excel学习库-PDF表格提取到Excel中

PDF全称 Portable Document Format ，作为便携式文档格式，可将文字、字体、图形、图像、色彩、版式及与印刷设备相关的参数等封装在一个文件中，能在每种打印机上精准呈现每一个字符及颜色的真实效果。

但是由于其特殊的格式，pdf文档内容无法被直接编辑，也无法像excel一样在表格内进行各种计算，因此这节内容就来讲讲如何将pdf表格内容提取到excel文件中。

前提条件

Python

读取PDF

安装第三方库 pdfplumber ：pip install pdfplumber

1、要想读取PDF内容，首先打开pdf文件

import pdfplumber

with pdfplumber.open(file) as pdf:
　print('开始读取数据')

2、接着获取pdf每一页（page）的内容

for page in pdf.pages:

3、然后获取每一页中的所有表格

for table in page.extract_tables():

4、table对象存储的内容就是获取到的表格内容，输出显示一下：

print(table)

写入Excel文件

安装第三方库 openpyxl ：pip install openpyxl

安装第三方库 pandas ：pip install pandas

1、写入excel

import pandas as pd

df = pd.DataFrame(table)

df.to_excel("1.xlsx", header=False, index=False)

虽然此时内容成功写入excel文件中，但是这并不是我们想要的，打开excel可以发现只有最后一个page的数据被存入了excel中，为什么会这样呢？因为to_excel("1.xlsx")方法默认会创建一个名为1.xlsx的新excel文件，之前的内容就被覆盖了，所以只会保留最后写入的数据。

2、将内容写入Excel中的多个sheet中

xlsx = pd.ExcelWriter(newfile)

df.to_excel(excel_writer=xlsx, sheet_name="sheet1")

df2.to_excel(excel_writer=xlsx, sheet_name="sheet2")

xlsx.close()

完整代码，可用

# coding:utf8

import pdfplumber
import os
from tkinter import Tk
import tkinter.filedialog as tf
import pandas as pd

def pdf_to_excel(file):
　table_all = []
　with pdfplumber.open(file) as pdf:
　　print('开始读取数据')

　　for page in pdf.pages:
　　# 获取当前页面的全部表格
　　　for table in page.extract_tables():
　　　　table_all.append(table)

　ext = os.path.splitext(file)[1]
　newfile = file.replace(ext, ".xlsx")
　# 循环写入excel表
　with pd.ExcelWriter(newfile) as xlsx:
　　for i, table in enumerate(table_all):
　　　df = pd.DataFrame(table)
　　　df.to_excel(excel_writer=xlsx, sheet_name=f"Page{i+1}", header=False, index=False)

if __name__ == '__main__':
　root = Tk()
　root.withdraw()
　files = tf.askopenfilenames(filetypes=[("pdf文件", "*.pdf")])
　for file in files:
　　pdf_to_excel(file)

最后

使用 pdfplumber 这个模块可以很方便的将pdf中的表格提取到excel中，因为是直接读取的pdf数据，所以内容是完全准确的；其缺点是只能提取PDF中的表格内容，图片或不在表格范围内的内容无法提取。

一	二	三	四	五	六	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

excel学习库

excel表格_excel函数公式大全_execl从入门到精通

PDF表格提取到Excel中2024-04-15 02:17:30

前提条件

读取PDF

写入Excel文件

完整代码，可用

最后