pdfplumber 提取 PDF 文本及表格

pdfplumber 提取 PDF 文本及表格 2022-07-05 841

pdfplumber-0.5.12，适用于python 2.7、3.5、3.6. 官网链接在此：https://github.com/jsvine/pdfplumber

一、安装

pip install pdfplumber

二、使用

提取文本：pdf.page[0].extract_text() 提取表格：pdf.page[0].extract_tables()

提取文字与表格（去除空格），分别写入txt文档，代码如下：

import pdfplumber
path = path/to/file.pdf
fileNames = os.path.splitext(path)

pdf = pdfplumber.open(path)
for page in pdf.pages:
    # 获取当前页面的全部文本信息，包括表格中的文字，写入txt文档
    # 直接得到字符串，包括了换行符【与PDF上的换行位置一致，而不是实际的“段落”】
	# print(page.extract_text())
	with open(fileNames[0] + _txt.txt, a+) as m:
        m.write(page.extract_text())
    m.close()
    
    # 提取本页全部表格内容,删除空格等，写入txt文档
    # 得到的table是嵌套list类型
    for table in page.extract_tables():
        for row_old in table:  # 表中每一行
            row_new = []
            for i in range(len(row_old)):
                if row_old[i]:
                    row_new.append(row_old[i])
            # print(str(row_new).replace(\n, ).decode(unicode-escape))
            if len(row_new) != 0:
                with open(fileNames[0] + _table.txt, a+) as f:
                    f.write(str(row_new).replace(\n, )+
)
                f.close()
pdf.close()

模块相关介绍：

https://github.com/jsvine/pdfplumber#pdfplumber-v0512

pdf.pdf 类

属性描述 .metadata 元数据键/值对的字典，从PDF的Info预告片中提取。通常包括“CreationDate”，“ModDate”，“Producer”等。 .pages 包含pdfplumber.Page每页加载一个实例的列表。

pdf.page 类

属性描述 .page_number 页码 .width 页面的宽度 .height 页面的高度 .objects/ .chars/ .lines/.rects 这些属性中的每一个都是一个列表，每个列表包含一个嵌入在页面上的每个此类对象的字典

关于方法和描述请见：https://github.com/jsvine/pdfplumber#the-pdfplumberpage-class

对象

每个实例pdfplumber.PDF和pdfplumber.Page访问四种类型的PDF对象。以下属性各自返回匹配对象的Python列表：

对象含义 .chars 文本字符 .annos 注释文本字符 .lines 一维直线 .rects 二维矩形 .curves 一系列连接点

关于属性和描述请见：https://github.com/jsvine/pdfplumber#objects

免费搭建微信查券返利机器人来轻松赚佣金

文章来自:IT技术分享网
分享地址:http://www.5ityx.cn/cate100/57128.html

上一篇： .gitignore 文件不生效问题 & 解决方法

下一篇： .gitignore与.git/info/exclude区别

pdfplumber 提取 PDF 文本及表格

一、安装

二、使用

模块相关介绍：

pdf.pdf 类

pdf.page 类

对象

pdfplumber 提取 PDF 文本及表格 相关内容

聚合标签

pdfplumber 提取 PDF 文本及表格相关内容