姜鹏辉的个人博客 GreyNius

【Python】使用python读取pdf文件

2020-08-22

代码来源:https://blog.csdn.net/leviopku/article/details/86443426

读取时使用try,可能存在pdf文件无法读取而报错的情况。

from io import StringIO
from io import open
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
 
 
def read_pdf(pdf_name):
    pdf = open(pdf_name,"rb")
    # resource manager
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    # device
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdf)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    # 获取所有行
    lines = str(content).split("\n")
    pdf.close()
    return lines




Comments

Content