Python Pdfminer
pdfminer is a third-party module for parsing PDF files.
Contents
Installation
The most up-to-date implementation of pdfminer is available though pip(1) as pdfminer.six.
Usage
Simple Usage
from pdfminer.high_level import extract_text
with open(filename, "rb") as f:
text = extract_text(f)
Document Processing
In this case, using HTML as the destination encoding. To use XML instead, replace "html" with "xml".
from io import StringIO
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams
buffer = StringIO()
with open(filename, "rb") as f:
extract_text_to_fp(f, buffer, laparams=LAParams(), output_type="html", codec=None)
html = buffer.getvalue()
Page-by-page Processing
In this case, using XML as the destination encoding. To use HTML instead, replace XMLConverter with HTMLConverter.
from io import StringIO
from pdfminer.converter import XMLConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams
buffer = StringIO()
manager = PDFResourceManager(caching=False)
converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
interpreter = PDFPageInterpreter(manager, converter)
with open(filename, 'rb') as f:
for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False):
interpreter.process_page(page)
xml = buffer.getvalue()
Functions
Extract_Pages
pdfminer.high_level.extract_pages() reads an open binary (PDF) filestream and returns an iterator of pdfminer.layout.LTPage objects
Extract_Text
pdfminer.high_level.extract_text() reads an open binary (PDF) filestream and returns the text as a string.
Extract_Text_To_Fp
pdfminer.high_level.extract_text_to_fp() reads an open binary (PDF) filestream and writes the content to an open filestream, optionally formatted in a semi-structured language.
Classes
LAParams
pdfminer.layout.LAParams is a class for layout analysis parameters.
Parameters |
Meaning |
line_overlap |
|
char_margin |
|
word_margin |
|
line_margin |
|
boxes_flow |
|
detect_vertical |
|
all_texts |
|
