Python Pdfminer
pdfminer is a third-party module for parsing PDF files.
Contents
Installation
The most up-to-date implementation of pdfminer is available though pip(1) as pdfminer.six.
Usage
To extract the content of a PDF file and convert it to HTML, try:
from io import StringIO from pdfminer.high_level import extract_text_to_fp from pdfminer.layout import LAParams buffer = StringIO() with open(filename, "rb") as f: extract_text_to_fp(f, buffer, laparams=LAParams(), output_type="html", codec=None) html = buffer.getvalue()
Page-by-page Processing
To process pages of a PDF file separately, try:
from io import StringIO from pdfminer.converter import XMLConverter from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.layout import LAParams buffer = StringIO() manager = PDFResourceManager(caching=False) converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None) interpreter = PDFPageInterpreter(manager, converter) with open(filename, 'rb') as f: for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False): interpreter.process_page(page) xml = buffer.getvalue()
Functions
Extract_Pages
pdfminer.high_level.extract_pages() reads an open binary (PDF) filestream and returns an iterator of pdfminer.layout.LTPage objects
Extract_Text
pdfminer.high_level.extract_text() reads an open binary (PDF) filestream and returns the text as a string.
from pdfminer.high_level import extract_text with open('example.pdf', 'rb') as f: text = extract_text(f)
Extract_Text_To_Fp
pdfminer.high_level.extract_text_to_fp() reads an open binary (PDF) filestream and writes the content to an open filestream, optionally formatted in a semi-structured language.
Classes
LAParams
pdfminer.layout.LAParams is a class for layout analysis parameters.
Parameters |
Meaning |
line_overlap |
|
char_margin |
|
word_margin |
|
line_margin |
|
boxes_flow |
|
detect_vertical |
|
all_texts |
|