Python Pdfminer

pdfminer is a third-party module for parsing PDF files.


Installation

The most up-to-date implementation of pdfminer is available though pip(1) as pdfminer.six.


Usage

To extract the content of a PDF file and convert it to HTML, try:

from io import StringIO
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams

buffer = StringIO()
with open(filename, "rb") as f:
    extract_text_to_fp(f, buffer, laparams=LAParams(), output_type="html", codec=None)
html = buffer.getvalue()

Page-by-page Processing

To process pages of a PDF file separately, try:

from io import StringIO
from pdfminer.converter import XMLConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams

buffer = StringIO()
manager = PDFResourceManager(caching=False)
converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
interpreter = PDFPageInterpreter(manager, converter)
with open(filename, 'rb') as f:
    for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False):
        interpreter.process_page(page)
xml = buffer.getvalue()


Functions

Extract_Pages

pdfminer.high_level.extract_pages() reads an open binary (PDF) filestream and returns an iterator of pdfminer.layout.LTPage objects


Extract_Text

pdfminer.high_level.extract_text() reads an open binary (PDF) filestream and returns the text as a string.

from pdfminer.high_level import extract_text

with open('example.pdf', 'rb') as f:
    text = extract_text(f)


Extract_Text_To_Fp

pdfminer.high_level.extract_text_to_fp() reads an open binary (PDF) filestream and writes the content to an open filestream, optionally formatted in a semi-structured language.


Classes

LAParams

pdfminer.layout.LAParams is a class for layout analysis parameters.

Parameters

Meaning

line_overlap

char_margin

word_margin

line_margin

boxes_flow

detect_vertical

all_texts


LTPages


CategoryRicottone

Python/Pdfminer (last edited 2023-10-12 17:52:53 by DominicRicottone)