Differences between revisions 2 and 3
Revision 2 as of 2022-05-01 20:59:20
Size: 1778
Comment:
Revision 3 as of 2022-05-01 21:03:29
Size: 1600
Comment:
Deletions are marked like this. Additions are marked like this.
Line 48: Line 48:
In this case, using XML as the destination encoding. In this case, using XML as the destination encoding. To use HTML instead, replace `XMLConverter` with `HTMLConverter`.
Line 59: Line 59:
converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None, imagewriter=None) converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
Line 62: Line 62:
    for page in PDFPage.get_pages(f, None, 0, password=None, caching=False):     for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False):
Line 65: Line 65:
}}}

To use HTML instead, translate the `XMLConverter` instance with an `HTMLConverter` instance.

{{{
XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
# vs
HTMLConverter(manager, buffer, laparams=LAParams(), codec=None)

Python Pdfminer


Installation

The most up-to-date implementation of pdfminer is available though pip(1) as pdfminer.six.


Usage

Simple Usage

from pdfminer.high_level import extract_text
with open(filename, "rb") as f:
    text = extract_text(f)

Document Processing

In this case, using HTML as the destination encoding. To use XML instead, replace "html" with "xml".

from io import StringIO
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams

buffer = StringIO()
with open(filename, "rb") as f:
    extract_text_to_fp(f, buffer, laparams=LAParams(), output_type="html", codec=None)
html = buffer.getvalue()

Page-by-page Processing

In this case, using XML as the destination encoding. To use HTML instead, replace XMLConverter with HTMLConverter.

from io import StringIO
from pdfminer.converter import XMLConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams

buffer = StringIO()
manager = PDFResourceManager(caching=False)
converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
interpreter = PDFPageInterpreter(manager, converter)
with open(filename, 'rb') as f:
    for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False):
        interpreter.process_page(page)
xml = buffer.getvalue()


CategoryRicottone

Python/Pdfminer (last edited 2023-10-12 17:52:53 by DominicRicottone)