Differences between revisions 2 and 7 (spanning 5 versions)

Python Pdfminer

pdfminer is a third-party module for parsing PDF files.

Contents

Python Pdfminer

Installation

The most up-to-date implementation of pdfminer is available though pip(1) as pdfminer.six.

Usage

To extract the content of a PDF file and convert it to HTML, try:

from io import StringIO
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams

buffer = StringIO()
with open(filename, "rb") as f:
    extract_text_to_fp(f, buffer, laparams=LAParams(), output_type="html", codec=None)
html = buffer.getvalue()

Page-by-page Processing

To process pages of a PDF file separately, try:

from io import StringIO
from pdfminer.converter import XMLConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams

buffer = StringIO()
manager = PDFResourceManager(caching=False)
converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
interpreter = PDFPageInterpreter(manager, converter)
with open(filename, 'rb') as f:
    for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False):
        interpreter.process_page(page)
xml = buffer.getvalue()

Functions

Extract_Pages

pdfminer.high_level.extract_pages() reads an open binary (PDF) filestream and returns an iterator of pdfminer.layout.LTPage objects

Extract_Text

pdfminer.high_level.extract_text() reads an open binary (PDF) filestream and returns the text as a string.

from pdfminer.high_level import extract_text

with open('example.pdf', 'rb') as f:
    text = extract_text(f)

Extract_Text_To_Fp

pdfminer.high_level.extract_text_to_fp() reads an open binary (PDF) filestream and writes the content to an open filestream, optionally formatted in a semi-structured language.

Classes

LAParams

pdfminer.layout.LAParams is a class for layout analysis parameters.

Parameters	Meaning
`line_overlap`
`char_margin`
`word_margin`
`line_margin`
`boxes_flow`
`detect_vertical`
`all_texts`

LTPages

CategoryRicottone

-  ⇤ ← Revision 2 as of 2022-05-01 20:59:20 → 
  Size: 1778
  Editor: DominicRicottone
  Comment:
+   ← Revision 7 as of 2023-10-12 17:52:53 → ⇥
  Size: 2609
  Editor: DominicRicottone
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
+'''`pdfminer`''' is a third-party module for parsing PDF files.
-Line 11:
+Line 13:
-The most up-to-date implementation of `pdfminer` is available though `pip(1)` as `pdfminer.six`.
+The most up-to-date implementation of `pdfminer` is available though [[Python/Pip|pip(1)]] as `pdfminer.six`.
-Line 19:
+Line 21:
-=== Simple Usage ===

{{{
from pdfminer.high_level import extract_text
with open(filename, "rb") as f:
    text = extract_text(f)
}}}



=== Document Processing ===

In this case, using HTML as the destination encoding. To use XML instead, replace `"html"` with `"xml"`.
+To extract the content of a PDF file and convert it to HTML, try:
-Line 48:
+Line 38:
-In this case, using XML as the destination encoding.
+To process pages of a PDF file separately, try:
-Line 59:
+Line 49:
-converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None, imagewriter=None)
+converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
-Line 62:
+Line 52:
-    for page in PDFPage.get_pages(f, None, 0, password=None, caching=False):
+    for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False):
-Line 67:
+Line 57:
-To use HTML instead, translate the `XMLConverter` instance with an `HTMLConverter` instance.
+----



== Functions ==



=== Extract_Pages ===

''' `pdfminer.high_level.extract_pages()`''' reads an open binary (PDF) filestream and returns an iterator of `pdfminer.layout.LTPage` objects

----



=== Extract_Text ===

''' `pdfminer.high_level.extract_text()`''' reads an open binary (PDF) filestream and returns the text as a [[Python/Builtins/Types#Str|string]].
-Line 70:
+Line 78:
-XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
# vs
HTMLConverter(manager, buffer, laparams=LAParams(), codec=None)
+from pdfminer.high_level import extract_text

with open('example.pdf', 'rb') as f:
    text = extract_text(f)
-Line 74:
+Line 83:
+----



=== Extract_Text_To_Fp ===

''' `pdfminer.high_level.extract_text_to_fp()`''' reads an open binary (PDF) filestream and writes the content to an open filestream, optionally formatted in a semi-structured language.

----



== Classes ==



=== LAParams ===

'''`pdfminer.layout.LAParams`''' is a class for layout analysis parameters.

||'''Parameters''' ||'''Meaning'''||
||`line_overlap`   ||             ||
||`char_margin`    ||             ||
||`word_margin`    ||             ||
||`line_margin`    ||             ||
||`boxes_flow`     ||             ||
||`detect_vertical`||             ||
||`all_texts`      ||             ||

----



=== LTPages ===

Diff for "Python/Pdfminer"