Differences between revisions 2 and 6 (spanning 4 versions)
Revision 2 as of 2022-05-01 20:59:20
Size: 1778
Comment:
Revision 6 as of 2023-10-12 17:50:34
Size: 2771
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:

'''`pdfminer`''' is a third-party module for parsing PDF files.
Line 11: Line 13:
The most up-to-date implementation of `pdfminer` is available though `pip(1)` as `pdfminer.six`. The most up-to-date implementation of `pdfminer` is available though [[Python/Pip|pip(1)]] as `pdfminer.six`.
Line 48: Line 50:
In this case, using XML as the destination encoding. In this case, using XML as the destination encoding. To use HTML instead, replace `XMLConverter` with `HTMLConverter`.
Line 59: Line 61:
converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None, imagewriter=None) converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
Line 62: Line 64:
    for page in PDFPage.get_pages(f, None, 0, password=None, caching=False):     for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False):
Line 67: Line 69:
To use HTML instead, translate the `XMLConverter` instance with an `HTMLConverter` instance. ----
Line 69: Line 71:
{{{
XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
# vs
HTMLConverter(manager, buffer, laparams=LAParams(), codec=None)
}}}


== Functions ==



=== Extract_Pages ===

''' `pdfminer.high_level.extract_pages()`''' reads an open binary (PDF) filestream and returns an iterator of `pdfminer.layout.LTPage` objects

----



=== Extract_Text ===

''' `pdfminer.high_level.extract_text()`''' reads an open binary (PDF) filestream and returns the text as a [[Python/Builtins/Types#Str|string]].

----



=== Extract_Text_To_Fp ===

''' `pdfminer.high_level.extract_text_to_fp()`''' reads an open binary (PDF) filestream and writes the content to an open filestream, optionally formatted in a semi-structured language.

----



== Classes ==



=== LAParams ===

'''`pdfminer.layout.LAParams`''' is a class for layout analysis parameters.

||'''Parameters''' ||'''Meaning'''||
||`line_overlap` || ||
||`char_margin` || ||
||`word_margin` || ||
||`line_margin` || ||
||`boxes_flow` || ||
||`detect_vertical`|| ||
||`all_texts` || ||

----



=== LTPages ===

Python Pdfminer

pdfminer is a third-party module for parsing PDF files.


Installation

The most up-to-date implementation of pdfminer is available though pip(1) as pdfminer.six.


Usage

Simple Usage

from pdfminer.high_level import extract_text
with open(filename, "rb") as f:
    text = extract_text(f)

Document Processing

In this case, using HTML as the destination encoding. To use XML instead, replace "html" with "xml".

from io import StringIO
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams

buffer = StringIO()
with open(filename, "rb") as f:
    extract_text_to_fp(f, buffer, laparams=LAParams(), output_type="html", codec=None)
html = buffer.getvalue()

Page-by-page Processing

In this case, using XML as the destination encoding. To use HTML instead, replace XMLConverter with HTMLConverter.

from io import StringIO
from pdfminer.converter import XMLConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams

buffer = StringIO()
manager = PDFResourceManager(caching=False)
converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
interpreter = PDFPageInterpreter(manager, converter)
with open(filename, 'rb') as f:
    for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False):
        interpreter.process_page(page)
xml = buffer.getvalue()


Functions

Extract_Pages

pdfminer.high_level.extract_pages() reads an open binary (PDF) filestream and returns an iterator of pdfminer.layout.LTPage objects


Extract_Text

pdfminer.high_level.extract_text() reads an open binary (PDF) filestream and returns the text as a string.


Extract_Text_To_Fp

pdfminer.high_level.extract_text_to_fp() reads an open binary (PDF) filestream and writes the content to an open filestream, optionally formatted in a semi-structured language.


Classes

LAParams

pdfminer.layout.LAParams is a class for layout analysis parameters.

Parameters

Meaning

line_overlap

char_margin

word_margin

line_margin

boxes_flow

detect_vertical

all_texts


LTPages


CategoryRicottone

Python/Pdfminer (last edited 2023-10-12 17:52:53 by DominicRicottone)