|
Size: 1778
Comment:
|
← Revision 7 as of 2023-10-12 17:52:53 ⇥
Size: 2609
Comment:
|
| Deletions are marked like this. | Additions are marked like this. |
| Line 2: | Line 2: |
'''`pdfminer`''' is a third-party module for parsing PDF files. |
|
| Line 11: | Line 13: |
| The most up-to-date implementation of `pdfminer` is available though `pip(1)` as `pdfminer.six`. | The most up-to-date implementation of `pdfminer` is available though [[Python/Pip|pip(1)]] as `pdfminer.six`. |
| Line 19: | Line 21: |
| === Simple Usage === {{{ from pdfminer.high_level import extract_text with open(filename, "rb") as f: text = extract_text(f) }}} === Document Processing === In this case, using HTML as the destination encoding. To use XML instead, replace `"html"` with `"xml"`. |
To extract the content of a PDF file and convert it to HTML, try: |
| Line 48: | Line 38: |
| In this case, using XML as the destination encoding. | To process pages of a PDF file separately, try: |
| Line 59: | Line 49: |
| converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None, imagewriter=None) | converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None) |
| Line 62: | Line 52: |
| for page in PDFPage.get_pages(f, None, 0, password=None, caching=False): | for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False): |
| Line 67: | Line 57: |
| To use HTML instead, translate the `XMLConverter` instance with an `HTMLConverter` instance. | ---- == Functions == === Extract_Pages === ''' `pdfminer.high_level.extract_pages()`''' reads an open binary (PDF) filestream and returns an iterator of `pdfminer.layout.LTPage` objects ---- === Extract_Text === ''' `pdfminer.high_level.extract_text()`''' reads an open binary (PDF) filestream and returns the text as a [[Python/Builtins/Types#Str|string]]. |
| Line 70: | Line 78: |
| XMLConverter(manager, buffer, laparams=LAParams(), codec=None) # vs HTMLConverter(manager, buffer, laparams=LAParams(), codec=None) |
from pdfminer.high_level import extract_text with open('example.pdf', 'rb') as f: text = extract_text(f) |
| Line 74: | Line 83: |
---- === Extract_Text_To_Fp === ''' `pdfminer.high_level.extract_text_to_fp()`''' reads an open binary (PDF) filestream and writes the content to an open filestream, optionally formatted in a semi-structured language. ---- == Classes == === LAParams === '''`pdfminer.layout.LAParams`''' is a class for layout analysis parameters. ||'''Parameters''' ||'''Meaning'''|| ||`line_overlap` || || ||`char_margin` || || ||`word_margin` || || ||`line_margin` || || ||`boxes_flow` || || ||`detect_vertical`|| || ||`all_texts` || || ---- === LTPages === |
Python Pdfminer
pdfminer is a third-party module for parsing PDF files.
Contents
Installation
The most up-to-date implementation of pdfminer is available though pip(1) as pdfminer.six.
Usage
To extract the content of a PDF file and convert it to HTML, try:
from io import StringIO
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams
buffer = StringIO()
with open(filename, "rb") as f:
extract_text_to_fp(f, buffer, laparams=LAParams(), output_type="html", codec=None)
html = buffer.getvalue()
Page-by-page Processing
To process pages of a PDF file separately, try:
from io import StringIO
from pdfminer.converter import XMLConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams
buffer = StringIO()
manager = PDFResourceManager(caching=False)
converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None)
interpreter = PDFPageInterpreter(manager, converter)
with open(filename, 'rb') as f:
for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False):
interpreter.process_page(page)
xml = buffer.getvalue()
Functions
Extract_Pages
pdfminer.high_level.extract_pages() reads an open binary (PDF) filestream and returns an iterator of pdfminer.layout.LTPage objects
Extract_Text
pdfminer.high_level.extract_text() reads an open binary (PDF) filestream and returns the text as a string.
from pdfminer.high_level import extract_text
with open('example.pdf', 'rb') as f:
text = extract_text(f)
Extract_Text_To_Fp
pdfminer.high_level.extract_text_to_fp() reads an open binary (PDF) filestream and writes the content to an open filestream, optionally formatted in a semi-structured language.
Classes
LAParams
pdfminer.layout.LAParams is a class for layout analysis parameters.
Parameters |
Meaning |
line_overlap |
|
char_margin |
|
word_margin |
|
line_margin |
|
boxes_flow |
|
detect_vertical |
|
all_texts |
|
