= Python Pdfminer = '''`pdfminer`''' is a third-party module for parsing PDF files. <> ---- == Installation == The most up-to-date implementation of `pdfminer` is available though [[Python/Pip|pip(1)]] as `pdfminer.six`. ---- == Usage == To extract the content of a PDF file and convert it to HTML, try: {{{ from io import StringIO from pdfminer.high_level import extract_text_to_fp from pdfminer.layout import LAParams buffer = StringIO() with open(filename, "rb") as f: extract_text_to_fp(f, buffer, laparams=LAParams(), output_type="html", codec=None) html = buffer.getvalue() }}} === Page-by-page Processing === To process pages of a PDF file separately, try: {{{ from io import StringIO from pdfminer.converter import XMLConverter from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.layout import LAParams buffer = StringIO() manager = PDFResourceManager(caching=False) converter = XMLConverter(manager, buffer, laparams=LAParams(), codec=None) interpreter = PDFPageInterpreter(manager, converter) with open(filename, 'rb') as f: for page in PDFPage.get_pages(f, pagenos=None, maxpages=0, password=None, caching=False): interpreter.process_page(page) xml = buffer.getvalue() }}} ---- == Functions == === Extract_Pages === ''' `pdfminer.high_level.extract_pages()`''' reads an open binary (PDF) filestream and returns an iterator of `pdfminer.layout.LTPage` objects ---- === Extract_Text === ''' `pdfminer.high_level.extract_text()`''' reads an open binary (PDF) filestream and returns the text as a [[Python/Builtins/Types#Str|string]]. {{{ from pdfminer.high_level import extract_text with open('example.pdf', 'rb') as f: text = extract_text(f) }}} ---- === Extract_Text_To_Fp === ''' `pdfminer.high_level.extract_text_to_fp()`''' reads an open binary (PDF) filestream and writes the content to an open filestream, optionally formatted in a semi-structured language. ---- == Classes == === LAParams === '''`pdfminer.layout.LAParams`''' is a class for layout analysis parameters. ||'''Parameters''' ||'''Meaning'''|| ||`line_overlap` || || ||`char_margin` || || ||`word_margin` || || ||`line_margin` || || ||`boxes_flow` || || ||`detect_vertical`|| || ||`all_texts` || || ---- === LTPages === ---- CategoryRicottone