PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
- Written entirely in Python. (for version 2.4 or newer)
- Parse, analyze, and convert PDF documents.
- PDF-1.7 specification support. (well, almost)
- CJK languages and vertical writing scripts support.
- Various font types (Type1, TrueType, Type3, and CID) support.
- Basic encryption (RC4) support.
- PDF to HTML conversion (with a sample converter web app).
- Outline (TOC) extraction.
- Tagged contents extraction.
- Reconstruct the original layout by grouping text chunks.
PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.
How to Install
- Install Python 2.4 or newer. (Python 3 is not supported.)
- Download the PDFMiner source.
- Unpack it.
# python setup.py install
- Do the following test:
$ pdf2txt.py samples/simple1.pdf Hello World Hello World H e l l o W o r l d H e l l o W o r l d
For more tutorials visit official website.
Linux :: PDFMiner-20131113.tar.gz
Official Website :: http://www.unixuser.org/~euske/python/pdfminer/index.html