libpdf

libpdf allows the extraction of structured data from machine readable PDFs. It is tested for Python 3.8, 3.9. 3.10, 3.11 and 3.12.

Motivation

libpdf hopes to bridge the gap between low-level PDF extraction libraries like pdfminer, pdfplumber, PyPDF2 or Poppler and end users that are looking for a structure and content aware extraction solution.

libpdf specifically cares for the structure of PDFs. It extracts the chapter hierarchy and puts paragraphs, tables and figures into their corresponding hierarchical position. Extracted PDF links are not just pointing to a coordinate on a page - as specified in the PDF standard - but to the very chapter/table/figure/paragraph at that position. That makes it possible to get human and machine readable access to the original document structure.

PDF documents are inherently hard to extract because the PDF standard is optimized for visual representation. PDF does not know about words, spaces, line breaks or tables. Some libraries are specialized to form words and text lines from characters and extract that text. However they commonly don’t know whether the text is part of a table, inside a figure or part of a chapter. They also won’t recognize headers and footers. That means the user has to post-process the output and deal with the layout issues. Libraries like Camelot or Tabula are specialized only on table extraction (which is great) and can only be a part of the overall solution.

libpdf implements a well defined UML PDF model and populates it with the extracted data. The API as well as the JSON/YAML output follows the model. The design of the model is generic and should fit many use cases.

The library is mainly targeted at machine readable technical documentation PDFs, but could also work on others. Machine readable means the PDF does not consist of bitmaps (so users can select and copy text with a PDF viewer).

After evaluating multiple low-level libraries, pdfplumber and pdfminer were chosen as a basis. Understanding these libraries and their specifics tends to consume a lot of time and resources, so libpdf was created to bring users a more ready-to-use experience.

libpdf would not exist without the great underlying libraries and their maintainers’ support. Thank you!

libpdf

Motivation

Content