API

Loading a PDF

libpdf.load(pdf, verbose=1, page_range=None, page_crop=None, smart_page_crop=False, save_figures=False, figure_dir='figures', no_annotations=False, no_chapters=False, no_paragraphs=False, no_tables=False, no_figures=False, no_rects=False, init_logging=True, visual_debug=False, visual_debug_output_dir='visual_debug_libpdf', visual_split_elements=False, visual_debug_include_elements=None, visual_debug_exclude_elements=None)

Entry point for the usage of libpdf as a library.

The function is actually called main_api() to better correspond to main_cli() and main(). It is however exposed to the API as libpdf.load() which is considered more expressive for API users.

Parameters:
  • pdf (str) – path to the PDF to read

  • verbose (int) – verbosity level as integer (0 = errors, fatal, critical, 1 = warnings, 2 = info, 3 = debug)

  • page_range (str) – range of pages to extract as string without spaces (e.g. 3-5 or 3,4,7 or 3-5,7)

  • page_crop (Tuple[float, float, float, float]) – see description in function core.main()

  • smart_page_crop (bool) – see description in function core.main()

  • save_figures (bool) – flag triggering the export of figures to the figure_dir

  • figure_dir (str) – output directory for extracted figures; if it does not exist, it will be created

  • no_annotations (bool) – flag triggering the exclusion of annotations from pdf catalog

  • no_chapters (bool) – flag triggering the exclusion of chapters (resulting in a flat list of elements)

  • no_paragraphs (bool) – flag triggering the exclusion of paragraphs (no normal text content)

  • no_tables (bool) – flag triggering the exclusion of tables

  • no_figures (bool) – flag triggering the exclusion of figures

  • no_rects (bool) – flag triggering the exclusion of rects

  • init_logging (bool) – flag indicating whether libpdf shall instantiate a root log handler that is capable of handling both log messages and progress bars; it does so by passing all log messages to tqdm.write()

  • visual_debug (bool) – flag triggering visual debug feature

  • visual_debug_output_dir (str) – output directory for visualized pdf pages

  • visual_split_elements (bool) – flag triggering split visualized elements in separate folder

  • visual_debug_include_elements (List[str]) – a list of elements that shall be included when visual debugging

  • visual_debug_exclude_elements (List[str]) – a list of elements that shall be excluded when visual debugging

Returns:

instance of ApiObjects class

Return type:

ApiObjects

Returned objects

class libpdf.apiobjects.ApiObjects(root, chapters, paragraphs, tables, figures, rects, pdfplumber, pdfminer)

Bases: object

Data class that stores instances for all extracted PDF objects.

Variables:
  • root (Root) – Main entry point to structured data as per the UML PDF model.

  • flattened (Flattened) – named tuple holding flattened versions of all nested objects in root.contents.*; the element types chapters/paragraphs/tables/figures can be directly accessed (API convenience)

  • pdfplumber (PDF) – pdfplumber PDF object for further processing by API users

  • pdfminer (PDFDocument) – pdfminer PDF object for further processing by API users, also available in pdfplumber.doc

Parameters:
  • root (Root) –

  • chapters (List[Chapter]) –

  • paragraphs (List[Paragraph]) –

  • tables (List[Table]) –

  • figures (List[Figure]) –

  • rects (List[Rect]) –

  • pdfplumber (PDF) –

  • pdfminer (PDFDocument) –

class libpdf.apiobjects.Flattened(chapters, paragraphs, tables, figures, rects)

Bases: NamedTuple

NamedTuple to hold flattened Element instances featuring also type hinting.

Parameters:
chapters: List[Chapter]

Alias for field number 0

paragraphs: List[Paragraph]

Alias for field number 1

tables: List[Table]

Alias for field number 2

figures: List[Figure]

Alias for field number 3

rects: List[Rect]

Alias for field number 4

Model classes

The object ApiObjects returned by libpdf.load() contains the following class (instances).

Root

class libpdf.models.root.Root(file, pages, content)

Bases: ModelBase

Main entry point to the UML PDF model.

Variables:
Parameters:

File

class libpdf.models.file.File(name, path, page_count, crop_top=0, crop_bottom=0, crop_left=0, crop_right=0, file_meta=None, root=None)

Bases: ModelBase

PDF file data.

There is a file wide crop feature that removes static parts from each page:

*-page-------------------------------------*
|                    ^                     |
|                crop_top                  |
|                    v                     |
|               +-content-+                |
|<--crop_left-->|         |<--crop_right-->|
|               |         |                |
|               |         |                |
|               +---------+                |
|                   ^                      |
|              crop_bottom                 |
|                   v                      |
*------------------------------------------*

It can be used to ignore headers, footers or sidebars. The user-defined parameters are exposed to both the CLI and API.

Variables:
  • name (str) – PDF file name

  • path (str) – PDF file path

  • page_count (int) – number of pages in PDF

  • crop_top (float) – distance in points from top of each page to ignore for extraction

  • crop_bottom (float) – distance in points from bottom of each page to ignore for extraction

  • crop_left (float) – distance in points from left side of each page to ignore for extraction

  • crop_right (float) – distance in points from right side of each page to ignore for extraction

  • file_meta (FileMeta) – reference to FileMeta instance

  • b_root (Root) – back reference to Root instance

Parameters:
  • name (str) –

  • path (str) –

  • page_count (int) –

  • crop_top (float) –

  • crop_bottom (float) –

  • crop_left (float) –

  • crop_right (float) –

  • file_meta (FileMeta) –

  • root (Root) –

property id_

Return the identifier to address the file.

The parameter can later be used during libpdf postprocessing to link to elements in other files.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used. The file identifier is built from the file name including extension. All characters are removed that do not follow the Python identifier character set (Regex character set [_a-zA-Z0-9]).

FileMeta

class libpdf.models.file_meta.FileMeta(author=None, title=None, subject=None, creator=None, producer=None, keywords=None, creation_date=None, modified_date=None, trapped=None, file=None)

Bases: ModelBase

PDF file meta data.

Variables:
  • author (str) – PDF author meta data field

  • title (str) – PDF title meta data field

  • subject (str) – PDF subject meta data field

  • creator (str) – PDF creator meta data field

  • producer (str) – PDF producer meta data field

  • keywords (str) – PDF keywords meta data field

  • creation_date (datetime) – PDF creation date given as datetime instance

  • modified_date (datetime) – PDF modified date given as datetime instance

  • trapped (bool) – PDF printing trap flag (https://en.wikipedia.org/wiki/Trap_%28printing%29)

  • b_file (File) – back reference to a File instance

Parameters:
  • author (str) –

  • title (str) –

  • subject (str) –

  • creator (str) –

  • producer (str) –

  • keywords (str) –

  • creation_date (datetime) –

  • modified_date (datetime) –

  • trapped (bool) –

  • file (File) –

Page

class libpdf.models.page.Page(number, width, height, content=None, root=None, positions=None)

Bases: ModelBase

PDF page data.

Variables:
  • number (int) – PDF page number, 1-based

  • width (float) – page width in points

  • height (float) – page height in points

  • content (List[Union[Chapter, Paragraph, Table, Figure]]) – ordered list of elements on the page; chapters might still be nested if the page contains sub-chapters

  • root (Root) – back reference to a Root instance

  • b_positions (List[Position]) – back reference to all Position instances on the page

Parameters:
property id_

Return the identifier to address the Page.

The identifier follows the pattern page.<number>. It is used as a link target if a PDF link annotation points to a blank space position, i.e. there is no Chapter, Paragraph, Table, Figure at the target location.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Element

class libpdf.models.element.Element(position, root=None, chapter=None)

Bases: ModelBase, ABC

Base class for Chapter, Paragraph, Table and Figure.

Variables:
  • position (Position) – a Position instance determining the location of the Element

  • b_root (Root) – Root instance (mutually exclusive with the b_chapter parameter)

  • b_chapter (Chapter) – parent Chapter instance (mutually exclusive with the b_root parameter)

Parameters:
abstract property id_

Return the identifier to address the Element.

property uid

Return the unique identifier to address the full path to the Element.

The identifier follows the pattern element.<number>/element.<number>.

For example, the uid for a paragraph in chapter.2.1.4 is ‘chapter.2/chapter.2.1/chapter.2.1.4/paragraph.6’.

Type:

str

Chapter

class libpdf.models.chapter.Chapter(title, number, position, content=None, chapter=None, textbox=None)

Bases: Element

PDF chapter (extracted from PDF outline).

The Chapter elements defines the structure of the PDF. If an outline is given, Chapters are extracted from it and all elements (sub-chapters, tables, figures, paragraphs) are put below the Chapter in the ordered content list.

Variables:
  • title (str) – the title of the chapter, as extracted from outline

  • number (str) – the chapter number as string (e.g. ‘3.2.4’)

  • textbox (HorizontalBox) – the textbox of the chapter, as extracted from pdfminer

  • number – the chapter number as string (e.g. ‘3.2.4’)

  • position (Position) – a Position instance determining the location of the Chapter; a Chapter commonly spans across several pages, however only one Position is aggregated because the end of the Chapter can be determined by looking at the next Chapter

  • content (List[Union[Chapter, Paragraph, Table, Figure]]) – the content of the chapter (other sub-chapters, paragraphs, tables, figures)

Parameters:
property id_

Return the identifier to address the Chapter.

The identifier follows the pattern chapter.<number>.

It is used as a link target if a PDF link-annotation points to the Element.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Type:

str

Paragraph

class libpdf.models.paragraph.Paragraph(idx, position, links, textbox=None, root=None, chapter=None)

Bases: Element

PDF paragraph (normal text).

A paragraph always ends at the end of a page.

Variables:
  • idx (int) – the number of the instance in the current scope, 1-based

  • position (Position) – the position of the paragraph

  • links (List[Link]) – list of links in the paragraph text

  • textbox (HorizontalBox) – the textbox of the paragraph, as extracted from pdfminer

Parameters:
property id_

Return the identifier to address the Paragraph.

The identifier follows the pattern paragraph.<idx>. idx the 1-based number of the Paragraph in the current scope (root, chapter, sub-chapters, page).

It is used as a link target if a PDF link-annotation points to the Element.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Type:

str

Set b_source back reference on all links.

Table

class libpdf.models.table.Table(idx, cells, position, caption=None)

Bases: Element

PDF table data.

Variables:
  • idx (int) – the number of the instance in the current scope, 1-based

  • cells (List[Cell]) – a list of Cell instances that are part of the table

  • caption (str) – the caption of the figure (text over/under the table describing it)

  • position (Position) – a Position instance determining the location of the table

Parameters:
  • idx (int) –

  • cells (List[Cell]) –

  • position (Position) –

property id_

Return the identifier to address the Table.

The identifier follows the pattern table.<idx>. idx the 1-based number of the Table in the current scope (root, chapter, sub-chapters, page).

It is used as a link target if a PDF link-annotation points to the Element.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Type:

str

property rows

Return a list of rows in the table where each contains a list of columns.

Type:

List[List[Cell]]

property columns

Return a list of columns in the table where each contains a list of rows.

Type:

List[List[Cell]]

property rows_count

Return the number of rows in the table.

Type:

int

property columns_count

Return the number of columns in the table.

Type:

int

Cell

class libpdf.models.table.Cell(row, col, position, links, table=None, textbox=None)

Bases: ModelBase

PDF table cell data.

Variables:
  • row (int) – the row number of the cell, 1-based

  • col (int) – the column number of the cell, 1-based

  • textbox (HorizontalBox) – the textbox of the cell, as extracted from pdfminer and converted to HorizontalBox

  • position (Position) – a Position instance determining the location of the cell

  • b_table (Table) – a Table instance that contains the cell

  • links (List[Link]) – list of links in the cell text

Parameters:

Set b_source back reference on all links.

Figure

class libpdf.models.figure.Figure(idx, rel_path, position, links, textboxes, text=None, caption=None)

Bases: Element

PDF figure.

A figure can be a bitmap image or vector graphics mixed with overlaying text. libpdf extracts figures into an external file where rel_path defines the path to the external file. The text property contains text extracted from the figure area. This can be highly unstructured because libpdf does not analyze the text layout within figures as there is no common denominator for an algorithm. libpdf will however do the same character grouping analysis as for paragraphs, so the user can assume text flow is from top left to bottom right.

Variables:
  • idx (int) – the number of the instance in the current scope, 1-based

  • rel_path (str) – the path to the external file containing the figure

  • textboxes (a list of HorizontalBox) – the textboxes of the figure, as extracted from pdfminer

  • caption (str) – the caption of the figure (text over/under the figure describing it)

  • position (Position) – a Position instance determining the location of the figure

  • links (List[Link]) – list of text links in the figure area

Parameters:
  • idx (int) –

  • rel_path (str) –

  • position (Position) –

  • links (List[Link]) –

  • textboxes (List[HorizontalBox]) –

  • text (str) –

  • caption (str) –

property id_

Return the identifier to address the Figure.

The identifier follows the pattern figure.<idx>. idx the 1-based number of the Figure in the current scope (root, chapter, sub-chapters, page).

It is used as a link target if a PDF link-annotation points to the Element.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Type:

str

Set b_source back reference on all links.

Rect

class libpdf.models.rect.Rect(idx, position, textbox, non_stroking_color=None)

Bases: Element

Rectangles in a PDF.

The rectangles are extracted from pdfplumber. The text covered in the rectangle is extracted and stored in an newly instantiated textbox.

Parameters:
property id_: str

Return the identifier to address the Figure.

The identifier follows the pattern figure.<idx>. idx the 1-based number of the Figure in the current scope (root, chapter, sub-chapters, page).

It is used as a link target if a PDF link-annotation points to the Element.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Type:

str

Position

class libpdf.models.position.Position(x0, y0, x1, y1, page, element=None, cell=None)

Bases: object

Define the coordinates of an Element or Cell.

A position is either linked by an Element or by a Cell (mutually exclusive). A position keeps a reference to the Page it is located on.

Here is some ASCII art to explain the libpdf coordinates:

*-page------------------------*
|                             |
|                             |
|                             |
|        +-bbox----+          |
|<--x0-->|         |    ^     |
|        |         |    |     |
|<-------|---x1--->|    |     |
|        +---------+    |     |
|            ^          |     |
|           y0         y1     |
|            v          v     |
*-----------------------------*

The bbox definition [x0, y0, x1, y1] is in sync with pdfminer and the PDF standard. Coordinate type is float for both libpdf and pdfminer.

Note

pdfplumber has a different definition of bounding boxes:

*-page------------------------*
|                ^      ^     |
|               top     |     |
|                v      |     |
|        +-bbox----+    |     |
|<--x0-->|         |    |     |
|        |         |  bottom  |
|<-------|---x1--->|    |     |
|        +---------+    v     |
|                             |
|                             |
|                             |
*-----------------------------*

The pdfplumber bounding box is [x0, top, x1, bottom]. Coordinate type is Decimal.

To deal with the coordinate and type differences there are conversion functions to_pdfplumber_bbox and from_pdfplumber_bbox in module libpdf.utils.

Variables:
  • x0 (float) – distance from the left of the page to the left edge of the box

  • y0 (float) – distance from the bottom of the page to the lower edge of the box (less than y1)

  • x1 (float) – distance from the left of the page to the right edge of the box

  • y1 (float) – distance from the bottom of the page to the upper edge of the box (greater than y0)

  • page (Page) – reference to a Page object

  • element (Element) – element that refers to the position (mutually exclusive with cell)

  • cell (Cell) – cell that refers to the position (mutually exclusive with element)

Parameters:
  • x0 (float) –

  • y0 (float) –

  • x1 (float) –

  • y1 (float) –

  • page (Page) –

  • element (Element) –

  • cell (Cell) –

Char

class libpdf.models.horizontal_box.Char(text, x0=None, y0=None, x1=None, y1=None, ncolor=None, fontname=None)

Bases: object

Define the character class.

Variables:
  • ~.text – a plain char of the chararcter

  • x0 (float) – distance from the left of the page to the left edge of the character

  • y0 (float) – distance from the bottom of the page to the lower edge of the character (less than y1)

  • x1 (float) – distance from the left of the page to the right edge of the character

  • y1 (float) – distance from the bottom of the page to the upper edge of the character (greater than y0)

  • ncolor (Tuple[float, float, float]) – non-stroking-color as rgb value

Parameters:
  • text (str) –

  • x0 (float | None) –

  • y0 (float | None) –

  • x1 (float | None) –

  • y1 (float | None) –

  • ncolor (tuple | None) –

  • fontname (str | None) –

Word

class libpdf.models.horizontal_box.Word(chars, x0=None, y0=None, x1=None, y1=None)

Bases: object

Define the word class.

A word shall contain several characters.

Variables:

chars (List[Char]) – a list of the chararcter

Parameters:
  • chars (list[Char]) –

  • x0 (float | None) –

  • y0 (float | None) –

  • x1 (float | None) –

  • y1 (float | None) –

property text: str

Return plain text.

HorizontalLine

class libpdf.models.horizontal_box.HorizontalLine(words, x0=None, y0=None, x1=None, y1=None)

Bases: object

Define the horizontal line class.

A horizontal line shall contain a word or several words.

Variables:

words (List[Word]) – a list of the words

Parameters:
  • words (list[Word]) –

  • x0 (float | None) –

  • y0 (float | None) –

  • x1 (float | None) –

  • y1 (float | None) –

property text: str

Return plain text.

HorizontalBox

class libpdf.models.horizontal_box.HorizontalBox(lines, x0=None, y0=None, x1=None, y1=None)

Bases: object

Define the horizontal box class.

A horizontal box shall contain a horizontal line or several of it.

Variables:

lines (List[HorizontalLine]) – a list of the HorizontalLine

Parameters:
  • lines (list[HorizontalLine]) –

  • x0 (float | None) –

  • y0 (float | None) –

  • x1 (float | None) –

  • y1 (float | None) –

property text: str

Return plain text.

property words: list[str]

Return list of words.