API

Loading a PDF

libpdf.load(pdf, verbose=1, page_range=None, page_crop=None, smart_page_crop=False, save_figures=False, figure_dir='figures', no_annotations=False, no_chapters=False, no_paragraphs=False, no_tables=False, no_figures=False, no_rects=False, init_logging=True, visual_debug=False, visual_debug_output_dir='visual_debug_libpdf', visual_split_elements=False, visual_debug_include_elements=None, visual_debug_exclude_elements=None)

Entry point for the usage of libpdf as a library.

The function is actually called main_api() to better correspond to main_cli() and main(). It is however exposed to the API as libpdf.load() which is considered more expressive for API users.

Parameters:

pdf (str) – path to the PDF to read
verbose (int) – verbosity level as integer (0 = errors, fatal, critical, 1 = warnings, 2 = info, 3 = debug)
page_range (str) – range of pages to extract as string without spaces (e.g. 3-5 or 3,4,7 or 3-5,7)
page_crop (Tuple[float, float, float, float]) – see description in function core.main()
smart_page_crop (bool) – see description in function core.main()
save_figures (bool) – flag triggering the export of figures to the figure_dir
figure_dir (str) – output directory for extracted figures; if it does not exist, it will be created
no_annotations (bool) – flag triggering the exclusion of annotations from pdf catalog
no_chapters (bool) – flag triggering the exclusion of chapters (resulting in a flat list of elements)
no_paragraphs (bool) – flag triggering the exclusion of paragraphs (no normal text content)
no_tables (bool) – flag triggering the exclusion of tables
no_figures (bool) – flag triggering the exclusion of figures
no_rects (bool) – flag triggering the exclusion of rects
init_logging (bool) – flag indicating whether libpdf shall instantiate a root log handler that is capable of handling both log messages and progress bars; it does so by passing all log messages to tqdm.write()
visual_debug (bool) – flag triggering visual debug feature
visual_debug_output_dir (str) – output directory for visualized pdf pages
visual_split_elements (bool) – flag triggering split visualized elements in separate folder
visual_debug_include_elements (List[str]) – a list of elements that shall be included when visual debugging
visual_debug_exclude_elements (List[str]) – a list of elements that shall be excluded when visual debugging

Returns:

instance of ApiObjects class

Return type:

ApiObjects

Returned objects

class libpdf.apiobjects.ApiObjects(root, chapters, paragraphs, tables, figures, rects, pdfplumber, pdfminer)

Bases: object

Data class that stores instances for all extracted PDF objects.

Variables:

root (Root) – Main entry point to structured data as per the UML PDF model.
flattened (Flattened) – named tuple holding flattened versions of all nested objects in root.contents.*; the element types chapters/paragraphs/tables/figures can be directly accessed (API convenience)
pdfplumber (PDF) – pdfplumber PDF object for further processing by API users
pdfminer (PDFDocument) – pdfminer PDF object for further processing by API users, also available in pdfplumber.doc

Parameters:

root (Root) –
chapters (List[Chapter]) –
paragraphs (List[Paragraph]) –
tables (List[Table]) –
figures (List[Figure]) –
rects (List[Rect]) –
pdfplumber (PDF) –
pdfminer (PDFDocument) –

class libpdf.apiobjects.Flattened(chapters, paragraphs, tables, figures, rects)

Bases: NamedTuple

NamedTuple to hold flattened Element instances featuring also type hinting.

Parameters:

chapters (List[Chapter]) –
paragraphs (List[Paragraph]) –
tables (List[Table]) –
figures (List[Figure]) –
rects (List[Rect]) –

chapters: List[Chapter]: Alias for field number 0

paragraphs: List[Paragraph]: Alias for field number 1

tables: List[Table]: Alias for field number 2

figures: List[Figure]: Alias for field number 3

rects: List[Rect]: Alias for field number 4

Model classes

The object ApiObjects returned by libpdf.load() contains the following class (instances).

Root

class libpdf.models.root.Root(file, pages, content)

Bases: ModelBase

Main entry point to the UML PDF model.

Variables:

file (File) – a File instance
pages (List[Page]) – PDF pages
content (List[Union[Chapter, Paragraph, Table, Figure]]) – PDF contents/payload, given as list containing instances of type Element

Parameters:

file (File) –
pages (List[Page]) –
content (List[Chapter | Paragraph | Table | Figure]) –

File

class libpdf.models.file.File(name, path, page_count, crop_top=0, crop_bottom=0, crop_left=0, crop_right=0, file_meta=None, root=None)

Bases: ModelBase

PDF file data.

There is a file wide crop feature that removes static parts from each page:

*-page-------------------------------------*
|                    ^                     |
|                crop_top                  |
|                    v                     |
|               +-content-+                |
|<--crop_left-->|         |<--crop_right-->|
|               |         |                |
|               |         |                |
|               +---------+                |
|                   ^                      |
|              crop_bottom                 |
|                   v                      |
*------------------------------------------*

It can be used to ignore headers, footers or sidebars. The user-defined parameters are exposed to both the CLI and API.

Variables:

name (str) – PDF file name
path (str) – PDF file path
page_count (int) – number of pages in PDF
crop_top (float) – distance in points from top of each page to ignore for extraction
crop_bottom (float) – distance in points from bottom of each page to ignore for extraction
crop_left (float) – distance in points from left side of each page to ignore for extraction
crop_right (float) – distance in points from right side of each page to ignore for extraction
file_meta (FileMeta) – reference to FileMeta instance
b_root (Root) – back reference to Root instance

Parameters:

name (str) –
path (str) –
page_count (int) –
crop_top (float) –
crop_bottom (float) –
crop_left (float) –
crop_right (float) –
file_meta (FileMeta) –
root (Root) –

property id_

Return the identifier to address the file.

The parameter can later be used during libpdf postprocessing to link to elements in other files.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used. The file identifier is built from the file name including extension. All characters are removed that do not follow the Python identifier character set (Regex character set [_a-zA-Z0-9]).

FileMeta

class libpdf.models.file_meta.FileMeta(author=None, title=None, subject=None, creator=None, producer=None, keywords=None, creation_date=None, modified_date=None, trapped=None, file=None)

Bases: ModelBase

PDF file meta data.

Variables:

author (str) – PDF author meta data field
title (str) – PDF title meta data field
subject (str) – PDF subject meta data field
creator (str) – PDF creator meta data field
producer (str) – PDF producer meta data field
keywords (str) – PDF keywords meta data field
creation_date (datetime) – PDF creation date given as datetime instance
modified_date (datetime) – PDF modified date given as datetime instance
trapped (bool) – PDF printing trap flag (https://en.wikipedia.org/wiki/Trap_%28printing%29)
b_file (File) – back reference to a File instance

Parameters:

author (str) –
title (str) –
subject (str) –
creator (str) –
producer (str) –
keywords (str) –
creation_date (datetime) –
modified_date (datetime) –
trapped (bool) –
file (File) –

Page

class libpdf.models.page.Page(number, width, height, content=None, root=None, positions=None)

Bases: ModelBase

PDF page data.

Variables:

number (int) – PDF page number, 1-based
width (float) – page width in points
height (float) – page height in points
content (List[Union[Chapter, Paragraph, Table, Figure]]) – ordered list of elements on the page; chapters might still be nested if the page contains sub-chapters
root (Root) – back reference to a Root instance
b_positions (List[Position]) – back reference to all Position instances on the page

Parameters:

content (List[Chapter | Paragraph | Table | Figure]) –
root (Root) –
positions (List[Position]) –

property id_

Return the identifier to address the Page.

The identifier follows the pattern page.<number>. It is used as a link target if a PDF link annotation points to a blank space position, i.e. there is no Chapter, Paragraph, Table, Figure at the target location.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Element

class libpdf.models.element.Element(position, root=None, chapter=None)

Bases: ModelBase, ABC

Base class for Chapter, Paragraph, Table and Figure.

Variables:

position (Position) – a Position instance determining the location of the Element
b_root (Root) – Root instance (mutually exclusive with the b_chapter parameter)
b_chapter (Chapter) – parent Chapter instance (mutually exclusive with the b_root parameter)

Parameters:

position (Position) –
root (Root) –
chapter (Chapter) –

abstract property id_: Return the identifier to address the Element.

property uid

Return the unique identifier to address the full path to the Element.

The identifier follows the pattern element.<number>/element.<number>.

For example, the uid for a paragraph in chapter.2.1.4 is ‘chapter.2/chapter.2.1/chapter.2.1.4/paragraph.6’.

Type:: str

Chapter

class libpdf.models.chapter.Chapter(title, number, position, content=None, chapter=None, textbox=None)

Bases: Element

PDF chapter (extracted from PDF outline).

The Chapter elements defines the structure of the PDF. If an outline is given, Chapters are extracted from it and all elements (sub-chapters, tables, figures, paragraphs) are put below the Chapter in the ordered content list.

Variables:

title (str) – the title of the chapter, as extracted from outline
number (str) – the chapter number as string (e.g. ‘3.2.4’)
textbox (HorizontalBox) – the textbox of the chapter, as extracted from pdfminer
number – the chapter number as string (e.g. ‘3.2.4’)
position (Position) – a Position instance determining the location of the Chapter; a Chapter commonly spans across several pages, however only one Position is aggregated because the end of the Chapter can be determined by looking at the next Chapter
content (List[Union[Chapter, Paragraph, Table, Figure]]) – the content of the chapter (other sub-chapters, paragraphs, tables, figures)

Parameters:

title (str) –
number (str) –
position (Position) –
content (List[Chapter | Paragraph | Table | Figure]) –
chapter (Chapter) –
textbox (HorizontalBox) –

property id_

Return the identifier to address the Chapter.

The identifier follows the pattern chapter.<number>.

It is used as a link target if a PDF link-annotation points to the Element.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Type:: str

Paragraph

class libpdf.models.paragraph.Paragraph(idx, position, links, textbox=None, root=None, chapter=None)

Bases: Element

PDF paragraph (normal text).

A paragraph always ends at the end of a page.

Variables:

idx (int) – the number of the instance in the current scope, 1-based
position (Position) – the position of the paragraph
links (List[Link]) – list of links in the paragraph text
textbox (HorizontalBox) – the textbox of the paragraph, as extracted from pdfminer

Parameters:

idx (int) –
position (Position) –
links (List[Link]) –
textbox (HorizontalBox) –
root (Root) –
chapter (Chapter) –

property id_

Return the identifier to address the Paragraph.

The identifier follows the pattern paragraph.<idx>. idx the 1-based number of the Paragraph in the current scope (root, chapter, sub-chapters, page).

It is used as a link target if a PDF link-annotation points to the Element.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Type:: str

set_links_backref(): Set b_source back reference on all links.

Table

class libpdf.models.table.Table(idx, cells, position, caption=None)

Bases: Element

PDF table data.

Variables:

idx (int) – the number of the instance in the current scope, 1-based
cells (List[Cell]) – a list of Cell instances that are part of the table
caption (str) – the caption of the figure (text over/under the table describing it)
position (Position) – a Position instance determining the location of the table

Parameters:

idx (int) –
cells (List[Cell]) –
position (Position) –

property id_

Return the identifier to address the Table.

The identifier follows the pattern table.<idx>. idx the 1-based number of the Table in the current scope (root, chapter, sub-chapters, page).

It is used as a link target if a PDF link-annotation points to the Element.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Type:: str

property rows

Return a list of rows in the table where each contains a list of columns.

Type:: List[List[Cell]]

property columns

Return a list of columns in the table where each contains a list of rows.

Type:: List[List[Cell]]

property rows_count

Return the number of rows in the table.

Type:: int

property columns_count

Return the number of columns in the table.

Type:: int

Cell

class libpdf.models.table.Cell(row, col, position, links, table=None, textbox=None)

Bases: ModelBase

PDF table cell data.

Variables:

row (int) – the row number of the cell, 1-based
col (int) – the column number of the cell, 1-based
textbox (HorizontalBox) – the textbox of the cell, as extracted from pdfminer and converted to HorizontalBox
position (Position) – a Position instance determining the location of the cell
b_table (Table) – a Table instance that contains the cell
links (List[Link]) – list of links in the cell text

Parameters:

row (int) –
col (int) –
position (Position) –
links (List[Link]) –
table (Table) –
textbox (HorizontalBox) –

set_links_backref(): Set b_source back reference on all links.

Figure

class libpdf.models.figure.Figure(idx, rel_path, position, links, textboxes, text=None, caption=None)

Bases: Element

PDF figure.

A figure can be a bitmap image or vector graphics mixed with overlaying text. libpdf extracts figures into an external file where rel_path defines the path to the external file. The text property contains text extracted from the figure area. This can be highly unstructured because libpdf does not analyze the text layout within figures as there is no common denominator for an algorithm. libpdf will however do the same character grouping analysis as for paragraphs, so the user can assume text flow is from top left to bottom right.

Variables:

idx (int) – the number of the instance in the current scope, 1-based
rel_path (str) – the path to the external file containing the figure
textboxes (a list of HorizontalBox) – the textboxes of the figure, as extracted from pdfminer
caption (str) – the caption of the figure (text over/under the figure describing it)
position (Position) – a Position instance determining the location of the figure
links (List[Link]) – list of text links in the figure area

Parameters:

idx (int) –
rel_path (str) –
position (Position) –
links (List[Link]) –
textboxes (List[HorizontalBox]) –
text (str) –
caption (str) –

property id_

Return the identifier to address the Figure.

The identifier follows the pattern figure.<idx>. idx the 1-based number of the Figure in the current scope (root, chapter, sub-chapters, page).

It is used as a link target if a PDF link-annotation points to the Element.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Type:: str

set_links_backref(): Set b_source back reference on all links.

Rect

class libpdf.models.rect.Rect(idx, position, textbox, non_stroking_color=None)

Bases: Element

Rectangles in a PDF.

The rectangles are extracted from pdfplumber. The text covered in the rectangle is extracted and stored in an newly instantiated textbox.

Parameters:

idx (int) –
position (Position) –
textbox (HorizontalBox) –
non_stroking_color (tuple) –

property id_: str

Return the identifier to address the Figure.

The identifier follows the pattern figure.<idx>. idx the 1-based number of the Figure in the current scope (root, chapter, sub-chapters, page).

It is used as a link target if a PDF link-annotation points to the Element.

According to PDF model the parameter should be called id but the name is reserved in Python, so id_ is used.

Type:: str

Position

class libpdf.models.position.Position(x0, y0, x1, y1, page, element=None, cell=None)

Bases: object

Define the coordinates of an Element or Cell.

A position is either linked by an Element or by a Cell (mutually exclusive). A position keeps a reference to the Page it is located on.

Here is some ASCII art to explain the libpdf coordinates:

*-page------------------------*
|                             |
|                             |
|                             |
|        +-bbox----+          |
|<--x0-->|         |    ^     |
|        |         |    |     |
|<-------|---x1--->|    |     |
|        +---------+    |     |
|            ^          |     |
|           y0         y1     |
|            v          v     |
*-----------------------------*

The bbox definition [x0, y0, x1, y1] is in sync with pdfminer and the PDF standard. Coordinate type is float for both libpdf and pdfminer.

Note

pdfplumber has a different definition of bounding boxes:

*-page------------------------*
|                ^      ^     |
|               top     |     |
|                v      |     |
|        +-bbox----+    |     |
|<--x0-->|         |    |     |
|        |         |  bottom  |
|<-------|---x1--->|    |     |
|        +---------+    v     |
|                             |
|                             |
|                             |
*-----------------------------*

The pdfplumber bounding box is [x0, top, x1, bottom]. Coordinate type is Decimal.

To deal with the coordinate and type differences there are conversion functions to_pdfplumber_bbox and from_pdfplumber_bbox in module libpdf.utils.

Variables:

x0 (float) – distance from the left of the page to the left edge of the box
y0 (float) – distance from the bottom of the page to the lower edge of the box (less than y1)
x1 (float) – distance from the left of the page to the right edge of the box
y1 (float) – distance from the bottom of the page to the upper edge of the box (greater than y0)
page (Page) – reference to a Page object
element (Element) – element that refers to the position (mutually exclusive with cell)
cell (Cell) – cell that refers to the position (mutually exclusive with element)

Parameters:

x0 (float) –
y0 (float) –
x1 (float) –
y1 (float) –
page (Page) –
element (Element) –
cell (Cell) –

Link

class libpdf.models.link.Link(idx_start, idx_stop, pos_target, libpdf_target=None, b_source=None)

Bases: ModelBase

PDF link embedded in the text.

Variables:

idx_start (int) – the 0-based index of the start char. This char is included in the link text.
idx_stop (int) – the 0-based index of the stop char. This char is excluded in the link text, so the start/stop indexes are compatible with the Python string slicing notation.
pos_target (Dict[str, float]) – the position where the link points to. e.g {'page': 4, 'x': 56, 'y': 789}
libpdf_target (str) – points either to a libpdf Element or to a Page. The libpdf Element link is built by concatenating nested elements, separated by ‘/’, e.g. chapter.3/chapter.3.2/table.2. In case pos_target cannot be resolved to a libpdf Element, the target is set to the target coordinates given as page.<id>/<X>:<Y>, e.g. page.4/56:789. In this case libpdf_target is identical to pos_target.
b_source – back reference to the link source, can be Paragraph, Figure or Cell

Parameters:

idx_start (int) –
idx_stop (int) –
pos_target (Dict) –
libpdf_target (str) –
b_source (Paragraph | Figure | Cell) –

property source_chars

Show the text between the start and stop indices.

Main usecase for this is debugging.

Char

class libpdf.models.horizontal_box.Char(text, x0=None, y0=None, x1=None, y1=None, ncolor=None, fontname=None)

Bases: object

Define the character class.

Variables:

~.text – a plain char of the chararcter
x0 (float) – distance from the left of the page to the left edge of the character
y0 (float) – distance from the bottom of the page to the lower edge of the character (less than y1)
x1 (float) – distance from the left of the page to the right edge of the character
y1 (float) – distance from the bottom of the page to the upper edge of the character (greater than y0)
ncolor (Tuple[float, float, float]) – non-stroking-color as rgb value

Parameters:

text (str) –
x0 (float | None) –
y0 (float | None) –
x1 (float | None) –
y1 (float | None) –
ncolor (tuple | None) –
fontname (str | None) –

Word

class libpdf.models.horizontal_box.Word(chars, x0=None, y0=None, x1=None, y1=None)

Bases: object

Define the word class.

A word shall contain several characters.

Variables:

chars (List[Char]) – a list of the chararcter

Parameters:

chars (list[Char]) –
x0 (float | None) –
y0 (float | None) –
x1 (float | None) –
y1 (float | None) –

property text: str: Return plain text.

HorizontalLine

class libpdf.models.horizontal_box.HorizontalLine(words, x0=None, y0=None, x1=None, y1=None)

Bases: object

Define the horizontal line class.

A horizontal line shall contain a word or several words.

Variables:

words (List[Word]) – a list of the words

Parameters:

words (list[Word]) –
x0 (float | None) –
y0 (float | None) –
x1 (float | None) –
y1 (float | None) –

property text: str: Return plain text.

HorizontalBox

class libpdf.models.horizontal_box.HorizontalBox(lines, x0=None, y0=None, x1=None, y1=None)

Bases: object

Define the horizontal box class.

A horizontal box shall contain a horizontal line or several of it.

Variables:

lines (List[HorizontalLine]) – a list of the HorizontalLine

Parameters:

lines (list[HorizontalLine]) –
x0 (float | None) –
y0 (float | None) –
x1 (float | None) –
y1 (float | None) –

property text: str: Return plain text.

property words: list[str]: Return list of words.