API
Loading a PDF
- libpdf.load(pdf, verbose=1, page_range=None, page_crop=None, smart_page_crop=False, save_figures=False, figure_dir='figures', no_annotations=False, no_chapters=False, no_paragraphs=False, no_tables=False, no_figures=False, no_rects=False, init_logging=True, visual_debug=False, visual_debug_output_dir='visual_debug_libpdf', visual_split_elements=False, visual_debug_include_elements=None, visual_debug_exclude_elements=None)
Entry point for the usage of libpdf as a library.
The function is actually called
main_api()
to better correspond tomain_cli()
andmain()
. It is however exposed to the API aslibpdf.load()
which is considered more expressive for API users.- Parameters:
pdf (str) – path to the PDF to read
verbose (int) – verbosity level as integer (0 = errors, fatal, critical, 1 = warnings, 2 = info, 3 = debug)
page_range (str) – range of pages to extract as string without spaces (e.g. 3-5 or 3,4,7 or 3-5,7)
page_crop (Tuple[float, float, float, float]) – see description in function core.main()
smart_page_crop (bool) – see description in function core.main()
save_figures (bool) – flag triggering the export of figures to the figure_dir
figure_dir (str) – output directory for extracted figures; if it does not exist, it will be created
no_annotations (bool) – flag triggering the exclusion of annotations from pdf catalog
no_chapters (bool) – flag triggering the exclusion of chapters (resulting in a flat list of elements)
no_paragraphs (bool) – flag triggering the exclusion of paragraphs (no normal text content)
no_tables (bool) – flag triggering the exclusion of tables
no_figures (bool) – flag triggering the exclusion of figures
no_rects (bool) – flag triggering the exclusion of rects
init_logging (bool) – flag indicating whether libpdf shall instantiate a root log handler that is capable of handling both log messages and progress bars; it does so by passing all log messages to tqdm.write()
visual_debug (bool) – flag triggering visual debug feature
visual_debug_output_dir (str) – output directory for visualized pdf pages
visual_split_elements (bool) – flag triggering split visualized elements in separate folder
visual_debug_include_elements (List[str]) – a list of elements that shall be included when visual debugging
visual_debug_exclude_elements (List[str]) – a list of elements that shall be excluded when visual debugging
- Returns:
instance of
ApiObjects
class- Return type:
Returned objects
- class libpdf.apiobjects.ApiObjects(root, chapters, paragraphs, tables, figures, rects, pdfplumber, pdfminer)
Bases:
object
Data class that stores instances for all extracted PDF objects.
- Variables:
root (Root) – Main entry point to structured data as per the UML PDF model.
flattened (Flattened) – named tuple holding flattened versions of all nested objects in root.contents.*; the element types chapters/paragraphs/tables/figures can be directly accessed (API convenience)
pdfplumber (PDF) – pdfplumber PDF object for further processing by API users
pdfminer (PDFDocument) – pdfminer PDF object for further processing by API users, also available in pdfplumber.doc
- Parameters:
- class libpdf.apiobjects.Flattened(chapters, paragraphs, tables, figures, rects)
Bases:
NamedTuple
NamedTuple to hold flattened Element instances featuring also type hinting.
- Parameters:
Model classes
The object ApiObjects
returned by libpdf.load()
contains the following
class (instances).
Root
- class libpdf.models.root.Root(file, pages, content)
Bases:
ModelBase
Main entry point to the UML PDF model.
- Variables:
- Parameters:
File
- class libpdf.models.file.File(name, path, page_count, crop_top=0, crop_bottom=0, crop_left=0, crop_right=0, file_meta=None, root=None)
Bases:
ModelBase
PDF file data.
There is a file wide crop feature that removes static parts from each page:
*-page-------------------------------------* | ^ | | crop_top | | v | | +-content-+ | |<--crop_left-->| |<--crop_right-->| | | | | | | | | | +---------+ | | ^ | | crop_bottom | | v | *------------------------------------------*
It can be used to ignore headers, footers or sidebars. The user-defined parameters are exposed to both the CLI and API.
- Variables:
name (str) – PDF file name
path (str) – PDF file path
page_count (int) – number of pages in PDF
crop_top (float) – distance in points from top of each page to ignore for extraction
crop_bottom (float) – distance in points from bottom of each page to ignore for extraction
crop_left (float) – distance in points from left side of each page to ignore for extraction
crop_right (float) – distance in points from right side of each page to ignore for extraction
file_meta (FileMeta) – reference to FileMeta instance
b_root (Root) – back reference to Root instance
- Parameters:
- property id_
Return the identifier to address the file.
The parameter can later be used during libpdf postprocessing to link to elements in other files.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used. The file identifier is built from the file name including extension. All characters are removed that do not follow the Python identifier character set (Regex character set[_a-zA-Z0-9]
).
FileMeta
- class libpdf.models.file_meta.FileMeta(author=None, title=None, subject=None, creator=None, producer=None, keywords=None, creation_date=None, modified_date=None, trapped=None, file=None)
Bases:
ModelBase
PDF file meta data.
- Variables:
author (str) – PDF author meta data field
title (str) – PDF title meta data field
subject (str) – PDF subject meta data field
creator (str) – PDF creator meta data field
producer (str) – PDF producer meta data field
keywords (str) – PDF keywords meta data field
creation_date (datetime) – PDF creation date given as datetime instance
modified_date (datetime) – PDF modified date given as datetime instance
trapped (bool) – PDF printing trap flag (https://en.wikipedia.org/wiki/Trap_%28printing%29)
b_file (File) – back reference to a File instance
- Parameters:
author (str) –
title (str) –
subject (str) –
creator (str) –
producer (str) –
keywords (str) –
creation_date (datetime) –
modified_date (datetime) –
trapped (bool) –
file (File) –
Page
- class libpdf.models.page.Page(number, width, height, content=None, root=None, positions=None)
Bases:
ModelBase
PDF page data.
- Variables:
number (int) – PDF page number, 1-based
width (float) – page width in points
height (float) – page height in points
content (List[Union[Chapter, Paragraph, Table, Figure]]) – ordered list of elements on the page; chapters might still be nested if the page contains sub-chapters
root (Root) – back reference to a Root instance
b_positions (List[Position]) – back reference to all Position instances on the page
- Parameters:
- property id_
Return the identifier to address the Page.
The identifier follows the pattern
page.<number>
. It is used as a link target if a PDF link annotation points to a blank space position, i.e. there is no Chapter, Paragraph, Table, Figure at the target location.According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.
Element
- class libpdf.models.element.Element(position, root=None, chapter=None)
Bases:
ModelBase
,ABC
Base class for
Chapter
,Paragraph
,Table
andFigure
.- Variables:
- Parameters:
- abstract property id_
Return the identifier to address the Element.
- property uid
Return the unique identifier to address the full path to the Element.
The identifier follows the pattern
element.<number>/element.<number>
.For example, the uid for a paragraph in chapter.2.1.4 is ‘chapter.2/chapter.2.1/chapter.2.1.4/paragraph.6’.
- Type:
str
Chapter
- class libpdf.models.chapter.Chapter(title, number, position, content=None, chapter=None, textbox=None)
Bases:
Element
PDF chapter (extracted from PDF outline).
The Chapter elements defines the structure of the PDF. If an outline is given, Chapters are extracted from it and all elements (sub-chapters, tables, figures, paragraphs) are put below the Chapter in the ordered content list.
- Variables:
title (str) – the title of the chapter, as extracted from outline
number (str) – the chapter number as string (e.g. ‘3.2.4’)
textbox (HorizontalBox) – the textbox of the chapter, as extracted from pdfminer
number – the chapter number as string (e.g. ‘3.2.4’)
position (Position) – a Position instance determining the location of the Chapter; a Chapter commonly spans across several pages, however only one Position is aggregated because the end of the Chapter can be determined by looking at the next Chapter
content (List[Union[Chapter, Paragraph, Table, Figure]]) – the content of the chapter (other sub-chapters, paragraphs, tables, figures)
- Parameters:
- property id_
Return the identifier to address the Chapter.
The identifier follows the pattern
chapter.<number>
.It is used as a link target if a PDF link-annotation points to the Element.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.- Type:
str
Paragraph
- class libpdf.models.paragraph.Paragraph(idx, position, links, textbox=None, root=None, chapter=None)
Bases:
Element
PDF paragraph (normal text).
A paragraph always ends at the end of a page.
- Variables:
idx (int) – the number of the instance in the current scope, 1-based
position (Position) – the position of the paragraph
links (List[Link]) – list of links in the paragraph text
textbox (HorizontalBox) – the textbox of the paragraph, as extracted from pdfminer
- Parameters:
idx (int) –
position (Position) –
links (List[Link]) –
textbox (HorizontalBox) –
root (Root) –
chapter (Chapter) –
- property id_
Return the identifier to address the Paragraph.
The identifier follows the pattern
paragraph.<idx>
. idx the 1-based number of the Paragraph in the current scope (root, chapter, sub-chapters, page).It is used as a link target if a PDF link-annotation points to the Element.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.- Type:
str
- set_links_backref()
Set b_source back reference on all links.
Table
- class libpdf.models.table.Table(idx, cells, position, caption=None)
Bases:
Element
PDF table data.
- Variables:
idx (int) – the number of the instance in the current scope, 1-based
cells (List[Cell]) – a list of Cell instances that are part of the table
caption (str) – the caption of the figure (text over/under the table describing it)
position (Position) – a Position instance determining the location of the table
- Parameters:
- property id_
Return the identifier to address the Table.
The identifier follows the pattern
table.<idx>
. idx the 1-based number of the Table in the current scope (root, chapter, sub-chapters, page).It is used as a link target if a PDF link-annotation points to the Element.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.- Type:
str
- property rows
Return a list of rows in the table where each contains a list of columns.
- Type:
List[List[Cell]]
- property columns
Return a list of columns in the table where each contains a list of rows.
- Type:
List[List[Cell]]
- property rows_count
Return the number of rows in the table.
- Type:
int
- property columns_count
Return the number of columns in the table.
- Type:
int
Cell
- class libpdf.models.table.Cell(row, col, position, links, table=None, textbox=None)
Bases:
ModelBase
PDF table cell data.
- Variables:
row (int) – the row number of the cell, 1-based
col (int) – the column number of the cell, 1-based
textbox (HorizontalBox) – the textbox of the cell, as extracted from pdfminer and converted to HorizontalBox
position (Position) – a Position instance determining the location of the cell
b_table (Table) – a Table instance that contains the cell
links (List[Link]) – list of links in the cell text
- Parameters:
row (int) –
col (int) –
position (Position) –
links (List[Link]) –
table (Table) –
textbox (HorizontalBox) –
- set_links_backref()
Set b_source back reference on all links.
Figure
- class libpdf.models.figure.Figure(idx, rel_path, position, links, textboxes, text=None, caption=None)
Bases:
Element
PDF figure.
A figure can be a bitmap image or vector graphics mixed with overlaying text. libpdf extracts figures into an external file where
rel_path
defines the path to the external file. The text property contains text extracted from the figure area. This can be highly unstructured because libpdf does not analyze the text layout within figures as there is no common denominator for an algorithm. libpdf will however do the same character grouping analysis as for paragraphs, so the user can assume text flow is from top left to bottom right.- Variables:
idx (int) – the number of the instance in the current scope, 1-based
rel_path (str) – the path to the external file containing the figure
textboxes (a list of HorizontalBox) – the textboxes of the figure, as extracted from pdfminer
caption (str) – the caption of the figure (text over/under the figure describing it)
position (Position) – a Position instance determining the location of the figure
links (List[Link]) – list of text links in the figure area
- Parameters:
idx (int) –
rel_path (str) –
position (Position) –
links (List[Link]) –
textboxes (List[HorizontalBox]) –
text (str) –
caption (str) –
- property id_
Return the identifier to address the Figure.
The identifier follows the pattern
figure.<idx>
. idx the 1-based number of the Figure in the current scope (root, chapter, sub-chapters, page).It is used as a link target if a PDF link-annotation points to the Element.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.- Type:
str
- set_links_backref()
Set b_source back reference on all links.
Rect
- class libpdf.models.rect.Rect(idx, position, textbox, non_stroking_color=None)
Bases:
Element
Rectangles in a PDF.
The rectangles are extracted from pdfplumber. The text covered in the rectangle is extracted and stored in an newly instantiated textbox.
- Parameters:
idx (int) –
position (Position) –
textbox (HorizontalBox) –
non_stroking_color (tuple) –
- property id_: str
Return the identifier to address the Figure.
The identifier follows the pattern
figure.<idx>
. idx the 1-based number of the Figure in the current scope (root, chapter, sub-chapters, page).It is used as a link target if a PDF link-annotation points to the Element.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.- Type:
str
Position
- class libpdf.models.position.Position(x0, y0, x1, y1, page, element=None, cell=None)
Bases:
object
Define the coordinates of an
Element
orCell
.A position is either linked by an Element or by a Cell (mutually exclusive). A position keeps a reference to the
Page
it is located on.Here is some ASCII art to explain the libpdf coordinates:
*-page------------------------* | | | | | | | +-bbox----+ | |<--x0-->| | ^ | | | | | | |<-------|---x1--->| | | | +---------+ | | | ^ | | | y0 y1 | | v v | *-----------------------------*
The bbox definition [x0, y0, x1, y1] is in sync with pdfminer and the PDF standard. Coordinate type is float for both libpdf and pdfminer.
Note
pdfplumber has a different definition of bounding boxes:
*-page------------------------* | ^ ^ | | top | | | v | | | +-bbox----+ | | |<--x0-->| | | | | | | bottom | |<-------|---x1--->| | | | +---------+ v | | | | | | | *-----------------------------*
The pdfplumber bounding box is [x0, top, x1, bottom]. Coordinate type is Decimal.
To deal with the coordinate and type differences there are conversion functions
to_pdfplumber_bbox
andfrom_pdfplumber_bbox
in module libpdf.utils.- Variables:
x0 (float) – distance from the left of the page to the left edge of the box
y0 (float) – distance from the bottom of the page to the lower edge of the box (less than y1)
x1 (float) – distance from the left of the page to the right edge of the box
y1 (float) – distance from the bottom of the page to the upper edge of the box (greater than y0)
page (Page) – reference to a Page object
element (Element) – element that refers to the position (mutually exclusive with cell)
cell (Cell) – cell that refers to the position (mutually exclusive with element)
- Parameters:
Link
- class libpdf.models.link.Link(idx_start, idx_stop, pos_target, libpdf_target=None, b_source=None)
Bases:
ModelBase
PDF link embedded in the text.
- Variables:
idx_start (int) – the 0-based index of the start char. This char is included in the link text.
idx_stop (int) – the 0-based index of the stop char. This char is excluded in the link text, so the start/stop indexes are compatible with the Python string slicing notation.
pos_target (Dict[str, float]) – the position where the link points to. e.g
{'page': 4, 'x': 56, 'y': 789}
libpdf_target (str) – points either to a libpdf
Element
or to aPage
. The libpdfElement
link is built by concatenating nested elements, separated by ‘/’, e.g.chapter.3/chapter.3.2/table.2
. In casepos_target
cannot be resolved to a libpdfElement
, the target is set to the target coordinates given aspage.<id>/<X>:<Y>
, e.g.page.4/56:789
. In this caselibpdf_target
is identical topos_target
.b_source – back reference to the link source, can be
Paragraph
,Figure
orCell
- Parameters:
- property source_chars
Show the text between the start and stop indices.
Main usecase for this is debugging.
Char
- class libpdf.models.horizontal_box.Char(text, x0=None, y0=None, x1=None, y1=None, ncolor=None, fontname=None)
Bases:
object
Define the character class.
- Variables:
~.text – a plain char of the chararcter
x0 (float) – distance from the left of the page to the left edge of the character
y0 (float) – distance from the bottom of the page to the lower edge of the character (less than y1)
x1 (float) – distance from the left of the page to the right edge of the character
y1 (float) – distance from the bottom of the page to the upper edge of the character (greater than y0)
ncolor (Tuple[float, float, float]) – non-stroking-color as rgb value
- Parameters:
text (str) –
x0 (float | None) –
y0 (float | None) –
x1 (float | None) –
y1 (float | None) –
ncolor (tuple | None) –
fontname (str | None) –
Word
- class libpdf.models.horizontal_box.Word(chars, x0=None, y0=None, x1=None, y1=None)
Bases:
object
Define the word class.
A word shall contain several characters.
- Variables:
chars (List[Char]) – a list of the chararcter
- Parameters:
chars (list[Char]) –
x0 (float | None) –
y0 (float | None) –
x1 (float | None) –
y1 (float | None) –
- property text: str
Return plain text.
HorizontalLine
- class libpdf.models.horizontal_box.HorizontalLine(words, x0=None, y0=None, x1=None, y1=None)
Bases:
object
Define the horizontal line class.
A horizontal line shall contain a word or several words.
- Variables:
words (List[Word]) – a list of the words
- Parameters:
words (list[Word]) –
x0 (float | None) –
y0 (float | None) –
x1 (float | None) –
y1 (float | None) –
- property text: str
Return plain text.
HorizontalBox
- class libpdf.models.horizontal_box.HorizontalBox(lines, x0=None, y0=None, x1=None, y1=None)
Bases:
object
Define the horizontal box class.
A horizontal box shall contain a horizontal line or several of it.
- Variables:
lines (List[HorizontalLine]) – a list of the HorizontalLine
- Parameters:
lines (list[HorizontalLine]) –
x0 (float | None) –
y0 (float | None) –
x1 (float | None) –
y1 (float | None) –
- property text: str
Return plain text.
- property words: list[str]
Return list of words.