API
Loading a PDF
- libpdf.load(pdf, verbose=1, page_range=None, page_crop=None, smart_page_crop=False, save_figures=False, figure_dir='figures', no_chapters=False, no_paragraphs=False, no_tables=False, no_figures=False, init_logging=True, visual_debug=False, visual_debug_output_dir='visual_debug_libpdf', visual_split_elements=False, visual_debug_include_elements=None, visual_debug_exclude_elements=None)
Entry point for the usage of libpdf as a library.
The function is actually called
main_api()
to better correspond tomain_cli()
andmain()
. It is however exposed to the API aslibpdf.load()
which is considered more expressive for API users.- Parameters
pdf (str) – path to the PDF to read
verbose (int) – verbosity level as integer (0 = errors, fatal, critical, 1 = warnings, 2 = info, 3 = debug)
page_range (Optional[str]) – range of pages to extract as string without spaces (e.g. 3-5 or 3,4,7 or 3-5,7)
page_crop (Optional[Tuple[float, float, float, float]]) – see description in function core.main()
smart_page_crop (bool) – see description in function core.main()
save_figures (bool) – flag triggering the export of figures to the figure_dir
figure_dir (str) – output directory for extracted figures; if it does not exist, it will be created
no_chapters (bool) – flag triggering the exclusion of chapters (resulting in a flat list of elements)
no_paragraphs (bool) – flag triggering the exclusion of paragraphs (no normal text content)
no_tables (bool) – flag triggering the exclusion of tables
no_figures (bool) – flag triggering the exclusion of figures
init_logging (bool) – flag indicating whether libpdf shall instantiate a root log handler that is capable of handling both log messages and progress bars; it does so by passing all log messages to tqdm.write()
visual_debug (bool) – flag triggering visual debug feature
visual_debug_output_dir (str) – output directory for visualized pdf pages
visual_split_elements (bool) – flag triggering split visualized elements in separate folder
visual_debug_include_elements (Optional[List[str]]) – a list of elements that shall be included when visual debugging
visual_debug_exclude_elements (Optional[List[str]]) – a list of elements that shall be excluded when visual debugging
- Returns
instance of
ApiObjects
class- Return type
Returned objects
- class libpdf.apiobjects.ApiObjects(root, chapters, paragraphs, tables, figures, pdfplumber, pdfminer)
Bases:
object
Data class that stores instances for all extracted PDF objects.
- Variables
root (Root) – Main entry point to structured data as per the UML PDF model.
flattened (Flattened) – named tuple holding flattened versions of all nested objects in root.contents.*; the element types chapters/paragraphs/tables/figures can be directly accessed (API convenience)
pdfplumber (PDF) – pdfplumber PDF object for further processing by API users
pdfminer (PDFDocument) – pdfminer PDF object for further processing by API users, also available in pdfplumber.doc
- Parameters
root (libpdf.models.root.Root) –
chapters (List[libpdf.models.chapter.Chapter]) –
paragraphs (List[libpdf.models.paragraph.Paragraph]) –
tables (List[libpdf.models.table.Table]) –
figures (List[libpdf.models.figure.Figure]) –
pdfplumber (pdfplumber.pdf.PDF) –
pdfminer (pdfminer.pdfdocument.PDFDocument) –
- class libpdf.apiobjects.Flattened(chapters, paragraphs, tables, figures)
Bases:
tuple
NamedTuple to hold flattened Element instances featuring also type hinting.
- Parameters
chapters (List[libpdf.models.chapter.Chapter]) –
paragraphs (List[libpdf.models.paragraph.Paragraph]) –
tables (List[libpdf.models.table.Table]) –
figures (List[libpdf.models.figure.Figure]) –
- chapters: List[libpdf.models.chapter.Chapter]
Alias for field number 0
- paragraphs: List[libpdf.models.paragraph.Paragraph]
Alias for field number 1
- tables: List[libpdf.models.table.Table]
Alias for field number 2
- figures: List[libpdf.models.figure.Figure]
Alias for field number 3
Model classes
The object ApiObjects
returned by libpdf.load()
contains the following
class (instances).
Root
- class libpdf.models.root.Root(file, pages, content)
Bases:
libpdf.models.model_base.ModelBase
Main entry point to the UML PDF model.
- Variables
- Parameters
file (libpdf.models.file.File) –
pages (List[libpdf.models.page.Page]) –
content (List[Union[libpdf.models.chapter.Chapter, libpdf.models.paragraph.Paragraph, libpdf.models.table.Table, libpdf.models.figure.Figure]]) –
File
- class libpdf.models.file.File(name, path, page_count, crop_top=0, crop_bottom=0, crop_left=0, crop_right=0, file_meta=None, root=None)
Bases:
libpdf.models.model_base.ModelBase
PDF file data.
There is a file wide crop feature that removes static parts from each page:
*-page-------------------------------------* | ^ | | crop_top | | v | | +-content-+ | |<--crop_left-->| |<--crop_right-->| | | | | | | | | | +---------+ | | ^ | | crop_bottom | | v | *------------------------------------------*
It can be used to ignore headers, footers or sidebars. The user-defined parameters are exposed to both the CLI and API.
- Variables
name (str) – PDF file name
path (str) – PDF file path
page_count (int) – number of pages in PDF
crop_top (float) – distance in points from top of each page to ignore for extraction
crop_bottom (float) – distance in points from bottom of each page to ignore for extraction
crop_left (float) – distance in points from left side of each page to ignore for extraction
crop_right (float) – distance in points from right side of each page to ignore for extraction
file_meta (FileMeta) – reference to FileMeta instance
b_root (Root) – back reference to Root instance
- Parameters
name (str) –
path (str) –
page_count (int) –
crop_top (float) –
crop_bottom (float) –
crop_left (float) –
crop_right (float) –
file_meta (libpdf.models.file_meta.FileMeta) –
root (Root) –
- property id_
Return the identifier to address the file.
The parameter can later be used during libpdf postprocessing to link to elements in other files.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used. The file identifier is built from the file name including extension. All characters are removed that do not follow the Python identifier character set (Regex character set[_a-zA-Z0-9]
).
FileMeta
- class libpdf.models.file_meta.FileMeta(author=None, title=None, subject=None, creator=None, producer=None, keywords=None, creation_date=None, modified_date=None, trapped=None, file=None)
Bases:
libpdf.models.model_base.ModelBase
PDF file meta data.
- Variables
author (str) – PDF author meta data field
title (str) – PDF title meta data field
subject (str) – PDF subject meta data field
creator (str) – PDF creator meta data field
producer (str) – PDF producer meta data field
keywords (str) – PDF keywords meta data field
creation_date (datetime) – PDF creation date given as datetime instance
modified_date (datetime) – PDF modified date given as datetime instance
trapped (bool) – PDF printing trap flag (https://en.wikipedia.org/wiki/Trap_%28printing%29)
b_file (File) – back reference to a File instance
- Parameters
author (str) –
title (str) –
subject (str) –
creator (str) –
producer (str) –
keywords (str) –
creation_date (datetime.datetime) –
modified_date (datetime.datetime) –
trapped (bool) –
file (File) –
Page
- class libpdf.models.page.Page(number, width, height, content=None, root=None, positions=None)
Bases:
libpdf.models.model_base.ModelBase
PDF page data.
- Variables
number (int) – PDF page number, 1-based
width (float) – page width in points
height (float) – page height in points
content (List[Union[Chapter, Paragraph, Table, Figure]]) – ordered list of elements on the page; chapters might still be nested if the page contains sub-chapters
root (Root) – back reference to a Root instance
b_positions (List[Position]) – back reference to all Position instances on the page
- Parameters
- property id_
Return the identifier to address the Page.
The identifier follows the pattern
page.<number>
. It is used as a link target if a PDF link annotation points to a blank space position, i.e. there is no Chapter, Paragraph, Table, Figure at the target location.According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.
Element
- class libpdf.models.element.Element(position, root=None, chapter=None)
Bases:
libpdf.models.model_base.ModelBase
,abc.ABC
Base class for
Chapter
,Paragraph
,Table
andFigure
.- Variables
- Parameters
- abstract property id_
Return the identifier to address the Element.
- property uid
Return the unique identifier to address the full path to the Element.
The identifier follows the pattern
element.<number>/element.<number>
.For example, the uid for a paragraph in chapter.2.1.4 is ‘chapter.2/chapter.2.1/chapter.2.1.4/paragraph.6’.
- Type
str
Chapter
- class libpdf.models.chapter.Chapter(title, number, position, content=None, chapter=None)
Bases:
libpdf.models.element.Element
PDF chapter (extracted from PDF outline).
The Chapter elements defines the structure of the PDF. If an outline is given, Chapters are extracted from it and all elements (sub-chapters, tables, figures, paragraphs) are put below the Chapter in the ordered content list.
- Variables
title (str) – the title of the chapter, as extracted from outline
number (str) – the chapter number as string (e.g. ‘3.2.4’)
position (Position) – a Position instance determining the location of the Chapter; a Chapter commonly spans across several pages, however only one Position is aggregated because the end of the Chapter can be determined by looking at the next Chapter
content (List[Union[Chapter, Paragraph, Table, Figure]]) – the content of the chapter (other sub-chapters, paragraphs, tables, figures)
- Parameters
- property id_
Return the identifier to address the Chapter.
The identifier follows the pattern
chapter.<number>
.It is used as a link target if a PDF link-annotation points to the Element.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.- Type
str
Paragraph
- class libpdf.models.paragraph.Paragraph(idx, text, position, links, root=None, chapter=None)
Bases:
libpdf.models.element.Element
PDF paragraph (normal text).
A paragraph always ends at the end of a page.
- Variables
- Parameters
idx (int) –
text (str) –
position (Position) –
links (List[libpdf.models.link.Link]) –
root (Root) –
chapter (Chapter) –
- property id_
Return the identifier to address the Paragraph.
The identifier follows the pattern
paragraph.<idx>
. idx the 1-based number of the Paragraph in the current scope (root, chapter, sub-chapters, page).It is used as a link target if a PDF link-annotation points to the Element.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.- Type
str
- set_links_backref()
Set b_source back reference on all links.
Table
- class libpdf.models.table.Table(idx, cells, position, caption=None)
Bases:
libpdf.models.element.Element
PDF table data.
- Variables
idx (int) – the number of the instance in the current scope, 1-based
cells (List[Cell]) – a list of Cell instances that are part of the table
caption (str) – the caption of the figure (text over/under the table describing it)
position (Position) – a Position instance determining the location of the table
- Parameters
- property id_
Return the identifier to address the Table.
The identifier follows the pattern
table.<idx>
. idx the 1-based number of the Table in the current scope (root, chapter, sub-chapters, page).It is used as a link target if a PDF link-annotation points to the Element.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.- Type
str
- property rows
Return a list of rows in the table where each contains a list of columns.
- Type
List[List[Cell]]
- property columns
Return a list of columns in the table where each contains a list of rows.
- Type
List[List[Cell]]
- property rows_count
Return the number of rows in the table.
- Type
int
- property columns_count
Return the number of columns in the table.
- Type
int
Cell
- class libpdf.models.table.Cell(row, col, text, position, links, table=None)
Bases:
libpdf.models.model_base.ModelBase
PDF table cell data.
- Variables
row (int) – the row number of the cell, 1-based
col (int) – the column number of the cell, 1-based
text (str) – the text content of the cell
position (Position) – a Position instance determining the location of the cell
b_table (Table) – a Table instance that contains the cell
links (List[Link]) – list of links in the cell text
- Parameters
row (int) –
col (int) –
text (str) –
position (libpdf.models.position.Position) –
links (List[libpdf.models.link.Link]) –
table (libpdf.models.table.Table) –
- set_links_backref()
Set b_source back reference on all links.
Figure
- class libpdf.models.figure.Figure(idx, rel_path, position, links, text=None, caption=None)
Bases:
libpdf.models.element.Element
PDF figure.
A figure can be a bitmap image or vector graphics mixed with overlaying text. libpdf extracts figures into an external file where
rel_path
defines the path to the external file. The text property contains text extracted from the figure area. This can be highly unstructured because libpdf does not analyze the text layout within figures as there is no common denominator for an algorithm. libpdf will however do the same character grouping analysis as for paragraphs, so the user can assume text flow is from top left to bottom right.- Variables
idx (int) – the number of the instance in the current scope, 1-based
rel_path (str) – the path to the external file containing the figure
text (str) – all merged text inside the figure area
caption (str) – the caption of the figure (text over/under the figure describing it)
position (Position) – a Position instance determining the location of the figure
links (List[Link]) – list of text links in the figure area
- Parameters
idx (int) –
rel_path (str) –
position (Position) –
links (List[libpdf.models.link.Link]) –
text (str) –
caption (str) –
- property id_
Return the identifier to address the Figure.
The identifier follows the pattern
figure.<idx>
. idx the 1-based number of the Figure in the current scope (root, chapter, sub-chapters, page).It is used as a link target if a PDF link-annotation points to the Element.
According to PDF model the parameter should be called
id
but the name is reserved in Python, soid_
is used.- Type
str
- set_links_backref()
Set b_source back reference on all links.
Position
- class libpdf.models.position.Position(x0, y0, x1, y1, page, element=None, cell=None)
Bases:
object
Define the coordinates of an
Element
orCell
.A position is either linked by an Element or by a Cell (mutually exclusive). A position keeps a reference to the
Page
it is located on.Here is some ASCII art to explain the libpdf coordinates:
*-page------------------------* | | | | | | | +-bbox----+ | |<--x0-->| | ^ | | | | | | |<-------|---x1--->| | | | +---------+ | | | ^ | | | y0 y1 | | v v | *-----------------------------*
The bbox definition [x0, y0, x1, y1] is in sync with pdfminer and the PDF standard. Coordinate type is float for both libpdf and pdfminer.
Note
pdfplumber has a different definition of bounding boxes:
*-page------------------------* | ^ ^ | | top | | | v | | | +-bbox----+ | | |<--x0-->| | | | | | | bottom | |<-------|---x1--->| | | | +---------+ v | | | | | | | *-----------------------------*
The pdfplumber bounding box is [x0, top, x1, bottom]. Coordinate type is Decimal.
To deal with the coordinate and type differences there are conversion functions
to_pdfplumber_bbox
andfrom_pdfplumber_bbox
in module libpdf.utils.- Variables
x0 (float) – distance from the left of the page to the left edge of the box
y0 (float) – distance from the bottom of the page to the lower edge of the box (less than y1)
x1 (float) – distance from the left of the page to the right edge of the box
y1 (float) – distance from the bottom of the page to the upper edge of the box (greater than y0)
page (Page) – reference to a Page object
element (Element) – element that refers to the position (mutually exclusive with cell)
cell (Cell) – cell that refers to the position (mutually exclusive with element)
- Parameters
Link
- class libpdf.models.link.Link(idx_start, idx_stop, pos_target, libpdf_target=None, b_source=None)
Bases:
libpdf.models.model_base.ModelBase
PDF link embedded in the text.
- Variables
idx_start (int) – the 0-based index of the start char. This char is included in the link text.
idx_stop (int) – the 0-based index of the stop char. This char is excluded in the link text, so the start/stop indexes are compatible with the Python string slicing notation.
pos_target (Dict[str, float]) – the position where the link points to. e.g
{'page': 4, 'x': 56, 'y': 789}
libpdf_target (str) – points either to a libpdf
Element
or to aPage
. The libpdfElement
link is built by concatenating nested elements, separated by ‘/’, e.g.chapter.3/chapter.3.2/table.2
. In casepos_target
cannot be resolved to a libpdfElement
, the target is set to the target coordinates given aspage.<id>/<X>:<Y>
, e.g.page.4/56:789
. In this caselibpdf_target
is identical topos_target
.b_source – back reference to the link source, can be
Paragraph
,Figure
orCell
- Parameters
- property source_chars
Show the text between the start and stop indices.
Main usecase for this is debugging.