A ton of improvements and new features:
- Shifts to a lazy-loading paradigm, so that you don't have to process an entire PDF just to access one page.
- Strips out
pandas
requirement and usage.- Results in a 3x-ish speedup for
within_bbox
and similar methods, thanks to short-circuiting&
operators.
- Results in a 3x-ish speedup for
- Moves from
float
s toDecimal
s to improve accuracy of equality comparisons. - Moves to a more modular architecture, adds
Container
,Page
, andCroppedPage
classes. - Adds
Page.crop(...)
. - Adds
Page.extract_table(...)
for Tabula-like functionality. - Adds
PDF.metadata
property. - Adds derived properties
Container.rect_edges
andContainer.edges
, decomposing each rectangle decomposed into its constituent lines. - Renames
collate_chars(...)
toget_text(...)
(while retaining a reference to the former). - Enriches the the command-line tool's JSON output to include PDF metadata and page dimensions. [https://github.com//issues/3]