Changed
- Change the (still experimental)
Page/utils.extract_text(layout=True)
approach so that it pads, to the degree necessary, the ends of lines with spaces and the end of the text with blank lines to acheive better mimicry of page layout. (d3662de) - Refactor handling of
pts
attribute and, in doing so, deprecate thecurve_obj["points"]
attribute, and fixPageImage.draw_line(...)
's handling of diagonal lines. (216bedd) - Breaking change: In
Page.extract_table[s](...)
,keep_blank_chars
must now be passed astext_keep_blank_chars
, for consistency's sake. (c4e1b29)
Added
- Add
Page.extract_table[s](...)
support for allPage.extract_text(...)
keyword arguments. (c4e1b29) - Add
height
andwidth
keyword arguemnts toPage.to_image(...)
. (#798 + 93f7dbd) - Add
layout_width
,layout_width_chars
,layout_height
, andlayout_width_chars
parameters toPage/utils.extract_text(layout=True)
. (d3662de) - Add CITATION.cff. (#755) [h/t @joaoccruz]
Fixed
- Fix simple edge-case for when page rotation is (incorrectly) set to
None
. (#811) [h/t @toshi1127]
Development Changes
- Convert
utils.py
intoutils/
submodules. Retains same interface, just an improvement in organization. (6351d97) - Fix typing hints to include io.BytesIO. (d4107f6) [h/t @conitrade-as]
- Refactor text-extraction utilities, paving way for better consistency across various entrypoints to text extraction (e.g., via
utils.extract_text(...)
, viaPage.extract_text(...)
, viaPage.extract_table(...)
). (3424b57)