Enhancements:
- add sync methods (feature)
- add corrupt PDF searchable text detection with automatic OCR fallback (feature)
- add metadata extraction using Pandoc (feature)
- add multi-sheet worksheet (excel etc.) extraction (feature)
- add
language,psmandpax_processeskeyword arguments (enhancement; api) - gated
typing-extensionsto Python 3.10 and below (enhancement; dependencies) - added multi-loop compatibility by switching from
asyncioto usinganyio(enhancement; compatibility) - added managed worker processes for Pandoc and Tesseract using
anyio.to_process(enhancement; performance) - replaced
xslx2csvwithpython-calamineand improved implementation to extract all sheets in a workbook (enhancement; performance)
Breaking Changes:
- updated
ExtractionResultto includemetadata(breaking change; api) - changed
force_ocrto a kwarg (breaking change; api)
Internal:
- split the _extractors namespace into smaller packages and reorganized source code
- add matrix tests against all supported Python versions (internal; testing)
- refined ruff rules to enhance linting strictness
- increase coverage to >=99% coverage