Enhancements:
- added support for multiple OCR backends: added PaddleOCR and Easy OCR (feature)
- added support for having no OCR backend (feature)
- changed Tesseract OCR to optional (enhancement)
- added support for registering creating custom extractors (feature)
- added support for overriding builtin extractors (feature)
- added support for post-processing hooks (feature)
- added support for validation hooks (feature)
- added PDF metadata extraction using Playa-PDF (feature)
- added optional chunking support (feature)
- added documentation site (documentation)
Breaking Changes:
- Changed
ExtractionResultsfrom NamedTuple to TypedDict (breaking change; api)
Internal:
- Rework internals to allow extensibility by changing to a class-based architecture (internal; architecture)