v4.6.0 — Recursive Archives, DocumentStructure, Bug Fixes
Added
- Recursive archive extraction: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own
ExtractionResult. NewArchiveEntrytype andmax_archive_depthconfig. - YAML/JSON section chunker: New
ChunkerType::Yamlwith full hierarchy paths and auto-inference from metadata. - Unified DocumentStructure: Extended with 7 new node types, 4 annotation kinds, attributes bag. All 35 extractors produce native DocumentStructure.
- Document-level OCR:
process_document()for whole-file extraction — up to 30% faster on multi-page documents. - DocBook/JATS inline annotations: Semantic formatting mapped to AnnotationKind variants.
Changed
- CSV extraction: Produces
Row N: Header: Valueformat for better embedding quality. - XML extraction: Indented hierarchical output preserving element tree.
Improved
- Zero-copy file I/O: memmap2 + simdutf8 SIMD UTF-8 validation for large files.
- Unified concurrency management: Centralized thread budget with configurable
ConcurrencyConfig.
Fixed
- #557: Auto-enable
extract_pagesfor element-based output — correct page numbers without manual PageConfig. - #558: Fixed misleading PageConfig docstring defaults.
- #560: MSG extraction now supports compressed RTF bodies (PR_RTF_COMPRESSED).
- #561: Indexed colour PDF images now decode correctly with palette lookup.
- ODT extraction robustness improvements.
See CHANGELOG.md for full details.