kreuzberg-dev/kreuzberg v4.6.0 on GitHub

v4.6.0 — Recursive Archives, DocumentStructure, Bug Fixes

Recursive archive extraction: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own ExtractionResult. New ArchiveEntry type and max_archive_depth config.
YAML/JSON section chunker: New ChunkerType::Yaml with full hierarchy paths and auto-inference from metadata.
Unified DocumentStructure: Extended with 7 new node types, 4 annotation kinds, attributes bag. All 35 extractors produce native DocumentStructure.
Document-level OCR: process_document() for whole-file extraction — up to 30% faster on multi-page documents.
DocBook/JATS inline annotations: Semantic formatting mapped to AnnotationKind variants.

CSV extraction: Produces Row N: Header: Value format for better embedding quality.
XML extraction: Indented hierarchical output preserving element tree.

Zero-copy file I/O: memmap2 + simdutf8 SIMD UTF-8 validation for large files.
Unified concurrency management: Centralized thread budget with configurable ConcurrencyConfig.

#557: Auto-enable extract_pages for element-based output — correct page numbers without manual PageConfig.
#558: Fixed misleading PageConfig docstring defaults.
#560: MSG extraction now supports compressed RTF bodies (PR_RTF_COMPRESSED).
#561: Indexed colour PDF images now decode correctly with palette lookup.
ODT extraction robustness improvements.

See CHANGELOG.md for full details.