github kreuzberg-dev/kreuzberg v4.6.0

9 hours ago

v4.6.0 — Recursive Archives, DocumentStructure, Bug Fixes

Added

  • Recursive archive extraction: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own ExtractionResult. New ArchiveEntry type and max_archive_depth config.
  • YAML/JSON section chunker: New ChunkerType::Yaml with full hierarchy paths and auto-inference from metadata.
  • Unified DocumentStructure: Extended with 7 new node types, 4 annotation kinds, attributes bag. All 35 extractors produce native DocumentStructure.
  • Document-level OCR: process_document() for whole-file extraction — up to 30% faster on multi-page documents.
  • DocBook/JATS inline annotations: Semantic formatting mapped to AnnotationKind variants.

Changed

  • CSV extraction: Produces Row N: Header: Value format for better embedding quality.
  • XML extraction: Indented hierarchical output preserving element tree.

Improved

  • Zero-copy file I/O: memmap2 + simdutf8 SIMD UTF-8 validation for large files.
  • Unified concurrency management: Centralized thread budget with configurable ConcurrencyConfig.

Fixed

  • #557: Auto-enable extract_pages for element-based output — correct page numbers without manual PageConfig.
  • #558: Fixed misleading PageConfig docstring defaults.
  • #560: MSG extraction now supports compressed RTF bodies (PR_RTF_COMPRESSED).
  • #561: Indexed colour PDF images now decode correctly with palette lookup.
  • ODT extraction robustness improvements.

See CHANGELOG.md for full details.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.