github kreuzberg-dev/kreuzberg v4.5.2

6 hours ago

Fixed

  • PDF word splitting in extracted text: Pdfium's text extraction inserted spurious spaces mid-word (e.g. "s hall a b e active" instead of "shall be active"). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (font_size × 0.33 threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact.
  • Markdown underscore escaping: Underscores in extracted text (e.g. CTC_ARP_01) were incorrectly escaped as CTC\_ARP\_01 throughout the markdown output. Underscore escaping has been removed entirely since extracted PDF text contains literal identifiers, not markdown formatting.
  • Page header/footer leakage: Running headers like ISO 21111-10:2021(E) and copyright footers leaked into the document body. Added fuzzy alphanumeric matching to detect repeated header/footer text even when spacing or character extraction varies across pages.
  • R batch function spurious NULL argument: R wrapper batch functions passed an extra NULL positional argument to native Rust functions, causing "unused argument" errors on all batch operations.
  • Elixir Windows ORT DLL staging: ONNX Runtime DLL was only staged in target/release/ but not in priv/native/ where the BEAM VM loads NIFs. OCR/layout/embedding features now work correctly on Windows CI.

Added

  • General extraction result caching: All file types (PDF, Office, HTML, archives, etc.) are now cached — not just OCR results. Repeated extractions of the same file with the same config return instantly from cache.
  • Cache namespace isolation: New cache_namespace field on ExtractionConfig enables multi-tenant cache isolation on shared filesystems. Available via --cache-namespace CLI flag and across all language bindings.
  • Per-request cache TTL: New cache_ttl_secs field on ExtractionConfig overrides the global TTL for individual extractions. Set to 0 to skip cache entirely. Available via --cache-ttl-secs CLI flag.
  • Cache namespace deletion: delete_namespace() removes all cache entries under a namespace. get_stats_filtered() returns per-namespace statistics.
  • Multi-worker cleanup safety: Cache cleanup no longer triggers excessively when multiple worker pods share the same cache directory.
  • Bundled eng.traineddata: English OCR works out of the box with zero runtime configuration (~4MB bundled at build time).
  • Tessdata in cache warm: kreuzberg-cli cache warm now downloads all tessdata_fast language files (~120 languages) to KREUZBERG_CACHE_DIR/tessdata/, giving full Tesseract language support without system packages.
  • Tessdata in cache manifest: kreuzberg-cli cache manifest now includes all tessdata files with source URLs, enabling --sync-cache to download tessdata alongside models.
  • KREUZBERG_CACHE_DIR/tessdata resolution: resolve_tessdata_path() now checks KREUZBERG_CACHE_DIR/tessdata and the bundled build path before falling back to system paths.
  • CLI embed command: Generate vector embeddings from text via kreuzberg embed --text "..." --preset balanced.
  • CLI chunk command: Split text into chunks via kreuzberg chunk --text "..." --chunk-size 512.
  • CLI completions command: Generate shell completions for bash, zsh, fish, powershell.
  • CLI --log-level global flag: Override RUST_LOG via kreuzberg --log-level debug extract doc.pdf.
  • CLI extraction overrides: 27 flags exposed via ExtractionOverrides struct with #[command(flatten)].
  • CLI colored output: Text output uses anstyle for colored headers, labels, success values, and dim separators. Respects NO_COLOR env var.
  • API POST /detect, GET /version, GET /cache/manifest, POST /cache/warm: New REST endpoints.
  • MCP get_version, cache_manifest, cache_warm, embed_text, chunk_text: New MCP tools.
  • Pipeline table extraction tracing: Zero-cost tracing::trace! and tracing::debug! logging throughout layout detection and table extraction.
  • TATR model availability check: Layout detection returns an error if table regions are detected but the TATR model is unavailable.

Changed

  • CLI batch flags: Batch command now supports all extraction override flags via shared ExtractionOverrides struct.
  • CLI config architecture: Replaced 13-parameter function with ExtractionOverrides struct using #[command(flatten)].
  • MCP tool architecture: Removed dead tools/ trait-based duplicates; all tools implemented directly in server.rs.

Improved

  • CLI validation: OCR backend values, chunk size/overlap bounds, DPI range, layout confidence validated.
  • API validation: Embedding preset names and chunk bounds checked.
  • MCP validation: Empty paths rejected, chunk bounds checked, embedding preset validated.
  • Chunk overlap auto-clamping: When --chunk-size is smaller than default overlap, overlap is automatically clamped to size/4.

See full changelog: https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.