kreuzberg-dev/kreuzberg v4.5.2 on GitHub

Fixed

PDF word splitting in extracted text: Pdfium's text extraction inserted spurious spaces mid-word (e.g. "s hall a b e active" instead of "shall be active"). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (font_size × 0.33 threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact.
Markdown underscore escaping: Underscores in extracted text (e.g. CTC_ARP_01) were incorrectly escaped as CTC\_ARP\_01 throughout the markdown output. Underscore escaping has been removed entirely since extracted PDF text contains literal identifiers, not markdown formatting.
Page header/footer leakage: Running headers like ISO 21111-10:2021(E) and copyright footers leaked into the document body. Added fuzzy alphanumeric matching to detect repeated header/footer text even when spacing or character extraction varies across pages.
R batch function spurious NULL argument: R wrapper batch functions passed an extra NULL positional argument to native Rust functions, causing "unused argument" errors on all batch operations.
Elixir Windows ORT DLL staging: ONNX Runtime DLL was only staged in target/release/ but not in priv/native/ where the BEAM VM loads NIFs. OCR/layout/embedding features now work correctly on Windows CI.

Added

General extraction result caching: All file types (PDF, Office, HTML, archives, etc.) are now cached — not just OCR results. Repeated extractions of the same file with the same config return instantly from cache.
Cache namespace isolation: New cache_namespace field on ExtractionConfig enables multi-tenant cache isolation on shared filesystems. Available via --cache-namespace CLI flag and across all language bindings.
Per-request cache TTL: New cache_ttl_secs field on ExtractionConfig overrides the global TTL for individual extractions. Set to 0 to skip cache entirely. Available via --cache-ttl-secs CLI flag.
Cache namespace deletion: delete_namespace() removes all cache entries under a namespace. get_stats_filtered() returns per-namespace statistics.
Multi-worker cleanup safety: Cache cleanup no longer triggers excessively when multiple worker pods share the same cache directory.
Bundled eng.traineddata: English OCR works out of the box with zero runtime configuration (~4MB bundled at build time).
Tessdata in cache warm: kreuzberg-cli cache warm now downloads all tessdata_fast language files (~120 languages) to KREUZBERG_CACHE_DIR/tessdata/, giving full Tesseract language support without system packages.
Tessdata in cache manifest: kreuzberg-cli cache manifest now includes all tessdata files with source URLs, enabling --sync-cache to download tessdata alongside models.
KREUZBERG_CACHE_DIR/tessdata resolution: resolve_tessdata_path() now checks KREUZBERG_CACHE_DIR/tessdata and the bundled build path before falling back to system paths.
CLI embed command: Generate vector embeddings from text via kreuzberg embed --text "..." --preset balanced.
CLI chunk command: Split text into chunks via kreuzberg chunk --text "..." --chunk-size 512.
CLI completions command: Generate shell completions for bash, zsh, fish, powershell.
CLI --log-level global flag: Override RUST_LOG via kreuzberg --log-level debug extract doc.pdf.
CLI extraction overrides: 27 flags exposed via ExtractionOverrides struct with #[command(flatten)].
CLI colored output: Text output uses anstyle for colored headers, labels, success values, and dim separators. Respects NO_COLOR env var.
API POST /detect, GET /version, GET /cache/manifest, POST /cache/warm: New REST endpoints.
MCP get_version, cache_manifest, cache_warm, embed_text, chunk_text: New MCP tools.
Pipeline table extraction tracing: Zero-cost tracing::trace! and tracing::debug! logging throughout layout detection and table extraction.
TATR model availability check: Layout detection returns an error if table regions are detected but the TATR model is unavailable.

Changed

CLI batch flags: Batch command now supports all extraction override flags via shared ExtractionOverrides struct.
CLI config architecture: Replaced 13-parameter function with ExtractionOverrides struct using #[command(flatten)].
MCP tool architecture: Removed dead tools/ trait-based duplicates; all tools implemented directly in server.rs.

Improved

CLI validation: OCR backend values, chunk size/overlap bounds, DPI range, layout confidence validated.
API validation: Embedding preset names and chunk bounds checked.
MCP validation: Empty paths rejected, chunk bounds checked, embedding preset validated.
Chunk overlap auto-clamping: When --chunk-size is smaller than default overlap, overlap is automatically clamped to size/4.

See full changelog: https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md