Fixed
- PDF word splitting in extracted text: Pdfium's text extraction inserted spurious spaces mid-word (e.g.
"s hall a b e active"instead of"shall be active"). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (font_size × 0.33threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact. - Markdown underscore escaping: Underscores in extracted text (e.g.
CTC_ARP_01) were incorrectly escaped asCTC\_ARP\_01throughout the markdown output. Underscore escaping has been removed entirely since extracted PDF text contains literal identifiers, not markdown formatting. - Page header/footer leakage: Running headers like
ISO 21111-10:2021(E)and copyright footers leaked into the document body. Added fuzzy alphanumeric matching to detect repeated header/footer text even when spacing or character extraction varies across pages. - R batch function spurious NULL argument: R wrapper batch functions passed an extra
NULLpositional argument to native Rust functions, causing "unused argument" errors on all batch operations. - Elixir Windows ORT DLL staging: ONNX Runtime DLL was only staged in
target/release/but not inpriv/native/where the BEAM VM loads NIFs. OCR/layout/embedding features now work correctly on Windows CI.
Added
- General extraction result caching: All file types (PDF, Office, HTML, archives, etc.) are now cached — not just OCR results. Repeated extractions of the same file with the same config return instantly from cache.
- Cache namespace isolation: New
cache_namespacefield onExtractionConfigenables multi-tenant cache isolation on shared filesystems. Available via--cache-namespaceCLI flag and across all language bindings. - Per-request cache TTL: New
cache_ttl_secsfield onExtractionConfigoverrides the global TTL for individual extractions. Set to0to skip cache entirely. Available via--cache-ttl-secsCLI flag. - Cache namespace deletion:
delete_namespace()removes all cache entries under a namespace.get_stats_filtered()returns per-namespace statistics. - Multi-worker cleanup safety: Cache cleanup no longer triggers excessively when multiple worker pods share the same cache directory.
- Bundled eng.traineddata: English OCR works out of the box with zero runtime configuration (~4MB bundled at build time).
- Tessdata in
cache warm:kreuzberg-cli cache warmnow downloads all tessdata_fast language files (~120 languages) toKREUZBERG_CACHE_DIR/tessdata/, giving full Tesseract language support without system packages. - Tessdata in
cache manifest:kreuzberg-cli cache manifestnow includes all tessdata files with source URLs, enabling--sync-cacheto download tessdata alongside models. KREUZBERG_CACHE_DIR/tessdataresolution:resolve_tessdata_path()now checksKREUZBERG_CACHE_DIR/tessdataand the bundled build path before falling back to system paths.- CLI
embedcommand: Generate vector embeddings from text viakreuzberg embed --text "..." --preset balanced. - CLI
chunkcommand: Split text into chunks viakreuzberg chunk --text "..." --chunk-size 512. - CLI
completionscommand: Generate shell completions for bash, zsh, fish, powershell. - CLI
--log-levelglobal flag: OverrideRUST_LOGviakreuzberg --log-level debug extract doc.pdf. - CLI extraction overrides: 27 flags exposed via
ExtractionOverridesstruct with#[command(flatten)]. - CLI colored output: Text output uses
anstylefor colored headers, labels, success values, and dim separators. RespectsNO_COLORenv var. - API
POST /detect,GET /version,GET /cache/manifest,POST /cache/warm: New REST endpoints. - MCP
get_version,cache_manifest,cache_warm,embed_text,chunk_text: New MCP tools. - Pipeline table extraction tracing: Zero-cost
tracing::trace!andtracing::debug!logging throughout layout detection and table extraction. - TATR model availability check: Layout detection returns an error if table regions are detected but the TATR model is unavailable.
Changed
- CLI batch flags: Batch command now supports all extraction override flags via shared
ExtractionOverridesstruct. - CLI config architecture: Replaced 13-parameter function with
ExtractionOverridesstruct using#[command(flatten)]. - MCP tool architecture: Removed dead
tools/trait-based duplicates; all tools implemented directly inserver.rs.
Improved
- CLI validation: OCR backend values, chunk size/overlap bounds, DPI range, layout confidence validated.
- API validation: Embedding preset names and chunk bounds checked.
- MCP validation: Empty paths rejected, chunk bounds checked, embedding preset validated.
- Chunk overlap auto-clamping: When
--chunk-sizeis smaller than default overlap, overlap is automatically clamped tosize/4.
See full changelog: https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md