[4.7.1] - 2026-04-03
Added
- Tree-sitter grammar management CLI — New
kreuzberg tree-sittersubcommand withdownload,list,cache-dir, andcleansub-commands for managing tree-sitter grammar parsers. Supports downloading by language name, group (--groups web,systems,scripting), or all (--all). Reads[tree_sitter]config fromkreuzberg.tomlwith--from-config. - Tree-sitter grammar management API — New REST endpoints:
POST /grammars/download,GET /grammars/list,GET /grammars/cache,DELETE /grammars/cachefor programmatic grammar management. - Tree-sitter grammar management MCP tools — New MCP tools:
download_grammars,list_grammars,grammar_cache_info,clean_grammar_cachefor AI assistant-driven grammar management. - Tree-sitter config startup initialization — API and MCP servers auto-download tree-sitter grammars on startup when
[tree_sitter]config specifieslanguagesorgroups.
Changed
- Normalized OCR+layout pipeline — Tesseract+layout path now follows the same architecture as pdfium+layout: hOCR → PdfParagraph →
apply_layout_overrides→assemble_internal_document→ comrak. - Elixir NIF crash protection — All extraction and batch NIFs now wrapped with
catch_unwindto prevent native C library panics from crashing the BEAM VM. Panics return{:error, reason}with error-level tracing and backtraces.
Fixed
- hOCR parser depth tracking — Fixed paragraph boundary detection using tag-name-specific depth tracking.
- hOCR multi-page content loss — Removed per-page filter that silently dropped content on pages 2+.
- OCR batch parallelization — Now uses
resolve_thread_budget()instead of hardcoded 4 threads. - Benchmark workflow — Removed reference to deleted
kreuzberg-extractbinary target. - Ruby OCR backend — Added missing
ocr_internal_documentfield. - Keyword extraction tests — Updated assertions to use
extracted_keywordsfield. - PaddleOCR cache dir test — Fixed failure when
KREUZBERG_CACHE_DIRenv var is set. - API
pdf_passwordhandler — Added#[cfg(feature = "pdf")]gate. - Chunking page boundary regression (#636)
- HF Hub environment variables (#634)
- PDF bridge tracing panic on multibyte characters (#635)
- Go/Java FFI struct layout — Fixed missing fields causing offset shifts.
- PHP/Ruby/Node.js binding fixes — Various field and config parsing fixes.
- OCR InternalDocument propagation — Structured document now propagated through full pipeline.
- Italian/European PDF ligature corruption — Extended ligature repair for
tt,ti,tti.
See full changelog for details.