kreuzberg-dev/kreuzberg v4.7.1 on GitHub

[4.7.1] - 2026-04-03

Tree-sitter grammar management CLI — New kreuzberg tree-sitter subcommand with download, list, cache-dir, and clean sub-commands for managing tree-sitter grammar parsers. Supports downloading by language name, group (--groups web,systems,scripting), or all (--all). Reads [tree_sitter] config from kreuzberg.toml with --from-config.
Tree-sitter grammar management API — New REST endpoints: POST /grammars/download, GET /grammars/list, GET /grammars/cache, DELETE /grammars/cache for programmatic grammar management.
Tree-sitter grammar management MCP tools — New MCP tools: download_grammars, list_grammars, grammar_cache_info, clean_grammar_cache for AI assistant-driven grammar management.
Tree-sitter config startup initialization — API and MCP servers auto-download tree-sitter grammars on startup when [tree_sitter] config specifies languages or groups.

Normalized OCR+layout pipeline — Tesseract+layout path now follows the same architecture as pdfium+layout: hOCR → PdfParagraph → apply_layout_overrides → assemble_internal_document → comrak.
Elixir NIF crash protection — All extraction and batch NIFs now wrapped with catch_unwind to prevent native C library panics from crashing the BEAM VM. Panics return {:error, reason} with error-level tracing and backtraces.

hOCR parser depth tracking — Fixed paragraph boundary detection using tag-name-specific depth tracking.
hOCR multi-page content loss — Removed per-page filter that silently dropped content on pages 2+.
OCR batch parallelization — Now uses resolve_thread_budget() instead of hardcoded 4 threads.
Benchmark workflow — Removed reference to deleted kreuzberg-extract binary target.
Ruby OCR backend — Added missing ocr_internal_document field.
Keyword extraction tests — Updated assertions to use extracted_keywords field.
PaddleOCR cache dir test — Fixed failure when KREUZBERG_CACHE_DIR env var is set.
API pdf_password handler — Added #[cfg(feature = "pdf")] gate.
Chunking page boundary regression (#636)
HF Hub environment variables (#634)
PDF bridge tracing panic on multibyte characters (#635)
Go/Java FFI struct layout — Fixed missing fields causing offset shifts.
PHP/Ruby/Node.js binding fixes — Various field and config parsing fixes.
OCR InternalDocument propagation — Structured document now propagated through full pipeline.
Italian/European PDF ligature corruption — Extended ligature repair for tt, ti, tti.

See full changelog for details.