github kreuzberg-dev/kreuzberg v4.7.1
publish v4.7.1

9 hours ago

[4.7.1] - 2026-04-03

Added

  • Tree-sitter grammar management CLI — New kreuzberg tree-sitter subcommand with download, list, cache-dir, and clean sub-commands for managing tree-sitter grammar parsers. Supports downloading by language name, group (--groups web,systems,scripting), or all (--all). Reads [tree_sitter] config from kreuzberg.toml with --from-config.
  • Tree-sitter grammar management API — New REST endpoints: POST /grammars/download, GET /grammars/list, GET /grammars/cache, DELETE /grammars/cache for programmatic grammar management.
  • Tree-sitter grammar management MCP tools — New MCP tools: download_grammars, list_grammars, grammar_cache_info, clean_grammar_cache for AI assistant-driven grammar management.
  • Tree-sitter config startup initialization — API and MCP servers auto-download tree-sitter grammars on startup when [tree_sitter] config specifies languages or groups.

Changed

  • Normalized OCR+layout pipeline — Tesseract+layout path now follows the same architecture as pdfium+layout: hOCR → PdfParagraph → apply_layout_overridesassemble_internal_document → comrak.
  • Elixir NIF crash protection — All extraction and batch NIFs now wrapped with catch_unwind to prevent native C library panics from crashing the BEAM VM. Panics return {:error, reason} with error-level tracing and backtraces.

Fixed

  • hOCR parser depth tracking — Fixed paragraph boundary detection using tag-name-specific depth tracking.
  • hOCR multi-page content loss — Removed per-page filter that silently dropped content on pages 2+.
  • OCR batch parallelization — Now uses resolve_thread_budget() instead of hardcoded 4 threads.
  • Benchmark workflow — Removed reference to deleted kreuzberg-extract binary target.
  • Ruby OCR backend — Added missing ocr_internal_document field.
  • Keyword extraction tests — Updated assertions to use extracted_keywords field.
  • PaddleOCR cache dir test — Fixed failure when KREUZBERG_CACHE_DIR env var is set.
  • API pdf_password handler — Added #[cfg(feature = "pdf")] gate.
  • Chunking page boundary regression (#636)
  • HF Hub environment variables (#634)
  • PDF bridge tracing panic on multibyte characters (#635)
  • Go/Java FFI struct layout — Fixed missing fields causing offset shifts.
  • PHP/Ruby/Node.js binding fixes — Various field and config parsing fixes.
  • OCR InternalDocument propagation — Structured document now propagated through full pipeline.
  • Italian/European PDF ligature corruption — Extended ligature repair for tt, ti, tti.

See full changelog for details.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.