github kreuzberg-dev/kreuzberg v4.7.0

6 hours ago

Highlights

Code Intelligence — 248 Languages

New code_intelligence field on ExtractionResult via tree-sitter-language-pack. Extract functions, classes, imports, exports, symbols, docstrings, and diagnostics. Semantic code chunking that respects scope boundaries. Configure with CodeContentMode: chunks, raw, or structure.

Benchmark-Driven Extraction Quality

350+ test documents across 23 formats with Structural F1 scoring. LaTeX, XLSX, HTML, RTF all at 100% SF1. PDF table SF1: 15.5% → 53.7%. PDF overall SF1: 40.7% → 45.2% across 157 verified fixtures.

Unified Architecture

All extractors produce a canonical InternalDocument. One rendering pipeline via comrak AST produces GFM markdown, semantic HTML5, djot, and plain text. Images extracted from 8 formats with recursive OCR. URIs classified from 20+ formats. Recursive extraction for email attachments, PDF portfolios, Office embedded objects.

Output Quality

GFM-compliant markdown (fenced code blocks, clean escaping, validated with rumdl). Semantic HTML with class="language-X" on code blocks. Djot with shared text normalization (MD↔Djot TF1 = 1.0). HTML input passthrough via html-to-markdown. 36 cross-format parity tests.

New Features

  • code_intelligence field on ExtractionResult across all 11 bindings
  • CodeContentMode config and TreeSitterConfig for code extraction control
  • URI extraction (uris field) from 20+ formats with type classification
  • Semantic chunk labeling (chunk_type field)
  • JSON structured output format
  • TOON wire format (30-50% fewer tokens than JSON) across CLI, API, MCP, all bindings
  • Renderer registry for custom output format plugins
  • disable_ocr config option
  • Strict config validation (deny_unknown_fields)
  • OpenWebUI integration
  • Recursive extraction for email attachments, nested emails, DOCX/PPTX embedded objects
  • PDF bookmark/outline and embedded file extraction

Breaking Changes

  • LayoutDetectionConfig.preset removed — layout detection is binary (enabled or not)
  • table_model changed from Option<String> to TableModel enum
  • metadata.additional HashMap replaced with typed FormatMetadata variants

Extraction Fixes

25+ format-specific fixes: LaTeX, RTF (bold bleeding, tables, lists, hyperlinks), RST (headings, code hints), HTML (preprocessing enabled), PPTX (slide titles, hyperlinks, embedded objects), XLSX/XLS (sheet headings), IPYNB (heading detection, outputs), ODT (MathML formulas), ORG (inline code, source blocks), DocBook (root wrapping), FictionBook (poems, images), Apple iWork (tables, metadata), PDF (heading hierarchy, table discrimination, formula detection, ALL-CAPS headings).

Security

  • Tesseract C++ exception crash fix at FFI boundary
  • Stack overflow prevention in PDF bookmark parsing
  • DoS caps on URI extraction
  • Recursion limits on image OCR pipeline
  • UB fix in Ruby FFI
  • Strict config validation

Install

pip install kreuzberg
npm install @kreuzberg-dev/kreuzberg
cargo add kreuzberg

Full changelog: https://kreuzberg.dev/CHANGELOG
Documentation: https://kreuzberg.dev

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.