kreuzberg-dev/kreuzberg v4.7.0 on GitHub

Highlights

Code Intelligence — 248 Languages

New code_intelligence field on ExtractionResult via tree-sitter-language-pack. Extract functions, classes, imports, exports, symbols, docstrings, and diagnostics. Semantic code chunking that respects scope boundaries. Configure with CodeContentMode: chunks, raw, or structure.

Benchmark-Driven Extraction Quality

350+ test documents across 23 formats with Structural F1 scoring. LaTeX, XLSX, HTML, RTF all at 100% SF1. PDF table SF1: 15.5% → 53.7%. PDF overall SF1: 40.7% → 45.2% across 157 verified fixtures.

Unified Architecture

All extractors produce a canonical InternalDocument. One rendering pipeline via comrak AST produces GFM markdown, semantic HTML5, djot, and plain text. Images extracted from 8 formats with recursive OCR. URIs classified from 20+ formats. Recursive extraction for email attachments, PDF portfolios, Office embedded objects.

Output Quality

GFM-compliant markdown (fenced code blocks, clean escaping, validated with rumdl). Semantic HTML with class="language-X" on code blocks. Djot with shared text normalization (MD↔Djot TF1 = 1.0). HTML input passthrough via html-to-markdown. 36 cross-format parity tests.

New Features

code_intelligence field on ExtractionResult across all 11 bindings
CodeContentMode config and TreeSitterConfig for code extraction control
URI extraction (uris field) from 20+ formats with type classification
Semantic chunk labeling (chunk_type field)
JSON structured output format
TOON wire format (30-50% fewer tokens than JSON) across CLI, API, MCP, all bindings
Renderer registry for custom output format plugins
disable_ocr config option
Strict config validation (deny_unknown_fields)
OpenWebUI integration
Recursive extraction for email attachments, nested emails, DOCX/PPTX embedded objects
PDF bookmark/outline and embedded file extraction

Breaking Changes

LayoutDetectionConfig.preset removed — layout detection is binary (enabled or not)
table_model changed from Option<String> to TableModel enum
metadata.additional HashMap replaced with typed FormatMetadata variants

Extraction Fixes

25+ format-specific fixes: LaTeX, RTF (bold bleeding, tables, lists, hyperlinks), RST (headings, code hints), HTML (preprocessing enabled), PPTX (slide titles, hyperlinks, embedded objects), XLSX/XLS (sheet headings), IPYNB (heading detection, outputs), ODT (MathML formulas), ORG (inline code, source blocks), DocBook (root wrapping), FictionBook (poems, images), Apple iWork (tables, metadata), PDF (heading hierarchy, table discrimination, formula detection, ALL-CAPS headings).

Security

Tesseract C++ exception crash fix at FFI boundary
Stack overflow prevention in PDF bookmark parsing
DoS caps on URI extraction
Recursion limits on image OCR pipeline
UB fix in Ruby FFI
Strict config validation

Install

pip install kreuzberg
npm install @kreuzberg-dev/kreuzberg
cargo add kreuzberg

Full changelog: https://kreuzberg.dev/CHANGELOG
Documentation: https://kreuzberg.dev