Highlights
Code Intelligence — 248 Languages
New code_intelligence field on ExtractionResult via tree-sitter-language-pack. Extract functions, classes, imports, exports, symbols, docstrings, and diagnostics. Semantic code chunking that respects scope boundaries. Configure with CodeContentMode: chunks, raw, or structure.
Benchmark-Driven Extraction Quality
350+ test documents across 23 formats with Structural F1 scoring. LaTeX, XLSX, HTML, RTF all at 100% SF1. PDF table SF1: 15.5% → 53.7%. PDF overall SF1: 40.7% → 45.2% across 157 verified fixtures.
Unified Architecture
All extractors produce a canonical InternalDocument. One rendering pipeline via comrak AST produces GFM markdown, semantic HTML5, djot, and plain text. Images extracted from 8 formats with recursive OCR. URIs classified from 20+ formats. Recursive extraction for email attachments, PDF portfolios, Office embedded objects.
Output Quality
GFM-compliant markdown (fenced code blocks, clean escaping, validated with rumdl). Semantic HTML with class="language-X" on code blocks. Djot with shared text normalization (MD↔Djot TF1 = 1.0). HTML input passthrough via html-to-markdown. 36 cross-format parity tests.
New Features
code_intelligencefield onExtractionResultacross all 11 bindingsCodeContentModeconfig andTreeSitterConfigfor code extraction control- URI extraction (
urisfield) from 20+ formats with type classification - Semantic chunk labeling (
chunk_typefield) - JSON structured output format
- TOON wire format (30-50% fewer tokens than JSON) across CLI, API, MCP, all bindings
- Renderer registry for custom output format plugins
disable_ocrconfig option- Strict config validation (
deny_unknown_fields) - OpenWebUI integration
- Recursive extraction for email attachments, nested emails, DOCX/PPTX embedded objects
- PDF bookmark/outline and embedded file extraction
Breaking Changes
LayoutDetectionConfig.presetremoved — layout detection is binary (enabled or not)table_modelchanged fromOption<String>toTableModelenummetadata.additionalHashMap replaced with typedFormatMetadatavariants
Extraction Fixes
25+ format-specific fixes: LaTeX, RTF (bold bleeding, tables, lists, hyperlinks), RST (headings, code hints), HTML (preprocessing enabled), PPTX (slide titles, hyperlinks, embedded objects), XLSX/XLS (sheet headings), IPYNB (heading detection, outputs), ODT (MathML formulas), ORG (inline code, source blocks), DocBook (root wrapping), FictionBook (poems, images), Apple iWork (tables, metadata), PDF (heading hierarchy, table discrimination, formula detection, ALL-CAPS headings).
Security
- Tesseract C++ exception crash fix at FFI boundary
- Stack overflow prevention in PDF bookmark parsing
- DoS caps on URI extraction
- Recursion limits on image OCR pipeline
- UB fix in Ruby FFI
- Strict config validation
Install
pip install kreuzberg
npm install @kreuzberg-dev/kreuzberg
cargo add kreuzbergFull changelog: https://kreuzberg.dev/CHANGELOG
Documentation: https://kreuzberg.dev