kreuzberg-dev/kreuzberg v4.3.7 on GitHub

Added

NFC unicode normalization applied to all extraction outputs, ensuring consistent representation of composed characters across all backends (gated behind quality feature)
Configurable PDF page margin fractions (top_margin_fraction, bottom_margin_fraction) in PdfConfig
PDF annotation extraction with new PdfAnnotation type supporting Text, Highlight, Link, Stamp, Underline, StrikeOut, and Other annotation types
extract_annotations configuration option in PdfConfig
annotations field on ExtractionResult across all language bindings (Rust, Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, WASM)

Fixed

PDF markdown extraction quality at parity with docling (91.0% avg F1 vs docling's 91.4% across 16 test PDFs, while being 10-50x faster): Replaced PdfiumParagraph::from_objects() with per-character text extraction using pdfium's PdfPageText::chars() API, which correctly handles font matrices, CMap lookups, and text positioning. Adaptive line-break detection uses measured Y-position changes rather than font-size-relative thresholds, fixing PDFs where pdfium reports incorrect unscaled font sizes.
PDF markdown extraction no longer drops all content on PDFs with broken font metrics: Added font-size filter fallback — when the MIN_FONT_SIZE filter (4pt) removes all text segments (e.g. PDFs where pdfium reports font_size=1 due to font matrix scaling), the filter is skipped and unfiltered segments are used instead.
PDF margin filter no longer drops all content on edge-case PDFs: Added margin filter fallback — when margin filtering removes all text segments (e.g. PDFs where pdfium reports baseline_y values outside expected margin bands), the filter is skipped for that page.
PDF ligature repair integrated into per-character extraction: Ligature corruption (fi→!, fl→#, ff→") is now repaired inline during character iteration rather than as a separate post-processing pass, improving both accuracy and performance.
PDF multi-column text extraction improved: Federal Register-style multi-column PDFs went from 69.9% to 90.7% F1 by using pdfium's text API which naturally handles reading order.
PDF table detection now requires ≥3 aligned columns, eliminating false positives from two-column text layouts (academic papers, newsletters)
PDF table post-processing rejects tables with ≤2 columns, >50% long cells, or average cell length >50 chars
PDF markdown rendering no longer drops content when pdfium returns zero-value baseline coordinates (fixes missing titles/authors in some LaTeX-generated PDFs)
PaddleOCR backend validation now dynamically checks the plugin registry instead of hardcoding, preventing false "backend not registered" errors when the plugin is available (#403)
WASM bindings now export detectMimeFromBytes and getExtensionsForMime MIME utility functions
Node.js NAPI-RS binding correctly exposes annotations field on ExtractionResult
Python output format validation tests updated to reflect json as a valid format (alias for structured)
XLSX extraction with output_format="markdown" now produces markdown tables instead of plain text (#405)
MCP tools with no parameters (cache_stats, cache_clear) now emit valid inputSchema with {"type": "object", "properties": {}} instead of {"const": null}, fixing Claude Code and other MCP clients that validate schema type (#406)
Python get_valid_ocr_backends() now unconditionally includes paddleocr in the returned list, matching all other language bindings
TypeScript E2E test generator now maps extract_annotations to extractAnnotations in mapPdfConfig(), fixing annotation assertion failures
PHP PdfConfig now includes extractAnnotations, topMarginFraction, and bottomMarginFraction fields, restoring parity with the Rust core config