github kreuzberg-dev/kreuzberg v4.3.7

6 hours ago

Added

  • NFC unicode normalization applied to all extraction outputs, ensuring consistent representation of composed characters across all backends (gated behind quality feature)
  • Configurable PDF page margin fractions (top_margin_fraction, bottom_margin_fraction) in PdfConfig
  • PDF annotation extraction with new PdfAnnotation type supporting Text, Highlight, Link, Stamp, Underline, StrikeOut, and Other annotation types
  • extract_annotations configuration option in PdfConfig
  • annotations field on ExtractionResult across all language bindings (Rust, Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, WASM)

Fixed

  • PDF markdown extraction quality at parity with docling (91.0% avg F1 vs docling's 91.4% across 16 test PDFs, while being 10-50x faster): Replaced PdfiumParagraph::from_objects() with per-character text extraction using pdfium's PdfPageText::chars() API, which correctly handles font matrices, CMap lookups, and text positioning. Adaptive line-break detection uses measured Y-position changes rather than font-size-relative thresholds, fixing PDFs where pdfium reports incorrect unscaled font sizes.
  • PDF markdown extraction no longer drops all content on PDFs with broken font metrics: Added font-size filter fallback — when the MIN_FONT_SIZE filter (4pt) removes all text segments (e.g. PDFs where pdfium reports font_size=1 due to font matrix scaling), the filter is skipped and unfiltered segments are used instead.
  • PDF margin filter no longer drops all content on edge-case PDFs: Added margin filter fallback — when margin filtering removes all text segments (e.g. PDFs where pdfium reports baseline_y values outside expected margin bands), the filter is skipped for that page.
  • PDF ligature repair integrated into per-character extraction: Ligature corruption (fi!, fl#, ff") is now repaired inline during character iteration rather than as a separate post-processing pass, improving both accuracy and performance.
  • PDF multi-column text extraction improved: Federal Register-style multi-column PDFs went from 69.9% to 90.7% F1 by using pdfium's text API which naturally handles reading order.
  • PDF table detection now requires ≥3 aligned columns, eliminating false positives from two-column text layouts (academic papers, newsletters)
  • PDF table post-processing rejects tables with ≤2 columns, >50% long cells, or average cell length >50 chars
  • PDF markdown rendering no longer drops content when pdfium returns zero-value baseline coordinates (fixes missing titles/authors in some LaTeX-generated PDFs)
  • PaddleOCR backend validation now dynamically checks the plugin registry instead of hardcoding, preventing false "backend not registered" errors when the plugin is available (#403)
  • WASM bindings now export detectMimeFromBytes and getExtensionsForMime MIME utility functions
  • Node.js NAPI-RS binding correctly exposes annotations field on ExtractionResult
  • Python output format validation tests updated to reflect json as a valid format (alias for structured)
  • XLSX extraction with output_format="markdown" now produces markdown tables instead of plain text (#405)
  • MCP tools with no parameters (cache_stats, cache_clear) now emit valid inputSchema with {"type": "object", "properties": {}} instead of {"const": null}, fixing Claude Code and other MCP clients that validate schema type (#406)
  • Python get_valid_ocr_backends() now unconditionally includes paddleocr in the returned list, matching all other language bindings
  • TypeScript E2E test generator now maps extract_annotations to extractAnnotations in mapPdfConfig(), fixing annotation assertion failures
  • PHP PdfConfig now includes extractAnnotations, topMarginFraction, and bottomMarginFraction fields, restoring parity with the Rust core config

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.