Added
- NFC unicode normalization applied to all extraction outputs, ensuring consistent representation of composed characters across all backends (gated behind
qualityfeature) - Configurable PDF page margin fractions (
top_margin_fraction,bottom_margin_fraction) inPdfConfig - PDF annotation extraction with new
PdfAnnotationtype supportingText,Highlight,Link,Stamp,Underline,StrikeOut, andOtherannotation types extract_annotationsconfiguration option inPdfConfigannotationsfield onExtractionResultacross all language bindings (Rust, Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, WASM)
Fixed
- PDF markdown extraction quality at parity with docling (91.0% avg F1 vs docling's 91.4% across 16 test PDFs, while being 10-50x faster): Replaced
PdfiumParagraph::from_objects()with per-character text extraction using pdfium'sPdfPageText::chars()API, which correctly handles font matrices, CMap lookups, and text positioning. Adaptive line-break detection uses measured Y-position changes rather than font-size-relative thresholds, fixing PDFs where pdfium reports incorrect unscaled font sizes. - PDF markdown extraction no longer drops all content on PDFs with broken font metrics: Added font-size filter fallback — when the
MIN_FONT_SIZEfilter (4pt) removes all text segments (e.g. PDFs where pdfium reportsfont_size=1due to font matrix scaling), the filter is skipped and unfiltered segments are used instead. - PDF margin filter no longer drops all content on edge-case PDFs: Added margin filter fallback — when margin filtering removes all text segments (e.g. PDFs where pdfium reports baseline_y values outside expected margin bands), the filter is skipped for that page.
- PDF ligature repair integrated into per-character extraction: Ligature corruption (
fi→!,fl→#,ff→") is now repaired inline during character iteration rather than as a separate post-processing pass, improving both accuracy and performance. - PDF multi-column text extraction improved: Federal Register-style multi-column PDFs went from 69.9% to 90.7% F1 by using pdfium's text API which naturally handles reading order.
- PDF table detection now requires ≥3 aligned columns, eliminating false positives from two-column text layouts (academic papers, newsletters)
- PDF table post-processing rejects tables with ≤2 columns, >50% long cells, or average cell length >50 chars
- PDF markdown rendering no longer drops content when pdfium returns zero-value baseline coordinates (fixes missing titles/authors in some LaTeX-generated PDFs)
- PaddleOCR backend validation now dynamically checks the plugin registry instead of hardcoding, preventing false "backend not registered" errors when the plugin is available (#403)
- WASM bindings now export
detectMimeFromBytesandgetExtensionsForMimeMIME utility functions - Node.js NAPI-RS binding correctly exposes
annotationsfield onExtractionResult - Python output format validation tests updated to reflect
jsonas a valid format (alias forstructured) - XLSX extraction with
output_format="markdown"now produces markdown tables instead of plain text (#405) - MCP tools with no parameters (
cache_stats,cache_clear) now emit validinputSchemawith{"type": "object", "properties": {}}instead of{"const": null}, fixing Claude Code and other MCP clients that validate schema type (#406) - Python
get_valid_ocr_backends()now unconditionally includespaddleocrin the returned list, matching all other language bindings - TypeScript E2E test generator now maps
extract_annotationstoextractAnnotationsinmapPdfConfig(), fixing annotation assertion failures - PHP
PdfConfignow includesextractAnnotations,topMarginFraction, andbottomMarginFractionfields, restoring parity with the Rust core config