kreuzberg-dev/kreuzberg v4.3.6 on GitHub

Added

Pdfium PdfParagraph object-based extraction: New markdown extraction path using pdfium's PdfParagraph::from_objects() for spatial text grouping, replacing raw page-object iteration. Provides accurate per-line baseline positions via into_lines() and styled text fragments with bold/italic/monospace detection.
Structure tree and content marks API in pdfium-render: New ExtractedBlock, ContentRole, and PdfParagraph types for tagged PDF semantic extraction. Structure tree headings are validated against font size and word count to prevent broken structure trees from misclassifying body text.
Modular markdown pipeline: Refactored PDF markdown rendering into focused modules — bridge.rs (pdfium API bridge), lines.rs (baseline grouping), paragraphs.rs (paragraph detection), classify.rs (heading/code classification), render.rs (inline markup), assembly.rs (table/image interleaving), pipeline.rs (orchestration).
Text encoding normalization: normalize_text_encoding() converts trailing soft hyphens to regular hyphens for word-rejoining, strips mid-word soft hyphens, and removes stray C0 control characters from PDF text.
Table post-processing validation: 10-stage validation — empty row removal, long cell rejection, data row detection, header extraction, column merging, dimension checks, column sparsity, overall density, content asymmetry, and cell normalization. Eliminates false positive table detections in non-table PDFs.
Font quality detection for OCR triggering: Added has_unicode_map_error() to pdfium-render's PdfPageTextChar. If >30% of characters per page have broken unicode mappings, OCR fallback is triggered automatically.
Extended list prefix detection: Paragraph list detection now recognizes en dashes, em dashes, single-letter alphabetic prefixes, and roman numerals.

Fixed

UTF-8 panic in PDF list detection (#398): Fixed with proper CRLF-aware newline advancement and char boundary guards.
PaddleOCR backend not respected in Python bindings (#399): paddleocr/paddle-ocr backends now correctly handled.
Ruby gem missing sorbet-runtime at runtime (#400): Promoted to a runtime dependency.
DOCX extractor panic on multi-byte UTF-8 page boundaries (#401): Fixed with char-boundary-safe insertion.
Tesseract TSV level mapping off-by-one: Fixed parse_tsv_to_elements to include word-level entries.
OCR elements dropped in image OCR path: Now passes through elements parsed from Tesseract TSV output.
Node.js djot_content field missing: Fixed mapping in JsExtractionResult.
Pipeline test race conditions: Replaced manual mutex with #[serial] from serial_test.
OCR cache deserialization failure: Graceful cache miss on schema changes — stale cache entries are deleted instead of crashing.
PDF table detection false positives: Precision improved from 50% to 100% via post_process_table() validation.
Baseline tolerance drift in PDF line grouping: Now anchored to first segment's font size per line.
Paragraph gap detection: Changed to 25th percentile (Q1) for robustness against outlier-tight spacings.
Python OutputFormat.Structured missing: Added structured/json output format to PyO3 bindings and Python enum.
Python ChunkingConfig.chunker_type missing: Exposed chunker_type parameter in Python binding.
Python e2e config tests: Fixed build_config() to construct HierarchyConfig and PageConfig from fixture dicts.
Go CI tar extraction conflicts: Disabled Go caching in test job to prevent cache conflicts.
E2e generator gaps: Implemented plugin API test generators for WASM-Deno, WASM-Workers, and C#.

See full changelog: https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md