Added
- Pdfium
PdfParagraphobject-based extraction: New markdown extraction path using pdfium'sPdfParagraph::from_objects()for spatial text grouping, replacing raw page-object iteration. Provides accurate per-line baseline positions viainto_lines()and styled text fragments with bold/italic/monospace detection. - Structure tree and content marks API in pdfium-render: New
ExtractedBlock,ContentRole, andPdfParagraphtypes for tagged PDF semantic extraction. Structure tree headings are validated against font size and word count to prevent broken structure trees from misclassifying body text. - Modular markdown pipeline: Refactored PDF markdown rendering into focused modules —
bridge.rs(pdfium API bridge),lines.rs(baseline grouping),paragraphs.rs(paragraph detection),classify.rs(heading/code classification),render.rs(inline markup),assembly.rs(table/image interleaving),pipeline.rs(orchestration). - Text encoding normalization:
normalize_text_encoding()converts trailing soft hyphens to regular hyphens for word-rejoining, strips mid-word soft hyphens, and removes stray C0 control characters from PDF text. - Table post-processing validation: 10-stage validation — empty row removal, long cell rejection, data row detection, header extraction, column merging, dimension checks, column sparsity, overall density, content asymmetry, and cell normalization. Eliminates false positive table detections in non-table PDFs.
- Font quality detection for OCR triggering: Added
has_unicode_map_error()to pdfium-render'sPdfPageTextChar. If >30% of characters per page have broken unicode mappings, OCR fallback is triggered automatically. - Extended list prefix detection: Paragraph list detection now recognizes en dashes, em dashes, single-letter alphabetic prefixes, and roman numerals.
Fixed
- UTF-8 panic in PDF list detection (#398): Fixed with proper CRLF-aware newline advancement and char boundary guards.
- PaddleOCR backend not respected in Python bindings (#399):
paddleocr/paddle-ocrbackends now correctly handled. - Ruby gem missing
sorbet-runtimeat runtime (#400): Promoted to a runtime dependency. - DOCX extractor panic on multi-byte UTF-8 page boundaries (#401): Fixed with char-boundary-safe insertion.
- Tesseract TSV level mapping off-by-one: Fixed
parse_tsv_to_elementsto include word-level entries. - OCR elements dropped in image OCR path: Now passes through elements parsed from Tesseract TSV output.
- Node.js
djot_contentfield missing: Fixed mapping inJsExtractionResult. - Pipeline test race conditions: Replaced manual mutex with
#[serial]fromserial_test. - OCR cache deserialization failure: Graceful cache miss on schema changes — stale cache entries are deleted instead of crashing.
- PDF table detection false positives: Precision improved from 50% to 100% via
post_process_table()validation. - Baseline tolerance drift in PDF line grouping: Now anchored to first segment's font size per line.
- Paragraph gap detection: Changed to 25th percentile (Q1) for robustness against outlier-tight spacings.
- Python
OutputFormat.Structuredmissing: Added structured/json output format to PyO3 bindings and Python enum. - Python
ChunkingConfig.chunker_typemissing: Exposedchunker_typeparameter in Python binding. - Python e2e config tests: Fixed
build_config()to constructHierarchyConfigandPageConfigfrom fixture dicts. - Go CI tar extraction conflicts: Disabled Go caching in test job to prevent cache conflicts.
- E2e generator gaps: Implemented plugin API test generators for WASM-Deno, WASM-Workers, and C#.
See full changelog: https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md