Added
Blank Page Detection
is_blankfield onPageInfoandPageContent: Pages with fewer than 3 non-whitespace characters and no tables or images are flagged as blank. Detection uses a two-phase approach: text-only analysis during extraction, then refinement after table/image assignment. Available across all 9 language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir, WASM). Closes #378.
PaddleOCR Backend
- PaddleOCR backend via ONNX Runtime: New OCR backend (
kreuzberg-paddle-ocr) using PaddlePaddle's PP-OCRv4 models converted to ONNX format, run via ONNX Runtime. Supports 6 languages (English, Chinese, Japanese, Korean, German, French) with automatic model downloading and caching. Provides superior CJK recognition compared to Tesseract. - PaddleOCR support in all bindings: Available across Python, Rust, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, and Elixir bindings via the
paddle-ocrfeature flag. - PaddleOCR CLI support: The
kreuzberg-clibinary supports--ocr-backend paddle-ocrfor PaddleOCR extraction.
Unified OCR Element Output
- Structured OCR element data: Extraction results now include
OcrElementdata with bounding geometry (rectangles and quadrilaterals), per-element confidence scores, rotation information, and hierarchical levels (word, line, block, page). Available from both PaddleOCR and Tesseract backends.
Shared ONNX Runtime Discovery
ort_discoverymodule: Finds ONNX Runtime shared libraries across platforms, shared between PaddleOCR and future ONNX-based backends.
Document Structure Output
DocumentStructuresupport across all bindings: Added structured document output withinclude_document_structureconfiguration option across Python, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, Elixir, and WASM bindings.
Native DOC/PPT Extraction
- OLE/CFB-based extraction: Added native DOC and PPT extraction via OLE/CFB binary parsing. Legacy Office formats no longer require any external tools.
musl Linux Support
- Re-enabled musl targets: Added
x86_64-unknown-linux-muslandaarch64-unknown-linux-musltargets for CLI binaries, Python wheels (musllinux), and Node.js native bindings. Resolves glibc 2.38+ requirement for prebuilt CLI binaries on older distros like Ubuntu 22.04 (#364).
Fixed
MSG Extraction Hang on Large Attachments (#372)
- Fixed
.msg(Outlook) extraction hanging indefinitely on files with large attachments. Replaced themsg_parsercrate with direct OLE/CFB parsing using thecfbcrate — attachment binary data is now read directly without hex-encoding overhead. - Added lenient FAT padding for MSG files with truncated sector tables produced by some Outlook versions.
Rotated PDF Text Extraction
- Fixed text extraction returning empty content for PDFs with 90° or 270° page rotation. Kreuzberg now strips
/Rotateentries from page dictionaries before loading, restoring correct text extraction for all rotation angles.
CSV and Excel Extraction Quality
- Fixed CSV extraction producing near-zero quality scores (0.024) by outputting proper delimited text instead of debug format.
- Fixed Excel extraction producing low quality scores (0.22) by outputting clean tab/newline-delimited cell text.
XML Extraction Quality
- Improved XML text extraction to better handle namespaced elements, CDATA sections, and mixed content, improving quality scores.
WASM Table Extraction
- Fixed WASM adapter not recognizing
page_numberfield (snake_case) from Rust FFI, causing table data to be silently dropped in Deno and Cloudflare Workers tests.
DOCX Formatting Output (#376)
- Fixed DOCX extraction producing plain text instead of formatted markdown. Bold, italic, underline, strikethrough, and hyperlinks are now rendered with proper markdown markers.
- Fixed heading hierarchy: Title style maps to
#, Heading1 to##, through Heading5+ clamped at######. - Fixed bullet lists, numbered lists, and nested list indentation.
- Fixed tables missing from markdown output. Tables are now interleaved with paragraphs in document order and rendered as markdown pipe tables.
- Fixed table cell formatting being stripped — bold/italic inside table cells is now preserved.
- Added 16 integration tests covering formatting, headings, lists, tables, and document structure.
Typst Table Content Extraction
- Fixed Typst
extract_table_contentdouble-counting opening parenthesis, which caused the table parser to consume all remaining document content after a#table()call.
PaddleOCR Recognition Model
- Fixed PaddleOCR recognition model failing to load with
ShapeInferenceErroron ONNX Runtime 1.23.x. - Fixed incorrect detection model filename in Docker and CI action.
Python Bindings
- Fixed
OcrConfigconstructor silently ignoringpaddle_ocr_configandelement_configkeyword arguments. - Fixed keyword extraction results being silently dropped in Python bindings. Closes #379.
TypeScript/Node.js Bindings
- Fixed PaddleOCR config and element config being silently dropped by the NAPI-RS binding layer.
- Fixed
ocr_elementsmissing from extraction result conversion.
Ruby Bindings
- Fixed
kreuzberg-pdfium-rendervendored crate not included in gemspec. - Fixed PaddleOCR config and element config not being parsed in Ruby binding config layer.
- Fixed
ocr_elementsmissing from Ruby extraction result conversion.
Go Bindings
- Fixed
PdfMetadatadeserialization failing when keyword extraction produces object arrays instead of simple strings.
C# Bindings
- Fixed keyword extraction data inaccessible —
ExtractedKeywordswas marked[JsonIgnore]and excluded from metadata serialization.
PHP Bindings
- Fixed
document,elements, andocrElementsproperties inaccessible onExtractionResult. - Fixed
ExtractionConfig::toArray()not serializinginclude_document_structure. - Fixed wrapper function names for document extractor management.
- Added missing OCR backend management functions.
- Fixed
page_countmetadata key mismatch.
Elixir Bindings
- Fixed NIF config parser not forwarding
include_document_structure,result_format,output_format,html_options,max_concurrent_extractions, andsecurity_limitsoptions. - Added missing document extractor management NIFs.
CI
- Fixed PHP E2E tests not actually running in CI.
Changed
Build System
- Bumped ONNX Runtime from 1.23.2 to 1.24.1 across CI, Docker images, and documentation.
- Bumped vendored Tesseract from 5.5.1 to 5.5.2.
- Bumped vendored Leptonica from 1.86.0 to 1.87.0.
Removed
LibreOffice Dependency
- LibreOffice is no longer required: Legacy .doc and .ppt files are now extracted natively via OLE/CFB parsing. LibreOffice has been removed from Docker images, CI pipelines, and system dependency requirements, reducing the full Docker image size by ~500-800MB.
msg_parser Dependency
- Replaced
msg_parsercrate with direct CFB parsing for MSG extraction.
Guten OCR Backend
- Removed all references to the unused Guten OCR backend from Node.js and PHP bindings.