github kreuzberg-dev/kreuzberg v4.3.0

5 hours ago

Added

Blank Page Detection

  • is_blank field on PageInfo and PageContent: Pages with fewer than 3 non-whitespace characters and no tables or images are flagged as blank. Detection uses a two-phase approach: text-only analysis during extraction, then refinement after table/image assignment. Available across all 9 language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir, WASM). Closes #378.

PaddleOCR Backend

  • PaddleOCR backend via ONNX Runtime: New OCR backend (kreuzberg-paddle-ocr) using PaddlePaddle's PP-OCRv4 models converted to ONNX format, run via ONNX Runtime. Supports 6 languages (English, Chinese, Japanese, Korean, German, French) with automatic model downloading and caching. Provides superior CJK recognition compared to Tesseract.
  • PaddleOCR support in all bindings: Available across Python, Rust, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, and Elixir bindings via the paddle-ocr feature flag.
  • PaddleOCR CLI support: The kreuzberg-cli binary supports --ocr-backend paddle-ocr for PaddleOCR extraction.

Unified OCR Element Output

  • Structured OCR element data: Extraction results now include OcrElement data with bounding geometry (rectangles and quadrilaterals), per-element confidence scores, rotation information, and hierarchical levels (word, line, block, page). Available from both PaddleOCR and Tesseract backends.

Shared ONNX Runtime Discovery

  • ort_discovery module: Finds ONNX Runtime shared libraries across platforms, shared between PaddleOCR and future ONNX-based backends.

Document Structure Output

  • DocumentStructure support across all bindings: Added structured document output with include_document_structure configuration option across Python, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, Elixir, and WASM bindings.

Native DOC/PPT Extraction

  • OLE/CFB-based extraction: Added native DOC and PPT extraction via OLE/CFB binary parsing. Legacy Office formats no longer require any external tools.

musl Linux Support

  • Re-enabled musl targets: Added x86_64-unknown-linux-musl and aarch64-unknown-linux-musl targets for CLI binaries, Python wheels (musllinux), and Node.js native bindings. Resolves glibc 2.38+ requirement for prebuilt CLI binaries on older distros like Ubuntu 22.04 (#364).

Fixed

MSG Extraction Hang on Large Attachments (#372)

  • Fixed .msg (Outlook) extraction hanging indefinitely on files with large attachments. Replaced the msg_parser crate with direct OLE/CFB parsing using the cfb crate — attachment binary data is now read directly without hex-encoding overhead.
  • Added lenient FAT padding for MSG files with truncated sector tables produced by some Outlook versions.

Rotated PDF Text Extraction

  • Fixed text extraction returning empty content for PDFs with 90° or 270° page rotation. Kreuzberg now strips /Rotate entries from page dictionaries before loading, restoring correct text extraction for all rotation angles.

CSV and Excel Extraction Quality

  • Fixed CSV extraction producing near-zero quality scores (0.024) by outputting proper delimited text instead of debug format.
  • Fixed Excel extraction producing low quality scores (0.22) by outputting clean tab/newline-delimited cell text.

XML Extraction Quality

  • Improved XML text extraction to better handle namespaced elements, CDATA sections, and mixed content, improving quality scores.

WASM Table Extraction

  • Fixed WASM adapter not recognizing page_number field (snake_case) from Rust FFI, causing table data to be silently dropped in Deno and Cloudflare Workers tests.

DOCX Formatting Output (#376)

  • Fixed DOCX extraction producing plain text instead of formatted markdown. Bold, italic, underline, strikethrough, and hyperlinks are now rendered with proper markdown markers.
  • Fixed heading hierarchy: Title style maps to #, Heading1 to ##, through Heading5+ clamped at ######.
  • Fixed bullet lists, numbered lists, and nested list indentation.
  • Fixed tables missing from markdown output. Tables are now interleaved with paragraphs in document order and rendered as markdown pipe tables.
  • Fixed table cell formatting being stripped — bold/italic inside table cells is now preserved.
  • Added 16 integration tests covering formatting, headings, lists, tables, and document structure.

Typst Table Content Extraction

  • Fixed Typst extract_table_content double-counting opening parenthesis, which caused the table parser to consume all remaining document content after a #table() call.

PaddleOCR Recognition Model

  • Fixed PaddleOCR recognition model failing to load with ShapeInferenceError on ONNX Runtime 1.23.x.
  • Fixed incorrect detection model filename in Docker and CI action.

Python Bindings

  • Fixed OcrConfig constructor silently ignoring paddle_ocr_config and element_config keyword arguments.
  • Fixed keyword extraction results being silently dropped in Python bindings. Closes #379.

TypeScript/Node.js Bindings

  • Fixed PaddleOCR config and element config being silently dropped by the NAPI-RS binding layer.
  • Fixed ocr_elements missing from extraction result conversion.

Ruby Bindings

  • Fixed kreuzberg-pdfium-render vendored crate not included in gemspec.
  • Fixed PaddleOCR config and element config not being parsed in Ruby binding config layer.
  • Fixed ocr_elements missing from Ruby extraction result conversion.

Go Bindings

  • Fixed PdfMetadata deserialization failing when keyword extraction produces object arrays instead of simple strings.

C# Bindings

  • Fixed keyword extraction data inaccessible — ExtractedKeywords was marked [JsonIgnore] and excluded from metadata serialization.

PHP Bindings

  • Fixed document, elements, and ocrElements properties inaccessible on ExtractionResult.
  • Fixed ExtractionConfig::toArray() not serializing include_document_structure.
  • Fixed wrapper function names for document extractor management.
  • Added missing OCR backend management functions.
  • Fixed page_count metadata key mismatch.

Elixir Bindings

  • Fixed NIF config parser not forwarding include_document_structure, result_format, output_format, html_options, max_concurrent_extractions, and security_limits options.
  • Added missing document extractor management NIFs.

CI

  • Fixed PHP E2E tests not actually running in CI.

Changed

Build System

  • Bumped ONNX Runtime from 1.23.2 to 1.24.1 across CI, Docker images, and documentation.
  • Bumped vendored Tesseract from 5.5.1 to 5.5.2.
  • Bumped vendored Leptonica from 1.86.0 to 1.87.0.

Removed

LibreOffice Dependency

  • LibreOffice is no longer required: Legacy .doc and .ppt files are now extracted natively via OLE/CFB parsing. LibreOffice has been removed from Docker images, CI pipelines, and system dependency requirements, reducing the full Docker image size by ~500-800MB.

msg_parser Dependency

  • Replaced msg_parser crate with direct CFB parsing for MSG extraction.

Guten OCR Backend

  • Removed all references to the unused Guten OCR backend from Node.js and PHP bindings.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.