github kreuzberg-dev/kreuzberg v4.4.0

latest release: v4.4.1
3 hours ago

Added

  • R language bindings -- Added kreuzberg R package via extendr with full extraction API (sync/async, batch, bytes), typed error conditions, S3 result class with accessors, config discovery, OCR/chunking configuration, plugin system, and 32 documentation snippets.
  • PHP async extraction: Non-blocking extraction via DeferredResult pattern with Tokio thread pool. Includes extractFileAsync(), extractBytesAsync(), batchExtractFilesAsync(), batchExtractBytesAsync() across OOP, procedural, and static APIs. Framework bridges for Amp v3+ (AmpBridge) and ReactPHP (ReactBridge).
  • C FFI distribution: Official C shared library (libkreuzberg) with cbindgen-generated header, cmake packaging (find_package(kreuzberg)), pkg-config support, and prebuilt binaries for Linux x86_64/aarch64, macOS arm64, and Windows x86_64. Includes full API reference documentation and test coverage.
  • Go FFI bindings: Go package (packages/go/v4) consuming the C FFI shared library with prebuilt binaries published as GitHub release assets for all four platforms.
  • C as 13th e2e test language: The e2e-generator now produces C test files exercising the FFI API, with 15 passing test cases.
  • R distribution via r-universe: Switched R package distribution from CRAN to r-universe for faster release cycles and easier native compilation.
  • WASM native OCR (ocr-wasm feature): Tesseract OCR compiled directly into the WASM binary via kreuzberg-tesseract, enabling OCR in all environments (Browser, Node.js, Deno, Bun) without browser-specific APIs. Supports 43 languages with tessdata downloaded from CDN into memory.
  • WASM Node.js/Deno PDFium support: PDFium initialization now works in Node.js and Deno by loading the WASM module from the filesystem. Configurable via KREUZBERG_PDFIUM_PATH environment variable.
  • WASM full-feature build: OCR, Excel, and archive extraction are now enabled by default in the WASM package. All wasm-pack build targets include the ocr-wasm feature.
  • WASM Excel extraction (excel-wasm feature): Calamine-based Excel/spreadsheet extraction available in WASM without requiring Tokio runtime.
  • WASM archive extraction: ZIP, TAR, 7z, and GZIP archive extraction now available in WASM via synchronous extractor implementations.
  • WASM PDF annotations: PDF annotations (text notes, highlights, links, stamps) are now exposed in the WASM TypeScript API via the annotations field on ExtractionResult.

Fixed

  • DOCX equations not extracted: OMML math content was completely ignored by the DOCX parser, causing all equation text to be silently dropped. Math runs are now extracted as regular text.
  • DOCX line breaks ignored: <w:br/> elements were not handled, causing adjacent text segments to merge. Line breaks now insert whitespace.
  • PPTX/PPSX table content lost: Tables were rendered as HTML without whitespace between tags, causing the entire table to tokenize as a single unreadable blob. Tables now render as markdown pipe tables with proper cell separation.
  • PPTX/PPSX/PPTM image markers pollute text: Image references injected spurious numeric tokens into extracted content. Image markers now use a clean ![image]() format.
  • DOCX image markers pollute text: Drawing references injected spurious numeric tokens. Changed to ![alt](image).
  • EPUB double-lossy conversion: XHTML content was converted through an XHTML-to-markdown-to-plain-text pipeline, losing content at each stage. Replaced with direct roxmltree traversal that extracts text content from XHTML elements without intermediate markdown.
  • Excel float formatting drops numeric precision: format_cell_to_string() formatted whole-number floats as "1.0" instead of "1", causing numeric token mismatches in quality scoring.
  • HTML metadata extraction pollutes content: The extract_metadata option was left enabled, causing YAML frontmatter to be prepended to the content string. Set extract_metadata = false in the metadata extraction path.
  • Markdown extractor loses tokens through AST reconstruction: Now returns raw text content directly (after frontmatter extraction) while still parsing the AST for table and image extraction.
  • SVG text extraction includes element prefixes: SVG extraction now targets only text-bearing elements without prefixes.
  • XML ground truth uses raw source: Regenerated all 20 ground truth files.
  • Elixir benchmark UTF-8 locale: Erlang VM running with latin1 native encoding corrupted UTF-8 strings from Rust NIFs. Added ERL_LIBS path configuration.
  • WASM OCR not working (enableOcr() regression): The function now bridges both JS-side and Rust-side registries so OCR works end-to-end.
  • WASM tessdata CDN URL returns 404: Updated to use the official tesseract-ocr/tessdata_fast GitHub repository.
  • XML UTF-16 parsing fails on files with odd byte count: The decoder now truncates to the nearest even byte boundary.
  • R bindings crash on strings with embedded NUL bytes: NUL bytes are now stripped before passing strings to R.
  • R bindings %||% operator incompatible with R < 4.4: Added a package-local polyfill for backwards compatibility.
  • API returns HTTP 500 for unsupported file formats (#414): UnsupportedFormat errors are now mapped to HTTP 400 with a clear UnsupportedFormatError response.
  • PDF markdown extraction missing headings/bold for flat structure trees (#391): Pages with font size variation but no heading tags are now enriched via K-means font-size clustering.
  • PaddleOCR backend not found when using backend="paddleocr" (#403): The OCR backend registry now resolves the "paddleocr" alias to the canonical "paddle-ocr" name.
  • WASM metadata serialization: Switched from serde_wasm_bindgen to serde_json + JSON.parse() for output serialization.
  • WASM config deserialization: Config keys are now converted to snake_case before passing to the WASM boundary.
  • WASM PDFium module loading: The build script now locates and copies the actual PDFium ESM module from the Cargo build output.
  • Email header extraction loses display names: From, To, CC, and BCC fields now use "Display Name" <email@example.com> format when a display name is available.
  • Email date header normalized to RFC 3339: Now preserves the raw Date header value and only falls back to RFC 3339 normalization when unavailable.
  • Docker builds fail due to missing snippet-runner exclusion: Added snippet-runner to the sed exclusion patterns in all three Dockerfiles.
  • WASM Deno e2e tests skip OCR fixtures: The e2e generator now calls enableOcr() after initWasm() in every generated test file.
  • WASM Deno e2e tests ignore pages config: Added mapPageConfig() to the test helper template.
  • C FFI NULL callback crash: Reject NULL callback function pointers in plugin registration to prevent segfaults.

Removed

  • polars dependency: Removed unused polars crate and table_from_arrow_to_markdown dead code from the excel feature.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.