Added
- R language bindings -- Added kreuzberg R package via extendr with full extraction API (sync/async, batch, bytes), typed error conditions, S3 result class with accessors, config discovery, OCR/chunking configuration, plugin system, and 32 documentation snippets.
- PHP async extraction: Non-blocking extraction via
DeferredResultpattern with Tokio thread pool. IncludesextractFileAsync(),extractBytesAsync(),batchExtractFilesAsync(),batchExtractBytesAsync()across OOP, procedural, and static APIs. Framework bridges for Amp v3+ (AmpBridge) and ReactPHP (ReactBridge). - C FFI distribution: Official C shared library (
libkreuzberg) with cbindgen-generated header, cmake packaging (find_package(kreuzberg)), pkg-config support, and prebuilt binaries for Linux x86_64/aarch64, macOS arm64, and Windows x86_64. Includes full API reference documentation and test coverage. - Go FFI bindings: Go package (
packages/go/v4) consuming the C FFI shared library with prebuilt binaries published as GitHub release assets for all four platforms. - C as 13th e2e test language: The e2e-generator now produces C test files exercising the FFI API, with 15 passing test cases.
- R distribution via r-universe: Switched R package distribution from CRAN to r-universe for faster release cycles and easier native compilation.
- WASM native OCR (
ocr-wasmfeature): Tesseract OCR compiled directly into the WASM binary viakreuzberg-tesseract, enabling OCR in all environments (Browser, Node.js, Deno, Bun) without browser-specific APIs. Supports 43 languages with tessdata downloaded from CDN into memory. - WASM Node.js/Deno PDFium support: PDFium initialization now works in Node.js and Deno by loading the WASM module from the filesystem. Configurable via
KREUZBERG_PDFIUM_PATHenvironment variable. - WASM full-feature build: OCR, Excel, and archive extraction are now enabled by default in the WASM package. All
wasm-pack buildtargets include theocr-wasmfeature. - WASM Excel extraction (
excel-wasmfeature): Calamine-based Excel/spreadsheet extraction available in WASM without requiring Tokio runtime. - WASM archive extraction: ZIP, TAR, 7z, and GZIP archive extraction now available in WASM via synchronous extractor implementations.
- WASM PDF annotations: PDF annotations (text notes, highlights, links, stamps) are now exposed in the WASM TypeScript API via the
annotationsfield onExtractionResult.
Fixed
- DOCX equations not extracted: OMML math content was completely ignored by the DOCX parser, causing all equation text to be silently dropped. Math runs are now extracted as regular text.
- DOCX line breaks ignored:
<w:br/>elements were not handled, causing adjacent text segments to merge. Line breaks now insert whitespace. - PPTX/PPSX table content lost: Tables were rendered as HTML without whitespace between tags, causing the entire table to tokenize as a single unreadable blob. Tables now render as markdown pipe tables with proper cell separation.
- PPTX/PPSX/PPTM image markers pollute text: Image references injected spurious numeric tokens into extracted content. Image markers now use a clean
![image]()format. - DOCX image markers pollute text: Drawing references injected spurious numeric tokens. Changed to
. - EPUB double-lossy conversion: XHTML content was converted through an XHTML-to-markdown-to-plain-text pipeline, losing content at each stage. Replaced with direct
roxmltreetraversal that extracts text content from XHTML elements without intermediate markdown. - Excel float formatting drops numeric precision:
format_cell_to_string()formatted whole-number floats as"1.0"instead of"1", causing numeric token mismatches in quality scoring. - HTML metadata extraction pollutes content: The
extract_metadataoption was left enabled, causing YAML frontmatter to be prepended to the content string. Setextract_metadata = falsein the metadata extraction path. - Markdown extractor loses tokens through AST reconstruction: Now returns raw text content directly (after frontmatter extraction) while still parsing the AST for table and image extraction.
- SVG text extraction includes element prefixes: SVG extraction now targets only text-bearing elements without prefixes.
- XML ground truth uses raw source: Regenerated all 20 ground truth files.
- Elixir benchmark UTF-8 locale: Erlang VM running with
latin1native encoding corrupted UTF-8 strings from Rust NIFs. AddedERL_LIBSpath configuration. - WASM OCR not working (
enableOcr()regression): The function now bridges both JS-side and Rust-side registries so OCR works end-to-end. - WASM tessdata CDN URL returns 404: Updated to use the official
tesseract-ocr/tessdata_fastGitHub repository. - XML UTF-16 parsing fails on files with odd byte count: The decoder now truncates to the nearest even byte boundary.
- R bindings crash on strings with embedded NUL bytes: NUL bytes are now stripped before passing strings to R.
- R bindings
%||%operator incompatible with R < 4.4: Added a package-local polyfill for backwards compatibility. - API returns HTTP 500 for unsupported file formats (#414):
UnsupportedFormaterrors are now mapped to HTTP 400 with a clearUnsupportedFormatErrorresponse. - PDF markdown extraction missing headings/bold for flat structure trees (#391): Pages with font size variation but no heading tags are now enriched via K-means font-size clustering.
- PaddleOCR backend not found when using
backend="paddleocr"(#403): The OCR backend registry now resolves the"paddleocr"alias to the canonical"paddle-ocr"name. - WASM metadata serialization: Switched from
serde_wasm_bindgentoserde_json+JSON.parse()for output serialization. - WASM config deserialization: Config keys are now converted to snake_case before passing to the WASM boundary.
- WASM PDFium module loading: The build script now locates and copies the actual PDFium ESM module from the Cargo build output.
- Email header extraction loses display names: From, To, CC, and BCC fields now use
"Display Name" <email@example.com>format when a display name is available. - Email date header normalized to RFC 3339: Now preserves the raw
Dateheader value and only falls back to RFC 3339 normalization when unavailable. - Docker builds fail due to missing snippet-runner exclusion: Added
snippet-runnerto the sed exclusion patterns in all three Dockerfiles. - WASM Deno e2e tests skip OCR fixtures: The e2e generator now calls
enableOcr()afterinitWasm()in every generated test file. - WASM Deno e2e tests ignore pages config: Added
mapPageConfig()to the test helper template. - C FFI NULL callback crash: Reject NULL callback function pointers in plugin registration to prevent segfaults.
Removed
polarsdependency: Removed unusedpolarscrate andtable_from_arrow_to_markdowndead code from theexcelfeature.