kreuzberg-dev/kreuzberg v4.0.0-rc.6 on GitHub

[4.0.0-rc.6] - 2025-12-07

FFI Bindings:

Added core feature for kreuzberg-ffi without embeddings support
- Provides lightweight FFI build option excluding ONNX Runtime dependency
- Enables Windows MinGW compatibility for Go bindings
- Includes HTML processing and all document extraction features
- Use --no-default-features --features core for MinGW builds

ODT Extraction:

Fixed ODT table extraction producing duplicate content
- Table cells were being extracted twice: once as markdown tables (correct) and again as raw paragraphs (incorrect)
- Root cause: XML traversal using .descendants() included nested table cell content as document-level text
- Solution: Changed to only process direct children of <office:text> element, isolating table content
- Impact: ODT extraction now produces clean output without cell duplication
Enhanced ODT metadata extraction to match Office Open XML capabilities
- Added comprehensive metadata extraction from meta.xml (OpenDocument standard)
- New OdtProperties struct supports all OpenDocument metadata fields
- Extracts: title, subject, creator, initial-creator, keywords, description, dates, language
- Document statistics: page count, word count, character count, paragraph count, table count, image count
- Metadata extraction now consistent between ODT, DOCX, XLSX, and PPTX formats
- Impact: ODT files now provide rich metadata comparable to other Office formats

Go Bindings:

Fixed Windows MinGW builds by disabling embeddings feature
- Windows ONNX Runtime only provides MSVC .lib files incompatible with MinGW
- Go bindings on Windows now use core feature (no embeddings)
- Full features (including embeddings) remain available on Linux, macOS, and Windows MSVC
Fixed test execution to use test_documents instead of .kreuzberg cache
- Ensures reproducible test runs without relying on user cache directory
- Improves CI/CD reliability and test isolation

CI/CD Infrastructure:

Upgraded upload-artifact from v4 to v5 for compatibility with download-artifact@v6
- Fixes artifact version mismatch causing benchmark and CI failures
- Affects 10 workflow files with 42 total changes
- Resolves "artifact not found" errors in multi-job workflows
Fixed RUSTFLAGS handling in setup-onnx-runtime action
- Now appends to existing RUSTFLAGS instead of overwriting
- Preserves -C target-feature=+crt-static for Windows GNU builds
Fixed Go Windows CI artifact download path causing linker failures
- Changed download path from target to . to prevent double-nesting (target/target/...)
- Linker can now find libkreuzberg_ffi.dll at correct location
- Added debug logging to show directory structure after artifact download
Aligned all workflows to Java 24
- Updated from Java 23 to 24 across all CI and publish workflows
- Resolves "release version 25 not supported" compilation errors
- Affects ci-validate, ci-java, publish, and benchmarks workflows

Ruby Bindings:

Fixed rb-sys links conflict in gem build
- Removed rb-sys vendoring, now uses version 0.9.119 from crates.io
- Resolves Cargo error: "package rb-sys links to native library rb, but it conflicts with previous package"
- Allows Cargo to unify rb-sys dependency across magnus and kreuzberg-rb

C# E2E Tests:

Fixed OCR tests failing with empty content
- Added render_config_expression function to C# E2E generator
- Tests now pass proper OCR config JSON instead of null
- Regenerated all C# tests with tesseract backend configuration
Fixed metadata array contains assertion for single value in array
- Extended ValueContains method to handle value-in-array case
- Fixes sheet_names metadata assertions in Excel tests

Python Bindings: