[4.0.0-rc.6] - 2025-12-07
Release Candidate 6 - FFI Core Feature & CI/Build Improvements
New Features
FFI Bindings:
- Added
corefeature for kreuzberg-ffi without embeddings support- Provides lightweight FFI build option excluding ONNX Runtime dependency
- Enables Windows MinGW compatibility for Go bindings
- Includes HTML processing and all document extraction features
- Use
--no-default-features --features corefor MinGW builds
Bug Fixes
ODT Extraction:
- Fixed ODT table extraction producing duplicate content
- Table cells were being extracted twice: once as markdown tables (correct) and again as raw paragraphs (incorrect)
- Root cause: XML traversal using
.descendants()included nested table cell content as document-level text - Solution: Changed to only process direct children of
<office:text>element, isolating table content - Impact: ODT extraction now produces clean output without cell duplication
- Enhanced ODT metadata extraction to match Office Open XML capabilities
- Added comprehensive metadata extraction from
meta.xml(OpenDocument standard) - New
OdtPropertiesstruct supports all OpenDocument metadata fields - Extracts: title, subject, creator, initial-creator, keywords, description, dates, language
- Document statistics: page count, word count, character count, paragraph count, table count, image count
- Metadata extraction now consistent between ODT, DOCX, XLSX, and PPTX formats
- Impact: ODT files now provide rich metadata comparable to other Office formats
- Added comprehensive metadata extraction from
Go Bindings:
- Fixed Windows MinGW builds by disabling embeddings feature
- Windows ONNX Runtime only provides MSVC .lib files incompatible with MinGW
- Go bindings on Windows now use
corefeature (no embeddings) - Full features (including embeddings) remain available on Linux, macOS, and Windows MSVC
- Fixed test execution to use
test_documentsinstead of.kreuzbergcache- Ensures reproducible test runs without relying on user cache directory
- Improves CI/CD reliability and test isolation
CI/CD Infrastructure:
- Upgraded
upload-artifactfrom v4 to v5 for compatibility withdownload-artifact@v6- Fixes artifact version mismatch causing benchmark and CI failures
- Affects 10 workflow files with 42 total changes
- Resolves "artifact not found" errors in multi-job workflows
- Fixed RUSTFLAGS handling in
setup-onnx-runtimeaction- Now appends to existing RUSTFLAGS instead of overwriting
- Preserves
-C target-feature=+crt-staticfor Windows GNU builds
- Fixed Go Windows CI artifact download path causing linker failures
- Changed download path from
targetto.to prevent double-nesting (target/target/...) - Linker can now find libkreuzberg_ffi.dll at correct location
- Added debug logging to show directory structure after artifact download
- Changed download path from
- Aligned all workflows to Java 24
- Updated from Java 23 to 24 across all CI and publish workflows
- Resolves "release version 25 not supported" compilation errors
- Affects ci-validate, ci-java, publish, and benchmarks workflows
Ruby Bindings:
- Fixed rb-sys links conflict in gem build
- Removed rb-sys vendoring, now uses version 0.9.119 from crates.io
- Resolves Cargo error: "package rb-sys links to native library rb, but it conflicts with previous package"
- Allows Cargo to unify rb-sys dependency across magnus and kreuzberg-rb
C# E2E Tests:
- Fixed OCR tests failing with empty content
- Added render_config_expression function to C# E2E generator
- Tests now pass proper OCR config JSON instead of null
- Regenerated all C# tests with tesseract backend configuration
- Fixed metadata array contains assertion for single value in array
- Extended ValueContains method to handle value-in-array case
- Fixes sheet_names metadata assertions in Excel tests
Python Bindings:
- Fixed missing format_type in text extraction metadata
- TypstExtractor and LatexExtractor incorrectly claimed text/plain MIME type
- Removed text/plain from both extractors' supported types
- PlainTextExtractor now correctly handles text/plain with proper TextMetadata
- Metadata now includes format_type, line_count, word_count, character_count
- Added unit test for Metadata serialization to verify format field flattening