github kreuzberg-dev/kreuzberg v4.0.0-rc.6

pre-releaseone day ago

[4.0.0-rc.6] - 2025-12-07

Release Candidate 6 - FFI Core Feature & CI/Build Improvements

New Features

FFI Bindings:

  • Added core feature for kreuzberg-ffi without embeddings support
    • Provides lightweight FFI build option excluding ONNX Runtime dependency
    • Enables Windows MinGW compatibility for Go bindings
    • Includes HTML processing and all document extraction features
    • Use --no-default-features --features core for MinGW builds

Bug Fixes

ODT Extraction:

  • Fixed ODT table extraction producing duplicate content
    • Table cells were being extracted twice: once as markdown tables (correct) and again as raw paragraphs (incorrect)
    • Root cause: XML traversal using .descendants() included nested table cell content as document-level text
    • Solution: Changed to only process direct children of <office:text> element, isolating table content
    • Impact: ODT extraction now produces clean output without cell duplication
  • Enhanced ODT metadata extraction to match Office Open XML capabilities
    • Added comprehensive metadata extraction from meta.xml (OpenDocument standard)
    • New OdtProperties struct supports all OpenDocument metadata fields
    • Extracts: title, subject, creator, initial-creator, keywords, description, dates, language
    • Document statistics: page count, word count, character count, paragraph count, table count, image count
    • Metadata extraction now consistent between ODT, DOCX, XLSX, and PPTX formats
    • Impact: ODT files now provide rich metadata comparable to other Office formats

Go Bindings:

  • Fixed Windows MinGW builds by disabling embeddings feature
    • Windows ONNX Runtime only provides MSVC .lib files incompatible with MinGW
    • Go bindings on Windows now use core feature (no embeddings)
    • Full features (including embeddings) remain available on Linux, macOS, and Windows MSVC
  • Fixed test execution to use test_documents instead of .kreuzberg cache
    • Ensures reproducible test runs without relying on user cache directory
    • Improves CI/CD reliability and test isolation

CI/CD Infrastructure:

  • Upgraded upload-artifact from v4 to v5 for compatibility with download-artifact@v6
    • Fixes artifact version mismatch causing benchmark and CI failures
    • Affects 10 workflow files with 42 total changes
    • Resolves "artifact not found" errors in multi-job workflows
  • Fixed RUSTFLAGS handling in setup-onnx-runtime action
    • Now appends to existing RUSTFLAGS instead of overwriting
    • Preserves -C target-feature=+crt-static for Windows GNU builds
  • Fixed Go Windows CI artifact download path causing linker failures
    • Changed download path from target to . to prevent double-nesting (target/target/...)
    • Linker can now find libkreuzberg_ffi.dll at correct location
    • Added debug logging to show directory structure after artifact download
  • Aligned all workflows to Java 24
    • Updated from Java 23 to 24 across all CI and publish workflows
    • Resolves "release version 25 not supported" compilation errors
    • Affects ci-validate, ci-java, publish, and benchmarks workflows

Ruby Bindings:

  • Fixed rb-sys links conflict in gem build
    • Removed rb-sys vendoring, now uses version 0.9.119 from crates.io
    • Resolves Cargo error: "package rb-sys links to native library rb, but it conflicts with previous package"
    • Allows Cargo to unify rb-sys dependency across magnus and kreuzberg-rb

C# E2E Tests:

  • Fixed OCR tests failing with empty content
    • Added render_config_expression function to C# E2E generator
    • Tests now pass proper OCR config JSON instead of null
    • Regenerated all C# tests with tesseract backend configuration
  • Fixed metadata array contains assertion for single value in array
    • Extended ValueContains method to handle value-in-array case
    • Fixes sheet_names metadata assertions in Excel tests

Python Bindings:

  • Fixed missing format_type in text extraction metadata
    • TypstExtractor and LatexExtractor incorrectly claimed text/plain MIME type
    • Removed text/plain from both extractors' supported types
    • PlainTextExtractor now correctly handles text/plain with proper TextMetadata
    • Metadata now includes format_type, line_count, word_count, character_count
    • Added unit test for Metadata serialization to verify format field flattening

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.