github kreuzberg-dev/kreuzberg v4.2.2

5 hours ago

Changed

API Alignment - 1:1 Parity Across All Bindings

  • Strict API parity enforcement: All 9 language bindings now have exact 1:1 field parity with Rust core
    • Verification script (scripts/verify_api_parity.py) now runs in STRICT mode, failing on ANY field differences
    • Added to ci-validate.yaml workflow to prevent future API drift

PHP Bindings

  • ExtractionConfig alignment: Removed 5 non-canonical fields and fixed defaults
    • Removed: embedding, extractImages, extractTables, preserveFormatting, outputEncoding
    • Fixed defaults: useCache → true, enableQualityProcessing → true, maxConcurrentExtractions → null
    • Updated ExtractionConfigBuilder to match canonical API
    • All 16 fields now match Rust canonical source exactly

Go Bindings

  • ExtractionResult alignment: Removed Success field (not in Rust canonical)
  • PageInfo alignment: Removed Visible and ContentType fields (not in Rust canonical)
  • Updated 14 test files to remove references to removed fields

Ruby Bindings

  • Default value fix: Changed enable_quality_processing default from false to true to match Rust

Java Bindings

  • Default value fix: Changed enableQualityProcessing default from false to true to match Rust

TypeScript Bindings

  • Type exports cleanup: Removed non-existent type exports (EmbeddingConfig, EmbeddingModelType, HierarchyConfig, ImagePreprocessingConfig) from index.ts

Fixed

Elixir Bindings

  • Hex package compilation: Fixed force_build: true causing production installs to fail (#333)
    • Changed to force_build: Mix.env() in [:test, :dev] to only build from source in development
    • Production installs now correctly use precompiled NIF binaries from GitHub releases

Docker Images

  • Tesseract OCR plugin initialization: Fixed "OCR backend 'tesseract' not registered" error in published Docker images
  • Embeddings plugin initialization: Fixed "Failed to initialize embedding model" error

API

  • JSON error responses: Added custom JsonApi extractor for consistent JSON error responses
  • OpenAPI schema improvements: Enhanced schema validation constraints
  • Chunking validation: Added validation that overlap must be less than max_characters
  • Embed validation: Added validation that all text entries must be non-empty strings
  • Default embedding model: EmbeddingConfig.model now defaults to "balanced" preset

CI/CD

  • Schemathesis API contract testing: Added schemathesis to Docker CI workflow

Rust Core

  • XLSX OOM with Excel Solver files: Fixed out-of-memory issue when processing XLSX files with sparse data at extreme cell positions (#331)

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.