github kreuzberg-dev/kreuzberg v4.2.7

latest release: benchmark-run-21564105736
8 hours ago

[4.2.7] - 2026-02-01

Added

API

  • OpenAPI schema for /extract endpoint: Implemented utoipa::ToSchema on all extraction result types (ExtractionResult, Metadata, Chunk, ExtractedImage, Element, DjotContent, PageContent, Table, and all nested types), enabling full OpenAPI documentation for the extraction endpoint
  • Unified ChunkingConfig: Merged internal chunking config into a single ChunkingConfig struct with canonical field names (max_characters, overlap) and serde aliases (max_chars, max_overlap) for backwards compatibility. Added trim and chunker_type fields. ChunkerType enum is no longer feature-gated behind chunking

OCR

  • KREUZBERG_OCR_LANGUAGE="all" support: Setting the language to "all" or "*" automatically detects and uses all installed Tesseract languages from the tessdata directory, eliminating manual enumeration (#344)

Fixed

Ruby Bindings

  • Cow<'static, str> type conversions: Fixed Magnus bindings to properly convert Cow<'static, str> fields (mime_type, format, colorspace) using .as_ref() instead of passing directly to FFI methods
  • Vendor workspace bytes dependency: Added bytes to the Ruby vendor workspace Cargo.toml via the vendoring script, fixing workspace dependency resolution failures
  • Tempfile GC in batch test: Kept Tempfile references alive in batch_operations_spec.rb to prevent garbage collection before batch_extract_files_sync reads them

Python Bindings

  • Runtime ExtractedImage import: Defined ExtractedImage, Metadata, OutputFormat, and ResultFormat as Python-level runtime types instead of importing from compiled Rust bindings (these are stub-only types, not #[pyclass] exports)
  • Overhauled _internal_bindings.pyi type stubs: Exhaustive audit against Rust source to ensure all types, fields, and optionality match exactly
  • Removed duplicate types.py: Deleted kreuzberg/types.py which contained 43 duplicate type definitions conflicting with _internal_bindings.pyi
  • Consolidated duplicate test files: Merged unique tests from test_embeddings_advanced.py, test_images_extraction.py, test_tables_extraction.py into their canonical counterparts and deleted the duplicates

C# Bindings

  • Attributes deserialization on ARM64: Added AttributesDictionaryConverter to handle both array-of-arrays and object JSON formats for LinkMetadata.Attributes and HtmlImageMetadata.Attributes
  • Overhauled type definitions from audit against Rust source
  • Fixed keyword deserialization: Properly discriminate between simple string keywords and extracted keyword objects

Java Bindings

  • Test timeout prevention: Added @Timeout(60) to all concurrency and async test methods
  • Surefire timeout reduction: Reduced forkedProcessTimeoutInSeconds from 3600s to 600s
  • Overhauled type definitions from audit against Rust source

TypeScript Bindings

  • Overhauled type definitions from audit against NAPI-RS Rust source

PHP Bindings

  • Overhauled type definitions from audit against ext-php-rs Rust source

Go Bindings

  • Overhauled type definitions from audit against Rust source
  • Consolidated config tests

Elixir Bindings

  • Overhauled all struct types from audit against Rust source: Exhaustive audit of every Elixir struct against the Rust core types to ensure field-level correctness
  • Added new struct modules matching Rust types: ChunkMetadata, Keyword, PageHierarchy, DjotContent, PageStructure, ErrorMetadata, ImagePreprocessingMetadata
  • Fixed all test files: Updated 11 test files to match new struct field names

Performance

  • Cow<'static, str> for static string fields: Eliminated heap allocations for values that are always string literals
  • RST parser allocation reduction: Replaced Vec<char> collects with direct iterator usage
  • Vec for small metadata maps: Replaced HashMap with Vec for small attribute collections
  • bytes::Bytes for binary data: Zero-copy cloning of large image buffers
  • AHashMap for hot-path maps: Faster hashing in metadata and extraction pipelines

Changed

  • Dependency update: Bumped html-to-markdown-rs from 2.24.1 to 2.24.3

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.