[4.2.7] - 2026-02-01
Added
API
- OpenAPI schema for
/extractendpoint: Implementedutoipa::ToSchemaon all extraction result types (ExtractionResult,Metadata,Chunk,ExtractedImage,Element,DjotContent,PageContent,Table, and all nested types), enabling full OpenAPI documentation for the extraction endpoint - Unified
ChunkingConfig: Merged internal chunking config into a singleChunkingConfigstruct with canonical field names (max_characters,overlap) and serde aliases (max_chars,max_overlap) for backwards compatibility. Addedtrimandchunker_typefields.ChunkerTypeenum is no longer feature-gated behindchunking
OCR
KREUZBERG_OCR_LANGUAGE="all"support: Setting the language to"all"or"*"automatically detects and uses all installed Tesseract languages from the tessdata directory, eliminating manual enumeration (#344)
Fixed
Ruby Bindings
- Cow<'static, str> type conversions: Fixed Magnus bindings to properly convert
Cow<'static, str>fields (mime_type,format,colorspace) using.as_ref()instead of passing directly to FFI methods - Vendor workspace
bytesdependency: Addedbytesto the Ruby vendor workspace Cargo.toml via the vendoring script, fixing workspace dependency resolution failures - Tempfile GC in batch test: Kept
Tempfilereferences alive inbatch_operations_spec.rbto prevent garbage collection beforebatch_extract_files_syncreads them
Python Bindings
- Runtime
ExtractedImageimport: DefinedExtractedImage,Metadata,OutputFormat, andResultFormatas Python-level runtime types instead of importing from compiled Rust bindings (these are stub-only types, not#[pyclass]exports) - Overhauled
_internal_bindings.pyitype stubs: Exhaustive audit against Rust source to ensure all types, fields, and optionality match exactly - Removed duplicate
types.py: Deletedkreuzberg/types.pywhich contained 43 duplicate type definitions conflicting with_internal_bindings.pyi - Consolidated duplicate test files: Merged unique tests from
test_embeddings_advanced.py,test_images_extraction.py,test_tables_extraction.pyinto their canonical counterparts and deleted the duplicates
C# Bindings
- Attributes deserialization on ARM64: Added
AttributesDictionaryConverterto handle both array-of-arrays and object JSON formats forLinkMetadata.AttributesandHtmlImageMetadata.Attributes - Overhauled type definitions from audit against Rust source
- Fixed keyword deserialization: Properly discriminate between simple string keywords and extracted keyword objects
Java Bindings
- Test timeout prevention: Added
@Timeout(60)to all concurrency and async test methods - Surefire timeout reduction: Reduced
forkedProcessTimeoutInSecondsfrom 3600s to 600s - Overhauled type definitions from audit against Rust source
TypeScript Bindings
- Overhauled type definitions from audit against NAPI-RS Rust source
PHP Bindings
- Overhauled type definitions from audit against ext-php-rs Rust source
Go Bindings
- Overhauled type definitions from audit against Rust source
- Consolidated config tests
Elixir Bindings
- Overhauled all struct types from audit against Rust source: Exhaustive audit of every Elixir struct against the Rust core types to ensure field-level correctness
- Added new struct modules matching Rust types: ChunkMetadata, Keyword, PageHierarchy, DjotContent, PageStructure, ErrorMetadata, ImagePreprocessingMetadata
- Fixed all test files: Updated 11 test files to match new struct field names
Performance
- Cow<'static, str> for static string fields: Eliminated heap allocations for values that are always string literals
- RST parser allocation reduction: Replaced
Vec<char>collects with direct iterator usage - Vec for small metadata maps: Replaced
HashMapwithVecfor small attribute collections - bytes::Bytes for binary data: Zero-copy cloning of large image buffers
- AHashMap for hot-path maps: Faster hashing in metadata and extraction pipelines
Changed
- Dependency update: Bumped
html-to-markdown-rsfrom 2.24.1 to 2.24.3