[4.6.3] - 2026-03-27
Added
- Tower service layer (
servicemodule): ComposableExtractionServiceimplementingtower::Servicewith configurable middleware layers (tracing, metrics, timeout, concurrency limit). Newtower-servicefeature flag, auto-enabled byapiandmcp.ExtractionServiceBuilderprovides ergonomic layer composition. - Semantic OpenTelemetry conventions (
telemetrymodule): Formalkreuzberg.*attribute namespace with 30+ span attributes, metric names, and operation/stage constants. - Extraction metrics: 11 OTel metric instruments (counters, histograms, gauge) covering extraction totals, durations, cache hits/misses, pipeline stages, OCR, and concurrent extractions. Feature-gated behind
otel. - InstrumentedExtractor wrapper: Automatic per-extractor tracing spans and metrics without per-extractor annotations. Injected at registry dispatch when
otelfeature is enabled.
Improved
- Deeper instrumentation: Pipeline post-processing stages, individual processor execution, OCR operations, and RT-DETR layout model inference now have semantic spans and duration metrics.
- API and MCP servers use ExtractionService: Both consumers now route extractions through the Tower service stack.
- Unified config merge: JSON config merge logic deduplicated between CLI and MCP.
- API server hardening: Added response compression (gzip/brotli/zstd), panic recovery, request-ID correlation, and sensitive header redaction via tower-http middleware.
Changed
- Removed per-extractor
#[instrument]annotations: 29 manual annotations replaced by the automaticInstrumentedExtractorwrapper. - Span attribute names migrated to
kreuzberg.*namespace:extraction.filename->kreuzberg.document.filename,extraction.mime_type->kreuzberg.document.mime_type, etc.
Fixed
- EPUB spine semantics refactor (#594): Richer OPF package model preserves manifest fallback chains, guide references, and non-linear spine items. Navigation chrome stripped from output. Malformed guide references now produce warnings instead of hard failures.
- DOCX image extraction for
<a:blip>with child elements (#591): Images with high-quality settings were not extracted. Now handlesEvent::Startfor<a:blip>. - OCR table extraction returned empty results via pipeline path (#593): Layout detection and table propagation fixed for both code paths.
- Missing
chunker_typefield in bindings (#592): Exposed across Python, TypeScript/WASM, Go, C#, PHP bindings. - Full API parity across all 10 bindings: Added
max_archive_depthto all bindings. Added missing typed config classes foracceleration,email,layout,concurrencywhere needed. - Node Windows publish failure: Prepare script fallback replaced with cross-platform
node -e. - CI Validate path triggers broadened: Covers
docs/**,biome.json,.task/**, and other lintable paths. - Publish pipeline ORT bundling: Configurable
strategyinput (system/bundled) onsetup-onnx-runtimeaction. Publish jobs now usestrategy: bundled. - C FFI CI missing ORT setup: Added
setup-onnx-runtimestep toci-c-ffi.yaml.