github kreuzberg-dev/kreuzberg v4.6.3

9 hours ago

[4.6.3] - 2026-03-27

Added

  • Tower service layer (service module): Composable ExtractionService implementing tower::Service with configurable middleware layers (tracing, metrics, timeout, concurrency limit). New tower-service feature flag, auto-enabled by api and mcp. ExtractionServiceBuilder provides ergonomic layer composition.
  • Semantic OpenTelemetry conventions (telemetry module): Formal kreuzberg.* attribute namespace with 30+ span attributes, metric names, and operation/stage constants.
  • Extraction metrics: 11 OTel metric instruments (counters, histograms, gauge) covering extraction totals, durations, cache hits/misses, pipeline stages, OCR, and concurrent extractions. Feature-gated behind otel.
  • InstrumentedExtractor wrapper: Automatic per-extractor tracing spans and metrics without per-extractor annotations. Injected at registry dispatch when otel feature is enabled.

Improved

  • Deeper instrumentation: Pipeline post-processing stages, individual processor execution, OCR operations, and RT-DETR layout model inference now have semantic spans and duration metrics.
  • API and MCP servers use ExtractionService: Both consumers now route extractions through the Tower service stack.
  • Unified config merge: JSON config merge logic deduplicated between CLI and MCP.
  • API server hardening: Added response compression (gzip/brotli/zstd), panic recovery, request-ID correlation, and sensitive header redaction via tower-http middleware.

Changed

  • Removed per-extractor #[instrument] annotations: 29 manual annotations replaced by the automatic InstrumentedExtractor wrapper.
  • Span attribute names migrated to kreuzberg.* namespace: extraction.filename -> kreuzberg.document.filename, extraction.mime_type -> kreuzberg.document.mime_type, etc.

Fixed

  • EPUB spine semantics refactor (#594): Richer OPF package model preserves manifest fallback chains, guide references, and non-linear spine items. Navigation chrome stripped from output. Malformed guide references now produce warnings instead of hard failures.
  • DOCX image extraction for <a:blip> with child elements (#591): Images with high-quality settings were not extracted. Now handles Event::Start for <a:blip>.
  • OCR table extraction returned empty results via pipeline path (#593): Layout detection and table propagation fixed for both code paths.
  • Missing chunker_type field in bindings (#592): Exposed across Python, TypeScript/WASM, Go, C#, PHP bindings.
  • Full API parity across all 10 bindings: Added max_archive_depth to all bindings. Added missing typed config classes for acceleration, email, layout, concurrency where needed.
  • Node Windows publish failure: Prepare script fallback replaced with cross-platform node -e.
  • CI Validate path triggers broadened: Covers docs/**, biome.json, .task/**, and other lintable paths.
  • Publish pipeline ORT bundling: Configurable strategy input (system/bundled) on setup-onnx-runtime action. Publish jobs now use strategy: bundled.
  • C FFI CI missing ORT setup: Added setup-onnx-runtime step to ci-c-ffi.yaml.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.