github kreuzberg-dev/kreuzberg v5.0.0-rc.22

pre-release14 hours ago

v5.0.0-rc.22

Release candidate. Bumps alef to 0.25.39 and regenerates all bindings; see entries below.

Removed

  • Orphan docs/reference/api-gleam.md left behind by dc268c1c0e chore: drop Gleam binding entirely. CI Docs task docs:build:strict aborted on 8 unresolved link reference warnings from this file (e.g. [\BatchFileItem`], [0.0, 1.0], ["https://example.com"]` patterns the strict link checker reads as broken markdown references). The Gleam binding itself was removed earlier; this is just nav cleanup.
  • java: removed stale hand-authored trait-bridge overlay in packages/java/src/main/java/dev/kreuzberg/ (9 files: IPostProcessor, IOcrBackend, IValidator, IEmbeddingBackend, OcrBackendBridge, PostProcessorBridge, ValidatorBridge, EmbeddingBackendBridge, package-info). Alef's trait-bridge code generation now produces these files correctly in the canonical packages/java/dev/kreuzberg/ location; the overlay was leftover from before alef handled trait bridges and carried only stale unused imports and formatting differences. Both trees were being compiled by Maven's sourceDirectory=basedir config, triggering duplicate-class errors and checkstyle violations. The canonical generated copies are now the sole source.

Fixed

  • alef: bump alef_version to 0.25.39 and regenerate all bindings. Picks up four fixes that affect the kreuzberg surface. (1) Kotlin/Android DTO instance methods (e.g. ExtractionConfig, RedactionConfig, ServerConfig, PaddleOcrConfig withers) were emitted as broken top-level functions after the data-class constructor ) with snake_case locals and references to non-existent native methods, breaking compilation; they are now emitted inside the class body as camelCase member stubs that throw UnsupportedOperationException with reconstruct-via-Builder guidance. (2) Java stops emitting throwing UnsupportedOperationException stubs for Self-returning DTO/enum methods (no JNI/FFM symbol exists for them yet — absence beats a stub that compiles but misleads), and restores the true default for boxed @Nullable Boolean #[serde(default)] fields (e.g. ChunkingConfig.trim, EmbeddingConfig.normalize) so omitted JSON no longer surfaces null from the accessor. (3) The new OcrBackend::emits_structured_markdown vtable method is now wired through the full FFI/native bridge surface (FFI struct + initializers, C header, Go/Zig/Dart/JNI/node bridges), fixing the rc.20 missing field emits_structured_markdown in initializer of KreuzbergOcrBackendVTable native-build break. (4) Enum associated static-factory methods are now collected and surfaced. Also carries the 0.25.36 cfg-variant dedup completeness (pyo3/napi/wasm/zig/dart FRB mirror) and the Java marshal_optional_bytes.jinja template registration. (5) Duplicate-definition build breaks unmasked once the FFI vtable compiled past emits_structured_markdown: the swift Rust crate emitted pub fn token_counter_noop twice for own-block opaque types (TokenCounter, E0428), and the PHP #[php_impl] block emitted download_model/default_model_name/known_models three times each from a ner-onnx cfg-pair plus an unconditional re-export (E0592) — alef 0.25.39 consolidates noop emission to a single source and deduplicates the PHP Rust surface. Also includes 0.25.38's napi map-return wrapping and pyo3 enum-factory-staticmethod handling.

  • alef: bump alef_version to 0.25.35 and regenerate all bindings, fixing two binding-surface regressions that broke every language e2e suite on rc.20. (1) The async embedding/reranking entry points (embed_texts_async, rerank_async) vanished from generated bindings — the real implementations are generic / alef(skip)-marked and re-exported at the crate root under one cfg, while a no-ORT stub lives under the disjoint not(...) cfg; alef's extractor dropped the re-export and a dedup-by-name pass then collapsed the remaining pair, so the symbol disappeared whenever the surviving entry's cfg was inactive (embedTextsAsync is not a function, missing kreuzberg_embed_texts_async FFI symbol, :nif_not_loaded, etc.). alef 0.25.34's pub-use-clears-skip + dedup-by-(name,cfg) extractor fixes restore both cfg-gated variants. (2) Those paired variants then produced duplicate-method compile errors in single-surface bindings (Java Kreuzberg.java had 14 identical-signature duplicates; C#/Go/Kotlin/Swift/Dart/PHP/Ruby/Elixir + the JNI shim likewise); alef 0.25.35's shared codegen::fn_dedup collapses same-named cfg-variant functions to one host method for those backends while Rust-cfg-gated backends (FFI/napi/pyo3/wasm) keep both #[cfg] arms. Also carries the alef PHP Default::default gating, Swift exclude_types forwarder filtering, extendr extendr_module! cfg-gated registration, Java checkstyle/import cleanups, and the Dart token_count u32→i64 cast.

  • ci: copy crates/kreuzberg-candle-ocr into the musl build images. The crate is an optional path-dep of the kreuzberg crate (candle-ocr family of features) but was never COPY'd into docker/Dockerfile.musl-build, so cargo build aborted at workspace resolution with failed to read /build/crates/kreuzberg-candle-ocr/Cargo.toml: No such file or directory. The same gap existed in docker/Dockerfile.musl-ffi and docker/Dockerfile.musl-rustler, which both depend on kreuzberg with the full feature set; all three now copy the crate alongside kreuzberg-paddle-ocr. The Cargo.toml strip sed blocks were already candle-ocr-safe (they never reference it), so the workspace member is preserved consistently.

  • alef: bump alef_version to 0.25.30 and regenerate. Adds excluded_default_features = ["heic"] to the [crates.dart] and [crates.swift] blocks so the generated packages/{dart,swift}/rust/Cargo.toml default = [...] arrays no longer pull heiclibheif-sys has no cross-compile story for iOS or Android NDK targets, which made CI Mobile (cargo check kreuzberg-dart on aarch64-apple-ios / aarch64-apple-ios-sim / aarch64-linux-android / x86_64-linux-android) and the rc.20 Publish Build Swift package job fail with pkg-config has not been configured to support cross-compilation. The heic feature itself is preserved as a forwarding entry so desktop callers can opt in via --features heic. The regen also picks up alef 0.25.28–0.25.30 fixes: Java PMD-violation suppressions, removal of generated-file alef:hash headers, PHP module-entry version sync, and the per-field Default fallback for core types without Default.

  • crates/kreuzberg-ffi/tests/vtable_bytes_len.rs initializer for KreuzbergOcrBackendVTable was missing the new emits_structured_markdown field added in rc.20, which made CI Rust on macos-latest and ubuntu-24.04-arm fail with E0063 missing field. The test now constructs the full vtable; no production code change required.

  • scripts/publish/update-homebrew-formula.sh now ensures depends_on "libheif" is present in the tap formula at publish time. Homebrew bottle builds were failing during rc.18/rc.19 because libheif-sys could not locate libheif.pc via PKG_CONFIG_PATH in the brew sandbox (the formula declared only cmake/pkg-config/rust as build deps). The injection is idempotent and self-healing for any future formula drift.

  • ci: set MACOSX_DEPLOYMENT_TARGET=15.0 for the macOS Python wheel build (cibw-environment-macos in publish.yaml). macos-latest is now macOS 15 (Sequoia), and the Homebrew HEIF/AV1 dylibs libheif bundles (libx265, libheif, libaom, libvmaf, libsharpyuv, libde265) carry LC_BUILD_VERSION minos 15.0. With the wheel tagged macosx_11_0_arm64, delocate aborted the rc.20 run with DelocationError: Library dependencies do not satisfy target MacOS version 11.0. Pinning to 15.0 tags the wheel macosx_15_0_arm64 so the bundled dylibs satisfy the check. Only arm64 macOS is built (Intel dropped in rc.20+), and 15.0 is already the effective minimum imposed by the libheif dylibs.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.