github kreuzberg-dev/kreuzberg v4.0.0-rc.7

pre-releaseone hour ago

Added

  • WebAssembly bindings (@kreuzberg/wasm npm package) for browser, Cloudflare Workers, Deno, and Bun
    • Full TypeScript API with sync and async extraction methods
    • Multi-threading support via wasm-bindgen-rayon and SharedArrayBuffer
    • Batch processing with batchExtractBytes() and batchExtractBytesSync()
    • Plugin system for custom post-processors, validators, and OCR backends
    • MIME type detection and configuration management
    • Comprehensive unit tests (95%+ coverage on core modules)
    • Production-ready error handling with detailed error messages
  • RTF extractor now builds structured tables (markdown + cells) and parses RTF \info metadata (authors, dates, counts), bringing parity with DOCX/ODT fixtures.
  • New pandoc-generated RTF fixtures with embedded metadata for word_sample, lorem_ipsum, and extraction_test to validate cross-format extraction.
  • Page tracking and metadata redesign (#226)
    • Per-page content extraction with PageContent type
    • Byte-accurate page boundaries with PageBoundary type for O(1) lookups
    • Detailed per-page metadata with PageInfo type (dimensions, counts, visibility)
    • Unified page structure tracking with PageStructure type
    • PageConfig for controlling page extraction behavior
    • Automatic chunk-to-page mapping with first_page/last_page in ChunkMetadata
    • Format-specific support:
      • PDF: Full byte-accurate tracking with O(1) performance
      • PPTX: Slide boundary tracking
      • DOCX: Best-effort page break detection
    • Page markers in combined text for LLM context awareness

Changed

  • BREAKING: ChunkMetadata field renames for byte-accurate tracking (#226)
    • char_startbyte_start (UTF-8 byte offset)
    • char_endbyte_end (UTF-8 byte offset)
    • Existing code using char_start/char_end must be updated
    • See migration guide for details

Fixed

  • Comprehensive lint cleanup across the crate and tests (clippy warnings resolved).
  • Publish workflow now tolerates apt-managed RubyGems installations by skipping unsupported gem update --system during gem rebuild and installs a fallback .NET SDK when the runner lacks dotnet.
  • Docker publish now skips pushing when the target version tag already exists, avoiding redundant builds for released images.
  • Docker tag existence is checked upfront before any publish work, and per-variant publish jobs are skipped early when the version is already present.
  • Added preflight checks for CLI, Go, and Rust crates to skip build/publish when the release artifacts already exist.
  • Maven publishing now uses Sonatype Central's central-publishing-maven-plugin with auto-publish/wait and Central user-token credentials, replacing the legacy OSSRH endpoint.
  • Maven Central publish timeout increased from 30 minutes to 2 hours to accommodate slower validation/publishing process.
  • Python wheels are now built with manylinux: auto parameter (was incorrectly set to manylinux2014 which is not a valid maturin-action value), fixing PyPI upload rejection of linux_x86_64 platform tags.
  • manylinux wheel builds now detect container type (CentOS vs Debian) and set correct OPENSSL_LIB_DIR paths (/usr/lib64 for CentOS, /usr/lib/x86_64-linux-gnu for Debian) to avoid openssl-sys build failures in maturin builds.
  • Ruby Gemfile.lock now includes x86_64-linux platform for CI compatibility on Linux runners.
  • Ruby gem corruption fixed by excluding .fastembed_cache (567MB of embedding models) and target directories from gemspec fallback path.
  • Java Panama FFM SIGSEGV crashes on macOS ARM64 fixed by adding explicit padding fields to FFI structs (CExtractionResult and CBatchResult) to ensure struct alignment matches between Rust and Java.
  • TypeScript E2E test type error fixed in smoke.spec.ts by using proper expectation object format.
  • Node.js benchmarks now have tsx as workspace dev dependency and root-level typecheck script.
  • C# compilation errors (CS0136, CS0128, CS0165) resolved by fixing variable shadowing in e2e/csharp/Helpers.cs.
  • Python CI timeout issues resolved by marking slow office document tests with @pytest.mark.slow and skipping them in CI.
  • Go CI tests enhanced with comprehensive verbose logging and platform-specific diagnostics for better debugging.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.