Added
- WebAssembly bindings (
@kreuzberg/wasmnpm package) for browser, Cloudflare Workers, Deno, and Bun- Full TypeScript API with sync and async extraction methods
- Multi-threading support via
wasm-bindgen-rayonand SharedArrayBuffer - Batch processing with
batchExtractBytes()andbatchExtractBytesSync() - Plugin system for custom post-processors, validators, and OCR backends
- MIME type detection and configuration management
- Comprehensive unit tests (95%+ coverage on core modules)
- Production-ready error handling with detailed error messages
- RTF extractor now builds structured tables (markdown + cells) and parses RTF
\infometadata (authors, dates, counts), bringing parity with DOCX/ODT fixtures. - New pandoc-generated RTF fixtures with embedded metadata for
word_sample,lorem_ipsum, andextraction_testto validate cross-format extraction. - Page tracking and metadata redesign (#226)
- Per-page content extraction with
PageContenttype - Byte-accurate page boundaries with
PageBoundarytype for O(1) lookups - Detailed per-page metadata with
PageInfotype (dimensions, counts, visibility) - Unified page structure tracking with
PageStructuretype PageConfigfor controlling page extraction behavior- Automatic chunk-to-page mapping with
first_page/last_pageinChunkMetadata - Format-specific support:
- PDF: Full byte-accurate tracking with O(1) performance
- PPTX: Slide boundary tracking
- DOCX: Best-effort page break detection
- Page markers in combined text for LLM context awareness
- Per-page content extraction with
Changed
- BREAKING:
ChunkMetadatafield renames for byte-accurate tracking (#226)char_start→byte_start(UTF-8 byte offset)char_end→byte_end(UTF-8 byte offset)- Existing code using
char_start/char_endmust be updated - See migration guide for details
Fixed
- Comprehensive lint cleanup across the crate and tests (clippy warnings resolved).
- Publish workflow now tolerates apt-managed RubyGems installations by skipping unsupported
gem update --systemduring gem rebuild and installs a fallback .NET SDK when the runner lacksdotnet. - Docker publish now skips pushing when the target version tag already exists, avoiding redundant builds for released images.
- Docker tag existence is checked upfront before any publish work, and per-variant publish jobs are skipped early when the version is already present.
- Added preflight checks for CLI, Go, and Rust crates to skip build/publish when the release artifacts already exist.
- Maven publishing now uses Sonatype Central's
central-publishing-maven-pluginwith auto-publish/wait and Central user-token credentials, replacing the legacy OSSRH endpoint. - Maven Central publish timeout increased from 30 minutes to 2 hours to accommodate slower validation/publishing process.
- Python wheels are now built with
manylinux: autoparameter (was incorrectly set tomanylinux2014which is not a valid maturin-action value), fixing PyPI upload rejection oflinux_x86_64platform tags. - manylinux wheel builds now detect container type (CentOS vs Debian) and set correct
OPENSSL_LIB_DIRpaths (/usr/lib64for CentOS,/usr/lib/x86_64-linux-gnufor Debian) to avoid openssl-sys build failures in maturin builds. - Ruby Gemfile.lock now includes x86_64-linux platform for CI compatibility on Linux runners.
- Ruby gem corruption fixed by excluding .fastembed_cache (567MB of embedding models) and target directories from gemspec fallback path.
- Java Panama FFM SIGSEGV crashes on macOS ARM64 fixed by adding explicit padding fields to FFI structs (CExtractionResult and CBatchResult) to ensure struct alignment matches between Rust and Java.
- TypeScript E2E test type error fixed in smoke.spec.ts by using proper expectation object format.
- Node.js benchmarks now have tsx as workspace dev dependency and root-level typecheck script.
- C# compilation errors (CS0136, CS0128, CS0165) resolved by fixing variable shadowing in e2e/csharp/Helpers.cs.
- Python CI timeout issues resolved by marking slow office document tests with @pytest.mark.slow and skipping them in CI.
- Go CI tests enhanced with comprehensive verbose logging and platform-specific diagnostics for better debugging.