github kreuzberg-dev/kreuzberg v4.4.5

11 hours ago

Fixed

  • PDF markdown garbles positioned text (#431): PDFs with positioned/tabular text (CVs, addresses, data tables) had their line breaks destroyed during paragraph grouping. Added page-level positioned text detection: when fewer than 30% of lines on a page reach the right margin, short lines are split into separate paragraphs to preserve the document's visual structure.
  • Node worker pool password bug: extractFileInWorker was passing the password argument as mime_type to extract_file_sync, meaning passwords were never applied and MIME detection could break. Password is now correctly injected into config.pdf_options.passwords.
  • WASM camelCase config deserialization: JS consumers send camelCase config keys (e.g. includeDocumentStructure) but serde expects snake_case. Added camel_to_snake transform in parse_config() so config fields are properly deserialized. Fixes document structure extraction returning empty results via WASM.
  • PHP 8.5 array coercion on macOS: On PHP 8.5 + macOS, ext-php-rs coerces #[php_class] return values to arrays instead of objects. Added normalizeExtractionResult() wrapper that transparently converts arrays via ExtractionResult::fromArray().
  • PHP 8.5 support: Upgraded ext-php-rs to 0.15.6 for PHP 8.5 compatibility.
  • Vendoring scripts missing path deps: Ruby and R vendoring scripts failed when workspace dependencies use path instead of version.
  • WASM Deno OCR test hang: OCR tests hung indefinitely on WASM Deno because Tesseract synchronous initialization blocks the single-threaded runtime.
  • pdfium-render clippy lints: Fixed clippy warnings in kreuzberg-pdfium-render crate.

Added

  • CLI --pdf-password flag: New --pdf-password option on extract and batch commands for encrypted PDF support.
  • MCP pdf_password parameter: Added pdf_password field to extract_file, extract_bytes, and batch_extract_files MCP tool params.
  • API pdf_password multipart field: The HTTP API extract endpoint now accepts a pdf_password multipart field for encrypted PDFs.
  • PdfConfig Default impl: Added Default implementation for PdfConfig to support ergonomic config construction.
  • E2E password-protected PDF fixture: Added pdf_password_protected fixture testing copy-protected PDF extraction across all bindings.

Changed

  • All binding crates linted in pre-commit: Removed clippy exclusions for kreuzberg-php, kreuzberg-node, and kreuzberg-wasm.
  • golangci-lint v2.11.3: Upgraded from v2.9.0.

Full Changelog: v4.4.4...v4.4.5

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.