Fixed
- PDF markdown garbles positioned text (#431): PDFs with positioned/tabular text (CVs, addresses, data tables) had their line breaks destroyed during paragraph grouping. Added page-level positioned text detection: when fewer than 30% of lines on a page reach the right margin, short lines are split into separate paragraphs to preserve the document's visual structure.
- Node worker pool password bug:
extractFileInWorkerwas passing thepasswordargument asmime_typetoextract_file_sync, meaning passwords were never applied and MIME detection could break. Password is now correctly injected intoconfig.pdf_options.passwords. - WASM camelCase config deserialization: JS consumers send camelCase config keys (e.g.
includeDocumentStructure) butserdeexpects snake_case. Addedcamel_to_snaketransform inparse_config()so config fields are properly deserialized. Fixes document structure extraction returning empty results via WASM. - PHP 8.5 array coercion on macOS: On PHP 8.5 + macOS, ext-php-rs coerces
#[php_class]return values to arrays instead of objects. AddednormalizeExtractionResult()wrapper that transparently converts arrays viaExtractionResult::fromArray(). - PHP 8.5 support: Upgraded ext-php-rs to 0.15.6 for PHP 8.5 compatibility.
- Vendoring scripts missing path deps: Ruby and R vendoring scripts failed when workspace dependencies use
pathinstead ofversion. - WASM Deno OCR test hang: OCR tests hung indefinitely on WASM Deno because Tesseract synchronous initialization blocks the single-threaded runtime.
- pdfium-render clippy lints: Fixed clippy warnings in kreuzberg-pdfium-render crate.
Added
- CLI
--pdf-passwordflag: New--pdf-passwordoption onextractandbatchcommands for encrypted PDF support. - MCP
pdf_passwordparameter: Addedpdf_passwordfield toextract_file,extract_bytes, andbatch_extract_filesMCP tool params. - API
pdf_passwordmultipart field: The HTTP API extract endpoint now accepts apdf_passwordmultipart field for encrypted PDFs. PdfConfigDefault impl: AddedDefaultimplementation forPdfConfigto support ergonomic config construction.- E2E password-protected PDF fixture: Added
pdf_password_protectedfixture testing copy-protected PDF extraction across all bindings.
Changed
- All binding crates linted in pre-commit: Removed clippy exclusions for kreuzberg-php, kreuzberg-node, and kreuzberg-wasm.
- golangci-lint v2.11.3: Upgraded from v2.9.0.
Full Changelog: v4.4.4...v4.4.5