kreuzberg-dev/kreuzberg v4.4.3 on GitHub

Added

PDF image placeholder toggle: New inject_placeholders option on ImageExtractionConfig (default: true). Set to false to extract images as data without injecting ![image](...) references into the markdown content.

Fixed

Token reduction not applied (#436): Token reduction config was accepted but never executed during extraction. The pipeline now applies reduce_tokens() when token_reduction.mode is configured.
Nested HTML table extraction: Nested HTML tables now extract correctly with proper cell data and markdown rendering, using the visitor-based table extraction API from html-to-markdown-rs.
hOCR plain text output: hOCR conversion now correctly produces plain text when OutputFormat::Plain is requested, instead of silently falling back to Markdown.
PDF garbled text for positioned/tabular content (#431): PDF text extraction now detects X-position gaps between consecutive characters and inserts spaces when the gap exceeds 0.8 × avg_font_size.
Chunk page metadata drift with overlap (#439): Chunk byte offsets are now computed via pointer arithmetic from the source text, fixing cumulative drift that caused chunks to report incorrect page numbers when overlap is enabled.
Node.js metadata casing: Standardized all Metadata and EmailMetadata fields to camelCase in the Node.js/TypeScript bindings. Also corrected pluralization for authors and keywords.
WASM build failure on Windows CI: CMake try-compile checks on Windows used the host MSVC compiler (cl.exe), which rejected GCC/Clang flags like -Wno-implicit-function-declaration. Added CMAKE_TRY_COMPILE_TARGET_TYPE=STATIC_LIBRARY to WASM cross-compilation builds.
WASM OCR build panic when git/patch unavailable: The tesseract WASM patch application panicked when both git apply and patch commands failed. Added programmatic C++ source fixups as a fallback, applying all necessary changes via idempotent string replacements.