kreuzberg-dev/kreuzberg v4.7.2 on GitHub

What's Changed

Added

E2E generator published mode — Generate standalone test apps against published registry versions for all 12 language bindings

Changed

Global model cache (#641) — Models now download to platform-appropriate global cache directory instead of per-directory .kreuzberg/ folders

Fixed

Leptonica DPI crash (#606) — Images with 0 DPI caused C++ exception abort during preprocessing. Now validates and fixes DPI to 72 before preprocessing. Also disabled C++ exception handling on Windows MSVC builds.
Embedded HTML in PDF text layers — PDFs with raw HTML in text layer produced escaped garbage. Now detected and converted to clean markdown.
Code classification false positives — Layout model sometimes classified regular prose as Code blocks. Added prose guard.
PageBreak rendering as separators — PageBreak elements rendered as ----- in output. Now treated as structural metadata.
Node.js ExtractionResult.children missing at runtime — Field was in TypeScript definitions but absent from runtime NAPI object.
Node.js disable_ocr config not respected — disableOcr: true still produced OCR content for images.
C# Serialization class inaccessible — Class had insufficient access level in published NuGet package.
Java PdfAnnotation missing getters — Added getContent() and getPageNumber() methods.
Java Table missing getters — Added getCells(), getMarkdown(), and getPageNumber() methods.
PaddleOCR angle classification crash (#643) — Fixed input dimensions for V2 angle classifier model.
Centralized concurrency controls — Fixed 5 places bypassing resolve_thread_budget().
Chunk page numbers missing (#636) — Fixed first_page/last_page being null when chunking was configured.
Ruby OCR backend — Added missing ocr_internal_document field.

Full Changelog: v4.7.1...v4.7.2