What's Changed
Added
- E2E generator published mode — Generate standalone test apps against published registry versions for all 12 language bindings
Changed
- Global model cache (#641) — Models now download to platform-appropriate global cache directory instead of per-directory
.kreuzberg/folders
Fixed
- Leptonica DPI crash (#606) — Images with 0 DPI caused C++ exception abort during preprocessing. Now validates and fixes DPI to 72 before preprocessing. Also disabled C++ exception handling on Windows MSVC builds.
- Embedded HTML in PDF text layers — PDFs with raw HTML in text layer produced escaped garbage. Now detected and converted to clean markdown.
- Code classification false positives — Layout model sometimes classified regular prose as Code blocks. Added prose guard.
- PageBreak rendering as separators — PageBreak elements rendered as
-----in output. Now treated as structural metadata. - Node.js
ExtractionResult.childrenmissing at runtime — Field was in TypeScript definitions but absent from runtime NAPI object. - Node.js
disable_ocrconfig not respected —disableOcr: truestill produced OCR content for images. - C#
Serializationclass inaccessible — Class had insufficient access level in published NuGet package. - Java
PdfAnnotationmissing getters — AddedgetContent()andgetPageNumber()methods. - Java
Tablemissing getters — AddedgetCells(),getMarkdown(), andgetPageNumber()methods. - PaddleOCR angle classification crash (#643) — Fixed input dimensions for V2 angle classifier model.
- Centralized concurrency controls — Fixed 5 places bypassing
resolve_thread_budget(). - Chunk page numbers missing (#636) — Fixed
first_page/last_pagebeing null when chunking was configured. - Ruby OCR backend — Added missing
ocr_internal_documentfield.
Full Changelog: v4.7.1...v4.7.2