github kreuzberg-dev/kreuzberg v4.7.4

8 hours ago

Added

  • Re-added --layout boolean CLI flag for easy layout detection enablement (use --layout to enable with model defaults, --layout false to explicitly disable)
  • arXiv watermark/sidebar noise filtering for academic PDFs — strips LaTeX sidebar identifiers from extracted text
  • Second-tier cross-page repeating text detection — catches conference headers and journal running titles that repeat on >70% of pages but appear outside the margin zone
  • Figure/picture text suppression — text inside layout-detected Picture regions is now marked as page furniture and excluded from body output

Fixed

  • Figure-internal text leaking into body output — Text from inside figures and diagrams (e.g., diagram labels, axis text) was incorrectly included in the extracted body content, sometimes promoted to headings. The layout detection pipeline now suppresses text paragraphs classified as Picture regions.
  • CLI tests now correctly reference --content-format instead of deprecated --output-format
  • Empty image references in PDF markdown/HTML output — PDFs with embedded images produced empty ![]() references in markdown and <img src="" alt=""> in HTML output. The PDF structure pipeline now extracts actual image pixel data via pdfium and populates document images, producing proper ![](image_N.png) references.
  • Invalid extractFromFile config in documentation — Demo code in the TypeScript API reference included invalid configuration parameters that caused runtime errors.
  • WASM build failure with extern "C-unwind" — The LLVM WASM backend does not support cleanupret instructions generated by extern "C-unwind" FFI blocks. Added ffi_extern! macro that uses extern "C-unwind" on native targets (for C++ exception safety) and extern "C" on WASM.
  • Go module tag format — Go module tags now use the correct packages/go/v4/vX.Y.Z format matching the module path in go.mod, plus the legacy packages/go/vX.Y.Z format for backwards compatibility. Backfilled tags for all stable releases.

Changed

  • CLI documentation updated with all missing extraction override flags (--layout-table-model, --disable-ocr, --cache-namespace, --cache-ttl-secs)

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.