github kreuzberg-dev/kreuzberg v4.8.1

4 hours ago

Added

  • Styled HTML output — New HtmlOutputConfig on ExtractionConfig with 5 built-in themes (default, github, dark, light, unstyled), semantic kb-* CSS class hooks on every structural element, CSS custom properties (--kb-*), custom CSS injection (inline or file), and configurable class prefix. The existing Html output format is upgraded in-place when html_output is set (#633, #665)
  • 5 new CLI flags: --html-theme, --html-css, --html-css-file, --html-class-prefix, --html-no-embed-css — any flag implicitly sets --content-format html
  • HtmlOutputConfig and HtmlTheme types exposed in Rust public API

Changed

  • Vendored yake-rust 1.0.3 into kreuzberg core, removing external dependency
    • Fixes #676: BacktrackLimitExceeded panic on large files (10+ MB) by replacing regex-based sentence splitting with memchr-based approach
    • Expanded YAKE stopwords from 34 to 64 languages using kreuzberg's unified stopwords module
    • Removed 6 transitive dependencies (yake-rust, segtok, fancy-regex, streaming-stats, hashbrown, levenshtein)
  • Styled HTML renderer included in the html feature (no separate html-styled feature gate)

Fixed

  • PPTX: panic on non-char-boundary during page boundary recomputation — byte offsets could land inside multi-byte UTF-8 characters (e.g. U+2026), causing a panic when slicing content (#674)
  • PDF: include_headers / include_footers flags ignored by layout-model furniture stripping — when a layout-detection model classified paragraphs as PageHeader or PageFooter, they were unconditionally stripped as furniture regardless of ContentFilterConfig flag values. Setting strip_repeating_text=false with include_headers=true now correctly preserves those regions (#670)
  • PDF: heuristic table detector misclassifies body text as tables on slide-like PDFs — PowerPoint-exported PDFs with column-like text gaps produced false-positive 2–3 row "tables" whose bounding boxes covered the entire page, suppressing all body text from the structured extraction pipeline. Tables with ≤3 rows spanning >50% of the page height are now rejected as false positives
  • PPTX: ImageExtractionConfig.inject_placeholders silently ignored — setting inject_placeholders=false now correctly suppresses ![alt](target) image references in PPTX markdown output (#671, #677)
  • DOCX/HTML/DocBook/LaTeX/RST: inject_placeholders config ignored — all extractors now honour ImageExtractionConfig.inject_placeholders to suppress image reference injection when set to false
  • PPTX public API cleanupextract_pptx_from_path and extract_pptx_from_bytes now accept &PptxExtractionOptions instead of 6 positional parameters

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.