Added
- Styled HTML output — New
HtmlOutputConfigonExtractionConfigwith 5 built-in themes (default,github,dark,light,unstyled), semantickb-*CSS class hooks on every structural element, CSS custom properties (--kb-*), custom CSS injection (inline or file), and configurable class prefix. The existingHtmloutput format is upgraded in-place whenhtml_outputis set (#633, #665) - 5 new CLI flags:
--html-theme,--html-css,--html-css-file,--html-class-prefix,--html-no-embed-css— any flag implicitly sets--content-format html HtmlOutputConfigandHtmlThemetypes exposed in Rust public API
Changed
- Vendored yake-rust 1.0.3 into kreuzberg core, removing external dependency
- Fixes #676:
BacktrackLimitExceededpanic on large files (10+ MB) by replacing regex-based sentence splitting with memchr-based approach - Expanded YAKE stopwords from 34 to 64 languages using kreuzberg's unified stopwords module
- Removed 6 transitive dependencies (yake-rust, segtok, fancy-regex, streaming-stats, hashbrown, levenshtein)
- Fixes #676:
- Styled HTML renderer included in the
htmlfeature (no separatehtml-styledfeature gate)
Fixed
- PPTX: panic on non-char-boundary during page boundary recomputation — byte offsets could land inside multi-byte UTF-8 characters (e.g.
…U+2026), causing a panic when slicing content (#674) - PDF:
include_headers/include_footersflags ignored by layout-model furniture stripping — when a layout-detection model classified paragraphs asPageHeaderorPageFooter, they were unconditionally stripped as furniture regardless ofContentFilterConfigflag values. Settingstrip_repeating_text=falsewithinclude_headers=truenow correctly preserves those regions (#670) - PDF: heuristic table detector misclassifies body text as tables on slide-like PDFs — PowerPoint-exported PDFs with column-like text gaps produced false-positive 2–3 row "tables" whose bounding boxes covered the entire page, suppressing all body text from the structured extraction pipeline. Tables with ≤3 rows spanning >50% of the page height are now rejected as false positives
- PPTX:
ImageExtractionConfig.inject_placeholderssilently ignored — settinginject_placeholders=falsenow correctly suppressesimage references in PPTX markdown output (#671, #677) - DOCX/HTML/DocBook/LaTeX/RST:
inject_placeholdersconfig ignored — all extractors now honourImageExtractionConfig.inject_placeholdersto suppress image reference injection when set tofalse - PPTX public API cleanup —
extract_pptx_from_pathandextract_pptx_from_bytesnow accept&PptxExtractionOptionsinstead of 6 positional parameters