github kreuzberg-dev/kreuzberg v4.3.8

9 hours ago

Added

  • MDX format support (mdx feature): Extract text from .mdx files, stripping JSX/import/export syntax while preserving markdown content, frontmatter, tables, and code fences
  • List supported formats API (#404): Query all supported file extensions and MIME types via list_supported_formats() in Rust, GET /formats REST endpoint, list_formats MCP tool, or kreuzberg formats CLI subcommand

Fixed

  • PDF ligature corruption in CM/Type1 fonts: Added contextual ligature repair for PDFs with broken ToUnicode CMaps where pdfium doesn't flag encoding errors. Fixes corrupted text like di!erentdifferent, o"cesoffices, #nancialfinancial in LaTeX-generated PDFs. Uses vowel/consonant heuristic to disambiguate ambiguous ligature mappings. Applied to both structure tree and heuristic extraction paths.
  • PDF dehyphenation across line boundaries: Added paragraph-level dehyphenation that rejoins words broken across PDF line breaks (e.g. soft waresoftware, recog nitionrecognition). Handles both explicit trailing hyphens (Case 1) and implicit breaks where pdfium strips the hyphen (Case 2, using full-line detection). Applied to both structure tree and heuristic extraction paths.
  • PDF page markers missing in Markdown and OCR output (#412): Page markers (insert_page_markers / marker_format) were not inserted when using Markdown output format or OCR extraction since the 4.3.5 pipeline rewrite. Fixed by threading the marker format through the markdown assembly pipeline and OCR page joining. Djot output inherits markers automatically.
  • PDF Djot/HTML output quality parity: Djot and HTML output formats now use the same high-quality structural extraction pipeline as Markdown (headings, tables, bold/italic, dehyphenation). Previously these formats fell back to plain text split into paragraphs, losing all document structure.
  • PDF sidebar text pollution: Widened the margin band for sidebar character filtering from 5% to 6.5% of page width, fixing cases where rotated sidebar text (e.g. arXiv identifiers) leaked into extracted content.
  • Node.js PDF config options not passed to native binding: Fixed extractAnnotations, hierarchy, topMarginFraction, and bottomMarginFraction PDF config fields being silently dropped by the TypeScript config normalizer, causing PDF annotation extraction to always return undefined in the Node.js binding.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.