Added
- MDX format support (
mdxfeature): Extract text from.mdxfiles, stripping JSX/import/export syntax while preserving markdown content, frontmatter, tables, and code fences - List supported formats API (#404): Query all supported file extensions and MIME types via
list_supported_formats()in Rust,GET /formatsREST endpoint,list_formatsMCP tool, orkreuzberg formatsCLI subcommand
Fixed
- PDF ligature corruption in CM/Type1 fonts: Added contextual ligature repair for PDFs with broken ToUnicode CMaps where pdfium doesn't flag encoding errors. Fixes corrupted text like
di!erent→different,o"ces→offices,#nancial→financialin LaTeX-generated PDFs. Uses vowel/consonant heuristic to disambiguate ambiguous ligature mappings. Applied to both structure tree and heuristic extraction paths. - PDF dehyphenation across line boundaries: Added paragraph-level dehyphenation that rejoins words broken across PDF line breaks (e.g.
soft ware→software,recog nition→recognition). Handles both explicit trailing hyphens (Case 1) and implicit breaks where pdfium strips the hyphen (Case 2, using full-line detection). Applied to both structure tree and heuristic extraction paths. - PDF page markers missing in Markdown and OCR output (#412): Page markers (
insert_page_markers/marker_format) were not inserted when using Markdown output format or OCR extraction since the 4.3.5 pipeline rewrite. Fixed by threading the marker format through the markdown assembly pipeline and OCR page joining. Djot output inherits markers automatically. - PDF Djot/HTML output quality parity: Djot and HTML output formats now use the same high-quality structural extraction pipeline as Markdown (headings, tables, bold/italic, dehyphenation). Previously these formats fell back to plain text split into paragraphs, losing all document structure.
- PDF sidebar text pollution: Widened the margin band for sidebar character filtering from 5% to 6.5% of page width, fixing cases where rotated sidebar text (e.g. arXiv identifiers) leaked into extracted content.
- Node.js PDF config options not passed to native binding: Fixed
extractAnnotations,hierarchy,topMarginFraction, andbottomMarginFractionPDF config fields being silently dropped by the TypeScript config normalizer, causing PDF annotation extraction to always returnundefinedin the Node.js binding.