Added
- PDF image placeholder toggle: New
inject_placeholdersoption onImageExtractionConfig(default:true). Set tofalseto extract images as data without injectingreferences into the markdown content.
Fixed
- Token reduction not applied (#436): Token reduction config was accepted but never executed during extraction. The pipeline now applies
reduce_tokens()whentoken_reduction.modeis configured. - Nested HTML table extraction: Nested HTML tables now extract correctly with proper cell data and markdown rendering, using the visitor-based table extraction API from html-to-markdown-rs.
- hOCR plain text output: hOCR conversion now correctly produces plain text when
OutputFormat::Plainis requested, instead of silently falling back to Markdown. - PDF garbled text for positioned/tabular content (#431): PDF text extraction now detects X-position gaps between consecutive characters and inserts spaces when the gap exceeds
0.8 × avg_font_size. - Chunk page metadata drift with overlap (#439): Chunk byte offsets are now computed via pointer arithmetic from the source text, fixing cumulative drift that caused chunks to report incorrect page numbers when overlap is enabled.
- Node.js metadata casing: Standardized all
MetadataandEmailMetadatafields tocamelCasein the Node.js/TypeScript bindings. Also corrected pluralization forauthorsandkeywords. - WASM build failure on Windows CI: CMake try-compile checks on Windows used the host MSVC compiler (
cl.exe), which rejected GCC/Clang flags like-Wno-implicit-function-declaration. AddedCMAKE_TRY_COMPILE_TARGET_TYPE=STATIC_LIBRARYto WASM cross-compilation builds. - WASM OCR build panic when
git/patchunavailable: The tesseract WASM patch application panicked when bothgit applyandpatchcommands failed. Added programmatic C++ source fixups as a fallback, applying all necessary changes via idempotent string replacements.