Added
- Re-added
--layoutboolean CLI flag for easy layout detection enablement (use--layoutto enable with model defaults,--layout falseto explicitly disable) - arXiv watermark/sidebar noise filtering for academic PDFs — strips LaTeX sidebar identifiers from extracted text
- Second-tier cross-page repeating text detection — catches conference headers and journal running titles that repeat on >70% of pages but appear outside the margin zone
- Figure/picture text suppression — text inside layout-detected Picture regions is now marked as page furniture and excluded from body output
Fixed
- Figure-internal text leaking into body output — Text from inside figures and diagrams (e.g., diagram labels, axis text) was incorrectly included in the extracted body content, sometimes promoted to headings. The layout detection pipeline now suppresses text paragraphs classified as Picture regions.
- CLI tests now correctly reference
--content-formatinstead of deprecated--output-format - Empty image references in PDF markdown/HTML output — PDFs with embedded images produced empty
![]()references in markdown and<img src="" alt="">in HTML output. The PDF structure pipeline now extracts actual image pixel data via pdfium and populates document images, producing properreferences. - Invalid
extractFromFileconfig in documentation — Demo code in the TypeScript API reference included invalid configuration parameters that caused runtime errors. - WASM build failure with
extern "C-unwind"— The LLVM WASM backend does not supportcleanupretinstructions generated byextern "C-unwind"FFI blocks. Addedffi_extern!macro that usesextern "C-unwind"on native targets (for C++ exception safety) andextern "C"on WASM. - Go module tag format — Go module tags now use the correct
packages/go/v4/vX.Y.Zformat matching the module path ingo.mod, plus the legacypackages/go/vX.Y.Zformat for backwards compatibility. Backfilled tags for all stable releases.
Changed
- CLI documentation updated with all missing extraction override flags (
--layout-table-model,--disable-ocr,--cache-namespace,--cache-ttl-secs)