github yfedoseev/pdf_oxide v0.3.71
v0.3.71 | Spec-alignment & extraction-leadership release — the renderer gains tiling patterns, Type 3 fonts, and mesh shadings; the markdown converter gains first-class tables, images, links, headings, nested lists, running header/footer removal, and footnotes; plus accurate per-glyph coordinates on the word/span APIs, correct PDF/X ICC validation, and a structure-tree parser that no longer drops large trees.

one hour ago

Added

  • Renderer spec alignment (ISO 32000-1) — the CPU rasteriser now paints several previously-unsupported constructs: tiling patterns (PatternType 1, §8.7.3), Type 3 font glyphs (CharProcs executed under the font matrix with d0/d1, §9.6.5), mesh shadings (free-form and lattice-form Gouraud triangle meshes and Coons/tensor patches — types 4/5/6/7 — plus function-based type 1, §8.7.4.5), text rendering modes 4–7 (glyph-outline clip accumulation across BT/ET, §9.3.6), and colour-key masking (/Mask [ranges], §8.9.6.4). JPEG 2000 images with chroma-subsampled components are now upsampled and decoded rather than skipped.
  • First-class tables in the markdown/HTML converters — the pipeline converter renders detected tables directly (pipe tables with header rows and colspan handling), replacing the fragile text-post-processing path.
  • Images, links, and document structure in markdown — figures are emitted as ![](…), /Link annotations become [text](uri) / <a href> (with a safe-scheme gate), heading hierarchy is inferred as #######, indentation-based nested lists are preserved, cross-page running headers/footers are detected and filtered, and superscript-marker + page-bottom footnotes become [^n] references.
  • Hybrid-reference files (/XRefStm, §7.5.8.4) — a classic trailer's cross-reference-stream supplement is now parsed and merged, so hybrid PDFs resolve all objects.

Fixed

  • Per-glyph coordinates in extract_words / extract_spans / extract_text_lines drifted on CID/Type 0 fonts (#780, part 2) — these APIs reconstructed each glyph's x-position by summing nominal advance widths, which omits the ISO 32000-1 §9.4.3 TJ-array kerning, so positions drifted cumulatively along a line (up to tens of points) versus extract_chars. Each glyph's x now comes from the accurate content-stream position (matching extract_chars and Poppler's pdftotext -bbox); on the reporter's repro, glyphs within 0.5 pt of the reference went from 15 % to 97 %. Word segmentation is unchanged (the char-width array is untouched), so complex-script extraction does not regress. Thanks @ankursri494 for the report and reproducer.
  • Valid ICC profiles reported as [XCOLOR-005] … not a valid stream in validate_pdf_x (#797) — an ICCBased colour space embeds its profile as a stream (§8.6.5.5, [ /ICCBased stream ]), but the validator only accepted a bare dictionary and flagged every conforming profile (including the Ghent Workgroup PDF/X-4 suite). It now reads /N from the stream dictionary. Thanks @takoportal for the detailed report and repro.
  • Structure-tree parsing dropped large trees under a hard-coded budget (#801)parse_structure_tree imposed a 200 ms wall-clock budget and a 10 000-element cap and returned no structure tree at all when either was exceeded (e.g. the 756-page ISO 32000-1 specification), which is non-deterministic across machines and silently loses data. The default now parses the complete tree; callers that need to bound the work can opt in via the new parse_structure_tree_with_budget(&doc, Option<Duration>) (and doc.structure_tree_with_budget(…)). The redundant post-parse size check is removed. Thanks @bjorn3 for the report and proposed API.
  • Inter-word spaces dropped on justified TJ-positioned text (#803) — on documents whose words are positioned with TJ/Td offsets in embedded Type 0 / Identity-H subset fonts (e.g. the 214-page ISO 21111-10 standard), whole runs extracted glued together — All rights reserved came out as Allrightsreserved. The word-gap detector derives its threshold from the font's space-glyph advance, but under Identity-H character code 0x20 maps to CID 32 — an arbitrary glyph, not the space (ISO 32000-2 §9.7.5.2, §9.10.2: the space is reached through the font's CMap/ToUnicode, never code 0x20). Reading that ~0.56 em glyph advance as the space width inflated the threshold so far that genuine ~0.25 em word gaps fell below it and were suppressed. Identity-encoded Type 0 fonts now fall back to the 0.25 em typographic default; non-Identity CMaps that legitimately place a space at 0x20 still use their explicit /W entry. Thanks @Goldziher for the precise report and geometry.
  • Numeric median selection — heading/base-font-size statistics now use select_nth_unstable_by (exact O(n)) instead of a full sort.

Thanks to @ankursri494 (#780), @takoportal (#797), @bjorn3 (#801), and @Goldziher (#803) for reporting the issues that drove this release.


Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

Don't miss a new pdf_oxide release

NewReleases is sending notifications on new releases.