Added
- Renderer spec alignment (ISO 32000-1) — the CPU rasteriser now paints several previously-unsupported constructs: tiling patterns (PatternType 1, §8.7.3), Type 3 font glyphs (CharProcs executed under the font matrix with
d0/d1, §9.6.5), mesh shadings (free-form and lattice-form Gouraud triangle meshes and Coons/tensor patches — types 4/5/6/7 — plus function-based type 1, §8.7.4.5), text rendering modes 4–7 (glyph-outline clip accumulation acrossBT/ET, §9.3.6), and colour-key masking (/Mask [ranges], §8.9.6.4). JPEG 2000 images with chroma-subsampled components are now upsampled and decoded rather than skipped. - First-class tables in the markdown/HTML converters — the pipeline converter renders detected tables directly (pipe tables with header rows and colspan handling), replacing the fragile text-post-processing path.
- Images, links, and document structure in markdown — figures are emitted as
,/Linkannotations become[text](uri)/<a href>(with a safe-scheme gate), heading hierarchy is inferred as#–######, indentation-based nested lists are preserved, cross-page running headers/footers are detected and filtered, and superscript-marker + page-bottom footnotes become[^n]references. - Hybrid-reference files (
/XRefStm, §7.5.8.4) — a classic trailer's cross-reference-stream supplement is now parsed and merged, so hybrid PDFs resolve all objects.
Fixed
- Per-glyph coordinates in
extract_words/extract_spans/extract_text_linesdrifted on CID/Type 0 fonts (#780, part 2) — these APIs reconstructed each glyph's x-position by summing nominal advance widths, which omits the ISO 32000-1 §9.4.3TJ-array kerning, so positions drifted cumulatively along a line (up to tens of points) versusextract_chars. Each glyph's x now comes from the accurate content-stream position (matchingextract_charsand Poppler'spdftotext -bbox); on the reporter's repro, glyphs within 0.5 pt of the reference went from 15 % to 97 %. Word segmentation is unchanged (the char-width array is untouched), so complex-script extraction does not regress. Thanks @ankursri494 for the report and reproducer. - Valid ICC profiles reported as
[XCOLOR-005] … not a valid streaminvalidate_pdf_x(#797) — an ICCBased colour space embeds its profile as a stream (§8.6.5.5,[ /ICCBased stream ]), but the validator only accepted a bare dictionary and flagged every conforming profile (including the Ghent Workgroup PDF/X-4 suite). It now reads/Nfrom the stream dictionary. Thanks @takoportal for the detailed report and repro. - Structure-tree parsing dropped large trees under a hard-coded budget (#801) —
parse_structure_treeimposed a 200 ms wall-clock budget and a 10 000-element cap and returned no structure tree at all when either was exceeded (e.g. the 756-page ISO 32000-1 specification), which is non-deterministic across machines and silently loses data. The default now parses the complete tree; callers that need to bound the work can opt in via the newparse_structure_tree_with_budget(&doc, Option<Duration>)(anddoc.structure_tree_with_budget(…)). The redundant post-parse size check is removed. Thanks @bjorn3 for the report and proposed API. - Inter-word spaces dropped on justified
TJ-positioned text (#803) — on documents whose words are positioned withTJ/Tdoffsets in embedded Type 0 / Identity-H subset fonts (e.g. the 214-page ISO 21111-10 standard), whole runs extracted glued together —All rights reservedcame out asAllrightsreserved. The word-gap detector derives its threshold from the font's space-glyph advance, but under Identity-H character code0x20maps to CID 32 — an arbitrary glyph, not the space (ISO 32000-2 §9.7.5.2, §9.10.2: the space is reached through the font's CMap/ToUnicode, never code0x20). Reading that ~0.56 em glyph advance as the space width inflated the threshold so far that genuine ~0.25 em word gaps fell below it and were suppressed. Identity-encoded Type 0 fonts now fall back to the 0.25 em typographic default; non-Identity CMaps that legitimately place a space at0x20still use their explicit/Wentry. Thanks @Goldziher for the precise report and geometry. - Numeric median selection — heading/base-font-size statistics now use
select_nth_unstable_by(exact O(n)) instead of a full sort.
Thanks to @ankursri494 (#780), @takoportal (#797), @bjorn3 (#801), and @Goldziher (#803) for reporting the issues that drove this release.
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz
|
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz
|
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz
|
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz
|
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz
|
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip
|
Changelog
See CHANGELOG.md for full details.