Features
extract_page_text()Single-Call DTO (#268) — NewPageTextstruct returns spans, characters, and page dimensions from a single extraction pass, eliminating redundant content stream parsing. Available across Rust, Python, and WASM.- Column-Aware Reading Order (#270) — New
extract_spans_with_reading_order()method accepts aReadingOrderparameter.ReadingOrder::ColumnAwareuses XY-Cut spatial partitioning to detect columns and read each column top-to-bottom, fixing garbled text for multi-column PDFs. - Per-Character Bounding Boxes from Font Metrics (#269) —
TextSpannow carries per-glyph advance widths captured during extraction.to_chars()produces accurate per-character bounding boxes using font metrics instead of uniform width division. Available asspan.char_widthsin Python andspan.charWidthsin WASM (omitted when empty). is_monospaceFlag on TextSpan/TextChar (#271) — Exposes the PDF font descriptor FixedPitch bit, with fallback name heuristic (Courier, Consolas, Mono, Fixed). Eliminates the need for fragile font-name string matching.Pdf::from_bytes()Constructor (#252) — Opens existing PDFs from in-memory bytes without requiring a file path. Available across Rust, Python (Pdf.from_bytes(data)), and WASM (WasmPdf.fromBytes(data)).- Path Operations in Python (#261) —
extract_paths()now includes anoperationslist with individual path commands (move_to, line_to, curve_to, rectangle, close_path) and their coordinates. WASMextractPaths()also aligned.
Bug Fixes
- Fixed panic on multi-byte UTF-8 in debug log slicing (#251) — Replaced raw byte-offset string slices with char-boundary-safe helpers, preventing panics when extracting text from CJK/emoji PDFs with debug logging enabled.
- Fixed markdown spacing around styled text (#273) — Markdown output no longer merges words across annotation/style span boundaries (e.g., "visitwww.example.comto" → "visit www.example.com to").
- Fixed Form XObject /Matrix application (#266) — Text extraction now correctly applies Form XObject transformation matrices and wraps in implicit q/Q save/restore per PDF spec Section 8.10.1.
- Fixed text matrix advance for rotated text (#266) — Replaced incorrect
total_width / text_matrix.d.abs()division (divide-by-zero for 90° rotation) with correctTm_new = T(tx, 0) × Tmper ISO 32000-1 Section 9.4.4. - Fixed prescan CTM loss for deeply nested text (#267) — Replaced backward 4KB scan with forward CTM tracking across the full content stream, capturing outer scaling transforms for text in streams >256KB (e.g., chart axis labels).
- Fixed prescan dropping marked content (BDC/BMC) for tagged PDFs — The forward CTM scan now includes preceding BDC/BMC operators and following EMC operators in region boundaries, preserving MCID, ActualText, and artifact tagging for tagged PDFs in large content streams.
- Fixed deduplication dropping distinct characters (#253) —
deduplicate_overlapping_charsnow checks character identity, not just position. Distinct characters close together (e.g., space followed by 'r' at 1.5pt) are no longer incorrectly removed. - Fixed text dropped with font-size-as-Tm-scale pattern (#254) — Corrected TD/T* matrix multiplication order per ISO 32000-1 Section 9.4.2. PDFs using
/F1 1 Tf+ scaledTm(common in InDesign, LaTeX) no longer silently lose lines. Also tightened containment filter to require text identity match. - Fixed markdown merging words in single-word BT/ET blocks (#260) —
to_markdown()now detects horizontal gaps between consecutive same-line spans and inserts spaces, matchingextract_text()behavior. Fixes PDFs generated by PDFKit.NET/DocuSign. - Fixed CLI merge creating blank documents (#262) —
merge_from/merge_from_bytesnow properly imports page objects with deep recursive copy of all dependent objects (content streams, fonts, images), remapping indirect references.
Dependencies
- pyo3 0.27.2 → 0.28.2 — Added
skip_from_py_object/from_py_objectannotations per newFromPyObjectopt-in requirement. - clap 4.5.60 → 4.6.0
- codecov/codecov-action 5 → 6
Breaking Changes (WASM only)
- WASM JSON field names now use camelCase —
TextSpan,TextChar,PageText,TextBlock, andTextLineserialized fields changed from snake_case to camelCase (e.g.,font_name→fontName,font_size→fontSize,is_italic→isItalic,page_width→pageWidth) when thewasmfeature is enabled. This aligns with JavaScript naming conventions. Rust JSON serialization via serde is only affected when thewasmfeature is enabled. Python uses PyO3 getters and is unaffected.
🏆 Community Contributors
🥇 @Goldziher — Thank you for the comprehensive feature requests (#252, #268, #269, #270, #271) that shaped the text extraction improvements in this release. Your detailed issue reports with code examples and spec references made implementation straightforward! 🚀
🥈 @bsickler — Thank you for the Form XObject matrix fix (#266) and prescan CTM rewrite (#267). These are critical correctness fixes for text extraction in rotated documents and large content streams! 🚀
🥉 @hansmrtn — Thank you for the UTF-8 panic fix (#251). This prevents crashes for any user processing non-ASCII PDFs with debug logging! 🚀
🏅 @jorlow — Thank you for the markdown spacing fix (#273). Clean, well-tested fix for a common user-facing issue! 🚀
🏅 @willywg — Thank you for exposing path operations in Python (#261), giving downstream tools access to individual vector path commands! 🚀
🏅 @titusz — Thank you for reporting the character deduplication (#253) and Tm-scale text dropping (#254) bugs with clear root cause analysis! 🚀
🏅 @oscmejia — Thank you for reporting the markdown word merging issue (#260) with a clear reproduction case! 🚀
🏅 @Inklikdevteam — Thank you for reporting the CLI merge blank pages bug (#262)! 🚀
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz
|
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz
|
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz
|
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz
|
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz
|
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip
|
Changelog
See CHANGELOG.md for full details.