Added
- Content-stream order exposed on extracted spans and words (#779) —
extract_wordsandextract_text_linesnow carry the originating span'ssequence(the content-stream emission order). It is surfaced idiomatically on the word/span types of every language binding — Python, Node.js and WASM, Go, the JVM (Java/Kotlin/Scala/Clojure), C#, Ruby, PHP, C and C++, Objective-C, Swift, Dart, R, Julia, Zig, and Elixir — via the new C-ABI accessorpdf_oxide_word_get_sequence. This lets consumers tell genuinely-consecutive draw calls apart from spatially-close-but-stream-distant ones (e.g. table cells vs. overlays), independent of the final reading order. Thanks @ankursri494 for the request.
Fixed
- A word split by a spurious space when its glyph runs overlap slightly (#791) — a single word drawn as two adjacent same-font runs whose glyphs overlap by a fraction of a point (ordinary tight kerning, e.g.
(PLANAL)then(TINA)positioned just insidePLANAL's right edge) was extracted asPLANAL TINA. The plain-text assembler now recognises this case — a negative inter-run gap, same font/weight/style, word characters on both sides, real (varying) per-glyph metrics, and not a lowercase→uppercase word boundary — and joins the runs with no inserted space, reconstructingPLANALTINA, matching pdftotext / PyMuPDF / lopdf on the same file. The spans are left unmerged, so page layout, reading order, and table detection are unaffected. Thanks @schelip for the report and minimal repro. extract_text --format linesmerged table/form cells across column gaps inconsistently (#792) — a flat 50 pt column-gap threshold made cell splitting depend on how wide each row's words happened to be, so a header row of short values (CEP/Cidade/UF) split into one line per cell while the value row directly below it (73751-452/PLANALTINA/GO, wider words, same gutters) merged into a single line. The threshold in line clustering is now font-relative ((font_size × 3).max(30 pt)), so rows sharing the same columns split the same way. Thanks @schelip for the report.- Span-derived APIs reported unresolved (alias) font names (#780, part 1) —
extract_spans,extract_words, andextract_text_linesreported the page's/Resources/Fontalias (e.g.F1) rather than the resolved/BaseFont(e.g.Helvetica,CIDFont+F1). They now resolve to the base font, matchingextract_charsand pdfminer.six / pdfplumber. (The second part of #780 — per-glyph coordinate drift on CID/Type0 fonts inextract_words— is tracked for a follow-up release.) Thanks @ankursri494 for the report. - Cased and caseless non-Latin prose no longer mis-detected as spatial tables — the no-rulings table detector's prose-paragraph guard now recognises sentence boundaries in cased non-Latin scripts and treats the Bengali/Devanagari danda (
।,॥) as a sentence terminator, so complex-script running prose that happens to align into columns is not extracted as a table grid.
Thanks to @schelip (#791, #792) and @ankursri494 (#779, #780) for reporting the issues that drove this release.
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz
|
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz
|
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz
|
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz
|
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz
|
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip
|
Changelog
See CHANGELOG.md for full details.