github yfedoseev/pdf_oxide v0.3.70
v0.3.70 | Extraction-fidelity release — kerning-split words rejoined in plain text, table/form line cells split consistently regardless of word width, resolved `/BaseFont` names on the span/word APIs, and content-stream order exposed on extracted spans and words.

latest releases: v0.3.71, go/v0.3.71
2 days ago

Added

  • Content-stream order exposed on extracted spans and words (#779)extract_words and extract_text_lines now carry the originating span's sequence (the content-stream emission order). It is surfaced idiomatically on the word/span types of every language binding — Python, Node.js and WASM, Go, the JVM (Java/Kotlin/Scala/Clojure), C#, Ruby, PHP, C and C++, Objective-C, Swift, Dart, R, Julia, Zig, and Elixir — via the new C-ABI accessor pdf_oxide_word_get_sequence. This lets consumers tell genuinely-consecutive draw calls apart from spatially-close-but-stream-distant ones (e.g. table cells vs. overlays), independent of the final reading order. Thanks @ankursri494 for the request.

Fixed

  • A word split by a spurious space when its glyph runs overlap slightly (#791) — a single word drawn as two adjacent same-font runs whose glyphs overlap by a fraction of a point (ordinary tight kerning, e.g. (PLANAL) then (TINA) positioned just inside PLANAL's right edge) was extracted as PLANAL TINA. The plain-text assembler now recognises this case — a negative inter-run gap, same font/weight/style, word characters on both sides, real (varying) per-glyph metrics, and not a lowercase→uppercase word boundary — and joins the runs with no inserted space, reconstructing PLANALTINA, matching pdftotext / PyMuPDF / lopdf on the same file. The spans are left unmerged, so page layout, reading order, and table detection are unaffected. Thanks @schelip for the report and minimal repro.
  • extract_text --format lines merged table/form cells across column gaps inconsistently (#792) — a flat 50 pt column-gap threshold made cell splitting depend on how wide each row's words happened to be, so a header row of short values (CEP/Cidade/UF) split into one line per cell while the value row directly below it (73751-452/PLANALTINA/GO, wider words, same gutters) merged into a single line. The threshold in line clustering is now font-relative ((font_size × 3).max(30 pt)), so rows sharing the same columns split the same way. Thanks @schelip for the report.
  • Span-derived APIs reported unresolved (alias) font names (#780, part 1)extract_spans, extract_words, and extract_text_lines reported the page's /Resources/Font alias (e.g. F1) rather than the resolved /BaseFont (e.g. Helvetica, CIDFont+F1). They now resolve to the base font, matching extract_chars and pdfminer.six / pdfplumber. (The second part of #780 — per-glyph coordinate drift on CID/Type0 fonts in extract_words — is tracked for a follow-up release.) Thanks @ankursri494 for the report.
  • Cased and caseless non-Latin prose no longer mis-detected as spatial tables — the no-rulings table detector's prose-paragraph guard now recognises sentence boundaries in cased non-Latin scripts and treats the Bengali/Devanagari danda (, ) as a sentence terminator, so complex-script running prose that happens to align into columns is not extracted as a table grid.

Thanks to @schelip (#791, #792) and @ankursri494 (#779, #780) for reporting the issues that drove this release.


Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

Don't miss a new pdf_oxide release

NewReleases is sending notifications on new releases.