github Michaelliv/markit v0.3.0
v0.3.0 — PDF converter rewrite

latest releases: v0.5.3, v0.5.2, v0.5.1...
one month ago

PDF converter rewrite

Rewrote the PDF converter from scratch with mupdf (native WASM).

What's new

  • Table detection — vector line extraction + raycasting places text into markdown tables
  • Diagram filtering — block diagrams (sparse grids, repeated labels) are excluded from table detection
  • Multi-column layout — two-column documents (legal docs, datasheets) read in correct order
  • Header/footer stripping — repeated running headers removed across pages
  • Image extraction — diagrams cropped and saved as PNGs when imageDir is provided
  • CTM tracking — content stream coordinate transforms applied correctly
  • Agent skillnpx skills add Michaelliv/markit

Performance

PDF Pages Time
Bitcoin whitepaper 9 26ms
US Constitution 16 56ms
Intel PCH datasheet 224 640ms
NXP S32K3xx datasheet 164 1.9s

Testing

58 tests across 4 test files covering grid detection, rendering, extraction, and column detection. Validated against Intel, NXP, Microchip, and Bitcoin whitepaper PDFs.

Don't miss a new markit release

NewReleases is sending notifications on new releases.