PDF converter rewrite
Rewrote the PDF converter from scratch with mupdf (native WASM).
What's new
- Table detection — vector line extraction + raycasting places text into markdown tables
- Diagram filtering — block diagrams (sparse grids, repeated labels) are excluded from table detection
- Multi-column layout — two-column documents (legal docs, datasheets) read in correct order
- Header/footer stripping — repeated running headers removed across pages
- Image extraction — diagrams cropped and saved as PNGs when
imageDiris provided - CTM tracking — content stream coordinate transforms applied correctly
- Agent skill —
npx skills add Michaelliv/markit
Performance
| Pages | Time | |
|---|---|---|
| Bitcoin whitepaper | 9 | 26ms |
| US Constitution | 16 | 56ms |
| Intel PCH datasheet | 224 | 640ms |
| NXP S32K3xx datasheet | 164 | 1.9s |
Testing
58 tests across 4 test files covering grid detection, rendering, extraction, and column detection. Validated against Intel, NXP, Microchip, and Bitcoin whitepaper PDFs.