Fixes
- OCR memory usage reduced 60-78%: Restructured the OCR batch rendering loop to render-and-encode one page at a time instead of holding all decoded RGB buffers simultaneously. A 98-page scanned PDF dropped from 4.6GB to 1.9GB peak RSS (batch_size=4), and from 3.3GB to 713MB (batch_size=1). Batch size now adapts to available system memory on Linux and macOS.
- PDF control character encoding artifacts: PDFs with broken ToUnicode font mappings that produce U+0002 (STX) and other control characters where hyphens should appear now have these replaced with hyphens when between word characters, or stripped otherwise. Fixes garbled output like
re\x02labelling→re-labelling.