What's New in v4.3.3
PaddleOCR Multi-Language Support (#388)
- 106+ language support via 12 script families: PaddleOCR recognition models now cover english, chinese (simplified+traditional+japanese), latin, korean, east slavic (cyrillic), thai, greek, arabic, devanagari, tamil, telugu, and kannada script families.
- Per-family recognition model architecture: Shared detection/classification models with per-family recognition models and dictionaries, downloaded on demand from HuggingFace.
- Engine pool for concurrent multi-language OCR: Replaced single-engine architecture with a per-family engine pool, enabling concurrent OCR across different languages.
- Backend-agnostic
--ocr-languageCLI flag: Works with all OCR backends (tesseract, paddle-ocr, easyocr). - SHA256 checksum verification: All model downloads verified against embedded checksums.
Centralized Image OCR Processing
- Shared
process_images_with_ocrfunction for all document extractors (DOCX, PPTX, Jupyter, Markdown).
Jupyter Notebook Image Extraction
- Base64 image decoding from notebook cell outputs with OCR support.
Markdown Data URI Image Extraction
- Data URI image decoding with OCR support for embedded images.
DOCX Full Extraction Pipeline (#387)
- DocumentStructure generation, pages field population, OCR on embedded images.
- Typed metadata fields, style-based heading detection, markdown formatting.
- Performance optimizations: eliminated 3x code duplication, removed unnecessary clones.
Fixed
- LaTeX zero-arg command handling preventing silent text loss.
- Structured data
is_text_fieldfalse positives from substring matching. - PaddleOCR CrnnNet recognition height fixed from 48 to 32 pixels.
See CHANGELOG.md for full details.