github kreuzberg-dev/kreuzberg v4.3.3

latest release: benchmark-run-22020443124
12 hours ago

What's New in v4.3.3

PaddleOCR Multi-Language Support (#388)

  • 106+ language support via 12 script families: PaddleOCR recognition models now cover english, chinese (simplified+traditional+japanese), latin, korean, east slavic (cyrillic), thai, greek, arabic, devanagari, tamil, telugu, and kannada script families.
  • Per-family recognition model architecture: Shared detection/classification models with per-family recognition models and dictionaries, downloaded on demand from HuggingFace.
  • Engine pool for concurrent multi-language OCR: Replaced single-engine architecture with a per-family engine pool, enabling concurrent OCR across different languages.
  • Backend-agnostic --ocr-language CLI flag: Works with all OCR backends (tesseract, paddle-ocr, easyocr).
  • SHA256 checksum verification: All model downloads verified against embedded checksums.

Centralized Image OCR Processing

  • Shared process_images_with_ocr function for all document extractors (DOCX, PPTX, Jupyter, Markdown).

Jupyter Notebook Image Extraction

  • Base64 image decoding from notebook cell outputs with OCR support.

Markdown Data URI Image Extraction

  • Data URI image decoding with OCR support for embedded images.

DOCX Full Extraction Pipeline (#387)

  • DocumentStructure generation, pages field population, OCR on embedded images.
  • Typed metadata fields, style-based heading detection, markdown formatting.
  • Performance optimizations: eliminated 3x code duplication, removed unnecessary clones.

Fixed

  • LaTeX zero-arg command handling preventing silent text loss.
  • Structured data is_text_field false positives from substring matching.
  • PaddleOCR CrnnNet recognition height fixed from 48 to 32 pixels.

See CHANGELOG.md for full details.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.