github kreuzberg-dev/kreuzberg v2.0.0

latest releases: v4.0.5, v4.0.4, packages/go/v4.0.4...
11 months ago

Enhancements:

  • add sync methods (feature)
  • add corrupt PDF searchable text detection with automatic OCR fallback (feature)
  • add metadata extraction using Pandoc (feature)
  • add multi-sheet worksheet (excel etc.) extraction (feature)
  • add language, psm and pax_processes keyword arguments (enhancement; api)
  • gated typing-extensions to Python 3.10 and below (enhancement; dependencies)
  • added multi-loop compatibility by switching from asyncio to using anyio (enhancement; compatibility)
  • added managed worker processes for Pandoc and Tesseract using anyio.to_process (enhancement; performance)
  • replaced xslx2csv with python-calamine and improved implementation to extract all sheets in a workbook (enhancement; performance)

Breaking Changes:

  • updated ExtractionResult to include metadata (breaking change; api)
  • changed force_ocr to a kwarg (breaking change; api)

Internal:

  • split the _extractors namespace into smaller packages and reorganized source code
  • add matrix tests against all supported Python versions (internal; testing)
  • refined ruff rules to enhance linting strictness
  • increase coverage to >=99% coverage

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.