github kreuzberg-dev/kreuzberg v3.9.0

latest releases: v4.0.0-rc.16, packages/go/v4/v4.0.0-rc.16, v4.0.0-rc.15...
5 months ago

New Features

Automatic Document Type Detection (#88)

Kreuzberg now includes automatic document classification capabilities, allowing you to identify document types (contracts, forms, invoices, receipts, reports) during extraction.

Key features:

  • Text-based and vision-based classification modes
  • Multi-language support via Google Translate integration
  • Configurable confidence thresholds
  • New configuration options in ExtractionConfig
  • New result fields in ExtractionResult

Installation:

pip install "kreuzberg[auto-classify-document-type]"

DeepSource Integration

Added .deepsource.toml configuration for automated code quality analysis.

Bug Fixes

  • Fixed PDF extraction when no OCR backend is available
  • Updated entity extraction test to use frozenset of tuples for proper comparison
  • Fixed config handling for dataclasses with slots=True
  • Resolved coverage configuration and cleanup issues in test runs

Improvements

  • Enhanced CI/CD pipeline with retry logic for flaky steps across all platforms
  • Improved test coverage gathering and cleanup procedures
  • Updated dependencies in uv.lock

Documentation

  • Added comprehensive guide for document classification feature
  • Updated all relevant documentation sections to include the new feature
  • Enhanced API reference with new configuration options

Dependencies

  • New optional dependency group: auto-classify-document-type
    • deep-translator: For multi-language support
    • pandas: For data processing in classification

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.