New Features
Automatic Document Type Detection (#88)
Kreuzberg now includes automatic document classification capabilities, allowing you to identify document types (contracts, forms, invoices, receipts, reports) during extraction.
Key features:
- Text-based and vision-based classification modes
- Multi-language support via Google Translate integration
- Configurable confidence thresholds
- New configuration options in
ExtractionConfig - New result fields in
ExtractionResult
Installation:
pip install "kreuzberg[auto-classify-document-type]"DeepSource Integration
Added .deepsource.toml configuration for automated code quality analysis.
Bug Fixes
- Fixed PDF extraction when no OCR backend is available
- Updated entity extraction test to use frozenset of tuples for proper comparison
- Fixed config handling for dataclasses with
slots=True - Resolved coverage configuration and cleanup issues in test runs
Improvements
- Enhanced CI/CD pipeline with retry logic for flaky steps across all platforms
- Improved test coverage gathering and cleanup procedures
- Updated dependencies in
uv.lock
Documentation
- Added comprehensive guide for document classification feature
- Updated all relevant documentation sections to include the new feature
- Enhanced API reference with new configuration options
Dependencies
- New optional dependency group:
auto-classify-document-typedeep-translator: For multi-language supportpandas: For data processing in classification