🐛 Bug Fixes
Critical Regression Fixes
- Fixed PDF extraction failures that caused "ExceptionGroup: unhandled errors in a TaskGroup" errors
- Fixed XLS file extraction failures with "File not found 'xl/_rels/workbook.xml.rels'" errors
- Fixed Tesseract OCR configuration to handle both enum and integer PSM values
Test Suite Improvements
- Fixed DataFrame API compatibility - Converted tests to use Polars instead of Pandas for consistency
- Fixed config file loading for arbitrary TOML files with
[tool.kreuzberg]sections - Fixed API config caching with nested dictionaries
🔧 Technical Details
Root Cause Analysis
- XLS Error: SpreadSheetExtractor was hardcoding
.xlsxextension for all spreadsheet files, causing python-calamine to fail when parsing.xlsfiles - Tesseract PSM Error: Code expected PSM as enum with
.valueattribute, but configuration provided integers
Changes Made
- Added MIME type to file extension mapping in SpreadSheetExtractor
- Updated Tesseract OCR to handle both enum and integer PSM values
- Ensured consistent use of Polars DataFrames throughout codebase (except GMFT which uses Pandas internally)
- Fixed configuration loading for non-standard TOML file names
- Added hashable conversion for nested config dictionaries in API caching
📝 Testing
- Added comprehensive regression tests using actual user data files
- Added API tests for Docker container configuration patterns
- All existing tests continue to pass
🔄 Compatibility
This release maintains full backwards compatibility while fixing critical regressions introduced after v3.12.
What's Changed
- chore(deps): bump actions/setup-python from 5 to 6 by @dependabot[bot] in #124
- fix: resolve regression in PDF extraction and XLS file handling by @Goldziher in #127
Full Changelog: v3.13.2...v3.13.3