github kreuzberg-dev/kreuzberg v3.13.3

latest releases: v4.0.0-rc.18, packages/go/v4/v4.0.0-rc.18, v4.0.0-rc.17...
3 months ago

🐛 Bug Fixes

Critical Regression Fixes

  • Fixed PDF extraction failures that caused "ExceptionGroup: unhandled errors in a TaskGroup" errors
  • Fixed XLS file extraction failures with "File not found 'xl/_rels/workbook.xml.rels'" errors
  • Fixed Tesseract OCR configuration to handle both enum and integer PSM values

Test Suite Improvements

  • Fixed DataFrame API compatibility - Converted tests to use Polars instead of Pandas for consistency
  • Fixed config file loading for arbitrary TOML files with [tool.kreuzberg] sections
  • Fixed API config caching with nested dictionaries

🔧 Technical Details

Root Cause Analysis

  1. XLS Error: SpreadSheetExtractor was hardcoding .xlsx extension for all spreadsheet files, causing python-calamine to fail when parsing .xls files
  2. Tesseract PSM Error: Code expected PSM as enum with .value attribute, but configuration provided integers

Changes Made

  • Added MIME type to file extension mapping in SpreadSheetExtractor
  • Updated Tesseract OCR to handle both enum and integer PSM values
  • Ensured consistent use of Polars DataFrames throughout codebase (except GMFT which uses Pandas internally)
  • Fixed configuration loading for non-standard TOML file names
  • Added hashable conversion for nested config dictionaries in API caching

📝 Testing

  • Added comprehensive regression tests using actual user data files
  • Added API tests for Docker container configuration patterns
  • All existing tests continue to pass

🔄 Compatibility

This release maintains full backwards compatibility while fixing critical regressions introduced after v3.12.

What's Changed

  • chore(deps): bump actions/setup-python from 5 to 6 by @dependabot[bot] in #124
  • fix: resolve regression in PDF extraction and XLS file handling by @Goldziher in #127

Full Changelog: v3.13.2...v3.13.3

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.