github kreuzberg-dev/kreuzberg v4.0.0-rc.22

pre-release7 hours ago

Added

  • PHP bindings - New PHP extension with comprehensive FFI bindings
    • PHP E2E test suite - Generated 65 comprehensive E2E tests from fixtures
    • Email extraction tests
    • HTML processing tests
    • Image extraction tests
    • OCR functionality tests (5 scenarios)
    • Office document tests (16 formats)
    • PDF extraction tests (16 scenarios)
    • Plugin API tests (14 API functions)
    • Smoke tests (7 formats)
    • Structured data tests (JSON/YAML)
    • XML extraction tests
  • Root composer.json - Added composer.json at repository root for Packagist publishing
  • HTML metadata extraction - Rich structured metadata from HTML documents
    • Headers extraction with hierarchy (level, text, id, depth, html_offset)
    • Links extraction with type classification (anchor, internal, external, email, phone)
    • Images extraction with dimensions and type detection (data-uri, inline-svg, external, relative)
    • Structured data extraction (JSON-LD, Microdata, RDFa)
    • New fields: language, text_direction, meta_tags

Fixed

  • C# target framework - Changed from net10.0 (preview) to net8.0 LTS
    • .NET 10 preview caused NuGet restore hangs
    • .NET 8 is latest stable LTS version with FFM API support
  • Homebrew check timeout - Added timeouts to prevent 55+ minute hangs
    • Job timeout: 5 minutes
    • Step timeout: 3 minutes
    • Command timeout: 120 seconds
  • Documentation - Standardized all README badges and removed AI-generated content
    • Consistent blue badge colors across all language bindings
    • Added Packagist badge to PHP README
    • Removed emojis and marketing language
    • Converted all relative links to absolute GitHub URLs
  • Ruby vendor script - Added missing workspace dependency inlining for lzma-rust2 and parking_lot

Changed

  • Version sync - Updated scripts/sync_versions.py to include root composer.json
  • BREAKING: HTML metadata structure - Replaced YAML frontmatter parsing with single-pass metadata extraction
    • keywords: Changed from Option (comma-separated) to Vec (array)
    • canonical: Renamed to canonical_url for clarity
    • Open Graph fields: Consolidated og_* fields into open_graph: BTreeMap<String, String>
    • Twitter Card fields: Consolidated twitter_* fields into twitter_card: BTreeMap<String, String>
    • New structured types: headers: Vec, links: Vec, images: Vec, structured_data: Vec
    • Migration guide: See docs/migration/v4.0-html-metadata.md for upgrade instructions

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.