Added
- PHP bindings - New PHP extension with comprehensive FFI bindings
- PHP E2E test suite - Generated 65 comprehensive E2E tests from fixtures
- Email extraction tests
- HTML processing tests
- Image extraction tests
- OCR functionality tests (5 scenarios)
- Office document tests (16 formats)
- PDF extraction tests (16 scenarios)
- Plugin API tests (14 API functions)
- Smoke tests (7 formats)
- Structured data tests (JSON/YAML)
- XML extraction tests
- Root composer.json - Added composer.json at repository root for Packagist publishing
- HTML metadata extraction - Rich structured metadata from HTML documents
- Headers extraction with hierarchy (level, text, id, depth, html_offset)
- Links extraction with type classification (anchor, internal, external, email, phone)
- Images extraction with dimensions and type detection (data-uri, inline-svg, external, relative)
- Structured data extraction (JSON-LD, Microdata, RDFa)
- New fields: language, text_direction, meta_tags
Fixed
- C# target framework - Changed from net10.0 (preview) to net8.0 LTS
- .NET 10 preview caused NuGet restore hangs
- .NET 8 is latest stable LTS version with FFM API support
- Homebrew check timeout - Added timeouts to prevent 55+ minute hangs
- Job timeout: 5 minutes
- Step timeout: 3 minutes
- Command timeout: 120 seconds
- Documentation - Standardized all README badges and removed AI-generated content
- Consistent blue badge colors across all language bindings
- Added Packagist badge to PHP README
- Removed emojis and marketing language
- Converted all relative links to absolute GitHub URLs
- Ruby vendor script - Added missing workspace dependency inlining for lzma-rust2 and parking_lot
Changed
- Version sync - Updated scripts/sync_versions.py to include root composer.json
- BREAKING: HTML metadata structure - Replaced YAML frontmatter parsing with single-pass metadata extraction
- keywords: Changed from Option (comma-separated) to Vec (array)
- canonical: Renamed to canonical_url for clarity
- Open Graph fields: Consolidated og_* fields into open_graph: BTreeMap<String, String>
- Twitter Card fields: Consolidated twitter_* fields into twitter_card: BTreeMap<String, String>
- New structured types: headers: Vec, links: Vec, images: Vec, structured_data: Vec
- Migration guide: See docs/migration/v4.0-html-metadata.md for upgrade instructions