Fixed
ODT List and Section Extraction
- Fixed ODT extractor not handling
text:listandtext:sectionelements. Documents containing bulleted/numbered lists or sections returned empty content.
UTF-16 EML Parsing
- Fixed EML files encoded in UTF-16 (LE/BE, with or without BOM) returning empty content. Detects UTF-16 encoding via BOM markers and heuristic byte-pattern analysis, transcoding to UTF-8 before parsing.
Email Attachment Metadata Serialization
- Fixed email extraction inserting a comma-joined string
"attachments"into theadditionalmetadata HashMap, which via#[serde(flatten)]overwrote the structuredEmailMetadata.attachmentsarray. This caused deserialization failures in Go, C#, and other typed bindings when processing emails with attachments.
WASM Office Document Support (DOCX, PPTX, ODT)
- Office documents now extract correctly in WASM builds.
WASM PDF Support in Non-Browser Runtimes
- PDFium auto-initializes in all WASM runtimes (Node.js, Bun, Deno).
Elixir PageBoundary JSON Serialization
- Added missing
@derive Jason.EncodertoPageBoundary,PageInfo, andPageStructurestructs.
Pre-built CLI Binary Missing MCP Command
- The build script now enables all features for standalone CLI binaries. Fixes #369.
PDF Error Handling Regression
- Corrupted PDFs now correctly return errors instead of empty results.
Added
Agent Skill for AI Coding Assistants
- Added
skills/kreuzberg/SKILL.mdfollowing the Agent Skills open standard.
MIME Type Mappings
- Added
.docbookand.jatsfile extension mappings.
Changed
API Parity
- Added
security_limitsfield to all 9 language bindings for API parity with Rust core.
See CHANGELOG.md for full details.