github kreuzberg-dev/kreuzberg v4.2.15

latest release: benchmark-run-21800982176
15 hours ago

Fixed

ODT List and Section Extraction

  • Fixed ODT extractor not handling text:list and text:section elements. Documents containing bulleted/numbered lists or sections returned empty content.

UTF-16 EML Parsing

  • Fixed EML files encoded in UTF-16 (LE/BE, with or without BOM) returning empty content. Detects UTF-16 encoding via BOM markers and heuristic byte-pattern analysis, transcoding to UTF-8 before parsing.

Email Attachment Metadata Serialization

  • Fixed email extraction inserting a comma-joined string "attachments" into the additional metadata HashMap, which via #[serde(flatten)] overwrote the structured EmailMetadata.attachments array. This caused deserialization failures in Go, C#, and other typed bindings when processing emails with attachments.

WASM Office Document Support (DOCX, PPTX, ODT)

  • Office documents now extract correctly in WASM builds.

WASM PDF Support in Non-Browser Runtimes

  • PDFium auto-initializes in all WASM runtimes (Node.js, Bun, Deno).

Elixir PageBoundary JSON Serialization

  • Added missing @derive Jason.Encoder to PageBoundary, PageInfo, and PageStructure structs.

Pre-built CLI Binary Missing MCP Command

  • The build script now enables all features for standalone CLI binaries. Fixes #369.

PDF Error Handling Regression

  • Corrupted PDFs now correctly return errors instead of empty results.

Added

Agent Skill for AI Coding Assistants

  • Added skills/kreuzberg/SKILL.md following the Agent Skills open standard.

MIME Type Mappings

  • Added .docbook and .jats file extension mappings.

Changed

API Parity

  • Added security_limits field to all 9 language bindings for API parity with Rust core.

See CHANGELOG.md for full details.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.