github kreuzberg-dev/kreuzberg v4.1.0
Release v4.1.0

8 hours ago

[4.1.0] - 2026-01-21

Added

API

  • POST /chunk endpoint: New text chunking endpoint for breaking text into smaller pieces
    • Accepts JSON body with text, chunker_type (text/markdown), and optional config
    • Returns chunks with byte offsets, indices, and metadata
    • Configuration options: max_characters (default: 2000), overlap (default: 100), trim (default: true)
    • Supports both text and markdown chunking strategies

Core

  • Djot markup format support: New .djot file extraction with comprehensive Djot syntax support

    • Full parser implementation with structured representation via DjotContent type
    • Feature-gated behind djot feature flag (enabled by default)
    • 39 comprehensive tests covering Unicode, tables, roundtrip conversion, and edge cases
  • Content output format configuration: New ContentFormat enum for configurable text output formatting

    • Converts extracted content from ANY file format to Plain, Markdown, Djot, or HTML
    • CLI support with --content-format flag and KREUZBERG_CONTENT_FORMAT environment variable
  • Element-based output format: New ResultFormat::ElementBased option provides Unstructured.io-compatible semantic element extraction

    • Extracts structured elements: titles, paragraphs, lists, tables, images, page breaks, headings, code blocks, block quotes, headers, footers
    • Each element includes rich metadata: bounding boxes, page numbers, confidence scores, hierarchy information

Language Bindings

  • All language bindings (Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, WASM) updated with:
    • Content format configuration support
    • Result format configuration for element-based output
    • DjotContent and element types

Documentation

  • Djot format documentation: New format reference and usage examples
  • Migration guides: New documentation for Unstructured.io users

Changed

Codebase

  • Major refactoring for maintainability: Split 22 large monolithic files into 110+ focused modules

Fixed

CI/CD

  • Ruby macOS builds: Fixed unused imports causing compilation failures
  • TypeScript tests on ARM64: Fixed module resolution error
  • Go Windows builds: Disabled incompatible verbose linker flags
  • PHP Windows builds: Added documentation for cargo fingerprint cache corruption issues

Documentation

  • MkDocs build: Fixed broken benchmark documentation links

Python

  • Type exports: Fixed missing type exports in kreuzberg.types.__all__

Elixir

  • DOCX keyword extraction: Fixed FunctionClauseError when extracting DOCX files with keywords metadata

See CHANGELOG.md for full details.

Don't miss a new kreuzberg release

NewReleases is sending notifications on new releases.