[4.1.0] - 2026-01-21
Added
API
- POST /chunk endpoint: New text chunking endpoint for breaking text into smaller pieces
- Accepts JSON body with
text,chunker_type(text/markdown), and optionalconfig - Returns chunks with byte offsets, indices, and metadata
- Configuration options:
max_characters(default: 2000),overlap(default: 100),trim(default: true) - Supports both text and markdown chunking strategies
- Accepts JSON body with
Core
-
Djot markup format support: New
.djotfile extraction with comprehensive Djot syntax support- Full parser implementation with structured representation via
DjotContenttype - Feature-gated behind
djotfeature flag (enabled by default) - 39 comprehensive tests covering Unicode, tables, roundtrip conversion, and edge cases
- Full parser implementation with structured representation via
-
Content output format configuration: New
ContentFormatenum for configurable text output formatting- Converts extracted content from ANY file format to Plain, Markdown, Djot, or HTML
- CLI support with
--content-formatflag andKREUZBERG_CONTENT_FORMATenvironment variable
-
Element-based output format: New
ResultFormat::ElementBasedoption provides Unstructured.io-compatible semantic element extraction- Extracts structured elements: titles, paragraphs, lists, tables, images, page breaks, headings, code blocks, block quotes, headers, footers
- Each element includes rich metadata: bounding boxes, page numbers, confidence scores, hierarchy information
Language Bindings
- All language bindings (Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, WASM) updated with:
- Content format configuration support
- Result format configuration for element-based output
DjotContentand element types
Documentation
- Djot format documentation: New format reference and usage examples
- Migration guides: New documentation for Unstructured.io users
Changed
Codebase
- Major refactoring for maintainability: Split 22 large monolithic files into 110+ focused modules
Fixed
CI/CD
- Ruby macOS builds: Fixed unused imports causing compilation failures
- TypeScript tests on ARM64: Fixed module resolution error
- Go Windows builds: Disabled incompatible verbose linker flags
- PHP Windows builds: Added documentation for cargo fingerprint cache corruption issues
Documentation
- MkDocs build: Fixed broken benchmark documentation links
Python
- Type exports: Fixed missing type exports in
kreuzberg.types.__all__
Elixir
- DOCX keyword extraction: Fixed
FunctionClauseErrorwhen extracting DOCX files with keywords metadata
See CHANGELOG.md for full details.