This release focuses on native MarkItDown document processing, Qwen throughput and DFlash stability, adaptive Burst Decode throughput, Gemma 4 unified multimodal support, and broad cache/server reliability fixes.
Highlights
- Added native MarkItDown document processing. Chat file uploads can now be converted through MarkItDown, including PDF, DOCX, PPTX, TXT, and Markdown inputs, with OpenAI-style
file_datasupport. - Added selectable PDF processing engines. PDFs can use either MarkItDown conversion or VLM OCR from Integration settings.
- Recovered Qwen throughput after the 0.4.0 regression. Qwen VLM/MTP batching and singleton decode handling were updated so single-row decode no longer falls into the slower batched cache path.
- Improved Qwen DFlash stability. DFlash Qwen target ops now stay pointed at the real text wrapper after the mlx-lm pipeline wrapper update, and idle DFlash engines are isolated across model switches.
- Added adaptive Burst Decode. oMLX can now coalesce multiple decode steps per executor hand-off to improve fast single-request decode throughput, with bounded responsiveness and Off / Light / Balanced / Aggressive controls.
- Added Gemma 4 unified audio input support. Gemma 4 unified models can accept audio input alongside image inputs, with suppress-token handling for multimodal placeholders.
- Improved long-context cache reliability. SSD cache pending-write saturation is tuned by block size and model KV size, transient writer backlog waits before dropping blocks, and hot-cache memory is reclaimed after model unload.
- Improved model and server controls. Server-wide context window caps, comma-separated bind addresses, embedding context fallback, and better engine teardown behavior are now covered.
Performance
- Burst Decode further reduces per-token executor overhead on fast local decode paths. Tokens may arrive in small bursts; the default Balanced mode can be changed from Global Settings -> Advanced.
- Internal
tg512measurements show the main Qwen regression recovered while keeping Gemma performance stable:
Fixes
- Fixed Qwen VLM/MTP decode behavior by keeping singleton cache layers out of
BatchKVCacheuntil a second request is actually appended. - Fixed Qwen DFlash output corruption after model switches by patching dflash-mlx Qwen target wrapper detection and unloading other idle DFlash engines before loading a new DFlash model. (#1707)
- Fixed prompt-prefix token seeding in
BatchGeneratorso penalty processors see cached prompt tokens correctly. Thanks to @Yukon for identifying the missing token buffer seed. - Fixed generation recovery for MLX
__next_prime overflowerrors by resetting decode state and retrying affected requests serially. (#1725) - Fixed chunked prefill admission so prefilling requests count against the configured concurrency cap. (#1704)
- Fixed SSD cache compatibility across model and cache-layout switches so valid persisted blocks are not deleted for other models or layouts.
- Fixed SSD cache write saturation for long-context workloads by tuning pending-write capacity from real block/KV size and waiting through transient writer backlog. by @cfbraun (#1627)
- Fixed SSD cache hit decode overhead by materializing restored cache backing arrays before decode starts.
- Fixed scalar mRoPE cache offsets for cached VLM prefixes.
- Fixed hot-cache memory retained after model unload and made the admin hot-cache clear action reclaim orphaned hot-cache owners and MLX buffers. by @khsd6327 (#1713)
- Fixed engine close fallback paths so SSD cache managers are still released when shutdown/deep reset raises.
- Fixed stuck engine teardown by treating long teardown stalls as fatal so a supervisor can restart from a clean process.
- Fixed embedding context length handling so
/v1/embeddingsuses request limits, configured context caps, or the model's own context length instead of falling back to 512 tokens. by @JimStenstrom (#1718) - Fixed non-ASCII API keys returning 500; invalid credentials now return 401. by @richgoodson (#1719)
- Fixed
response_formatdowngrade visibility by returning a client-visibleWarningheader when grammar-constrained output cannot be enforced. by @richgoodson (#1564) - Fixed raw tool-call JSON recovery when model output contains tabs or newlines inside string values.
- Added Hermes-style tool call parsing and streaming support. by @scubamount (#1596)
- Fixed Gemma 4 MoE multi-turn tool-call corruption by stripping stray tool-call close markers before history is re-rendered. by @kreeger (#1665)
- Fixed Gemma special tokens leaking into API output. by @JimStenstrom (#1698)
- Fixed multimodal oQ precision by preserving protected vision and audio tensors as float32 where needed. by @dodams258 (#1682)
- Fixed STT language handling so ISO language codes are preserved for backends that expect codes, while Qwen3-ASR-style backends still receive language names. (#1733)
- Fixed mlx-audio resample export compatibility for input audio.
- Fixed decoder-aware streaming detokenization and kept the fallback path compatible with older tokenizer wrappers.
- Fixed Gemma 4 unified routing, prompt kwargs preservation, assistant drafter acceptance, and suppress-token handling across scheduler, DFlash, and VLM MTP paths.
- Bumped the mlx-vlm pin to include Gemma 4 shared-KV/load fixes, Qwen quantized KV prompt-state fixes, Qwen3-VL visual mask alignment, Phi 3.5 VL EOS fixes, and prior unified audio/MTP fixes.
- Fixed small-system memory behavior so sub-24GB Apple Silicon systems use the small-system reserve path.
- Fixed idle CPU overhead while models are loaded and guarded idle wakeups for partial engine cores.
- Fixed interactive Claude model picker launch behavior when tier models are configured. by @fparrav (#1638)
- Fixed prerelease Homebrew formula updates so prerelease tags do not update the stable formula.
- Fixed packaging documentation for the staged app path and stale DMG build wording. by @cfbraun (#1458)
macOS App and Admin UI
- Added all DFlash settings to the native model config screen, including verify mode, draft window and sink sizes, quantization controls, in-memory cache entries, and SSD cache size.
- Added the VLM MTP toggle to model settings and gated overlapping speculative controls while VLM MTP is enabled. by @jabagawee (#1654)
- Added the server-wide context window cap to the admin settings UI.
- Added the Burst Decode setting to Global Settings -> Advanced.
- Added support for comma-separated bind addresses in the Host setting, including validation and alias detection. by @fqx (#1606)
- Fixed the chat composer at 768px tablet portrait width so the sidebar no longer covers it. by @pigeonstorm (#1699)
- Fixed the Settings menu so it stays available when the server is stopped.
- Fixed the login page so Auto theme honors the system dark-mode preference. by @monroewilliams (#1728)
- Fixed localized Memory Guard strings so placeholder interpolation no longer leaves stale tokens or duplicated units. by @fqx (#1730)
- Improved process naming so the server appears as
omlx-server. by @iamckun (#1658)
New Contributors
Thank you to everyone making their first contribution in this release:
@jabagawee, @Cmerrill1713, @kreeger, @sje397, @dodams258, @pigeonstorm, @JimStenstrom, @jackwh, @iamckun, @Kistaro.
Full Changelog: v0.4.1...v0.4.2
