This release candidate focuses on native MarkItDown document processing, Qwen throughput recovery (x1.48), Gemma 4 unified multimodal support, and cache, scheduler, and server stability improvements.
Highlights
- Added native MarkItDown document processing. Chat file uploads can now be converted through MarkItDown, including PDF, DOCX, PPTX, TXT, and Markdown inputs, with OpenAI-style
file_datasupport. - Added selectable PDF processing engines. PDFs can use either MarkItDown conversion or VLM OCR from Integration settings (currently web dashboard only).
- Recovered Qwen throughput after the 0.4.0 regression. Qwen VLM/MTP batching and singleton decode handling were updated so single-row decode no longer falls into the slower batched cache path.
- Added Gemma 4 unified audio input support. Gemma 4 unified models can now accept audio input alongside image inputs, with suppress-token handling for multimodal placeholders.
- Improved model and server controls. A server-wide context window cap policy was added, embedding requests now respect the effective model context length, and server processes now show an
omlx-serverprocess title.
Performance
Internal tg512 measurements show the main Qwen regression recovered while keeping Gemma performance stable:
Fixes
- Fixed Qwen VLM/MTP decode behavior by keeping singleton cache layers out of
BatchKVCacheuntil a second request is actually appended. - Fixed prompt-prefix token seeding in
BatchGeneratorso penalty processors see cached prompt tokens correctly. Thanks to @Yukon for identifying the missing token buffer seed. - Fixed SSD cache compatibility across model and cache-layout switches so valid persisted blocks are not deleted for other models or layouts.
- Fixed SSD cache pressure handling by unlinking LRU files outside the bounded write queue and preserving capped-eviction observability. by @cfbraun (#1451)
- Fixed cache-store backpressure and aborted prefill cleanup so new prefills wait safely while cache cleanup is full.
- Fixed an engine-pool acquire-vs-use eviction race and active-request counter leak for embedding and rerank engines. by @Cmerrill1713 (#1668)
- Fixed Gemma 4 MoE multi-turn tool-call corruption by stripping stray tool-call close markers before history is re-rendered. by @kreeger (#1665)
- Fixed Gemma special tokens leaking into API output. by @JimStenstrom (#1698)
- Fixed raw tool-call JSON recovery when model output contains tabs or newlines inside string values.
- Added Hermes-style tool call parsing and streaming support. by @scubamount (#1596)
- Fixed
response_formatdowngrade visibility by returning a client-visibleWarningheader when grammar-constrained output cannot be enforced. by @richgoodson (#1564) - Fixed embedding context length handling so
/v1/embeddingsuses the effective model context window instead of falling back to 512 tokens. by @jackwh (#1694) - Fixed multimodal oQ precision by preserving protected vision and audio tensors as float32 where needed. by @dodams258 (#1682)
- Fixed mlx-audio resample export compatibility for input audio.
- Fixed decoder-aware streaming detokenization and kept the fallback path compatible with older tokenizer wrappers.
- Fixed Gemma 4 unified routing, prompt kwargs preservation, assistant drafter acceptance, and suppress-token handling across scheduler, DFlash, and VLM MTP paths.
- Fixed small-system memory behavior so sub-24GB Apple Silicon systems use the small-system reserve path.
- Fixed idle CPU overhead while models are loaded and guarded idle wakeups for partial engine cores.
- Fixed interactive Claude model picker launch behavior when tier models are configured. by @fparrav (#1638)
- Fixed prerelease Homebrew formula updates so prerelease tags do not update the stable formula.
- Fixed packaging documentation for the staged app path and stale DMG build wording. by @cfbraun (#1458)
macOS App
- Added all DFlash settings to the native model config screen, including verify mode, draft window and sink sizes, quantization controls, in-memory cache entries, and SSD cache size.
- Added the VLM MTP toggle to model settings and gated overlapping speculative controls while VLM MTP is enabled. by @jabagawee (#1654)
- Added the server-wide context window cap to the admin settings UI.
- Fixed the chat composer at 768px tablet portrait width so the sidebar no longer covers it. by @pigeonstorm (#1699)
- Fixed the Settings menu so it stays available when the server is stopped.
- Improved process naming so the server appears as
omlx-server. by @iamckun (#1658)
New Contributors
Thank you to everyone making their first contribution in this release:
@jabagawee, @Cmerrill1713, @kreeger, @sje397, @dodams258, @pigeonstorm, @JimStenstrom, @jackwh, @iamckun, @Kistaro.