jundot/omlx v0.4.2 on GitHub

This release focuses on native MarkItDown document processing, Qwen throughput and DFlash stability, adaptive Burst Decode throughput, Gemma 4 unified multimodal support, and broad cache/server reliability fixes.

Highlights

Added native MarkItDown document processing. Chat file uploads can now be converted through MarkItDown, including PDF, DOCX, PPTX, TXT, and Markdown inputs, with OpenAI-style file_data support.
Added selectable PDF processing engines. PDFs can use either MarkItDown conversion or VLM OCR from Integration settings.
Recovered Qwen throughput after the 0.4.0 regression. Qwen VLM/MTP batching and singleton decode handling were updated so single-row decode no longer falls into the slower batched cache path.
Improved Qwen DFlash stability. DFlash Qwen target ops now stay pointed at the real text wrapper after the mlx-lm pipeline wrapper update, and idle DFlash engines are isolated across model switches.
Added adaptive Burst Decode. oMLX can now coalesce multiple decode steps per executor hand-off to improve fast single-request decode throughput, with bounded responsiveness and Off / Light / Balanced / Aggressive controls.
Added Gemma 4 unified audio input support. Gemma 4 unified models can accept audio input alongside image inputs, with suppress-token handling for multimodal placeholders.
Improved long-context cache reliability. SSD cache pending-write saturation is tuned by block size and model KV size, transient writer backlog waits before dropping blocks, and hot-cache memory is reclaimed after model unload.
Improved model and server controls. Server-wide context window caps, comma-separated bind addresses, embedding context fallback, and better engine teardown behavior are now covered.

Performance

Burst Decode further reduces per-token executor overhead on fast local decode paths. Tokens may arrive in small bursts; the default Balanced mode can be changed from Global Settings -> Advanced.
Internal tg512 measurements show the main Qwen regression recovered while keeping Gemma performance stable:

Fixes

Fixed Qwen VLM/MTP decode behavior by keeping singleton cache layers out of BatchKVCache until a second request is actually appended.
Fixed Qwen DFlash output corruption after model switches by patching dflash-mlx Qwen target wrapper detection and unloading other idle DFlash engines before loading a new DFlash model. (#1707)
Fixed prompt-prefix token seeding in BatchGenerator so penalty processors see cached prompt tokens correctly. Thanks to @Yukon for identifying the missing token buffer seed.
Fixed generation recovery for MLX __next_prime overflow errors by resetting decode state and retrying affected requests serially. (#1725)
Fixed chunked prefill admission so prefilling requests count against the configured concurrency cap. (#1704)
Fixed SSD cache compatibility across model and cache-layout switches so valid persisted blocks are not deleted for other models or layouts.
Fixed SSD cache write saturation for long-context workloads by tuning pending-write capacity from real block/KV size and waiting through transient writer backlog. by @cfbraun (#1627)
Fixed SSD cache hit decode overhead by materializing restored cache backing arrays before decode starts.
Fixed scalar mRoPE cache offsets for cached VLM prefixes.
Fixed hot-cache memory retained after model unload and made the admin hot-cache clear action reclaim orphaned hot-cache owners and MLX buffers. by @khsd6327 (#1713)
Fixed engine close fallback paths so SSD cache managers are still released when shutdown/deep reset raises.
Fixed stuck engine teardown by treating long teardown stalls as fatal so a supervisor can restart from a clean process.
Fixed embedding context length handling so /v1/embeddings uses request limits, configured context caps, or the model's own context length instead of falling back to 512 tokens. by @JimStenstrom (#1718)
Fixed non-ASCII API keys returning 500; invalid credentials now return 401. by @richgoodson (#1719)
Fixed response_format downgrade visibility by returning a client-visible Warning header when grammar-constrained output cannot be enforced. by @richgoodson (#1564)
Fixed raw tool-call JSON recovery when model output contains tabs or newlines inside string values.
Added Hermes-style tool call parsing and streaming support. by @scubamount (#1596)
Fixed Gemma 4 MoE multi-turn tool-call corruption by stripping stray tool-call close markers before history is re-rendered. by @kreeger (#1665)
Fixed Gemma special tokens leaking into API output. by @JimStenstrom (#1698)
Fixed multimodal oQ precision by preserving protected vision and audio tensors as float32 where needed. by @dodams258 (#1682)
Fixed STT language handling so ISO language codes are preserved for backends that expect codes, while Qwen3-ASR-style backends still receive language names. (#1733)
Fixed mlx-audio resample export compatibility for input audio.
Fixed decoder-aware streaming detokenization and kept the fallback path compatible with older tokenizer wrappers.
Fixed Gemma 4 unified routing, prompt kwargs preservation, assistant drafter acceptance, and suppress-token handling across scheduler, DFlash, and VLM MTP paths.
Bumped the mlx-vlm pin to include Gemma 4 shared-KV/load fixes, Qwen quantized KV prompt-state fixes, Qwen3-VL visual mask alignment, Phi 3.5 VL EOS fixes, and prior unified audio/MTP fixes.
Fixed small-system memory behavior so sub-24GB Apple Silicon systems use the small-system reserve path.
Fixed idle CPU overhead while models are loaded and guarded idle wakeups for partial engine cores.
Fixed interactive Claude model picker launch behavior when tier models are configured. by @fparrav (#1638)
Fixed prerelease Homebrew formula updates so prerelease tags do not update the stable formula.
Fixed packaging documentation for the staged app path and stale DMG build wording. by @cfbraun (#1458)

macOS App and Admin UI

Added all DFlash settings to the native model config screen, including verify mode, draft window and sink sizes, quantization controls, in-memory cache entries, and SSD cache size.
Added the VLM MTP toggle to model settings and gated overlapping speculative controls while VLM MTP is enabled. by @jabagawee (#1654)
Added the server-wide context window cap to the admin settings UI.
Added the Burst Decode setting to Global Settings -> Advanced.
Added support for comma-separated bind addresses in the Host setting, including validation and alias detection. by @fqx (#1606)
Fixed the chat composer at 768px tablet portrait width so the sidebar no longer covers it. by @pigeonstorm (#1699)
Fixed the Settings menu so it stays available when the server is stopped.
Fixed the login page so Auto theme honors the system dark-mode preference. by @monroewilliams (#1728)
Fixed localized Memory Guard strings so placeholder interpolation no longer leaves stale tokens or duplicated units. by @fqx (#1730)
Improved process naming so the server appears as omlx-server. by @iamckun (#1658)

New Contributors

Thank you to everyone making their first contribution in this release:

@jabagawee, @Cmerrill1713, @kreeger, @sje397, @dodams258, @pigeonstorm, @JimStenstrom, @jackwh, @iamckun, @Kistaro.

Full Changelog: v0.4.1...v0.4.2

jundot/omlx v0.4.2 0.4.2 on GitHub

Highlights

Performance

Fixes

macOS App and Admin UI

New Contributors

jundot/omlx v0.4.2
0.4.2

on GitHub