github jundot/omlx v0.4.2
0.4.2

5 hours ago

This release focuses on native MarkItDown document processing, Qwen throughput and DFlash stability, adaptive Burst Decode throughput, Gemma 4 unified multimodal support, and broad cache/server reliability fixes.

Highlights

  • Added native MarkItDown document processing. Chat file uploads can now be converted through MarkItDown, including PDF, DOCX, PPTX, TXT, and Markdown inputs, with OpenAI-style file_data support.
  • Added selectable PDF processing engines. PDFs can use either MarkItDown conversion or VLM OCR from Integration settings.
  • Recovered Qwen throughput after the 0.4.0 regression. Qwen VLM/MTP batching and singleton decode handling were updated so single-row decode no longer falls into the slower batched cache path.
  • Improved Qwen DFlash stability. DFlash Qwen target ops now stay pointed at the real text wrapper after the mlx-lm pipeline wrapper update, and idle DFlash engines are isolated across model switches.
  • Added adaptive Burst Decode. oMLX can now coalesce multiple decode steps per executor hand-off to improve fast single-request decode throughput, with bounded responsiveness and Off / Light / Balanced / Aggressive controls.
  • Added Gemma 4 unified audio input support. Gemma 4 unified models can accept audio input alongside image inputs, with suppress-token handling for multimodal placeholders.
  • Improved long-context cache reliability. SSD cache pending-write saturation is tuned by block size and model KV size, transient writer backlog waits before dropping blocks, and hot-cache memory is reclaimed after model unload.
  • Improved model and server controls. Server-wide context window caps, comma-separated bind addresses, embedding context fallback, and better engine teardown behavior are now covered.

Performance

  • Burst Decode further reduces per-token executor overhead on fast local decode paths. Tokens may arrive in small bursts; the default Balanced mode can be changed from Global Settings -> Advanced.
  • Internal tg512 measurements show the main Qwen regression recovered while keeping Gemma performance stable:

Screenshot 2026-06-09 at 01 25 30

Fixes

  • Fixed Qwen VLM/MTP decode behavior by keeping singleton cache layers out of BatchKVCache until a second request is actually appended.
  • Fixed Qwen DFlash output corruption after model switches by patching dflash-mlx Qwen target wrapper detection and unloading other idle DFlash engines before loading a new DFlash model. (#1707)
  • Fixed prompt-prefix token seeding in BatchGenerator so penalty processors see cached prompt tokens correctly. Thanks to @Yukon for identifying the missing token buffer seed.
  • Fixed generation recovery for MLX __next_prime overflow errors by resetting decode state and retrying affected requests serially. (#1725)
  • Fixed chunked prefill admission so prefilling requests count against the configured concurrency cap. (#1704)
  • Fixed SSD cache compatibility across model and cache-layout switches so valid persisted blocks are not deleted for other models or layouts.
  • Fixed SSD cache write saturation for long-context workloads by tuning pending-write capacity from real block/KV size and waiting through transient writer backlog. by @cfbraun (#1627)
  • Fixed SSD cache hit decode overhead by materializing restored cache backing arrays before decode starts.
  • Fixed scalar mRoPE cache offsets for cached VLM prefixes.
  • Fixed hot-cache memory retained after model unload and made the admin hot-cache clear action reclaim orphaned hot-cache owners and MLX buffers. by @khsd6327 (#1713)
  • Fixed engine close fallback paths so SSD cache managers are still released when shutdown/deep reset raises.
  • Fixed stuck engine teardown by treating long teardown stalls as fatal so a supervisor can restart from a clean process.
  • Fixed embedding context length handling so /v1/embeddings uses request limits, configured context caps, or the model's own context length instead of falling back to 512 tokens. by @JimStenstrom (#1718)
  • Fixed non-ASCII API keys returning 500; invalid credentials now return 401. by @richgoodson (#1719)
  • Fixed response_format downgrade visibility by returning a client-visible Warning header when grammar-constrained output cannot be enforced. by @richgoodson (#1564)
  • Fixed raw tool-call JSON recovery when model output contains tabs or newlines inside string values.
  • Added Hermes-style tool call parsing and streaming support. by @scubamount (#1596)
  • Fixed Gemma 4 MoE multi-turn tool-call corruption by stripping stray tool-call close markers before history is re-rendered. by @kreeger (#1665)
  • Fixed Gemma special tokens leaking into API output. by @JimStenstrom (#1698)
  • Fixed multimodal oQ precision by preserving protected vision and audio tensors as float32 where needed. by @dodams258 (#1682)
  • Fixed STT language handling so ISO language codes are preserved for backends that expect codes, while Qwen3-ASR-style backends still receive language names. (#1733)
  • Fixed mlx-audio resample export compatibility for input audio.
  • Fixed decoder-aware streaming detokenization and kept the fallback path compatible with older tokenizer wrappers.
  • Fixed Gemma 4 unified routing, prompt kwargs preservation, assistant drafter acceptance, and suppress-token handling across scheduler, DFlash, and VLM MTP paths.
  • Bumped the mlx-vlm pin to include Gemma 4 shared-KV/load fixes, Qwen quantized KV prompt-state fixes, Qwen3-VL visual mask alignment, Phi 3.5 VL EOS fixes, and prior unified audio/MTP fixes.
  • Fixed small-system memory behavior so sub-24GB Apple Silicon systems use the small-system reserve path.
  • Fixed idle CPU overhead while models are loaded and guarded idle wakeups for partial engine cores.
  • Fixed interactive Claude model picker launch behavior when tier models are configured. by @fparrav (#1638)
  • Fixed prerelease Homebrew formula updates so prerelease tags do not update the stable formula.
  • Fixed packaging documentation for the staged app path and stale DMG build wording. by @cfbraun (#1458)

macOS App and Admin UI

  • Added all DFlash settings to the native model config screen, including verify mode, draft window and sink sizes, quantization controls, in-memory cache entries, and SSD cache size.
  • Added the VLM MTP toggle to model settings and gated overlapping speculative controls while VLM MTP is enabled. by @jabagawee (#1654)
  • Added the server-wide context window cap to the admin settings UI.
  • Added the Burst Decode setting to Global Settings -> Advanced.
  • Added support for comma-separated bind addresses in the Host setting, including validation and alias detection. by @fqx (#1606)
  • Fixed the chat composer at 768px tablet portrait width so the sidebar no longer covers it. by @pigeonstorm (#1699)
  • Fixed the Settings menu so it stays available when the server is stopped.
  • Fixed the login page so Auto theme honors the system dark-mode preference. by @monroewilliams (#1728)
  • Fixed localized Memory Guard strings so placeholder interpolation no longer leaves stale tokens or duplicated units. by @fqx (#1730)
  • Improved process naming so the server appears as omlx-server. by @iamckun (#1658)

New Contributors

Thank you to everyone making their first contribution in this release:

@jabagawee, @Cmerrill1713, @kreeger, @sje397, @dodams258, @pigeonstorm, @JimStenstrom, @jackwh, @iamckun, @Kistaro.

Full Changelog: v0.4.1...v0.4.2

Don't miss a new omlx release

NewReleases is sending notifications on new releases.