jundot/omlx v0.4.4rc1 on GitHub

This release candidate focuses on DiffusionGemma support, DeepSeek V4 oQ/MTP support, and major cache-reuse correctness improvements for agent workloads.

Highlights

Added DiffusionGemma support via @Blaizzy's mlx-vlm, currently without cache. oMLX can now serve DiffusionGemma models through the mlx-vlm path.
Added DeepSeek V4 oQ quantization and MTP support. oMLX now supports fractional oQ levels, pre-quantized DeepSeek V4 oQ tensors, and safer DeepSeek V4 MTP loading and rollback behavior.
Dramatically improved agent cache reuse for Gemma and Qwen models. Paged SSD cache and prefix-cache correctness fixes now prevent stale layer-cache reuse, handle rotating-family prefix restore more safely, and strip superseded rotating-tip payloads so agent-style repeated prompts can reuse cache much more reliably. by @cfbraun in #1815 and @hojin12312 in #1807
Improved Memory Guard and preflight accounting. Scheduler preflight now avoids counting hot-cache bytes incorrectly, includes chunk KV growth, mirrors MLX SDPA fallback behavior more closely, and handles tracked model memory before load admission.

Improvements and Fixes

Added VLM MTP support with an external Qwen MTP drafter. VLM models can now use a separate Qwen MTP draft model for speculative decoding, with admin UI support for the new drafter settings. by @imi4u36d in #1791
Added a scheduler-facing output parser abstraction for Harmony, Gemma 4, and Cohere2 MoE style streamed output/tool parsing.
Fixed VLM MTP hidden-output preservation and made MTP draft rollback atomic.
Fixed DeepSeek V4 MTP rollback/loading and rejected unsupported DeepSeek V4 fp16 oQ configurations.
Added fractional oQ levels and support for pre-quantized DeepSeek V4 oQ tensors.
Fixed TurboQuant cache handling after hybrid cache restore and chunked prefill cache insertion. (#1793)
Fixed benchmark loading so VLM MTP benchmark paths are not forced through LM-only loading. by @imi4u36d in #1813
Added a benchmark option to force mlx-lm loading when needed.
Fixed paged SSD cache invalidation for stale layer cache signatures. by @cfbraun in #1815
Fixed prefix-cache restore for rotating-family models and stripped superseded rotating-tip payloads. by @hojin12312 in #1807
Added a prefix-cache divergence probe for always-miss diagnosis. by @popfido in #1784
Fixed scheduler preflight accounting for hot-cache bytes, chunk KV growth, SDPA fallback, and scheduler memory-guard test doubles. (#1796, #1797)
Fixed engine-pool settle waits so other serving engines are not delayed unnecessarily. by @JimStenstrom in #1785
Fixed logits_processors rows that are dropped to None during batch merge. by @efortin in #1799
Fixed API-key safety by fingerprinting rejected keys in logs and rejecting non-ASCII configured API keys at validation time. by @richgoodson in #1751 and #1804
Fixed admin global profile form synchronization. (#1816)
Refined speculative draft model dropdowns and removed the experimental label from VLM MTP.
Added cohere2_moe support via mlx-vlm. (#1809)
Improved native embedding, reranker, DFlash, and audio model paths, including BGE/XLM-R/BERT serving, DFlash memory/cache accounting, TTS language forwarding, Gemma 4 Unified discovery, and NeMo ASR discovery.

Thanks

Thanks to @richgoodson, @paalolav, @JimStenstrom, @apetersson, @popfido, @FaisalFehad, @scaryrawr, @hojin12312, @efortin, @imi4u36d, and @cfbraun for the reports and fixes that shaped this release.

New Contributors

Thank you to @paalolav, @hojin12312, and @efortin for making their first contributions in this release.

Full Changelog: v0.4.3...v0.4.4rc1

jundot/omlx v0.4.4rc1 0.4.4rc1 on GitHub

Highlights

Improvements and Fixes

Thanks

New Contributors

jundot/omlx v0.4.4rc1
0.4.4rc1

on GitHub