github jundot/omlx v0.4.4rc1
0.4.4rc1

8 hours ago

This release candidate focuses on DiffusionGemma support, DeepSeek V4 oQ/MTP support, and major cache-reuse correctness improvements for agent workloads.

Highlights

  • Added DiffusionGemma support via @Blaizzy's mlx-vlm, currently without cache. oMLX can now serve DiffusionGemma models through the mlx-vlm path.
  • Added DeepSeek V4 oQ quantization and MTP support. oMLX now supports fractional oQ levels, pre-quantized DeepSeek V4 oQ tensors, and safer DeepSeek V4 MTP loading and rollback behavior.
  • Dramatically improved agent cache reuse for Gemma and Qwen models. Paged SSD cache and prefix-cache correctness fixes now prevent stale layer-cache reuse, handle rotating-family prefix restore more safely, and strip superseded rotating-tip payloads so agent-style repeated prompts can reuse cache much more reliably. by @cfbraun in #1815 and @hojin12312 in #1807
  • Improved Memory Guard and preflight accounting. Scheduler preflight now avoids counting hot-cache bytes incorrectly, includes chunk KV growth, mirrors MLX SDPA fallback behavior more closely, and handles tracked model memory before load admission.

Improvements and Fixes

  • Added VLM MTP support with an external Qwen MTP drafter. VLM models can now use a separate Qwen MTP draft model for speculative decoding, with admin UI support for the new drafter settings. by @imi4u36d in #1791
  • Added a scheduler-facing output parser abstraction for Harmony, Gemma 4, and Cohere2 MoE style streamed output/tool parsing.
  • Fixed VLM MTP hidden-output preservation and made MTP draft rollback atomic.
  • Fixed DeepSeek V4 MTP rollback/loading and rejected unsupported DeepSeek V4 fp16 oQ configurations.
  • Added fractional oQ levels and support for pre-quantized DeepSeek V4 oQ tensors.
  • Fixed TurboQuant cache handling after hybrid cache restore and chunked prefill cache insertion. (#1793)
  • Fixed benchmark loading so VLM MTP benchmark paths are not forced through LM-only loading. by @imi4u36d in #1813
  • Added a benchmark option to force mlx-lm loading when needed.
  • Fixed paged SSD cache invalidation for stale layer cache signatures. by @cfbraun in #1815
  • Fixed prefix-cache restore for rotating-family models and stripped superseded rotating-tip payloads. by @hojin12312 in #1807
  • Added a prefix-cache divergence probe for always-miss diagnosis. by @popfido in #1784
  • Fixed scheduler preflight accounting for hot-cache bytes, chunk KV growth, SDPA fallback, and scheduler memory-guard test doubles. (#1796, #1797)
  • Fixed engine-pool settle waits so other serving engines are not delayed unnecessarily. by @JimStenstrom in #1785
  • Fixed logits_processors rows that are dropped to None during batch merge. by @efortin in #1799
  • Fixed API-key safety by fingerprinting rejected keys in logs and rejecting non-ASCII configured API keys at validation time. by @richgoodson in #1751 and #1804
  • Fixed admin global profile form synchronization. (#1816)
  • Refined speculative draft model dropdowns and removed the experimental label from VLM MTP.
  • Added cohere2_moe support via mlx-vlm. (#1809)
  • Improved native embedding, reranker, DFlash, and audio model paths, including BGE/XLM-R/BERT serving, DFlash memory/cache accounting, TTS language forwarding, Gemma 4 Unified discovery, and NeMo ASR discovery.

Thanks

Thanks to @richgoodson, @paalolav, @JimStenstrom, @apetersson, @popfido, @FaisalFehad, @scaryrawr, @hojin12312, @efortin, @imi4u36d, and @cfbraun for the reports and fixes that shaped this release.

New Contributors

Thank you to @paalolav, @hojin12312, and @efortin for making their first contributions in this release.

Full Changelog: v0.4.3...v0.4.4rc1

Don't miss a new omlx release

NewReleases is sending notifications on new releases.