This release rolls up the 0.4.4 release candidates and focuses on MiniMax M3 support, DiffusionGemma support, DeepSeek V4 oQ/MTP support, stronger macOS 27 compatibility, safer MTP batching, cache-reuse correctness, and Memory Guard hardening.
Highlights
- Early support for MiniMax M3 via the upstream mlx-vlm PR. oMLX now tracks the not-yet-merged MiniMax M3 work from Blaizzy/mlx-vlm#1374, originally contributed by @ivanfioravanti, so MiniMax M3 / MiniMax M3 VL can be tried before that support lands upstream. This includes native-text VLM adaptation, MiniMax position handling, sparse-attention left-padding fixes, tool-call marker handling, and related prefix/cache support.
- Added DiffusionGemma and expanded speculative decoding support. oMLX can now serve DiffusionGemma through the mlx-vlm path, and VLM MTP can use an external Qwen MTP drafter.
- Stronger macOS 27 compatibility. oMLX now uses a macOS memory stats compatibility layer for newer
HOST_VM_INFO64layouts, keeping Memory Guard decisions and admin memory telemetry stable on newer macOS releases. (#1749, #1835) - Added DeepSeek V4 oQ quantization and MTP support. This includes fractional oQ levels, pre-quantized DeepSeek V4 oQ tensors, and safer DeepSeek V4 MTP loading and rollback behavior.
- Improved agent cache reuse and cache correctness. Paged SSD cache, prefix-cache restore, rotating-family cache handling, and MiniMax M3 partial-cache resume are now safer for repeated agent-style workloads. by @cfbraun in #1815 and @hojin12312 in #1807
- Made native MTP batching safer. Native MTP decode now realigns batch rows and defers unsafe late-join rows, avoiding speculative batching across mismatched cache positions. by @efortin in #1824 and @richgoodson in #1845
- Strengthened Memory Guard and hot-cache behavior. oMLX now has better preflight accounting, binding-ceiling diagnostics, and hot-cache pressure handling. by @cfbraun in #1452 and @isaac-cf-wong in #1863
- Improved Gemma 4, Harmony, Codex App, and Hermes integration behavior. Tool-call parsing is more robust, malformed Harmony channels are preserved, Codex App Desktop launch is available, and Hermes now launches through the correct
hermes chatflow. by @richgoodson in #1854, @jimicze in #1852, and @fparrav in #1878
Improvements and Fixes
- Added MiniMax M3 native-text VLM support, sparse-attention patching, position-id handling, output parsing, tool-call filtering, and cache/type-handler support.
- Exposed nested VLM language models through the oQ sanitize-plan proxy so MiniMax-style nested VLMs can be quantized. by @gilby in #1881
- Added VLM MTP support with an external Qwen MTP drafter and fixed VLM MTP benchmark / mRoPE adapter routing paths. by @imi4u36d (#1791, #1813, and #1839)
- Fixed the native VLM MTP drafter picker for
qwen3_5_mtpmodels. by @chenqianhe in #1860 - Enabled tool calling on the serial diffusion lane. by @scubamount in #1837
- Fixed SSD cache invalidation for stale layer-cache signatures and rotating-tip cache payload handling. by @cfbraun in #1815 and @hojin12312 in #1807
- Added a prefix-cache divergence probe and improved DFlash cached-token reporting / pre-load admission accounting. by @popfido in #1784, #1768, and #1766
- Fixed engine-pool settle waits so other serving engines are not delayed unnecessarily. by @JimStenstrom in #1785
- Fixed DFlash prefill Memory Guard enforcement on the primary path. by @JimStenstrom in #1770
- Fixed Gemma 4 MCP-namespaced and single-quoted tool calls. by @richgoodson in #1854
- Fixed
/v1/completionsthinking_budgetforwarding and hardening. by @richgoodson in #1844 and @efortin in #1821 - Fixed row-aligned samplers/logits processors after batch-row removal. by @efortin in #1824 and @richgoodson in #1845
- Fixed non-ASCII configured API-key validation and safer rejected-key logging. by @richgoodson in #1804 and #1751
- Added faithful BGE serving on MLX and improved native embedding/reranker behavior. by @paalolav in #1767
- Added TTS language forwarding and NeMo ASR model detection. by @apetersson in #1773 and @scaryrawr in #1742
- Fixed Gemma 4 Unified detection as VLM. by @FaisalFehad in #1744
- Added Codex App Desktop integration and fixed Hermes launch command handling. by @jimicze in #1852 and @fparrav in #1878
Changes Since 0.4.4rc2
- Fixed MiniMax M3 partial-cache resume for long-context workloads by trimming partial cache hits back to a safe 2048-token boundary before resuming prefill. (#1888)
- Fixed a cache-reuse timing issue for coding-agent style follow-up requests, where a new request arriving immediately after the previous one could miss reusable cache because the prior cache write had not finished yet.
- Limited SSD cache preloading to the amount of hot-cache space that can actually be used, avoiding unnecessary preload work under memory pressure.
- Improved scheduler recovery when cache loading stalls, so blocked admissions are cleared and new requests can resume normally.
- Changed MarkItDown's Show as model integration setting to default off, while keeping attachment preprocessing enabled by default.
Thanks
Special thanks to @ivanfioravanti for the initial MiniMax M3 support PR in mlx-vlm, and to @Blaizzy for the mlx-vlm work that makes this path possible.
Thanks to @bbongtree1004, @richgoodson, @paalolav, @JimStenstrom, @apetersson, @popfido, @FaisalFehad, @scaryrawr, @hojin12312, @efortin, @imi4u36d, @cfbraun, @fparrav, @gilby, @jimicze, @isaac-cf-wong, @scubamount, and @chenqianhe for the reports and fixes that shaped this release.
New Contributors
Thank you to @paalolav, @hojin12312, @efortin, @jimicze, @isaac-cf-wong, and @gilby for making their first contributions since 0.4.3.
Full Changelog: v0.4.3...v0.4.4