This development release focuses on major prefill speedups for GLM-5.2 and MiniMax M3 through custom kernels, API-visible model presets/profiles, and VLM/cache hardening after 0.4.4.
Highlights
- Major prefill speedups for GLM-5.2 and MiniMax M3 through custom kernels. oMLX now includes GLM MoE DSA / Sparse MLA native kernels and MiniMax M3 sparse-attention acceleration, with build support in the Python package and macOS app. (#1984)
- API-visible model profiles and refreshed global presets. Profiles can now be exposed as
<model>:<profile>or<alias>:<profile>in/v1/modelsand served through the same loaded engine, while the built-in presets now include MiniMax-M3 and GLM-5.2. by @pablomoralesm in #1838
Performance Snapshot
Model: GLM-5.2-oQ4 (418.1 GB)
Machine: Mac Studio, M3 Ultra, 512 GB unified memory
Workload: single request, 128 generated tokens
| Context | Baseline PP | oMLX 0.4.5 PP | PP vs baseline | Baseline TG | oMLX 0.4.5 TG | TG vs baseline |
|---|---|---|---|---|---|---|
| 1k | 186.8 tok/s | 187.7 tok/s | 1.00x (+0.5%) | 15.6 tok/s | 15.6 tok/s | +0.0% |
| 4k | 187.4 tok/s | 212.2 tok/s | 1.13x (+13.2%) | 14.7 tok/s | 14.9 tok/s | +1.4% |
| 8k | 164.1 tok/s | 192.8 tok/s | 1.17x (+17.5%) | 14.4 tok/s | 14.9 tok/s | +3.5% |
| 16k | 128.1 tok/s | 178.9 tok/s | 1.40x (+39.7%) | 14.4 tok/s | 14.7 tok/s | +2.1% |
| 32k | 87.7 tok/s | 174.4 tok/s | 1.99x (+98.9%) | 14.1 tok/s | 14.5 tok/s | +2.8% |
Model: MiniMax-M3-oQ3 (187.3 GB)
Workload: single request, 128 generated tokens
| Context | Baseline PP | oMLX 0.4.5 PP | PP vs baseline | Baseline TG | oMLX 0.4.5 TG | TG vs baseline |
|---|---|---|---|---|---|---|
| 1k | 325.3 tok/s | 349.6 tok/s | 1.07x (+7.5%) | 28.5 tok/s | 29.7 tok/s | +4.2% |
| 4k | 351.0 tok/s | 359.4 tok/s | 1.02x (+2.4%) | 20.1 tok/s | 20.4 tok/s | +1.5% |
| 8k | 332.1 tok/s | 343.8 tok/s | 1.04x (+3.5%) | 20.1 tok/s | 20.0 tok/s | -0.5% |
| 16k | 293.7 tok/s | 340.9 tok/s | 1.16x (+16.1%) | 19.0 tok/s | 19.7 tok/s | +3.7% |
| 32k | 228.1 tok/s | 327.1 tok/s | 1.43x (+43.4%) | 18.8 tok/s | 19.0 tok/s | +1.1% |
| 64k | 158.8 tok/s | 307.7 tok/s | 1.94x (+93.8%) | 16.0 tok/s | 17.5 tok/s | +9.4% |
New Features
- Added GLM-5.2 bundled Sparse MLA DSA custom kernels, including DSA indexer, exact block attention, q8 V-up, and fused MoE support. (#1984)
- Added MiniMax M3 sparse-attention acceleration and adaptive long-prefill sizing.
- Added API-visible model profiles for OpenAI-compatible clients, with web and macOS UI support. by @pablomoralesm in #1838
- Updated global model presets for MiniMax-M3 and GLM-5.2.
- Added Brazilian Portuguese admin UI localization. by @victor-torres in #1919
- Added Gemma 4 QAT model support in the quantization tool. by @kreeger in #1690
- Added native Qwen2ForCausalLM embedding serving for models such as jina-code-embeddings and gte-Qwen2. by @JimStenstrom in #1720
- Modernized macOS app internals with native SwiftUI controls and Observation-based view models. by @Stv-X in #1891 and #1952
Bug Fixes
- Fixed head_dim=256 long-context prefill OOM by routing eligible prefill through the tiled SDPA256 path. by @StevePierce in #2025
- Fixed false VLM preflight rejections by counting actual image tokens instead of charging every image at the max-pixels ceiling. by @fqx in #1994
- Fixed VLM teardown memory reclaim by dropping wrapper/model references before final MLX reclaim. by @zwcf5200 in #2010
- Fixed SSD cache limit enforcement across model switches and composite
CacheList/ nestednstateSSD serialization. by @apcooley in #1939 - Fixed unsafe in-flight model unload races, tuned tiered Memory Guard thresholds, and improved MiniMax M3 long-generation cache materialization.
- Fixed Gemma 4 parenthesized
call:name(...)tool calls. by @richgoodson in #1886 - Fixed Cohere2 MoE streamed tool arguments with literal control characters and unsafe BPE streaming detokenization. by @ttapper in #1931
- Fixed
/v1/responsessystem-message fallback and missing reasoning output. by @imi4u36d in #1923 - Fixed MCP stdio configs with
cwd. by @JimStenstrom in #1987 - Fixed CLI bootstrap base-path loading for non-default installs. by @bspaulding in #1936
- Fixed Gemma4 Unified oQ sanitize proxy handling for audio-capable VLM checkpoints.
- Fixed Gemma4 E2B/E4B shared-KV VLM checkpoint loading so affected models no longer fall back to text-only LLM loading.
- Fixed Gemma E4B streaming output leaking raw
<pad>/<eos>stop tokens. - Fixed MiniMax M3 oQ sanitize paths for proxy sensitivity and compatibility patch ordering.
- Fixed admin/macOS UI issues including clipped chat action buttons, stale auto-start toggle state, nested local model display names, and the About docs link. by @shreyash0k in #2000 and @ryan-gustafson in #1949
New Contributors
Thank you to @pablomoralesm, @ttapper, @victor-torres, @bspaulding, @ryan-gustafson, @apcooley, @shreyash0k, @StevePierce, and @zwcf5200 for making their first contributions since 0.4.4.
Full Changelog: v0.4.4...v0.4.5.dev1