This is the stable 0.3.9 release, consolidating the 0.3.9.dev1, 0.3.9.dev2, and 0.3.9rc1 pre-releases plus the post-rc stabilization fixes. Huge thanks to everyone who filed issues and sent PRs since 0.3.8. If you hit a bug, please open an issue.
Highlights
Native MTP (Multi-Token Prediction) for Qwen3.5 / 3.6, Gemma 4, and DeepSeek-V4
Turn it on per model in admin settings and supported models predict multiple tokens at once for faster decode. Off by default. Gemma 4 gets MTP on the vision path, so image + text requests decode noticeably faster too.
Source PRs: ml-explore/mlx-lm#990 (Qwen3.5 / 3.6, @AirRunner), Blaizzy/mlx-lm#15 (DeepSeek-V4, @0xClandestine), and @Blaizzy's mlx-vlm for Gemma 4. oQ preserves mtp.* weights via a -mtp suffix on quantized output dirs; pre-converted oQ MTP models are at huggingface.co/Jundot.
DeepSeek V4 Pro / Flash support, including SSD cache
Full V4 model + PoolingCache / BatchPoolingCache ported from ml-explore/mlx-lm#1192 by @Blaizzy, tested against mlx-community/deepseek-v4. Highlights:
- F8_E8M0 / fp8 quant branch wired into
mlx_lm.utils.load_model. - SSD + prefix cache for V4: the cache type interface was generalized from 2-tuple
(keys, values)to N-tuple state (PoolingCache.stateis(buf_kv, buf_gate, pooled)), new on-disk formatpaged_ssd_cachev3. Without this, V4 sessions silently corrupted across prefix-cache hits. - V4 tool calling end-to-end: DSML-format parsing + emission on OpenAI / Anthropic endpoints, so V4 Pro / Flash drives Claude Code, Codex, and OpenClaw with no extra config.
DFlash now supports Gemma 4
Gemma 4 runs on the DFlash engine (thanks @bstnxbt's dflash-mlx), so the model lineup matches the rest of the pool. The admin quantization picker lights up every DFlash option including an FP16 draft-model boost (#880, thanks @deepsweet), with a configurable prefix cache size (#1120, thanks @yilmazorhan) and draft_window_size / draft_sink_size / verify_mode model settings (#1276).
Chunked prefill (#1224)
A long-context prompt no longer blocks decode for other in-flight requests: prefill advances one chunk per scheduler step, so concurrent requests keep streaming tokens through it. Off by default, toggleable from admin. Thanks @drumtorben.
Major stability improvements on low-memory Macs
oMLX is far more resilient on tight-memory machines. A new memory enforcer measures the same phys_footprint metric the OS uses for jetsam decisions and applies prefill admission control, so the server declines work before it would be killed instead of crashing under pressure. Backed by a hot-cache eviction race fix (#1298), parallelized SSD↔hot block preloading (#1301), per-model cache hit-rate visibility (#1183, all thanks @ivaniguarans), and a real-time memory bar on the admin dashboard (#1278, thanks @beamivalice). oQ can also auto-build a proxy model when the source can't fit in RAM, so large checkpoints are quantizable on smaller boxes (#1136).
ParoQuant support
Adds ParoQuant plus a pluggable custom-quantization loader so additional quant methods plug in without forking the loader path; all load call sites route through the dispatcher (#209, thanks @liang2kl).
New Features
- One-command coding agents:
omlx launch <claude|codex|opencode|openclaw|pi|copilot|hermes>wires env + model and execs into the agent via a curses TUI picker (#998 @fparrav, #1085 @scaryrawr, #1250 @shannonsands). - Chat multi-tasking: run multiple admin chats in parallel (#1231, @beamivalice).
- Admin "Restart Server" button, admin-auth gated (#1194, @jasonpaulso).
- Native reasoning in the Responses API survives tool-call round-trips (#1245, @a4501150).
/v1/audio/transcriptions:max_tokens(#1163, @thornad), STT language forwarding (#1184, @Bortlesboat),word_timestamps(#1214, @alexferrao)./v1/mcp/executeacceptstoolortool_name(#1285, @mvanhorn);total_timein non-streaming usage (#1269, @richgoodson); streaming completion summary log (#1170, @Lifto).- Downloaded models saved under
{owner}/{model}subfolders (#1188). - Inline chat rename (#1196) and per-message copy button (#1208, @beamivalice); sortable, mobile-friendly Browse Models (#830, @omniwired); log-viewer min display level (#1251, @fqx).
omlx launchextras: forward CLI args toclaude(#1223) /codex(#1255, @EdenGottlieb); respectPI_CODING_AGENT_DIR(#1282, @SuperGregM).- Periodic update re-check from the health timer (#1088, @jprado); menubar port sync (#1034); active model activity visibility (#1104, @apetersson).
- Auto-unload on settings change; chunk-form SSE keepalive default; Anthropic server-side tool defs accepted and dropped before inference.
- Spanish (#996) and French (#989) admin localizations; pre-saved oQ sensitivity maps loadable from the source folder (#1295, @deepsweet).
Bug Fixes
- Native MTP batch-reshape corruption: when a second request joined a single-sequence MTP generation, the surviving sequence could emit one corrupted token. The cache is now reconciled to a standard-resumable state on reshape, building on @aljen's
reset mtp state across batch reshapes. Plus: MTP head attach decoupled frommtp_enabled, VLM inference-path wiring (#1320), greedy detection via real sampler temperature (#1336), correct backbone/sampler timing (#1337), MoE/MTPLX sanitize guards (#1147, @ivaniguarans). - Honor user stop strings end-to-end; Llama-4 ChunkedKVCache at batch=1 (#1152, @aeyeopsdev); HTTP request hang on engine-loop errors (#1315); chunked-prefill
RuntimeErrorsurfaced as a request error. - Scheduler: worker mx-buffer lock (#1106), cross-thread
generation_streamsync (#1156, @a4501150), lock-free admin snapshot. - Cache:
async_evalstore materialization (#1146), hot-cache byte accounting (#1171, @lobsterbuko), shutdown flush to SSD (#1101), prefill-only partial-mode invalidation guard (#1119, @blightbow). - DFlash: per-model L2 SSD cache cap (#1326), MTP pre-load patch + no
n_confirmedleak (#1318), symmetric speculative toggle (#1227, @fish0710). - Server / API: Anthropic
tool_usestream emission (#845, @mrtkrcm);<think>recovery on unclosed tags andpreserve_thinkinghistory (#1329); Gemma 4 interleaved tool call + reasoning (#1028, @fabiopili); Responses cached-tokens (#1008, @hardlycharred); streaming abort engine ref (#1168, @manaskarra); engine-pool original load error surfaced (#1283, @contrapuntal). - HF downloader: keep finalized shards + actually stop on cancel (#1284, @mvanhorn), cross-origin redirect follow (#1339).
- Embeddings / OCR / STT: ModernBERT 3D mean-pool (#1038), Qwen3 VL processor compat (#1039, @penumbrazz), GLM-OCR / dots_ocr on transformers 5.5+, KMMLU answer mapping (#1161, @khsd6327).
- xgrammar: 0.1.34+ registry API + macOS dylib (#1042, #1043, @crienzo).
- Admin / desktop: chat line-break + inline edit (#1206), mobile chat input bar (#1218, @Luunae), Jinja
chat.html(#1201, @felk-dev), left-panel icon toggle (#1165, @bogdan-copocean), DMG model-dir persistence (#1114, @gltanaka), bind-host-aware health checks (#878, @DKev), health-check session reuse (#1211, @arthware-dev), bench upload error sanitized (#1192, @jasonpaulso).
Dependency Updates
- dflash-mlx
0.1.3 → 0.1.7, mlx-vlm191d7c8 → f96138e(Gemma 4 MTP batching), mistral-common>=1.10(#1116, @pmarreck),uv.lockgitignored (#1209, @deepsweet).
Documentation
- Chinese (#1198, @JasonYeYuhe) and other translated READMEs (ko, ja, es, fr) synced; Spanish phrasing smoothed (#1110, @fparrav).
New Contributors
Thank you to everyone making their first contribution in 0.3.9:
@fparrav, @hardlycharred, @crienzo, @penumbrazz, @fabiopili, @liang2kl, @scaryrawr, @mrtkrcm, @DKev, @ivaniguarans, @Bortlesboat, @thornad, @Lifto, @yilmazorhan, @aeyeopsdev, @a4501150, @manaskarra, @lobsterbuko, @bogdan-copocean, @felk-dev, @fish0710, @beamivalice, @gltanaka, @pmarreck, @khsd6327, @drumtorben, @mvanhorn, @richgoodson, @contrapuntal, @omniwired, @shannonsands, @alexferrao, @EdenGottlieb, @SuperGregM, @jprado, @Luunae, @arthware-dev, @apetersson, @fqx, @JasonYeYuhe.
Special thanks to @Blaizzy (DeepSeek V4 + Gemma 4 MTP via mlx-vlm) and @bstnxbt (dflash-mlx) for the upstream work this release builds on.