This is a dev build with features planned for the 0.3.9 release. I'll keep responding to PRs and issues. If you hit a bug, please open an issue. Thanks for your patience.
Highlights
DeepSeek V4 Pro / Flash support, including SSD cache
What's new for V4 specifically:
- Full V4 model +
PoolingCache/BatchPoolingCacheported from ml-explore/mlx-lm#1192 by @Blaizzy.- Tested against the models published at mlx-community/deepseek-v4.
- F8_E8M0 / fp8 quant branch wired into
mlx_lm.utils.load_modelvia function-replacement patch. - AutoTokenizer wrapper that retries with
PreTrainedConfig()for transformers ≤5.7.0 (becomes dead code once transformers ships native V4 support). - SSD + prefix cache for V4: this required generalizing the cache type interface from 2-tuple
(keys, values)state to N-tuple state, sincePoolingCache.stateis(buf_kv, buf_gate, pooled). New format ispaged_ssd_cachev3 (safetensors). Without this, V4 sessions silently corrupted across prefix-cache hits. - V4 tool calling end-to-end: DSML-format parsing + emission on OpenAI / Anthropic endpoints, dict-shaped
tool_call.argumentsaccepted from the chat template, so V4 Pro / Flash can drive Claude Code, Codex, and OpenClaw without extra config.
Native MTP (Multi-Token Prediction) for Qwen3.5 / 3.6 and DeepSeek-V4
Turn it on per model in admin settings and supported models predict multiple tokens at once for faster decode. Off by default.
Source PRs:
- ml-explore/mlx-lm#990, Qwen3.5 / 3.6 MTP by @AirRunner
- Blaizzy/mlx-lm#15, DeepSeek-V4 MTP by @0xClandestine
Additional integration work:
- oQ preserves
mtp.*weights via a new-mtpsuffix on quantized output dirs. - mlx-vlm
sanitizepatched to keep the MTP norm shift +language_modelprefix so quantized VLM checkpoints load with correct per-tensor bits.- Pre-converted oQ MTP models for testing are available at huggingface.co/Jundot.
One-command Claude Code on oMLX: omlx launch claude
Also supports codex, opencode, openclaw, and pi. One launcher, one TUI, every supported coding agent points at oMLX with no extra config.
No more hand-copying model names, base URLs, and auth tokens into shell rc files. Just type:
omlx launch claudeA minimal inline TUI lists every model oMLX knows about, with loaded models (●) sorted to the top, context size next to each, arrow keys or j/k to move, Enter to pick, q/ESC to bail. Once you pick, oMLX sets ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, model tiers, and the context window for you and execs straight into Claude Code. PYTHONHOME / PYTHONPATH are scrubbed first so hooks don't inherit a polluted env.
The admin dashboard's old "Other Integrations" section is also gone, replaced by an Applications list with real brand logos and a single omlx launch <tool> command per row. OpenClaw keeps its inline tools-profile selector (Minimal / Coding / Messaging / Full).
Thanks to @fparrav for the original PR (#998) bringing this in.
New Features
- Auto-unload on settings change: loaded model is auto-unloaded when settings require a reload, instead of silently running with stale config.
- dflash-mlx 0.1.5.1: bumped from 0.1.3, with admin cache controls (
runtime_context, prefix cache L1+L2, verify-specialized int4 qmm). - Spanish (es) admin localization (#996). Thanks @fparrav.
- PP indicator during spec-prefill on the admin dashboard (#1032).
- Menubar port sync: menubar now reads
settings.jsonon start so the menu and server agree on the active port (#1034). - SSE keepalive in chunk form by default for OpenClaw / WorkBuddy compatibility.
- Anthropic server-side tool defs: accept them on the wire and drop before inference instead of erroring out.
Bug Fixes
<think>recovery when the model never closes the tag: text inside an unclosed<think>is now surfaced as content instead of being dropped.- dflash
<think>prefix on reasoning model output (#1068). - Periodic
mx.clear_cacheduring long decodes: gated on accumulated buffer bytes so it only fires when there's something to free, with a clearer normalize log message. - Async store path: drop worker-thread
mx.evalcalls instore_cache; defer boundary-snapshot cleanup until the async store finishes; bypass continuity check for blocks with a boundary snapshot. - xgrammar 0.1.34+ registry API for parser discovery (#1042). Thanks @crienzo.
- xgrammar macOS dylib loading fix (#1043). Thanks @crienzo. Follow-up: switch the RECORD patch to
Ruby File.openinstead ofsh echofor cleaner Homebrew formulas. - ModernBERT 500: mean-pool 3D embedding outputs (#1038).
- Qwen3 VL embedding processor compat (#1039). Thanks @penumbrazz.
- Gemma 4 interleaved tool call + reasoning (#1028). Thanks @fabiopili.
- Admin navbar in light mode (#1046) and lower-GPU navbar (drop
backdrop-blur). - Responses API: populate
input_tokens_details.cached_tokenswith the real value (was a dummy in #1008). Thanks @hardlycharred for the initial PR. omlx launchpolish after #998: english TUI hint, fixtest_integrationsregistry count, restoreshellQuoteon the dashboard launch command, drop unused imports.- Curses model picker improvements: viewport scrolling, PgUp/PgDn/Home/End,
KEY_RESIZEhandling, ▲/▼ indicators when there are items above or below the viewport. - Skip community benchmark upload when experimental features are active.
- Demote per-call patch counters to DEBUG.
- Note the dflash single-stream constraint in the admin description.
- Rename misleading "fallback max seq_len" debug log.
Dependency Updates
- dflash-mlx
0.1.3 → 0.1.5.1
New Contributors
- @fparrav:
omlx launch claudeTUI + Applications dashboard redesign (#998) and Spanish admin localization (#996) - @hardlycharred: Responses usage field (#1008)
- @crienzo: xgrammar 0.1.34+ registry API and macOS dylib loading (#1042, #1043)
- @penumbrazz: Qwen3 VL embedding processor compat (#1039)
- @fabiopili: Gemma 4 interleaved tool call + reasoning (#1028)
