jundot/omlx v0.3.9.dev1 on GitHub

This is a dev build with features planned for the 0.3.9 release. I'll keep responding to PRs and issues. If you hit a bug, please open an issue. Thanks for your patience.

Highlights

DeepSeek V4 Pro / Flash support, including SSD cache

What's new for V4 specifically:

Full V4 model + PoolingCache / BatchPoolingCache ported from ml-explore/mlx-lm#1192 by @Blaizzy.
- Tested against the models published at mlx-community/deepseek-v4.
F8_E8M0 / fp8 quant branch wired into mlx_lm.utils.load_model via function-replacement patch.
AutoTokenizer wrapper that retries with PreTrainedConfig() for transformers ≤5.7.0 (becomes dead code once transformers ships native V4 support).
SSD + prefix cache for V4: this required generalizing the cache type interface from 2-tuple (keys, values) state to N-tuple state, since PoolingCache.state is (buf_kv, buf_gate, pooled). New format is paged_ssd_cache v3 (safetensors). Without this, V4 sessions silently corrupted across prefix-cache hits.
V4 tool calling end-to-end: DSML-format parsing + emission on OpenAI / Anthropic endpoints, dict-shaped tool_call.arguments accepted from the chat template, so V4 Pro / Flash can drive Claude Code, Codex, and OpenClaw without extra config.

Native MTP (Multi-Token Prediction) for Qwen3.5 / 3.6 and DeepSeek-V4

Turn it on per model in admin settings and supported models predict multiple tokens at once for faster decode. Off by default.

Source PRs:

ml-explore/mlx-lm#990, Qwen3.5 / 3.6 MTP by @AirRunner
Blaizzy/mlx-lm#15, DeepSeek-V4 MTP by @0xClandestine

Additional integration work:

oQ preserves mtp.* weights via a new -mtp suffix on quantized output dirs.
mlx-vlm sanitize patched to keep the MTP norm shift + language_model prefix so quantized VLM checkpoints load with correct per-tensor bits.
- Pre-converted oQ MTP models for testing are available at huggingface.co/Jundot.

One-command Claude Code on oMLX: `omlx launch claude`

Also supports codex, opencode, openclaw, and pi. One launcher, one TUI, every supported coding agent points at oMLX with no extra config.

No more hand-copying model names, base URLs, and auth tokens into shell rc files. Just type:

omlx launch claude

A minimal inline TUI lists every model oMLX knows about, with loaded models (●) sorted to the top, context size next to each, arrow keys or j/k to move, Enter to pick, q/ESC to bail. Once you pick, oMLX sets ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, model tiers, and the context window for you and execs straight into Claude Code. PYTHONHOME / PYTHONPATH are scrubbed first so hooks don't inherit a polluted env.

The admin dashboard's old "Other Integrations" section is also gone, replaced by an Applications list with real brand logos and a single omlx launch <tool> command per row. OpenClaw keeps its inline tools-profile selector (Minimal / Coding / Messaging / Full).

Thanks to @fparrav for the original PR (#998) bringing this in.

New Features

Auto-unload on settings change: loaded model is auto-unloaded when settings require a reload, instead of silently running with stale config.
dflash-mlx 0.1.5.1: bumped from 0.1.3, with admin cache controls (runtime_context, prefix cache L1+L2, verify-specialized int4 qmm).
Spanish (es) admin localization (#996). Thanks @fparrav.
PP indicator during spec-prefill on the admin dashboard (#1032).
Menubar port sync: menubar now reads settings.json on start so the menu and server agree on the active port (#1034).
SSE keepalive in chunk form by default for OpenClaw / WorkBuddy compatibility.
Anthropic server-side tool defs: accept them on the wire and drop before inference instead of erroring out.

Bug Fixes

<think> recovery when the model never closes the tag: text inside an unclosed <think> is now surfaced as content instead of being dropped.
dflash <think> prefix on reasoning model output (#1068).
Periodic mx.clear_cache during long decodes: gated on accumulated buffer bytes so it only fires when there's something to free, with a clearer normalize log message.
Async store path: drop worker-thread mx.eval calls in store_cache; defer boundary-snapshot cleanup until the async store finishes; bypass continuity check for blocks with a boundary snapshot.
xgrammar 0.1.34+ registry API for parser discovery (#1042). Thanks @crienzo.
xgrammar macOS dylib loading fix (#1043). Thanks @crienzo. Follow-up: switch the RECORD patch to Ruby File.open instead of sh echo for cleaner Homebrew formulas.
ModernBERT 500: mean-pool 3D embedding outputs (#1038).
Qwen3 VL embedding processor compat (#1039). Thanks @penumbrazz.
Gemma 4 interleaved tool call + reasoning (#1028). Thanks @fabiopili.
Admin navbar in light mode (#1046) and lower-GPU navbar (drop backdrop-blur).
Responses API: populate input_tokens_details.cached_tokens with the real value (was a dummy in #1008). Thanks @hardlycharred for the initial PR.
omlx launch polish after #998: english TUI hint, fix test_integrations registry count, restore shellQuote on the dashboard launch command, drop unused imports.
Curses model picker improvements: viewport scrolling, PgUp/PgDn/Home/End, KEY_RESIZE handling, ▲/▼ indicators when there are items above or below the viewport.
Skip community benchmark upload when experimental features are active.
Demote per-call patch counters to DEBUG.
Note the dflash single-stream constraint in the admin description.
Rename misleading "fallback max seq_len" debug log.

Dependency Updates

dflash-mlx 0.1.3 → 0.1.5.1

New Contributors

@fparrav: omlx launch claude TUI + Applications dashboard redesign (#998) and Spanish admin localization (#996)
@hardlycharred: Responses usage field (#1008)
@crienzo: xgrammar 0.1.34+ registry API and macOS dylib loading (#1042, #1043)
@penumbrazz: Qwen3 VL embedding processor compat (#1039)
@fabiopili: Gemma 4 interleaved tool call + reasoning (#1028)