github jundot/omlx v0.3.9.dev1

pre-release10 hours ago

This is a dev build with features planned for the 0.3.9 release. I'll keep responding to PRs and issues. If you hit a bug, please open an issue. Thanks for your patience.

Highlights

DeepSeek V4 Pro / Flash support, including SSD cache

What's new for V4 specifically:

  • Full V4 model + PoolingCache / BatchPoolingCache ported from ml-explore/mlx-lm#1192 by @Blaizzy.
  • F8_E8M0 / fp8 quant branch wired into mlx_lm.utils.load_model via function-replacement patch.
  • AutoTokenizer wrapper that retries with PreTrainedConfig() for transformers ≤5.7.0 (becomes dead code once transformers ships native V4 support).
  • SSD + prefix cache for V4: this required generalizing the cache type interface from 2-tuple (keys, values) state to N-tuple state, since PoolingCache.state is (buf_kv, buf_gate, pooled). New format is paged_ssd_cache v3 (safetensors). Without this, V4 sessions silently corrupted across prefix-cache hits.
  • V4 tool calling end-to-end: DSML-format parsing + emission on OpenAI / Anthropic endpoints, dict-shaped tool_call.arguments accepted from the chat template, so V4 Pro / Flash can drive Claude Code, Codex, and OpenClaw without extra config.

Native MTP (Multi-Token Prediction) for Qwen3.5 / 3.6 and DeepSeek-V4

Turn it on per model in admin settings and supported models predict multiple tokens at once for faster decode. Off by default.

Source PRs:

Additional integration work:

  • oQ preserves mtp.* weights via a new -mtp suffix on quantized output dirs.
  • mlx-vlm sanitize patched to keep the MTP norm shift + language_model prefix so quantized VLM checkpoints load with correct per-tensor bits.

One-command Claude Code on oMLX: omlx launch claude

Also supports codex, opencode, openclaw, and pi. One launcher, one TUI, every supported coding agent points at oMLX with no extra config.

No more hand-copying model names, base URLs, and auth tokens into shell rc files. Just type:

omlx launch claude

Applications dashboard with omlx launch commands

A minimal inline TUI lists every model oMLX knows about, with loaded models (●) sorted to the top, context size next to each, arrow keys or j/k to move, Enter to pick, q/ESC to bail. Once you pick, oMLX sets ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, model tiers, and the context window for you and execs straight into Claude Code. PYTHONHOME / PYTHONPATH are scrubbed first so hooks don't inherit a polluted env.

The admin dashboard's old "Other Integrations" section is also gone, replaced by an Applications list with real brand logos and a single omlx launch <tool> command per row. OpenClaw keeps its inline tools-profile selector (Minimal / Coding / Messaging / Full).

Thanks to @fparrav for the original PR (#998) bringing this in.


New Features

  • Auto-unload on settings change: loaded model is auto-unloaded when settings require a reload, instead of silently running with stale config.
  • dflash-mlx 0.1.5.1: bumped from 0.1.3, with admin cache controls (runtime_context, prefix cache L1+L2, verify-specialized int4 qmm).
  • Spanish (es) admin localization (#996). Thanks @fparrav.
  • PP indicator during spec-prefill on the admin dashboard (#1032).
  • Menubar port sync: menubar now reads settings.json on start so the menu and server agree on the active port (#1034).
  • SSE keepalive in chunk form by default for OpenClaw / WorkBuddy compatibility.
  • Anthropic server-side tool defs: accept them on the wire and drop before inference instead of erroring out.

Bug Fixes

  • <think> recovery when the model never closes the tag: text inside an unclosed <think> is now surfaced as content instead of being dropped.
  • dflash <think> prefix on reasoning model output (#1068).
  • Periodic mx.clear_cache during long decodes: gated on accumulated buffer bytes so it only fires when there's something to free, with a clearer normalize log message.
  • Async store path: drop worker-thread mx.eval calls in store_cache; defer boundary-snapshot cleanup until the async store finishes; bypass continuity check for blocks with a boundary snapshot.
  • xgrammar 0.1.34+ registry API for parser discovery (#1042). Thanks @crienzo.
  • xgrammar macOS dylib loading fix (#1043). Thanks @crienzo. Follow-up: switch the RECORD patch to Ruby File.open instead of sh echo for cleaner Homebrew formulas.
  • ModernBERT 500: mean-pool 3D embedding outputs (#1038).
  • Qwen3 VL embedding processor compat (#1039). Thanks @penumbrazz.
  • Gemma 4 interleaved tool call + reasoning (#1028). Thanks @fabiopili.
  • Admin navbar in light mode (#1046) and lower-GPU navbar (drop backdrop-blur).
  • Responses API: populate input_tokens_details.cached_tokens with the real value (was a dummy in #1008). Thanks @hardlycharred for the initial PR.
  • omlx launch polish after #998: english TUI hint, fix test_integrations registry count, restore shellQuote on the dashboard launch command, drop unused imports.
  • Curses model picker improvements: viewport scrolling, PgUp/PgDn/Home/End, KEY_RESIZE handling, ▲/▼ indicators when there are items above or below the viewport.
  • Skip community benchmark upload when experimental features are active.
  • Demote per-call patch counters to DEBUG.
  • Note the dflash single-stream constraint in the admin description.
  • Rename misleading "fallback max seq_len" debug log.

Dependency Updates

  • dflash-mlx 0.1.3 → 0.1.5.1

New Contributors

Don't miss a new omlx release

NewReleases is sending notifications on new releases.