jundot/omlx v0.3.9 on GitHub

This is the stable 0.3.9 release, consolidating the 0.3.9.dev1, 0.3.9.dev2, and 0.3.9rc1 pre-releases plus the post-rc stabilization fixes. Huge thanks to everyone who filed issues and sent PRs since 0.3.8. If you hit a bug, please open an issue.

Highlights

Native MTP (Multi-Token Prediction) for Qwen3.5 / 3.6, Gemma 4, and DeepSeek-V4

Turn it on per model in admin settings and supported models predict multiple tokens at once for faster decode. Off by default. Gemma 4 gets MTP on the vision path, so image + text requests decode noticeably faster too.

Source PRs: ml-explore/mlx-lm#990 (Qwen3.5 / 3.6, @AirRunner), Blaizzy/mlx-lm#15 (DeepSeek-V4, @0xClandestine), and @Blaizzy's mlx-vlm for Gemma 4. oQ preserves mtp.* weights via a -mtp suffix on quantized output dirs; pre-converted oQ MTP models are at huggingface.co/Jundot.

DeepSeek V4 Pro / Flash support, including SSD cache

Full V4 model + PoolingCache / BatchPoolingCache ported from ml-explore/mlx-lm#1192 by @Blaizzy, tested against mlx-community/deepseek-v4. Highlights:

F8_E8M0 / fp8 quant branch wired into mlx_lm.utils.load_model.
SSD + prefix cache for V4: the cache type interface was generalized from 2-tuple (keys, values) to N-tuple state (PoolingCache.state is (buf_kv, buf_gate, pooled)), new on-disk format paged_ssd_cache v3. Without this, V4 sessions silently corrupted across prefix-cache hits.
V4 tool calling end-to-end: DSML-format parsing + emission on OpenAI / Anthropic endpoints, so V4 Pro / Flash drives Claude Code, Codex, and OpenClaw with no extra config.

DFlash now supports Gemma 4

Gemma 4 runs on the DFlash engine (thanks @bstnxbt's dflash-mlx), so the model lineup matches the rest of the pool. The admin quantization picker lights up every DFlash option including an FP16 draft-model boost (#880, thanks @deepsweet), with a configurable prefix cache size (#1120, thanks @yilmazorhan) and draft_window_size / draft_sink_size / verify_mode model settings (#1276).

Chunked prefill (#1224)

A long-context prompt no longer blocks decode for other in-flight requests: prefill advances one chunk per scheduler step, so concurrent requests keep streaming tokens through it. Off by default, toggleable from admin. Thanks @drumtorben.

Major stability improvements on low-memory Macs

oMLX is far more resilient on tight-memory machines. A new memory enforcer measures the same phys_footprint metric the OS uses for jetsam decisions and applies prefill admission control, so the server declines work before it would be killed instead of crashing under pressure. Backed by a hot-cache eviction race fix (#1298), parallelized SSD↔hot block preloading (#1301), per-model cache hit-rate visibility (#1183, all thanks @ivaniguarans), and a real-time memory bar on the admin dashboard (#1278, thanks @beamivalice). oQ can also auto-build a proxy model when the source can't fit in RAM, so large checkpoints are quantizable on smaller boxes (#1136).

ParoQuant support

Adds ParoQuant plus a pluggable custom-quantization loader so additional quant methods plug in without forking the loader path; all load call sites route through the dispatcher (#209, thanks @liang2kl).

New Features

One-command coding agents: omlx launch <claude|codex|opencode|openclaw|pi|copilot|hermes> wires env + model and execs into the agent via a curses TUI picker (#998 @fparrav, #1085 @scaryrawr, #1250 @shannonsands).
Chat multi-tasking: run multiple admin chats in parallel (#1231, @beamivalice).
Admin "Restart Server" button, admin-auth gated (#1194, @jasonpaulso).
Native reasoning in the Responses API survives tool-call round-trips (#1245, @a4501150).
/v1/audio/transcriptions: max_tokens (#1163, @thornad), STT language forwarding (#1184, @Bortlesboat), word_timestamps (#1214, @alexferrao).
/v1/mcp/execute accepts tool or tool_name (#1285, @mvanhorn); total_time in non-streaming usage (#1269, @richgoodson); streaming completion summary log (#1170, @Lifto).
Downloaded models saved under {owner}/{model} subfolders (#1188).
Inline chat rename (#1196) and per-message copy button (#1208, @beamivalice); sortable, mobile-friendly Browse Models (#830, @omniwired); log-viewer min display level (#1251, @fqx).
omlx launch extras: forward CLI args to claude (#1223) / codex (#1255, @EdenGottlieb); respect PI_CODING_AGENT_DIR (#1282, @SuperGregM).
Periodic update re-check from the health timer (#1088, @jprado); menubar port sync (#1034); active model activity visibility (#1104, @apetersson).
Auto-unload on settings change; chunk-form SSE keepalive default; Anthropic server-side tool defs accepted and dropped before inference.
Spanish (#996) and French (#989) admin localizations; pre-saved oQ sensitivity maps loadable from the source folder (#1295, @deepsweet).

Bug Fixes

Native MTP batch-reshape corruption: when a second request joined a single-sequence MTP generation, the surviving sequence could emit one corrupted token. The cache is now reconciled to a standard-resumable state on reshape, building on @aljen's reset mtp state across batch reshapes. Plus: MTP head attach decoupled from mtp_enabled, VLM inference-path wiring (#1320), greedy detection via real sampler temperature (#1336), correct backbone/sampler timing (#1337), MoE/MTPLX sanitize guards (#1147, @ivaniguarans).
Honor user stop strings end-to-end; Llama-4 ChunkedKVCache at batch=1 (#1152, @aeyeopsdev); HTTP request hang on engine-loop errors (#1315); chunked-prefill RuntimeError surfaced as a request error.
Scheduler: worker mx-buffer lock (#1106), cross-thread generation_stream sync (#1156, @a4501150), lock-free admin snapshot.
Cache: async_eval store materialization (#1146), hot-cache byte accounting (#1171, @lobsterbuko), shutdown flush to SSD (#1101), prefill-only partial-mode invalidation guard (#1119, @blightbow).
DFlash: per-model L2 SSD cache cap (#1326), MTP pre-load patch + no n_confirmed leak (#1318), symmetric speculative toggle (#1227, @fish0710).
Server / API: Anthropic tool_use stream emission (#845, @mrtkrcm); <think> recovery on unclosed tags and preserve_thinking history (#1329); Gemma 4 interleaved tool call + reasoning (#1028, @fabiopili); Responses cached-tokens (#1008, @hardlycharred); streaming abort engine ref (#1168, @manaskarra); engine-pool original load error surfaced (#1283, @contrapuntal).
HF downloader: keep finalized shards + actually stop on cancel (#1284, @mvanhorn), cross-origin redirect follow (#1339).
Embeddings / OCR / STT: ModernBERT 3D mean-pool (#1038), Qwen3 VL processor compat (#1039, @penumbrazz), GLM-OCR / dots_ocr on transformers 5.5+, KMMLU answer mapping (#1161, @khsd6327).
xgrammar: 0.1.34+ registry API + macOS dylib (#1042, #1043, @crienzo).
Admin / desktop: chat line-break + inline edit (#1206), mobile chat input bar (#1218, @Luunae), Jinja chat.html (#1201, @felk-dev), left-panel icon toggle (#1165, @bogdan-copocean), DMG model-dir persistence (#1114, @gltanaka), bind-host-aware health checks (#878, @DKev), health-check session reuse (#1211, @arthware-dev), bench upload error sanitized (#1192, @jasonpaulso).

Dependency Updates

dflash-mlx 0.1.3 → 0.1.7, mlx-vlm 191d7c8 → f96138e (Gemma 4 MTP batching), mistral-common >=1.10 (#1116, @pmarreck), uv.lock gitignored (#1209, @deepsweet).

Documentation

Chinese (#1198, @JasonYeYuhe) and other translated READMEs (ko, ja, es, fr) synced; Spanish phrasing smoothed (#1110, @fparrav).

New Contributors

Thank you to everyone making their first contribution in 0.3.9:

@fparrav, @hardlycharred, @crienzo, @penumbrazz, @fabiopili, @liang2kl, @scaryrawr, @mrtkrcm, @DKev, @ivaniguarans, @Bortlesboat, @thornad, @Lifto, @yilmazorhan, @aeyeopsdev, @a4501150, @manaskarra, @lobsterbuko, @bogdan-copocean, @felk-dev, @fish0710, @beamivalice, @gltanaka, @pmarreck, @khsd6327, @drumtorben, @mvanhorn, @richgoodson, @contrapuntal, @omniwired, @shannonsands, @alexferrao, @EdenGottlieb, @SuperGregM, @jprado, @Luunae, @arthware-dev, @apetersson, @fqx, @JasonYeYuhe.

Special thanks to @Blaizzy (DeepSeek V4 + Gemma 4 MTP via mlx-vlm) and @bstnxbt (dflash-mlx) for the upstream work this release builds on.