github jundot/omlx v0.3.9

3 hours ago

This is the stable 0.3.9 release, consolidating the 0.3.9.dev1, 0.3.9.dev2, and 0.3.9rc1 pre-releases plus the post-rc stabilization fixes. Huge thanks to everyone who filed issues and sent PRs since 0.3.8. If you hit a bug, please open an issue.

Highlights

Native MTP (Multi-Token Prediction) for Qwen3.5 / 3.6, Gemma 4, and DeepSeek-V4

Turn it on per model in admin settings and supported models predict multiple tokens at once for faster decode. Off by default. Gemma 4 gets MTP on the vision path, so image + text requests decode noticeably faster too.

Source PRs: ml-explore/mlx-lm#990 (Qwen3.5 / 3.6, @AirRunner), Blaizzy/mlx-lm#15 (DeepSeek-V4, @0xClandestine), and @Blaizzy's mlx-vlm for Gemma 4. oQ preserves mtp.* weights via a -mtp suffix on quantized output dirs; pre-converted oQ MTP models are at huggingface.co/Jundot.

DeepSeek V4 Pro / Flash support, including SSD cache

Full V4 model + PoolingCache / BatchPoolingCache ported from ml-explore/mlx-lm#1192 by @Blaizzy, tested against mlx-community/deepseek-v4. Highlights:

  • F8_E8M0 / fp8 quant branch wired into mlx_lm.utils.load_model.
  • SSD + prefix cache for V4: the cache type interface was generalized from 2-tuple (keys, values) to N-tuple state (PoolingCache.state is (buf_kv, buf_gate, pooled)), new on-disk format paged_ssd_cache v3. Without this, V4 sessions silently corrupted across prefix-cache hits.
  • V4 tool calling end-to-end: DSML-format parsing + emission on OpenAI / Anthropic endpoints, so V4 Pro / Flash drives Claude Code, Codex, and OpenClaw with no extra config.

DFlash now supports Gemma 4

Gemma 4 runs on the DFlash engine (thanks @bstnxbt's dflash-mlx), so the model lineup matches the rest of the pool. The admin quantization picker lights up every DFlash option including an FP16 draft-model boost (#880, thanks @deepsweet), with a configurable prefix cache size (#1120, thanks @yilmazorhan) and draft_window_size / draft_sink_size / verify_mode model settings (#1276).

Chunked prefill (#1224)

A long-context prompt no longer blocks decode for other in-flight requests: prefill advances one chunk per scheduler step, so concurrent requests keep streaming tokens through it. Off by default, toggleable from admin. Thanks @drumtorben.

Major stability improvements on low-memory Macs

oMLX is far more resilient on tight-memory machines. A new memory enforcer measures the same phys_footprint metric the OS uses for jetsam decisions and applies prefill admission control, so the server declines work before it would be killed instead of crashing under pressure. Backed by a hot-cache eviction race fix (#1298), parallelized SSD↔hot block preloading (#1301), per-model cache hit-rate visibility (#1183, all thanks @ivaniguarans), and a real-time memory bar on the admin dashboard (#1278, thanks @beamivalice). oQ can also auto-build a proxy model when the source can't fit in RAM, so large checkpoints are quantizable on smaller boxes (#1136).

ParoQuant support

Adds ParoQuant plus a pluggable custom-quantization loader so additional quant methods plug in without forking the loader path; all load call sites route through the dispatcher (#209, thanks @liang2kl).


New Features

Bug Fixes

  • Native MTP batch-reshape corruption: when a second request joined a single-sequence MTP generation, the surviving sequence could emit one corrupted token. The cache is now reconciled to a standard-resumable state on reshape, building on @aljen's reset mtp state across batch reshapes. Plus: MTP head attach decoupled from mtp_enabled, VLM inference-path wiring (#1320), greedy detection via real sampler temperature (#1336), correct backbone/sampler timing (#1337), MoE/MTPLX sanitize guards (#1147, @ivaniguarans).
  • Honor user stop strings end-to-end; Llama-4 ChunkedKVCache at batch=1 (#1152, @aeyeopsdev); HTTP request hang on engine-loop errors (#1315); chunked-prefill RuntimeError surfaced as a request error.
  • Scheduler: worker mx-buffer lock (#1106), cross-thread generation_stream sync (#1156, @a4501150), lock-free admin snapshot.
  • Cache: async_eval store materialization (#1146), hot-cache byte accounting (#1171, @lobsterbuko), shutdown flush to SSD (#1101), prefill-only partial-mode invalidation guard (#1119, @blightbow).
  • DFlash: per-model L2 SSD cache cap (#1326), MTP pre-load patch + no n_confirmed leak (#1318), symmetric speculative toggle (#1227, @fish0710).
  • Server / API: Anthropic tool_use stream emission (#845, @mrtkrcm); <think> recovery on unclosed tags and preserve_thinking history (#1329); Gemma 4 interleaved tool call + reasoning (#1028, @fabiopili); Responses cached-tokens (#1008, @hardlycharred); streaming abort engine ref (#1168, @manaskarra); engine-pool original load error surfaced (#1283, @contrapuntal).
  • HF downloader: keep finalized shards + actually stop on cancel (#1284, @mvanhorn), cross-origin redirect follow (#1339).
  • Embeddings / OCR / STT: ModernBERT 3D mean-pool (#1038), Qwen3 VL processor compat (#1039, @penumbrazz), GLM-OCR / dots_ocr on transformers 5.5+, KMMLU answer mapping (#1161, @khsd6327).
  • xgrammar: 0.1.34+ registry API + macOS dylib (#1042, #1043, @crienzo).
  • Admin / desktop: chat line-break + inline edit (#1206), mobile chat input bar (#1218, @Luunae), Jinja chat.html (#1201, @felk-dev), left-panel icon toggle (#1165, @bogdan-copocean), DMG model-dir persistence (#1114, @gltanaka), bind-host-aware health checks (#878, @DKev), health-check session reuse (#1211, @arthware-dev), bench upload error sanitized (#1192, @jasonpaulso).

Dependency Updates

  • dflash-mlx 0.1.3 → 0.1.7, mlx-vlm 191d7c8 → f96138e (Gemma 4 MTP batching), mistral-common >=1.10 (#1116, @pmarreck), uv.lock gitignored (#1209, @deepsweet).

Documentation

New Contributors

Thank you to everyone making their first contribution in 0.3.9:

@fparrav, @hardlycharred, @crienzo, @penumbrazz, @fabiopili, @liang2kl, @scaryrawr, @mrtkrcm, @DKev, @ivaniguarans, @Bortlesboat, @thornad, @Lifto, @yilmazorhan, @aeyeopsdev, @a4501150, @manaskarra, @lobsterbuko, @bogdan-copocean, @felk-dev, @fish0710, @beamivalice, @gltanaka, @pmarreck, @khsd6327, @drumtorben, @mvanhorn, @richgoodson, @contrapuntal, @omniwired, @shannonsands, @alexferrao, @EdenGottlieb, @SuperGregM, @jprado, @Luunae, @arthware-dev, @apetersson, @fqx, @JasonYeYuhe.

Special thanks to @Blaizzy (DeepSeek V4 + Gemma 4 MTP via mlx-vlm) and @bstnxbt (dflash-mlx) for the upstream work this release builds on.

Don't miss a new omlx release

NewReleases is sending notifications on new releases.