This is the second dev build on the road to 0.3.9. Mostly cleanup and contributor PRs that landed since dev1, plus a couple of new features. If you hit a bug, please open an issue.
Highlights
Gemma 4 MTP support, thanks to @Blaizzy's mlx-vlm
Gemma 4 image + text requests now decode noticeably faster. Flip the toggle in admin and the model predicts multiple tokens per step on the vision path, just like the text-side MTP shipped in dev1. Off by default so existing setups behave the same until you opt in.
DFlash now supports Gemma 4, thanks to @bstnxbt's dflash-mlx
You can now run Gemma 4 on the DFlash engine, so the model lineup matches the rest of the pool instead of falling back when DFlash is selected. The admin quantization picker also lights up every option for DFlash (including an FP16 draft model boost, #880, thanks @deepsweet), and DFlash gets a configurable prefix cache size so longer-running sessions don't churn entries (#1120, thanks @yilmazorhan).
Copilot CLI integration for omlx launch (#1085)
omlx launch copilot now joins claude / codex / opencode / openclaw / pi. Same flow as the others — pick a model from the curses TUI, oMLX wires the env, execs into the CLI. The Applications dashboard picks up a copilot row automatically. Thanks @scaryrawr.
Admin "Restart Server" button (#1194)
Settings → Server Restart now restarts the server from the admin UI without going back to the menubar or shell. The restart endpoint is admin-auth gated. Thanks @jasonpaulso.
oQ proxy auto-build when model exceeds RAM (#1136)
oQ sensitivity measurement now auto-builds a proxy model when the source model can't fit in RAM, so large checkpoints are quantizable on smaller boxes without manual proxy prep. mlx-lm is patched during the auto-build to keep _LazyTensorIndex honest about FP8 / I8 dequant (#1126, thanks @ivaniguarans), and the OOM guard surfaces a clear error instead of getting stuck.
ParoQuant + pluggable custom quantization loader (#209)
Adds support for ParoQuant and a small dispatcher so additional quant methods plug in without forking the loader path. All load call sites (omlx serve, omlx chat, admin "Load model") route through the dispatcher, and experimental engine toggles are gated for paroquant models in admin. Thanks @liang2kl.
New Features
/v1/audio/transcriptionshonorsmax_tokensfor length-bounded transcripts (#1163). Thanks @thornad.- STT language forwarded to mlx-audio so non-English transcripts stop falling back to English defaults; ISO codes normalized to lowercase full-names server-side (#1184). Thanks @Bortlesboat.
- French admin localization added to the language picker (#989).
- Save downloaded models under
{owner}/{model}subfolder so the model dir mirrors the HuggingFace / ModelScope namespace instead of flat names (#1188). - Inline chat rename from the sidebar history (#1196).
- Copy button on each chat message with the button moved to the top of the bubble (#1208). Thanks @beamivalice.
- Active model activity visibility improvements on the admin dashboard (#1104). Thanks @apetersson.
- Streaming completion summary log for
/v1/chat/completionsand/v1/completionsso server logs cover streamed requests, not just non-streaming (#1170). Thanks @Lifto. - VLM MTP per-request acceptance stats in the speculative observability logs.
- Resolve symlinks in the app bundle CLI launcher so
omlxworks correctly when symlinked into/usr/local/binetc.
Bug Fixes
- Honor user stop strings end-to-end through server + scheduler — previously dropped between the API layer and the generation loop.
- Llama-4 ChunkedKVCache at batch=1 (#1152). Thanks @aeyeopsdev.
- Scheduler worker mx-buffer access locked against inference-thread mx ops (#1106).
- Cross-thread
generation_streamsync, TTS bfloat16, and F821 model_dir (#1156). Thanks @a4501150. - Lock-free admin snapshot so the scheduler admin endpoint doesn't block under load.
- Engine-core shutdown dispatch routes
scheduler.shutdownthrough_mlx_executorinstead of blocking the main loop. - Streaming abort engine reference kept across the abort path so
/v1/...cancellation doesn't NPE (#1168). Thanks @manaskarra. - Hot cache byte accounting (#1171). Thanks @lobsterbuko.
- Hot cache flush to SSD on shutdown without dropping blocks (#1101). Thanks @ivaniguarans.
- Cache-store materialization uses
async_evalso the store doesn't stall the inference thread (#1146). Thanks @ivaniguarans. - Partial mode cache invalidation prevented for prefill-only requests (#1119). Thanks @blightbow.
<think>tag handling for Qwen3.6 nested-visual sanitize on HF-source load and oQ.- NaN sensitivity scores in
_measure_sensitivity_from_quantized_modelare now guarded. _TrackedTensorEllipsis support + hard-fail discovery skip (#1204).- Specprefill
get_prefill_trackerUnboundLocalError (#1197). Thanks @fish0710. - MTP acceptance ratio + residual sample aligned with the sampler distribution.
- MoE sanitize guards against missing MTP expert weights (#1147). Thanks @ivaniguarans.
- MTP head attachment now runs regardless of
mtp_enabledstate so toggling it off doesn't strand the head. - MTP runtime patches applied during oQ sensitivity measurement so MTP-aware quant decisions use the right model.
- Strip
mtp.*tensors in the oQ streaming path whenpreserve_mtp=falseso non-MTP outputs don't ship dead weights. - MoE expert stacking skipped when MTP layer is dense (MTPLX) plus sanitize coverage tests.
- Anthropic stream emission around
tool_use(#845). Thanks @mrtkrcm. - OCR processors routed around torch-gated
AutoImageProcessorso GLM-OCR / dots_ocr load on transformers 5.5+. - xgrammar specprefill fallback path local import resolves an UnboundLocalError (#1197).
- Admin percent signs in memory input parsing are stripped end-to-end so
"50 %"-shaped input doesn't error. - Admin left panel icon toggling (#1165). Thanks @bogdan-copocean.
- Broken Jinja
{{ api_key | tojson }}inchat.html(#1201). Thanks @felk-dev. - Chat message line-break rendering + textarea inline-edit area (#1206). Thanks @beamivalice.
copilot_modelincluded in the integrations settings response so the admin dashboard reflects the active model.- Spanish translations smoothed out for natural phrasing (#1110). Thanks @fparrav.
- DMG model directory changes persist across restarts (#1114). Thanks @gltanaka.
- Preferences syncs the model dir only when changed so unrelated preference edits don't trigger a full rescan.
- omlx_app health checks + local URLs honor the bind host (#878). Thanks @DKev.
- omlx_app
update_model_dir_runtimehonors the bind host during runtime model-dir changes. - KMMLU eval answer mapping corrected for one-indexed labels (#1161). Thanks @khsd6327.
- Bench upload error sanitized so a Cloudflare challenge HTML response doesn't leak into the admin UI (#1192). Thanks @jasonpaulso.
- mlx-lm patched in oQ auto-built sensitivity proxy (#1203). Thanks @deepsweet.
python-multipartmoved to main dependencies so form-data endpoints work out of the box.
Documentation
- Chinese translation synced with latest upstream changes (#1198). Thanks @JasonYeYuhe.
- Translated READMEs (zh, ko, ja, es, fr) synced with the English upstream.
Dependency Updates
- mlx-vlm
191d7c8 → f96138efor Gemma 4 MTP server batching support - mistral-common floor pinned to
>=1.10so transformers 5.xWhisperProcessorloads cleanly (#1116). Thanks @pmarreck. uv.lockis now gitignored (#1209). Thanks @deepsweet.
New Contributors
- @liang2kl: ParoQuant + custom quantization method loader (#209)
- @scaryrawr: Copilot CLI integration for
omlx launch(#1085) - @mrtkrcm: Anthropic stream emission corner cases around
tool_use(#845) - @DKev: omlx_app health checks and local URLs honor bind host (#878)
- @ivaniguarans: oQ FP8/I8 dequant + OOM guard, MoE MTP sanitize fixes, cache-store async_eval, shutdown flush (#1101, #1112, #1126, #1146, #1147)
- @Bortlesboat: forward STT language to mlx-audio (#1184)
- @thornad:
/v1/audio/transcriptionsmax_tokenssupport (#1163) - @Lifto: streaming completion summary log (#1170)
- @yilmazorhan: DFlash configurable prefix cache
max_entries(#1120) - @aeyeopsdev: Llama-4 ChunkedKVCache batch=1 compat (#1152)
- @a4501150: cross-thread generation_stream sync, TTS bfloat16 (#1156)
- @manaskarra: streaming abort engine reference (#1168)
- @lobsterbuko: hot cache byte accounting (#1171)
- @bogdan-copocean: admin left panel icon toggling (#1165)
- @felk-dev: broken Jinja
chat.htmlfix (#1201) - @fish0710: specprefill UnboundLocalError (#1197)
- @beamivalice: copy button + line-break rendering (#1206, #1208)
- @gltanaka: DMG model directory persistence (#1114)
- @pmarreck: pin
mistral-common>=1.10(#1116) - @khsd6327: KMMLU answer mapping (#1161)