github jundot/omlx v0.3.9.dev2

pre-release10 hours ago

This is the second dev build on the road to 0.3.9. Mostly cleanup and contributor PRs that landed since dev1, plus a couple of new features. If you hit a bug, please open an issue.

Highlights

Gemma 4 MTP support, thanks to @Blaizzy's mlx-vlm

Gemma 4 image + text requests now decode noticeably faster. Flip the toggle in admin and the model predicts multiple tokens per step on the vision path, just like the text-side MTP shipped in dev1. Off by default so existing setups behave the same until you opt in.

DFlash now supports Gemma 4, thanks to @bstnxbt's dflash-mlx

You can now run Gemma 4 on the DFlash engine, so the model lineup matches the rest of the pool instead of falling back when DFlash is selected. The admin quantization picker also lights up every option for DFlash (including an FP16 draft model boost, #880, thanks @deepsweet), and DFlash gets a configurable prefix cache size so longer-running sessions don't churn entries (#1120, thanks @yilmazorhan).

Copilot CLI integration for omlx launch (#1085)

omlx launch copilot now joins claude / codex / opencode / openclaw / pi. Same flow as the others — pick a model from the curses TUI, oMLX wires the env, execs into the CLI. The Applications dashboard picks up a copilot row automatically. Thanks @scaryrawr.

Admin "Restart Server" button (#1194)

Settings → Server Restart now restarts the server from the admin UI without going back to the menubar or shell. The restart endpoint is admin-auth gated. Thanks @jasonpaulso.

oQ proxy auto-build when model exceeds RAM (#1136)

oQ sensitivity measurement now auto-builds a proxy model when the source model can't fit in RAM, so large checkpoints are quantizable on smaller boxes without manual proxy prep. mlx-lm is patched during the auto-build to keep _LazyTensorIndex honest about FP8 / I8 dequant (#1126, thanks @ivaniguarans), and the OOM guard surfaces a clear error instead of getting stuck.

ParoQuant + pluggable custom quantization loader (#209)

Adds support for ParoQuant and a small dispatcher so additional quant methods plug in without forking the loader path. All load call sites (omlx serve, omlx chat, admin "Load model") route through the dispatcher, and experimental engine toggles are gated for paroquant models in admin. Thanks @liang2kl.


New Features

  • /v1/audio/transcriptions honors max_tokens for length-bounded transcripts (#1163). Thanks @thornad.
  • STT language forwarded to mlx-audio so non-English transcripts stop falling back to English defaults; ISO codes normalized to lowercase full-names server-side (#1184). Thanks @Bortlesboat.
  • French admin localization added to the language picker (#989).
  • Save downloaded models under {owner}/{model} subfolder so the model dir mirrors the HuggingFace / ModelScope namespace instead of flat names (#1188).
  • Inline chat rename from the sidebar history (#1196).
  • Copy button on each chat message with the button moved to the top of the bubble (#1208). Thanks @beamivalice.
  • Active model activity visibility improvements on the admin dashboard (#1104). Thanks @apetersson.
  • Streaming completion summary log for /v1/chat/completions and /v1/completions so server logs cover streamed requests, not just non-streaming (#1170). Thanks @Lifto.
  • VLM MTP per-request acceptance stats in the speculative observability logs.
  • Resolve symlinks in the app bundle CLI launcher so omlx works correctly when symlinked into /usr/local/bin etc.

Bug Fixes

  • Honor user stop strings end-to-end through server + scheduler — previously dropped between the API layer and the generation loop.
  • Llama-4 ChunkedKVCache at batch=1 (#1152). Thanks @aeyeopsdev.
  • Scheduler worker mx-buffer access locked against inference-thread mx ops (#1106).
  • Cross-thread generation_stream sync, TTS bfloat16, and F821 model_dir (#1156). Thanks @a4501150.
  • Lock-free admin snapshot so the scheduler admin endpoint doesn't block under load.
  • Engine-core shutdown dispatch routes scheduler.shutdown through _mlx_executor instead of blocking the main loop.
  • Streaming abort engine reference kept across the abort path so /v1/... cancellation doesn't NPE (#1168). Thanks @manaskarra.
  • Hot cache byte accounting (#1171). Thanks @lobsterbuko.
  • Hot cache flush to SSD on shutdown without dropping blocks (#1101). Thanks @ivaniguarans.
  • Cache-store materialization uses async_eval so the store doesn't stall the inference thread (#1146). Thanks @ivaniguarans.
  • Partial mode cache invalidation prevented for prefill-only requests (#1119). Thanks @blightbow.
  • <think> tag handling for Qwen3.6 nested-visual sanitize on HF-source load and oQ.
  • NaN sensitivity scores in _measure_sensitivity_from_quantized_model are now guarded.
  • _TrackedTensor Ellipsis support + hard-fail discovery skip (#1204).
  • Specprefill get_prefill_tracker UnboundLocalError (#1197). Thanks @fish0710.
  • MTP acceptance ratio + residual sample aligned with the sampler distribution.
  • MoE sanitize guards against missing MTP expert weights (#1147). Thanks @ivaniguarans.
  • MTP head attachment now runs regardless of mtp_enabled state so toggling it off doesn't strand the head.
  • MTP runtime patches applied during oQ sensitivity measurement so MTP-aware quant decisions use the right model.
  • Strip mtp.* tensors in the oQ streaming path when preserve_mtp=false so non-MTP outputs don't ship dead weights.
  • MoE expert stacking skipped when MTP layer is dense (MTPLX) plus sanitize coverage tests.
  • Anthropic stream emission around tool_use (#845). Thanks @mrtkrcm.
  • OCR processors routed around torch-gated AutoImageProcessor so GLM-OCR / dots_ocr load on transformers 5.5+.
  • xgrammar specprefill fallback path local import resolves an UnboundLocalError (#1197).
  • Admin percent signs in memory input parsing are stripped end-to-end so "50 %"-shaped input doesn't error.
  • Admin left panel icon toggling (#1165). Thanks @bogdan-copocean.
  • Broken Jinja {{ api_key | tojson }} in chat.html (#1201). Thanks @felk-dev.
  • Chat message line-break rendering + textarea inline-edit area (#1206). Thanks @beamivalice.
  • copilot_model included in the integrations settings response so the admin dashboard reflects the active model.
  • Spanish translations smoothed out for natural phrasing (#1110). Thanks @fparrav.
  • DMG model directory changes persist across restarts (#1114). Thanks @gltanaka.
  • Preferences syncs the model dir only when changed so unrelated preference edits don't trigger a full rescan.
  • omlx_app health checks + local URLs honor the bind host (#878). Thanks @DKev.
  • omlx_app update_model_dir_runtime honors the bind host during runtime model-dir changes.
  • KMMLU eval answer mapping corrected for one-indexed labels (#1161). Thanks @khsd6327.
  • Bench upload error sanitized so a Cloudflare challenge HTML response doesn't leak into the admin UI (#1192). Thanks @jasonpaulso.
  • mlx-lm patched in oQ auto-built sensitivity proxy (#1203). Thanks @deepsweet.
  • python-multipart moved to main dependencies so form-data endpoints work out of the box.

Documentation

  • Chinese translation synced with latest upstream changes (#1198). Thanks @JasonYeYuhe.
  • Translated READMEs (zh, ko, ja, es, fr) synced with the English upstream.

Dependency Updates

  • mlx-vlm 191d7c8 → f96138e for Gemma 4 MTP server batching support
  • mistral-common floor pinned to >=1.10 so transformers 5.x WhisperProcessor loads cleanly (#1116). Thanks @pmarreck.
  • uv.lock is now gitignored (#1209). Thanks @deepsweet.

New Contributors

Don't miss a new omlx release

NewReleases is sending notifications on new releases.