Highlights
Model settings preset system
One-click vendor-recommended defaults replace the tedious 20-parameter setup dance. The model settings modal now has a unified [ Preset | Global | Model ] scope toggle, and ships with a curated omlx_preset.json bundle covering Qwen 3.5/6 (r/nr × general/code), Gemma 4, MiniMax M2.7, gpt-oss-120b, GLM-5/5.1, Llama 4, and Mistral Small 4. Values are copied from each vendor's official recommendation.
A refresh button pulls the latest bundle from omlx.ai, so when a new model lands the presets update in place without a server upgrade. Built on the per-model profile + global template data layer shipped in #853 (thanks @sxc562586657).
New Features
- Model settings profile/template data layer with atomic JSON persistence and full CRUD HTTP API (#853)
- Bundled
omlx_preset.json+POST /api/presets/refreshproxy route to fetch updates from omlx.ai - Qwen 3.6+ thinking preserved across turns on both endpoints: auto-set
preserve_thinking=Truegated on per-model template detection (#856), server-side<think>reconstruction from client-providedreasoning_content/ Anthropicthinkingblocks (#814), and nativemessage.reasoning_contentfield path for supporting templates to avoid the whitespace round-trip (#884) - Global idle timeout dropdown (None / 15m / 30m / 1h / 2h / 8h / 24h) in admin Resource Management. Per-model
ttl_secondsstill wins, pinned models exempt, live-applies without restart (#868) hot_cache_onlytoggle to run KV cache entirely in RAM with zero SSD I/O. Closes #605 (#864)- Qwen3-VL reranker and embedding auto-detection;
/v1/reranknow accepts{text, image}dicts (URL, base64, local path). Closes #877 - StatusKit Auto-Fix for Tahoe 26.x menubar visibility: one-click flips
isAllowedin the ControlCenter group container plist + restarts ControlCenter, with atomic write + backup + Full Disk Access deep-link. Bartender-aware conflict dialog also added - One-shot post-launch status-item recreate for the Tahoe registration race, About panel polish, dedicated menubar log at
~/Library/Application Support/oMLX/logs/menubar.log, andomlx diagnose menubarCLI - 4 new intelligence benchmarks (BBQ, MathQA, MMLU-Pro, SafetyBench) with per-category UI grouping (Knowledge / Commonsense & Reasoning / Math / Coding / Safety & Alignment). Thanks @michal-stengg (#837)
reasoning_contentaccepted on OpenAIMessagerequest model; Anthropicthinkingblocks preserved on the native tool-calling assistant branch (previously dropped) (#814, #884)- 1% granularity on Memory Limit and Cold Cache Limit sliders
Bug Fixes
- Fix oQ quantizing some Qwen3.5 DeltaNet weights that should have stayed in higher precision, causing quality regression. 5 regression tests added (#913)
- Fix admin model card being unreadable in dark mode. Thanks @miaobuao (#890)
- Stop the
RotatingKVCachebase_size warning from spamming on every Gemma 4 request. It's expected sliding-window behavior, so it's DEBUG-level now. Thanks @fqx (#910) - Fix
RuntimeError: There is no Stream(gpu, 0)crash on Qwen3.5-family models. Thanks @ysys143 (#891) - Bump mlx-lm to v0.31.3 and mlx to 0.31.2, picking up upstream fixes for Gemma 4, Mistral, tool parsers, and Metal stream handling (ml-explore/mlx-lm#1090)
- Restore per-field color chips and add a "matched preset" pill so the active preset is obvious at a glance (#897)
- Fix Qwen3.6 MoE losing its SpecPrefill speedup (11.87 → 25.9 tok/s on 35B-A3B-4bit-DWQ). Gemma 3, Qwen3, EXAONE4, DOTS1, and others also benefit. Thanks @mrtkrcm (#846)
- Fix whitespace drift in
<think>reconstruction on Qwen 3.6+ templates, improving prefix cache reuse across multi-turn reasoning (#884) - Fix
hot_cache_onlyeither crashing with Metal GPU panics or silently disabling caching entirely (#864) - Fix crash on non-mRoPE VLMs (Gemma 4, Pixtral, LLaVA) when the vision-feature cache hit (#881)
- Fix Gemma 4 SpecPrefill compatibility, completing @apetersson's initial fix. Closes #668 (#851)
- Fix
preserve_thinking=Truebeing auto-applied to templates that don't support it, plus 23 new detection tests (#856) - Return 422 instead of crashing with 500 when a client sends
tool_calls.argumentsin a malformed format. Closes #854 - Fix tool calls being silently dropped when the
qwen3_coderparser choked on non-Python-literal argument values; falls back to regex now. Closes #882 - Fix MiniMax losing parallel tool calls when a single block contained multiple
<invoke>s - Fix per-model Serving Stats filter zeroing counters when models were addressed by alias. Closes #875
- Fix the "Custom" profile badge appearing when only
ttl_secondsdiffered from defaults. TTL is now treated as an operational setting (#868) - Fix chat composer height not resetting after submit and user markdown overflowing the chat pane. Thanks @chulanpro5 (#887)
- Fix menubar visibility false-positives triggered by fullscreen video or slideshows
New Contributors
- @mrtkrcm: Qwen3.6 MoE SpecPrefill (#846)
- @michal-stengg: intelligence benchmarks (#837)
- @RepublicOfKorokke, @fqx: hot_cache_only (#701 / #864), RotatingKVCache log level (#910)
- @apetersson: Gemma 4 SpecPrefill initial fix (#851)
- @chulanpro5: chat composer fix (#887)
- @ysys143: Metal stream executor fix (#891)
- @miaobuao: model card dark mode styles (#890)
