jundot/omlx v0.3.7rc1 on GitHub

This is a release candidate for testing before the official v0.3.7 release. It may contain bugs. If you run into any issues, please open an issue.

Highlights

Model settings preset system

One-click vendor-recommended defaults replace the tedious 20-parameter setup dance. The model settings modal now has a unified [ Preset | Global | Model ] scope toggle, and ships with a curated omlx_preset.json bundle covering Qwen 3.5/6 (r/nr × general/code), Gemma 4, MiniMax M2.7, gpt-oss-120b, GLM-5/5.1, Llama 4, and Mistral Small 4. Values are copied from each vendor's official recommendation.

A refresh button pulls the latest bundle from omlx.ai, so when a new model lands the presets update in place without a server upgrade. Built on the per-model profile + global template data layer shipped in #853 (thanks @sxc562586657).

New Features

Model settings profile/template data layer with atomic JSON persistence and full CRUD HTTP API (#853)
Bundled omlx_preset.json + POST /api/presets/refresh proxy route to fetch updates from omlx.ai
Qwen 3.6+ thinking preserved across turns on both endpoints: auto-set preserve_thinking=True gated on per-model template detection (#856), server-side <think> reconstruction from client-provided reasoning_content / Anthropic thinking blocks (#814), and native message.reasoning_content field path for supporting templates to avoid the whitespace round-trip (#884)
Global idle timeout dropdown (None / 15m / 30m / 1h / 2h / 8h / 24h) in admin Resource Management. Per-model ttl_seconds still wins, pinned models exempt, live-applies without restart (#868)
hot_cache_only toggle to run KV cache entirely in RAM with zero SSD I/O. Closes #605 (#864)
Qwen3-VL reranker and embedding auto-detection; /v1/rerank now accepts {text, image} dicts (URL, base64, local path). Closes #877
StatusKit Auto-Fix for Tahoe 26.x menubar visibility: one-click flips isAllowed in the ControlCenter group container plist + restarts ControlCenter, with atomic write + backup + Full Disk Access deep-link. Bartender-aware conflict dialog also added
One-shot post-launch status-item recreate for the Tahoe registration race, About panel polish, dedicated menubar log at ~/Library/Application Support/oMLX/logs/menubar.log, and omlx diagnose menubar CLI
4 new intelligence benchmarks (BBQ, MathQA, MMLU-Pro, SafetyBench) with per-category UI grouping (Knowledge / Commonsense & Reasoning / Math / Coding / Safety & Alignment). Thanks @michal-stengg (#837)
reasoning_content accepted on OpenAI Message request model; Anthropic thinking blocks preserved on the native tool-calling assistant branch (previously dropped) (#814, #884)
1% granularity on Memory Limit and Cold Cache Limit sliders

Bug Fixes

Fix Qwen3.6 MoE SpecPrefill silently degrading to no-op. The llama extractor assumed Qwen3-Next's gate-split q_proj and dropped per-head q_norm. Dedicated _qwen36_extract_queries path restores the speedup (11.87 → 25.9 tok/s on Qwen3.6-35B-A3B-4bit-DWQ). Gemma 3 text / Qwen3 vanilla / EXAONE4 / DOTS1 / Klear / Apertus also benefit. Idea and measurements by @mrtkrcm (#846)
Fix whitespace round-trip in <think> reconstruction on Qwen 3.6+ templates by using the native message.reasoning_content field, improving prefix cache reuse on multi-turn reasoning (#884)
Fix hot_cache_only crashing or disabling caching entirely: the paged_ssd_cache_dir=None path hit an early-return guard that turned off all caching, and storing live mx.array inference objects in hot cache entries caused Metal GPU kernel panics on large caches (#864)
Fix cached VLM prefill crashing with AttributeError on non-mRoPE models (Gemma 4, Pixtral, LLaVA) when the vision-feature cache served a hit (#881)
Fix Gemma 4 SpecPrefill compatibility. Cherry-picks @apetersson's initial fix and completes the Gemma 4-specific handling. Closes #668 (#851)
Fix preserve_thinking=True auto-set applying to templates that don't support the kwarg, plus 23 missing detect_preserve_thinking unit tests from #814 (#856)
Fix 500 on chat template render when a client echoed back tool_calls.arguments as a non-JSON-object value (python repr, truncated JSON, empty string). Validated at ingress now, 422 instead of 500. Closes #854
Fix tool call silently dropped when mlx-lm's qwen3_coder parser raised SyntaxError on ast.literal_eval for non-python-literal argument values. Falls back to regex parser with a warning. Closes #882
Fix MiniMax parallel tool calls dropped when a single <minimax:tool_call> block contained multiple <invoke>s (mlx-lm bumped to a401730)
Fix per-model Serving Stats filter zeroing every counter for model IDs addressed by alias (model_alias, Anthropic names via /v1/messages, integration aliases from #634). Closes #875
Fix spurious "Custom" profile badge caused by ttl_seconds being compared as a universal profile field. TTL is now an excluded operational setting (#868)
Fix composer height not resetting after submit and user markdown overflowing the chat pane. Thanks @chulanpro5 (#887)
Fix menubar visibility false-positives from fullscreen video / slideshows by dropping the mid-session probe; relies on launch-time probe + one-shot recreate only

New Contributors

@mrtkrcm: Qwen3.6 MoE SpecPrefill (#846)
@michal-stengg: intelligence benchmarks (#837)
@RepublicOfKorokke, @fqx: hot_cache_only (#701 / #864)
@apetersson: Gemma 4 SpecPrefill initial fix (#851)
@chulanpro5: chat composer fix (#887)