Bug Fixes
- fix TTL expiration unloading models with active in-flight requests — all engine types (LLM, VLM, embedding, reranker, STT, TTS, STS) now report active request count so TTL check skips busy engines (#522)
- fix VLM mRoPE position state lost during prefill — multi-turn conversations on Qwen2-VL/Qwen2.5-VL could produce degraded output (#531)
- fix race condition between snapshot writer thread and cleanup
- fix thinking fallback tool call extraction too greedy — tightened regex to prevent false matches (#484)
- fix model aliases not resolving in audio endpoints (#525)
- fix missing mlx-audio optional deps for TTS/STT/STS (#515)
- fix
force_lmbenchmark loading failing on VLM-only models (#487)
Improvements
- make xgrammar optional — auto-detects install method (pip vs uv) and shows correct install command
- enable faulthandler for native crash diagnostics (#511, #520)
- re-download notice toggle in HF uploader
- oQ: update descriptions to reflect current implementation, temporarily disable enhanced quantization UI
- deps: bump mlx-vlm to 9db27b5
New Contributors
- @latent-variable made their first contribution in #517