This is a release candidate for testing before the official v0.3.5 release. It may contain bugs. If you run into any issues, please open an issue.
Highlights
DFlash speculative decoding
Integrated dflash-mlx (block diffusion speculative decoding, arXiv:2602.06036) as an experimental engine option. A small draft model proposes 16 tokens at once via block diffusion, verified by the target model in a single forward pass. All accepted tokens are lossless. Enable per-model in admin panel under Experimental Features. Based on @bstnxbt's MLX port of DFlash.
- Qwen3.5 family supported (4B, 9B, 27B, 35B-A3B)
- Temperature sampling support (greedy + stochastic)
- Auto fallback to BatchedEngine/VLMBatchedEngine when context exceeds
DFLASH_MAX_CTX(default 4096) - Streaming with proper CJK/UTF-8 detokenization
- Generation metrics logging (tok/s, acceptance ratio, cycles)
VLM prefix cache reuse for multi-image conversations
VLM prefix cache keying is now segmented by image turn instead of treating all image state as one cache key. Adding a new image in a multi-turn conversation only invalidates from that point onward, keeping earlier text and image turns cached. Cache efficiency improved from 14% to 76% in real multimodal agent workflows. by @latent-variable (#637)
New Features
- DFlash speculative decoding engine with admin UI controls
- Enable Thinking toggle with auto-detection in model settings (#748)
- Voice cloning support for TTS via
ref_audio/ref_textin/v1/audio/speechby @ethannortharc (#676) - VoiceDesign generation params for TTS by @jnchaba (#678)
- Dynamic reasoning parser dropdown from xgrammar by @leuski (#607)
- Pi integration by @latent-variable (#588)
- HF Hub cache directory auto-discovery by @AlexWorland (#466)
- Configurable server aliases for dashboard API URL hints by @jasonpaulso (#634)
- Network proxy and TLS configuration settings by @applesauce49 (#703)
- Cache probe endpoint for prompt prefix lookup (#720)
- Microsoft Harrier OSS model support including quantization by @kyr0 (#723)
- Unlock auto-scrolling in chat UI by @kyr0 (#722)
- Per-model runtime cache size display in admin panel by @garnetlyx (#744)
structuredContentfallback in MCP tool results- Per-request mRoPE position tracking for batched VLM decode
Bug Fixes
- Fix IOKit prepare count underflow with concurrent completions by @Chuhan1112 (#648)
- Fix TurboQuant KV cache conversion after prefill by @Landon-Molt (#717)
- Fix Harmony analysis channel in non-streaming output by @jaredlockhart (#695)
- Fix chat overscroll and markdown table rendering by @kyr0 (#707)
- Fix Homebrew formula tokenizers binary wheel build on macOS 15+ by @BRlin-o (#746)
- Fix macOS app reopen and termination delegate methods (#725)
- Fix per-image vision feature caching and VLM cache key cleanup
- Fix protect active-request models from LRU eviction
- Fix Metal memory release verification before updating pool tracking
- Fix Ollama-compatible usage aliases by @apetersson (#652)
- Fix VLM SpecPrefill threshold forwarding by @yizhang (#658)
- Fix streaming chat rendering and viewport containment
- Fix Enter key during CJK IME composition (#656)
- Fix Metal buffer cache clear after VLM vision encoding (#667)
- Fix SSE keepalive wrapper exception handling (#677)
- Fix image_url content parts in message extraction (#671)
- Fix Gemma 4 tool call robust fallback parser (#666)
- Fix
_cancelled_requestsunbounded growth in boundary snapshot store - Fix benchmark batch test crash with DFlashEngine (#759)
- Fix per-block meta_states for RotatingKVCache SSD storage
Dependencies
- Add
dflash-mlxas required dependency (pinned to jundot/dflash-mlx@8e1df22) - Bump mlx-vlm to 3472132
New Contributors
- @latent-variable — VLM prefix cache reuse for multi-image conversations (#637), Pi integration (#588)
- @Chuhan1112 — Fix IOKit prepare count underflow (#648)
- @jasonpaulso — Configurable server aliases (#634)
- @leuski — Dynamic reasoning parser dropdown (#607)
- @AlexWorland — HF Hub cache auto-discovery (#466)
- @ethannortharc — Voice cloning for TTS (#676)
- @apetersson — Ollama usage aliases (#652)
- @jaredlockhart — Harmony non-streaming reasoning fix (#695)
- @applesauce49 — Network proxy/TLS settings (#703)
- @kyr0 — Harrier OSS support (#723), scroll unlock (#722), chat UI fixes (#707)
- @Landon-Molt — TurboQuant KV cache fix (#717)
- @garnetlyx — Per-model cache size display (#744)
- @BRlin-o — Homebrew formula fix (#746)
- @jnchaba — VoiceDesign TTS params (#678)
- @yizhang — VLM SpecPrefill threshold fix (#658)
Full changelog: v0.3.5.dev1...v0.3.5