jundot/omlx v0.3.5-rc1 on GitHub

This is a release candidate for testing before the official v0.3.5 release. It may contain bugs. If you run into any issues, please open an issue.

Highlights

DFlash speculative decoding

Integrated dflash-mlx (block diffusion speculative decoding, arXiv:2602.06036) as an experimental engine option. A small draft model proposes 16 tokens at once via block diffusion, verified by the target model in a single forward pass. All accepted tokens are lossless. Enable per-model in admin panel under Experimental Features. Based on @bstnxbt's MLX port of DFlash.

Qwen3.5 family supported (4B, 9B, 27B, 35B-A3B)
Temperature sampling support (greedy + stochastic)
Auto fallback to BatchedEngine/VLMBatchedEngine when context exceeds DFLASH_MAX_CTX (default 4096)
Streaming with proper CJK/UTF-8 detokenization
Generation metrics logging (tok/s, acceptance ratio, cycles)

VLM prefix cache reuse for multi-image conversations

VLM prefix cache keying is now segmented by image turn instead of treating all image state as one cache key. Adding a new image in a multi-turn conversation only invalidates from that point onward, keeping earlier text and image turns cached. Cache efficiency improved from 14% to 76% in real multimodal agent workflows. by @latent-variable (#637)

New Features

DFlash speculative decoding engine with admin UI controls
Enable Thinking toggle with auto-detection in model settings (#748)
Voice cloning support for TTS via ref_audio/ref_text in /v1/audio/speech by @ethannortharc (#676)
VoiceDesign generation params for TTS by @jnchaba (#678)
Dynamic reasoning parser dropdown from xgrammar by @leuski (#607)
Pi integration by @latent-variable (#588)
HF Hub cache directory auto-discovery by @AlexWorland (#466)
Configurable server aliases for dashboard API URL hints by @jasonpaulso (#634)
Network proxy and TLS configuration settings by @applesauce49 (#703)
Cache probe endpoint for prompt prefix lookup (#720)
Microsoft Harrier OSS model support including quantization by @kyr0 (#723)
Unlock auto-scrolling in chat UI by @kyr0 (#722)
Per-model runtime cache size display in admin panel by @garnetlyx (#744)
structuredContent fallback in MCP tool results
Per-request mRoPE position tracking for batched VLM decode

Bug Fixes

Fix IOKit prepare count underflow with concurrent completions by @Chuhan1112 (#648)
Fix TurboQuant KV cache conversion after prefill by @Landon-Molt (#717)
Fix Harmony analysis channel in non-streaming output by @jaredlockhart (#695)
Fix chat overscroll and markdown table rendering by @kyr0 (#707)
Fix Homebrew formula tokenizers binary wheel build on macOS 15+ by @BRlin-o (#746)
Fix macOS app reopen and termination delegate methods (#725)
Fix per-image vision feature caching and VLM cache key cleanup
Fix protect active-request models from LRU eviction
Fix Metal memory release verification before updating pool tracking
Fix Ollama-compatible usage aliases by @apetersson (#652)
Fix VLM SpecPrefill threshold forwarding by @yizhang (#658)
Fix streaming chat rendering and viewport containment
Fix Enter key during CJK IME composition (#656)
Fix Metal buffer cache clear after VLM vision encoding (#667)
Fix SSE keepalive wrapper exception handling (#677)
Fix image_url content parts in message extraction (#671)
Fix Gemma 4 tool call robust fallback parser (#666)
Fix _cancelled_requests unbounded growth in boundary snapshot store
Fix benchmark batch test crash with DFlashEngine (#759)
Fix per-block meta_states for RotatingKVCache SSD storage

Dependencies

Add dflash-mlx as required dependency (pinned to jundot/dflash-mlx@8e1df22)
Bump mlx-vlm to 3472132

New Contributors

@latent-variable — VLM prefix cache reuse for multi-image conversations (#637), Pi integration (#588)
@Chuhan1112 — Fix IOKit prepare count underflow (#648)
@jasonpaulso — Configurable server aliases (#634)
@leuski — Dynamic reasoning parser dropdown (#607)
@AlexWorland — HF Hub cache auto-discovery (#466)
@ethannortharc — Voice cloning for TTS (#676)
@apetersson — Ollama usage aliases (#652)
@jaredlockhart — Harmony non-streaming reasoning fix (#695)
@applesauce49 — Network proxy/TLS settings (#703)
@kyr0 — Harrier OSS support (#723), scroll unlock (#722), chat UI fixes (#707)
@Landon-Molt — TurboQuant KV cache fix (#717)
@garnetlyx — Per-model cache size display (#744)
@BRlin-o — Homebrew formula fix (#746)
@jnchaba — VoiceDesign TTS params (#678)
@yizhang — VLM SpecPrefill threshold fix (#658)

Full changelog: v0.3.5.dev1...v0.3.5