github jundot/omlx v0.3.5-rc1

10 hours ago

This is a release candidate for testing before the official v0.3.5 release. It may contain bugs. If you run into any issues, please open an issue.

Highlights

DFlash speculative decoding

Integrated dflash-mlx (block diffusion speculative decoding, arXiv:2602.06036) as an experimental engine option. A small draft model proposes 16 tokens at once via block diffusion, verified by the target model in a single forward pass. All accepted tokens are lossless. Enable per-model in admin panel under Experimental Features. Based on @bstnxbt's MLX port of DFlash.

  • Qwen3.5 family supported (4B, 9B, 27B, 35B-A3B)
  • Temperature sampling support (greedy + stochastic)
  • Auto fallback to BatchedEngine/VLMBatchedEngine when context exceeds DFLASH_MAX_CTX (default 4096)
  • Streaming with proper CJK/UTF-8 detokenization
  • Generation metrics logging (tok/s, acceptance ratio, cycles)

VLM prefix cache reuse for multi-image conversations

VLM prefix cache keying is now segmented by image turn instead of treating all image state as one cache key. Adding a new image in a multi-turn conversation only invalidates from that point onward, keeping earlier text and image turns cached. Cache efficiency improved from 14% to 76% in real multimodal agent workflows. by @latent-variable (#637)

New Features

  • DFlash speculative decoding engine with admin UI controls
  • Enable Thinking toggle with auto-detection in model settings (#748)
  • Voice cloning support for TTS via ref_audio/ref_text in /v1/audio/speech by @ethannortharc (#676)
  • VoiceDesign generation params for TTS by @jnchaba (#678)
  • Dynamic reasoning parser dropdown from xgrammar by @leuski (#607)
  • Pi integration by @latent-variable (#588)
  • HF Hub cache directory auto-discovery by @AlexWorland (#466)
  • Configurable server aliases for dashboard API URL hints by @jasonpaulso (#634)
  • Network proxy and TLS configuration settings by @applesauce49 (#703)
  • Cache probe endpoint for prompt prefix lookup (#720)
  • Microsoft Harrier OSS model support including quantization by @kyr0 (#723)
  • Unlock auto-scrolling in chat UI by @kyr0 (#722)
  • Per-model runtime cache size display in admin panel by @garnetlyx (#744)
  • structuredContent fallback in MCP tool results
  • Per-request mRoPE position tracking for batched VLM decode

Bug Fixes

  • Fix IOKit prepare count underflow with concurrent completions by @Chuhan1112 (#648)
  • Fix TurboQuant KV cache conversion after prefill by @Landon-Molt (#717)
  • Fix Harmony analysis channel in non-streaming output by @jaredlockhart (#695)
  • Fix chat overscroll and markdown table rendering by @kyr0 (#707)
  • Fix Homebrew formula tokenizers binary wheel build on macOS 15+ by @BRlin-o (#746)
  • Fix macOS app reopen and termination delegate methods (#725)
  • Fix per-image vision feature caching and VLM cache key cleanup
  • Fix protect active-request models from LRU eviction
  • Fix Metal memory release verification before updating pool tracking
  • Fix Ollama-compatible usage aliases by @apetersson (#652)
  • Fix VLM SpecPrefill threshold forwarding by @yizhang (#658)
  • Fix streaming chat rendering and viewport containment
  • Fix Enter key during CJK IME composition (#656)
  • Fix Metal buffer cache clear after VLM vision encoding (#667)
  • Fix SSE keepalive wrapper exception handling (#677)
  • Fix image_url content parts in message extraction (#671)
  • Fix Gemma 4 tool call robust fallback parser (#666)
  • Fix _cancelled_requests unbounded growth in boundary snapshot store
  • Fix benchmark batch test crash with DFlashEngine (#759)
  • Fix per-block meta_states for RotatingKVCache SSD storage

Dependencies

New Contributors

Full changelog: v0.3.5.dev1...v0.3.5

Don't miss a new omlx release

NewReleases is sending notifications on new releases.