Highlights
Improved Gemma 4 support
Upgraded mlx-lm to 4469ad4 (BatchGenerator refactor) and mlx-vlm to 90732bd. Gemma 4 vision, audio, and MoE model support. Multi-image vision tower crash fix for different resolutions. Gemma 4 reasoning parser and agentic tool calling by @TipKnuckle (#565). Added omlx-specific customizations for Gemma 4 VLM continuous batching compatibility.
TurboQuant: near-zero overhead decode
Rewrote BatchTurboQuantKVCache as a TurboQuantKVCache subclass instead of wrapping with delegation. Combined with @Blaizzy's new fused Metal kernels (score+softmax+value in 1 dispatch), decode overhead dropped from 43% to 8% vs baseline.
Fixed double-softmax bug in hybrid attention (#556) that caused models to lose focus and loop. Fixed continuous batching shape mismatch (#559) when multiple requests join the batch.
Qwen3.5-4B-MLX-4bit, 8k context, 3-bit TQ:
| baseline | TurboQuant | ratio | |
|---|---|---|---|
| decode | 117.9 tok/s | 109.6 tok/s | 0.93x |
| peak mem | 5.19 GB | 4.90 GB | -5.6% |
| KV cache | 0.30 GB | 0.10 GB | -67% |
Continuous batching with TurboQuant
TurboQuant now works with multiple concurrent requests. Batch operations (merge/extract/extend/filter) handle quantized state correctly across batch size changes.
New Features
- Vision feature cache for multi-turn image reuse
- MCP tool call loop, engine status timeline in chat UI by @rayone (#509)
- Gemma 4 reasoning parser and agentic tool calling by @TipKnuckle (#565)
- Auto-update Homebrew formula on tag push
Bug Fixes
- Fix VLM batched decode degeneration via mlx-lm decode model
- Fix grammar constrained generation for new BatchGenerator pipelining
- Fix Gemma 4 multi-image vision tower crash on different resolutions
- Fix preserve image_url parts in Gemma 4 message extractor
- Fix VLM sanitize proxy missing audio_tower attribute
- Fix null num_experts in oQ for Gemma 4 dense models (#554)
- Fix bfloat16 audio in TTS wav conversion (#551)
- Bypass proxy for local oMLX health checks by @MKuBMax (#558)
- Fix chat UI: hide
_ui:falsemessages, remove stray</think>on abort
Dependencies
- Bump mlx-lm to 4469ad4 (BatchGenerator refactor + Gemma 4)
- Bump mlx-vlm to 90732bd (fused TurboQuant Metal kernels)
New Contributors
- @MKuBMax — Bypass proxy for local health checks (#558)
- @rayone — MCP tool call loop, engine status timeline, and chat UI polish (#509)
Full changelog: v0.3.2...v0.3.3