jundot/omlx v0.3.3 on GitHub

Highlights

Improved Gemma 4 support

Upgraded mlx-lm to 4469ad4 (BatchGenerator refactor) and mlx-vlm to 90732bd. Gemma 4 vision, audio, and MoE model support. Multi-image vision tower crash fix for different resolutions. Gemma 4 reasoning parser and agentic tool calling by @TipKnuckle (#565). Added omlx-specific customizations for Gemma 4 VLM continuous batching compatibility.

TurboQuant: near-zero overhead decode

Rewrote BatchTurboQuantKVCache as a TurboQuantKVCache subclass instead of wrapping with delegation. Combined with @Blaizzy's new fused Metal kernels (score+softmax+value in 1 dispatch), decode overhead dropped from 43% to 8% vs baseline.

Fixed double-softmax bug in hybrid attention (#556) that caused models to lose focus and loop. Fixed continuous batching shape mismatch (#559) when multiple requests join the batch.

Qwen3.5-4B-MLX-4bit, 8k context, 3-bit TQ:

	baseline	TurboQuant	ratio
decode	117.9 tok/s	109.6 tok/s	0.93x
peak mem	5.19 GB	4.90 GB	-5.6%
KV cache	0.30 GB	0.10 GB	-67%

Continuous batching with TurboQuant

TurboQuant now works with multiple concurrent requests. Batch operations (merge/extract/extend/filter) handle quantized state correctly across batch size changes.

New Features

Vision feature cache for multi-turn image reuse
MCP tool call loop, engine status timeline in chat UI by @rayone (#509)
Gemma 4 reasoning parser and agentic tool calling by @TipKnuckle (#565)
Auto-update Homebrew formula on tag push

Bug Fixes

Fix VLM batched decode degeneration via mlx-lm decode model
Fix grammar constrained generation for new BatchGenerator pipelining
Fix Gemma 4 multi-image vision tower crash on different resolutions
Fix preserve image_url parts in Gemma 4 message extractor
Fix VLM sanitize proxy missing audio_tower attribute
Fix null num_experts in oQ for Gemma 4 dense models (#554)
Fix bfloat16 audio in TTS wav conversion (#551)
Bypass proxy for local oMLX health checks by @MKuBMax (#558)
Fix chat UI: hide _ui:false messages, remove stray </think> on abort

Dependencies

Bump mlx-lm to 4469ad4 (BatchGenerator refactor + Gemma 4)
Bump mlx-vlm to 90732bd (fused TurboQuant Metal kernels)

New Contributors

@MKuBMax — Bypass proxy for local health checks (#558)
@rayone — MCP tool call loop, engine status timeline, and chat UI polish (#509)

Full changelog: v0.3.2...v0.3.3