github jundot/omlx v0.3.3

latest release: v0.3.4
19 hours ago

Highlights

Improved Gemma 4 support

Upgraded mlx-lm to 4469ad4 (BatchGenerator refactor) and mlx-vlm to 90732bd. Gemma 4 vision, audio, and MoE model support. Multi-image vision tower crash fix for different resolutions. Gemma 4 reasoning parser and agentic tool calling by @TipKnuckle (#565). Added omlx-specific customizations for Gemma 4 VLM continuous batching compatibility.

TurboQuant: near-zero overhead decode

Rewrote BatchTurboQuantKVCache as a TurboQuantKVCache subclass instead of wrapping with delegation. Combined with @Blaizzy's new fused Metal kernels (score+softmax+value in 1 dispatch), decode overhead dropped from 43% to 8% vs baseline.

Fixed double-softmax bug in hybrid attention (#556) that caused models to lose focus and loop. Fixed continuous batching shape mismatch (#559) when multiple requests join the batch.

Qwen3.5-4B-MLX-4bit, 8k context, 3-bit TQ:

baseline TurboQuant ratio
decode 117.9 tok/s 109.6 tok/s 0.93x
peak mem 5.19 GB 4.90 GB -5.6%
KV cache 0.30 GB 0.10 GB -67%

Continuous batching with TurboQuant

TurboQuant now works with multiple concurrent requests. Batch operations (merge/extract/extend/filter) handle quantized state correctly across batch size changes.

New Features

  • Vision feature cache for multi-turn image reuse
  • MCP tool call loop, engine status timeline in chat UI by @rayone (#509)
  • Gemma 4 reasoning parser and agentic tool calling by @TipKnuckle (#565)
  • Auto-update Homebrew formula on tag push

Bug Fixes

  • Fix VLM batched decode degeneration via mlx-lm decode model
  • Fix grammar constrained generation for new BatchGenerator pipelining
  • Fix Gemma 4 multi-image vision tower crash on different resolutions
  • Fix preserve image_url parts in Gemma 4 message extractor
  • Fix VLM sanitize proxy missing audio_tower attribute
  • Fix null num_experts in oQ for Gemma 4 dense models (#554)
  • Fix bfloat16 audio in TTS wav conversion (#551)
  • Bypass proxy for local oMLX health checks by @MKuBMax (#558)
  • Fix chat UI: hide _ui:false messages, remove stray </think> on abort

Dependencies

  • Bump mlx-lm to 4469ad4 (BatchGenerator refactor + Gemma 4)
  • Bump mlx-vlm to 90732bd (fused TurboQuant Metal kernels)

New Contributors

  • @MKuBMax — Bypass proxy for local health checks (#558)
  • @rayone — MCP tool call loop, engine status timeline, and chat UI polish (#509)

Full changelog: v0.3.2...v0.3.3

Don't miss a new omlx release

NewReleases is sending notifications on new releases.