github jundot/omlx v0.3.2

6 hours ago

Highlights

Gemma 4 support

Bumped mlx-vlm to 43b9b20 which adds Gemma 4 vision, audio, and MoE model support. Also includes chunked prefill fixes for KV-shared models.

TurboQuant is back

Based on @Blaizzy's mlx-vlm TurboQuant integration. Imports mlx-vlm's multi-codec engine directly — Prod, MSE, Polar, and Split codecs with fractional bit support (e.g. 3.5-bit).

I built BatchTurboQuantKVCache on top of mlx-vlm's single-request TurboQuantKVCache for omlx's continuous batching scheduler. KV cache is quantized immediately during prefill to reduce peak memory, with batch-quantized decode tokens (every 32 tokens) and hybrid attention for buffered fp16 + quantized state.

Enable from the admin dashboard per-model settings or via model_settings.json.

Qwen3.5-27B-4bit, 3-bit TQ:

32k context 128k context
baseline TQ baseline TQ
KV cache mem 2.14 GB 0.54 GB (-75%) 8.14 GB 1.70 GB (-79%)
Peak mem 22.47 GB 21.11 GB (-1.4 GB) 37.66 GB 33.55 GB (-4.1 GB)
Prefill 362 tok/s 353 tok/s 238 tok/s 226 tok/s
Decode 28.4 tok/s 17.9 tok/s 19.4 tok/s 7.3 tok/s

Peak memory savings scale with context length. Decode speed tradeoff is inherent to quantized KV attention — TQ is designed for memory-constrained long context, not speed.

Bug Fixes

  • Fix EXIF orientation not applied when loading images in VLM engine
  • Fix chunked prefill for Gemma 4 KV-shared models

Dependencies

  • Bump mlx-vlm to 43b9b20 (Gemma 4, TurboQuant)

Full changelog: v0.3.1...v0.3.2

Don't miss a new omlx release

NewReleases is sending notifications on new releases.