jundot/omlx v0.3.2 on GitHub

Highlights

Gemma 4 support

Bumped mlx-vlm to 43b9b20 which adds Gemma 4 vision, audio, and MoE model support. Also includes chunked prefill fixes for KV-shared models.

TurboQuant is back

Based on @Blaizzy's mlx-vlm TurboQuant integration. Imports mlx-vlm's multi-codec engine directly — Prod, MSE, Polar, and Split codecs with fractional bit support (e.g. 3.5-bit).

I built BatchTurboQuantKVCache on top of mlx-vlm's single-request TurboQuantKVCache for omlx's continuous batching scheduler. KV cache is quantized immediately during prefill to reduce peak memory, with batch-quantized decode tokens (every 32 tokens) and hybrid attention for buffered fp16 + quantized state.

Enable from the admin dashboard per-model settings or via model_settings.json.

Qwen3.5-27B-4bit, 3-bit TQ:

	32k context		128k context
	baseline	TQ	baseline	TQ
KV cache mem	2.14 GB	0.54 GB (-75%)	8.14 GB	1.70 GB (-79%)
Peak mem	22.47 GB	21.11 GB (-1.4 GB)	37.66 GB	33.55 GB (-4.1 GB)
Prefill	362 tok/s	353 tok/s	238 tok/s	226 tok/s
Decode	28.4 tok/s	17.9 tok/s	19.4 tok/s	7.3 tok/s

Peak memory savings scale with context length. Decode speed tradeoff is inherent to quantized KV attention — TQ is designed for memory-constrained long context, not speed.

Bug Fixes

Fix EXIF orientation not applied when loading images in VLM engine
Fix chunked prefill for Gemma 4 KV-shared models

Dependencies

Bump mlx-vlm to 43b9b20 (Gemma 4, TurboQuant)

Full changelog: v0.3.1...v0.3.2