Highlights
Gemma 4 support
Bumped mlx-vlm to 43b9b20 which adds Gemma 4 vision, audio, and MoE model support. Also includes chunked prefill fixes for KV-shared models.
TurboQuant is back
Based on @Blaizzy's mlx-vlm TurboQuant integration. Imports mlx-vlm's multi-codec engine directly — Prod, MSE, Polar, and Split codecs with fractional bit support (e.g. 3.5-bit).
I built BatchTurboQuantKVCache on top of mlx-vlm's single-request TurboQuantKVCache for omlx's continuous batching scheduler. KV cache is quantized immediately during prefill to reduce peak memory, with batch-quantized decode tokens (every 32 tokens) and hybrid attention for buffered fp16 + quantized state.
Enable from the admin dashboard per-model settings or via model_settings.json.
Qwen3.5-27B-4bit, 3-bit TQ:
| 32k context | 128k context | |||
|---|---|---|---|---|
| baseline | TQ | baseline | TQ | |
| KV cache mem | 2.14 GB | 0.54 GB (-75%) | 8.14 GB | 1.70 GB (-79%) |
| Peak mem | 22.47 GB | 21.11 GB (-1.4 GB) | 37.66 GB | 33.55 GB (-4.1 GB) |
| Prefill | 362 tok/s | 353 tok/s | 238 tok/s | 226 tok/s |
| Decode | 28.4 tok/s | 17.9 tok/s | 19.4 tok/s | 7.3 tok/s |
Peak memory savings scale with context length. Decode speed tradeoff is inherent to quantized KV attention — TQ is designed for memory-constrained long context, not speed.
Bug Fixes
- Fix EXIF orientation not applied when loading images in VLM engine
- Fix chunked prefill for Gemma 4 KV-shared models
Dependencies
- Bump mlx-vlm to 43b9b20 (Gemma 4, TurboQuant)
Full changelog: v0.3.1...v0.3.2