jundot/omlx v0.2.21 on GitHub

v0.2.21 Release Notes

Highlights

TurboQuant KV cache (experimental)

This is an experimental feature and may not work correctly in all scenarios.

TurboQuant KV Cache

Codebook-quantized KV cache that compresses key-value states during generation. Based on TurboQuant — random orthogonal rotation + Beta distribution codebook + boundary-based scalar quantization.

How it works: Prefill runs at full fp16 speed (no quality loss). At the first decode token, the accumulated KV cache is quantized to 3-bit or 4-bit codebook indices. Decode attention uses a fused 2-pass Flash Attention Metal kernel that reads directly from packed indices — no dequantization, no fp16 intermediate tensors.

Needle-in-Haystack (Qwen3.5-35B-A3B, 3-bit TurboQuant)

Context	Baseline	TurboQuant	KV Memory
32K	✅	✅	735MB → 195MB (73%)
64K	✅	✅	1407MB → 327MB (77%)
128K	✅	✅	2749MB → 589MB (79%)

Performance

Model	Prefill Speed	Decode Speed
Qwen3.5-35B-A3B	95%	87%
Qwen3.5-27B	97%	95%

Speed values are percentage of fp16 baseline performance.

Enable from admin UI → model settings → experimental features → TurboQuant KV Cache toggle.

oQ+ — enhanced quantization with GPTQ weight optimization

oQ+ adds Hessian-based GPTQ error compensation on top of oQ's sensitivity-driven mixed-precision system. Standard quantization rounds each weight independently. GPTQ processes columns sequentially and adjusts remaining weights to compensate for rounding error, guided by the inverse Hessian of calibration inputs. The result is the same mlx-lm compatible format — identical structure, identical inference speed — but with substantially lower output error.

For MoE models, routed experts (90%+ of total parameters) are processed using a batched algorithm: all experts in a layer share the same Hessian, so column-by-column optimization runs on all experts simultaneously. Qwen3.5-35B-A3B (256 experts × 40 layers): ~6 minutes vs ~90 minutes sequential (15x speedup, identical results).

Supported architectures: Qwen3.5 MoE/dense, MiniMax-M2.5, GLM, Step-3.5, Nemotron-Cascade, Llama/Mistral, VLM (vision weights kept fp16). See the oQ documentation for details.

Bug Fixes

Fix VLM cache proxy using authoritative mx.array offset instead of unreliable _offset shortcut
Fix monotonic _offset for BatchRotatingKVCache in VLM proxy
Fix disk-full logging for SSD cache writes (#342)
Fix LoRA adapter directories appearing in model discovery and admin downloads (#356)
Fix generation memory guard with Metal cache cleanup on OOM failure (#372)
Fix function_call_output accepting list/dict and serializing to JSON string (#367)
Fix download popup menu z-index in accuracy benchmark UI (#370)
Fix oq_calibration_data.json missing from package data (#374)
Fix merge chat_template_kwargs in eval to prevent duplicate kwarg error
Fix TemplateResponse calls for Starlette 1.0 compatibility (#351)
Fix homebrew formula HEAD install support

Full changelog: v0.2.20...v0.2.21