github jundot/omlx v0.2.21

8 hours ago

v0.2.21 Release Notes

Highlights

TurboQuant KV cache (experimental)

This is an experimental feature and may not work correctly in all scenarios.

TurboQuant KV Cache

Codebook-quantized KV cache that compresses key-value states during generation. Based on TurboQuant — random orthogonal rotation + Beta distribution codebook + boundary-based scalar quantization.

How it works: Prefill runs at full fp16 speed (no quality loss). At the first decode token, the accumulated KV cache is quantized to 3-bit or 4-bit codebook indices. Decode attention uses a fused 2-pass Flash Attention Metal kernel that reads directly from packed indices — no dequantization, no fp16 intermediate tensors.

Needle-in-Haystack (Qwen3.5-35B-A3B, 3-bit TurboQuant)

Context Baseline TurboQuant KV Memory
32K 735MB → 195MB (73%)
64K 1407MB → 327MB (77%)
128K 2749MB → 589MB (79%)

Performance

Model Prefill Speed Decode Speed
Qwen3.5-35B-A3B 95% 87%
Qwen3.5-27B 97% 95%

Speed values are percentage of fp16 baseline performance.

Enable from admin UI → model settings → experimental features → TurboQuant KV Cache toggle.

oQ+ — enhanced quantization with GPTQ weight optimization

oQ+ adds Hessian-based GPTQ error compensation on top of oQ's sensitivity-driven mixed-precision system. Standard quantization rounds each weight independently. GPTQ processes columns sequentially and adjusts remaining weights to compensate for rounding error, guided by the inverse Hessian of calibration inputs. The result is the same mlx-lm compatible format — identical structure, identical inference speed — but with substantially lower output error.

For MoE models, routed experts (90%+ of total parameters) are processed using a batched algorithm: all experts in a layer share the same Hessian, so column-by-column optimization runs on all experts simultaneously. Qwen3.5-35B-A3B (256 experts × 40 layers): ~6 minutes vs ~90 minutes sequential (15x speedup, identical results).

Supported architectures: Qwen3.5 MoE/dense, MiniMax-M2.5, GLM, Step-3.5, Nemotron-Cascade, Llama/Mistral, VLM (vision weights kept fp16). See the oQ documentation for details.

Bug Fixes

  • Fix VLM cache proxy using authoritative mx.array offset instead of unreliable _offset shortcut
  • Fix monotonic _offset for BatchRotatingKVCache in VLM proxy
  • Fix disk-full logging for SSD cache writes (#342)
  • Fix LoRA adapter directories appearing in model discovery and admin downloads (#356)
  • Fix generation memory guard with Metal cache cleanup on OOM failure (#372)
  • Fix function_call_output accepting list/dict and serializing to JSON string (#367)
  • Fix download popup menu z-index in accuracy benchmark UI (#370)
  • Fix oq_calibration_data.json missing from package data (#374)
  • Fix merge chat_template_kwargs in eval to prevent duplicate kwarg error
  • Fix TemplateResponse calls for Starlette 1.0 compatibility (#351)
  • Fix homebrew formula HEAD install support

Full changelog: v0.2.20...v0.2.21

Don't miss a new omlx release

NewReleases is sending notifications on new releases.