Updating to 0.2.23 is strongly recommended. 0.2.22 contains critical bugs that cause crashes and memory issues on long context and concurrent requests. Sorry for the trouble.
v0.2.23 Release Notes
Critical Bug Fixes
- Fix Metal buffer accumulation during prefill causing crashes — 0.2.22 disabled buffer clearing between prefill chunks, causing GPU memory to build up across chunks until the Metal driver crashes. This affected all devices but was especially severe on machines with less memory. (#410, #412, #421)
- Fix TTFT spikes from stale Metal buffers between requests — Freed buffers accumulated in the Metal buffer pool across requests, forcing expensive emergency GC during the next prefill. (#411)
- Fix KVCache offset mismatch in cache reconstruction — Stored meta_state offset could exceed actual tensor length after partial prefix match, causing
broadcast_shapeserrors on hybrid attention models (Qwen3.5) at concurrency > 1. (#409)
Bug Fixes
- Fix MoE router gate quantization causing model load failure
- Fix TurboQuant KV cache conversion missing in cache-merge prefill path (#422)
- Disable experimental TurboQuant feature pending further optimization
Improvements
- oQ: Enhance bit allocation strategy
- oQ: Enable enhanced quantization for Nemotron-H models