jundot/omlx v0.2.23 on GitHub

Updating to 0.2.23 is strongly recommended. 0.2.22 contains critical bugs that cause crashes and memory issues on long context and concurrent requests. Sorry for the trouble.

v0.2.23 Release Notes

Critical Bug Fixes

Fix Metal buffer accumulation during prefill causing crashes — 0.2.22 disabled buffer clearing between prefill chunks, causing GPU memory to build up across chunks until the Metal driver crashes. This affected all devices but was especially severe on machines with less memory. (#410, #412, #421)
Fix TTFT spikes from stale Metal buffers between requests — Freed buffers accumulated in the Metal buffer pool across requests, forcing expensive emergency GC during the next prefill. (#411)
Fix KVCache offset mismatch in cache reconstruction — Stored meta_state offset could exceed actual tensor length after partial prefix match, causing broadcast_shapes errors on hybrid attention models (Qwen3.5) at concurrency > 1. (#409)

Bug Fixes

Fix MoE router gate quantization causing model load failure
Fix TurboQuant KV cache conversion missing in cache-merge prefill path (#422)
Disable experimental TurboQuant feature pending further optimization

Improvements

oQ: Enhance bit allocation strategy
oQ: Enable enhanced quantization for Nemotron-H models