github jundot/omlx v0.2.23

latest release: v0.2.24.dev1
8 hours ago

Updating to 0.2.23 is strongly recommended. 0.2.22 contains critical bugs that cause crashes and memory issues on long context and concurrent requests. Sorry for the trouble.

v0.2.23 Release Notes

Critical Bug Fixes

  • Fix Metal buffer accumulation during prefill causing crashes — 0.2.22 disabled buffer clearing between prefill chunks, causing GPU memory to build up across chunks until the Metal driver crashes. This affected all devices but was especially severe on machines with less memory. (#410, #412, #421)
  • Fix TTFT spikes from stale Metal buffers between requests — Freed buffers accumulated in the Metal buffer pool across requests, forcing expensive emergency GC during the next prefill. (#411)
  • Fix KVCache offset mismatch in cache reconstruction — Stored meta_state offset could exceed actual tensor length after partial prefix match, causing broadcast_shapes errors on hybrid attention models (Qwen3.5) at concurrency > 1. (#409)

Bug Fixes

  • Fix MoE router gate quantization causing model load failure
  • Fix TurboQuant KV cache conversion missing in cache-merge prefill path (#422)
  • Disable experimental TurboQuant feature pending further optimization

Improvements

  • oQ: Enhance bit allocation strategy
  • oQ: Enable enhanced quantization for Nemotron-H models

Don't miss a new omlx release

NewReleases is sending notifications on new releases.