github jundot/omlx v0.3.11

7 hours ago

This release focuses on the stability work rather than new features (apologies to everyone whose PRs are still in the queue). I'm well aware the low-memory configurations have been especially fragile, and I treat fixing that as a higher priority than anything else right now.

The headline of 0.3.11 is a full rewrite of the memory guard. It now reads the user's live available memory and the chosen preference (Safe / Balanced / Aggressive) and uses as much memory as it safely can without errors — adapting in real time as other apps come and go. To everyone who has hit OOMs or kernel panics, I'm sorry for the trouble. Please give this version a try.

Memory guard rewrite

  • The confusing two sliders are gone. Memory Limit (Total) and Memory Limit (Models Only) are removed entirely. The admin UI now exposes a single dropdown next to the Memory guard toggle — Safe / Balanced / Aggressive — that picks how much headroom to leave for the OS and other apps.
  • Hard ceiling now considers everything. It is min(static_reserve, live_available, metal_cap). The dynamic part re-reads psutil.virtual_memory().available every poll, so oMLX shrinks when other apps grab memory and grows again when they release it. The Metal part reads iogpu.wired_limit_mb (or Apple's max_recommended_working_set_size when the sysctl is unset), so oMLX never plans an allocation Metal would refuse (#1397 inspired this direction — thanks @mikael-johansson).
  • Adaptive prefill throttle. When memory enters the caution zone, prefill chunks step down through 1024 → 512 → 256 → 128 instead of crashing.
  • Self-raising Metal wired limit. At startup oMLX raises its own per-process wired limit to the chosen ceiling via mx.set_wired_limit, so Metal accepts allocations within ceiling. If the kernel iogpu.wired_limit_mb is below what oMLX wants, the admin dashboard prints a red bold warning with the exact sudo sysctl iogpu.wired_limit_mb=N command to copy.

Memory guard tier dropdown with live breakdown and the sysctl-raise hint

Bug Fixes

  • Stuck PP requests after force-stop: when prefill was force-stopped on hard-limit exceeded, the request slot was never released and the dashboard showed it as still running. State is now cleaned up on the RuntimeError path (#1405).
  • VLM losing vision after dflash hot-swap: quantized MTP models with MTP disabled stripped the multimodal projector on engine restart. The MTPModule is now attached regardless of mtp_enabled so vision survives (#1404).
  • Qwen3.6 VLMs failing to load without MTP heads: the mlx-vlm MoE sanitize wasn't applied to VLM variants without MTP, so weights mismatched. Sanitize is now applied universally (#1412, thanks @samfenwick).
  • Gemma 4 multi-image cache miss: per-image cache lookup didn't fall back to the whole-request key, so multi-image prompts re-encoded every image on every request. Lookup now mirrors single-image with a whole-request fallback (#1417).
  • Embedding engine memory growth: Qwen3-Embedding-0.6B-4bit-DWQ accumulated MLX cache between requests until the process was killed. MLX cache is now cleared per request, not only when the engine goes idle (#684).
  • VLM MTP transient memory spikes: cache wasn't being cleared between MTP rounds, so peak memory could climb above the ceiling mid-generation. Cleared every round now.
  • Admin model_dirs not propagating at runtime: changing the model directory through the admin UI didn't reach OQManager / HFUploader, so they kept writing to the old path. Both now receive the live value.
  • Noisy [memcheck:*] prefill log: the chunk-end memory check spammed the log on every chunk. Now only logs when memory enters the caution zone.

Observability

  • ssd_write_drops counter: surfaces how often the SSD write queue saturates and drops blocks under load, which used to be invisible (#1406, thanks @ivaniguarans).

New Contributors

Thank you to everyone making their first contribution in 0.3.11:

@samfenwick, @mikael-johansson.

Don't miss a new omlx release

NewReleases is sending notifications on new releases.