github jundot/omlx v0.4.3
0.4.3

6 hours ago

This hotfix release focuses on macOS 27 compatibility, Throughput recovery for Qwopus and other affected models, and Memory Guard optimization and correctness fixes.

Highlights

  • Added macOS 27 beta compatibility. oMLX now handles the larger HOST_VM_INFO64 response shape used by macOS 27 and avoids fragile psutil memory-stat paths on macOS. (#1748, #1749)
  • Fixed slow Streaming decode on Qwopus and related models. Active Memory Guard polling no longer calls MLX/Metal telemetry from the background thread during active requests, removing a major source of decode stalls. (#1745)
  • Improved decode performance. In a single-run Qwen 3.6-35B-A3B tg512 check, throughput improved from 77.5 to 79.0 tok/s (+1.9%) compared with 0.4.2.
  • Improved per-model MTP behavior. MTP decode eligibility is now stored on each loaded model instance, so loading a non-MTP model later no longer disables MTP decode on an already-loaded Qwen/Qwopus MTP model. (#1758)
  • Optimized Memory Guard preflight estimates. TurboQuant KV, hybrid cache models, fused SDPA, and tiled SDPA scratch memory are now accounted for more accurately, reducing false rejections and avoiding unsafe underestimates. (#1763, #1764)

Improvements and Fixes

  • Fixed Memory Guard active-request polling so scheduler-recorded MLX memory samples are reused instead of querying MLX telemetry from the enforcer loop.
  • Fixed macOS memory detection so system memory and process enforcement remain stable when HOST_VM_INFO64 sizing changes on macOS 27 beta.
  • Fixed TurboQuant KV preflight accounting so Memory Guard no longer overestimates KV peak memory by several times on TurboQuant-enabled models. (#1763)
  • Fixed preflight support for hybrid ArraysCache models with TurboQuant enabled.
  • Fixed fused SDPA memory estimation so MLX fused attention is treated as linear-memory for all head_dim values where applicable. by @fqx (#1764)
  • Added tiled SDPA scratch accounting for high-head-dimension prefill paths so large VLM/Qwen-style models are guarded more accurately.
  • Fixed prefill Memory Guard errors to return a client-visible failure path instead of surfacing as an internal server failure.
  • Fixed DFlash fallback scheduler resolution and bumped dflash-mlx for the Qwen wrapper compatibility fix.
  • Fixed Llama 4 batch cache offsets. (#1752)
  • Fixed max_completion_tokens handling as an alias for max_tokens. (#1759)
  • Fixed Harmony encoding loading by retrying transient tokenizer/encoding load failures.
  • Fixed stored MarkItDown file placeholders so existing uploaded-file references remain usable after 0.4.2. (#1750)
  • Fixed logits_processors=None handling to avoid mlx-lm crashes. by @monroewilliams (#1747)
  • Added Thaw menu bar manager support. by @youvegotmoxie (#1743)
  • Bumped the mlx-lm, mlx-vlm, and dflash-mlx pins to include upstream compatibility fixes used by this hotfix.

Thanks

Thanks to @Collinw24, @ritbl, @orangeseasun205, @smkzw, @fqx, @monroewilliams, and @youvegotmoxie for the reports and fixes that shaped this release.

New Contributors

Thank you to @youvegotmoxie for making their first contribution in this release.

Full Changelog: v0.4.2...v0.4.3

Don't miss a new omlx release

NewReleases is sending notifications on new releases.