jundot/omlx v0.4.3 on GitHub

This hotfix release focuses on macOS 27 compatibility, Throughput recovery for Qwopus and other affected models, and Memory Guard optimization and correctness fixes.

Highlights

Added macOS 27 beta compatibility. oMLX now handles the larger HOST_VM_INFO64 response shape used by macOS 27 and avoids fragile psutil memory-stat paths on macOS. (#1748, #1749)
Fixed slow Streaming decode on Qwopus and related models. Active Memory Guard polling no longer calls MLX/Metal telemetry from the background thread during active requests, removing a major source of decode stalls. (#1745)
Improved decode performance. In a single-run Qwen 3.6-35B-A3B tg512 check, throughput improved from 77.5 to 79.0 tok/s (+1.9%) compared with 0.4.2.
Improved per-model MTP behavior. MTP decode eligibility is now stored on each loaded model instance, so loading a non-MTP model later no longer disables MTP decode on an already-loaded Qwen/Qwopus MTP model. (#1758)
Optimized Memory Guard preflight estimates. TurboQuant KV, hybrid cache models, fused SDPA, and tiled SDPA scratch memory are now accounted for more accurately, reducing false rejections and avoiding unsafe underestimates. (#1763, #1764)

Improvements and Fixes

Fixed Memory Guard active-request polling so scheduler-recorded MLX memory samples are reused instead of querying MLX telemetry from the enforcer loop.
Fixed macOS memory detection so system memory and process enforcement remain stable when HOST_VM_INFO64 sizing changes on macOS 27 beta.
Fixed TurboQuant KV preflight accounting so Memory Guard no longer overestimates KV peak memory by several times on TurboQuant-enabled models. (#1763)
Fixed preflight support for hybrid ArraysCache models with TurboQuant enabled.
Fixed fused SDPA memory estimation so MLX fused attention is treated as linear-memory for all head_dim values where applicable. by @fqx (#1764)
Added tiled SDPA scratch accounting for high-head-dimension prefill paths so large VLM/Qwen-style models are guarded more accurately.
Fixed prefill Memory Guard errors to return a client-visible failure path instead of surfacing as an internal server failure.
Fixed DFlash fallback scheduler resolution and bumped dflash-mlx for the Qwen wrapper compatibility fix.
Fixed Llama 4 batch cache offsets. (#1752)
Fixed max_completion_tokens handling as an alias for max_tokens. (#1759)
Fixed Harmony encoding loading by retrying transient tokenizer/encoding load failures.
Fixed stored MarkItDown file placeholders so existing uploaded-file references remain usable after 0.4.2. (#1750)
Fixed logits_processors=None handling to avoid mlx-lm crashes. by @monroewilliams (#1747)
Added Thaw menu bar manager support. by @youvegotmoxie (#1743)
Bumped the mlx-lm, mlx-vlm, and dflash-mlx pins to include upstream compatibility fixes used by this hotfix.

Thanks

Thanks to @Collinw24, @ritbl, @orangeseasun205, @smkzw, @fqx, @monroewilliams, and @youvegotmoxie for the reports and fixes that shaped this release.

New Contributors

Thank you to @youvegotmoxie for making their first contribution in this release.

Full Changelog: v0.4.2...v0.4.3

jundot/omlx v0.4.3 0.4.3 on GitHub

Highlights

Improvements and Fixes

Thanks

New Contributors

jundot/omlx v0.4.3
0.4.3

on GitHub