jundot/omlx v0.4.4.dev1 on GitHub

This development release improves memory-pressure handling, paged SSD cache reliability, native embedding/reranker serving, DFlash memory/cache accounting, and audio model support.

Improved emergency memory handling for pinned workloads. Active requests are only aborted as a last resort when memory exceeds the real ceiling.
Improved paged SSD cache write-back reliability. Dirty hot-cache blocks now fall back to inline SSD writes instead of being dropped.
Improved API-key log safety. Rejected API keys are logged as fingerprints instead of raw values. by @richgoodson in #1751
Improved native BGE/XLM-R/BERT serving. bf16 reranker loads, embedding eval mode, and CLS pooling are handled correctly. by @paalolav in #1767
Improved DFlash prefill memory guarding. DFlash primary mode now applies the prefill memory guard before admission. by @JimStenstrom in #1770
Improved native embedding and reranker inference. Native paths now match shared serving behavior more closely.
Improved DFlash preflight memory safety. Unsafe MLX telemetry calls were removed from the preflight guard path.
Added TTS language forwarding. The audio speech language field now reaches mlx-audio lang_code. by @apetersson in #1773
Improved DFlash cache accounting. Prefix-cache hits are reported in prompt_tokens_details.cached_tokens. by @popfido in #1768
Fixed TTS argument forwarding. TTS engine argument order is preserved when language is forwarded.
Improved Gemma 4 Unified discovery. gemma4_unified models are detected as VLMs even without vision_config. by @FaisalFehad in #1744
Improved NeMo ASR discovery. NeMo ASR models are detected as speech-to-text models. by @scaryrawr in #1742
Improved pre-load memory admission. Tracked model memory now participates in LRU eviction decisions before loading another model. by @popfido in #1766

jundot/omlx v0.4.4.dev1 0.4.4.dev1 on GitHub

jundot/omlx v0.4.4.dev1
0.4.4.dev1

on GitHub