github jundot/omlx v0.4.4.dev1
0.4.4.dev1

pre-release6 hours ago

This development release improves memory-pressure handling, paged SSD cache reliability, native embedding/reranker serving, DFlash memory/cache accounting, and audio model support.

  • Improved emergency memory handling for pinned workloads. Active requests are only aborted as a last resort when memory exceeds the real ceiling.
  • Improved paged SSD cache write-back reliability. Dirty hot-cache blocks now fall back to inline SSD writes instead of being dropped.
  • Improved API-key log safety. Rejected API keys are logged as fingerprints instead of raw values. by @richgoodson in #1751
  • Improved native BGE/XLM-R/BERT serving. bf16 reranker loads, embedding eval mode, and CLS pooling are handled correctly. by @paalolav in #1767
  • Improved DFlash prefill memory guarding. DFlash primary mode now applies the prefill memory guard before admission. by @JimStenstrom in #1770
  • Improved native embedding and reranker inference. Native paths now match shared serving behavior more closely.
  • Improved DFlash preflight memory safety. Unsafe MLX telemetry calls were removed from the preflight guard path.
  • Added TTS language forwarding. The audio speech language field now reaches mlx-audio lang_code. by @apetersson in #1773
  • Improved DFlash cache accounting. Prefix-cache hits are reported in prompt_tokens_details.cached_tokens. by @popfido in #1768
  • Fixed TTS argument forwarding. TTS engine argument order is preserved when language is forwarded.
  • Improved Gemma 4 Unified discovery. gemma4_unified models are detected as VLMs even without vision_config. by @FaisalFehad in #1744
  • Improved NeMo ASR discovery. NeMo ASR models are detected as speech-to-text models. by @scaryrawr in #1742
  • Improved pre-load memory admission. Tracked model memory now participates in LRU eviction decisions before loading another model. by @popfido in #1766

Don't miss a new omlx release

NewReleases is sending notifications on new releases.