ml-explore/mlx-lm v0.31.0 on GitHub

What's Changed

Fix save/load of CacheList by @angeloskath in #886
Share model by @angeloskath in #871
Fix mixed quant predicates for MLA models by @spicyneuron in #892
Add JoyAI LLM Flash by @kernelpool in #894
perplexity: add --trust-remote-code option by @ivanfioravanti in #896
server: add usage.prompt_tokens_details.cached_tokens to json response by @percontation in #849
Fix qwen3.5 casting to fp32 by @awni in #902
Fix sharded rms norm in MiniMax M2.5 by @angeloskath in #898
Bump for next version by @awni in #904
Add tie_word_embeddings modulars in mistral and qwen3 moe by @Goekdeniz-Guelmez in #889
Allow reading LFM2 models nested rope params by @ykhrustalev in #908
Improve the cache size limits by @angeloskath in #906
Make the cache limits more friendly by @angeloskath in #910
Add 'mx.clear_cache()' to piecewise prompt processing in server. by @N8python in #917
Add filter guard to ArraysCache.nbytes property by @f1yn in #918
Add tokens to eval to avoid large graphs when they are not used by @awni in #924
Clear the cache during batch generation by @awni in #926
Fix qwen3.5 sanitize by @awni in #928
step3p5: use rotating cache for sliding attention layers by @lyonsno in #949
Proposal: --prefill-step-size as cmd line argument for speed/memory usage trade-off by @Abioy in #943
fix: convert() uses incorrect defaults for quantization mode by @spicyneuron in #935
Bump minor by @angeloskath in #954
Ensure normalization does not promote to fp32 by @angeloskath in #951
Better caching in the server by @angeloskath in #911
Adds tensor parallelism for Qwen 3.5 by @angeloskath in #957

New Contributors

@spicyneuron made their first contribution in #892
@ykhrustalev made their first contribution in #908
@f1yn made their first contribution in #918
@lyonsno made their first contribution in #949
@Abioy made their first contribution in #943

Full Changelog: v0.30.7...v0.31.0