What's Changed
- Fix save/load of CacheList by @angeloskath in #886
- Share model by @angeloskath in #871
- Fix mixed quant predicates for MLA models by @spicyneuron in #892
- Add JoyAI LLM Flash by @kernelpool in #894
- perplexity: add --trust-remote-code option by @ivanfioravanti in #896
- server: add usage.prompt_tokens_details.cached_tokens to json response by @percontation in #849
- Fix qwen3.5 casting to fp32 by @awni in #902
- Fix sharded rms norm in MiniMax M2.5 by @angeloskath in #898
- Bump for next version by @awni in #904
- Add tie_word_embeddings modulars in mistral and qwen3 moe by @Goekdeniz-Guelmez in #889
- Allow reading LFM2 models nested rope params by @ykhrustalev in #908
- Improve the cache size limits by @angeloskath in #906
- Make the cache limits more friendly by @angeloskath in #910
- Add 'mx.clear_cache()' to piecewise prompt processing in server. by @N8python in #917
- Add filter guard to ArraysCache.nbytes property by @f1yn in #918
- Add tokens to eval to avoid large graphs when they are not used by @awni in #924
- Clear the cache during batch generation by @awni in #926
- Fix qwen3.5 sanitize by @awni in #928
- step3p5: use rotating cache for sliding attention layers by @lyonsno in #949
- Proposal:
--prefill-step-sizeas cmd line argument for speed/memory usage trade-off by @Abioy in #943 - fix:
convert()uses incorrect defaults for quantization mode by @spicyneuron in #935 - Bump minor by @angeloskath in #954
- Ensure normalization does not promote to fp32 by @angeloskath in #951
- Better caching in the server by @angeloskath in #911
- Adds tensor parallelism for Qwen 3.5 by @angeloskath in #957
New Contributors
- @spicyneuron made their first contribution in #892
- @ykhrustalev made their first contribution in #908
- @f1yn made their first contribution in #918
- @lyonsno made their first contribution in #949
- @Abioy made their first contribution in #943
Full Changelog: v0.30.7...v0.31.0