New features
- Qwen2-VL support
- Idefics 3/SmolVLM support
- ️🔥 6x prompt performance boost (all benchmarks faster than or comparable to MLX, llama.cpp)!
- 🗂️ More efficient non-PagedAttention KV cache implementation!
- Public tokenization API
Python wheels
The wheels now include support for Windows, Linux, and Mac with x84_64 and aarch64.
MSRV
1.79.0
What's Changed
- Update Dockerfile by @Reckon-11 in #895
- Add the Qwen2-VL model by @EricLBuehler in #894
- ISQ for mistralrs-bench by @EricLBuehler in #902
- Use tokenizers v0.20 by @EricLBuehler in #904
- Fix metal sdpa for v stride by @EricLBuehler in #905
- Better parsing of the image path by @EricLBuehler in #906
- Add some Metal kernels for HQQ dequant by @EricLBuehler in #907
- Handle assistant messages with 'tool_calls' by @Jeadie in #824
- Attention-fused softmax for Metal by @EricLBuehler in #908
- Metal qmatmul mat-mat product (5.4x performance increase) by @EricLBuehler in #909
- Support --dtype in mistralrs bench by @EricLBuehler in #911
- Metal: Use mtl resource shared to avoid one copy by @EricLBuehler in #914
- Preallocated KV cache by @EricLBuehler in #916
- Fixes for kv cache grow by @EricLBuehler in #917
- Dont always compile with fp8, bf16 for cuda by @EricLBuehler in #920
- Expand attnmask on cuda by @EricLBuehler in #923
- Faster CUDA prompt speeds by @EricLBuehler in #925
- Paged Attention alibi support by @EricLBuehler in #926
- Default to SDPA for faster VLlama PP T/s by @EricLBuehler in #927
- VLlama vision model ISQ support by @EricLBuehler in #928
- Support fp8 on Metal by @EricLBuehler in #930
- Bump rustls from 0.23.15 to 0.23.18 by @dependabot in #932
- Calculate perplexity of ISQ models by @EricLBuehler in #931
- Integrate fast MLX kernel for SDPA with long seqlen by @EricLBuehler in #933
- Always cast image to rgb8 for qwenvl2 by @EricLBuehler in #936
- Fix etag missing in hf hub by @EricLBuehler in #934
- Fix some examples for vllama 3.2 by @EricLBuehler in #937
- Improve memory efficency of vllama by @EricLBuehler in #938
- Implement the Idefics 3 models (Idefics 3, SmolVLM-Instruct) by @EricLBuehler in #939
- Expose a public tokenization API by @EricLBuehler in #940
- Prepare for v0.3.4 by @EricLBuehler in #942
New Contributors
- @Reckon-11 made their first contribution in #895
Full Changelog: v0.3.2...v0.3.4