EricLBuehler/mistral.rs v0.3.4 on GitHub

New features

Qwen2-VL support
Idefics 3/SmolVLM support
️‍🔥 6x prompt performance boost (all benchmarks faster than or comparable to MLX, llama.cpp)!
🗂️ More efficient non-PagedAttention KV cache implementation!
Public tokenization API

Python wheels

The wheels now include support for Windows, Linux, and Mac with x84_64 and aarch64.

MSRV

1.79.0

What's Changed

Update Dockerfile by @Reckon-11 in #895
Add the Qwen2-VL model by @EricLBuehler in #894
ISQ for mistralrs-bench by @EricLBuehler in #902
Use tokenizers v0.20 by @EricLBuehler in #904
Fix metal sdpa for v stride by @EricLBuehler in #905
Better parsing of the image path by @EricLBuehler in #906
Add some Metal kernels for HQQ dequant by @EricLBuehler in #907
Handle assistant messages with 'tool_calls' by @Jeadie in #824
Attention-fused softmax for Metal by @EricLBuehler in #908
Metal qmatmul mat-mat product (5.4x performance increase) by @EricLBuehler in #909
Support --dtype in mistralrs bench by @EricLBuehler in #911
Metal: Use mtl resource shared to avoid one copy by @EricLBuehler in #914
Preallocated KV cache by @EricLBuehler in #916
Fixes for kv cache grow by @EricLBuehler in #917
Dont always compile with fp8, bf16 for cuda by @EricLBuehler in #920
Expand attnmask on cuda by @EricLBuehler in #923
Faster CUDA prompt speeds by @EricLBuehler in #925
Paged Attention alibi support by @EricLBuehler in #926
Default to SDPA for faster VLlama PP T/s by @EricLBuehler in #927
VLlama vision model ISQ support by @EricLBuehler in #928
Support fp8 on Metal by @EricLBuehler in #930
Bump rustls from 0.23.15 to 0.23.18 by @dependabot in #932
Calculate perplexity of ISQ models by @EricLBuehler in #931
Integrate fast MLX kernel for SDPA with long seqlen by @EricLBuehler in #933
Always cast image to rgb8 for qwenvl2 by @EricLBuehler in #936
Fix etag missing in hf hub by @EricLBuehler in #934
Fix some examples for vllama 3.2 by @EricLBuehler in #937
Improve memory efficency of vllama by @EricLBuehler in #938
Implement the Idefics 3 models (Idefics 3, SmolVLM-Instruct) by @EricLBuehler in #939
Expose a public tokenization API by @EricLBuehler in #940
Prepare for v0.3.4 by @EricLBuehler in #942

New Contributors

@Reckon-11 made their first contribution in #895

Full Changelog: v0.3.2...v0.3.4