vllm-project/vllm v0.9.1rc2 on GitHub

What's Changed

[V1] Reuse V0's memory_profiling util for gpu worker memory profiling by @yeqcharlotte in #19312
[Bugfix] Fix benchmark_moe.py by @gty111 in #19016
Use xla flag to improve the quantized model performance by @vanbasten23 in #19303
Fix docs/mkdocs/hooks/remove_announcement.py by @hmellor in #19382
[Frontend] Make use_tqdm accept a callable for custom progress bars by @reidliu41 in #19357
[Core] Use tuple for kv cache group block ids by @njhill in #19175
[Bugfix] Fix modelscope token passed in by @Potabk in #19389
[Core] Batch multi modal input using pinned memory by @lgeiger in #19169
Add security warning to bug report template by @russellb in #19365
[Misc] refactor neuron_multimodal and profiling by @reidliu41 in #19397
Add clear documentation around the impact of debugging flag by @annapendleton in #19369
Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. by @louie-tsai in #17930
Revert "[v1] Add fp32 support to v1 engine through flex attn" by @Isotr0py in #19404
[BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword use_irope by @YUNQIUGUO in #19134
[BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral by @bigPYJ1151 in #19411
Simplify ep kernels installation by @youkaichao in #19412
[Misc] Slight improvement of the BNB by @jeejeelee in #19418

Full Changelog: v0.9.1rc1...v0.9.1rc2