What's Changed
- [V1] Reuse V0's memory_profiling util for gpu worker memory profiling by @yeqcharlotte in #19312
- [Bugfix] Fix benchmark_moe.py by @gty111 in #19016
- Use xla flag to improve the quantized model performance by @vanbasten23 in #19303
- Fix docs/mkdocs/hooks/remove_announcement.py by @hmellor in #19382
- [Frontend] Make use_tqdm accept a callable for custom progress bars by @reidliu41 in #19357
- [Core] Use tuple for kv cache group block ids by @njhill in #19175
- [Bugfix] Fix modelscope token passed in by @Potabk in #19389
- [Core] Batch multi modal input using pinned memory by @lgeiger in #19169
- Add security warning to bug report template by @russellb in #19365
- [Misc] refactor neuron_multimodal and profiling by @reidliu41 in #19397
- Add clear documentation around the impact of debugging flag by @annapendleton in #19369
- Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. by @louie-tsai in #17930
- Revert "[v1] Add fp32 support to v1 engine through flex attn" by @Isotr0py in #19404
- [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword
use_irope
by @YUNQIUGUO in #19134 - [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral by @bigPYJ1151 in #19411
- Simplify ep kernels installation by @youkaichao in #19412
- [Misc] Slight improvement of the BNB by @jeejeelee in #19418
New Contributors
- @annapendleton made their first contribution in #19369
- @louie-tsai made their first contribution in #17930
- @YUNQIUGUO made their first contribution in #19134
Full Changelog: v0.9.1rc1...v0.9.1rc2