Highlights
- Add initial TPU integration (#5292)
- Fix crashes when using FlashAttention backend (#5478)
- Fix issues when using num_devices < num_available_devices (#5473)
What's Changed
- [CI/Build] Add
is_quant_method_supported
to control quantization test configurations by @mgoin in #5253 - Revert "[CI/Build] Add
is_quant_method_supported
to control quantization test configurations" by @simon-mo in #5463 - [CI] Upgrade codespell version. by @rkooo567 in #5381
- [Hardware] Initial TPU integration by @WoosukKwon in #5292
- [Bugfix] Add device assertion to TorchSDPA by @bigPYJ1151 in #5402
- [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests by @khluu in #5464
- [Kernel] Vectorized FP8 quantize kernel by @comaniac in #5396
- [Bugfix] TYPE_CHECKING for MultiModalData by @kimdwkimdw in #5444
- [Frontend] [Core] Support for sharded tensorized models by @tjohnson31415 in #4990
- [misc] add hint for AttributeError by @youkaichao in #5462
- [Doc] Update debug docs by @DarkLight1337 in #5438
- [Bugfix] Fix typo in scheduler.py (requeset -> request) by @mgoin in #5470
- [Frontend] Add "input speed" to tqdm postfix alongside output speed by @mgoin in #5425
- [Bugfix] Fix wrong multi_modal_input format for CPU runner by @Isotr0py in #5451
- [Core][Distributed] add coordinator to reduce code duplication in tp and pp by @youkaichao in #5293
- [ci] Use sccache to build images by @khluu in #5419
- [Bugfix]if the content is started with ":"(response of ping), client should i… by @sywangyi in #5303
- [Kernel]
w4a16
support forcompressed-tensors
by @dsikka in #5385 - [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations by @mgoin in #5466
- [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 by @wenyujin333 in #5497
- [Hardware][Intel] Optimize CPU backend and add more performance tips by @bigPYJ1151 in #4971
- [Docs] Add 4th meetup slides by @WoosukKwon in #5509
- [Misc] Add vLLM version getter to utils by @DarkLight1337 in #5098
- [CI/Build] Simplify OpenAI server setup in tests by @DarkLight1337 in #5100
- [Doc] Update LLaVA docs by @DarkLight1337 in #5437
- [Kernel] Factor out epilogues from cutlass kernels by @tlrmchlsmth in #5391
- [MISC] Remove FP8 warning by @comaniac in #5472
- Seperate dev requirements into lint and test by @Yard1 in #5474
- Revert "[Core] Remove unnecessary copies in flash attn backend" by @Yard1 in #5478
- [misc] fix format.sh by @youkaichao in #5511
- [CI/Build] Disable test_fp8.py by @tlrmchlsmth in #5508
- [Kernel] Disable CUTLASS kernels for fp8 by @tlrmchlsmth in #5505
- Add
cuda_device_count_stateless
by @Yard1 in #5473 - [Hardware][Intel] Support CPU inference with AVX2 ISA by @DamonFool in #5452
- [Bugfix]typofix by @AllenDou in #5507
- bump version to v0.5.0.post1 by @simon-mo in #5522
New Contributors
- @kimdwkimdw made their first contribution in #5444
- @sywangyi made their first contribution in #5303
Full Changelog: v0.5.0...v0.5.0.post1