Major Changes
- ❗Planned breaking change ❗: we plan to remove beam search (see more in #6226) in the next few releases. This release come with a warning when beam search is enabled for the request. Please voice your concern in the RFC if you do have a valid use case for beam search in vLLM
- The release has moved to a Python version agnostic wheel (#6394). A single wheel can be installed across Python versions vLLM supports.
Highlights
Model Support
- Add PaliGemma (#5189), Fuyu-8B (#3924)
- Support for soft tuned prompts (#4645)
- A new guide for adding multi-modal plugins (#6205)
Hardware
- AMD: unify CUDA_VISIBLE_DEVICES usage (#6352)
Performance
- ZeroMQ fallback for broadcasting large objects (#6183)
- Simplify code to support pipeline parallel (#6406)
- Turn off CUTLASS scaled_mm for Ada Lovelace (#6384)
- Use CUTLASS kernels for the FP8 layers with Bias (#6270)
Features
- Enabling bonus token in speculative decoding for KV cache based models (#5765)
- Medusa Implementation with Top-1 proposer (#4978)
- An experimental vLLM CLI for serving and querying OpenAI compatible server (#5090)
Others
- Add support for multi-node on CI (#5955)
- Benchmark: add H100 suite (#6047)
- [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362)
- Build some nightly wheels (#6380)
What's Changed
- Update wheel builds to strip debug by @simon-mo in #6161
- Fix release wheel build env var by @simon-mo in #6162
- Move release wheel env var to Dockerfile instead by @simon-mo in #6163
- [Doc] Reorganize Supported Models by Type by @ywang96 in #6167
- [Doc] Move guide for multimodal model and other improvements by @DarkLight1337 in #6168
- [Model] Add PaliGemma by @ywang96 in #5189
- add benchmark for fix length input and output by @haichuan1221 in #5857
- [ Misc ] Support Fp8 via
llm-compressor
by @robertgshaw2-neuralmagic in #6110 - [misc][frontend] log all available endpoints by @youkaichao in #6195
- do not exclude
object
field in CompletionStreamResponse by @kczimm in #6196 - [Bugfix] FIx benchmark args for randomly sampled dataset by @haichuan1221 in #5947
- [Kernel] reloading fused_moe config on the last chunk by @avshalomman in #6210
- [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) by @afeldman-nm in #4888
- [Bugfix] use diskcache in outlines _get_guide #5436 by @ericperfect in #6203
- [Bugfix] Mamba cache Cuda Graph padding by @tomeras91 in #6214
- Add FlashInfer to default Dockerfile by @simon-mo in #6172
- [hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability by @youkaichao in #6216
- [core][distributed] fix ray worker rank assignment by @youkaichao in #6235
- [Bugfix][TPU] Add missing None to model input by @WoosukKwon in #6245
- [Bugfix][TPU] Fix outlines installation in TPU Dockerfile by @WoosukKwon in #6256
- Add support for multi-node on CI by @khluu in #5955
- [CORE] Adding support for insertion of soft-tuned prompts by @SwapnilDreams100 in #4645
- [Docs] Docs update for Pipeline Parallel by @andoorve in #6222
- [Bugfix]fix and needs_scalar_to_array logic check by @qibaoyuan in #6238
- [Speculative Decoding] Medusa Implementation with Top-1 proposer by @abhigoyal1997 in #4978
- [core][distributed] add zmq fallback for broadcasting large objects by @youkaichao in #6183
- [Bugfix][TPU] Add prompt adapter methods to TPUExecutor by @WoosukKwon in #6279
- [Doc] Guide for adding multi-modal plugins by @DarkLight1337 in #6205
- [Bugfix] Support 2D input shape in MoE layer by @WoosukKwon in #6287
- [Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. by @tdoublep in #6303
- [CI/Build] Enable mypy typing for remaining folders by @bmuskalla in #6268
- [Bugfix] OpenVINOExecutor abstractmethod error by @park12sj in #6296
- [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models by @sroy745 in #5765
- [Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor by @WoosukKwon in #6313
- [Doc] Remove comments incorrectly copied from another project by @daquexian in #6286
- [Doc] Update description of vLLM support for CPUs by @DamonFool in #6003
- [BugFix]: set outlines pkg version by @xiangyang-95 in #6262
- [Bugfix] Fix snapshot download in serving benchmark by @ywang96 in #6318
- [Misc] refactor(config): clean up unused code by @aniaan in #6320
- [BugFix]: fix engine timeout due to request abort by @pushan01 in #6255
- [Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. by @tdoublep in #6326
- [BugFix] get_and_reset only when scheduler outputs are not empty by @mzusman in #6266
- [ Misc ] Refactor Marlin Python Utilities by @robertgshaw2-neuralmagic in #6082
- Benchmark: add H100 suite by @simon-mo in #6047
- [bug fix] Fix llava next feature size calculation. by @xwjiang2010 in #6339
- [doc] update pipeline parallel in readme by @youkaichao in #6347
- [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy by @KuntaiDu in #5362
- [ BugFix ] Prompt Logprobs Detokenization by @robertgshaw2-neuralmagic in #6223
- [Misc] Remove flashinfer warning, add flashinfer tests to CI by @LiuXiaoxuanPKU in #6351
- [distributed][misc] keep consistent with how pytorch finds libcudart.so by @youkaichao in #6346
- [Bugfix] Fix usage stats logging exception warning with OpenVINO by @helena-intel in #6349
- [Model][Phi3-Small] Remove scipy from blocksparse_attention by @mgoin in #6343
- [CI/Build] (2/2) Switching AMD CI to store images in Docker Hub by @adityagoel14 in #6350
- [ROCm][AMD][Bugfix] unify CUDA_VISIBLE_DEVICES usage in vllm to get device count and fixed navi3x by @hongxiayang in #6352
- [ Misc ] Remove separate bias add by @robertgshaw2-neuralmagic in #6353
- [Misc][Bugfix] Update transformers for tokenizer issue by @ywang96 in #6364
- [ Misc ] Support Models With Bias in
compressed-tensors
integration by @robertgshaw2-neuralmagic in #6356 - [Bugfix] Fix dtype mismatch in PaliGemma by @DarkLight1337 in #6367
- [Build/CI] Checking/Waiting for the GPU's clean state by @Alexei-V-Ivanov-AMD in #6379
- [Misc] add fixture to guided processor tests by @kevinbu233 in #6341
- [ci] Add grouped tests & mark tests to run by default for fastcheck pipeline by @khluu in #6365
- [ci] Add GHA workflows to enable full CI run by @khluu in #6381
- [MISC] Upgrade dependency to PyTorch 2.3.1 by @comaniac in #5327
- Build some nightly wheels by default by @simon-mo in #6380
- Fix release-pipeline.yaml by @simon-mo in #6388
- Fix interpolation in release pipeline by @simon-mo in #6389
- Fix release pipeline's -e flag by @simon-mo in #6390
- [Bugfix] Fix illegal memory access in FP8 MoE kernel by @comaniac in #6382
- [Misc] Add generated git commit hash as
vllm.__commit__
by @mgoin in #6386 - Fix release pipeline's dir permission by @simon-mo in #6391
- [Bugfix][TPU] Fix megacore setting for v5e-litepod by @WoosukKwon in #6397
- [ci] Fix wording for GH bot by @khluu in #6398
- [Doc] Fix Typo in Doc by @esaliya in #6392
- [Bugfix] Fix hard-coded value of x in context_attention_fwd by @tdoublep in #6373
- [Docs] Clean up latest news by @WoosukKwon in #6401
- [ci] try to add multi-node tests by @youkaichao in #6280
- Updating LM Format Enforcer version to v10.3 by @noamgat in #6411
- [ Misc ] More Cleanup of Marlin by @robertgshaw2-neuralmagic in #6359
- [Misc] Add deprecation warning for beam search by @WoosukKwon in #6402
- [ Misc ] Apply MoE Refactor to Qwen2 + Deepseekv2 To Support Fp8 by @robertgshaw2-neuralmagic in #6417
- [Model] Initialize Fuyu-8B support by @Isotr0py in #3924
- Remove unnecessary trailing period in spec_decode.rst by @terrytangyuan in #6405
- [Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace by @tlrmchlsmth in #6384
- [ci][build] fix commit id by @youkaichao in #6420
- [ Misc ] Enable Quantizing All Layers of DeekSeekv2 by @robertgshaw2-neuralmagic in #6423
- [Feature] vLLM CLI for serving and querying OpenAI compatible server by @EthanqX in #5090
- [Doc] xpu backend requires running setvars.sh by @rscohn2 in #6393
- [CI/Build] Cross python wheel by @robertgshaw2-neuralmagic in #6394
- [Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' by @lxline in #6428
- Report usage for beam search by @simon-mo in #6404
- Add FUNDING.yml by @simon-mo in #6435
- [BugFix] BatchResponseData body should be optional by @zifeitong in #6345
- [Doc] add env docs for flashinfer backend by @DefTruth in #6437
- [core][distributed] simplify code to support pipeline parallel by @youkaichao in #6406
- [Bugfix] Convert image to RGB by default by @DarkLight1337 in #6430
- [doc][misc] doc update by @youkaichao in #6439
- [VLM] Minor space optimization for
ClipVisionModel
by @ywang96 in #6436 - [doc][distributed] add suggestion for distributed inference by @youkaichao in #6418
- [Kernel] Use CUTLASS kernels for the FP8 layers with Bias by @tlrmchlsmth in #6270
- [Misc] Use 0.0.9 version for flashinfer by @Pernekhan in #6447
- [Bugfix] Add custom Triton cache manager to resolve MoE MP issue by @tdoublep in #6140
- [Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF by @tdoublep in #6409
- bump version to v0.5.2 by @simon-mo in #6433
- [misc][distributed] fix pp missing layer condition by @youkaichao in #6446
New Contributors
- @haichuan1221 made their first contribution in #5857
- @kczimm made their first contribution in #6196
- @ericperfect made their first contribution in #6203
- @qibaoyuan made their first contribution in #6238
- @abhigoyal1997 made their first contribution in #4978
- @bmuskalla made their first contribution in #6268
- @park12sj made their first contribution in #6296
- @daquexian made their first contribution in #6286
- @xiangyang-95 made their first contribution in #6262
- @aniaan made their first contribution in #6320
- @pushan01 made their first contribution in #6255
- @helena-intel made their first contribution in #6349
- @adityagoel14 made their first contribution in #6350
- @kevinbu233 made their first contribution in #6341
- @esaliya made their first contribution in #6392
- @EthanqX made their first contribution in #5090
- @rscohn2 made their first contribution in #6393
- @lxline made their first contribution in #6428
Full Changelog: v0.5.1...v0.5.2