vllm-project/vllm v0.5.2 on GitHub

Major Changes

❗Planned breaking change ❗: we plan to remove beam search (see more in #6226) in the next few releases. This release come with a warning when beam search is enabled for the request. Please voice your concern in the RFC if you do have a valid use case for beam search in vLLM
The release has moved to a Python version agnostic wheel (#6394). A single wheel can be installed across Python versions vLLM supports.

Highlights

Model Support

Add PaliGemma (#5189), Fuyu-8B (#3924)
Support for soft tuned prompts (#4645)
A new guide for adding multi-modal plugins (#6205)

Hardware

AMD: unify CUDA_VISIBLE_DEVICES usage (#6352)

Performance

ZeroMQ fallback for broadcasting large objects (#6183)
Simplify code to support pipeline parallel (#6406)
Turn off CUTLASS scaled_mm for Ada Lovelace (#6384)
Use CUTLASS kernels for the FP8 layers with Bias (#6270)

Features

Enabling bonus token in speculative decoding for KV cache based models (#5765)
Medusa Implementation with Top-1 proposer (#4978)
An experimental vLLM CLI for serving and querying OpenAI compatible server (#5090)

Others

Add support for multi-node on CI (#5955)
Benchmark: add H100 suite (#6047)
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362)
Build some nightly wheels (#6380)

What's Changed

Update wheel builds to strip debug by @simon-mo in #6161
Fix release wheel build env var by @simon-mo in #6162
Move release wheel env var to Dockerfile instead by @simon-mo in #6163
[Doc] Reorganize Supported Models by Type by @ywang96 in #6167
[Doc] Move guide for multimodal model and other improvements by @DarkLight1337 in #6168
[Model] Add PaliGemma by @ywang96 in #5189
add benchmark for fix length input and output by @haichuan1221 in #5857
[ Misc ] Support Fp8 via llm-compressor by @robertgshaw2-neuralmagic in #6110
[misc][frontend] log all available endpoints by @youkaichao in #6195
do not exclude object field in CompletionStreamResponse by @kczimm in #6196
[Bugfix] FIx benchmark args for randomly sampled dataset by @haichuan1221 in #5947
[Kernel] reloading fused_moe config on the last chunk by @avshalomman in #6210
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) by @afeldman-nm in #4888
[Bugfix] use diskcache in outlines _get_guide #5436 by @ericperfect in #6203
[Bugfix] Mamba cache Cuda Graph padding by @tomeras91 in #6214
Add FlashInfer to default Dockerfile by @simon-mo in #6172
[hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability by @youkaichao in #6216
[core][distributed] fix ray worker rank assignment by @youkaichao in #6235
[Bugfix][TPU] Add missing None to model input by @WoosukKwon in #6245
[Bugfix][TPU] Fix outlines installation in TPU Dockerfile by @WoosukKwon in #6256
Add support for multi-node on CI by @khluu in #5955
[CORE] Adding support for insertion of soft-tuned prompts by @SwapnilDreams100 in #4645
[Docs] Docs update for Pipeline Parallel by @andoorve in #6222
[Bugfix]fix and needs_scalar_to_array logic check by @qibaoyuan in #6238
[Speculative Decoding] Medusa Implementation with Top-1 proposer by @abhigoyal1997 in #4978
[core][distributed] add zmq fallback for broadcasting large objects by @youkaichao in #6183
[Bugfix][TPU] Add prompt adapter methods to TPUExecutor by @WoosukKwon in #6279
[Doc] Guide for adding multi-modal plugins by @DarkLight1337 in #6205
[Bugfix] Support 2D input shape in MoE layer by @WoosukKwon in #6287
[Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. by @tdoublep in #6303
[CI/Build] Enable mypy typing for remaining folders by @bmuskalla in #6268
[Bugfix] OpenVINOExecutor abstractmethod error by @park12sj in #6296
[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models by @sroy745 in #5765
[Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor by @WoosukKwon in #6313
[Doc] Remove comments incorrectly copied from another project by @daquexian in #6286
[Doc] Update description of vLLM support for CPUs by @DamonFool in #6003
[BugFix]: set outlines pkg version by @xiangyang-95 in #6262
[Bugfix] Fix snapshot download in serving benchmark by @ywang96 in #6318
[Misc] refactor(config): clean up unused code by @aniaan in #6320
[BugFix]: fix engine timeout due to request abort by @pushan01 in #6255
[Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. by @tdoublep in #6326
[BugFix] get_and_reset only when scheduler outputs are not empty by @mzusman in #6266
[ Misc ] Refactor Marlin Python Utilities by @robertgshaw2-neuralmagic in #6082
Benchmark: add H100 suite by @simon-mo in #6047
[bug fix] Fix llava next feature size calculation. by @xwjiang2010 in #6339
[doc] update pipeline parallel in readme by @youkaichao in #6347
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy by @KuntaiDu in #5362
[ BugFix ] Prompt Logprobs Detokenization by @robertgshaw2-neuralmagic in #6223
[Misc] Remove flashinfer warning, add flashinfer tests to CI by @LiuXiaoxuanPKU in #6351
[distributed][misc] keep consistent with how pytorch finds libcudart.so by @youkaichao in #6346
[Bugfix] Fix usage stats logging exception warning with OpenVINO by @helena-intel in #6349
[Model][Phi3-Small] Remove scipy from blocksparse_attention by @mgoin in #6343
[CI/Build] (2/2) Switching AMD CI to store images in Docker Hub by @adityagoel14 in #6350
[ROCm][AMD][Bugfix] unify CUDA_VISIBLE_DEVICES usage in vllm to get device count and fixed navi3x by @hongxiayang in #6352
[ Misc ] Remove separate bias add by @robertgshaw2-neuralmagic in #6353
[Misc][Bugfix] Update transformers for tokenizer issue by @ywang96 in #6364
[ Misc ] Support Models With Bias in compressed-tensors integration by @robertgshaw2-neuralmagic in #6356
[Bugfix] Fix dtype mismatch in PaliGemma by @DarkLight1337 in #6367
[Build/CI] Checking/Waiting for the GPU's clean state by @Alexei-V-Ivanov-AMD in #6379
[Misc] add fixture to guided processor tests by @kevinbu233 in #6341
[ci] Add grouped tests & mark tests to run by default for fastcheck pipeline by @khluu in #6365
[ci] Add GHA workflows to enable full CI run by @khluu in #6381
[MISC] Upgrade dependency to PyTorch 2.3.1 by @comaniac in #5327
Build some nightly wheels by default by @simon-mo in #6380
Fix release-pipeline.yaml by @simon-mo in #6388
Fix interpolation in release pipeline by @simon-mo in #6389
Fix release pipeline's -e flag by @simon-mo in #6390
[Bugfix] Fix illegal memory access in FP8 MoE kernel by @comaniac in #6382
[Misc] Add generated git commit hash as vllm.__commit__ by @mgoin in #6386
Fix release pipeline's dir permission by @simon-mo in #6391
[Bugfix][TPU] Fix megacore setting for v5e-litepod by @WoosukKwon in #6397
[ci] Fix wording for GH bot by @khluu in #6398
[Doc] Fix Typo in Doc by @esaliya in #6392
[Bugfix] Fix hard-coded value of x in context_attention_fwd by @tdoublep in #6373
[Docs] Clean up latest news by @WoosukKwon in #6401
[ci] try to add multi-node tests by @youkaichao in #6280
Updating LM Format Enforcer version to v10.3 by @noamgat in #6411
[ Misc ] More Cleanup of Marlin by @robertgshaw2-neuralmagic in #6359
[Misc] Add deprecation warning for beam search by @WoosukKwon in #6402
[ Misc ] Apply MoE Refactor to Qwen2 + Deepseekv2 To Support Fp8 by @robertgshaw2-neuralmagic in #6417
[Model] Initialize Fuyu-8B support by @Isotr0py in #3924
Remove unnecessary trailing period in spec_decode.rst by @terrytangyuan in #6405
[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace by @tlrmchlsmth in #6384
[ci][build] fix commit id by @youkaichao in #6420
[ Misc ] Enable Quantizing All Layers of DeekSeekv2 by @robertgshaw2-neuralmagic in #6423
[Feature] vLLM CLI for serving and querying OpenAI compatible server by @EthanqX in #5090
[Doc] xpu backend requires running setvars.sh by @rscohn2 in #6393
[CI/Build] Cross python wheel by @robertgshaw2-neuralmagic in #6394
[Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' by @lxline in #6428
Report usage for beam search by @simon-mo in #6404
Add FUNDING.yml by @simon-mo in #6435
[BugFix] BatchResponseData body should be optional by @zifeitong in #6345
[Doc] add env docs for flashinfer backend by @DefTruth in #6437
[core][distributed] simplify code to support pipeline parallel by @youkaichao in #6406
[Bugfix] Convert image to RGB by default by @DarkLight1337 in #6430
[doc][misc] doc update by @youkaichao in #6439
[VLM] Minor space optimization for ClipVisionModel by @ywang96 in #6436
[doc][distributed] add suggestion for distributed inference by @youkaichao in #6418
[Kernel] Use CUTLASS kernels for the FP8 layers with Bias by @tlrmchlsmth in #6270
[Misc] Use 0.0.9 version for flashinfer by @Pernekhan in #6447
[Bugfix] Add custom Triton cache manager to resolve MoE MP issue by @tdoublep in #6140
[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF by @tdoublep in #6409
bump version to v0.5.2 by @simon-mo in #6433
[misc][distributed] fix pp missing layer condition by @youkaichao in #6446

New Contributors

@haichuan1221 made their first contribution in #5857
@kczimm made their first contribution in #6196
@ericperfect made their first contribution in #6203
@qibaoyuan made their first contribution in #6238
@abhigoyal1997 made their first contribution in #4978
@bmuskalla made their first contribution in #6268
@park12sj made their first contribution in #6296
@daquexian made their first contribution in #6286
@xiangyang-95 made their first contribution in #6262
@aniaan made their first contribution in #6320
@pushan01 made their first contribution in #6255
@helena-intel made their first contribution in #6349
@adityagoel14 made their first contribution in #6350
@kevinbu233 made their first contribution in #6341
@esaliya made their first contribution in #6392
@EthanqX made their first contribution in #5090
@rscohn2 made their first contribution in #6393
@lxline made their first contribution in #6428

Full Changelog: v0.5.1...v0.5.2