vllm-project/vllm v0.5.5 on GitHub

Highlights

Performance Update

We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (#7000, #7387, #7452, #7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set --num-scheduler-steps 8 as a parameter to the API server (via vllm serve) or AsyncLLMEngine. We are working on expanding the coverage to LLM class and aiming to turning it on by default
Various enhancements:
- Use flashinfer sampling kernel when avaiable, leading to 7% decoding throughput speedup (#7137)
- Reduce Python allocations, leading to 24% throughput speedup (#7162, 7364)
- Improvements to the zeromq based decoupled frontend (#7570, #7716, #7484)

Model Support

Support Jamba 1.5 (#7415, #7601, #6739)
Support for the first audio model UltravoxModel (#7615, #7446)
Improvements to vision models:
- Support image embeddings as input (#6613)
- Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
Support loading GGUF model (#5191) with tensor parallelism (#7520)
Progress in encoder decoder models: support for serving encoder/decoder models (#7258), and architecture for cross-attention (#4942)

Hardware Support

AMD: Add fp8 Linear Layer for rocm (#7210)
Enhancements to TPU support: load time W8A16 quantization (#7005), optimized rope (#7635), and support multi-host inference (#7457).
Intel: various refactoring for worker, executor, and model runner (#7686, #7712)

Others

Optimize prefix caching performance (#7193)
Speculative decoding
- Use target model max length as default for draft model (#7706)
- EAGLE Implementation with Top-1 proposer (#6830)
Entrypoints
- A new chat method in the LLM class (#5049)
- Support embeddings in the run_batch API (#7132)
- Support prompt_logprobs in Chat Completion (#7453)
Quantizations
- Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
- Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
torch.compile: register custom ops for kernels (#7591, #7594, #7536)

What's Changed

[ci][frontend] deduplicate tests by @youkaichao in #7101
[Doc] [SpecDecode] Update MLPSpeculator documentation by @tdoublep in #7100
[Bugfix] Specify device when loading LoRA and embedding tensors by @jischein in #7129
[MISC] Use non-blocking transfer in prepare_input by @comaniac in #7172
[Core] Support loading GGUF model by @Isotr0py in #5191
[Build] Add initial conditional testing spec by @simon-mo in #6841
[LoRA] Relax LoRA condition by @jeejeelee in #7146
[Model] Support SigLIP encoder and alternative decoders for LLaVA models by @DarkLight1337 in #7153
[BugFix] Fix DeepSeek remote code by @dsikka in #7178
[ BugFix ] Fix ZMQ when VLLM_PORT is set by @robertgshaw2-neuralmagic in #7205
[Bugfix] add gguf dependency by @kpapis in #7198
[SpecDecode] [Minor] Fix spec decode sampler tests by @LiuXiaoxuanPKU in #7183
[Kernel] Add per-tensor and per-token AZP epilogues by @ProExpertProg in #5941
[Core] Optimize evictor-v2 performance by @xiaobochen123 in #7193
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) by @afeldman-nm in #4942
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading by @mgoin in #7225
[BugFix] Overhaul async request cancellation by @njhill in #7111
[Doc] Mock new dependencies for documentation by @ywang96 in #7245
[BUGFIX]: top_k is expected to be an integer. by @Atllkks10 in #7227
[Frontend] Gracefully handle missing chat template and fix CI failure by @DarkLight1337 in #7238
[distributed][misc] add specialized method for cuda platform by @youkaichao in #7249
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 by @dsikka in #5874
[ BugFix ] Move zmq frontend to IPC instead of TCP by @robertgshaw2-neuralmagic in #7222
Fixes typo in function name by @rafvasq in #7275
[Bugfix] Fix input processor for InternVL2 model by @Isotr0py in #7164
[OpenVINO] migrate to latest dependencies versions by @ilya-lavrenov in #7251
[Doc] add online speculative decoding example by @stas00 in #7243
[BugFix] Fix frontend multiprocessing hang by @maxdebayser in #7217
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization by @mgoin in #7219
[ci] Make building wheels per commit optional by @khluu in #7278
[Bugfix] Fix gptq failure on T4s by @LucasWilkinson in #7264
[FrontEnd] Make merge_async_iterators is_cancelled arg optional by @njhill in #7282
[Doc] Update supported_hardware.rst by @mgoin in #7276
[Kernel] Fix Flashinfer Correctness by @LiuXiaoxuanPKU in #7284
[Misc] Fix typos in scheduler.py by @ruisearch42 in #7285
[Frontend] remove max_num_batched_tokens limit for lora by @NiuBlibing in #7288
[Bugfix] Fix LoRA with PP by @andoorve in #7292
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 by @jeejeelee in #7273
[Bugfix][Kernel] Increased atol to fix failing tests by @ProExpertProg in #7305
[Frontend] Kill the server on engine death by @joerunde in #6594
[Bugfix][fast] Fix the get_num_blocks_touched logic by @zachzzc in #6849
[Doc] Put collect_env issue output in a block by @mgoin in #7310
[CI/Build] Dockerfile.cpu improvements by @dtrifiro in #7298
[Bugfix] Fix new Llama3.1 GGUF model loading by @Isotr0py in #7269
[Misc] Temporarily resolve the error of BitAndBytes by @jeejeelee in #7308
Add Skywork AI as Sponsor by @simon-mo in #7314
[TPU] Add Load-time W8A16 quantization for TPU Backend by @lsy323 in #7005
[Core] Support serving encoder/decoder models by @DarkLight1337 in #7258
[TPU] Fix dockerfile.tpu by @WoosukKwon in #7331
[Performance] Optimize e2e overheads: Reduce python allocations by @alexm-neuralmagic in #7162
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary by @tjohnson31415 in #7218
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace by @SolitaryThinker in #6971
[Core] Streamline stream termination in AsyncLLMEngine by @njhill in #7336
[Model][Jamba] Mamba cache single buffer by @mzusman in #6739
[VLM][Doc] Add stop_token_ids to InternVL example by @Isotr0py in #7354
[Performance] e2e overheads reduction: Small followup diff by @alexm-neuralmagic in #7364
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder by @alexm-neuralmagic in #7360
[Frontend] Support embeddings in the run_batch API by @pooyadavoodi in #7132
[Bugfix] Fix ITL recording in serving benchmark by @ywang96 in #7372
[Core] Add span metrics for model_forward, scheduler and sampler time by @sfc-gh-mkeralapura in #7089
[Bugfix] Fix PerTensorScaleParameter weight loading for fused models by @dsikka in #7376
[Misc] Add numpy implementation of compute_slot_mapping by @Yard1 in #7377
[Core] Fix edge case in chunked prefill + block manager v2 by @cadedaniel in #7380
[Bugfix] Fix phi3v batch inference when images have different aspect ratio by @Isotr0py in #7392
[TPU] Use mark_dynamic to reduce compilation time by @WoosukKwon in #7340
Updating LM Format Enforcer version to v0.10.6 by @noamgat in #7189
[core] [2/N] refactor worker_base input preparation for multi-step by @SolitaryThinker in #7387
[CI/Build] build on empty device for better dev experience by @tomeras91 in #4773
[Doc] add instructions about building vLLM with VLLM_TARGET_DEVICE=empty by @tomeras91 in #7403
[misc] add commit id in collect env by @youkaichao in #7405
[Docs] Update readme by @simon-mo in #7316
[CI/Build] Minor refactoring for vLLM assets by @ywang96 in #7407
[Kernel] Flashinfer correctness fix for v0.1.3 by @LiuXiaoxuanPKU in #7319
[Core][VLM] Support image embeddings as input by @ywang96 in #6613
[Frontend] Disallow passing model as both argument and option by @DarkLight1337 in #7347
[CI/Build] bump Dockerfile.neuron image base, use public ECR by @dtrifiro in #6832
[Bugfix] Fix logit soft cap in flash-attn backend by @WoosukKwon in #7425
[ci] Entrypoints run upon changes in vllm/ by @khluu in #7423
[ci] Cancel fastcheck run when PR is marked ready by @khluu in #7427
[ci] Cancel fastcheck when PR is ready by @khluu in #7433
[Misc] Use scalar type to dispatch to different gptq_marlin kernels by @LucasWilkinson in #7323
[Core] Consolidate GB constant and enable float GB arguments by @DarkLight1337 in #7416
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel by @jon-chuang in #7208
[Bugfix] Handle PackageNotFoundError when checking for xpu version by @sasha0552 in #7398
[CI/Build] bump minimum cmake version by @dtrifiro in #6999
[Core] Shut down aDAG workers with clean async llm engine exit by @ruisearch42 in #7224
[mypy] Misc. typing improvements by @DarkLight1337 in #7417
[Misc] improve logits processors logging message by @aw632 in #7435
[ci] Remove fast check cancel workflow by @khluu in #7455
[Bugfix] Fix weight loading for Chameleon when TP>1 by @DarkLight1337 in #7410
[hardware] unify usage of is_tpu to current_platform.is_tpu() by @youkaichao in #7102
[TPU] Suppress import custom_ops warning by @WoosukKwon in #7458
Revert "[Doc] Update supported_hardware.rst (#7276)" by @WoosukKwon in #7467
[Frontend][Core] Add plumbing to support audio language models by @petersalas in #7446
[Misc] Update LM Eval Tolerance by @dsikka in #7473
[Misc] Update gptq_marlin to use new vLLMParameters by @dsikka in #7281
[Misc] Update Fused MoE weight loading by @dsikka in #7334
[Misc] Update awq and awq_marlin to use vLLMParameters by @dsikka in #7422
Announce NVIDIA Meetup by @simon-mo in #7483
[frontend] spawn engine process from api server process by @youkaichao in #7484
[Misc] compressed-tensors code reuse by @kylesayrs in #7277
[misc][plugin] add plugin system implementation by @youkaichao in #7426
[TPU] Support multi-host inference by @WoosukKwon in #7457
[Bugfix][CI] Import ray under guard by @WoosukKwon in #7486
[CI/Build]Reduce the time consumption for LoRA tests by @jeejeelee in #7396
[misc][ci] fix cpu test with plugins by @youkaichao in #7489
[Bugfix][Docs] Update list of mock imports by @DarkLight1337 in #7493
[doc] update test script to include cudagraph by @youkaichao in #7501
Fix empty output when temp is too low by @CatherineSue in #2937
[ci] fix model tests by @youkaichao in #7507
[Bugfix][Frontend] Disable embedding API for chat models by @QwertyJack in #7504
[Misc] Deprecation Warning when setting --engine-use-ray by @wallashss in #7424
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt by @DarkLight1337 in #7126
[core] [3/N] multi-step args and sequence.py by @SolitaryThinker in #7452
[TPU] Set per-rank XLA cache by @WoosukKwon in #7533
[Misc] Revert compressed-tensors code reuse by @kylesayrs in #7521
llama_index serving integration documentation by @pavanjava in #6973
[Bugfix][TPU] Correct env variable for XLA cache path by @WoosukKwon in #7544
[Bugfix] update neuron for version > 0.5.0 by @omrishiv in #7175
[Misc] Update dockerfile for CPU to cover protobuf installation by @PHILO-HE in #7182
[Bugfix] Fix default weight loading for scalars by @mgoin in #7534
[Bugfix][Harmless] Fix hardcoded float16 dtype for model_is_embedding by @mgoin in #7566
[Misc] Add quantization config support for speculative model. by @ShangmingCai in #7343
[Feature]: Add OpenAI server prompt_logprobs support #6508 by @gnpinkert in #7453
[ci/test] rearrange tests and make adag test soft fail by @youkaichao in #7572
Chat method for offline llm by @nunjunj in #5049
[CI] Move quantization cpu offload tests out of fastcheck by @mgoin in #7574
[Misc/Testing] Use torch.testing.assert_close by @jon-chuang in #7324
register custom op for flash attn and use from torch.ops by @youkaichao in #7536
[Core] Use uvloop with zmq-decoupled front-end by @njhill in #7570
[CI] Fix crashes of performance benchmark by @KuntaiDu in #7500
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method by @gongdao123 in #7513
support tqdm in notebooks by @fzyzcjy in #7510
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm by @charlifu in #7210
[Kernel] W8A16 Int8 inside FusedMoE by @mzusman in #7415
[Kernel] Add tuned triton configs for ExpertsInt8 by @mgoin in #7601
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend by @SolitaryThinker in #7571
[Core] Fix tracking of model forward time to the span traces in case of PP>1 by @sfc-gh-mkeralapura in #7440
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints by @mgoin in #7444
[Doc] Update quantization supported hardware table by @mgoin in #7595
[Kernel] register punica functions as torch ops by @bnellnm in #7591
[Kernel][Misc] dynamo support for ScalarType by @bnellnm in #7594
[Kernel] fix types used in aqlm and ggml kernels to support dynamo by @bnellnm in #7596
[Model] Align nemotron config with final HF state and fix lm-eval-small by @mgoin in #7611
[Bugfix] Fix custom_ar support check by @bnellnm in #7617
.[Build/CI] Enabling passing AMD tests. by @Alexei-V-Ivanov-AMD in #7610
[Bugfix] Clear engine reference in AsyncEngineRPCServer by @ruisearch42 in #7618
[aDAG] Unflake aDAG + PP tests by @rkooo567 in #7600
[Bugfix] add >= 1.0 constraint for openai dependency by @metasyn in #7612
[misc] use nvml to get consistent device name by @youkaichao in #7582
[ci][test] fix engine/logger test by @youkaichao in #7621
[core][misc] update libcudart finding by @youkaichao in #7620
[Model] Pipeline parallel support for JAIS by @mrbesher in #7603
[ci][test] allow longer wait time for api server by @youkaichao in #7629
[Misc]Fix BitAndBytes exception messages by @jeejeelee in #7626
[VLM] Refactor MultiModalConfig initialization and profiling by @ywang96 in #7530
[TPU] Skip creating empty tensor by @WoosukKwon in #7630
[TPU] Use mark_dynamic only for dummy run by @WoosukKwon in #7634
[TPU] Optimize RoPE forward_native2 by @WoosukKwon in #7636
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend by @robertgshaw2-neuralmagic in #7279
[CI/Build] Add text-only test for Qwen models by @alex-jw-brooks in #7475
[Misc] Refactor Llama3 RoPE initialization by @WoosukKwon in #7637
[Core] Optimize SPMD architecture with delta + serialization optimization by @rkooo567 in #7109
[Core] Use flashinfer sampling kernel when available by @peng1999 in #7137
fix xpu build by @jikunshang in #7644
[Misc] Remove Gemma RoPE by @WoosukKwon in #7638
[MISC] Add prefix cache hit rate to metrics by @comaniac in #7606
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 by @c3-ali in #5428
[core] Multi Step Scheduling by @SolitaryThinker in #7000
[Core] Support tensor parallelism for GGUF quantization by @Isotr0py in #7520
[Bugfix] Don't disable existing loggers by @a-ys in #7664
[TPU] Fix redundant input tensor cloning by @WoosukKwon in #7660
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding by @tjohnson31415 in #7665
[doc] fix doc build error caused by msgspec by @youkaichao in #7659
[Speculative Decoding] Fixing hidden states handling in batch expansion by @abhigoyal1997 in #7508
[ci] Install Buildkite test suite analysis by @khluu in #7667
[Bugfix] support tie_word_embeddings for all models by @zijian-hu in #5724
[CI] Organizing performance benchmark files by @KuntaiDu in #7616
[misc] add nvidia related library in collect env by @youkaichao in #7674
[XPU] fallback to native implementation for xpu custom op by @jianyizh in #7670
[misc][cuda] add warning for pynvml user by @youkaichao in #7675
[Core] Refactor executor classes to make it easier to inherit GPUExecutor by @jikunshang in #7673
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel by @LucasWilkinson in #7174
[OpenVINO] Updated documentation by @ilya-lavrenov in #7687
[VLM][Model] Add test for InternViT vision encoder by @Isotr0py in #7409
[Hardware] [Intel GPU] refactor xpu worker/executor by @jikunshang in #7686
[CI/Build] Pin OpenTelemetry versions and make availability errors clearer by @ronensc in #7266
[Misc] Add jinja2 as an explicit build requirement by @LucasWilkinson in #7695
[Core] Add AttentionState abstraction by @Yard1 in #7663
[Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) by @jikunshang in #7685
[ci][test] adjust max wait time for cpu offloading test by @youkaichao in #7709
[Core] Pipe worker_class_fn argument in Executor by @Yard1 in #7707
[ci] try to log process using the port to debug the port usage by @youkaichao in #7711
[Model] Add AWQ quantization support for InternVL2 model by @Isotr0py in #7187
[Doc] Section for Multimodal Language Models by @ywang96 in #7719
[mypy] Enable following imports for entrypoints by @DarkLight1337 in #7248
[Bugfix] Mirror jinja2 in pyproject.toml by @sasha0552 in #7723
[BugFix] Avoid premature async generator exit and raise all exception variations by @njhill in #7698
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] by @learninmou in #7509
[Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend by @Isotr0py in #7735
[Spec Decoding] Use target model max length as default for draft model by @njhill in #7706
[Bugfix] chat method add_generation_prompt param by @brian14708 in #7734
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend by @robertgshaw2-neuralmagic in #7394
[Bugfix] Pass PYTHONPATH from setup.py to CMake by @sasha0552 in #7730
[multi-step] Raise error if not using async engine by @SolitaryThinker in #7703
[Frontend] Improve Startup Failure UX by @robertgshaw2-neuralmagic in #7716
[misc] Add Torch profiler support by @SolitaryThinker in #7451
[Model] Add UltravoxModel and UltravoxConfig by @petersalas in #7615
[ci] [multi-step] narrow multi-step test dependency paths by @SolitaryThinker in #7760
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in #7527
[distributed][misc] error on same VLLM_HOST_IP setting by @youkaichao in #7756
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility by @gshtras in #7477
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce by @ProExpertProg in #7233
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue by @zifeitong in #7710
[Bug][Frontend] Improve ZMQ client robustness by @joerunde in #7443
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" by @mgoin in #7764
[TPU] Avoid initializing TPU runtime in is_tpu by @WoosukKwon in #7763
[ci] refine dependency for distributed tests by @youkaichao in #7776
[Misc] Use torch.compile for GemmaRMSNorm by @WoosukKwon in #7642
[Speculative Decoding] EAGLE Implementation with Top-1 proposer by @abhigoyal1997 in #6830
Fix ShardedStateLoader for vllm fp8 quantization by @sfc-gh-zhwang in #7708
[Bugfix] Don't build machete on cuda <12.0 by @LucasWilkinson in #7757
[Misc] update fp8 to use vLLMParameter by @dsikka in #7437
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output by @tjohnson31415 in #7232
[Misc] Enhance prefix-caching benchmark tool by @Jeffwan in #6568
[Doc] Fix incorrect docs from #7615 by @petersalas in #7788
[Bugfix] Use LoadFormat values as choices for vllm serve --load-format by @mgoin in #7784
[ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args by @khluu in #7705
[Misc] fix typo in triton import warning by @lsy323 in #7794
[Frontend] error suppression cleanup by @joerunde in #7786
[Ray backend] Better error when pg topology is bad. by @rkooo567 in #7584
[Hardware][Intel GPU] refactor xpu_model_runner, fix xpu tensor parallel by @jikunshang in #7712
[misc] Add Torch profiler support for CPU-only devices by @DamonFool in #7806
[BugFix] Fix server crash on empty prompt by @maxdebayser in #7746
[github][misc] promote asking llm first by @youkaichao in #7809
[Misc] Update marlin to use vLLMParameters by @dsikka in #7803
Bump version to v0.5.5 by @simon-mo in #7823

New Contributors

@jischein made their first contribution in #7129
@kpapis made their first contribution in #7198
@xiaobochen123 made their first contribution in #7193
@Atllkks10 made their first contribution in #7227
@stas00 made their first contribution in #7243
@maxdebayser made their first contribution in #7217
@NiuBlibing made their first contribution in #7288
@lsy323 made their first contribution in #7005
@pooyadavoodi made their first contribution in #7132
@sfc-gh-mkeralapura made their first contribution in #7089
@jon-chuang made their first contribution in #7208
@aw632 made their first contribution in #7435
@petersalas made their first contribution in #7446
@kylesayrs made their first contribution in #7277
@QwertyJack made their first contribution in #7504
@wallashss made their first contribution in #7424
@pavanjava made their first contribution in #6973
@PHILO-HE made their first contribution in #7182
@gnpinkert made their first contribution in #7453
@gongdao123 made their first contribution in #7513
@charlifu made their first contribution in #7210
@metasyn made their first contribution in #7612
@mrbesher made their first contribution in #7603
@alex-jw-brooks made their first contribution in #7475
@a-ys made their first contribution in #7664
@zijian-hu made their first contribution in #5724
@jianyizh made their first contribution in #7670
@learninmou made their first contribution in #7509
@brian14708 made their first contribution in #7734
@sfc-gh-zhwang made their first contribution in #7708

Full Changelog: v0.5.4...v0.5.5