vllm-project/vllm v0.5.4 on GitHub

Highlights

Model Support

Enhanced pipeline parallelism support for DeepSeek v2 (#6519), Qwen (#6974), Qwen2 (#6924), and Nemotron (#6863)
Enhanced vision language model support for InternVL2 (#6514, #7067), BLIP-2 (#5920), MiniCPM-V (#4087, #7122).
Added H2O Danube3-4b (#6451)
Added Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611)

Hardware Support

TPU enhancements: collective communication, TP for async engine, faster compile time (#6891, #6933, #6856, #6813, #5871)
Intel CPU: enable multiprocessing and tensor parallelism (#6125)

Performance

We are progressing along our quest to quickly improve performance. Each of the following PRs contributed some improvements, and we anticipate more enhancements in the next release.

Separated OpenAI Server's HTTP request handling and model inference loop with zeromq. This brought 20% speedup over time to first token and 2x speedup over inter token latency. (#6883)
Used Python's native array data structure speedup padding. This bring 15% throughput enhancement in large batch size scenarios. (#6779)
Reduce unnecessary compute when logprobs=None. This reduced latency of get log probs from ~30ms to ~5ms in large batch size scenarios. (#6532)
Optimize get_seqs function, bring 2% throughput enhancements. (#7051)

Production Features

Enhancements to speculative decoding: FlashInfer in DraftModelRunner (#6926), observability (#6963), and benchmarks (#6964)
Refactor the punica kernel based on Triton (#5036)
Support for guided decoding for offline LLM (#6878)

Quantization

Support W4A8 quantization for vllm (#5218)
Tuned FP8 and INT8 Kernels for Ada Lovelace and SM75 T4 (#6677, #6996, #6848)
Support reading bitsandbytes pre-quantized model (#5753)

What's Changed

[Docs] Announce llama3.1 support by @WoosukKwon in #6688
[doc][distributed] fix doc argument order by @youkaichao in #6691
[Bugfix] Fix a log error in chunked prefill by @WoosukKwon in #6694
[BugFix] Fix RoPE error in Llama 3.1 by @WoosukKwon in #6693
Bump version to 0.5.3.post1 by @simon-mo in #6696
[Misc] Add ignored layers for fp8 quantization by @mgoin in #6657
[Frontend] Add Usage data in each chunk for chat_serving. #6540 by @yecohn in #6652
[Model] Pipeline Parallel Support for DeepSeek v2 by @tjohnson31415 in #6519
Bump transformers version for Llama 3.1 hotfix and patch Chameleon by @ywang96 in #6690
[build] relax wheel size limit by @youkaichao in #6704
[CI] Add smoke test for non-uniform AutoFP8 quantization by @mgoin in #6702
[Bugfix] StatLoggers: cache spec decode metrics when they get collected. by @tdoublep in #6645
[bitsandbytes]: support read bnb pre-quantized model by @thesues in #5753
[Bugfix] fix flashinfer cudagraph capture for PP by @SolitaryThinker in #6708
[SpecDecoding] Update MLPSpeculator CI tests to use smaller model by @njhill in #6714
[Bugfix] Fix token padding for chameleon by @ywang96 in #6724
[Docs][ROCm] Detailed instructions to build from source by @WoosukKwon in #6680
[Build/CI] Update run-amd-test.sh. Enable Docker Hub login. by @Alexei-V-Ivanov-AMD in #6711
[Bugfix]fix modelscope compatible issue by @liuyhwangyh in #6730
Adding f-string to validation error which is missing by @luizanao in #6748
[Bugfix] Fix speculative decode seeded test by @njhill in #6743
[Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. by @AllenDou in #6686
[Frontend] split run_server into build_server and run_server by @dtrifiro in #6740
[Kernels] Add fp8 support to reshape_and_cache_flash by @Yard1 in #6667
[Core] Tweaks to model runner/input builder developer APIs by @Yard1 in #6712
[Bugfix] Bump transformers to 4.43.2 by @mgoin in #6752
[Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users by @hongxiayang in #6754
[core][distributed] fix zmq hang by @youkaichao in #6759
[Frontend] Represent tokens with identifiable strings by @ezliu in #6626
[Model] Adding support for MiniCPM-V by @HwwwwwwwH in #4087
[Bugfix] Fix decode tokens w. CUDA graph by @comaniac in #6757
[Bugfix] Fix awq_marlin and gptq_marlin flags by @alexm-neuralmagic in #6745
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py by @CatherineSue in #6755
[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V by @HwwwwwwwH in #6787
[ Misc ] fp8-marlin channelwise via compressed-tensors by @robertgshaw2-neuralmagic in #6524
[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints by @mgoin in #6761
[Bugfix] Add synchronize to prevent possible data race by @tlrmchlsmth in #6788
[Doc] Add documentations for nightly benchmarks by @KuntaiDu in #6412
[Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors by @LucasWilkinson in #6798
[doc][distributed] improve multinode serving doc by @youkaichao in #6804
[Docs] Publish 5th meetup slides by @WoosukKwon in #6799
[Core] Fix ray forward_dag error mssg by @rkooo567 in #6792
[ci][distributed] fix flaky tests by @youkaichao in #6806
[ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check by @khluu in #6810
Fix ReplicatedLinear weight loading by @qingquansong in #6793
[Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. by @eaplatanios in #6770
[Core] Use array to speedup padding by @peng1999 in #6779
[doc][debugging] add known issues for hangs by @youkaichao in #6816
[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) by @mgoin in #6611
[Bugfix][Kernel] Promote another index to int64_t by @tlrmchlsmth in #6838
[Build/CI][ROCm] Minor simplification to Dockerfile.rocm by @WoosukKwon in #6811
[Misc][TPU] Support TPU in initialize_ray_cluster by @WoosukKwon in #6812
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation by @bigPYJ1151 in #6125
[Doc] Add Nemotron to supported model docs by @mgoin in #6843
[Doc] Update SkyPilot doc for wrong indents and instructions for update service by @Michaelvll in #4283
Update README.md by @gurpreet-dhami in #6847
enforce eager mode with bnb quantization temporarily by @chenqianfzh in #6846
[TPU] Support collective communications in XLA devices by @WoosukKwon in #6813
[Frontend] Factor out code for running uvicorn by @DarkLight1337 in #6828
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b by @LucasWilkinson in #6852
[Bugfix]: Fix Tensorizer test failures by @sangstar in #6835
[ROCm] Upgrade PyTorch nightly version by @WoosukKwon in #6845
[Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron by @omrishiv in #6844
[Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba by @tomeras91 in #6784
[Model] H2O Danube3-4b by @g-eoj in #6451
[Hardware][TPU] Implement tensor parallelism with Ray by @WoosukKwon in #5871
[Doc] Add missing mock import to docs conf.py by @hmellor in #6834
[Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor by @tjohnson31415 in #6802
[Misc][VLM][Doc] Consolidate offline examples for vision language models by @ywang96 in #6858
[Bugfix] Fix VLM example typo by @ywang96 in #6859
[bugfix] make args.stream work by @WrRan in #6831
[CI/Build][Doc] Update CI and Doc for VLM example changes by @ywang96 in #6860
[Model] Initial support for BLIP-2 by @DarkLight1337 in #5920
[Docs] Add RunLLM chat widget by @cw75 in #6857
[TPU] Reduce compilation time & Upgrade PyTorch XLA version by @WoosukKwon in #6856
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel by @alexm-neuralmagic in #6795
Add Nemotron to PP_SUPPORTED_MODELS by @mgoin in #6863
[Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 by @zeyugao in #6871
[Model] Initialize support for InternVL2 series models by @Isotr0py in #6514
[Kernel] Tuned FP8 Kernels for Ada Lovelace by @varun-sundar-rabindranath in #6677
[Core] Reduce unnecessary compute when logprobs=None by @peng1999 in #6532
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel by @tlrmchlsmth in #6901
[TPU] Add TPU tensor parallelism to async engine by @etwk in #6891
[Bugfix] Allow vllm to still work if triton is not installed. by @tdoublep in #6786
[Frontend] New allowed_token_ids decoding request parameter by @njhill in #6753
[Kernel] Remove unused variables in awq/gemm_kernels.cu by @tlrmchlsmth in #6908
[ci] GHA workflow to remove ready label upon "/notready" comment by @khluu in #6921
[Kernel] Fix marlin divide-by-zero warnings by @tlrmchlsmth in #6904
[Kernel] Tuned int8 kernels for Ada Lovelace by @varun-sundar-rabindranath in #6848
[TPU] Fix greedy decoding by @WoosukKwon in #6933
[Bugfix] Fix PaliGemma MMP by @ywang96 in #6930
[Doc] Super tiny fix doc typo by @fzyzcjy in #6949
[BugFix] Fix use of per-request seed with pipeline parallel by @njhill in #6698
[Kernel] Squash a few more warnings by @tlrmchlsmth in #6914
[OpenVINO] Updated OpenVINO requirements and build docs by @ilya-lavrenov in #6948
[Bugfix] Fix tensorizer memory profiling bug during testing by @sangstar in #6881
[Kernel] Remove scaled_fp8_quant kernel padding footgun by @tlrmchlsmth in #6842
[core][misc] improve free_finished_seq_groups by @youkaichao in #6865
[Build] Temporarily Disable Kernels and LoRA tests by @simon-mo in #6961
[Nightly benchmarking suite] Remove pkill python from run benchmark suite by @cadedaniel in #6965
[CI] [nightly benchmark] Do not re-download sharegpt dataset if exists by @cadedaniel in #6706
[Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding by @cadedaniel in #6964
[mypy] Enable following imports for some directories by @DarkLight1337 in #6681
[Bugfix] Fix broadcasting logic for multi_modal_kwargs by @DarkLight1337 in #6836
[CI/Build] Fix mypy errors by @DarkLight1337 in #6968
[Bugfix][TPU] Set readonly=True for non-root devices by @WoosukKwon in #6980
[Bugfix] fix logit processor excceed vocab size issue by @FeiDeng in #6927
Support W4A8 quantization for vllm by @HandH1998 in #5218
[Bugfix] Clean up MiniCPM-V by @HwwwwwwwH in #6939
[Bugfix] Fix feature size calculation for LLaVA-NeXT by @DarkLight1337 in #6982
[Model] use FusedMoE layer in Jamba by @avshalomman in #6935
[MISC] Introduce pipeline parallelism partition strategies by @comaniac in #6920
[Bugfix] Support cpu offloading with quant_method.process_weights_after_loading by @mgoin in #6960
[Kernel] Enable FP8 Cutlass for Ada Lovelace by @varun-sundar-rabindranath in #6950
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) by @varun-sundar-rabindranath in #6996
[Misc] Add compressed-tensors to optimized quant list by @mgoin in #7006
Revert "[Frontend] Factor out code for running uvicorn" by @simon-mo in #7012
[Kernel][RFC] Refactor the punica kernel based on Triton by @jeejeelee in #5036
[Model] Pipeline parallel support for Qwen2 by @xuyi in #6924
[Bugfix][TPU] Do not use torch.Generator for TPUs by @WoosukKwon in #6981
[Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings by @tjohnson31415 in #6758
PP comm optimization: replace send with partial send + allgather by @aurickq in #6695
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user by @zifeitong in #6954
[core][scheduler] simplify and improve scheduler by @youkaichao in #6867
[Build/CI] Fixing Docker Hub quota issue. by @Alexei-V-Ivanov-AMD in #7043
[CI/Build] Update torch to 2.4 by @SageMoore in #6951
[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm by @Isotr0py in #6992
[CI/Build] Remove sparseml requirement from testing by @mgoin in #7037
[Bugfix] Lower gemma's unloaded_params exception to warning by @mgoin in #7002
[Models] Support Qwen model with PP by @andoorve in #6974
Update run-amd-test.sh by @okakarpa in #7044
[Misc] Support attention logits soft-capping with flash-attn by @WoosukKwon in #7022
[CI/Build][Bugfix] Fix CUTLASS header-only line by @tlrmchlsmth in #7034
[Performance] Optimize get_seqs by @WoosukKwon in #7051
[Kernel] Fix input for flashinfer prefill wrapper. by @LiuXiaoxuanPKU in #7008
[mypy] Speed up mypy checking by @DarkLight1337 in #7056
[ci][distributed] try to fix pp test by @youkaichao in #7054
Fix tracing.py by @bong-furiosa in #7065
[cuda][misc] remove error_on_invalid_device_count_status by @youkaichao in #7069
[Core] Comment out unused code in sampler by @peng1999 in #7023
[Hardware][Intel CPU] Update torch 2.4.0 for CPU backend by @DamonFool in #6931
[ci] set timeout for test_oot_registration.py by @youkaichao in #7082
[CI/Build] Add support for Python 3.12 by @mgoin in #7035
[Misc] Disambiguate quantized types via a new ScalarType by @LucasWilkinson in #6396
[Core] Pipeline parallel with Ray ADAG by @ruisearch42 in #6837
[Misc] Revive to use loopback address for driver IP by @ruisearch42 in #7091
[misc] add a flag to enable compile by @youkaichao in #7092
[ Frontend ] Multiprocessing for OpenAI Server with zeromq by @robertgshaw2-neuralmagic in #6883
[ci][distributed] shorten wait time if server hangs by @youkaichao in #7098
[Frontend] Factor out chat message parsing by @DarkLight1337 in #7055
[ci][distributed] merge distributed test commands by @youkaichao in #7097
[ci][distributed] disable ray dag tests by @youkaichao in #7099
[Model] Refactor and decouple weight loading logic for InternVL2 model by @Isotr0py in #7067
[Bugfix] Fix block table for seqs that have prefix cache hits by @zachzzc in #7018
[LoRA] ReplicatedLinear support LoRA by @jeejeelee in #7081
[CI] Temporarily turn off H100 performance benchmark by @KuntaiDu in #7104
[ci][test] finalize fork_new_process_for_each_test by @youkaichao in #7114
[Frontend] Warn if user max_model_len is greater than derived max_model_len by @fialhocoelho in #7080
Support for guided decoding for offline LLM by @kevinbu233 in #6878
[misc] add zmq in collect env by @youkaichao in #7119
[core][misc] simply output processing with shortcut for non-parallel sampling and non-beam search usecase by @youkaichao in #7117
[Model]Refactor MiniCPMV by @jeejeelee in #7020
[Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator by @tdoublep in #7105
[misc][distributed] improve libcudart.so finding by @youkaichao in #7127
Clean up remaining Punica C information by @jeejeelee in #7027
[Model] Add multi-image support for minicpmv offline inference by @HwwwwwwwH in #7122
[Frontend] Reapply "Factor out code for running uvicorn" by @DarkLight1337 in #7095
[Model] SiglipVisionModel ported from transformers by @ChristopherCho in #6942
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification by @cadedaniel in #6963
[SpecDecode] Support FlashInfer in DraftModelRunner by @bong-furiosa in #6926
[BugFix] Use IP4 localhost form for zmq bind by @njhill in #7163
[BugFix] Use args.trust_remote_code by @VastoLorde95 in #7121
[Misc] Fix typo in GroupCoordinator.recv() by @ruisearch42 in #7167
[Kernel] Update CUTLASS to 3.5.1 by @tlrmchlsmth in #7085
[CI/Build] Suppress divide-by-zero and missing return statement warnings by @tlrmchlsmth in #7001
[Bugfix][CI/Build] Fix CUTLASS FetchContent by @tlrmchlsmth in #7171
bump version to v0.5.4 by @simon-mo in #7139

New Contributors

@yecohn made their first contribution in #6652
@thesues made their first contribution in #5753
@luizanao made their first contribution in #6748
@ezliu made their first contribution in #6626
@HwwwwwwwH made their first contribution in #4087
@LucasWilkinson made their first contribution in #6798
@qingquansong made their first contribution in #6793
@eaplatanios made their first contribution in #6770
@gurpreet-dhami made their first contribution in #6847
@omrishiv made their first contribution in #6844
@cw75 made their first contribution in #6857
@zeyugao made their first contribution in #6871
@etwk made their first contribution in #6891
@fzyzcjy made their first contribution in #6949
@FeiDeng made their first contribution in #6927
@HandH1998 made their first contribution in #5218
@xuyi made their first contribution in #6924
@bong-furiosa made their first contribution in #7065
@zachzzc made their first contribution in #7018
@fialhocoelho made their first contribution in #7080
@ChristopherCho made their first contribution in #6942
@VastoLorde95 made their first contribution in #7121

Full Changelog: v0.5.3...v0.5.4