Highlights
Model Support
- Enhanced pipeline parallelism support for DeepSeek v2 (#6519), Qwen (#6974), Qwen2 (#6924), and Nemotron (#6863)
- Enhanced vision language model support for InternVL2 (#6514, #7067), BLIP-2 (#5920), MiniCPM-V (#4087, #7122).
- Added H2O Danube3-4b (#6451)
- Added Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611)
Hardware Support
- TPU enhancements: collective communication, TP for async engine, faster compile time (#6891, #6933, #6856, #6813, #5871)
- Intel CPU: enable multiprocessing and tensor parallelism (#6125)
Performance
We are progressing along our quest to quickly improve performance. Each of the following PRs contributed some improvements, and we anticipate more enhancements in the next release.
- Separated OpenAI Server's HTTP request handling and model inference loop with
zeromq
. This brought 20% speedup over time to first token and 2x speedup over inter token latency. (#6883) - Used Python's native array data structure speedup padding. This bring 15% throughput enhancement in large batch size scenarios. (#6779)
- Reduce unnecessary compute when logprobs=None. This reduced latency of get log probs from ~30ms to ~5ms in large batch size scenarios. (#6532)
- Optimize
get_seqs
function, bring 2% throughput enhancements. (#7051)
Production Features
- Enhancements to speculative decoding: FlashInfer in DraftModelRunner (#6926), observability (#6963), and benchmarks (#6964)
- Refactor the punica kernel based on Triton (#5036)
- Support for guided decoding for offline LLM (#6878)
Quantization
- Support W4A8 quantization for vllm (#5218)
- Tuned FP8 and INT8 Kernels for Ada Lovelace and SM75 T4 (#6677, #6996, #6848)
- Support reading bitsandbytes pre-quantized model (#5753)
What's Changed
- [Docs] Announce llama3.1 support by @WoosukKwon in #6688
- [doc][distributed] fix doc argument order by @youkaichao in #6691
- [Bugfix] Fix a log error in chunked prefill by @WoosukKwon in #6694
- [BugFix] Fix RoPE error in Llama 3.1 by @WoosukKwon in #6693
- Bump version to 0.5.3.post1 by @simon-mo in #6696
- [Misc] Add ignored layers for
fp8
quantization by @mgoin in #6657 - [Frontend] Add Usage data in each chunk for chat_serving. #6540 by @yecohn in #6652
- [Model] Pipeline Parallel Support for DeepSeek v2 by @tjohnson31415 in #6519
- Bump
transformers
version for Llama 3.1 hotfix and patch Chameleon by @ywang96 in #6690 - [build] relax wheel size limit by @youkaichao in #6704
- [CI] Add smoke test for non-uniform AutoFP8 quantization by @mgoin in #6702
- [Bugfix] StatLoggers: cache spec decode metrics when they get collected. by @tdoublep in #6645
- [bitsandbytes]: support read bnb pre-quantized model by @thesues in #5753
- [Bugfix] fix flashinfer cudagraph capture for PP by @SolitaryThinker in #6708
- [SpecDecoding] Update MLPSpeculator CI tests to use smaller model by @njhill in #6714
- [Bugfix] Fix token padding for chameleon by @ywang96 in #6724
- [Docs][ROCm] Detailed instructions to build from source by @WoosukKwon in #6680
- [Build/CI] Update run-amd-test.sh. Enable Docker Hub login. by @Alexei-V-Ivanov-AMD in #6711
- [Bugfix]fix modelscope compatible issue by @liuyhwangyh in #6730
- Adding f-string to validation error which is missing by @luizanao in #6748
- [Bugfix] Fix speculative decode seeded test by @njhill in #6743
- [Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. by @AllenDou in #6686
- [Frontend] split run_server into build_server and run_server by @dtrifiro in #6740
- [Kernels] Add fp8 support to
reshape_and_cache_flash
by @Yard1 in #6667 - [Core] Tweaks to model runner/input builder developer APIs by @Yard1 in #6712
- [Bugfix] Bump transformers to 4.43.2 by @mgoin in #6752
- [Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users by @hongxiayang in #6754
- [core][distributed] fix zmq hang by @youkaichao in #6759
- [Frontend] Represent tokens with identifiable strings by @ezliu in #6626
- [Model] Adding support for MiniCPM-V by @HwwwwwwwH in #4087
- [Bugfix] Fix decode tokens w. CUDA graph by @comaniac in #6757
- [Bugfix] Fix awq_marlin and gptq_marlin flags by @alexm-neuralmagic in #6745
- [Bugfix] Fix encoding_format in examples/openai_embedding_client.py by @CatherineSue in #6755
- [Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V by @HwwwwwwwH in #6787
- [ Misc ]
fp8-marlin
channelwise viacompressed-tensors
by @robertgshaw2-neuralmagic in #6524 - [Bugfix] Fix
kv_cache_dtype=fp8
without scales for FP8 checkpoints by @mgoin in #6761 - [Bugfix] Add synchronize to prevent possible data race by @tlrmchlsmth in #6788
- [Doc] Add documentations for nightly benchmarks by @KuntaiDu in #6412
- [Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors by @LucasWilkinson in #6798
- [doc][distributed] improve multinode serving doc by @youkaichao in #6804
- [Docs] Publish 5th meetup slides by @WoosukKwon in #6799
- [Core] Fix ray forward_dag error mssg by @rkooo567 in #6792
- [ci][distributed] fix flaky tests by @youkaichao in #6806
- [ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check by @khluu in #6810
- Fix ReplicatedLinear weight loading by @qingquansong in #6793
- [Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. by @eaplatanios in #6770
- [Core] Use array to speedup padding by @peng1999 in #6779
- [doc][debugging] add known issues for hangs by @youkaichao in #6816
- [Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) by @mgoin in #6611
- [Bugfix][Kernel] Promote another index to int64_t by @tlrmchlsmth in #6838
- [Build/CI][ROCm] Minor simplification to Dockerfile.rocm by @WoosukKwon in #6811
- [Misc][TPU] Support TPU in initialize_ray_cluster by @WoosukKwon in #6812
- [Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation by @bigPYJ1151 in #6125
- [Doc] Add Nemotron to supported model docs by @mgoin in #6843
- [Doc] Update SkyPilot doc for wrong indents and instructions for update service by @Michaelvll in #4283
- Update README.md by @gurpreet-dhami in #6847
- enforce eager mode with bnb quantization temporarily by @chenqianfzh in #6846
- [TPU] Support collective communications in XLA devices by @WoosukKwon in #6813
- [Frontend] Factor out code for running uvicorn by @DarkLight1337 in #6828
- [Bug Fix] Illegal memory access, FP8 Llama 3.1 405b by @LucasWilkinson in #6852
- [Bugfix]: Fix Tensorizer test failures by @sangstar in #6835
- [ROCm] Upgrade PyTorch nightly version by @WoosukKwon in #6845
- [Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron by @omrishiv in #6844
- [Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba by @tomeras91 in #6784
- [Model] H2O Danube3-4b by @g-eoj in #6451
- [Hardware][TPU] Implement tensor parallelism with Ray by @WoosukKwon in #5871
- [Doc] Add missing mock import to docs
conf.py
by @hmellor in #6834 - [Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor by @tjohnson31415 in #6802
- [Misc][VLM][Doc] Consolidate offline examples for vision language models by @ywang96 in #6858
- [Bugfix] Fix VLM example typo by @ywang96 in #6859
- [bugfix] make args.stream work by @WrRan in #6831
- [CI/Build][Doc] Update CI and Doc for VLM example changes by @ywang96 in #6860
- [Model] Initial support for BLIP-2 by @DarkLight1337 in #5920
- [Docs] Add RunLLM chat widget by @cw75 in #6857
- [TPU] Reduce compilation time & Upgrade PyTorch XLA version by @WoosukKwon in #6856
- [Kernel] Increase precision of GPTQ/AWQ Marlin kernel by @alexm-neuralmagic in #6795
- Add Nemotron to PP_SUPPORTED_MODELS by @mgoin in #6863
- [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 by @zeyugao in #6871
- [Model] Initialize support for InternVL2 series models by @Isotr0py in #6514
- [Kernel] Tuned FP8 Kernels for Ada Lovelace by @varun-sundar-rabindranath in #6677
- [Core] Reduce unnecessary compute when logprobs=None by @peng1999 in #6532
- [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel by @tlrmchlsmth in #6901
- [TPU] Add TPU tensor parallelism to async engine by @etwk in #6891
- [Bugfix] Allow vllm to still work if triton is not installed. by @tdoublep in #6786
- [Frontend] New
allowed_token_ids
decoding request parameter by @njhill in #6753 - [Kernel] Remove unused variables in awq/gemm_kernels.cu by @tlrmchlsmth in #6908
- [ci] GHA workflow to remove ready label upon "/notready" comment by @khluu in #6921
- [Kernel] Fix marlin divide-by-zero warnings by @tlrmchlsmth in #6904
- [Kernel] Tuned int8 kernels for Ada Lovelace by @varun-sundar-rabindranath in #6848
- [TPU] Fix greedy decoding by @WoosukKwon in #6933
- [Bugfix] Fix PaliGemma MMP by @ywang96 in #6930
- [Doc] Super tiny fix doc typo by @fzyzcjy in #6949
- [BugFix] Fix use of per-request seed with pipeline parallel by @njhill in #6698
- [Kernel] Squash a few more warnings by @tlrmchlsmth in #6914
- [OpenVINO] Updated OpenVINO requirements and build docs by @ilya-lavrenov in #6948
- [Bugfix] Fix tensorizer memory profiling bug during testing by @sangstar in #6881
- [Kernel] Remove scaled_fp8_quant kernel padding footgun by @tlrmchlsmth in #6842
- [core][misc] improve free_finished_seq_groups by @youkaichao in #6865
- [Build] Temporarily Disable Kernels and LoRA tests by @simon-mo in #6961
- [Nightly benchmarking suite] Remove pkill python from run benchmark suite by @cadedaniel in #6965
- [CI] [nightly benchmark] Do not re-download sharegpt dataset if exists by @cadedaniel in #6706
- [Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding by @cadedaniel in #6964
- [mypy] Enable following imports for some directories by @DarkLight1337 in #6681
- [Bugfix] Fix broadcasting logic for
multi_modal_kwargs
by @DarkLight1337 in #6836 - [CI/Build] Fix mypy errors by @DarkLight1337 in #6968
- [Bugfix][TPU] Set readonly=True for non-root devices by @WoosukKwon in #6980
- [Bugfix] fix logit processor excceed vocab size issue by @FeiDeng in #6927
- Support W4A8 quantization for vllm by @HandH1998 in #5218
- [Bugfix] Clean up MiniCPM-V by @HwwwwwwwH in #6939
- [Bugfix] Fix feature size calculation for LLaVA-NeXT by @DarkLight1337 in #6982
- [Model] use FusedMoE layer in Jamba by @avshalomman in #6935
- [MISC] Introduce pipeline parallelism partition strategies by @comaniac in #6920
- [Bugfix] Support cpu offloading with quant_method.process_weights_after_loading by @mgoin in #6960
- [Kernel] Enable FP8 Cutlass for Ada Lovelace by @varun-sundar-rabindranath in #6950
- [Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) by @varun-sundar-rabindranath in #6996
- [Misc] Add compressed-tensors to optimized quant list by @mgoin in #7006
- Revert "[Frontend] Factor out code for running uvicorn" by @simon-mo in #7012
- [Kernel][RFC] Refactor the punica kernel based on Triton by @jeejeelee in #5036
- [Model] Pipeline parallel support for Qwen2 by @xuyi in #6924
- [Bugfix][TPU] Do not use torch.Generator for TPUs by @WoosukKwon in #6981
- [Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings by @tjohnson31415 in #6758
- PP comm optimization: replace send with partial send + allgather by @aurickq in #6695
- [Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user by @zifeitong in #6954
- [core][scheduler] simplify and improve scheduler by @youkaichao in #6867
- [Build/CI] Fixing Docker Hub quota issue. by @Alexei-V-Ivanov-AMD in #7043
- [CI/Build] Update torch to 2.4 by @SageMoore in #6951
- [Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm by @Isotr0py in #6992
- [CI/Build] Remove sparseml requirement from testing by @mgoin in #7037
- [Bugfix] Lower gemma's unloaded_params exception to warning by @mgoin in #7002
- [Models] Support Qwen model with PP by @andoorve in #6974
- Update run-amd-test.sh by @okakarpa in #7044
- [Misc] Support attention logits soft-capping with flash-attn by @WoosukKwon in #7022
- [CI/Build][Bugfix] Fix CUTLASS header-only line by @tlrmchlsmth in #7034
- [Performance] Optimize
get_seqs
by @WoosukKwon in #7051 - [Kernel] Fix input for flashinfer prefill wrapper. by @LiuXiaoxuanPKU in #7008
- [mypy] Speed up mypy checking by @DarkLight1337 in #7056
- [ci][distributed] try to fix pp test by @youkaichao in #7054
- Fix tracing.py by @bong-furiosa in #7065
- [cuda][misc] remove error_on_invalid_device_count_status by @youkaichao in #7069
- [Core] Comment out unused code in sampler by @peng1999 in #7023
- [Hardware][Intel CPU] Update torch 2.4.0 for CPU backend by @DamonFool in #6931
- [ci] set timeout for test_oot_registration.py by @youkaichao in #7082
- [CI/Build] Add support for Python 3.12 by @mgoin in #7035
- [Misc] Disambiguate quantized types via a new ScalarType by @LucasWilkinson in #6396
- [Core] Pipeline parallel with Ray ADAG by @ruisearch42 in #6837
- [Misc] Revive to use loopback address for driver IP by @ruisearch42 in #7091
- [misc] add a flag to enable compile by @youkaichao in #7092
- [ Frontend ] Multiprocessing for OpenAI Server with
zeromq
by @robertgshaw2-neuralmagic in #6883 - [ci][distributed] shorten wait time if server hangs by @youkaichao in #7098
- [Frontend] Factor out chat message parsing by @DarkLight1337 in #7055
- [ci][distributed] merge distributed test commands by @youkaichao in #7097
- [ci][distributed] disable ray dag tests by @youkaichao in #7099
- [Model] Refactor and decouple weight loading logic for InternVL2 model by @Isotr0py in #7067
- [Bugfix] Fix block table for seqs that have prefix cache hits by @zachzzc in #7018
- [LoRA] ReplicatedLinear support LoRA by @jeejeelee in #7081
- [CI] Temporarily turn off H100 performance benchmark by @KuntaiDu in #7104
- [ci][test] finalize fork_new_process_for_each_test by @youkaichao in #7114
- [Frontend] Warn if user
max_model_len
is greater than derivedmax_model_len
by @fialhocoelho in #7080 - Support for guided decoding for offline LLM by @kevinbu233 in #6878
- [misc] add zmq in collect env by @youkaichao in #7119
- [core][misc] simply output processing with shortcut for non-parallel sampling and non-beam search usecase by @youkaichao in #7117
- [Model]Refactor MiniCPMV by @jeejeelee in #7020
- [Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator by @tdoublep in #7105
- [misc][distributed] improve libcudart.so finding by @youkaichao in #7127
- Clean up remaining Punica C information by @jeejeelee in #7027
- [Model] Add multi-image support for minicpmv offline inference by @HwwwwwwwH in #7122
- [Frontend] Reapply "Factor out code for running uvicorn" by @DarkLight1337 in #7095
- [Model] SiglipVisionModel ported from transformers by @ChristopherCho in #6942
- [Speculative decoding] Add periodic log with time spent in proposal/scoring/verification by @cadedaniel in #6963
- [SpecDecode] Support FlashInfer in DraftModelRunner by @bong-furiosa in #6926
- [BugFix] Use IP4 localhost form for zmq bind by @njhill in #7163
- [BugFix] Use args.trust_remote_code by @VastoLorde95 in #7121
- [Misc] Fix typo in GroupCoordinator.recv() by @ruisearch42 in #7167
- [Kernel] Update CUTLASS to 3.5.1 by @tlrmchlsmth in #7085
- [CI/Build] Suppress divide-by-zero and missing return statement warnings by @tlrmchlsmth in #7001
- [Bugfix][CI/Build] Fix CUTLASS FetchContent by @tlrmchlsmth in #7171
- bump version to v0.5.4 by @simon-mo in #7139
New Contributors
- @yecohn made their first contribution in #6652
- @thesues made their first contribution in #5753
- @luizanao made their first contribution in #6748
- @ezliu made their first contribution in #6626
- @HwwwwwwwH made their first contribution in #4087
- @LucasWilkinson made their first contribution in #6798
- @qingquansong made their first contribution in #6793
- @eaplatanios made their first contribution in #6770
- @gurpreet-dhami made their first contribution in #6847
- @omrishiv made their first contribution in #6844
- @cw75 made their first contribution in #6857
- @zeyugao made their first contribution in #6871
- @etwk made their first contribution in #6891
- @fzyzcjy made their first contribution in #6949
- @FeiDeng made their first contribution in #6927
- @HandH1998 made their first contribution in #5218
- @xuyi made their first contribution in #6924
- @bong-furiosa made their first contribution in #7065
- @zachzzc made their first contribution in #7018
- @fialhocoelho made their first contribution in #7080
- @ChristopherCho made their first contribution in #6942
- @VastoLorde95 made their first contribution in #7121
Full Changelog: v0.5.3...v0.5.4