Highlights
Model Support
LLM
- Added support for Falcon (#5069)
- Added support for IBM Granite Code models (#4636)
- Added blocksparse flash attention kernel and Phi-3-Small model (#4799)
- Added Snowflake arctic model implementation (#4652, #4889, #4690)
- Supported Dynamic RoPE scaling (#4638)
- Supported for long context lora (#4787)
Embedding Models
- Intial support for Embedding API with e5-mistral-7b-instruct (#3734)
- Cross-attention KV caching and memory-management towards encoder-decoder model support (#4837)
Vision Language Model
- Add base class for vision-language models (#4809)
- Consolidate prompt arguments to LLM engines (#4328)
- LLaVA model refactor (#4910)
Hardware Support
AMD
- Add fused_moe Triton configs (#4951)
- Add support for Punica kernels (#3140)
- Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)
Production Engine
Batch API
- Support OpenAI batch file format (#4794)
Making Ray Optional
- Add
MultiprocessingGPUExecutor
(#4539) - Eliminate parallel worker per-step task scheduling overhead (#4894)
Automatic Prefix Caching
- Accelerating the hashing function by avoiding deep copies (#4696)
Speculative Decoding
- CUDA graph support (#4295)
- Enable TP>1 speculative decoding (#4840)
- Improve n-gram efficiency (#4724)
Performance Optimization
Quantization
- Add GPTQ Marlin 2:4 sparse structured support (#4790)
- Initial Activation Quantization Support (#4525)
- Marlin prefill performance improvement (about better on average) (#4983)
- Automatically Detect SparseML models (#5119)
Better Attention Kernel
- Use flash-attn for decoding (#3648)
FP8
- Improve FP8 linear layer performance (#4691)
- Add w8a8 CUTLASS kernels (#4749)
- Support for CUTLASS kernels in CUDA graphs (#4954)
- Load FP8 kv-cache scaling factors from checkpoints (#4893)
- Make static FP8 scaling more robust (#4570)
- Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)
Optimize Distributed Communication
- change python dict to pytorch tensor (#4607)
- change python dict to pytorch tensor for blocks to swap (#4659)
- improve paccess check (#4992)
- remove vllm-nccl (#5091)
- support both cpu and device tensor in broadcast tensor dict (#4660)
Extensible Architecture
Pipeline Parallelism
- refactor custom allreduce to support multiple tp groups (#4754)
- refactor pynccl to hold multiple communicators (#4591)
- Support PP PyNCCL Groups (#4988)
What's Changed
- Disable cuda version check in vllm-openai image by @zhaoyang-star in #4530
- [Bugfix] Fix
asyncio.Task
not being subscriptable by @DarkLight1337 in #4623 - [CI] use ccache actions properly in release workflow by @simon-mo in #4629
- [CI] Add retry for agent lost by @cadedaniel in #4633
- Update lm-format-enforcer to 0.10.1 by @noamgat in #4631
- [Kernel] Make static FP8 scaling more robust by @pcmoritz in #4570
- [Core][Optimization] change python dict to pytorch tensor by @youkaichao in #4607
- [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. by @Alexei-V-Ivanov-AMD in #4642
- [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora by @FurtherAI in #4609
- [Core][Optimization] change copy-on-write from dict[int, list] to list by @youkaichao in #4648
- [Bug fix][Core] fixup ngram not setup correctly by @leiwen83 in #4551
- [Core][Distributed] support both cpu and device tensor in broadcast tensor dict by @youkaichao in #4660
- [Core] Optimize sampler get_logprobs by @rkooo567 in #4594
- [CI] Make mistral tests pass by @rkooo567 in #4596
- [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi by @DefTruth in #4573
- [Misc] Add
get_name
method to attention backends by @WoosukKwon in #4685 - [Core] Faster startup for LoRA enabled models by @Yard1 in #4634
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap by @youkaichao in #4659
- [CI/Test] fix swap test for multi gpu by @youkaichao in #4689
- [Misc] Use vllm-flash-attn instead of flash-attn by @WoosukKwon in #4686
- [Dynamic Spec Decoding] Auto-disable by the running queue size by @comaniac in #4592
- [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs by @cadedaniel in #4672
- [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin by @alexm-neuralmagic in #4626
- [Frontend] add tok/s speed metric to llm class when using tqdm by @MahmoudAshraf97 in #4400
- [Frontend] Move async logic outside of constructor by @DarkLight1337 in #4674
- [Misc] Remove unnecessary ModelRunner imports by @WoosukKwon in #4703
- [Misc] Set block size at initialization & Fix test_model_runner by @WoosukKwon in #4705
- [ROCm] Add support for Punica kernels on AMD GPUs by @kliuae in #3140
- [Bugfix] Fix CLI arguments in OpenAI server docs by @DarkLight1337 in #4709
- [Bugfix] Update grafana.json by @robertgshaw2-neuralmagic in #4711
- [Bugfix] Add logs for all model dtype casting by @mgoin in #4717
- [Model] Snowflake arctic model implementation by @sfc-gh-hazhang in #4652
- [Kernel] [FP8] Improve FP8 linear layer performance by @pcmoritz in #4691
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support by @comaniac in #4535
- [Core][Distributed] refactor pynccl to hold multiple communicators by @youkaichao in #4591
- [Misc] Keep only one implementation of the create_dummy_prompt function. by @AllenDou in #4716
- chunked-prefill-doc-syntax by @simon-mo in #4603
- [Core]fix type annotation for
swap_blocks
by @jikunshang in #4726 - [Misc] Apply a couple g++ cleanups by @stevegrubb in #4719
- [Core] Fix circular reference which leaked llm instance in local dev env by @rkooo567 in #4737
- [Bugfix] Fix CLI arguments in OpenAI server docs by @AllenDou in #4729
- [Speculative decoding] CUDA graph support by @heeju-kim2 in #4295
- [CI] Nits for bad initialization of SeqGroup in testing by @robertgshaw2-neuralmagic in #4748
- [Core][Test] fix function name typo in custom allreduce by @youkaichao in #4750
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API by @CatherineSue in #3734
- [Model] Add support for IBM Granite Code models by @yikangshen in #4636
- [CI/Build] Tweak Marlin Nondeterminism Issues In CI by @robertgshaw2-neuralmagic in #4713
- [CORE] Improvement in ranks code by @SwapnilDreams100 in #4718
- [Core][Distributed] refactor custom allreduce to support multiple tp groups by @youkaichao in #4754
- [CI/Build] Move
test_utils.py
totests/utils.py
by @DarkLight1337 in #4425 - [Scheduler] Warning upon preemption and Swapping by @rkooo567 in #4647
- [Misc] Enhance attention selector by @WoosukKwon in #4751
- [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update
tensorizer
to version 2.9.0 by @sangstar in #4208 - [Speculative decoding] Improve n-gram efficiency by @comaniac in #4724
- [Kernel] Use flash-attn for decoding by @skrider in #3648
- [Bugfix] Fix dynamic FP8 quantization for Mixtral by @pcmoritz in #4793
- [Doc] Shorten README by removing supported model list by @zhuohan123 in #4796
- [Doc] Add API reference for offline inference by @DarkLight1337 in #4710
- [Doc] Add meetups to the doc by @zhuohan123 in #4798
- [Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies by @KuntaiDu in #4696
- [Bugfix][Doc] Fix CI failure in docs by @DarkLight1337 in #4804
- [Core] Add MultiprocessingGPUExecutor by @njhill in #4539
- Add 4th meetup announcement to readme by @simon-mo in #4817
- Revert "[Kernel] Use flash-attn for decoding (#3648)" by @rkooo567 in #4820
- [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API by @rkooo567 in #4681
- [CI/Build] Further decouple HuggingFace implementation from ours during tests by @DarkLight1337 in #4166
- [Bugfix] Properly set distributed_executor_backend in ParallelConfig by @zifeitong in #4816
- [Doc] Highlight the fourth meetup in the README by @zhuohan123 in #4842
- [Frontend] Re-enable custom roles in Chat Completions API by @DarkLight1337 in #4758
- [Frontend] Support OpenAI batch file format by @wuisawesome in #4794
- [Core] Implement sharded state loader by @aurickq in #4690
- [Speculative decoding][Re-take] Enable TP>1 speculative decoding by @comaniac in #4840
- Add marlin unit tests and marlin benchmark script by @alexm-neuralmagic in #4815
- [Kernel] add bfloat16 support for gptq marlin kernel by @jinzhen-lin in #4788
- [docs] Fix typo in examples filename openi -> openai by @wuisawesome in #4864
- [Frontend] Separate OpenAI Batch Runner usage from API Server by @wuisawesome in #4851
- [Bugfix] Bypass authorization API token for preflight requests by @dulacp in #4862
- Add GPTQ Marlin 2:4 sparse structured support by @alexm-neuralmagic in #4790
- Add JSON output support for benchmark_latency and benchmark_throughput by @simon-mo in #4848
- [ROCm][AMD][Bugfix] adding a missing triton autotune config by @hongxiayang in #4845
- [Core][Distributed] remove graph mode function by @youkaichao in #4818
- [Misc] remove old comments by @youkaichao in #4866
- [Kernel] Add punica dimension for Qwen1.5-32B LoRA by @Silencioo in #4850
- [Kernel] Add w8a8 CUTLASS kernels by @tlrmchlsmth in #4749
- [Bugfix] Fix FP8 KV cache support by @WoosukKwon in #4869
- Support to serve vLLM on Kubernetes with LWS by @kerthcet in #4829
- [Frontend] OpenAI API server: Do not add bos token by default when encoding by @bofenghuang in #4688
- [Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests by @Alexei-V-Ivanov-AMD in #4797
- [Bugfix] fix rope error when load models with different dtypes by @jinzhen-lin in #4835
- Sync huggingface modifications of qwen Moe model by @eigen2017 in #4774
- [Doc] Update Ray Data distributed offline inference example by @Yard1 in #4871
- [Bugfix] Relax tiktoken to >= 0.6.0 by @mgoin in #4890
- [ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used by @alexeykondrat in #4658
- [Lora] Support long context lora by @rkooo567 in #4787
- [Bugfix][Model] Add base class for vision-language models by @DarkLight1337 in #4809
- [Kernel] Add marlin_24 unit tests by @alexm-neuralmagic in #4901
- [Kernel] Add flash-attn back by @WoosukKwon in #4907
- [Model] LLaVA model refactor by @DarkLight1337 in #4910
- Remove marlin warning by @alexm-neuralmagic in #4918
- [Bugfix]: Fix communication Timeout error in safety-constrained distributed System by @ZwwWayne in #4914
- [Build/CI] Enabling AMD Entrypoints Test by @Alexei-V-Ivanov-AMD in #4834
- [Bugfix] Fix dummy weight for fp8 by @mzusman in #4916
- [Core] Sharded State Loader download from HF by @aurickq in #4889
- [Doc]Add documentation to benchmarking script when running TGI by @KuntaiDu in #4920
- [Core] Fix scheduler considering "no LoRA" as "LoRA" by @Yard1 in #4897
- [Model] add rope_scaling support for qwen2 by @hzhwcmhf in #4930
- [Model] Add Phi-2 LoRA support by @Isotr0py in #4886
- [Docs] Add acknowledgment for sponsors by @simon-mo in #4925
- [CI/Build] Codespell ignore
build/
directory by @mgoin in #4945 - [Bugfix] Fix flag name for
max_seq_len_to_capture
by @kerthcet in #4935 - [Bugfix][Kernel] Add head size check for attention backend selection by @Isotr0py in #4944
- [Frontend] Dynamic RoPE scaling by @sasha0552 in #4638
- [CI/Build] Enforce style for C++ and CUDA code with
clang-format
by @mgoin in #4722 - [misc] remove comments that were supposed to be removed by @rkooo567 in #4977
- [Kernel] Fixup for CUTLASS kernels in CUDA graphs by @tlrmchlsmth in #4954
- [Misc] Load FP8 kv-cache scaling factors from checkpoints by @comaniac in #4893
- [Model] LoRA gptbigcode implementation by @raywanb in #3949
- [Core] Eliminate parallel worker per-step task scheduling overhead by @njhill in #4894
- [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig by @pcmoritz in #4991
- [Misc] Take user preference in attention selector by @comaniac in #4960
- Marlin 24 prefill performance improvement (about 25% better on average) by @alexm-neuralmagic in #4983
- [Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined by @LetianLee in #5009
- [Core][1/N] Support PP PyNCCL Groups by @andoorve in #4988
- [Kernel] Initial Activation Quantization Support by @dsikka in #4525
- [Core]: Option To Use Prompt Token Ids Inside Logits Processor by @kezouke in #4985
- [Doc] add ccache guide in doc by @youkaichao in #5012
- [Bugfix] Fix Mistral v0.3 Weight Loading by @robertgshaw2-neuralmagic in #5005
- [Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in #4764
- [Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model by @linxihui in #4799
- [Misc] add logging level env var by @youkaichao in #5045
- [Dynamic Spec Decoding] Minor fix for disabling speculative decoding by @LiuXiaoxuanPKU in #5000
- [Misc] Make Serving Benchmark More User-friendly by @ywang96 in #5044
- [Bugfix / Core] Prefix Caching Guards (merged with main) by @zhuohan123 in #4846
- [Core] Allow AQLM on Pascal by @sasha0552 in #5058
- [Model] Add support for falcon-11B by @Isotr0py in #5069
- [Core] Sliding window for block manager v2 by @mmoskal in #4545
- [BugFix] Fix Embedding Models with TP>1 by @robertgshaw2-neuralmagic in #5075
- [Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X by @divakar-amd in #4951
- [Docs] Add Dropbox as sponsors by @simon-mo in #5089
- [Core] Consolidate prompt arguments to LLM engines by @DarkLight1337 in #4328
- [Bugfix] Remove the last EOS token unless explicitly specified by @jsato8094 in #5077
- [Misc] add gpu_memory_utilization arg by @pandyamarut in #5079
- [Core][Optimization] remove vllm-nccl by @youkaichao in #5091
- [Bugfix] Fix arguments passed to
Sequence
in stop checker test by @DarkLight1337 in #5092 - [Core][Distributed] improve p2p access check by @youkaichao in #4992
- [Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) by @afeldman-nm in #4837
- [Doc]Replace deprecated flag in readme by @ronensc in #4526
- [Bugfix][CI/Build] Fix test and improve code for
merge_async_iterators
by @DarkLight1337 in #5096 - [Bugfix][CI/Build] Fix codespell failing to skip files in
git diff
by @DarkLight1337 in #5097 - [Core] Avoid the need to pass
None
values toSequence.inputs
by @DarkLight1337 in #5099 - [Bugfix] logprobs is not compatible with the OpenAI spec #4795 by @Etelis in #5031
- [Doc][Build] update after removing vllm-nccl by @youkaichao in #5103
- [Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter by @alexm-neuralmagic in #5108
- [CI/Build] Docker cleanup functionality for amd servers by @okakarpa in #5112
- [BUGFIX] [FRONTEND] Correct chat logprobs by @br3no in #5029
- [Bugfix] Automatically Detect SparseML models by @robertgshaw2-neuralmagic in #5119
- [CI/Build] increase wheel size limit to 200 MB by @youkaichao in #5130
- [Misc] remove duplicate definition of
seq_lens_tensor
in model_runner.py by @ita9naiwa in #5129 - [Doc] Use intersphinx and update entrypoints docs by @DarkLight1337 in #5125
- add doc about serving option on dstack by @deep-diver in #3074
- Bump version to v0.4.3 by @simon-mo in #5046
- [Build] Disable sm_90a in cu11 by @simon-mo in #5141
- [Bugfix] Avoid Warnings in SparseML Activation Quantization by @robertgshaw2-neuralmagic in #5120
- [Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) by @alexm-neuralmagic in #5136
- [Model] Support MAP-NEO model by @xingweiqu in #5081
- Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" by @simon-mo in #5149
- [Misc]: optimize eager mode host time by @functionxu123 in #4196
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script by @comaniac in #5039
- [Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support by @njhill in #5171
- [Build] Guard against older CUDA versions when building CUTLASS 3.x kernels by @tlrmchlsmth in #5168
New Contributors
- @MahmoudAshraf97 made their first contribution in #4400
- @sfc-gh-hazhang made their first contribution in #4652
- @stevegrubb made their first contribution in #4719
- @heeju-kim2 made their first contribution in #4295
- @yikangshen made their first contribution in #4636
- @KuntaiDu made their first contribution in #4696
- @wuisawesome made their first contribution in #4794
- @aurickq made their first contribution in #4690
- @jinzhen-lin made their first contribution in #4788
- @dulacp made their first contribution in #4862
- @Silencioo made their first contribution in #4850
- @tlrmchlsmth made their first contribution in #4749
- @kerthcet made their first contribution in #4829
- @bofenghuang made their first contribution in #4688
- @eigen2017 made their first contribution in #4774
- @alexeykondrat made their first contribution in #4658
- @ZwwWayne made their first contribution in #4914
- @mzusman made their first contribution in #4916
- @hzhwcmhf made their first contribution in #4930
- @raywanb made their first contribution in #3949
- @LetianLee made their first contribution in #5009
- @dsikka made their first contribution in #4525
- @kezouke made their first contribution in #4985
- @linxihui made their first contribution in #4799
- @divakar-amd made their first contribution in #4951
- @pandyamarut made their first contribution in #5079
- @afeldman-nm made their first contribution in #4837
- @Etelis made their first contribution in #5031
- @okakarpa made their first contribution in #5112
- @deep-diver made their first contribution in #3074
- @xingweiqu made their first contribution in #5081
- @functionxu123 made their first contribution in #4196
Full Changelog: v0.4.2...v0.4.3