vllm-project/vllm v0.4.3 on GitHub

Highlights

Model Support

LLM

Added support for Falcon (#5069)
Added support for IBM Granite Code models (#4636)
Added blocksparse flash attention kernel and Phi-3-Small model (#4799)
Added Snowflake arctic model implementation (#4652, #4889, #4690)
Supported Dynamic RoPE scaling (#4638)
Supported for long context lora (#4787)

Embedding Models

Intial support for Embedding API with e5-mistral-7b-instruct (#3734)
Cross-attention KV caching and memory-management towards encoder-decoder model support (#4837)

Vision Language Model

Add base class for vision-language models (#4809)
Consolidate prompt arguments to LLM engines (#4328)
LLaVA model refactor (#4910)

Hardware Support

AMD

Add fused_moe Triton configs (#4951)
Add support for Punica kernels (#3140)
Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)

Production Engine

Batch API

Support OpenAI batch file format (#4794)

Making Ray Optional

Add MultiprocessingGPUExecutor (#4539)
Eliminate parallel worker per-step task scheduling overhead (#4894)

Automatic Prefix Caching

Accelerating the hashing function by avoiding deep copies (#4696)

Speculative Decoding

CUDA graph support (#4295)
Enable TP>1 speculative decoding (#4840)
Improve n-gram efficiency (#4724)

Performance Optimization

Quantization

Add GPTQ Marlin 2:4 sparse structured support (#4790)
Initial Activation Quantization Support (#4525)
Marlin prefill performance improvement (about better on average) (#4983)
Automatically Detect SparseML models (#5119)

Better Attention Kernel

Use flash-attn for decoding (#3648)

FP8

Improve FP8 linear layer performance (#4691)
Add w8a8 CUTLASS kernels (#4749)
Support for CUTLASS kernels in CUDA graphs (#4954)
Load FP8 kv-cache scaling factors from checkpoints (#4893)
Make static FP8 scaling more robust (#4570)
Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)

Optimize Distributed Communication

change python dict to pytorch tensor (#4607)
change python dict to pytorch tensor for blocks to swap (#4659)
improve paccess check (#4992)
remove vllm-nccl (#5091)
support both cpu and device tensor in broadcast tensor dict (#4660)

Extensible Architecture

Pipeline Parallelism

refactor custom allreduce to support multiple tp groups (#4754)
refactor pynccl to hold multiple communicators (#4591)
Support PP PyNCCL Groups (#4988)

What's Changed

Disable cuda version check in vllm-openai image by @zhaoyang-star in #4530
[Bugfix] Fix asyncio.Task not being subscriptable by @DarkLight1337 in #4623
[CI] use ccache actions properly in release workflow by @simon-mo in #4629
[CI] Add retry for agent lost by @cadedaniel in #4633
Update lm-format-enforcer to 0.10.1 by @noamgat in #4631
[Kernel] Make static FP8 scaling more robust by @pcmoritz in #4570
[Core][Optimization] change python dict to pytorch tensor by @youkaichao in #4607
[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. by @Alexei-V-Ivanov-AMD in #4642
[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora by @FurtherAI in #4609
[Core][Optimization] change copy-on-write from dict[int, list] to list by @youkaichao in #4648
[Bug fix][Core] fixup ngram not setup correctly by @leiwen83 in #4551
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict by @youkaichao in #4660
[Core] Optimize sampler get_logprobs by @rkooo567 in #4594
[CI] Make mistral tests pass by @rkooo567 in #4596
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi by @DefTruth in #4573
[Misc] Add get_name method to attention backends by @WoosukKwon in #4685
[Core] Faster startup for LoRA enabled models by @Yard1 in #4634
[Core][Optimization] change python dict to pytorch tensor for blocks to swap by @youkaichao in #4659
[CI/Test] fix swap test for multi gpu by @youkaichao in #4689
[Misc] Use vllm-flash-attn instead of flash-attn by @WoosukKwon in #4686
[Dynamic Spec Decoding] Auto-disable by the running queue size by @comaniac in #4592
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs by @cadedaniel in #4672
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin by @alexm-neuralmagic in #4626
[Frontend] add tok/s speed metric to llm class when using tqdm by @MahmoudAshraf97 in #4400
[Frontend] Move async logic outside of constructor by @DarkLight1337 in #4674
[Misc] Remove unnecessary ModelRunner imports by @WoosukKwon in #4703
[Misc] Set block size at initialization & Fix test_model_runner by @WoosukKwon in #4705
[ROCm] Add support for Punica kernels on AMD GPUs by @kliuae in #3140
[Bugfix] Fix CLI arguments in OpenAI server docs by @DarkLight1337 in #4709
[Bugfix] Update grafana.json by @robertgshaw2-neuralmagic in #4711
[Bugfix] Add logs for all model dtype casting by @mgoin in #4717
[Model] Snowflake arctic model implementation by @sfc-gh-hazhang in #4652
[Kernel] [FP8] Improve FP8 linear layer performance by @pcmoritz in #4691
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support by @comaniac in #4535
[Core][Distributed] refactor pynccl to hold multiple communicators by @youkaichao in #4591
[Misc] Keep only one implementation of the create_dummy_prompt function. by @AllenDou in #4716
chunked-prefill-doc-syntax by @simon-mo in #4603
[Core]fix type annotation for swap_blocks by @jikunshang in #4726
[Misc] Apply a couple g++ cleanups by @stevegrubb in #4719
[Core] Fix circular reference which leaked llm instance in local dev env by @rkooo567 in #4737
[Bugfix] Fix CLI arguments in OpenAI server docs by @AllenDou in #4729
[Speculative decoding] CUDA graph support by @heeju-kim2 in #4295
[CI] Nits for bad initialization of SeqGroup in testing by @robertgshaw2-neuralmagic in #4748
[Core][Test] fix function name typo in custom allreduce by @youkaichao in #4750
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API by @CatherineSue in #3734
[Model] Add support for IBM Granite Code models by @yikangshen in #4636
[CI/Build] Tweak Marlin Nondeterminism Issues In CI by @robertgshaw2-neuralmagic in #4713
[CORE] Improvement in ranks code by @SwapnilDreams100 in #4718
[Core][Distributed] refactor custom allreduce to support multiple tp groups by @youkaichao in #4754
[CI/Build] Move test_utils.py to tests/utils.py by @DarkLight1337 in #4425
[Scheduler] Warning upon preemption and Swapping by @rkooo567 in #4647
[Misc] Enhance attention selector by @WoosukKwon in #4751
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 by @sangstar in #4208
[Speculative decoding] Improve n-gram efficiency by @comaniac in #4724
[Kernel] Use flash-attn for decoding by @skrider in #3648
[Bugfix] Fix dynamic FP8 quantization for Mixtral by @pcmoritz in #4793
[Doc] Shorten README by removing supported model list by @zhuohan123 in #4796
[Doc] Add API reference for offline inference by @DarkLight1337 in #4710
[Doc] Add meetups to the doc by @zhuohan123 in #4798
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies by @KuntaiDu in #4696
[Bugfix][Doc] Fix CI failure in docs by @DarkLight1337 in #4804
[Core] Add MultiprocessingGPUExecutor by @njhill in #4539
Add 4th meetup announcement to readme by @simon-mo in #4817
Revert "[Kernel] Use flash-attn for decoding (#3648)" by @rkooo567 in #4820
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API by @rkooo567 in #4681
[CI/Build] Further decouple HuggingFace implementation from ours during tests by @DarkLight1337 in #4166
[Bugfix] Properly set distributed_executor_backend in ParallelConfig by @zifeitong in #4816
[Doc] Highlight the fourth meetup in the README by @zhuohan123 in #4842
[Frontend] Re-enable custom roles in Chat Completions API by @DarkLight1337 in #4758
[Frontend] Support OpenAI batch file format by @wuisawesome in #4794
[Core] Implement sharded state loader by @aurickq in #4690
[Speculative decoding][Re-take] Enable TP>1 speculative decoding by @comaniac in #4840
Add marlin unit tests and marlin benchmark script by @alexm-neuralmagic in #4815
[Kernel] add bfloat16 support for gptq marlin kernel by @jinzhen-lin in #4788
[docs] Fix typo in examples filename openi -> openai by @wuisawesome in #4864
[Frontend] Separate OpenAI Batch Runner usage from API Server by @wuisawesome in #4851
[Bugfix] Bypass authorization API token for preflight requests by @dulacp in #4862
Add GPTQ Marlin 2:4 sparse structured support by @alexm-neuralmagic in #4790
Add JSON output support for benchmark_latency and benchmark_throughput by @simon-mo in #4848
[ROCm][AMD][Bugfix] adding a missing triton autotune config by @hongxiayang in #4845
[Core][Distributed] remove graph mode function by @youkaichao in #4818
[Misc] remove old comments by @youkaichao in #4866
[Kernel] Add punica dimension for Qwen1.5-32B LoRA by @Silencioo in #4850
[Kernel] Add w8a8 CUTLASS kernels by @tlrmchlsmth in #4749
[Bugfix] Fix FP8 KV cache support by @WoosukKwon in #4869
Support to serve vLLM on Kubernetes with LWS by @kerthcet in #4829
[Frontend] OpenAI API server: Do not add bos token by default when encoding by @bofenghuang in #4688
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests by @Alexei-V-Ivanov-AMD in #4797
[Bugfix] fix rope error when load models with different dtypes by @jinzhen-lin in #4835
Sync huggingface modifications of qwen Moe model by @eigen2017 in #4774
[Doc] Update Ray Data distributed offline inference example by @Yard1 in #4871
[Bugfix] Relax tiktoken to >= 0.6.0 by @mgoin in #4890
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used by @alexeykondrat in #4658
[Lora] Support long context lora by @rkooo567 in #4787
[Bugfix][Model] Add base class for vision-language models by @DarkLight1337 in #4809
[Kernel] Add marlin_24 unit tests by @alexm-neuralmagic in #4901
[Kernel] Add flash-attn back by @WoosukKwon in #4907
[Model] LLaVA model refactor by @DarkLight1337 in #4910
Remove marlin warning by @alexm-neuralmagic in #4918
[Bugfix]: Fix communication Timeout error in safety-constrained distributed System by @ZwwWayne in #4914
[Build/CI] Enabling AMD Entrypoints Test by @Alexei-V-Ivanov-AMD in #4834
[Bugfix] Fix dummy weight for fp8 by @mzusman in #4916
[Core] Sharded State Loader download from HF by @aurickq in #4889
[Doc]Add documentation to benchmarking script when running TGI by @KuntaiDu in #4920
[Core] Fix scheduler considering "no LoRA" as "LoRA" by @Yard1 in #4897
[Model] add rope_scaling support for qwen2 by @hzhwcmhf in #4930
[Model] Add Phi-2 LoRA support by @Isotr0py in #4886
[Docs] Add acknowledgment for sponsors by @simon-mo in #4925
[CI/Build] Codespell ignore build/ directory by @mgoin in #4945
[Bugfix] Fix flag name for max_seq_len_to_capture by @kerthcet in #4935
[Bugfix][Kernel] Add head size check for attention backend selection by @Isotr0py in #4944
[Frontend] Dynamic RoPE scaling by @sasha0552 in #4638
[CI/Build] Enforce style for C++ and CUDA code with clang-format by @mgoin in #4722
[misc] remove comments that were supposed to be removed by @rkooo567 in #4977
[Kernel] Fixup for CUTLASS kernels in CUDA graphs by @tlrmchlsmth in #4954
[Misc] Load FP8 kv-cache scaling factors from checkpoints by @comaniac in #4893
[Model] LoRA gptbigcode implementation by @raywanb in #3949
[Core] Eliminate parallel worker per-step task scheduling overhead by @njhill in #4894
[Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig by @pcmoritz in #4991
[Misc] Take user preference in attention selector by @comaniac in #4960
Marlin 24 prefill performance improvement (about 25% better on average) by @alexm-neuralmagic in #4983
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined by @LetianLee in #5009
[Core][1/N] Support PP PyNCCL Groups by @andoorve in #4988
[Kernel] Initial Activation Quantization Support by @dsikka in #4525
[Core]: Option To Use Prompt Token Ids Inside Logits Processor by @kezouke in #4985
[Doc] add ccache guide in doc by @youkaichao in #5012
[Bugfix] Fix Mistral v0.3 Weight Loading by @robertgshaw2-neuralmagic in #5005
[Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in #4764
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model by @linxihui in #4799
[Misc] add logging level env var by @youkaichao in #5045
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding by @LiuXiaoxuanPKU in #5000
[Misc] Make Serving Benchmark More User-friendly by @ywang96 in #5044
[Bugfix / Core] Prefix Caching Guards (merged with main) by @zhuohan123 in #4846
[Core] Allow AQLM on Pascal by @sasha0552 in #5058
[Model] Add support for falcon-11B by @Isotr0py in #5069
[Core] Sliding window for block manager v2 by @mmoskal in #4545
[BugFix] Fix Embedding Models with TP>1 by @robertgshaw2-neuralmagic in #5075
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X by @divakar-amd in #4951
[Docs] Add Dropbox as sponsors by @simon-mo in #5089
[Core] Consolidate prompt arguments to LLM engines by @DarkLight1337 in #4328
[Bugfix] Remove the last EOS token unless explicitly specified by @jsato8094 in #5077
[Misc] add gpu_memory_utilization arg by @pandyamarut in #5079
[Core][Optimization] remove vllm-nccl by @youkaichao in #5091
[Bugfix] Fix arguments passed to Sequence in stop checker test by @DarkLight1337 in #5092
[Core][Distributed] improve p2p access check by @youkaichao in #4992
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) by @afeldman-nm in #4837
[Doc]Replace deprecated flag in readme by @ronensc in #4526
[Bugfix][CI/Build] Fix test and improve code for merge_async_iterators by @DarkLight1337 in #5096
[Bugfix][CI/Build] Fix codespell failing to skip files in git diff by @DarkLight1337 in #5097
[Core] Avoid the need to pass None values to Sequence.inputs by @DarkLight1337 in #5099
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 by @Etelis in #5031
[Doc][Build] update after removing vllm-nccl by @youkaichao in #5103
[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter by @alexm-neuralmagic in #5108
[CI/Build] Docker cleanup functionality for amd servers by @okakarpa in #5112
[BUGFIX] [FRONTEND] Correct chat logprobs by @br3no in #5029
[Bugfix] Automatically Detect SparseML models by @robertgshaw2-neuralmagic in #5119
[CI/Build] increase wheel size limit to 200 MB by @youkaichao in #5130
[Misc] remove duplicate definition of seq_lens_tensor in model_runner.py by @ita9naiwa in #5129
[Doc] Use intersphinx and update entrypoints docs by @DarkLight1337 in #5125
add doc about serving option on dstack by @deep-diver in #3074
Bump version to v0.4.3 by @simon-mo in #5046
[Build] Disable sm_90a in cu11 by @simon-mo in #5141
[Bugfix] Avoid Warnings in SparseML Activation Quantization by @robertgshaw2-neuralmagic in #5120
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) by @alexm-neuralmagic in #5136
[Model] Support MAP-NEO model by @xingweiqu in #5081
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" by @simon-mo in #5149
[Misc]: optimize eager mode host time by @functionxu123 in #4196
[Model] Enable FP8 QKV in MoE and refine kernel tuning script by @comaniac in #5039
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support by @njhill in #5171
[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels by @tlrmchlsmth in #5168

New Contributors

@MahmoudAshraf97 made their first contribution in #4400
@sfc-gh-hazhang made their first contribution in #4652
@stevegrubb made their first contribution in #4719
@heeju-kim2 made their first contribution in #4295
@yikangshen made their first contribution in #4636
@KuntaiDu made their first contribution in #4696
@wuisawesome made their first contribution in #4794
@aurickq made their first contribution in #4690
@jinzhen-lin made their first contribution in #4788
@dulacp made their first contribution in #4862
@Silencioo made their first contribution in #4850
@tlrmchlsmth made their first contribution in #4749
@kerthcet made their first contribution in #4829
@bofenghuang made their first contribution in #4688
@eigen2017 made their first contribution in #4774
@alexeykondrat made their first contribution in #4658
@ZwwWayne made their first contribution in #4914
@mzusman made their first contribution in #4916
@hzhwcmhf made their first contribution in #4930
@raywanb made their first contribution in #3949
@LetianLee made their first contribution in #5009
@dsikka made their first contribution in #4525
@kezouke made their first contribution in #4985
@linxihui made their first contribution in #4799
@divakar-amd made their first contribution in #4951
@pandyamarut made their first contribution in #5079
@afeldman-nm made their first contribution in #4837
@Etelis made their first contribution in #5031
@okakarpa made their first contribution in #5112
@deep-diver made their first contribution in #3074
@xingweiqu made their first contribution in #5081
@functionxu123 made their first contribution in #4196

Full Changelog: v0.4.2...v0.4.3