vllm-project/vllm v0.4.1 on GitHub

Highlights

Features

Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
Support private model registration, and updating our support policy (#3871, 3948)
Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
Add option for using LM Format Enforcer for guided decoding (#3868)
Add option for optionally initialize tokenizer and detokenizer (#3748)
Add option for load model using tensorizer (#3476)

Enhancements

vLLM is now mostly type checked by mypy (#3816, #4006, #4161, #4043)
Progress towards chunked prefill scheduler (#3550, #3853, #4280, #3884)
Progress towards speculative decoding (#3250, #3706, #3894)
Initial support with dynamic per-tensor scaling via FP8 (#4118)

Hardwares

Intel CPU inference backend is added (#3993, #3634)
AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)

What's Changed

[Kernel] Layernorm performance optimization by @mawong-amd in #3662
[Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
[CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
[Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
[Misc] Some minor simplifications to detokenization logic by @njhill in #3670
[Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
[Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
[Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
[HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
[Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
[Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in #3798
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
[Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in #3706
[BugFix] Use different mechanism to get vllm version in is_cpu() by @njhill in #3804
[Doc] Update README.md by @robertgshaw2-neuralmagic in #3806
[Doc] Update contribution guidelines for better onboarding by @michaelfeil in #3819
[3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in #3550
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in #3290
[Misc] Publish 3rd meetup slides by @WoosukKwon in #3835
Fixes the argument for local_tokenizer_group by @sighingnow in #3754
[Core] Enable hf_transfer by default if available by @michaelfeil in #3817
[Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in #3840
[Core] [Frontend] Make detokenization optional by @mgerstgrasser in #3749
[Bugfix] Fix args in benchmark_serving by @CatherineSue in #3836
[Benchmark] Refactor sample_requests in benchmark_throughput by @gty111 in #3613
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by @youkaichao in #3805
[Hardware][CPU] Update cpu torch to match default of 2.2.1 by @mgoin in #3854
[Model] Cohere CommandR+ by @saurabhdash2512 in #3829
[Core] improve robustness of pynccl by @youkaichao in #3860
[Doc]Add asynchronous engine arguments to documentation. by @SeanGallen in #3810
[CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by @youkaichao in #3859
[Misc] Add pytest marker to opt-out of global test cleanup by @cadedaniel in #3863
[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by @cadedaniel in #3864
[Bugfix] Fixing requirements.txt by @noamgat in #3865
[Misc] Define common requirements by @WoosukKwon in #3841
Add option to completion API to truncate prompt tokens by @tdoublep in #3144
[Chunked Prefill][4/n] Chunked prefill scheduler. by @rkooo567 in #3853
[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by @Isotr0py in #3869
[CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by @youkaichao in #3889
[Core] enable out-of-tree model register by @youkaichao in #3871
[WIP][Core] latency optimization by @youkaichao in #3890
[Bugfix] Fix Llava inference with Tensor Parallelism. by @Isotr0py in #3883
[Model] add minicpm by @SUDA-HLT-ywfang in #3893
[Bugfix] Added Command-R GPTQ support by @egortolmachev in #3849
[Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration by @Ki6an in #3767
[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by @mawong-amd in #3782
[BugFix][Model] Fix commandr RoPE max_position_embeddings by @esmeetu in #3919
[Core] separate distributed_init from worker by @youkaichao in #3904
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by @cadedaniel in #3837
[Bugfix] Fix KeyError on loading GPT-NeoX by @jsato8094 in #3925
[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by @jpvillam-amd in #3643
[Misc] Avoid loading incorrect LoRA config by @jeejeelee in #3777
[Benchmark] Add cpu options to bench scripts by @PZD-CHINA in #3915
[Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by @zhaotyer in #3955
[Bugfix] Fix logits processor when prompt_logprobs is not None by @huyiwen in #3899
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by @tjohnson31415 in #3876
[Bugfix][ROCm] Add numba to Dockerfile.rocm by @WoosukKwon in #3962
[Model][AMD] ROCm support for 256 head dims for Gemma by @jamestwhedbee in #3972
[Doc] Add doc to state our model support policy by @youkaichao in #3948
[Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server by @dmarasco in #3945
[Doc] Fix getting stared to use publicly available model by @fpaupier in #3963
[Bugfix] handle hf_config with architectures == None by @tjohnson31415 in #3982
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by @youkaichao in #3950
[Core][5/N] Fully working chunked prefill e2e by @rkooo567 in #3884
[Core][Model] Use torch.compile to accelerate layernorm in commandr by @youkaichao in #3985
[Test] Add xformer and flash attn tests by @rkooo567 in #3961
[Misc] refactor ops and cache_ops layer by @jikunshang in #3913
[Doc][Installation] delete python setup.py develop by @youkaichao in #3989
[Kernel] Fused MoE Config for Mixtral 8x22 by @ywang96 in #4002
fix-bgmv-kernel-640 by @kingljl in #4007
[Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance by @bigPYJ1151 in #3824
[Core] Set linear_weights directly on the layer by @Yard1 in #3977
[Core][Distributed] make init_distributed_environment compatible with init_process_group by @youkaichao in #4014
Fix echo/logprob OpenAI completion bug by @dylanwhawk in #3441
[Kernel] Add extra punica sizes to support bigger vocabs by @Yard1 in #4015
[BugFix] Fix handling of stop strings and stop token ids by @njhill in #3672
[Doc] Add typing hints / mypy types cleanup by @michaelfeil in #3816
[Core] Support LoRA on quantized models by @jeejeelee in #4012
[Frontend][Core] Move merge_async_iterators to utils by @DarkLight1337 in #4026
[Test] Test multiple attn backend for chunked prefill. by @rkooo567 in #4023
[Bugfix] fix type hint for py 3.8 by @youkaichao in #4036
[Misc] Fix typo in scheduler.py by @zhuohan123 in #4022
[mypy] Add mypy type annotation part 1 by @rkooo567 in #4006
[Core] fix custom allreduce default value by @youkaichao in #4040
Fix triton compilation issue by @Bellk17 in #3984
[Bugfix] Fix LoRA bug by @jeejeelee in #4032
[CI/Test] expand ruff and yapf for all supported python version by @youkaichao in #4037
[Bugfix] More type hint fixes for py 3.8 by @dylanwhawk in #4039
[Core][Distributed] improve logging for init dist by @youkaichao in #4042
[Bugfix] fix_log_time_in_metrics by @zspo in #4050
[Bugfix] fix_small_bug_in_neuron_executor by @zspo in #4051
[Kernel] Add punica dimension for Baichuan-13B by @jeejeelee in #4053
[Frontend] [Core] feat: Add model loading using tensorizer by @sangstar in #3476
[Core] avoid too many cuda context by caching p2p test by @youkaichao in #4021
[BugFix] Fix tensorizer extra in setup.py by @njhill in #4072
[Docs] document that mixtral 8x22b is supported by @simon-mo in #4073
[Misc] Upgrade triton to 2.2.0 by @esmeetu in #4061
[Bugfix] Fix filelock version requirement by @zhuohan123 in #4075
[Misc][Minor] Fix CPU block num log in CPUExecutor. by @bigPYJ1151 in #4088
[Core] Simplifications to executor classes by @njhill in #4071
[Doc] Add better clarity for tensorizer usage by @sangstar in #4090
[Bugfix] Fix ray workers profiling with nsight by @rickyyx in #4095
[Typing] Fix Sequence type GenericAlias only available after Python 3.9. by @rkooo567 in #4092
[Core] Fix engine-use-ray broken by @rkooo567 in #4105
LM Format Enforcer Guided Decoding Support by @noamgat in #3868
[Core] Refactor model loading code by @Yard1 in #4097
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine by @cadedaniel in #3894
[Misc] [CI] Fix CI failure caught after merge by @cadedaniel in #4126
[CI] Move CPU/AMD tests to after wait by @cadedaniel in #4123
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication by @youkaichao in #4024
[Bugfix] fix output parsing error for trtllm backend by @elinx in #4137
[Kernel] Add punica dimension for Swallow-MS-7B LoRA by @ucciicci in #4134
[Typing] Mypy typing part 2 by @rkooo567 in #4043
[Core] Add integrity check during initialization; add test for it by @youkaichao in #4155
Allow model to be served under multiple names by @hmellor in #2894
[Bugfix] Get available quantization methods from quantization registry by @mgoin in #4098
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill by @mmoskal in #4128
[Docs] document that Meta Llama 3 is supported by @simon-mo in #4175
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields by @jamestwhedbee in #4149
[Misc] Bump transformers to latest version by @njhill in #4176
[CI/CD] add neuron docker and ci test scripts by @liangfu in #3571
[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) by @agt in #4159
[Core] add an option to log every function call to for debugging hang/crash in distributed inference by @youkaichao in #4079
Support eos_token_id from generation_config.json by @simon-mo in #4182
[Bugfix] Fix LoRA loading check by @jeejeelee in #4138
Bump version of 0.4.1 by @simon-mo in #4177
[Misc] fix docstrings by @UranusSeven in #4191
[Bugfix][Core] Restore logging of stats in the async engine by @ronensc in #4150
[Misc] add nccl in collect env by @youkaichao in #4211
Pass tokenizer_revision when getting tokenizer in openai serving by @chiragjn in #4214
[Bugfix] Add fix for JSON whitespace by @ayusher in #4189
Fix missing docs and out of sync EngineArgs by @hmellor in #4219
[Kernel][FP8] Initial support with dynamic per-tensor scaling by @comaniac in #4118
[Frontend] multiple sampling params support by @nunjunj in #3570
Updating lm-format-enforcer version and adding links to decoding libraries in docs by @noamgat in #4222
Don't show default value for flags in EngineArgs by @hmellor in #4223
[Doc]: Update the page of adding new models by @YeFD in #4236
Make initialization of tokenizer and detokenizer optional by @GeauxEric in #3748
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring by @hongxiayang in #4129
[Core][Distributed] fix _is_full_nvlink detection by @youkaichao in #4233
[Misc] Add vision language model support to CPU backend by @Isotr0py in #3968
[Bugfix] Fix type annotations in CPU model runner by @WoosukKwon in #4256
[Frontend] Enable support for CPU backend in AsyncLLMEngine. by @sighingnow in #3993
[Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter by @alexm-nm in #4217
Add example scripts to documentation by @hmellor in #4225
[Core] Scheduler perf fix by @rkooo567 in #4270
[Doc] Update the SkyPilot doc with serving and Llama-3 by @Michaelvll in #4276
[Core][Distributed] use absolute path for library file by @youkaichao in #4271
Fix autodoc directives by @hmellor in #4272
[Mypy] Part 3 fix typing for nested directories for most of directory by @rkooo567 in #4161
[Core] Some simplification of WorkerWrapper changes by @njhill in #4183
[Core] Scheduling optimization 2 by @rkooo567 in #4280
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. by @cadedaniel in #3951
[Bugfix] Fixing max token error message for openai compatible server by @jgordley in #4016
[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper by @DefTruth in #4286
[Core][Logging] Add last frame information for better debugging by @youkaichao in #4278
[CI] Add ccache for wheel builds job by @simon-mo in #4281
AQLM CUDA support by @jaemzfleming in #3287
[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened by @DarkLight1337 in #4292
[Kernel] FP8 support for MoE kernel / Mixtral by @pcmoritz in #4244
[Bugfix] fixed fp8 conflict with aqlm by @robertgshaw2-neuralmagic in #4307
[Core][Distributed] use cpu/gloo to initialize pynccl by @youkaichao in #4248
[CI][Build] change pynvml to nvidia-ml-py by @youkaichao in #4302
[Misc] Reduce supported Punica dtypes by @WoosukKwon in #4304

New Contributors

@mawong-amd made their first contribution in #3662
@Qubitium made their first contribution in #3689
@bigPYJ1151 made their first contribution in #3634
@A-Mahla made their first contribution in #3788
@AdrianAbeyta made their first contribution in #3290
@mgerstgrasser made their first contribution in #3749
@CatherineSue made their first contribution in #3836
@saurabhdash2512 made their first contribution in #3829
@SeanGallen made their first contribution in #3810
@SUDA-HLT-ywfang made their first contribution in #3893
@egortolmachev made their first contribution in #3849
@Ki6an made their first contribution in #3767
@jsato8094 made their first contribution in #3925
@jpvillam-amd made their first contribution in #3643
@PZD-CHINA made their first contribution in #3915
@zhaotyer made their first contribution in #3955
@huyiwen made their first contribution in #3899
@dmarasco made their first contribution in #3945
@fpaupier made their first contribution in #3963
@kingljl made their first contribution in #4007
@DarkLight1337 made their first contribution in #4026
@Bellk17 made their first contribution in #3984
@sangstar made their first contribution in #3476
@rickyyx made their first contribution in #4095
@elinx made their first contribution in #4137
@ucciicci made their first contribution in #4134
@mmoskal made their first contribution in #4128
@agt made their first contribution in #4159
@ayusher made their first contribution in #4189
@nunjunj made their first contribution in #3570
@YeFD made their first contribution in #4236
@GeauxEric made their first contribution in #3748
@alexm-nm made their first contribution in #4217
@jgordley made their first contribution in #4016
@DefTruth made their first contribution in #4286
@jaemzfleming made their first contribution in #3287

Full Changelog: v0.4.0...v0.4.1