Highlights
Features
- Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
- Support private model registration, and updating our support policy (#3871, 3948)
- Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
- Add option for using LM Format Enforcer for guided decoding (#3868)
- Add option for optionally initialize tokenizer and detokenizer (#3748)
- Add option for load model using
tensorizer
(#3476)
Enhancements
- vLLM is now mostly type checked by
mypy
(#3816, #4006, #4161, #4043) - Progress towards chunked prefill scheduler (#3550, #3853, #4280, #3884)
- Progress towards speculative decoding (#3250, #3706, #3894)
- Initial support with dynamic per-tensor scaling via FP8 (#4118)
Hardwares
- Intel CPU inference backend is added (#3993, #3634)
- AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)
What's Changed
- [Kernel] Layernorm performance optimization by @mawong-amd in #3662
- [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
- [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
- [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
- [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
- [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
- [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
- [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
- [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
- [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
- [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
- [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
- [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
- [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
- Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
- [Bugfix] Add
__init__.py
files forvllm/core/block/
andvllm/spec_decode/
by @mgoin in #3798 - [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
- [Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in #3706
- [BugFix] Use different mechanism to get vllm version in
is_cpu()
by @njhill in #3804 - [Doc] Update README.md by @robertgshaw2-neuralmagic in #3806
- [Doc] Update contribution guidelines for better onboarding by @michaelfeil in #3819
- [3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in #3550
- Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in #3290
- [Misc] Publish 3rd meetup slides by @WoosukKwon in #3835
- Fixes the argument for local_tokenizer_group by @sighingnow in #3754
- [Core] Enable hf_transfer by default if available by @michaelfeil in #3817
- [Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in #3840
- [Core] [Frontend] Make detokenization optional by @mgerstgrasser in #3749
- [Bugfix] Fix args in benchmark_serving by @CatherineSue in #3836
- [Benchmark] Refactor sample_requests in benchmark_throughput by @gty111 in #3613
- [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by @youkaichao in #3805
- [Hardware][CPU] Update cpu torch to match default of 2.2.1 by @mgoin in #3854
- [Model] Cohere CommandR+ by @saurabhdash2512 in #3829
- [Core] improve robustness of pynccl by @youkaichao in #3860
- [Doc]Add asynchronous engine arguments to documentation. by @SeanGallen in #3810
- [CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by @youkaichao in #3859
- [Misc] Add pytest marker to opt-out of global test cleanup by @cadedaniel in #3863
- [Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by @cadedaniel in #3864
- [Bugfix] Fixing requirements.txt by @noamgat in #3865
- [Misc] Define common requirements by @WoosukKwon in #3841
- Add option to completion API to truncate prompt tokens by @tdoublep in #3144
- [Chunked Prefill][4/n] Chunked prefill scheduler. by @rkooo567 in #3853
- [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by @Isotr0py in #3869
- [CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by @youkaichao in #3889
- [Core] enable out-of-tree model register by @youkaichao in #3871
- [WIP][Core] latency optimization by @youkaichao in #3890
- [Bugfix] Fix Llava inference with Tensor Parallelism. by @Isotr0py in #3883
- [Model] add minicpm by @SUDA-HLT-ywfang in #3893
- [Bugfix] Added Command-R GPTQ support by @egortolmachev in #3849
- [Bugfix] Enable Proper
attention_bias
Usage in Llama Model Configuration by @Ki6an in #3767 - [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by @mawong-amd in #3782
- [BugFix][Model] Fix commandr RoPE max_position_embeddings by @esmeetu in #3919
- [Core] separate distributed_init from worker by @youkaichao in #3904
- [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by @cadedaniel in #3837
- [Bugfix] Fix KeyError on loading GPT-NeoX by @jsato8094 in #3925
- [ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by @jpvillam-amd in #3643
- [Misc] Avoid loading incorrect LoRA config by @jeejeelee in #3777
- [Benchmark] Add cpu options to bench scripts by @PZD-CHINA in #3915
- [Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by @zhaotyer in #3955
- [Bugfix] Fix logits processor when prompt_logprobs is not None by @huyiwen in #3899
- [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by @tjohnson31415 in #3876
- [Bugfix][ROCm] Add numba to Dockerfile.rocm by @WoosukKwon in #3962
- [Model][AMD] ROCm support for 256 head dims for Gemma by @jamestwhedbee in #3972
- [Doc] Add doc to state our model support policy by @youkaichao in #3948
- [Bugfix] Remove key sorting for
guided_json
parameter in OpenAi compatible Server by @dmarasco in #3945 - [Doc] Fix getting stared to use publicly available model by @fpaupier in #3963
- [Bugfix] handle hf_config with architectures == None by @tjohnson31415 in #3982
- [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by @youkaichao in #3950
- [Core][5/N] Fully working chunked prefill e2e by @rkooo567 in #3884
- [Core][Model] Use torch.compile to accelerate layernorm in commandr by @youkaichao in #3985
- [Test] Add xformer and flash attn tests by @rkooo567 in #3961
- [Misc] refactor ops and cache_ops layer by @jikunshang in #3913
- [Doc][Installation] delete python setup.py develop by @youkaichao in #3989
- [Kernel] Fused MoE Config for Mixtral 8x22 by @ywang96 in #4002
- fix-bgmv-kernel-640 by @kingljl in #4007
- [Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance by @bigPYJ1151 in #3824
- [Core] Set
linear_weights
directly on the layer by @Yard1 in #3977 - [Core][Distributed] make init_distributed_environment compatible with init_process_group by @youkaichao in #4014
- Fix echo/logprob OpenAI completion bug by @dylanwhawk in #3441
- [Kernel] Add extra punica sizes to support bigger vocabs by @Yard1 in #4015
- [BugFix] Fix handling of stop strings and stop token ids by @njhill in #3672
- [Doc] Add typing hints / mypy types cleanup by @michaelfeil in #3816
- [Core] Support LoRA on quantized models by @jeejeelee in #4012
- [Frontend][Core] Move
merge_async_iterators
to utils by @DarkLight1337 in #4026 - [Test] Test multiple attn backend for chunked prefill. by @rkooo567 in #4023
- [Bugfix] fix type hint for py 3.8 by @youkaichao in #4036
- [Misc] Fix typo in scheduler.py by @zhuohan123 in #4022
- [mypy] Add mypy type annotation part 1 by @rkooo567 in #4006
- [Core] fix custom allreduce default value by @youkaichao in #4040
- Fix triton compilation issue by @Bellk17 in #3984
- [Bugfix] Fix LoRA bug by @jeejeelee in #4032
- [CI/Test] expand ruff and yapf for all supported python version by @youkaichao in #4037
- [Bugfix] More type hint fixes for py 3.8 by @dylanwhawk in #4039
- [Core][Distributed] improve logging for init dist by @youkaichao in #4042
- [Bugfix] fix_log_time_in_metrics by @zspo in #4050
- [Bugfix] fix_small_bug_in_neuron_executor by @zspo in #4051
- [Kernel] Add punica dimension for Baichuan-13B by @jeejeelee in #4053
- [Frontend] [Core] feat: Add model loading using
tensorizer
by @sangstar in #3476 - [Core] avoid too many cuda context by caching p2p test by @youkaichao in #4021
- [BugFix] Fix tensorizer extra in setup.py by @njhill in #4072
- [Docs] document that mixtral 8x22b is supported by @simon-mo in #4073
- [Misc] Upgrade triton to 2.2.0 by @esmeetu in #4061
- [Bugfix] Fix filelock version requirement by @zhuohan123 in #4075
- [Misc][Minor] Fix CPU block num log in CPUExecutor. by @bigPYJ1151 in #4088
- [Core] Simplifications to executor classes by @njhill in #4071
- [Doc] Add better clarity for tensorizer usage by @sangstar in #4090
- [Bugfix] Fix ray workers profiling with nsight by @rickyyx in #4095
- [Typing] Fix Sequence type GenericAlias only available after Python 3.9. by @rkooo567 in #4092
- [Core] Fix engine-use-ray broken by @rkooo567 in #4105
- LM Format Enforcer Guided Decoding Support by @noamgat in #3868
- [Core] Refactor model loading code by @Yard1 in #4097
- [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine by @cadedaniel in #3894
- [Misc] [CI] Fix CI failure caught after merge by @cadedaniel in #4126
- [CI] Move CPU/AMD tests to after wait by @cadedaniel in #4123
- [Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication by @youkaichao in #4024
- [Bugfix] fix output parsing error for trtllm backend by @elinx in #4137
- [Kernel] Add punica dimension for Swallow-MS-7B LoRA by @ucciicci in #4134
- [Typing] Mypy typing part 2 by @rkooo567 in #4043
- [Core] Add integrity check during initialization; add test for it by @youkaichao in #4155
- Allow model to be served under multiple names by @hmellor in #2894
- [Bugfix] Get available quantization methods from quantization registry by @mgoin in #4098
- [Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill by @mmoskal in #4128
- [Docs] document that Meta Llama 3 is supported by @simon-mo in #4175
- [Bugfix] Support logprobs when using guided_json and other constrained decoding fields by @jamestwhedbee in #4149
- [Misc] Bump transformers to latest version by @njhill in #4176
- [CI/CD] add neuron docker and ci test scripts by @liangfu in #3571
- [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) by @agt in #4159
- [Core] add an option to log every function call to for debugging hang/crash in distributed inference by @youkaichao in #4079
- Support eos_token_id from generation_config.json by @simon-mo in #4182
- [Bugfix] Fix LoRA loading check by @jeejeelee in #4138
- Bump version of 0.4.1 by @simon-mo in #4177
- [Misc] fix docstrings by @UranusSeven in #4191
- [Bugfix][Core] Restore logging of stats in the async engine by @ronensc in #4150
- [Misc] add nccl in collect env by @youkaichao in #4211
- Pass
tokenizer_revision
when getting tokenizer in openai serving by @chiragjn in #4214 - [Bugfix] Add fix for JSON whitespace by @ayusher in #4189
- Fix missing docs and out of sync
EngineArgs
by @hmellor in #4219 - [Kernel][FP8] Initial support with dynamic per-tensor scaling by @comaniac in #4118
- [Frontend] multiple sampling params support by @nunjunj in #3570
- Updating lm-format-enforcer version and adding links to decoding libraries in docs by @noamgat in #4222
- Don't show default value for flags in
EngineArgs
by @hmellor in #4223 - [Doc]: Update the page of adding new models by @YeFD in #4236
- Make initialization of tokenizer and detokenizer optional by @GeauxEric in #3748
- [AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring by @hongxiayang in #4129
- [Core][Distributed] fix _is_full_nvlink detection by @youkaichao in #4233
- [Misc] Add vision language model support to CPU backend by @Isotr0py in #3968
- [Bugfix] Fix type annotations in CPU model runner by @WoosukKwon in #4256
- [Frontend] Enable support for CPU backend in AsyncLLMEngine. by @sighingnow in #3993
- [Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter by @alexm-nm in #4217
- Add example scripts to documentation by @hmellor in #4225
- [Core] Scheduler perf fix by @rkooo567 in #4270
- [Doc] Update the SkyPilot doc with serving and Llama-3 by @Michaelvll in #4276
- [Core][Distributed] use absolute path for library file by @youkaichao in #4271
- Fix
autodoc
directives by @hmellor in #4272 - [Mypy] Part 3 fix typing for nested directories for most of directory by @rkooo567 in #4161
- [Core] Some simplification of WorkerWrapper changes by @njhill in #4183
- [Core] Scheduling optimization 2 by @rkooo567 in #4280
- [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. by @cadedaniel in #3951
- [Bugfix] Fixing max token error message for openai compatible server by @jgordley in #4016
- [Bugfix] Add init_cached_hf_modules to RayWorkerWrapper by @DefTruth in #4286
- [Core][Logging] Add last frame information for better debugging by @youkaichao in #4278
- [CI] Add ccache for wheel builds job by @simon-mo in #4281
- AQLM CUDA support by @jaemzfleming in #3287
- [Bugfix][Frontend] Raise exception when file-like chat template fails to be opened by @DarkLight1337 in #4292
- [Kernel] FP8 support for MoE kernel / Mixtral by @pcmoritz in #4244
- [Bugfix] fixed fp8 conflict with aqlm by @robertgshaw2-neuralmagic in #4307
- [Core][Distributed] use cpu/gloo to initialize pynccl by @youkaichao in #4248
- [CI][Build] change pynvml to nvidia-ml-py by @youkaichao in #4302
- [Misc] Reduce supported Punica dtypes by @WoosukKwon in #4304
New Contributors
- @mawong-amd made their first contribution in #3662
- @Qubitium made their first contribution in #3689
- @bigPYJ1151 made their first contribution in #3634
- @A-Mahla made their first contribution in #3788
- @AdrianAbeyta made their first contribution in #3290
- @mgerstgrasser made their first contribution in #3749
- @CatherineSue made their first contribution in #3836
- @saurabhdash2512 made their first contribution in #3829
- @SeanGallen made their first contribution in #3810
- @SUDA-HLT-ywfang made their first contribution in #3893
- @egortolmachev made their first contribution in #3849
- @Ki6an made their first contribution in #3767
- @jsato8094 made their first contribution in #3925
- @jpvillam-amd made their first contribution in #3643
- @PZD-CHINA made their first contribution in #3915
- @zhaotyer made their first contribution in #3955
- @huyiwen made their first contribution in #3899
- @dmarasco made their first contribution in #3945
- @fpaupier made their first contribution in #3963
- @kingljl made their first contribution in #4007
- @DarkLight1337 made their first contribution in #4026
- @Bellk17 made their first contribution in #3984
- @sangstar made their first contribution in #3476
- @rickyyx made their first contribution in #4095
- @elinx made their first contribution in #4137
- @ucciicci made their first contribution in #4134
- @mmoskal made their first contribution in #4128
- @agt made their first contribution in #4159
- @ayusher made their first contribution in #4189
- @nunjunj made their first contribution in #3570
- @YeFD made their first contribution in #4236
- @GeauxEric made their first contribution in #3748
- @alexm-nm made their first contribution in #4217
- @jgordley made their first contribution in #4016
- @DefTruth made their first contribution in #4286
- @jaemzfleming made their first contribution in #3287
Full Changelog: v0.4.0...v0.4.1