github vllm-project/vllm v0.4.1

latest releases: v0.6.4.post1, v0.6.4, v0.6.3.post1...
6 months ago

Highlights

Features

  • Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
  • Support private model registration, and updating our support policy (#3871, 3948)
  • Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
  • Add option for using LM Format Enforcer for guided decoding (#3868)
  • Add option for optionally initialize tokenizer and detokenizer (#3748)
  • Add option for load model using tensorizer (#3476)

Enhancements

Hardwares

  • Intel CPU inference backend is added (#3993, #3634)
  • AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)

What's Changed

  • [Kernel] Layernorm performance optimization by @mawong-amd in #3662
  • [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
  • [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
  • [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
  • [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
  • [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
  • [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
  • [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
  • [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
  • [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
  • [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
  • [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
  • [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
  • [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
  • Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
  • [Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in #3798
  • [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
  • [Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in #3706
  • [BugFix] Use different mechanism to get vllm version in is_cpu() by @njhill in #3804
  • [Doc] Update README.md by @robertgshaw2-neuralmagic in #3806
  • [Doc] Update contribution guidelines for better onboarding by @michaelfeil in #3819
  • [3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in #3550
  • Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in #3290
  • [Misc] Publish 3rd meetup slides by @WoosukKwon in #3835
  • Fixes the argument for local_tokenizer_group by @sighingnow in #3754
  • [Core] Enable hf_transfer by default if available by @michaelfeil in #3817
  • [Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in #3840
  • [Core] [Frontend] Make detokenization optional by @mgerstgrasser in #3749
  • [Bugfix] Fix args in benchmark_serving by @CatherineSue in #3836
  • [Benchmark] Refactor sample_requests in benchmark_throughput by @gty111 in #3613
  • [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by @youkaichao in #3805
  • [Hardware][CPU] Update cpu torch to match default of 2.2.1 by @mgoin in #3854
  • [Model] Cohere CommandR+ by @saurabhdash2512 in #3829
  • [Core] improve robustness of pynccl by @youkaichao in #3860
  • [Doc]Add asynchronous engine arguments to documentation. by @SeanGallen in #3810
  • [CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by @youkaichao in #3859
  • [Misc] Add pytest marker to opt-out of global test cleanup by @cadedaniel in #3863
  • [Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by @cadedaniel in #3864
  • [Bugfix] Fixing requirements.txt by @noamgat in #3865
  • [Misc] Define common requirements by @WoosukKwon in #3841
  • Add option to completion API to truncate prompt tokens by @tdoublep in #3144
  • [Chunked Prefill][4/n] Chunked prefill scheduler. by @rkooo567 in #3853
  • [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by @Isotr0py in #3869
  • [CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by @youkaichao in #3889
  • [Core] enable out-of-tree model register by @youkaichao in #3871
  • [WIP][Core] latency optimization by @youkaichao in #3890
  • [Bugfix] Fix Llava inference with Tensor Parallelism. by @Isotr0py in #3883
  • [Model] add minicpm by @SUDA-HLT-ywfang in #3893
  • [Bugfix] Added Command-R GPTQ support by @egortolmachev in #3849
  • [Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration by @Ki6an in #3767
  • [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by @mawong-amd in #3782
  • [BugFix][Model] Fix commandr RoPE max_position_embeddings by @esmeetu in #3919
  • [Core] separate distributed_init from worker by @youkaichao in #3904
  • [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by @cadedaniel in #3837
  • [Bugfix] Fix KeyError on loading GPT-NeoX by @jsato8094 in #3925
  • [ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by @jpvillam-amd in #3643
  • [Misc] Avoid loading incorrect LoRA config by @jeejeelee in #3777
  • [Benchmark] Add cpu options to bench scripts by @PZD-CHINA in #3915
  • [Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by @zhaotyer in #3955
  • [Bugfix] Fix logits processor when prompt_logprobs is not None by @huyiwen in #3899
  • [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by @tjohnson31415 in #3876
  • [Bugfix][ROCm] Add numba to Dockerfile.rocm by @WoosukKwon in #3962
  • [Model][AMD] ROCm support for 256 head dims for Gemma by @jamestwhedbee in #3972
  • [Doc] Add doc to state our model support policy by @youkaichao in #3948
  • [Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server by @dmarasco in #3945
  • [Doc] Fix getting stared to use publicly available model by @fpaupier in #3963
  • [Bugfix] handle hf_config with architectures == None by @tjohnson31415 in #3982
  • [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by @youkaichao in #3950
  • [Core][5/N] Fully working chunked prefill e2e by @rkooo567 in #3884
  • [Core][Model] Use torch.compile to accelerate layernorm in commandr by @youkaichao in #3985
  • [Test] Add xformer and flash attn tests by @rkooo567 in #3961
  • [Misc] refactor ops and cache_ops layer by @jikunshang in #3913
  • [Doc][Installation] delete python setup.py develop by @youkaichao in #3989
  • [Kernel] Fused MoE Config for Mixtral 8x22 by @ywang96 in #4002
  • fix-bgmv-kernel-640 by @kingljl in #4007
  • [Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance by @bigPYJ1151 in #3824
  • [Core] Set linear_weights directly on the layer by @Yard1 in #3977
  • [Core][Distributed] make init_distributed_environment compatible with init_process_group by @youkaichao in #4014
  • Fix echo/logprob OpenAI completion bug by @dylanwhawk in #3441
  • [Kernel] Add extra punica sizes to support bigger vocabs by @Yard1 in #4015
  • [BugFix] Fix handling of stop strings and stop token ids by @njhill in #3672
  • [Doc] Add typing hints / mypy types cleanup by @michaelfeil in #3816
  • [Core] Support LoRA on quantized models by @jeejeelee in #4012
  • [Frontend][Core] Move merge_async_iterators to utils by @DarkLight1337 in #4026
  • [Test] Test multiple attn backend for chunked prefill. by @rkooo567 in #4023
  • [Bugfix] fix type hint for py 3.8 by @youkaichao in #4036
  • [Misc] Fix typo in scheduler.py by @zhuohan123 in #4022
  • [mypy] Add mypy type annotation part 1 by @rkooo567 in #4006
  • [Core] fix custom allreduce default value by @youkaichao in #4040
  • Fix triton compilation issue by @Bellk17 in #3984
  • [Bugfix] Fix LoRA bug by @jeejeelee in #4032
  • [CI/Test] expand ruff and yapf for all supported python version by @youkaichao in #4037
  • [Bugfix] More type hint fixes for py 3.8 by @dylanwhawk in #4039
  • [Core][Distributed] improve logging for init dist by @youkaichao in #4042
  • [Bugfix] fix_log_time_in_metrics by @zspo in #4050
  • [Bugfix] fix_small_bug_in_neuron_executor by @zspo in #4051
  • [Kernel] Add punica dimension for Baichuan-13B by @jeejeelee in #4053
  • [Frontend] [Core] feat: Add model loading using tensorizer by @sangstar in #3476
  • [Core] avoid too many cuda context by caching p2p test by @youkaichao in #4021
  • [BugFix] Fix tensorizer extra in setup.py by @njhill in #4072
  • [Docs] document that mixtral 8x22b is supported by @simon-mo in #4073
  • [Misc] Upgrade triton to 2.2.0 by @esmeetu in #4061
  • [Bugfix] Fix filelock version requirement by @zhuohan123 in #4075
  • [Misc][Minor] Fix CPU block num log in CPUExecutor. by @bigPYJ1151 in #4088
  • [Core] Simplifications to executor classes by @njhill in #4071
  • [Doc] Add better clarity for tensorizer usage by @sangstar in #4090
  • [Bugfix] Fix ray workers profiling with nsight by @rickyyx in #4095
  • [Typing] Fix Sequence type GenericAlias only available after Python 3.9. by @rkooo567 in #4092
  • [Core] Fix engine-use-ray broken by @rkooo567 in #4105
  • LM Format Enforcer Guided Decoding Support by @noamgat in #3868
  • [Core] Refactor model loading code by @Yard1 in #4097
  • [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine by @cadedaniel in #3894
  • [Misc] [CI] Fix CI failure caught after merge by @cadedaniel in #4126
  • [CI] Move CPU/AMD tests to after wait by @cadedaniel in #4123
  • [Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication by @youkaichao in #4024
  • [Bugfix] fix output parsing error for trtllm backend by @elinx in #4137
  • [Kernel] Add punica dimension for Swallow-MS-7B LoRA by @ucciicci in #4134
  • [Typing] Mypy typing part 2 by @rkooo567 in #4043
  • [Core] Add integrity check during initialization; add test for it by @youkaichao in #4155
  • Allow model to be served under multiple names by @hmellor in #2894
  • [Bugfix] Get available quantization methods from quantization registry by @mgoin in #4098
  • [Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill by @mmoskal in #4128
  • [Docs] document that Meta Llama 3 is supported by @simon-mo in #4175
  • [Bugfix] Support logprobs when using guided_json and other constrained decoding fields by @jamestwhedbee in #4149
  • [Misc] Bump transformers to latest version by @njhill in #4176
  • [CI/CD] add neuron docker and ci test scripts by @liangfu in #3571
  • [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) by @agt in #4159
  • [Core] add an option to log every function call to for debugging hang/crash in distributed inference by @youkaichao in #4079
  • Support eos_token_id from generation_config.json by @simon-mo in #4182
  • [Bugfix] Fix LoRA loading check by @jeejeelee in #4138
  • Bump version of 0.4.1 by @simon-mo in #4177
  • [Misc] fix docstrings by @UranusSeven in #4191
  • [Bugfix][Core] Restore logging of stats in the async engine by @ronensc in #4150
  • [Misc] add nccl in collect env by @youkaichao in #4211
  • Pass tokenizer_revision when getting tokenizer in openai serving by @chiragjn in #4214
  • [Bugfix] Add fix for JSON whitespace by @ayusher in #4189
  • Fix missing docs and out of sync EngineArgs by @hmellor in #4219
  • [Kernel][FP8] Initial support with dynamic per-tensor scaling by @comaniac in #4118
  • [Frontend] multiple sampling params support by @nunjunj in #3570
  • Updating lm-format-enforcer version and adding links to decoding libraries in docs by @noamgat in #4222
  • Don't show default value for flags in EngineArgs by @hmellor in #4223
  • [Doc]: Update the page of adding new models by @YeFD in #4236
  • Make initialization of tokenizer and detokenizer optional by @GeauxEric in #3748
  • [AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring by @hongxiayang in #4129
  • [Core][Distributed] fix _is_full_nvlink detection by @youkaichao in #4233
  • [Misc] Add vision language model support to CPU backend by @Isotr0py in #3968
  • [Bugfix] Fix type annotations in CPU model runner by @WoosukKwon in #4256
  • [Frontend] Enable support for CPU backend in AsyncLLMEngine. by @sighingnow in #3993
  • [Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter by @alexm-nm in #4217
  • Add example scripts to documentation by @hmellor in #4225
  • [Core] Scheduler perf fix by @rkooo567 in #4270
  • [Doc] Update the SkyPilot doc with serving and Llama-3 by @Michaelvll in #4276
  • [Core][Distributed] use absolute path for library file by @youkaichao in #4271
  • Fix autodoc directives by @hmellor in #4272
  • [Mypy] Part 3 fix typing for nested directories for most of directory by @rkooo567 in #4161
  • [Core] Some simplification of WorkerWrapper changes by @njhill in #4183
  • [Core] Scheduling optimization 2 by @rkooo567 in #4280
  • [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. by @cadedaniel in #3951
  • [Bugfix] Fixing max token error message for openai compatible server by @jgordley in #4016
  • [Bugfix] Add init_cached_hf_modules to RayWorkerWrapper by @DefTruth in #4286
  • [Core][Logging] Add last frame information for better debugging by @youkaichao in #4278
  • [CI] Add ccache for wheel builds job by @simon-mo in #4281
  • AQLM CUDA support by @jaemzfleming in #3287
  • [Bugfix][Frontend] Raise exception when file-like chat template fails to be opened by @DarkLight1337 in #4292
  • [Kernel] FP8 support for MoE kernel / Mixtral by @pcmoritz in #4244
  • [Bugfix] fixed fp8 conflict with aqlm by @robertgshaw2-neuralmagic in #4307
  • [Core][Distributed] use cpu/gloo to initialize pynccl by @youkaichao in #4248
  • [CI][Build] change pynvml to nvidia-ml-py by @youkaichao in #4302
  • [Misc] Reduce supported Punica dtypes by @WoosukKwon in #4304

New Contributors

Full Changelog: v0.4.0...v0.4.1

Don't miss a new vllm release

NewReleases is sending notifications on new releases.