vllm-project/vllm v0.4.0 on GitHub

Major changes

Models

New models: Command+R(#3433), Qwen2 MoE(#3346), DBRX(#3660), XVerse (#3610), Jais (#3183).
New vision language model: LLaVA (#3042)

Production features

Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag --enable-prefix-caching to turn it on.
Support json_object in OpenAI server for arbitrary JSON, --use-delay flag to improve time to first token across many requests, and min_tokens to EOS suppression.
Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
Custom all reduce kernel has been re-enabled after more robustness fixes.
Replaced cupy dependency due to its bugs.

Hardware

Improved Neuron support for AWS Inferentia.
CMake based build system for extensibility.

Ecosystem

Extensive serving benchmark refactoring (#3277)
Usage statistics collection (#2852)

What's Changed

allow user chose log level by --log-level instead of fixed 'info'. by @AllenDou in #3109
Reorder kv dtype check to avoid nvcc not found error on AMD platform by @cloudhan in #3104
Add Automatic Prefix Caching by @SageMoore in #2762
Add vLLM version info to logs and openai API server by @jasonacox in #3161
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark by @zhuohan123 in #3158
Make it easy to profile workers with nsight by @pcmoritz in #3162
[DOC] add setup document to support neuron backend by @liangfu in #2777
[Minor Fix] Remove unused code in benchmark_prefix_caching.py by @gty111 in #3171
Add document for vllm paged attention kernel. by @pian13131 in #2978
enable --gpu-memory-utilization in benchmark_throughput.py by @AllenDou in #3175
[Minor fix] The domain dns.google may cause a socket.gaierror exception by @ttbachyinsda in #3176
Push logprob generation to LLMEngine by @Yard1 in #3065
Add health check, make async Engine more robust by @Yard1 in #3015
Fix the openai benchmarking requests to work with latest OpenAI apis by @wangchen615 in #2992
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs by @hongxiayang in #3123
Store eos_token_id in Sequence for easy access by @njhill in #3166
[Fix] Avoid pickling entire LLMEngine for Ray workers by @njhill in #3207
[Tests] Add block manager and scheduler tests by @rkooo567 in #3108
[Testing] Fix core tests by @cadedaniel in #3224
A simple addition of dynamic_ncols=True by @chujiezheng in #3242
Add GPTQ support for Gemma by @TechxGenus in #3200
Update requirements-dev.txt to include package for benchmarking scripts. by @wangchen615 in #3181
Separate attention backends by @WoosukKwon in #3005
Measure model memory usage by @mgoin in #3120
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) by @jacobthebanana in #3263
Fix auto prefix bug by @ElizaWszola in #3239
Connect engine healthcheck to openai server by @njhill in #3260
Feature add lora support for Qwen2 by @whyiug in #3177
[Minor Fix] Fix comments in benchmark_serving by @gty111 in #3252
[Docs] Fix Unmocked Imports by @ywang96 in #3275
[FIX] Make flash_attn optional by @WoosukKwon in #3269
Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir by @mgoin in #3241
[FIX] Fix prefix test error on main by @zhuohan123 in #3286
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling by @cadedaniel in #3103
Enhance lora tests with more layer and rank variations by @tterrysun in #3243
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA by @dllehr-amd in #3262
[BugFix] Fix get tokenizer when using ray by @esmeetu in #3301
[Fix] Fix best_of behavior when n=1 by @njhill in #3298
Re-enable the 80 char line width limit by @zhuohan123 in #3305
[docs] Add LoRA support information for models by @pcmoritz in #3299
Add distributed model executor abstraction by @zhuohan123 in #3191
[ROCm] Fix warp and lane calculation in blockReduceSum by @kliuae in #3321
Support Mistral Model Inference with transformers-neuronx by @DAIZHENWEI in #3153
docs: Add BentoML deployment doc by @Sherlock113 in #3336
Fixes #1556 double free by @br3no in #3347
Add kernel for GeGLU with approximate GELU by @WoosukKwon in #3337
[Fix] fix quantization arg when using marlin by @DreamTeamWangbowen in #3319
add hf_transfer to requirements.txt by @RonanKMcGovern in #3031
fix bias in if, ambiguous by @hliuca in #3259
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build by @chenxu2048 in #3256
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. by @orsharir in #3350
Add batched RoPE kernel by @tterrysun in #3095
Fix lint by @Yard1 in #3388
[FIX] Simpler fix for async engine running on ray by @zhuohan123 in #3371
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion by @simon-mo in #3383
allow user to chose which vllm's merics to display in grafana by @AllenDou in #3393
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 by @youkaichao in #3389
Install flash_attn in Docker image by @tdoublep in #3396
Add args for mTLS support by @declark1 in #3410
[issue templates] add some issue templates by @youkaichao in #3412
Fix assertion failure in Qwen 1.5 with prefix caching enabled by @chenxu2048 in #3373
fix marlin config repr by @qeternity in #3414
Feature: dynamic shared mem moe_align_block_size_kernel by @akhoroshev in #3376
[Misc] add HOST_IP env var by @youkaichao in #3419
Add chat templates for Falcon by @Dinghow in #3420
Add chat templates for ChatGLM by @Dinghow in #3418
Fix dist.broadcast stall without group argument by @GindaChen in #3408
Fix tie_word_embeddings for Qwen2. by @fyabc in #3344
[Fix] Add args for mTLS support by @declark1 in #3430
Fixes the misuse/mixuse of time.time()/time.monotonic() by @sighingnow in #3220
[Misc] add error message in non linux platform by @youkaichao in #3438
Fix issue templates by @hmellor in #3436
fix document error for value and v_vec illustration by @laneeeee in #3421
Asynchronous tokenization by @Yard1 in #2879
Removed Extraneous Print Message From OAI Server by @robertgshaw2-neuralmagic in #3440
[Misc] PR templates by @youkaichao in #3413
Fixes the incorrect argument in the prefix-prefill test cases by @sighingnow in #3246
Replace lstrip() with removeprefix() to fix Ruff linter warning by @ronensc in #2958
Fix Baichuan chat template by @Dinghow in #3340
[Misc] fix line length for entire codebase by @simon-mo in #3444
Support arbitrary json_object in OpenAI and Context Free Grammar by @simon-mo in #3211
Fix setup.py neuron-ls issue by @simon-mo in #2671
[Misc] Define from_dict and to_dict in InputMetadata by @WoosukKwon in #3452
[CI] Shard tests for LoRA and Kernels to speed up by @simon-mo in #3445
[Bugfix] Make moe_align_block_size AMD-compatible by @WoosukKwon in #3470
CI: Add ROCm Docker Build by @simon-mo in #2886
[Testing] Add test_config.py to CI by @cadedaniel in #3437
[CI/Build] Fix Bad Import In Test by @robertgshaw2-neuralmagic in #3473
[Misc] Fix PR Template by @zhuohan123 in #3478
Cmake based build system by @bnellnm in #2830
[Core] Zero-copy asdict for InputMetadata by @Yard1 in #3475
[Misc] Update README for the Third vLLM Meetup by @zhuohan123 in #3479
[Core] Cache some utils by @Yard1 in #3474
[Core] print error before deadlock by @youkaichao in #3459
[Doc] Add docs about OpenAI compatible server by @simon-mo in #3288
[BugFix] Avoid initializing CUDA too early by @njhill in #3487
Update dockerfile with ModelScope support by @ifsheldon in #3429
[Doc] minor fix to neuron-installation.rst by @jimburtoft in #3505
Revert "[Core] Cache some utils" by @simon-mo in #3507
[Doc] minor fix of spelling in amd-installation.rst by @jimburtoft in #3506
Use lru_cache for some environment detection utils by @simon-mo in #3508
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled by @ElizaWszola in #3357
[Core] Add generic typing to LRUCache by @njhill in #3511
[Misc] Remove cache stream and cache events by @WoosukKwon in #3461
Abort when nvcc command is not found in the PATH by @AllenDou in #3527
Check for _is_cuda() in compute_num_jobs by @bnellnm in #3481
[Bugfix] Fix ROCm support in CMakeLists.txt by @jamestwhedbee in #3534
[1/n] Triton sampling kernel by @Yard1 in #3186
[1/n][Chunked Prefill] Refactor input query shapes by @rkooo567 in #3236
Migrate logits computation and gather to model_runner by @esmeetu in #3233
[BugFix] Hot fix in setup.py for neuron build by @zhuohan123 in #3537
[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor by @ElizaWszola in #3431
Fix 1D query issue from _prune_hidden_states by @rkooo567 in #3539
[🚀 Ready to be merged] Added support for Jais models by @grandiose-pizza in #3183
[Misc][Log] Add log for tokenizer length not equal to vocabulary size by @esmeetu in #3500
[Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config by @WoosukKwon in #3551
[BugFix] gemma loading after quantization or LoRA. by @taeminlee in #3553
[Bugfix][Model] Fix Qwen2 by @esmeetu in #3554
[Hardware][Neuron] Refactor neuron support by @zhuohan123 in #3471
Some fixes for custom allreduce kernels by @hanzhi713 in #2760
Dynamic scheduler delay to improve ITL performance by @tdoublep in #3279
[Core] Improve detokenization performance for prefill by @Yard1 in #3469
[Bugfix] use SoftLockFile instead of LockFile by @kota-iizuka in #3578
[Misc] Fix BLOOM copyright notice by @WoosukKwon in #3591
[Misc] Bump transformers version by @ywang96 in #3592
[BugFix] Fix Falcon tied embeddings by @WoosukKwon in #3590
[BugFix] 1D query fix for MoE models by @njhill in #3597
[CI] typo fix: is_hip --> is_hip() by @youkaichao in #3595
[CI/Build] respect the common environment variable MAX_JOBS by @youkaichao in #3600
[CI/Build] fix flaky test by @youkaichao in #3602
[BugFix] minor fix: method typo in rotary_embedding.py file, get_device() -> device by @jikunshang in #3604
[Bugfix] Revert "[Bugfix] use SoftLockFile instead of LockFile (#3578)" by @WoosukKwon in #3599
[Model] Add starcoder2 awq support by @shaonianyr in #3569
[Core] Refactor Attention Take 2 by @WoosukKwon in #3462
[Bugfix] fix automatic prefix args and add log info by @gty111 in #3608
[CI] Try introducing isort. by @rkooo567 in #3495
[Core] Adding token ranks along with logprobs by @SwapnilDreams100 in #3516
feat: implement the min_tokens sampling parameter by @tjohnson31415 in #3124
[Bugfix] API stream returning two stops by @dylanwhawk in #3450
hotfix isort on logprobs ranks pr by @simon-mo in #3622
[Feature] Add vision language model support. by @xwjiang2010 in #3042
Optimize _get_ranks in Sampler by @Yard1 in #3623
[Misc] Include matched stop string/token in responses by @njhill in #2976
Enable more models to inference based on LoRA by @jeejeelee in #3382
[Bugfix] Fix ipv6 address parsing bug by @liiliiliil in #3641
[BugFix] Fix ipv4 address parsing regression by @njhill in #3645
[Kernel] support non-zero cuda devices in punica kernels by @jeejeelee in #3636
[Doc]add lora support by @jeejeelee in #3649
[Misc] Minor fix in KVCache type by @WoosukKwon in #3652
[Core] remove cupy dependency by @youkaichao in #3625
[Bugfix] More faithful implementation of Gemma by @WoosukKwon in #3653
[Bugfix] [Hotfix] fix nccl library name by @youkaichao in #3661
[Model] Add support for DBRX by @megha95 in #3660
[Misc] add the "download-dir" option to the latency/throughput benchmarks by @AmadeusChan in #3621
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark by @ywang96 in #3277
Add support for Cohere's Command-R model by @zeppombal in #3433
[Docs] Add Command-R to supported models by @WoosukKwon in #3669
[Model] Fix and clean commandr by @esmeetu in #3671
[Model] Add support for xverse by @hxer7963 in #3610
[CI/Build] update default number of jobs and nvcc threads to avoid overloading the system by @youkaichao in #3675
[Kernel] Add Triton MoE kernel configs for DBRX + A100 by @WoosukKwon in #3679
[Core] [Bugfix] Refactor block manager subsystem for better testability by @cadedaniel in #3492
[Model] Add support for Qwen2MoeModel by @wenyujin333 in #3346
[Kernel] DBRX Triton MoE kernel H100 by @ywang96 in #3692
[2/N] Chunked prefill data update by @rkooo567 in #3538
[Bugfix] Update neuron_executor.py to add optional vision_language_config. by @adamrb in #3695
fix benchmark format reporting in buildkite by @simon-mo in #3693
[CI] Add test case to run examples scripts by @simon-mo in #3638
[Core] Support multi-node inference(eager and cuda graph) by @esmeetu in #3686
[Kernel] Add MoE Triton kernel configs for A100 40GB by @WoosukKwon in #3700
[Bugfix] Set enable_prefix_caching=True in prefix caching example by @WoosukKwon in #3703
fix logging msg for block manager by @simon-mo in #3701
[Core] fix del of communicator by @youkaichao in #3702
[Benchmark] Change mii to use persistent deployment and support tensor parallel by @IKACE in #3628
bump version to v0.4.0 by @simon-mo in #3705
Revert "bump version to v0.4.0" by @youkaichao in #3708
[Test] Make model tests run again and remove --forked from pytest by @rkooo567 in #3631
[Misc] Minor type annotation fix by @WoosukKwon in #3716
[Core][Test] move local_rank to the last arg with default value to keep api compatible by @youkaichao in #3711
add ccache to docker build image by @simon-mo in #3704
Usage Stats Collection by @yhu422 in #2852
[BugFix] Fix tokenizer out of vocab size by @esmeetu in #3685
[BugFix][Frontend] Fix completion logprobs=0 error by @esmeetu in #3731
[Bugfix] Command-R Max Model Length by @ywang96 in #3727
bump version to v0.4.0 by @simon-mo in #3712
[ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic by @hongxiayang in #3699
usage lib get version another way by @simon-mo in #3735
[BugFix] Use consistent logger everywhere by @njhill in #3738
[Core][Bugfix] cache len of tokenizer by @youkaichao in #3741
Fix build when nvtools is missing by @bnellnm in #3698
CMake build elf without PTX by @simon-mo in #3739

New Contributors

@cloudhan made their first contribution in #3104
@SageMoore made their first contribution in #2762
@jasonacox made their first contribution in #3161
@gty111 made their first contribution in #3171
@pian13131 made their first contribution in #2978
@ttbachyinsda made their first contribution in #3176
@wangchen615 made their first contribution in #2992
@chujiezheng made their first contribution in #3242
@TechxGenus made their first contribution in #3200
@mgoin made their first contribution in #3120
@jacobthebanana made their first contribution in #3263
@ElizaWszola made their first contribution in #3239
@DAIZHENWEI made their first contribution in #3153
@Sherlock113 made their first contribution in #3336
@br3no made their first contribution in #3347
@DreamTeamWangbowen made their first contribution in #3319
@RonanKMcGovern made their first contribution in #3031
@hliuca made their first contribution in #3259
@orsharir made their first contribution in #3350
@youkaichao made their first contribution in #3389
@tdoublep made their first contribution in #3396
@declark1 made their first contribution in #3410
@qeternity made their first contribution in #3414
@akhoroshev made their first contribution in #3376
@Dinghow made their first contribution in #3420
@fyabc made their first contribution in #3344
@laneeeee made their first contribution in #3421
@bnellnm made their first contribution in #2830
@ifsheldon made their first contribution in #3429
@jimburtoft made their first contribution in #3505
@grandiose-pizza made their first contribution in #3183
@taeminlee made their first contribution in #3553
@kota-iizuka made their first contribution in #3578
@shaonianyr made their first contribution in #3569
@SwapnilDreams100 made their first contribution in #3516
@tjohnson31415 made their first contribution in #3124
@xwjiang2010 made their first contribution in #3042
@liiliiliil made their first contribution in #3641
@AmadeusChan made their first contribution in #3621
@zeppombal made their first contribution in #3433
@hxer7963 made their first contribution in #3610
@wenyujin333 made their first contribution in #3346
@adamrb made their first contribution in #3695
@IKACE made their first contribution in #3628
@yhu422 made their first contribution in #2852

Full Changelog: v0.3.3...v0.4.0