github vllm-project/vllm v0.4.0

latest releases: v0.6.3.post1, v0.6.3, v0.6.2...
7 months ago

Major changes

Models

Production features

  • Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag --enable-prefix-caching to turn it on.
  • Support json_object in OpenAI server for arbitrary JSON, --use-delay flag to improve time to first token across many requests, and min_tokens to EOS suppression.
  • Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
  • Custom all reduce kernel has been re-enabled after more robustness fixes.
  • Replaced cupy dependency due to its bugs.

Hardware

  • Improved Neuron support for AWS Inferentia.
  • CMake based build system for extensibility.

Ecosystem

  • Extensive serving benchmark refactoring (#3277)
  • Usage statistics collection (#2852)

What's Changed

  • allow user chose log level by --log-level instead of fixed 'info'. by @AllenDou in #3109
  • Reorder kv dtype check to avoid nvcc not found error on AMD platform by @cloudhan in #3104
  • Add Automatic Prefix Caching by @SageMoore in #2762
  • Add vLLM version info to logs and openai API server by @jasonacox in #3161
  • [FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark by @zhuohan123 in #3158
  • Make it easy to profile workers with nsight by @pcmoritz in #3162
  • [DOC] add setup document to support neuron backend by @liangfu in #2777
  • [Minor Fix] Remove unused code in benchmark_prefix_caching.py by @gty111 in #3171
  • Add document for vllm paged attention kernel. by @pian13131 in #2978
  • enable --gpu-memory-utilization in benchmark_throughput.py by @AllenDou in #3175
  • [Minor fix] The domain dns.google may cause a socket.gaierror exception by @ttbachyinsda in #3176
  • Push logprob generation to LLMEngine by @Yard1 in #3065
  • Add health check, make async Engine more robust by @Yard1 in #3015
  • Fix the openai benchmarking requests to work with latest OpenAI apis by @wangchen615 in #2992
  • [ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs by @hongxiayang in #3123
  • Store eos_token_id in Sequence for easy access by @njhill in #3166
  • [Fix] Avoid pickling entire LLMEngine for Ray workers by @njhill in #3207
  • [Tests] Add block manager and scheduler tests by @rkooo567 in #3108
  • [Testing] Fix core tests by @cadedaniel in #3224
  • A simple addition of dynamic_ncols=True by @chujiezheng in #3242
  • Add GPTQ support for Gemma by @TechxGenus in #3200
  • Update requirements-dev.txt to include package for benchmarking scripts. by @wangchen615 in #3181
  • Separate attention backends by @WoosukKwon in #3005
  • Measure model memory usage by @mgoin in #3120
  • Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) by @jacobthebanana in #3263
  • Fix auto prefix bug by @ElizaWszola in #3239
  • Connect engine healthcheck to openai server by @njhill in #3260
  • Feature add lora support for Qwen2 by @whyiug in #3177
  • [Minor Fix] Fix comments in benchmark_serving by @gty111 in #3252
  • [Docs] Fix Unmocked Imports by @ywang96 in #3275
  • [FIX] Make flash_attn optional by @WoosukKwon in #3269
  • Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir by @mgoin in #3241
  • [FIX] Fix prefix test error on main by @zhuohan123 in #3286
  • [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling by @cadedaniel in #3103
  • Enhance lora tests with more layer and rank variations by @tterrysun in #3243
  • [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA by @dllehr-amd in #3262
  • [BugFix] Fix get tokenizer when using ray by @esmeetu in #3301
  • [Fix] Fix best_of behavior when n=1 by @njhill in #3298
  • Re-enable the 80 char line width limit by @zhuohan123 in #3305
  • [docs] Add LoRA support information for models by @pcmoritz in #3299
  • Add distributed model executor abstraction by @zhuohan123 in #3191
  • [ROCm] Fix warp and lane calculation in blockReduceSum by @kliuae in #3321
  • Support Mistral Model Inference with transformers-neuronx by @DAIZHENWEI in #3153
  • docs: Add BentoML deployment doc by @Sherlock113 in #3336
  • Fixes #1556 double free by @br3no in #3347
  • Add kernel for GeGLU with approximate GELU by @WoosukKwon in #3337
  • [Fix] fix quantization arg when using marlin by @DreamTeamWangbowen in #3319
  • add hf_transfer to requirements.txt by @RonanKMcGovern in #3031
  • fix bias in if, ambiguous by @hliuca in #3259
  • [Minor Fix] Use cupy-cuda11x in CUDA 11.8 build by @chenxu2048 in #3256
  • Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. by @orsharir in #3350
  • Add batched RoPE kernel by @tterrysun in #3095
  • Fix lint by @Yard1 in #3388
  • [FIX] Simpler fix for async engine running on ray by @zhuohan123 in #3371
  • [Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion by @simon-mo in #3383
  • allow user to chose which vllm's merics to display in grafana by @AllenDou in #3393
  • [Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 by @youkaichao in #3389
  • Install flash_attn in Docker image by @tdoublep in #3396
  • Add args for mTLS support by @declark1 in #3410
  • [issue templates] add some issue templates by @youkaichao in #3412
  • Fix assertion failure in Qwen 1.5 with prefix caching enabled by @chenxu2048 in #3373
  • fix marlin config repr by @qeternity in #3414
  • Feature: dynamic shared mem moe_align_block_size_kernel by @akhoroshev in #3376
  • [Misc] add HOST_IP env var by @youkaichao in #3419
  • Add chat templates for Falcon by @Dinghow in #3420
  • Add chat templates for ChatGLM by @Dinghow in #3418
  • Fix dist.broadcast stall without group argument by @GindaChen in #3408
  • Fix tie_word_embeddings for Qwen2. by @fyabc in #3344
  • [Fix] Add args for mTLS support by @declark1 in #3430
  • Fixes the misuse/mixuse of time.time()/time.monotonic() by @sighingnow in #3220
  • [Misc] add error message in non linux platform by @youkaichao in #3438
  • Fix issue templates by @hmellor in #3436
  • fix document error for value and v_vec illustration by @laneeeee in #3421
  • Asynchronous tokenization by @Yard1 in #2879
  • Removed Extraneous Print Message From OAI Server by @robertgshaw2-neuralmagic in #3440
  • [Misc] PR templates by @youkaichao in #3413
  • Fixes the incorrect argument in the prefix-prefill test cases by @sighingnow in #3246
  • Replace lstrip() with removeprefix() to fix Ruff linter warning by @ronensc in #2958
  • Fix Baichuan chat template by @Dinghow in #3340
  • [Misc] fix line length for entire codebase by @simon-mo in #3444
  • Support arbitrary json_object in OpenAI and Context Free Grammar by @simon-mo in #3211
  • Fix setup.py neuron-ls issue by @simon-mo in #2671
  • [Misc] Define from_dict and to_dict in InputMetadata by @WoosukKwon in #3452
  • [CI] Shard tests for LoRA and Kernels to speed up by @simon-mo in #3445
  • [Bugfix] Make moe_align_block_size AMD-compatible by @WoosukKwon in #3470
  • CI: Add ROCm Docker Build by @simon-mo in #2886
  • [Testing] Add test_config.py to CI by @cadedaniel in #3437
  • [CI/Build] Fix Bad Import In Test by @robertgshaw2-neuralmagic in #3473
  • [Misc] Fix PR Template by @zhuohan123 in #3478
  • Cmake based build system by @bnellnm in #2830
  • [Core] Zero-copy asdict for InputMetadata by @Yard1 in #3475
  • [Misc] Update README for the Third vLLM Meetup by @zhuohan123 in #3479
  • [Core] Cache some utils by @Yard1 in #3474
  • [Core] print error before deadlock by @youkaichao in #3459
  • [Doc] Add docs about OpenAI compatible server by @simon-mo in #3288
  • [BugFix] Avoid initializing CUDA too early by @njhill in #3487
  • Update dockerfile with ModelScope support by @ifsheldon in #3429
  • [Doc] minor fix to neuron-installation.rst by @jimburtoft in #3505
  • Revert "[Core] Cache some utils" by @simon-mo in #3507
  • [Doc] minor fix of spelling in amd-installation.rst by @jimburtoft in #3506
  • Use lru_cache for some environment detection utils by @simon-mo in #3508
  • [PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled by @ElizaWszola in #3357
  • [Core] Add generic typing to LRUCache by @njhill in #3511
  • [Misc] Remove cache stream and cache events by @WoosukKwon in #3461
  • Abort when nvcc command is not found in the PATH by @AllenDou in #3527
  • Check for _is_cuda() in compute_num_jobs by @bnellnm in #3481
  • [Bugfix] Fix ROCm support in CMakeLists.txt by @jamestwhedbee in #3534
  • [1/n] Triton sampling kernel by @Yard1 in #3186
  • [1/n][Chunked Prefill] Refactor input query shapes by @rkooo567 in #3236
  • Migrate logits computation and gather to model_runner by @esmeetu in #3233
  • [BugFix] Hot fix in setup.py for neuron build by @zhuohan123 in #3537
  • [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor by @ElizaWszola in #3431
  • Fix 1D query issue from _prune_hidden_states by @rkooo567 in #3539
  • [🚀 Ready to be merged] Added support for Jais models by @grandiose-pizza in #3183
  • [Misc][Log] Add log for tokenizer length not equal to vocabulary size by @esmeetu in #3500
  • [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config by @WoosukKwon in #3551
  • [BugFix] gemma loading after quantization or LoRA. by @taeminlee in #3553
  • [Bugfix][Model] Fix Qwen2 by @esmeetu in #3554
  • [Hardware][Neuron] Refactor neuron support by @zhuohan123 in #3471
  • Some fixes for custom allreduce kernels by @hanzhi713 in #2760
  • Dynamic scheduler delay to improve ITL performance by @tdoublep in #3279
  • [Core] Improve detokenization performance for prefill by @Yard1 in #3469
  • [Bugfix] use SoftLockFile instead of LockFile by @kota-iizuka in #3578
  • [Misc] Fix BLOOM copyright notice by @WoosukKwon in #3591
  • [Misc] Bump transformers version by @ywang96 in #3592
  • [BugFix] Fix Falcon tied embeddings by @WoosukKwon in #3590
  • [BugFix] 1D query fix for MoE models by @njhill in #3597
  • [CI] typo fix: is_hip --> is_hip() by @youkaichao in #3595
  • [CI/Build] respect the common environment variable MAX_JOBS by @youkaichao in #3600
  • [CI/Build] fix flaky test by @youkaichao in #3602
  • [BugFix] minor fix: method typo in rotary_embedding.py file, get_device() -> device by @jikunshang in #3604
  • [Bugfix] Revert "[Bugfix] use SoftLockFile instead of LockFile (#3578)" by @WoosukKwon in #3599
  • [Model] Add starcoder2 awq support by @shaonianyr in #3569
  • [Core] Refactor Attention Take 2 by @WoosukKwon in #3462
  • [Bugfix] fix automatic prefix args and add log info by @gty111 in #3608
  • [CI] Try introducing isort. by @rkooo567 in #3495
  • [Core] Adding token ranks along with logprobs by @SwapnilDreams100 in #3516
  • feat: implement the min_tokens sampling parameter by @tjohnson31415 in #3124
  • [Bugfix] API stream returning two stops by @dylanwhawk in #3450
  • hotfix isort on logprobs ranks pr by @simon-mo in #3622
  • [Feature] Add vision language model support. by @xwjiang2010 in #3042
  • Optimize _get_ranks in Sampler by @Yard1 in #3623
  • [Misc] Include matched stop string/token in responses by @njhill in #2976
  • Enable more models to inference based on LoRA by @jeejeelee in #3382
  • [Bugfix] Fix ipv6 address parsing bug by @liiliiliil in #3641
  • [BugFix] Fix ipv4 address parsing regression by @njhill in #3645
  • [Kernel] support non-zero cuda devices in punica kernels by @jeejeelee in #3636
  • [Doc]add lora support by @jeejeelee in #3649
  • [Misc] Minor fix in KVCache type by @WoosukKwon in #3652
  • [Core] remove cupy dependency by @youkaichao in #3625
  • [Bugfix] More faithful implementation of Gemma by @WoosukKwon in #3653
  • [Bugfix] [Hotfix] fix nccl library name by @youkaichao in #3661
  • [Model] Add support for DBRX by @megha95 in #3660
  • [Misc] add the "download-dir" option to the latency/throughput benchmarks by @AmadeusChan in #3621
  • feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark by @ywang96 in #3277
  • Add support for Cohere's Command-R model by @zeppombal in #3433
  • [Docs] Add Command-R to supported models by @WoosukKwon in #3669
  • [Model] Fix and clean commandr by @esmeetu in #3671
  • [Model] Add support for xverse by @hxer7963 in #3610
  • [CI/Build] update default number of jobs and nvcc threads to avoid overloading the system by @youkaichao in #3675
  • [Kernel] Add Triton MoE kernel configs for DBRX + A100 by @WoosukKwon in #3679
  • [Core] [Bugfix] Refactor block manager subsystem for better testability by @cadedaniel in #3492
  • [Model] Add support for Qwen2MoeModel by @wenyujin333 in #3346
  • [Kernel] DBRX Triton MoE kernel H100 by @ywang96 in #3692
  • [2/N] Chunked prefill data update by @rkooo567 in #3538
  • [Bugfix] Update neuron_executor.py to add optional vision_language_config. by @adamrb in #3695
  • fix benchmark format reporting in buildkite by @simon-mo in #3693
  • [CI] Add test case to run examples scripts by @simon-mo in #3638
  • [Core] Support multi-node inference(eager and cuda graph) by @esmeetu in #3686
  • [Kernel] Add MoE Triton kernel configs for A100 40GB by @WoosukKwon in #3700
  • [Bugfix] Set enable_prefix_caching=True in prefix caching example by @WoosukKwon in #3703
  • fix logging msg for block manager by @simon-mo in #3701
  • [Core] fix del of communicator by @youkaichao in #3702
  • [Benchmark] Change mii to use persistent deployment and support tensor parallel by @IKACE in #3628
  • bump version to v0.4.0 by @simon-mo in #3705
  • Revert "bump version to v0.4.0" by @youkaichao in #3708
  • [Test] Make model tests run again and remove --forked from pytest by @rkooo567 in #3631
  • [Misc] Minor type annotation fix by @WoosukKwon in #3716
  • [Core][Test] move local_rank to the last arg with default value to keep api compatible by @youkaichao in #3711
  • add ccache to docker build image by @simon-mo in #3704
  • Usage Stats Collection by @yhu422 in #2852
  • [BugFix] Fix tokenizer out of vocab size by @esmeetu in #3685
  • [BugFix][Frontend] Fix completion logprobs=0 error by @esmeetu in #3731
  • [Bugfix] Command-R Max Model Length by @ywang96 in #3727
  • bump version to v0.4.0 by @simon-mo in #3712
  • [ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic by @hongxiayang in #3699
  • usage lib get version another way by @simon-mo in #3735
  • [BugFix] Use consistent logger everywhere by @njhill in #3738
  • [Core][Bugfix] cache len of tokenizer by @youkaichao in #3741
  • Fix build when nvtools is missing by @bnellnm in #3698
  • CMake build elf without PTX by @simon-mo in #3739

New Contributors

Full Changelog: v0.3.3...v0.4.0

Don't miss a new vllm release

NewReleases is sending notifications on new releases.