vllm-project/vllm vtest on GitHub

What's Changed

[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label by @KuntaiDu in #5073
[CI/Build] Disable LLaVA-NeXT CPU test by @DarkLight1337 in #5529
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by @tlrmchlsmth in #5516
[Misc] Fix arg names by @AllenDou in #5524
[ Misc ] Rs/compressed tensors cleanup by @robertgshaw2-neuralmagic in #5432
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later by @tlrmchlsmth in #5401
[mis] fix flaky test of test_cuda_device_count_stateless by @youkaichao in #5546
[Core] Remove duplicate processing in async engine by @DarkLight1337 in #5525
[misc][distributed] fix benign error in is_in_the_same_node by @youkaichao in #5512
[Docs] Add ZhenFund as a Sponsor by @simon-mo in #5548
[Doc] Update documentation on Tensorizer by @sangstar in #5471
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by @tdoublep in #5460
[Bugfix] Fix typo in Pallas backend by @WoosukKwon in #5558
[Core][Distributed] improve p2p cache generation by @youkaichao in #5528
Add ccache to amd by @simon-mo in #5555
[Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in #5364
[mypy] Enable type checking for test directory by @DarkLight1337 in #5017
[CI/Build] Test both text and token IDs in batched OpenAI Completions API by @DarkLight1337 in #5568
[misc] Do not allow to use lora with chunked prefill. by @rkooo567 in #5538
add gptq_marlin test for bug report #5088 by @alexm-neuralmagic in #5145
[BugFix] Don't start a Ray cluster when not using Ray by @njhill in #5570
[Fix] Correct OpenAI batch response format by @zifeitong in #5554
Add basic correctness 2 GPU tests to 4 GPU pipeline by @Yard1 in #5518
[CI][BugFix] Flip is_quant_method_supported condition by @mgoin in #5577
[build][misc] limit numpy version by @youkaichao in #5582
[Doc] add debugging tips for crash and multi-node debugging by @youkaichao in #5581
Fix w8a8 benchmark and add Llama-3-8B by @comaniac in #5562
[Model] Rename Phi3 rope scaling type by @garg-amit in #5595
Correct alignment in the seq_len diagram. by @CharlesRiggins in #5592
[Kernel] compressed-tensors marlin 24 support by @dsikka in #5435
[Misc] use AutoTokenizer for benchmark serving when vLLM not installed by @zhyncs in #5588
[Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend by @jikunshang in #3814
[CI/BUILD] Support non-AVX512 vLLM building and testing by @DamonFool in #5574
[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard by @KuntaiDu in #5571
[bugfix][distributed] fix 16 gpus local rank arrangement by @youkaichao in #5604
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids by @youkaichao in #5584
[Bugfix] Fix KV head calculation for MPT models when using GQA by @bfontain in #5142
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py by @zifeitong in #5606
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier by @sroy745 in #5131
[Model] Initialize Phi-3-vision support by @Isotr0py in #4986
[Kernel] Add punica dimensions for Granite 13b by @joerunde in #5559
[misc][typo] fix typo by @youkaichao in #5620
[Misc] Fix typo by @DarkLight1337 in #5618
[CI] Avoid naming different metrics with the same name in performance benchmark by @KuntaiDu in #5615
[bugfix][distributed] do not error if two processes do not agree on p2p capability by @youkaichao in #5612
[Misc] Remove import from transformers logging by @CatherineSue in #5625
[CI/Build][Misc] Update Pytest Marker for VLMs by @ywang96 in #5623
[ci] Deprecate original CI template by @khluu in #5624
[Misc] Add OpenTelemetry support by @ronensc in #4687
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization by @dsikka in #5542
[ci] Setup Release pipeline and build release wheels with cache by @khluu in #5610
[Model] LoRA support added for command-r by @sergey-tinkoff in #5178
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties by @tdoublep in #5639
[Doc] Added cerebrium as Integration option by @milo157 in #5553
[Bugfix] Fix CUDA version check for mma warning suppression by @tlrmchlsmth in #5642
[Bugfix] Fix w8a8 benchmarks for int8 case by @tlrmchlsmth in #5643
[Bugfix] Fix Phi-3 Long RoPE scaling implementation by @ShukantPal in #5628
[Bugfix] Added test for sampling repetition penalty bug. by @tdoublep in #5659
[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices by @hongxiayang in #5641
[misc][distributed] use localhost for single-node by @youkaichao in #5619
[Model] Add FP8 kv cache for Qwen2 by @mgoin in #5656
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example by @Isotr0py in #5684
[Misc]Add param max-model-len in benchmark_latency.py by @DearPlanet in #5629
[CI/Build] Add tqdm to dependencies by @DarkLight1337 in #5680
[ci] Add A100 queue into AWS CI template by @khluu in #5648
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py by @mgoin in #5688
[ci][distributed] add tests for custom allreduce by @youkaichao in #5689
[Bugfix] AsyncLLMEngine hangs with asyncio.run by @zifeitong in #5654
[Doc] Update docker references by @rafvasq in #5614
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes by @dsikka in #5650
[ci] Limit num gpus if specified for A100 by @khluu in #5694
[Misc] Improve conftest by @DarkLight1337 in #5681
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors by @ywang96 in #5703
[Kernel] Update Cutlass int8 kernel configs for SM90 by @varun-sundar-rabindranath in #5514
[Model] Port over CLIPVisionModel for VLMs by @ywang96 in #5591
[Kernel] Update Cutlass int8 kernel configs for SM80 by @varun-sundar-rabindranath in #5275
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels by @tlrmchlsmth in #5715
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names by @mgoin in #5718
[distributed][misc] use fork by default for mp by @youkaichao in #5669
[Model] MLPSpeculator speculative decoding support by @JRosenkranz in #4947
[Kernel] Add punica dimension for Qwen2 LoRA by @jinzhen-lin in #5441
[BugFix] Fix test_phi3v.py by @CatherineSue in #5725
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora by @jeejeelee in #5665
[Core][Distributed] add shm broadcast by @youkaichao in #5399
[Kernel][CPU] Add Quick gelu to CPU by @ywang96 in #5717
[Doc] Documentation on supported hardware for quantization methods by @mgoin in #5745
[BugFix] exclude version 1.15.0 for modelscope by @zhyncs in #5668
[ci][test] fix ca test in main by @youkaichao in #5746
[LoRA] Add support for pinning lora adapters in the LRU cache by @rohithkrn in #5603
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline by @jikunshang in #5616
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs by @DamonFool in #5710
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py by @zifeitong in #5756
[Bugfix] Fix pin_lora error in TPU executor by @WoosukKwon in #5760
[Docs][TPU] Add installation tip for TPU by @WoosukKwon in #5761
[core][distributed] improve shared memory broadcast by @youkaichao in #5754
[BugFix] [Kernel] Add Cutlass2x fallback kernels by @varun-sundar-rabindranath in #5744
[Distributed] Add send and recv helpers by @andoorve in #5719
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement by @Isotr0py in #5772
[doc][faq] add warning to download models for every nodes by @youkaichao in #5783
[Doc] Add "Suggest edit" button to doc pages by @mgoin in #5789
[Doc] Add Phi-3-medium to list of supported models by @mgoin in #5788
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args by @CatherineSue in #5795
[ci] Remove aws template by @khluu in #5757
[Doc] Add notice about breaking changes to VLMs by @DarkLight1337 in #5818
[Speculative Decoding] Support draft model on different tensor-parallel size than target model by @wooyeonlee0 in #5414
[Misc] Remove useless code in cpu_worker by @DamonFool in #5824
[Core] Add fault tolerance for RayTokenizerGroupPool by @Yard1 in #5748
[doc][distributed] add both gloo and nccl tests by @youkaichao in #5834
[CI/Build] Add unit testing for FlexibleArgumentParser by @mgoin in #5798
[Misc] Update w4a16 compressed-tensors support to include w8a16 by @dsikka in #5794
[Hardware][TPU] Refactor TPU backend by @WoosukKwon in #5831
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes by @mawong-amd in #5422
[Hardware][TPU] Raise errors for unsupported sampling params by @WoosukKwon in #5850
[CI/Build] Add E2E tests for MLPSpeculator by @tdoublep in #5791
[Bugfix] Fix assertion in NeuronExecutor by @aws-patlange in #5841
[Core] Refactor Worker and ModelRunner to consolidate control plane communication by @stephanie-wang in #5408
[Misc][Doc] Add Example of using OpenAI Server with VLM by @ywang96 in #5832
[bugfix][distributed] fix shm broadcast when the queue size is full by @youkaichao in #5801
[Bugfix] Fix embedding to support 2D inputs by @WoosukKwon in #5829
[Bugfix][TPU] Fix KV cache size calculation by @WoosukKwon in #5860
[CI/Build] Refactor image test assets by @DarkLight1337 in #5821
[Kernel] Adding bias epilogue support for cutlass_scaled_mm by @ProExpertProg in #5560
[Frontend] Add tokenize/detokenize endpoints by @sasha0552 in #5054
[Hardware][TPU] Support parallel sampling & Swapping by @WoosukKwon in #5855
[Bugfix][TPU] Fix CPU cache allocation by @WoosukKwon in #5869
Support CPU inference with VSX PowerPC ISA by @ChipKerchner in #5652
[doc] update usage of env var to avoid conflict by @youkaichao in #5873
[Misc] Add example for LLaVA-NeXT by @ywang96 in #5879
[BugFix] Fix cuda graph for MLPSpeculator by @njhill in #5875
[Doc] Add note about context length in Phi-3-Vision example by @DarkLight1337 in #5887
[VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly by @xwjiang2010 in #5880
[Model] Add base class for LoRA-supported models by @DarkLight1337 in #5018
[Bugfix] Fix img_sizes Parsing in Phi3-Vision by @ywang96 in #5888
[CI/Build] [1/3] Reorganize entrypoints tests by @DarkLight1337 in #5526
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision by @DarkLight1337 in #5896
[doc][misc] add note for Kubernetes users by @youkaichao in #5916
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens by @njhill in #5876
[BugFix] Fix min_tokens behaviour for multiple eos tokens by @njhill in #5849
[CI/Build] Fix Args for _get_logits_warper in Sampler Test by @ywang96 in #5922
[Model] Add Gemma 2 by @WoosukKwon in #5908
[core][misc] remove logical block by @youkaichao in #5882
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X by @divakar-amd in #5932
[Hardware][TPU] Optimize KV cache swapping by @WoosukKwon in #5878
[VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. by @xwjiang2010 in #5905
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner by @Isotr0py in #5956
[Core] Registry for processing model inputs by @DarkLight1337 in #5214
Unmark fused_moe config json file as executable by @tlrmchlsmth in #5960
[Hardware][Intel] OpenVINO vLLM backend by @ilya-lavrenov in #5379
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high by @tdoublep in #5894
[CI/Build] [2/3] Reorganize entrypoints tests by @DarkLight1337 in #5904
[Distributed] Make it clear that % should not be in tensor dict keys. by @xwjiang2010 in #5927
[Spec Decode] Introduce DraftModelRunner by @comaniac in #5799
[Bugfix] Fix compute datatype for cutlass 3.x epilogues by @tlrmchlsmth in #5931
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) by @robertgshaw2-neuralmagic in #5928
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 by @robertgshaw2-neuralmagic in #5921
Support Deepseek-V2 by @zwd003 in #4650

New Contributors

@garg-amit made their first contribution in #5595
@CharlesRiggins made their first contribution in #5592
@bfontain made their first contribution in #5142
@sergey-tinkoff made their first contribution in #5178
@milo157 made their first contribution in #5553
@ShukantPal made their first contribution in #5628
@rafvasq made their first contribution in #5614
@JRosenkranz made their first contribution in #4947
@rohithkrn made their first contribution in #5603
@wooyeonlee0 made their first contribution in #5414
@aws-patlange made their first contribution in #5841
@stephanie-wang made their first contribution in #5408
@ProExpertProg made their first contribution in #5560
@ChipKerchner made their first contribution in #5652
@ilya-lavrenov made their first contribution in #5379

Full Changelog: v0.5.0.post1...vtest