What's Changed
- [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with
perf-benchmarks
label by @KuntaiDu in #5073 - [CI/Build] Disable LLaVA-NeXT CPU test by @DarkLight1337 in #5529
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by @tlrmchlsmth in #5516
- [Misc] Fix arg names by @AllenDou in #5524
- [ Misc ] Rs/compressed tensors cleanup by @robertgshaw2-neuralmagic in #5432
- [Kernel] Suppress mma.sp warning on CUDA 12.5 and later by @tlrmchlsmth in #5401
- [mis] fix flaky test of test_cuda_device_count_stateless by @youkaichao in #5546
- [Core] Remove duplicate processing in async engine by @DarkLight1337 in #5525
- [misc][distributed] fix benign error in
is_in_the_same_node
by @youkaichao in #5512 - [Docs] Add ZhenFund as a Sponsor by @simon-mo in #5548
- [Doc] Update documentation on Tensorizer by @sangstar in #5471
- [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by @tdoublep in #5460
- [Bugfix] Fix typo in Pallas backend by @WoosukKwon in #5558
- [Core][Distributed] improve p2p cache generation by @youkaichao in #5528
- Add ccache to amd by @simon-mo in #5555
- [Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in #5364
- [mypy] Enable type checking for test directory by @DarkLight1337 in #5017
- [CI/Build] Test both text and token IDs in batched OpenAI Completions API by @DarkLight1337 in #5568
- [misc] Do not allow to use lora with chunked prefill. by @rkooo567 in #5538
- add gptq_marlin test for bug report #5088 by @alexm-neuralmagic in #5145
- [BugFix] Don't start a Ray cluster when not using Ray by @njhill in #5570
- [Fix] Correct OpenAI batch response format by @zifeitong in #5554
- Add basic correctness 2 GPU tests to 4 GPU pipeline by @Yard1 in #5518
- [CI][BugFix] Flip is_quant_method_supported condition by @mgoin in #5577
- [build][misc] limit numpy version by @youkaichao in #5582
- [Doc] add debugging tips for crash and multi-node debugging by @youkaichao in #5581
- Fix w8a8 benchmark and add Llama-3-8B by @comaniac in #5562
- [Model] Rename Phi3 rope scaling type by @garg-amit in #5595
- Correct alignment in the seq_len diagram. by @CharlesRiggins in #5592
- [Kernel]
compressed-tensors
marlin 24 support by @dsikka in #5435 - [Misc] use AutoTokenizer for benchmark serving when vLLM not installed by @zhyncs in #5588
- [Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend by @jikunshang in #3814
- [CI/BUILD] Support non-AVX512 vLLM building and testing by @DamonFool in #5574
- [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard by @KuntaiDu in #5571
- [bugfix][distributed] fix 16 gpus local rank arrangement by @youkaichao in #5604
- [Optimization] use a pool to reuse LogicalTokenBlock.token_ids by @youkaichao in #5584
- [Bugfix] Fix KV head calculation for MPT models when using GQA by @bfontain in #5142
- [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py by @zifeitong in #5606
- [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier by @sroy745 in #5131
- [Model] Initialize Phi-3-vision support by @Isotr0py in #4986
- [Kernel] Add punica dimensions for Granite 13b by @joerunde in #5559
- [misc][typo] fix typo by @youkaichao in #5620
- [Misc] Fix typo by @DarkLight1337 in #5618
- [CI] Avoid naming different metrics with the same name in performance benchmark by @KuntaiDu in #5615
- [bugfix][distributed] do not error if two processes do not agree on p2p capability by @youkaichao in #5612
- [Misc] Remove import from transformers logging by @CatherineSue in #5625
- [CI/Build][Misc] Update Pytest Marker for VLMs by @ywang96 in #5623
- [ci] Deprecate original CI template by @khluu in #5624
- [Misc] Add OpenTelemetry support by @ronensc in #4687
- [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization by @dsikka in #5542
- [ci] Setup Release pipeline and build release wheels with cache by @khluu in #5610
- [Model] LoRA support added for command-r by @sergey-tinkoff in #5178
- [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties by @tdoublep in #5639
- [Doc] Added cerebrium as Integration option by @milo157 in #5553
- [Bugfix] Fix CUDA version check for mma warning suppression by @tlrmchlsmth in #5642
- [Bugfix] Fix w8a8 benchmarks for int8 case by @tlrmchlsmth in #5643
- [Bugfix] Fix Phi-3 Long RoPE scaling implementation by @ShukantPal in #5628
- [Bugfix] Added test for sampling repetition penalty bug. by @tdoublep in #5659
- [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices by @hongxiayang in #5641
- [misc][distributed] use localhost for single-node by @youkaichao in #5619
- [Model] Add FP8 kv cache for Qwen2 by @mgoin in #5656
- [Bugfix] Fix sampling_params passed incorrectly in Phi3v example by @Isotr0py in #5684
- [Misc]Add param max-model-len in benchmark_latency.py by @DearPlanet in #5629
- [CI/Build] Add tqdm to dependencies by @DarkLight1337 in #5680
- [ci] Add A100 queue into AWS CI template by @khluu in #5648
- [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py by @mgoin in #5688
- [ci][distributed] add tests for custom allreduce by @youkaichao in #5689
- [Bugfix] AsyncLLMEngine hangs with asyncio.run by @zifeitong in #5654
- [Doc] Update docker references by @rafvasq in #5614
- [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes by @dsikka in #5650
- [ci] Limit num gpus if specified for A100 by @khluu in #5694
- [Misc] Improve conftest by @DarkLight1337 in #5681
- [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors by @ywang96 in #5703
- [Kernel] Update Cutlass int8 kernel configs for SM90 by @varun-sundar-rabindranath in #5514
- [Model] Port over CLIPVisionModel for VLMs by @ywang96 in #5591
- [Kernel] Update Cutlass int8 kernel configs for SM80 by @varun-sundar-rabindranath in #5275
- [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels by @tlrmchlsmth in #5715
- [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names by @mgoin in #5718
- [distributed][misc] use fork by default for mp by @youkaichao in #5669
- [Model] MLPSpeculator speculative decoding support by @JRosenkranz in #4947
- [Kernel] Add punica dimension for Qwen2 LoRA by @jinzhen-lin in #5441
- [BugFix] Fix test_phi3v.py by @CatherineSue in #5725
- [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora by @jeejeelee in #5665
- [Core][Distributed] add shm broadcast by @youkaichao in #5399
- [Kernel][CPU] Add Quick
gelu
to CPU by @ywang96 in #5717 - [Doc] Documentation on supported hardware for quantization methods by @mgoin in #5745
- [BugFix] exclude version 1.15.0 for modelscope by @zhyncs in #5668
- [ci][test] fix ca test in main by @youkaichao in #5746
- [LoRA] Add support for pinning lora adapters in the LRU cache by @rohithkrn in #5603
- [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline by @jikunshang in #5616
- [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs by @DamonFool in #5710
- [Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py by @zifeitong in #5756
- [Bugfix] Fix pin_lora error in TPU executor by @WoosukKwon in #5760
- [Docs][TPU] Add installation tip for TPU by @WoosukKwon in #5761
- [core][distributed] improve shared memory broadcast by @youkaichao in #5754
- [BugFix] [Kernel] Add Cutlass2x fallback kernels by @varun-sundar-rabindranath in #5744
- [Distributed] Add send and recv helpers by @andoorve in #5719
- [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement by @Isotr0py in #5772
- [doc][faq] add warning to download models for every nodes by @youkaichao in #5783
- [Doc] Add "Suggest edit" button to doc pages by @mgoin in #5789
- [Doc] Add Phi-3-medium to list of supported models by @mgoin in #5788
- [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args by @CatherineSue in #5795
- [ci] Remove aws template by @khluu in #5757
- [Doc] Add notice about breaking changes to VLMs by @DarkLight1337 in #5818
- [Speculative Decoding] Support draft model on different tensor-parallel size than target model by @wooyeonlee0 in #5414
- [Misc] Remove useless code in cpu_worker by @DamonFool in #5824
- [Core] Add fault tolerance for
RayTokenizerGroupPool
by @Yard1 in #5748 - [doc][distributed] add both gloo and nccl tests by @youkaichao in #5834
- [CI/Build] Add unit testing for FlexibleArgumentParser by @mgoin in #5798
- [Misc] Update
w4a16
compressed-tensors
support to includew8a16
by @dsikka in #5794 - [Hardware][TPU] Refactor TPU backend by @WoosukKwon in #5831
- [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes by @mawong-amd in #5422
- [Hardware][TPU] Raise errors for unsupported sampling params by @WoosukKwon in #5850
- [CI/Build] Add E2E tests for MLPSpeculator by @tdoublep in #5791
- [Bugfix] Fix assertion in NeuronExecutor by @aws-patlange in #5841
- [Core] Refactor Worker and ModelRunner to consolidate control plane communication by @stephanie-wang in #5408
- [Misc][Doc] Add Example of using OpenAI Server with VLM by @ywang96 in #5832
- [bugfix][distributed] fix shm broadcast when the queue size is full by @youkaichao in #5801
- [Bugfix] Fix embedding to support 2D inputs by @WoosukKwon in #5829
- [Bugfix][TPU] Fix KV cache size calculation by @WoosukKwon in #5860
- [CI/Build] Refactor image test assets by @DarkLight1337 in #5821
- [Kernel] Adding bias epilogue support for
cutlass_scaled_mm
by @ProExpertProg in #5560 - [Frontend] Add tokenize/detokenize endpoints by @sasha0552 in #5054
- [Hardware][TPU] Support parallel sampling & Swapping by @WoosukKwon in #5855
- [Bugfix][TPU] Fix CPU cache allocation by @WoosukKwon in #5869
- Support CPU inference with VSX PowerPC ISA by @ChipKerchner in #5652
- [doc] update usage of env var to avoid conflict by @youkaichao in #5873
- [Misc] Add example for LLaVA-NeXT by @ywang96 in #5879
- [BugFix] Fix cuda graph for MLPSpeculator by @njhill in #5875
- [Doc] Add note about context length in Phi-3-Vision example by @DarkLight1337 in #5887
- [VLM][Bugfix] Make sure that
multi_modal_kwargs
is broadcasted properly by @xwjiang2010 in #5880 - [Model] Add base class for LoRA-supported models by @DarkLight1337 in #5018
- [Bugfix] Fix img_sizes Parsing in Phi3-Vision by @ywang96 in #5888
- [CI/Build] [1/3] Reorganize entrypoints tests by @DarkLight1337 in #5526
- [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision by @DarkLight1337 in #5896
- [doc][misc] add note for Kubernetes users by @youkaichao in #5916
- [BugFix] Fix
MLPSpeculator
handling ofnum_speculative_tokens
by @njhill in #5876 - [BugFix] Fix
min_tokens
behaviour for multiple eos tokens by @njhill in #5849 - [CI/Build] Fix Args for
_get_logits_warper
in Sampler Test by @ywang96 in #5922 - [Model] Add Gemma 2 by @WoosukKwon in #5908
- [core][misc] remove logical block by @youkaichao in #5882
- [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X by @divakar-amd in #5932
- [Hardware][TPU] Optimize KV cache swapping by @WoosukKwon in #5878
- [VLM][BugFix] Make sure that
multi_modal_kwargs
can broadcast properly with ring buffer. by @xwjiang2010 in #5905 - [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner by @Isotr0py in #5956
- [Core] Registry for processing model inputs by @DarkLight1337 in #5214
- Unmark fused_moe config json file as executable by @tlrmchlsmth in #5960
- [Hardware][Intel] OpenVINO vLLM backend by @ilya-lavrenov in #5379
- [Bugfix] Better error message for MLPSpeculator when
num_speculative_tokens
is set too high by @tdoublep in #5894 - [CI/Build] [2/3] Reorganize entrypoints tests by @DarkLight1337 in #5904
- [Distributed] Make it clear that % should not be in tensor dict keys. by @xwjiang2010 in #5927
- [Spec Decode] Introduce DraftModelRunner by @comaniac in #5799
- [Bugfix] Fix compute datatype for cutlass 3.x epilogues by @tlrmchlsmth in #5931
- [ Misc ] Remove
fp8_shard_indexer
from Col/Row Parallel Linear (Simplify Weight Loading) by @robertgshaw2-neuralmagic in #5928 - [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 by @robertgshaw2-neuralmagic in #5921
- Support Deepseek-V2 by @zwd003 in #4650
New Contributors
- @garg-amit made their first contribution in #5595
- @CharlesRiggins made their first contribution in #5592
- @bfontain made their first contribution in #5142
- @sergey-tinkoff made their first contribution in #5178
- @milo157 made their first contribution in #5553
- @ShukantPal made their first contribution in #5628
- @rafvasq made their first contribution in #5614
- @JRosenkranz made their first contribution in #4947
- @rohithkrn made their first contribution in #5603
- @wooyeonlee0 made their first contribution in #5414
- @aws-patlange made their first contribution in #5841
- @stephanie-wang made their first contribution in #5408
- @ProExpertProg made their first contribution in #5560
- @ChipKerchner made their first contribution in #5652
- @ilya-lavrenov made their first contribution in #5379
Full Changelog: v0.5.0.post1...vtest