github vllm-project/vllm vtest

latest releases: v0.6.4.post1, v0.6.4, v0.6.3.post1...
pre-release4 months ago

What's Changed

  • [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label by @KuntaiDu in #5073
  • [CI/Build] Disable LLaVA-NeXT CPU test by @DarkLight1337 in #5529
  • [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by @tlrmchlsmth in #5516
  • [Misc] Fix arg names by @AllenDou in #5524
  • [ Misc ] Rs/compressed tensors cleanup by @robertgshaw2-neuralmagic in #5432
  • [Kernel] Suppress mma.sp warning on CUDA 12.5 and later by @tlrmchlsmth in #5401
  • [mis] fix flaky test of test_cuda_device_count_stateless by @youkaichao in #5546
  • [Core] Remove duplicate processing in async engine by @DarkLight1337 in #5525
  • [misc][distributed] fix benign error in is_in_the_same_node by @youkaichao in #5512
  • [Docs] Add ZhenFund as a Sponsor by @simon-mo in #5548
  • [Doc] Update documentation on Tensorizer by @sangstar in #5471
  • [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by @tdoublep in #5460
  • [Bugfix] Fix typo in Pallas backend by @WoosukKwon in #5558
  • [Core][Distributed] improve p2p cache generation by @youkaichao in #5528
  • Add ccache to amd by @simon-mo in #5555
  • [Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in #5364
  • [mypy] Enable type checking for test directory by @DarkLight1337 in #5017
  • [CI/Build] Test both text and token IDs in batched OpenAI Completions API by @DarkLight1337 in #5568
  • [misc] Do not allow to use lora with chunked prefill. by @rkooo567 in #5538
  • add gptq_marlin test for bug report #5088 by @alexm-neuralmagic in #5145
  • [BugFix] Don't start a Ray cluster when not using Ray by @njhill in #5570
  • [Fix] Correct OpenAI batch response format by @zifeitong in #5554
  • Add basic correctness 2 GPU tests to 4 GPU pipeline by @Yard1 in #5518
  • [CI][BugFix] Flip is_quant_method_supported condition by @mgoin in #5577
  • [build][misc] limit numpy version by @youkaichao in #5582
  • [Doc] add debugging tips for crash and multi-node debugging by @youkaichao in #5581
  • Fix w8a8 benchmark and add Llama-3-8B by @comaniac in #5562
  • [Model] Rename Phi3 rope scaling type by @garg-amit in #5595
  • Correct alignment in the seq_len diagram. by @CharlesRiggins in #5592
  • [Kernel] compressed-tensors marlin 24 support by @dsikka in #5435
  • [Misc] use AutoTokenizer for benchmark serving when vLLM not installed by @zhyncs in #5588
  • [Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend by @jikunshang in #3814
  • [CI/BUILD] Support non-AVX512 vLLM building and testing by @DamonFool in #5574
  • [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard by @KuntaiDu in #5571
  • [bugfix][distributed] fix 16 gpus local rank arrangement by @youkaichao in #5604
  • [Optimization] use a pool to reuse LogicalTokenBlock.token_ids by @youkaichao in #5584
  • [Bugfix] Fix KV head calculation for MPT models when using GQA by @bfontain in #5142
  • [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py by @zifeitong in #5606
  • [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier by @sroy745 in #5131
  • [Model] Initialize Phi-3-vision support by @Isotr0py in #4986
  • [Kernel] Add punica dimensions for Granite 13b by @joerunde in #5559
  • [misc][typo] fix typo by @youkaichao in #5620
  • [Misc] Fix typo by @DarkLight1337 in #5618
  • [CI] Avoid naming different metrics with the same name in performance benchmark by @KuntaiDu in #5615
  • [bugfix][distributed] do not error if two processes do not agree on p2p capability by @youkaichao in #5612
  • [Misc] Remove import from transformers logging by @CatherineSue in #5625
  • [CI/Build][Misc] Update Pytest Marker for VLMs by @ywang96 in #5623
  • [ci] Deprecate original CI template by @khluu in #5624
  • [Misc] Add OpenTelemetry support by @ronensc in #4687
  • [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization by @dsikka in #5542
  • [ci] Setup Release pipeline and build release wheels with cache by @khluu in #5610
  • [Model] LoRA support added for command-r by @sergey-tinkoff in #5178
  • [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties by @tdoublep in #5639
  • [Doc] Added cerebrium as Integration option by @milo157 in #5553
  • [Bugfix] Fix CUDA version check for mma warning suppression by @tlrmchlsmth in #5642
  • [Bugfix] Fix w8a8 benchmarks for int8 case by @tlrmchlsmth in #5643
  • [Bugfix] Fix Phi-3 Long RoPE scaling implementation by @ShukantPal in #5628
  • [Bugfix] Added test for sampling repetition penalty bug. by @tdoublep in #5659
  • [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices by @hongxiayang in #5641
  • [misc][distributed] use localhost for single-node by @youkaichao in #5619
  • [Model] Add FP8 kv cache for Qwen2 by @mgoin in #5656
  • [Bugfix] Fix sampling_params passed incorrectly in Phi3v example by @Isotr0py in #5684
  • [Misc]Add param max-model-len in benchmark_latency.py by @DearPlanet in #5629
  • [CI/Build] Add tqdm to dependencies by @DarkLight1337 in #5680
  • [ci] Add A100 queue into AWS CI template by @khluu in #5648
  • [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py by @mgoin in #5688
  • [ci][distributed] add tests for custom allreduce by @youkaichao in #5689
  • [Bugfix] AsyncLLMEngine hangs with asyncio.run by @zifeitong in #5654
  • [Doc] Update docker references by @rafvasq in #5614
  • [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes by @dsikka in #5650
  • [ci] Limit num gpus if specified for A100 by @khluu in #5694
  • [Misc] Improve conftest by @DarkLight1337 in #5681
  • [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors by @ywang96 in #5703
  • [Kernel] Update Cutlass int8 kernel configs for SM90 by @varun-sundar-rabindranath in #5514
  • [Model] Port over CLIPVisionModel for VLMs by @ywang96 in #5591
  • [Kernel] Update Cutlass int8 kernel configs for SM80 by @varun-sundar-rabindranath in #5275
  • [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels by @tlrmchlsmth in #5715
  • [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names by @mgoin in #5718
  • [distributed][misc] use fork by default for mp by @youkaichao in #5669
  • [Model] MLPSpeculator speculative decoding support by @JRosenkranz in #4947
  • [Kernel] Add punica dimension for Qwen2 LoRA by @jinzhen-lin in #5441
  • [BugFix] Fix test_phi3v.py by @CatherineSue in #5725
  • [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora by @jeejeelee in #5665
  • [Core][Distributed] add shm broadcast by @youkaichao in #5399
  • [Kernel][CPU] Add Quick gelu to CPU by @ywang96 in #5717
  • [Doc] Documentation on supported hardware for quantization methods by @mgoin in #5745
  • [BugFix] exclude version 1.15.0 for modelscope by @zhyncs in #5668
  • [ci][test] fix ca test in main by @youkaichao in #5746
  • [LoRA] Add support for pinning lora adapters in the LRU cache by @rohithkrn in #5603
  • [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline by @jikunshang in #5616
  • [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs by @DamonFool in #5710
  • [Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py by @zifeitong in #5756
  • [Bugfix] Fix pin_lora error in TPU executor by @WoosukKwon in #5760
  • [Docs][TPU] Add installation tip for TPU by @WoosukKwon in #5761
  • [core][distributed] improve shared memory broadcast by @youkaichao in #5754
  • [BugFix] [Kernel] Add Cutlass2x fallback kernels by @varun-sundar-rabindranath in #5744
  • [Distributed] Add send and recv helpers by @andoorve in #5719
  • [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement by @Isotr0py in #5772
  • [doc][faq] add warning to download models for every nodes by @youkaichao in #5783
  • [Doc] Add "Suggest edit" button to doc pages by @mgoin in #5789
  • [Doc] Add Phi-3-medium to list of supported models by @mgoin in #5788
  • [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args by @CatherineSue in #5795
  • [ci] Remove aws template by @khluu in #5757
  • [Doc] Add notice about breaking changes to VLMs by @DarkLight1337 in #5818
  • [Speculative Decoding] Support draft model on different tensor-parallel size than target model by @wooyeonlee0 in #5414
  • [Misc] Remove useless code in cpu_worker by @DamonFool in #5824
  • [Core] Add fault tolerance for RayTokenizerGroupPool by @Yard1 in #5748
  • [doc][distributed] add both gloo and nccl tests by @youkaichao in #5834
  • [CI/Build] Add unit testing for FlexibleArgumentParser by @mgoin in #5798
  • [Misc] Update w4a16 compressed-tensors support to include w8a16 by @dsikka in #5794
  • [Hardware][TPU] Refactor TPU backend by @WoosukKwon in #5831
  • [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes by @mawong-amd in #5422
  • [Hardware][TPU] Raise errors for unsupported sampling params by @WoosukKwon in #5850
  • [CI/Build] Add E2E tests for MLPSpeculator by @tdoublep in #5791
  • [Bugfix] Fix assertion in NeuronExecutor by @aws-patlange in #5841
  • [Core] Refactor Worker and ModelRunner to consolidate control plane communication by @stephanie-wang in #5408
  • [Misc][Doc] Add Example of using OpenAI Server with VLM by @ywang96 in #5832
  • [bugfix][distributed] fix shm broadcast when the queue size is full by @youkaichao in #5801
  • [Bugfix] Fix embedding to support 2D inputs by @WoosukKwon in #5829
  • [Bugfix][TPU] Fix KV cache size calculation by @WoosukKwon in #5860
  • [CI/Build] Refactor image test assets by @DarkLight1337 in #5821
  • [Kernel] Adding bias epilogue support for cutlass_scaled_mm by @ProExpertProg in #5560
  • [Frontend] Add tokenize/detokenize endpoints by @sasha0552 in #5054
  • [Hardware][TPU] Support parallel sampling & Swapping by @WoosukKwon in #5855
  • [Bugfix][TPU] Fix CPU cache allocation by @WoosukKwon in #5869
  • Support CPU inference with VSX PowerPC ISA by @ChipKerchner in #5652
  • [doc] update usage of env var to avoid conflict by @youkaichao in #5873
  • [Misc] Add example for LLaVA-NeXT by @ywang96 in #5879
  • [BugFix] Fix cuda graph for MLPSpeculator by @njhill in #5875
  • [Doc] Add note about context length in Phi-3-Vision example by @DarkLight1337 in #5887
  • [VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly by @xwjiang2010 in #5880
  • [Model] Add base class for LoRA-supported models by @DarkLight1337 in #5018
  • [Bugfix] Fix img_sizes Parsing in Phi3-Vision by @ywang96 in #5888
  • [CI/Build] [1/3] Reorganize entrypoints tests by @DarkLight1337 in #5526
  • [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision by @DarkLight1337 in #5896
  • [doc][misc] add note for Kubernetes users by @youkaichao in #5916
  • [BugFix] Fix MLPSpeculator handling of num_speculative_tokens by @njhill in #5876
  • [BugFix] Fix min_tokens behaviour for multiple eos tokens by @njhill in #5849
  • [CI/Build] Fix Args for _get_logits_warper in Sampler Test by @ywang96 in #5922
  • [Model] Add Gemma 2 by @WoosukKwon in #5908
  • [core][misc] remove logical block by @youkaichao in #5882
  • [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X by @divakar-amd in #5932
  • [Hardware][TPU] Optimize KV cache swapping by @WoosukKwon in #5878
  • [VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. by @xwjiang2010 in #5905
  • [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner by @Isotr0py in #5956
  • [Core] Registry for processing model inputs by @DarkLight1337 in #5214
  • Unmark fused_moe config json file as executable by @tlrmchlsmth in #5960
  • [Hardware][Intel] OpenVINO vLLM backend by @ilya-lavrenov in #5379
  • [Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high by @tdoublep in #5894
  • [CI/Build] [2/3] Reorganize entrypoints tests by @DarkLight1337 in #5904
  • [Distributed] Make it clear that % should not be in tensor dict keys. by @xwjiang2010 in #5927
  • [Spec Decode] Introduce DraftModelRunner by @comaniac in #5799
  • [Bugfix] Fix compute datatype for cutlass 3.x epilogues by @tlrmchlsmth in #5931
  • [ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) by @robertgshaw2-neuralmagic in #5928
  • [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 by @robertgshaw2-neuralmagic in #5921
  • Support Deepseek-V2 by @zwd003 in #4650

New Contributors

Full Changelog: v0.5.0.post1...vtest

Don't miss a new vllm release

NewReleases is sending notifications on new releases.