pypi vllm 0.4.3
v0.4.3

latest releases: 0.5.0.post1, 0.5.0
25 days ago

Highlights

Model Support

LLM

  • Added support for Falcon (#5069)
  • Added support for IBM Granite Code models (#4636)
  • Added blocksparse flash attention kernel and Phi-3-Small model (#4799)
  • Added Snowflake arctic model implementation (#4652, #4889, #4690)
  • Supported Dynamic RoPE scaling (#4638)
  • Supported for long context lora (#4787)

Embedding Models

  • Intial support for Embedding API with e5-mistral-7b-instruct (#3734)
  • Cross-attention KV caching and memory-management towards encoder-decoder model support (#4837)

Visual Language Model

  • Add base class for vision-language models (#4809)
  • Consolidate prompt arguments to LLM engines (#4328)
  • LLaVA model refactor (#4910)

Hardware Support

AMD

  • Add fused_moe Triton configs (#4951)
  • Add support for Punica kernels (#3140)
  • Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)

Production Engine

Batch API

  • Support OpenAI batch file format (#4794)

Making Ray Optional

  • Add MultiprocessingGPUExecutor (#4539)
  • Eliminate parallel worker per-step task scheduling overhead (#4894)

Automatic Prefix Caching

  • Accelerating the hashing function by avoiding deep copies (#4696)

Speculative Decoding

  • CUDA graph support (#4295)
  • Enable TP>1 speculative decoding (#4840)
  • Improve n-gram efficiency (#4724)

Performance Optimization

Quantization

  • Add GPTQ Marlin 2:4 sparse structured support (#4790)
  • Initial Activation Quantization Support (#4525)
  • Marlin prefill performance improvement (about better on average) (#4983)
  • Automatically Detect SparseML models (#5119)

Better Attention Kernel

  • Use flash-attn for decoding (#3648)

FP8

  • Improve FP8 linear layer performance (#4691)
  • Add w8a8 CUTLASS kernels (#4749)
  • Support for CUTLASS kernels in CUDA graphs (#4954)
  • Load FP8 kv-cache scaling factors from checkpoints (#4893)
  • Make static FP8 scaling more robust (#4570)
  • Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)

Optimize Distributed Communication

  • change python dict to pytorch tensor (#4607)
  • change python dict to pytorch tensor for blocks to swap (#4659)
  • improve paccess check (#4992)
  • remove vllm-nccl (#5091)
  • support both cpu and device tensor in broadcast tensor dict (#4660)

Extensible Architecture

Pipeline Parallelism

  • refactor custom allreduce to support multiple tp groups (#4754)
  • refactor pynccl to hold multiple communicators (#4591)
  • Support PP PyNCCL Groups (#4988)

What's Changed

  • Disable cuda version check in vllm-openai image by @zhaoyang-star in #4530
  • [Bugfix] Fix asyncio.Task not being subscriptable by @DarkLight1337 in #4623
  • [CI] use ccache actions properly in release workflow by @simon-mo in #4629
  • [CI] Add retry for agent lost by @cadedaniel in #4633
  • Update lm-format-enforcer to 0.10.1 by @noamgat in #4631
  • [Kernel] Make static FP8 scaling more robust by @pcmoritz in #4570
  • [Core][Optimization] change python dict to pytorch tensor by @youkaichao in #4607
  • [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. by @Alexei-V-Ivanov-AMD in #4642
  • [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora by @FurtherAI in #4609
  • [Core][Optimization] change copy-on-write from dict[int, list] to list by @youkaichao in #4648
  • [Bug fix][Core] fixup ngram not setup correctly by @leiwen83 in #4551
  • [Core][Distributed] support both cpu and device tensor in broadcast tensor dict by @youkaichao in #4660
  • [Core] Optimize sampler get_logprobs by @rkooo567 in #4594
  • [CI] Make mistral tests pass by @rkooo567 in #4596
  • [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi by @DefTruth in #4573
  • [Misc] Add get_name method to attention backends by @WoosukKwon in #4685
  • [Core] Faster startup for LoRA enabled models by @Yard1 in #4634
  • [Core][Optimization] change python dict to pytorch tensor for blocks to swap by @youkaichao in #4659
  • [CI/Test] fix swap test for multi gpu by @youkaichao in #4689
  • [Misc] Use vllm-flash-attn instead of flash-attn by @WoosukKwon in #4686
  • [Dynamic Spec Decoding] Auto-disable by the running queue size by @comaniac in #4592
  • [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs by @cadedaniel in #4672
  • [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin by @alexm-neuralmagic in #4626
  • [Frontend] add tok/s speed metric to llm class when using tqdm by @MahmoudAshraf97 in #4400
  • [Frontend] Move async logic outside of constructor by @DarkLight1337 in #4674
  • [Misc] Remove unnecessary ModelRunner imports by @WoosukKwon in #4703
  • [Misc] Set block size at initialization & Fix test_model_runner by @WoosukKwon in #4705
  • [ROCm] Add support for Punica kernels on AMD GPUs by @kliuae in #3140
  • [Bugfix] Fix CLI arguments in OpenAI server docs by @DarkLight1337 in #4709
  • [Bugfix] Update grafana.json by @robertgshaw2-neuralmagic in #4711
  • [Bugfix] Add logs for all model dtype casting by @mgoin in #4717
  • [Model] Snowflake arctic model implementation by @sfc-gh-hazhang in #4652
  • [Kernel] [FP8] Improve FP8 linear layer performance by @pcmoritz in #4691
  • [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support by @comaniac in #4535
  • [Core][Distributed] refactor pynccl to hold multiple communicators by @youkaichao in #4591
  • [Misc] Keep only one implementation of the create_dummy_prompt function. by @AllenDou in #4716
  • chunked-prefill-doc-syntax by @simon-mo in #4603
  • [Core]fix type annotation for swap_blocks by @jikunshang in #4726
  • [Misc] Apply a couple g++ cleanups by @stevegrubb in #4719
  • [Core] Fix circular reference which leaked llm instance in local dev env by @rkooo567 in #4737
  • [Bugfix] Fix CLI arguments in OpenAI server docs by @AllenDou in #4729
  • [Speculative decoding] CUDA graph support by @heeju-kim2 in #4295
  • [CI] Nits for bad initialization of SeqGroup in testing by @robertgshaw2-neuralmagic in #4748
  • [Core][Test] fix function name typo in custom allreduce by @youkaichao in #4750
  • [Model][Misc] Add e5-mistral-7b-instruct and Embedding API by @CatherineSue in #3734
  • [Model] Add support for IBM Granite Code models by @yikangshen in #4636
  • [CI/Build] Tweak Marlin Nondeterminism Issues In CI by @robertgshaw2-neuralmagic in #4713
  • [CORE] Improvement in ranks code by @SwapnilDreams100 in #4718
  • [Core][Distributed] refactor custom allreduce to support multiple tp groups by @youkaichao in #4754
  • [CI/Build] Move test_utils.py to tests/utils.py by @DarkLight1337 in #4425
  • [Scheduler] Warning upon preemption and Swapping by @rkooo567 in #4647
  • [Misc] Enhance attention selector by @WoosukKwon in #4751
  • [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 by @sangstar in #4208
  • [Speculative decoding] Improve n-gram efficiency by @comaniac in #4724
  • [Kernel] Use flash-attn for decoding by @skrider in #3648
  • [Bugfix] Fix dynamic FP8 quantization for Mixtral by @pcmoritz in #4793
  • [Doc] Shorten README by removing supported model list by @zhuohan123 in #4796
  • [Doc] Add API reference for offline inference by @DarkLight1337 in #4710
  • [Doc] Add meetups to the doc by @zhuohan123 in #4798
  • [Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies by @KuntaiDu in #4696
  • [Bugfix][Doc] Fix CI failure in docs by @DarkLight1337 in #4804
  • [Core] Add MultiprocessingGPUExecutor by @njhill in #4539
  • Add 4th meetup announcement to readme by @simon-mo in #4817
  • Revert "[Kernel] Use flash-attn for decoding (#3648)" by @rkooo567 in #4820
  • [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API by @rkooo567 in #4681
  • [CI/Build] Further decouple HuggingFace implementation from ours during tests by @DarkLight1337 in #4166
  • [Bugfix] Properly set distributed_executor_backend in ParallelConfig by @zifeitong in #4816
  • [Doc] Highlight the fourth meetup in the README by @zhuohan123 in #4842
  • [Frontend] Re-enable custom roles in Chat Completions API by @DarkLight1337 in #4758
  • [Frontend] Support OpenAI batch file format by @wuisawesome in #4794
  • [Core] Implement sharded state loader by @aurickq in #4690
  • [Speculative decoding][Re-take] Enable TP>1 speculative decoding by @comaniac in #4840
  • Add marlin unit tests and marlin benchmark script by @alexm-neuralmagic in #4815
  • [Kernel] add bfloat16 support for gptq marlin kernel by @jinzhen-lin in #4788
  • [docs] Fix typo in examples filename openi -> openai by @wuisawesome in #4864
  • [Frontend] Separate OpenAI Batch Runner usage from API Server by @wuisawesome in #4851
  • [Bugfix] Bypass authorization API token for preflight requests by @dulacp in #4862
  • Add GPTQ Marlin 2:4 sparse structured support by @alexm-neuralmagic in #4790
  • Add JSON output support for benchmark_latency and benchmark_throughput by @simon-mo in #4848
  • [ROCm][AMD][Bugfix] adding a missing triton autotune config by @hongxiayang in #4845
  • [Core][Distributed] remove graph mode function by @youkaichao in #4818
  • [Misc] remove old comments by @youkaichao in #4866
  • [Kernel] Add punica dimension for Qwen1.5-32B LoRA by @Silencioo in #4850
  • [Kernel] Add w8a8 CUTLASS kernels by @tlrmchlsmth in #4749
  • [Bugfix] Fix FP8 KV cache support by @WoosukKwon in #4869
  • Support to serve vLLM on Kubernetes with LWS by @kerthcet in #4829
  • [Frontend] OpenAI API server: Do not add bos token by default when encoding by @bofenghuang in #4688
  • [Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests by @Alexei-V-Ivanov-AMD in #4797
  • [Bugfix] fix rope error when load models with different dtypes by @jinzhen-lin in #4835
  • Sync huggingface modifications of qwen Moe model by @eigen2017 in #4774
  • [Doc] Update Ray Data distributed offline inference example by @Yard1 in #4871
  • [Bugfix] Relax tiktoken to >= 0.6.0 by @mgoin in #4890
  • [ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used by @alexeykondrat in #4658
  • [Lora] Support long context lora by @rkooo567 in #4787
  • [Bugfix][Model] Add base class for vision-language models by @DarkLight1337 in #4809
  • [Kernel] Add marlin_24 unit tests by @alexm-neuralmagic in #4901
  • [Kernel] Add flash-attn back by @WoosukKwon in #4907
  • [Model] LLaVA model refactor by @DarkLight1337 in #4910
  • Remove marlin warning by @alexm-neuralmagic in #4918
  • [Bugfix]: Fix communication Timeout error in safety-constrained distributed System by @ZwwWayne in #4914
  • [Build/CI] Enabling AMD Entrypoints Test by @Alexei-V-Ivanov-AMD in #4834
  • [Bugfix] Fix dummy weight for fp8 by @mzusman in #4916
  • [Core] Sharded State Loader download from HF by @aurickq in #4889
  • [Doc]Add documentation to benchmarking script when running TGI by @KuntaiDu in #4920
  • [Core] Fix scheduler considering "no LoRA" as "LoRA" by @Yard1 in #4897
  • [Model] add rope_scaling support for qwen2 by @hzhwcmhf in #4930
  • [Model] Add Phi-2 LoRA support by @Isotr0py in #4886
  • [Docs] Add acknowledgment for sponsors by @simon-mo in #4925
  • [CI/Build] Codespell ignore build/ directory by @mgoin in #4945
  • [Bugfix] Fix flag name for max_seq_len_to_capture by @kerthcet in #4935
  • [Bugfix][Kernel] Add head size check for attention backend selection by @Isotr0py in #4944
  • [Frontend] Dynamic RoPE scaling by @sasha0552 in #4638
  • [CI/Build] Enforce style for C++ and CUDA code with clang-format by @mgoin in #4722
  • [misc] remove comments that were supposed to be removed by @rkooo567 in #4977
  • [Kernel] Fixup for CUTLASS kernels in CUDA graphs by @tlrmchlsmth in #4954
  • [Misc] Load FP8 kv-cache scaling factors from checkpoints by @comaniac in #4893
  • [Model] LoRA gptbigcode implementation by @raywanb in #3949
  • [Core] Eliminate parallel worker per-step task scheduling overhead by @njhill in #4894
  • [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig by @pcmoritz in #4991
  • [Misc] Take user preference in attention selector by @comaniac in #4960
  • Marlin 24 prefill performance improvement (about 25% better on average) by @alexm-neuralmagic in #4983
  • [Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined by @LetianLee in #5009
  • [Core][1/N] Support PP PyNCCL Groups by @andoorve in #4988
  • [Kernel] Initial Activation Quantization Support by @dsikka in #4525
  • [Core]: Option To Use Prompt Token Ids Inside Logits Processor by @kezouke in #4985
  • [Doc] add ccache guide in doc by @youkaichao in #5012
  • [Bugfix] Fix Mistral v0.3 Weight Loading by @robertgshaw2-neuralmagic in #5005
  • [Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in #4764
  • [Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model by @linxihui in #4799
  • [Misc] add logging level env var by @youkaichao in #5045
  • [Dynamic Spec Decoding] Minor fix for disabling speculative decoding by @LiuXiaoxuanPKU in #5000
  • [Misc] Make Serving Benchmark More User-friendly by @ywang96 in #5044
  • [Bugfix / Core] Prefix Caching Guards (merged with main) by @zhuohan123 in #4846
  • [Core] Allow AQLM on Pascal by @sasha0552 in #5058
  • [Model] Add support for falcon-11B by @Isotr0py in #5069
  • [Core] Sliding window for block manager v2 by @mmoskal in #4545
  • [BugFix] Fix Embedding Models with TP>1 by @robertgshaw2-neuralmagic in #5075
  • [Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X by @divakar-amd in #4951
  • [Docs] Add Dropbox as sponsors by @simon-mo in #5089
  • [Core] Consolidate prompt arguments to LLM engines by @DarkLight1337 in #4328
  • [Bugfix] Remove the last EOS token unless explicitly specified by @jsato8094 in #5077
  • [Misc] add gpu_memory_utilization arg by @pandyamarut in #5079
  • [Core][Optimization] remove vllm-nccl by @youkaichao in #5091
  • [Bugfix] Fix arguments passed to Sequence in stop checker test by @DarkLight1337 in #5092
  • [Core][Distributed] improve p2p access check by @youkaichao in #4992
  • [Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) by @afeldman-nm in #4837
  • [Doc]Replace deprecated flag in readme by @ronensc in #4526
  • [Bugfix][CI/Build] Fix test and improve code for merge_async_iterators by @DarkLight1337 in #5096
  • [Bugfix][CI/Build] Fix codespell failing to skip files in git diff by @DarkLight1337 in #5097
  • [Core] Avoid the need to pass None values to Sequence.inputs by @DarkLight1337 in #5099
  • [Bugfix] logprobs is not compatible with the OpenAI spec #4795 by @Etelis in #5031
  • [Doc][Build] update after removing vllm-nccl by @youkaichao in #5103
  • [Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter by @alexm-neuralmagic in #5108
  • [CI/Build] Docker cleanup functionality for amd servers by @okakarpa in #5112
  • [BUGFIX] [FRONTEND] Correct chat logprobs by @br3no in #5029
  • [Bugfix] Automatically Detect SparseML models by @robertgshaw2-neuralmagic in #5119
  • [CI/Build] increase wheel size limit to 200 MB by @youkaichao in #5130
  • [Misc] remove duplicate definition of seq_lens_tensor in model_runner.py by @ita9naiwa in #5129
  • [Doc] Use intersphinx and update entrypoints docs by @DarkLight1337 in #5125
  • add doc about serving option on dstack by @deep-diver in #3074
  • Bump version to v0.4.3 by @simon-mo in #5046
  • [Build] Disable sm_90a in cu11 by @simon-mo in #5141
  • [Bugfix] Avoid Warnings in SparseML Activation Quantization by @robertgshaw2-neuralmagic in #5120
  • [Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) by @alexm-neuralmagic in #5136
  • [Model] Support MAP-NEO model by @xingweiqu in #5081
  • Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" by @simon-mo in #5149
  • [Misc]: optimize eager mode host time by @functionxu123 in #4196
  • [Model] Enable FP8 QKV in MoE and refine kernel tuning script by @comaniac in #5039
  • [Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support by @njhill in #5171
  • [Build] Guard against older CUDA versions when building CUTLASS 3.x kernels by @tlrmchlsmth in #5168

New Contributors

Full Changelog: v0.4.2...v0.4.3

Don't miss a new vllm release

NewReleases is sending notifications on new releases.