github vllm-project/vllm v0.5.5

latest releases: v0.6.1.post2, v0.6.1.post1, v0.6.1...
24 days ago

Highlights

Performance Update

  • We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (#7000, #7387, #7452, #7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set --num-scheduler-steps 8 as a parameter to the API server (via vllm serve) or AsyncLLMEngine. We are working on expanding the coverage to LLM class and aiming to turning it on by default
  • Various enhancements:
    • Use flashinfer sampling kernel when avaiable, leading to 7% decoding throughput speedup (#7137)
    • Reduce Python allocations, leading to 24% throughput speedup (#7162, 7364)
    • Improvements to the zeromq based decoupled frontend (#7570, #7716, #7484)

Model Support

  • Support Jamba 1.5 (#7415, #7601, #6739)
  • Support for the first audio model UltravoxModel (#7615, #7446)
  • Improvements to vision models:
    • Support image embeddings as input (#6613)
    • Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
  • Support loading GGUF model (#5191) with tensor parallelism (#7520)
  • Progress in encoder decoder models: support for serving encoder/decoder models (#7258), and architecture for cross-attention (#4942)

Hardware Support

  • AMD: Add fp8 Linear Layer for rocm (#7210)
  • Enhancements to TPU support: load time W8A16 quantization (#7005), optimized rope (#7635), and support multi-host inference (#7457).
  • Intel: various refactoring for worker, executor, and model runner (#7686, #7712)

Others

  • Optimize prefix caching performance (#7193)
  • Speculative decoding
    • Use target model max length as default for draft model (#7706)
    • EAGLE Implementation with Top-1 proposer (#6830)
  • Entrypoints
    • A new chat method in the LLM class (#5049)
    • Support embeddings in the run_batch API (#7132)
    • Support prompt_logprobs in Chat Completion (#7453)
  • Quantizations
    • Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
    • Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
  • torch.compile: register custom ops for kernels (#7591, #7594, #7536)

What's Changed

  • [ci][frontend] deduplicate tests by @youkaichao in #7101
  • [Doc] [SpecDecode] Update MLPSpeculator documentation by @tdoublep in #7100
  • [Bugfix] Specify device when loading LoRA and embedding tensors by @jischein in #7129
  • [MISC] Use non-blocking transfer in prepare_input by @comaniac in #7172
  • [Core] Support loading GGUF model by @Isotr0py in #5191
  • [Build] Add initial conditional testing spec by @simon-mo in #6841
  • [LoRA] Relax LoRA condition by @jeejeelee in #7146
  • [Model] Support SigLIP encoder and alternative decoders for LLaVA models by @DarkLight1337 in #7153
  • [BugFix] Fix DeepSeek remote code by @dsikka in #7178
  • [ BugFix ] Fix ZMQ when VLLM_PORT is set by @robertgshaw2-neuralmagic in #7205
  • [Bugfix] add gguf dependency by @kpapis in #7198
  • [SpecDecode] [Minor] Fix spec decode sampler tests by @LiuXiaoxuanPKU in #7183
  • [Kernel] Add per-tensor and per-token AZP epilogues by @ProExpertProg in #5941
  • [Core] Optimize evictor-v2 performance by @xiaobochen123 in #7193
  • [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) by @afeldman-nm in #4942
  • [Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading by @mgoin in #7225
  • [BugFix] Overhaul async request cancellation by @njhill in #7111
  • [Doc] Mock new dependencies for documentation by @ywang96 in #7245
  • [BUGFIX]: top_k is expected to be an integer. by @Atllkks10 in #7227
  • [Frontend] Gracefully handle missing chat template and fix CI failure by @DarkLight1337 in #7238
  • [distributed][misc] add specialized method for cuda platform by @youkaichao in #7249
  • [Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 by @dsikka in #5874
  • [ BugFix ] Move zmq frontend to IPC instead of TCP by @robertgshaw2-neuralmagic in #7222
  • Fixes typo in function name by @rafvasq in #7275
  • [Bugfix] Fix input processor for InternVL2 model by @Isotr0py in #7164
  • [OpenVINO] migrate to latest dependencies versions by @ilya-lavrenov in #7251
  • [Doc] add online speculative decoding example by @stas00 in #7243
  • [BugFix] Fix frontend multiprocessing hang by @maxdebayser in #7217
  • [Bugfix][FP8] Fix dynamic FP8 Marlin quantization by @mgoin in #7219
  • [ci] Make building wheels per commit optional by @khluu in #7278
  • [Bugfix] Fix gptq failure on T4s by @LucasWilkinson in #7264
  • [FrontEnd] Make merge_async_iterators is_cancelled arg optional by @njhill in #7282
  • [Doc] Update supported_hardware.rst by @mgoin in #7276
  • [Kernel] Fix Flashinfer Correctness by @LiuXiaoxuanPKU in #7284
  • [Misc] Fix typos in scheduler.py by @ruisearch42 in #7285
  • [Frontend] remove max_num_batched_tokens limit for lora by @NiuBlibing in #7288
  • [Bugfix] Fix LoRA with PP by @andoorve in #7292
  • [Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 by @jeejeelee in #7273
  • [Bugfix][Kernel] Increased atol to fix failing tests by @ProExpertProg in #7305
  • [Frontend] Kill the server on engine death by @joerunde in #6594
  • [Bugfix][fast] Fix the get_num_blocks_touched logic by @zachzzc in #6849
  • [Doc] Put collect_env issue output in a block by @mgoin in #7310
  • [CI/Build] Dockerfile.cpu improvements by @dtrifiro in #7298
  • [Bugfix] Fix new Llama3.1 GGUF model loading by @Isotr0py in #7269
  • [Misc] Temporarily resolve the error of BitAndBytes by @jeejeelee in #7308
  • Add Skywork AI as Sponsor by @simon-mo in #7314
  • [TPU] Add Load-time W8A16 quantization for TPU Backend by @lsy323 in #7005
  • [Core] Support serving encoder/decoder models by @DarkLight1337 in #7258
  • [TPU] Fix dockerfile.tpu by @WoosukKwon in #7331
  • [Performance] Optimize e2e overheads: Reduce python allocations by @alexm-neuralmagic in #7162
  • [Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary by @tjohnson31415 in #7218
  • [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace by @SolitaryThinker in #6971
  • [Core] Streamline stream termination in AsyncLLMEngine by @njhill in #7336
  • [Model][Jamba] Mamba cache single buffer by @mzusman in #6739
  • [VLM][Doc] Add stop_token_ids to InternVL example by @Isotr0py in #7354
  • [Performance] e2e overheads reduction: Small followup diff by @alexm-neuralmagic in #7364
  • [Bugfix] Fix reinit procedure in ModelInputForGPUBuilder by @alexm-neuralmagic in #7360
  • [Frontend] Support embeddings in the run_batch API by @pooyadavoodi in #7132
  • [Bugfix] Fix ITL recording in serving benchmark by @ywang96 in #7372
  • [Core] Add span metrics for model_forward, scheduler and sampler time by @sfc-gh-mkeralapura in #7089
  • [Bugfix] Fix PerTensorScaleParameter weight loading for fused models by @dsikka in #7376
  • [Misc] Add numpy implementation of compute_slot_mapping by @Yard1 in #7377
  • [Core] Fix edge case in chunked prefill + block manager v2 by @cadedaniel in #7380
  • [Bugfix] Fix phi3v batch inference when images have different aspect ratio by @Isotr0py in #7392
  • [TPU] Use mark_dynamic to reduce compilation time by @WoosukKwon in #7340
  • Updating LM Format Enforcer version to v0.10.6 by @noamgat in #7189
  • [core] [2/N] refactor worker_base input preparation for multi-step by @SolitaryThinker in #7387
  • [CI/Build] build on empty device for better dev experience by @tomeras91 in #4773
  • [Doc] add instructions about building vLLM with VLLM_TARGET_DEVICE=empty by @tomeras91 in #7403
  • [misc] add commit id in collect env by @youkaichao in #7405
  • [Docs] Update readme by @simon-mo in #7316
  • [CI/Build] Minor refactoring for vLLM assets by @ywang96 in #7407
  • [Kernel] Flashinfer correctness fix for v0.1.3 by @LiuXiaoxuanPKU in #7319
  • [Core][VLM] Support image embeddings as input by @ywang96 in #6613
  • [Frontend] Disallow passing model as both argument and option by @DarkLight1337 in #7347
  • [CI/Build] bump Dockerfile.neuron image base, use public ECR by @dtrifiro in #6832
  • [Bugfix] Fix logit soft cap in flash-attn backend by @WoosukKwon in #7425
  • [ci] Entrypoints run upon changes in vllm/ by @khluu in #7423
  • [ci] Cancel fastcheck run when PR is marked ready by @khluu in #7427
  • [ci] Cancel fastcheck when PR is ready by @khluu in #7433
  • [Misc] Use scalar type to dispatch to different gptq_marlin kernels by @LucasWilkinson in #7323
  • [Core] Consolidate GB constant and enable float GB arguments by @DarkLight1337 in #7416
  • [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel by @jon-chuang in #7208
  • [Bugfix] Handle PackageNotFoundError when checking for xpu version by @sasha0552 in #7398
  • [CI/Build] bump minimum cmake version by @dtrifiro in #6999
  • [Core] Shut down aDAG workers with clean async llm engine exit by @ruisearch42 in #7224
  • [mypy] Misc. typing improvements by @DarkLight1337 in #7417
  • [Misc] improve logits processors logging message by @aw632 in #7435
  • [ci] Remove fast check cancel workflow by @khluu in #7455
  • [Bugfix] Fix weight loading for Chameleon when TP>1 by @DarkLight1337 in #7410
  • [hardware] unify usage of is_tpu to current_platform.is_tpu() by @youkaichao in #7102
  • [TPU] Suppress import custom_ops warning by @WoosukKwon in #7458
  • Revert "[Doc] Update supported_hardware.rst (#7276)" by @WoosukKwon in #7467
  • [Frontend][Core] Add plumbing to support audio language models by @petersalas in #7446
  • [Misc] Update LM Eval Tolerance by @dsikka in #7473
  • [Misc] Update gptq_marlin to use new vLLMParameters by @dsikka in #7281
  • [Misc] Update Fused MoE weight loading by @dsikka in #7334
  • [Misc] Update awq and awq_marlin to use vLLMParameters by @dsikka in #7422
  • Announce NVIDIA Meetup by @simon-mo in #7483
  • [frontend] spawn engine process from api server process by @youkaichao in #7484
  • [Misc] compressed-tensors code reuse by @kylesayrs in #7277
  • [misc][plugin] add plugin system implementation by @youkaichao in #7426
  • [TPU] Support multi-host inference by @WoosukKwon in #7457
  • [Bugfix][CI] Import ray under guard by @WoosukKwon in #7486
  • [CI/Build]Reduce the time consumption for LoRA tests by @jeejeelee in #7396
  • [misc][ci] fix cpu test with plugins by @youkaichao in #7489
  • [Bugfix][Docs] Update list of mock imports by @DarkLight1337 in #7493
  • [doc] update test script to include cudagraph by @youkaichao in #7501
  • Fix empty output when temp is too low by @CatherineSue in #2937
  • [ci] fix model tests by @youkaichao in #7507
  • [Bugfix][Frontend] Disable embedding API for chat models by @QwertyJack in #7504
  • [Misc] Deprecation Warning when setting --engine-use-ray by @wallashss in #7424
  • [VLM][Core] Support profiling with multiple multi-modal inputs per prompt by @DarkLight1337 in #7126
  • [core] [3/N] multi-step args and sequence.py by @SolitaryThinker in #7452
  • [TPU] Set per-rank XLA cache by @WoosukKwon in #7533
  • [Misc] Revert compressed-tensors code reuse by @kylesayrs in #7521
  • llama_index serving integration documentation by @pavanjava in #6973
  • [Bugfix][TPU] Correct env variable for XLA cache path by @WoosukKwon in #7544
  • [Bugfix] update neuron for version > 0.5.0 by @omrishiv in #7175
  • [Misc] Update dockerfile for CPU to cover protobuf installation by @PHILO-HE in #7182
  • [Bugfix] Fix default weight loading for scalars by @mgoin in #7534
  • [Bugfix][Harmless] Fix hardcoded float16 dtype for model_is_embedding by @mgoin in #7566
  • [Misc] Add quantization config support for speculative model. by @ShangmingCai in #7343
  • [Feature]: Add OpenAI server prompt_logprobs support #6508 by @gnpinkert in #7453
  • [ci/test] rearrange tests and make adag test soft fail by @youkaichao in #7572
  • Chat method for offline llm by @nunjunj in #5049
  • [CI] Move quantization cpu offload tests out of fastcheck by @mgoin in #7574
  • [Misc/Testing] Use torch.testing.assert_close by @jon-chuang in #7324
  • register custom op for flash attn and use from torch.ops by @youkaichao in #7536
  • [Core] Use uvloop with zmq-decoupled front-end by @njhill in #7570
  • [CI] Fix crashes of performance benchmark by @KuntaiDu in #7500
  • [Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method by @gongdao123 in #7513
  • support tqdm in notebooks by @fzyzcjy in #7510
  • [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm by @charlifu in #7210
  • [Kernel] W8A16 Int8 inside FusedMoE by @mzusman in #7415
  • [Kernel] Add tuned triton configs for ExpertsInt8 by @mgoin in #7601
  • [spec decode] [4/N] Move update_flash_attn_metadata to attn backend by @SolitaryThinker in #7571
  • [Core] Fix tracking of model forward time to the span traces in case of PP>1 by @sfc-gh-mkeralapura in #7440
  • [Doc] Add docs for llmcompressor INT8 and FP8 checkpoints by @mgoin in #7444
  • [Doc] Update quantization supported hardware table by @mgoin in #7595
  • [Kernel] register punica functions as torch ops by @bnellnm in #7591
  • [Kernel][Misc] dynamo support for ScalarType by @bnellnm in #7594
  • [Kernel] fix types used in aqlm and ggml kernels to support dynamo by @bnellnm in #7596
  • [Model] Align nemotron config with final HF state and fix lm-eval-small by @mgoin in #7611
  • [Bugfix] Fix custom_ar support check by @bnellnm in #7617
  • .[Build/CI] Enabling passing AMD tests. by @Alexei-V-Ivanov-AMD in #7610
  • [Bugfix] Clear engine reference in AsyncEngineRPCServer by @ruisearch42 in #7618
  • [aDAG] Unflake aDAG + PP tests by @rkooo567 in #7600
  • [Bugfix] add >= 1.0 constraint for openai dependency by @metasyn in #7612
  • [misc] use nvml to get consistent device name by @youkaichao in #7582
  • [ci][test] fix engine/logger test by @youkaichao in #7621
  • [core][misc] update libcudart finding by @youkaichao in #7620
  • [Model] Pipeline parallel support for JAIS by @mrbesher in #7603
  • [ci][test] allow longer wait time for api server by @youkaichao in #7629
  • [Misc]Fix BitAndBytes exception messages by @jeejeelee in #7626
  • [VLM] Refactor MultiModalConfig initialization and profiling by @ywang96 in #7530
  • [TPU] Skip creating empty tensor by @WoosukKwon in #7630
  • [TPU] Use mark_dynamic only for dummy run by @WoosukKwon in #7634
  • [TPU] Optimize RoPE forward_native2 by @WoosukKwon in #7636
  • [ Bugfix ] Fix Prometheus Metrics With zeromq Frontend by @robertgshaw2-neuralmagic in #7279
  • [CI/Build] Add text-only test for Qwen models by @alex-jw-brooks in #7475
  • [Misc] Refactor Llama3 RoPE initialization by @WoosukKwon in #7637
  • [Core] Optimize SPMD architecture with delta + serialization optimization by @rkooo567 in #7109
  • [Core] Use flashinfer sampling kernel when available by @peng1999 in #7137
  • fix xpu build by @jikunshang in #7644
  • [Misc] Remove Gemma RoPE by @WoosukKwon in #7638
  • [MISC] Add prefix cache hit rate to metrics by @comaniac in #7606
  • [Bugfix] fix lora_dtype value type in arg_utils.py - part 2 by @c3-ali in #5428
  • [core] Multi Step Scheduling by @SolitaryThinker in #7000
  • [Core] Support tensor parallelism for GGUF quantization by @Isotr0py in #7520
  • [Bugfix] Don't disable existing loggers by @a-ys in #7664
  • [TPU] Fix redundant input tensor cloning by @WoosukKwon in #7660
  • [Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding by @tjohnson31415 in #7665
  • [doc] fix doc build error caused by msgspec by @youkaichao in #7659
  • [Speculative Decoding] Fixing hidden states handling in batch expansion by @abhigoyal1997 in #7508
  • [ci] Install Buildkite test suite analysis by @khluu in #7667
  • [Bugfix] support tie_word_embeddings for all models by @zijian-hu in #5724
  • [CI] Organizing performance benchmark files by @KuntaiDu in #7616
  • [misc] add nvidia related library in collect env by @youkaichao in #7674
  • [XPU] fallback to native implementation for xpu custom op by @jianyizh in #7670
  • [misc][cuda] add warning for pynvml user by @youkaichao in #7675
  • [Core] Refactor executor classes to make it easier to inherit GPUExecutor by @jikunshang in #7673
  • [Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel by @LucasWilkinson in #7174
  • [OpenVINO] Updated documentation by @ilya-lavrenov in #7687
  • [VLM][Model] Add test for InternViT vision encoder by @Isotr0py in #7409
  • [Hardware] [Intel GPU] refactor xpu worker/executor by @jikunshang in #7686
  • [CI/Build] Pin OpenTelemetry versions and make availability errors clearer by @ronensc in #7266
  • [Misc] Add jinja2 as an explicit build requirement by @LucasWilkinson in #7695
  • [Core] Add AttentionState abstraction by @Yard1 in #7663
  • [Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) by @jikunshang in #7685
  • [ci][test] adjust max wait time for cpu offloading test by @youkaichao in #7709
  • [Core] Pipe worker_class_fn argument in Executor by @Yard1 in #7707
  • [ci] try to log process using the port to debug the port usage by @youkaichao in #7711
  • [Model] Add AWQ quantization support for InternVL2 model by @Isotr0py in #7187
  • [Doc] Section for Multimodal Language Models by @ywang96 in #7719
  • [mypy] Enable following imports for entrypoints by @DarkLight1337 in #7248
  • [Bugfix] Mirror jinja2 in pyproject.toml by @sasha0552 in #7723
  • [BugFix] Avoid premature async generator exit and raise all exception variations by @njhill in #7698
  • [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] by @learninmou in #7509
  • [Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend by @Isotr0py in #7735
  • [Spec Decoding] Use target model max length as default for draft model by @njhill in #7706
  • [Bugfix] chat method add_generation_prompt param by @brian14708 in #7734
  • [Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend by @robertgshaw2-neuralmagic in #7394
  • [Bugfix] Pass PYTHONPATH from setup.py to CMake by @sasha0552 in #7730
  • [multi-step] Raise error if not using async engine by @SolitaryThinker in #7703
  • [Frontend] Improve Startup Failure UX by @robertgshaw2-neuralmagic in #7716
  • [misc] Add Torch profiler support by @SolitaryThinker in #7451
  • [Model] Add UltravoxModel and UltravoxConfig by @petersalas in #7615
  • [ci] [multi-step] narrow multi-step test dependency paths by @SolitaryThinker in #7760
  • [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in #7527
  • [distributed][misc] error on same VLLM_HOST_IP setting by @youkaichao in #7756
  • [AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility by @gshtras in #7477
  • [Kernel] Replaced blockReduce[...] functions with cub::BlockReduce by @ProExpertProg in #7233
  • [Model] Fix Phi-3.5-vision-instruct 'num_crops' issue by @zifeitong in #7710
  • [Bug][Frontend] Improve ZMQ client robustness by @joerunde in #7443
  • Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" by @mgoin in #7764
  • [TPU] Avoid initializing TPU runtime in is_tpu by @WoosukKwon in #7763
  • [ci] refine dependency for distributed tests by @youkaichao in #7776
  • [Misc] Use torch.compile for GemmaRMSNorm by @WoosukKwon in #7642
  • [Speculative Decoding] EAGLE Implementation with Top-1 proposer by @abhigoyal1997 in #6830
  • Fix ShardedStateLoader for vllm fp8 quantization by @sfc-gh-zhwang in #7708
  • [Bugfix] Don't build machete on cuda <12.0 by @LucasWilkinson in #7757
  • [Misc] update fp8 to use vLLMParameter by @dsikka in #7437
  • [Bugfix] spec decode handle None entries in topk args in create_sequence_group_output by @tjohnson31415 in #7232
  • [Misc] Enhance prefix-caching benchmark tool by @Jeffwan in #6568
  • [Doc] Fix incorrect docs from #7615 by @petersalas in #7788
  • [Bugfix] Use LoadFormat values as choices for vllm serve --load-format by @mgoin in #7784
  • [ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args by @khluu in #7705
  • [Misc] fix typo in triton import warning by @lsy323 in #7794
  • [Frontend] error suppression cleanup by @joerunde in #7786
  • [Ray backend] Better error when pg topology is bad. by @rkooo567 in #7584
  • [Hardware][Intel GPU] refactor xpu_model_runner, fix xpu tensor parallel by @jikunshang in #7712
  • [misc] Add Torch profiler support for CPU-only devices by @DamonFool in #7806
  • [BugFix] Fix server crash on empty prompt by @maxdebayser in #7746
  • [github][misc] promote asking llm first by @youkaichao in #7809
  • [Misc] Update marlin to use vLLMParameters by @dsikka in #7803
  • Bump version to v0.5.5 by @simon-mo in #7823

New Contributors

Full Changelog: v0.5.4...v0.5.5

Don't miss a new vllm release

NewReleases is sending notifications on new releases.