What's Changed
- [ci][frontend] deduplicate tests by @youkaichao in #7101
- [Doc] [SpecDecode] Update MLPSpeculator documentation by @tdoublep in #7100
- [Bugfix] Specify device when loading LoRA and embedding tensors by @jischein in #7129
- [MISC] Use non-blocking transfer in prepare_input by @comaniac in #7172
- [Core] Support loading GGUF model by @Isotr0py in #5191
- [Build] Add initial conditional testing spec by @simon-mo in #6841
- [LoRA] Relax LoRA condition by @jeejeelee in #7146
- [Model] Support SigLIP encoder and alternative decoders for LLaVA models by @DarkLight1337 in #7153
- [BugFix] Fix DeepSeek remote code by @dsikka in #7178
- [ BugFix ] Fix ZMQ when
VLLM_PORT
is set by @robertgshaw2-neuralmagic in #7205 - [Bugfix] add gguf dependency by @kpapis in #7198
- [SpecDecode] [Minor] Fix spec decode sampler tests by @LiuXiaoxuanPKU in #7183
- [Kernel] Add per-tensor and per-token AZP epilogues by @ProExpertProg in #5941
- [Core] Optimize evictor-v2 performance by @xiaobochen123 in #7193
- [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) by @afeldman-nm in #4942
- [Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading by @mgoin in #7225
- [BugFix] Overhaul async request cancellation by @njhill in #7111
- [Doc] Mock new dependencies for documentation by @ywang96 in #7245
- [BUGFIX]: top_k is expected to be an integer. by @Atllkks10 in #7227
- [Frontend] Gracefully handle missing chat template and fix CI failure by @DarkLight1337 in #7238
- [distributed][misc] add specialized method for cuda platform by @youkaichao in #7249
- [Misc] Refactor linear layer weight loading; introduce
BasevLLMParameter
andweight_loader_v2
by @dsikka in #5874 - [ BugFix ] Move
zmq
frontend to IPC instead of TCP by @robertgshaw2-neuralmagic in #7222 - Fixes typo in function name by @rafvasq in #7275
- [Bugfix] Fix input processor for InternVL2 model by @Isotr0py in #7164
- [OpenVINO] migrate to latest dependencies versions by @ilya-lavrenov in #7251
- [Doc] add online speculative decoding example by @stas00 in #7243
- [BugFix] Fix frontend multiprocessing hang by @maxdebayser in #7217
- [Bugfix][FP8] Fix dynamic FP8 Marlin quantization by @mgoin in #7219
- [ci] Make building wheels per commit optional by @khluu in #7278
- [Bugfix] Fix gptq failure on T4s by @LucasWilkinson in #7264
- [FrontEnd] Make
merge_async_iterators
is_cancelled
arg optional by @njhill in #7282 - [Doc] Update supported_hardware.rst by @mgoin in #7276
- [Kernel] Fix Flashinfer Correctness by @LiuXiaoxuanPKU in #7284
- [Misc] Fix typos in scheduler.py by @ruisearch42 in #7285
- [Frontend] remove max_num_batched_tokens limit for lora by @NiuBlibing in #7288
- [Bugfix] Fix LoRA with PP by @andoorve in #7292
- [Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 by @jeejeelee in #7273
- [Bugfix][Kernel] Increased atol to fix failing tests by @ProExpertProg in #7305
- [Frontend] Kill the server on engine death by @joerunde in #6594
- [Bugfix][fast] Fix the get_num_blocks_touched logic by @zachzzc in #6849
- [Doc] Put collect_env issue output in a block by @mgoin in #7310
- [CI/Build] Dockerfile.cpu improvements by @dtrifiro in #7298
- [Bugfix] Fix new Llama3.1 GGUF model loading by @Isotr0py in #7269
- [Misc] Temporarily resolve the error of BitAndBytes by @jeejeelee in #7308
- Add Skywork AI as Sponsor by @simon-mo in #7314
- [TPU] Add Load-time W8A16 quantization for TPU Backend by @lsy323 in #7005
- [Core] Support serving encoder/decoder models by @DarkLight1337 in #7258
- [TPU] Fix dockerfile.tpu by @WoosukKwon in #7331
- [Performance] Optimize e2e overheads: Reduce python allocations by @alexm-neuralmagic in #7162
- [Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary by @tjohnson31415 in #7218
- [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace by @SolitaryThinker in #6971
- [Core] Streamline stream termination in
AsyncLLMEngine
by @njhill in #7336 - [Model][Jamba] Mamba cache single buffer by @mzusman in #6739
- [VLM][Doc] Add
stop_token_ids
to InternVL example by @Isotr0py in #7354 - [Performance] e2e overheads reduction: Small followup diff by @alexm-neuralmagic in #7364
- [Bugfix] Fix reinit procedure in ModelInputForGPUBuilder by @alexm-neuralmagic in #7360
- [Frontend] Support embeddings in the run_batch API by @pooyadavoodi in #7132
- [Bugfix] Fix ITL recording in serving benchmark by @ywang96 in #7372
- [Core] Add span metrics for model_forward, scheduler and sampler time by @sfc-gh-mkeralapura in #7089
- [Bugfix] Fix
PerTensorScaleParameter
weight loading for fused models by @dsikka in #7376 - [Misc] Add numpy implementation of
compute_slot_mapping
by @Yard1 in #7377 - [Core] Fix edge case in chunked prefill + block manager v2 by @cadedaniel in #7380
- [Bugfix] Fix phi3v batch inference when images have different aspect ratio by @Isotr0py in #7392
- [TPU] Use mark_dynamic to reduce compilation time by @WoosukKwon in #7340
- Updating LM Format Enforcer version to v0.10.6 by @noamgat in #7189
- [core] [2/N] refactor worker_base input preparation for multi-step by @SolitaryThinker in #7387
- [CI/Build] build on empty device for better dev experience by @tomeras91 in #4773
- [Doc] add instructions about building vLLM with VLLM_TARGET_DEVICE=empty by @tomeras91 in #7403
- [misc] add commit id in collect env by @youkaichao in #7405
- [Docs] Update readme by @simon-mo in #7316
- [CI/Build] Minor refactoring for vLLM assets by @ywang96 in #7407
- [Kernel] Flashinfer correctness fix for v0.1.3 by @LiuXiaoxuanPKU in #7319
- [Core][VLM] Support image embeddings as input by @ywang96 in #6613
- [Frontend] Disallow passing
model
as both argument and option by @DarkLight1337 in #7347 - [CI/Build] bump Dockerfile.neuron image base, use public ECR by @dtrifiro in #6832
- [Bugfix] Fix logit soft cap in flash-attn backend by @WoosukKwon in #7425
- [ci] Entrypoints run upon changes in vllm/ by @khluu in #7423
- [ci] Cancel fastcheck run when PR is marked ready by @khluu in #7427
- [ci] Cancel fastcheck when PR is ready by @khluu in #7433
- [Misc] Use scalar type to dispatch to different
gptq_marlin
kernels by @LucasWilkinson in #7323 - [Core] Consolidate
GB
constant and enable float GB arguments by @DarkLight1337 in #7416 - [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel by @jon-chuang in #7208
- [Bugfix] Handle PackageNotFoundError when checking for xpu version by @sasha0552 in #7398
- [CI/Build] bump minimum cmake version by @dtrifiro in #6999
- [Core] Shut down aDAG workers with clean async llm engine exit by @ruisearch42 in #7224
- [mypy] Misc. typing improvements by @DarkLight1337 in #7417
- [Misc] improve logits processors logging message by @aw632 in #7435
- [ci] Remove fast check cancel workflow by @khluu in #7455
- [Bugfix] Fix weight loading for Chameleon when TP>1 by @DarkLight1337 in #7410
- [hardware] unify usage of is_tpu to current_platform.is_tpu() by @youkaichao in #7102
- [TPU] Suppress import custom_ops warning by @WoosukKwon in #7458
- Revert "[Doc] Update supported_hardware.rst (#7276)" by @WoosukKwon in #7467
- [Frontend][Core] Add plumbing to support audio language models by @petersalas in #7446
- [Misc] Update LM Eval Tolerance by @dsikka in #7473
- [Misc] Update
gptq_marlin
to use new vLLMParameters by @dsikka in #7281 - [Misc] Update Fused MoE weight loading by @dsikka in #7334
- [Misc] Update
awq
andawq_marlin
to usevLLMParameters
by @dsikka in #7422 - Announce NVIDIA Meetup by @simon-mo in #7483
- [frontend] spawn engine process from api server process by @youkaichao in #7484
- [Misc]
compressed-tensors
code reuse by @kylesayrs in #7277 - [misc][plugin] add plugin system implementation by @youkaichao in #7426
- [TPU] Support multi-host inference by @WoosukKwon in #7457
- [Bugfix][CI] Import ray under guard by @WoosukKwon in #7486
- [CI/Build]Reduce the time consumption for LoRA tests by @jeejeelee in #7396
- [misc][ci] fix cpu test with plugins by @youkaichao in #7489
- [Bugfix][Docs] Update list of mock imports by @DarkLight1337 in #7493
- [doc] update test script to include cudagraph by @youkaichao in #7501
- Fix empty output when temp is too low by @CatherineSue in #2937
- [ci] fix model tests by @youkaichao in #7507
- [Bugfix][Frontend] Disable embedding API for chat models by @QwertyJack in #7504
- [Misc] Deprecation Warning when setting --engine-use-ray by @wallashss in #7424
- [VLM][Core] Support profiling with multiple multi-modal inputs per prompt by @DarkLight1337 in #7126
- [core] [3/N] multi-step args and sequence.py by @SolitaryThinker in #7452
- [TPU] Set per-rank XLA cache by @WoosukKwon in #7533
- [Misc] Revert
compressed-tensors
code reuse by @kylesayrs in #7521 - llama_index serving integration documentation by @pavanjava in #6973
- [Bugfix][TPU] Correct env variable for XLA cache path by @WoosukKwon in #7544
- [Bugfix] update neuron for version > 0.5.0 by @omrishiv in #7175
- [Misc] Update dockerfile for CPU to cover protobuf installation by @PHILO-HE in #7182
- [Bugfix] Fix default weight loading for scalars by @mgoin in #7534
- [Bugfix][Harmless] Fix hardcoded float16 dtype for model_is_embedding by @mgoin in #7566
- [Misc] Add quantization config support for speculative model. by @ShangmingCai in #7343
- [Feature]: Add OpenAI server prompt_logprobs support #6508 by @gnpinkert in #7453
- [ci/test] rearrange tests and make adag test soft fail by @youkaichao in #7572
- Chat method for offline llm by @nunjunj in #5049
- [CI] Move quantization cpu offload tests out of fastcheck by @mgoin in #7574
- [Misc/Testing] Use
torch.testing.assert_close
by @jon-chuang in #7324 - register custom op for flash attn and use from torch.ops by @youkaichao in #7536
- [Core] Use uvloop with zmq-decoupled front-end by @njhill in #7570
- [CI] Fix crashes of performance benchmark by @KuntaiDu in #7500
- [Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method by @gongdao123 in #7513
- support tqdm in notebooks by @fzyzcjy in #7510
- [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm by @charlifu in #7210
- [Kernel] W8A16 Int8 inside FusedMoE by @mzusman in #7415
- [Kernel] Add tuned triton configs for ExpertsInt8 by @mgoin in #7601
- [spec decode] [4/N] Move update_flash_attn_metadata to attn backend by @SolitaryThinker in #7571
- [Core] Fix tracking of model forward time to the span traces in case of PP>1 by @sfc-gh-mkeralapura in #7440
- [Doc] Add docs for llmcompressor INT8 and FP8 checkpoints by @mgoin in #7444
- [Doc] Update quantization supported hardware table by @mgoin in #7595
- [Kernel] register punica functions as torch ops by @bnellnm in #7591
- [Kernel][Misc] dynamo support for ScalarType by @bnellnm in #7594
- [Kernel] fix types used in aqlm and ggml kernels to support dynamo by @bnellnm in #7596
- [Model] Align nemotron config with final HF state and fix lm-eval-small by @mgoin in #7611
- [Bugfix] Fix custom_ar support check by @bnellnm in #7617
- .[Build/CI] Enabling passing AMD tests. by @Alexei-V-Ivanov-AMD in #7610
- [Bugfix] Clear engine reference in AsyncEngineRPCServer by @ruisearch42 in #7618
- [aDAG] Unflake aDAG + PP tests by @rkooo567 in #7600
- [Bugfix] add >= 1.0 constraint for openai dependency by @metasyn in #7612
- [misc] use nvml to get consistent device name by @youkaichao in #7582
- [ci][test] fix engine/logger test by @youkaichao in #7621
- [core][misc] update libcudart finding by @youkaichao in #7620
- [Model] Pipeline parallel support for JAIS by @mrbesher in #7603
- [ci][test] allow longer wait time for api server by @youkaichao in #7629
- [Misc]Fix BitAndBytes exception messages by @jeejeelee in #7626
- [VLM] Refactor
MultiModalConfig
initialization and profiling by @ywang96 in #7530 - [TPU] Skip creating empty tensor by @WoosukKwon in #7630
- [TPU] Use mark_dynamic only for dummy run by @WoosukKwon in #7634
- [TPU] Optimize RoPE forward_native2 by @WoosukKwon in #7636
- [ Bugfix ] Fix Prometheus Metrics With
zeromq
Frontend by @robertgshaw2-neuralmagic in #7279 - [CI/Build] Add text-only test for Qwen models by @alex-jw-brooks in #7475
- [Misc] Refactor Llama3 RoPE initialization by @WoosukKwon in #7637
- [Core] Optimize SPMD architecture with delta + serialization optimization by @rkooo567 in #7109
- [Core] Use flashinfer sampling kernel when available by @peng1999 in #7137
- fix xpu build by @jikunshang in #7644
- [Misc] Remove Gemma RoPE by @WoosukKwon in #7638
- [MISC] Add prefix cache hit rate to metrics by @comaniac in #7606
- [Bugfix] fix lora_dtype value type in arg_utils.py - part 2 by @c3-ali in #5428
- [core] Multi Step Scheduling by @SolitaryThinker in #7000
- [Core] Support tensor parallelism for GGUF quantization by @Isotr0py in #7520
- [Bugfix] Don't disable existing loggers by @a-ys in #7664
- [TPU] Fix redundant input tensor cloning by @WoosukKwon in #7660
- [Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding by @tjohnson31415 in #7665
- [doc] fix doc build error caused by msgspec by @youkaichao in #7659
- [Speculative Decoding] Fixing hidden states handling in batch expansion by @abhigoyal1997 in #7508
- [ci] Install Buildkite test suite analysis by @khluu in #7667
- [Bugfix] support
tie_word_embeddings
for all models by @zijian-hu in #5724 - [CI] Organizing performance benchmark files by @KuntaiDu in #7616
- [misc] add nvidia related library in collect env by @youkaichao in #7674
- [XPU] fallback to native implementation for xpu custom op by @jianyizh in #7670
- [misc][cuda] add warning for pynvml user by @youkaichao in #7675
- [Core] Refactor executor classes to make it easier to inherit GPUExecutor by @jikunshang in #7673
- [Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel by @LucasWilkinson in #7174
- [OpenVINO] Updated documentation by @ilya-lavrenov in #7687
- [VLM][Model] Add test for InternViT vision encoder by @Isotr0py in #7409
- [Hardware] [Intel GPU] refactor xpu worker/executor by @jikunshang in #7686
- [CI/Build] Pin OpenTelemetry versions and make availability errors clearer by @ronensc in #7266
- [Misc] Add jinja2 as an explicit build requirement by @LucasWilkinson in #7695
- [Core] Add
AttentionState
abstraction by @Yard1 in #7663 - [Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) by @jikunshang in #7685
- [ci][test] adjust max wait time for cpu offloading test by @youkaichao in #7709
- [Core] Pipe
worker_class_fn
argument in Executor by @Yard1 in #7707 - [ci] try to log process using the port to debug the port usage by @youkaichao in #7711
- [Model] Add AWQ quantization support for InternVL2 model by @Isotr0py in #7187
- [Doc] Section for Multimodal Language Models by @ywang96 in #7719
- [mypy] Enable following imports for entrypoints by @DarkLight1337 in #7248
- [Bugfix] Mirror jinja2 in pyproject.toml by @sasha0552 in #7723
- [BugFix] Avoid premature async generator exit and raise all exception variations by @njhill in #7698
- [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] by @learninmou in #7509
- [Bugfix][Hardware][CPU] Fix
mm_limits
initialization for CPU backend by @Isotr0py in #7735 - [Spec Decoding] Use target model max length as default for draft model by @njhill in #7706
- [Bugfix] chat method add_generation_prompt param by @brian14708 in #7734
- [Bugfix][Frontend] Fix Issues Under High Load With
zeromq
Frontend by @robertgshaw2-neuralmagic in #7394 - [Bugfix] Pass PYTHONPATH from setup.py to CMake by @sasha0552 in #7730
- [multi-step] Raise error if not using async engine by @SolitaryThinker in #7703
- [Frontend] Improve Startup Failure UX by @robertgshaw2-neuralmagic in #7716
- [misc] Add Torch profiler support by @SolitaryThinker in #7451
- [Model] Add UltravoxModel and UltravoxConfig by @petersalas in #7615
- [ci] [multi-step] narrow multi-step test dependency paths by @SolitaryThinker in #7760
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in #7527
- [distributed][misc] error on same VLLM_HOST_IP setting by @youkaichao in #7756
- [AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility by @gshtras in #7477
- [Kernel] Replaced
blockReduce[...]
functions withcub::BlockReduce
by @ProExpertProg in #7233 - [Model] Fix Phi-3.5-vision-instruct 'num_crops' issue by @zifeitong in #7710
- [Bug][Frontend] Improve ZMQ client robustness by @joerunde in #7443
- Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" by @mgoin in #7764
- [TPU] Avoid initializing TPU runtime in is_tpu by @WoosukKwon in #7763
- [ci] refine dependency for distributed tests by @youkaichao in #7776
- [Misc] Use torch.compile for GemmaRMSNorm by @WoosukKwon in #7642
- [Speculative Decoding] EAGLE Implementation with Top-1 proposer by @abhigoyal1997 in #6830
- Fix ShardedStateLoader for vllm fp8 quantization by @sfc-gh-zhwang in #7708
- [Bugfix] Don't build machete on cuda <12.0 by @LucasWilkinson in #7757
- [Misc] update fp8 to use
vLLMParameter
by @dsikka in #7437 - [Bugfix] spec decode handle None entries in topk args in create_sequence_group_output by @tjohnson31415 in #7232
- [Misc] Enhance prefix-caching benchmark tool by @Jeffwan in #6568
- [Doc] Fix incorrect docs from #7615 by @petersalas in #7788
- [Bugfix] Use LoadFormat values as choices for
vllm serve --load-format
by @mgoin in #7784 - [ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args by @khluu in #7705
- [Misc] fix typo in triton import warning by @lsy323 in #7794
- [Frontend] error suppression cleanup by @joerunde in #7786
- [Ray backend] Better error when pg topology is bad. by @rkooo567 in #7584
- [Hardware][Intel GPU] refactor xpu_model_runner, fix xpu tensor parallel by @jikunshang in #7712
- [misc] Add Torch profiler support for CPU-only devices by @DamonFool in #7806
- [BugFix] Fix server crash on empty prompt by @maxdebayser in #7746
- [github][misc] promote asking llm first by @youkaichao in #7809
- [Misc] Update
marlin
to use vLLMParameters by @dsikka in #7803 - Bump version to v0.5.5 by @simon-mo in #7823
New Contributors
- @jischein made their first contribution in #7129
- @kpapis made their first contribution in #7198
- @xiaobochen123 made their first contribution in #7193
- @Atllkks10 made their first contribution in #7227
- @stas00 made their first contribution in #7243
- @maxdebayser made their first contribution in #7217
- @NiuBlibing made their first contribution in #7288
- @lsy323 made their first contribution in #7005
- @pooyadavoodi made their first contribution in #7132
- @sfc-gh-mkeralapura made their first contribution in #7089
- @jon-chuang made their first contribution in #7208
- @aw632 made their first contribution in #7435
- @petersalas made their first contribution in #7446
- @kylesayrs made their first contribution in #7277
- @QwertyJack made their first contribution in #7504
- @wallashss made their first contribution in #7424
- @pavanjava made their first contribution in #6973
- @PHILO-HE made their first contribution in #7182
- @gnpinkert made their first contribution in #7453
- @gongdao123 made their first contribution in #7513
- @charlifu made their first contribution in #7210
- @metasyn made their first contribution in #7612
- @mrbesher made their first contribution in #7603
- @alex-jw-brooks made their first contribution in #7475
- @a-ys made their first contribution in #7664
- @zijian-hu made their first contribution in #5724
- @jianyizh made their first contribution in #7670
- @learninmou made their first contribution in #7509
- @brian14708 made their first contribution in #7734
- @sfc-gh-zhwang made their first contribution in #7708
Full Changelog: v0.5.4...v0.5.5