github vllm-project/vllm v0.8.0rc1

latest release: v0.8.0rc2
pre-release18 hours ago

Note: vLLM no longer sets the global seed (#14274). Please set the seed parameter if you need to reproduce your results.

What's Changed

  • Update pre-commit's isort version to remove warnings by @hmellor in #13614
  • [V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
  • fix neuron performance issue by @ajayvohra2005 in #13589
  • [Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
  • [Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
  • [Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
  • Add llmaz as another integration by @kerthcet in #13643
  • [Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
  • [NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
  • Use pre-commit to update requirements-test.txt by @hmellor in #13617
  • [Bugfix] Add mm_processor_kwargs to chat-related protocols by @ywang96 in #13644
  • [V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
  • Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
  • [FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
  • [ci] Fix metrics test model path by @khluu in #13635
  • [Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
  • [Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
  • fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in #13668
  • [Attention] MLA with chunked prefill by @LucasWilkinson in #12639
  • [Misc] Fix yapf linting tools etc not running on pre-commit by @Isotr0py in #13695
  • docs: Add a note on full CI run in contributing guide by @terrytangyuan in #13646
  • [HTTP Server] Make model param optional in request by @youngkent in #13568
  • [Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… by @WangErXiao in #13672
  • [Misc] Capture and log the time of loading weights by @waltforme in #13666
  • [ROCM] fix native attention function call by @gongdao123 in #13650
  • [Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA by @2015aroras in #13687
  • [Misc] Bump compressed-tensors by @dsikka in #13619
  • [Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len by @WangErXiao in #13691
  • [v1] Support allowed_token_ids in v1 Sampler by @houseroad in #13210
  • [Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler by @JenZhao in #13594
  • Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size by @fabianlim in #13660
  • [V1][Metrics] Support vllm:cache_config_info by @markmc in #13299
  • [Metrics] Add --show-hidden-metrics-for-version CLI arg by @markmc in #13295
  • [Misc] Reduce LoRA-related static variable by @jeejeelee in #13166
  • [CI/Build] Fix pre-commit errors by @DarkLight1337 in #13696
  • [core] set up data parallel communication by @youkaichao in #13591
  • [ci] fix linter by @youkaichao in #13701
  • Support SSL Key Rotation in HTTP Server by @youngkent in #13495
  • [NVIDIA] Support nvfp4 cutlass gemm by @kaixih in #13571
  • [V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths by @SageMoore in #13095
  • [ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm by @gshtras in #13231
  • [Doc] Dockerfile instructions for optional dependencies and dev transformers by @DarkLight1337 in #13699
  • [Bugfix] Fix boolean conversion for OpenVINO env variable by @helena-intel in #13615
  • [XPU]fix setuptools version for xpu by @yma11 in #13548
  • [CI/Build] fix uv caching in Dockerfile by @dtrifiro in #13611
  • [CI/Build] Fix pre-commit errors from #13571 by @ywang96 in #13709
  • [BugFix] Minor: logger import in attention backend by @andylolu2 in #13706
  • [ci] Use env var to control whether to use S3 bucket in CI by @khluu in #13634
  • [Quant] BaiChuan SupportsQuant by @kylesayrs in #13710
  • [LMM] Implement merged multimodal processor for whisper by @Isotr0py in #13278
  • [Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms by @njhill in #13688
  • [Misc] Deprecate --dataset from benchmark_serving.py by @ywang96 in #13708
  • [v1] torchrun compatibility by @youkaichao in #13642
  • [V1][BugFix] Fix engine core client shutdown hangs by @njhill in #13298
  • Fix some issues with benchmark data output by @huydhn in #13641
  • [ci] Add logic to change model to S3 path only when S3 CI env var is on by @khluu in #13727
  • [V1][Core] Fix memory issue with logits & sampling by @ywang96 in #13721
  • [model][refactor] remove cuda hard code in models and layers by @MengqingCao in #13658
  • [Bugfix] fix(logging): add missing opening square bracket by @bufferoverflow in #13011
  • [CI/Build] add python-json-logger to requirements-common by @bufferoverflow in #12842
  • Expert Parallelism (EP) Support for DeepSeek Models by @cakeng in #12583
  • [BugFix] Illegal memory access for MoE On H20 by @Abatom in #13693
  • [Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set by @NickLucche in #12513
  • [V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) by @afeldman-nm in #10980
  • Revert "[V1][Core] Fix memory issue with logits & sampling" by @ywang96 in #13775
  • Fix precommit fail in fused_moe intermediate_cache2 chunking by @mgoin in #13772
  • [Misc] Clean Up EngineArgs.create_engine_config by @robertgshaw2-redhat in #13734
  • [Misc][Chore] Clean Up AsyncOutputProcessing Logs by @robertgshaw2-redhat in #13780
  • Remove unused kwargs from model definitions by @hmellor in #13555
  • [Doc] arg_utils.py: fixed a typo by @eli-b in #13785
  • [Misc] set single whitespace between log sentences by @cjackal in #13771
  • [Bugfix][Quantization] Fix FP8 + EP by @tlrmchlsmth in #13784
  • [Misc][Attention][Quantization] init property earlier by @wangxiyuan in #13733
  • [V1][Metrics] Implement vllm:lora_requests_info metric by @markmc in #13504
  • [Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" by @LucasWilkinson in #13802
  • [Bugfix] Support MLA for CompressedTensorsWNA16 by @mgoin in #13725
  • Fix CompressedTensorsWNA16MoE with grouped scales by @mgoin in #13769
  • [Core] LoRA V1 - Add add/pin/list/remove_lora functions by @varun-sundar-rabindranath in #13705
  • [Misc] Check that the model can be inspected upon registration by @DarkLight1337 in #13743
  • [Core] xgrammar: Expand list of unsupported jsonschema keywords by @russellb in #13783
  • [Bugfix] Modify modelscope api usage in transformer_utils by @shen-shanshan in #13807
  • [misc] Clean up ray compiled graph type hints by @ruisearch42 in #13731
  • [Feature] Support KV cache offloading and disagg prefill with LMCache connector. by @YaoJiayi in #12953
  • [ROCm][Quantization][Kernel] Using HIP FP8 header by @gshtras in #12593
  • [CI/Build] Fix V1 LoRA failure by @jeejeelee in #13767
  • [Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs by @Chen-0210 in #13724
  • [Bugfix] Initialize attention bias on the same device as Query/Key/Value by @edwardzjl in #13468
  • [Bugfix] Flush TunableOp results before worker processes are destroyed. by @naromero77amd in #13623
  • [Bugfix] Fix deepseek-vl2 inference with more than 2 images by @Isotr0py in #13818
  • Fix /v1/audio/transcriptions Bad Request Error by @HermitSun in #13811
  • [Bugfix] Revert inspection code in #13743 by @DarkLight1337 in #13832
  • Fix string parsing error by @Chen-0210 in #13825
  • [Neuron] Add custom_ops for neuron backend by @liangfu in #13246
  • Fix failing MyGemma2Embedding test by @hmellor in #13820
  • [Model] Support Grok1 by @mgoin in #13795
  • DeepSeek V2/V3/R1 only place lm_head on last pp rank by @hmellor in #13833
  • [misc] Show driver IP info when Ray fails to allocate driver worker by @ruisearch42 in #13858
  • [V1][Spec Decode] Change Spec Decode Rejection Sampling API by @LiuXiaoxuanPKU in #13729
  • [Misc]Code Cleanup by @noemotiovon in #13859
  • [Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues by @henrylhtsang in #13797
  • Improve pipeline partitioning by @hmellor in #13839
  • [Doc] fix the incorrect module path of tensorize_vllm_model by @tianyuzhou95 in #13863
  • [ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms by @SageMoore in #13844
  • [v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine by @sethkimmel3 in #13837
  • [Misc] Improve LoRA spelling by @jeejeelee in #13831
  • [Misc] Fix input processing for Ultravox by @ywang96 in #13871
  • [Bugfix] Add test example for Ultravox v0.5 by @DarkLight1337 in #13890
  • Add comments on accessing kv_cache and attn_metadata by @hmellor in #13887
  • [Bugfix] Handle None parameters in Mistral function calls. by @fgreinacher in #13786
  • [Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor by @b8zhong in #13736
  • [Bugfix] Do not crash V0 engine on input errors by @joerunde in #13101
  • [Bugfix] Update expected token counts for Ultravox tests by @DarkLight1337 in #13895
  • [TPU] use torch2.6 with whl package by @Chenyaaang in #13860
  • [Misc] fixed qwen_vl_utils parameter error by @chaunceyjiang in #13906
  • [Bugfix] Backend option to disable xgrammar any_whitespace by @wallashss in #12744
  • [BugFix] Make FP8 Linear compatible with torch.compile by @WoosukKwon in #13918
  • [Kernel] FlashMLA integration by @LucasWilkinson in #13747
  • [ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined by @HollowMan6 in #13851
  • Use CUDA 12.4 as default for release and nightly wheels by @mgoin in #12098
  • [misc] Rename Ray ADAG to Compiled Graph by @ruisearch42 in #13928
  • [ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding by @SageMoore in #13922
  • [V1][Metrics] Handle preemptions by @markmc in #13169
  • [CI/Build] Add examples/ directory to be labelled by mergify by @b8zhong in #13944
  • [Misc] fixed 'required' is an invalid argument for positionals by @chaunceyjiang in #13948
  • [PP] Correct cache size check by @zhengy001 in #13873
  • Fix test_block_fp8.py test for MoE by @mgoin in #13915
  • [VLM] Support multimodal inputs for Florence-2 models by @Isotr0py in #13320
  • [Model] Deepseek GGUF support by @SzymonOzog in #13167
  • Update quickstart.md by @observerw in #13958
  • Deduplicate .pre-commit-config.yaml's exclude by @hmellor in #13967
  • [bugfix] Fix profiling for RayDistributedExecutor by @ruisearch42 in #13945
  • Update LMFE version to v0.10.11 to support new versions of transforme… by @noamgat in #13930
  • [Bugfix] Fix qwen2.5-vl overflow issue by @Isotr0py in #13968
  • [VLM] Generalized prompt updates for multi-modal processor by @DarkLight1337 in #13964
  • [Attention] MLA support for V1 by @chenyang78 in #13789
  • Bump azure/setup-helm from 4.2.0 to 4.3.0 by @dependabot in #13742
  • [VLM] Deprecate legacy input mapper for OOT multimodal models by @DarkLight1337 in #13979
  • [ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups by @SageMoore in #13970
  • [V1][Minor] Minor cleanup for GPU Model Runner by @WoosukKwon in #13983
  • [core] Perf improvement for DSv3 on AMD GPUs by @qli88 in #13718
  • [Attention] Flash MLA for V1 by @LucasWilkinson in #13867
  • [Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict by @benchislett in #13626
  • [Misc] Print FusedMoE detail info by @jeejeelee in #13974
  • [V1]SupportsV0Only protocol for model definitions by @ywang96 in #13959
  • [Bugfix] Check that number of images matches number of <|image|> tokens with mllama by @tjohnson31415 in #13911
  • [Doc] Move multimodal Embedding API example to Online Serving page by @DarkLight1337 in #14017
  • [Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) by @hasB4K in #13987
  • Use smaller embedding model when not testing model specifically by @hmellor in #13891
  • [Hardware][Intel-Gaudi] Regional compilation support by @Kacper-Pietkun in #13213
  • [V1][Minor] Restore V1 compatibility with LLMEngine class by @Ryp in #13090
  • Update AutoAWQ docs by @hmellor in #14042
  • [Bugfix] Fix MoeWNA16Method activation by @jeejeelee in #14024
  • [VLM][Bugfix] Enable specifying prompt target via index by @DarkLight1337 in #14038
  • [Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series by @LouieYang in #14031
  • [Doc] Fix ROCm documentation by @b8zhong in #14041
  • Fix entrypoint tests for embedding models by @hmellor in #14052
  • [V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU by @vanbasten23 in #13379
  • [v1] Cleanup the BlockTable in InputBatch by @heheda12345 in #13977
  • Add RELEASE.md by @atalman in #13926
  • [v1] Move block pool operations to a separate class by @heheda12345 in #13973
  • [core] Bump ray to 2.43 by @ruisearch42 in #13994
  • [torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass by @ProExpertProg in #10902
  • [Docs] Add pipeline_parallel_size to optimization docs by @b8zhong in #14059
  • [Bugfix] Add file lock for ModelScope download by @jeejeelee in #14060
  • [Misc][Kernel]: Add GPTQAllSpark Quantization by @wyajieha in #12931
  • [Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor by @bigPYJ1151 in #14053
  • [Documentation] Add more deployment guide for Kubernetes deployment by @KuntaiDu in #13841
  • [Doc] Consolidate whisper and florence2 examples by @Isotr0py in #14050
  • [V1][Minor] Do not print attn backend twice by @WoosukKwon in #13985
  • [ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class by @SageMoore in #14065
  • [v1][Bugfix] Only cache blocks that are not in the prefix cache by @heheda12345 in #14073
  • [v1] Add __repr__ to KVCacheBlock to avoid recursive print by @heheda12345 in #14081
  • [Model] Add LoRA support for TransformersModel by @jeejeelee in #13770
  • [Misc] Accurately capture the time of loading weights by @waltforme in #14063
  • [Doc] Source building add clone step by @qux-bbb in #14086
  • [v0][structured output] Support reasoning output by @gaocegege in #12955
  • Update deprecated Python 3.8 typing by @hmellor in #13971
  • [Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure by @realShengYao in #14051
  • [Misc] typo find in deepseek_v2 by @noooop in #14106
  • [Misc][Platform] Move use allgather to platform by @MengqingCao in #14010
  • [Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 by @comaniac in #13921
  • [V1] Refactor parallel sampling support by @markmc in #13774
  • Improve the docs for TransformersModel by @hmellor in #14147
  • [ROCm] Faster Custom Paged Attention kernels by @tjtanaa in #12348
  • Fix head_dim not existing in all model configs (Transformers backend) by @hmellor in #14141
  • [V0][Metrics] Remove unimplemented vllm:tokens_total by @markmc in #14134
  • [V0][Metrics] Deprecate some KV/prefix cache metrics by @markmc in #14136
  • [V1] Simplify stats logging by @njhill in #14082
  • [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics by @markmc in https://github.com//pull/14055
  • [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 by @mgoin in #14100
  • [Kernel] Optimize moe intermediate_cache usage by @mgoin in #13625
  • [Docs] Add GPTQModel by @Qubitium in #14056
  • [v1] Add comments to the new ragged paged attention Pallas kernel by @vanbasten23 in #14155
  • [Model] Add support for GraniteMoeShared models by @tjohnson31415 in #13313
  • [core] moe fp8 block quant tuning support by @divakar-amd in #14068
  • [Misc] Remove lru_cache in NvmlCudaPlatform by @comaniac in #14156
  • [core] Pass all driver env vars to ray workers unless excluded by @ruisearch42 in #14099
  • Use math.prod instead of np.prod for trivial ops by @zhanwenchen in #14142
  • Fix benchmark_moe.py tuning for CUDA devices by @mgoin in #14164
  • [platform] add debug logging during inferring the device type by @youkaichao in #14195
  • [sleep mode] error out with expandable_segments by @youkaichao in #14189
  • [doc] add "Failed to infer device type" to faq by @youkaichao in #14200
  • [Bugfix] Restrict MacOS CPU detection by @mgoin in #14210
  • [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs by @njhill in #13869
  • [V0][Metrics] Deprecate some questionable request time metrics by @markmc in #14135
  • [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py by @lk-chen in #14161
  • add cutlass support for blackwell fp8 gemm by @kushanam in #13798
  • [TPU][Profiler] Support start_profile/stop_profile in TPU worker by @lsy323 in #13988
  • Fix performance when --generation-config is not None by @hmellor in #14223
  • [Frontend] Do prompt_logprobs clamping for chat as well as completions by @hmellor in #14225
  • [Docs] Update Dockerfile dependency image by @mgoin in #14215
  • [v1][Metrics] Add design doc by @markmc in #12745
  • Serialize using safetensors for KV caches by @KuntaiDu in #14228
  • Clean up unused padding_idx variables across many model definitions by @tlrmchlsmth in #13240
  • [ROCm] Disable a few more kernel tests that are broken on ROCm by @SageMoore in #14145
  • [V1][TPU] TPU multimodal model support for ragged attention by @mgoin in #14158
  • [misc] announce china meetup by @youkaichao in #14248
  • Moved numba from common requirements to cuda/rocm specific requirements by @npanpaliya in #14199
  • Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 by @mgoin in #14157
  • [Bugfix] Fix gptq_marlin for deepseek-v3 by @rainkert in #13750
  • [V1][Bugfix] Do not reset prefix caching metrics by @comaniac in #14235
  • [Model] New model support for Phi-4-multimodal-instruct by @congcongchen123 in #14119
  • [V1] EP/TP MoE + DP Attention by @tlrmchlsmth in #13931
  • [platforms] improve rocm debugging info by @youkaichao in #14257
  • Temporarily disable test_awq_gemm_opcheck by @mgoin in #14251
  • [Frontend] Allow return_tokens_as_token_ids to be passed as a request param by @benchislett in #14066
  • [Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing by @ywang96 in #14256
  • [Bugfix][V1] Fix allowed_token_ids for v1 Sampler by @houseroad in #14169
  • [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID by @iacolippo in #14217
  • [Doc] [3/N] Refer code examples for common cases in dev multimodal processor by @DarkLight1337 in #14278
  • Small update for external_launcher backend docs by @zhe-thoughts in #14288
  • [V1][Frontend] Add Testing For V1 Runtime Parameters by @robertgshaw2-redhat in #14159
  • [LoRA] Remove linear hack outside transformers backend by @Isotr0py in #14177
  • [Misc] Add Qwen2MoeForCausalLM moe tuning support by @jeejeelee in #14276
  • [Doc] Fixed typo in prefix_caching.md by @DaividFrank in #14293
  • [Bugfix] Fix broken vision language example by @Isotr0py in #14292
  • [Docs] Add Meta Slides by @simon-mo in #14297
  • [V1][Minor] Remove obsolete FIXME comment by @njhill in #14304
  • Deprecate best_of Sampling Parameter in anticipation for vLLM V1 by @vincent-4 in #13997
  • [V1][BugFix] Fix for mixed top_k batch by @njhill in #14301
  • [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env by @yangsijia-serena in #14267
  • [V1][Easy] Add empty allowed_token_ids in the v1 sampler test by @houseroad in #14308
  • [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch by @pyc96 in #14237
  • [Bugfix] Remove num_tokens_across_dp by @tlrmchlsmth in #14302
  • [BugFix] Fix prefix caching V0 MLA by @LucasWilkinson in #14255
  • [CI/Build] Use spawn multiprocessing mode for V1 test pipeline by @russellb in #14243
  • Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM by @mgoin in #13917
  • [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation by @terrytangyuan in #13850
  • [BugFix] MLA + V1, illegal memory access and accuracy issues by @LucasWilkinson in #14253
  • [misc] Mention ray list nodes command to troubleshoot ray issues by @ruisearch42 in #14318
  • [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 by @gaocegege in #14114
  • [V1] LoRA - Enable more V1 tests by @varun-sundar-rabindranath in #14315
  • [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention by @NickLucche in #11301
  • [Hardware] Update the flash attn tag to support Blackwell by @pavanimajety in #14244
  • [Model] Update Paligemma multimodal processing with PromptUpdate by @kylehh in #14015
  • [VLM] Support Pixtral-HF on V1 by @lk-chen in #14275
  • [Core] Optimizing cross-attention QKVParallelLinear computation by @NickLucche in #12325
  • [Frontend][Docs] Transcription API streaming by @NickLucche in #13301
  • [Doc] Update reasoning with stream example to use OpenAI library by @liuyanyi in #14077
  • [Doc] Correct beam_search using in generative_models.md by @upayuryeva in #14363
  • [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend by @tdoublep in #14152
  • [Bugfix][Core] fix abort_seq_group and memory leak when n>1 by @courage17340 in #14326
  • [Core] Don't use cache during multi-modal profiling by @DarkLight1337 in #14336
  • [Doc] Fix date typo in README.md by @jitseklomp in #14366
  • [RLHF] use worker_extension_cls for compatibility with V0 and V1 by @youkaichao in #14185
  • Reinstate best_of for V0 by @hmellor in #14356
  • Adding cpu inference with VXE ISA for s390x architecture by @dilipgb in #12613
  • Add authors to license header. by @tdoublep in #14371
  • Fix mla prefill context performance by @ZhongYingMatrix in #13897
  • [V1] Do not detokenize if sampling param detokenize is False by @hj-mistral in #14224
  • [Distributed] Add enable_expert_parallel arg by @tlrmchlsmth in #14305
  • [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa by @mgoin in #13569
  • [CI] Disable spawn when running V1 Test by @tdoublep in #14345
  • [Kernel] Add needs_fixed_stride_order tag to most GEMMs by @tlrmchlsmth in #14306
  • [Bugfix] Fix use_direct_call condition in FusedMoE layer for by @tlrmchlsmth in #14382
  • [Bug] Fix Attention when ignored in by quant_method by @mgoin in #14313
  • [V1][Bugfix] Standardize quantized kv cache rejection for attention backends by @mgoin in #14221
  • [Docs] Add nsight guide to profiling docs by @mgoin in #14298
  • [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue by @yaochengji in #14310
  • [Doc] Fix a typo by @dyli-google in #14385
  • [Bugfix] Correctly call cudaProfilerStop in benchmarks script by @b8zhong in #14183
  • [Perf] Reduce MLA CPU overheads in V1 by @LucasWilkinson in #14384
  • [FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object by @ProExpertProg in #14390
  • [BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs by @LucasWilkinson in #14396
  • [Bugfix] Fix JambaForCausalLM LoRA by @jeejeelee in #14370
  • [Build] Add nightly wheel fallback when latest commit wheel unavailable by @Isotr0py in #14358
  • OpenVINO: added CPU-like conditions by @ilya-lavrenov in #14338
  • [GH] Auto-apply multi-modality label to relevant PRs by @DarkLight1337 in #14402
  • correct wrong markdown syntax by @vincent-pli in #14414
  • [Bugfix] Further clean up LoRA test by @jeejeelee in #14422
  • [Bugfix] Clean up multi-modal processors by @DarkLight1337 in #14417
  • [Misc] Set default value of seed to None by @SmartManoj in #14274
  • [BUGFIX] Skip tokenization support for throughput benchmark by @maleksan85 in #12712
  • Fix missing kv_caches and attn_metadata in OpenVINOCausalLM by @hmellor in #14271
  • Use the optimized block sizes after tuning the kernel. by @vanbasten23 in #14329
  • [V1][Core] Support for Structured Outputs by @aarnphm in #12388
  • [Doc] Update prefix_caching.md to match the example image by @York-RDWang in #14420
  • [Benchmarks] Make detokenization optional in benchmark scripts by @JArnoldAMD in #11697
  • [Kernel] optimize performance of gptq marlin kernel when n is small by @jinzhen-lin in #14138
  • [Misc] Add Phi4-MM example by @jeejeelee in #14343
  • [v1] torch.compile integration explanation by @youkaichao in #14437
  • [V1] Eagerly remove finished requests from the batch by @njhill in #14388
  • [V1][Metrics] Fix traceback with preemptions+LoRA by @markmc in #14220
  • [Bugfix] Fix torch_xla which can't handle None seed introduced in #14274 by @yarongmu-google in #14459
  • [V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC by @afeldman-nm in #13949
  • [Bugfix][V1] Handle MLA in kv_cache_interface by @tlrmchlsmth in #14462
  • Revert "[Perf] Reduce MLA CPU overheads in V1 (#14384)" by @tlrmchlsmth in #14471
  • [Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache by @hasB4K in #14369
  • [MISC][V1] Register process killing handler only in the main thread by @comaniac in #14380
  • [core] add extra_args to SamplingParams by @akeshet in #13300
  • [CI/Build] refactor: set timezone of container to UTC by @bufferoverflow in #12888
  • Default to generation_config from model by @hmellor in #12622
  • [Doc]add doc for Qwen models tool calling by @WangErXiao in #14478
  • [Doc] Added QwQ-32B to the supported models list in the reasoning out… by @WangErXiao in #14479
  • [Bugfix] Make the deviceprofiler include LoRA memory. by @jeejeelee in #14469
  • Add training doc signposting to TRL by @hmellor in #14439
  • [Build/BugFix] Fix hopper 12.8 build by @LucasWilkinson in #14354
  • Add RLHF document by @hmellor in #14482
  • [CI/Build] Use a fixed seed to avoid flaky tests by @DarkLight1337 in #14480
  • [V1] TPU - Add tensor parallel support via Ray by @alexm-redhat in #13618
  • [VLM] Add TP support for Phi-4-MM by @Isotr0py in #14453
  • [Misc] add use_tqdm_on_load to reduce logs by @aarnphm in #14407
  • [V1][Core] Fix memory issue with logits & sampling by @ywang96 in #13776
  • [benchmarks] Add option to use unique jsonschema for each request by @russellb in #14457
  • [Misc] Don't run ruff at all on 3rd party libs by @DarkLight1337 in #14493
  • Move requirements into their own directory by @hmellor in #12547
  • [Bugfix] DeepSeek Accuracy by @LucasWilkinson in #14476
  • [Bugfix] Fix profiling OOM and decouple encoder multimodal profiling by @Isotr0py in #14361
  • Update CODEOWNERS for structured output by @russellb in #14496
  • [Misc] Upgrade to Python 3.9 typing for additional directories by @DarkLight1337 in #14492
  • [V1] Support bad_words in sampler by @22quinn in #13376
  • Revert "[V1][Core] Fix memory issue with logits & sampling" by @robertgshaw2-redhat in #14504
  • [Attention] Default to FlashMLA backend for MLA by @LucasWilkinson in #14451
  • [V1][TPU] Remove unnecessary padding for running on TPU. by @vanbasten23 in #14467
  • [Feat] Support chunked prefill for LMCache connector by @YaoJiayi in #14505
  • [Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 by @yanyc428 in #12428
  • [Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work by @Isotr0py in #14498
  • [Hardware][TPU] Fix the recompiling issue in logits processor after warmup by @yaochengji in #14510
  • [Misc] Ensure out-of-tree quantization method recognize by cli args by @liuyanyi in #14328
  • [Bugfix] Wrong requirements path - rocm by @martinhoyer in #14527
  • [Feature] Consolidate performance benchmark datasets by @JenZhao in #14036
  • [Misc] Add log information for handle_process_request. by @chaunceyjiang in #14130
  • [Docs] Mention model_impl arg when explaining Transformers fallback by @hmellor in #14552
  • [Frontend] support image embeds by @chaunceyjiang in #13955
  • [Kernel] Add more dtype support for GGUF kernels by @SzymonOzog in #14043
  • [Doc] Update PaliGemma note to a warning by @DarkLight1337 in #14565
  • Correct capitalisation: Github -> GitHub by @hmellor in #14561
  • [V1][Bugfix] Fix handing of second_per_grid_ts for Qwen2-VL & Qwen2.5-VL by @ywang96 in #14548
  • Correct capitalisation: VLLM -> vLLM by @hmellor in #14562
  • [Docs] Make installation URLs nicer by @hmellor in #14556
  • [Bugfix][v1] fixed llava-hf/llava-1.5-7b-hf is broken on V1 by @chaunceyjiang in #14554
  • [Perf] Improve MLA on V1 by @simon-mo in #14540
  • [Minor] Update the tqdm bar for parallel sampling by @WoosukKwon in #14571
  • [V1] LoRA - Add triton kernels for V1 by @varun-sundar-rabindranath in #13096
  • Fix typo in benchmark_serving_structured_output.py by @russellb in #14566
  • [V1] Prevent xgrammar from breaking TPU support by @russellb in #14575
  • [Kernel] moe wna16 cuda kernel by @jinzhen-lin in #13321
  • [MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils by @comaniac in #14379
  • [Neuron] Add Neuron device communicator for vLLM v1 by @gnovack in #14085
  • [neuron] add reshape_and_cache by @liangfu in #14391
  • [V1][PP] Do not block engine core when no requests to schedule by @comaniac in #14585
  • [Bugfix] Fix FP16 overflow for DeepSeek V2 by @Concurrensee in #13232
  • [V1][Core] Fix memory issue with logits & sampling by @ywang96 in #14508
  • [Misc] Correct deepseek-vl2 chat template by @Isotr0py in #14558
  • [Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync by @cynthieye in #14377
  • [VLM] Cleanup siglip legacy code and fix broken paligemma multimodal processor by @Isotr0py in #14602
  • benchmarks: simplify test jsonschema by @russellb in #14567
  • dynamic distpatch of fp8 kernels by @jeffdaily in #14245
  • [Bugfix] Update --hf-overrides for Alibaba-NLP/gte-Qwen2 by @DarkLight1337 in #14609
  • Uninstall dependencies before installing requirements/tpu.txt by @richardsliu in #14586
  • [V1] Add regex structured output support with xgrammar by @russellb in #14590
  • docs: Add documentation for s390x cpu implementation by @dilipgb in #14198
  • [BugFix/Build] Fix sparse kernels not getting built on hopper by @LucasWilkinson in #14572
  • [Hardware][Intel GPU] upgrade IPEX dependency to 2.6.10. by @jikunshang in #14564
  • [V1] Remove cache from StructuredOutputManager by @russellb in #14622
  • fix some typos : supported_head_sizes by @hackty in #14627
  • [V1] Delay all xgrammar usage until needed by @russellb in #14616
  • Fix run_tpu_test by @richardsliu in #14641
  • [V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly by @vanbasten23 in #14597
  • [Bugfix][V1][PP] Only warmup sampler at last PP rank by @comaniac in #14643
  • [release] Add commands to clean up logs on TPU release node by @khluu in #14642
  • [Feature] Add vllm bench CLI by @randyjhc in #13993
  • [core][V1] pluggable scheduler by @joerunde in #14466
  • [Doc] Update benchmarks README by @JenZhao in #14646
  • [Model] Extend Ultravox to accept audio longer than 30s by @farzadab in #13631
  • [V1][Core] Support MistralTokenizer for Structured Output by @aarnphm in #14625
  • [Core] Refactor QKVCrossParallelLinear implementation to support BNB 4-bit quantization by @Isotr0py in #14545
  • [Kernel] GGUF MoE kernel by @SzymonOzog in #14613
  • [V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing by @benchislett in #14645
  • [Kernel] Add ModelOpt FP4 Checkpoint Support by @pavanimajety in #12520
  • [CPU] Upgrade CPU backend to torch-2.6 by @bigPYJ1151 in #13381
  • [ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. by @SageMoore in #14629
  • [Model] Add support for Gemma 3 by @WoosukKwon in #14660
  • [Bugfix] Missing thumbnail from NVLM-D processor by @ameyanjarlekar in #14633
  • [ROCm] Enable chunked prefill/paged attention in MLA on ROCm by @SageMoore in #14316
  • [FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. by @tjtanaa in #14664
  • [BugFix][V1] Fix parallel sampling finishing/aborts by @njhill in #14512
  • [V1] Allow sliding window + prefix caching by @WoosukKwon in #13069
  • [release] Add force remove for TPU logs by @khluu in #14697
  • [bugfix] fixup warning message for plugged schedulers for v1 by @joerunde in #14700
  • Add ray[data] as tpu dependency by @richardsliu in #14691
  • [ROCm][FP8] Fix for adjustments needed only for fnuz by @gshtras in #14689
  • [BugFix][TritonMLA] Process weights after model loading for GGUF by @tywuAMD in #14555
  • [Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config by @hasB4K in #14367
  • [V1][TPU] Add assertion on multi-step-scheduler by @lsy323 in #14707
  • [Quant] BartModel SupportsQuant by @kylesayrs in #14699
  • [Quant] Bamba SupportsQuant by @kylesayrs in #14698
  • [Bugfix] Fix chunked prefill for GGUF by @SzymonOzog in #14666
  • [CI/Build] Delete ultravox LoRA test by @jeejeelee in #14730
  • [Bugfix] fix benchmark moe by @jeejeelee in #14653
  • [VLM] Support pan-and-scan for Gemma3 multi-modal processor by @DarkLight1337 in #14672
  • [VLM] Support loading InternVideo2.5 models as original InternVLChatModel by @Isotr0py in #14738
  • [Bugfix] Fix prompt format of GLM4V by @DarkLight1337 in #14539
  • [V1][Minor] Minor enhancements on scheduler by @WoosukKwon in #14732
  • [Misc] Clean up processor tests by @DarkLight1337 in #14771
  • [V1][Core] using cached vocab_size for Structured Outputs by @aarnphm in #14630
  • [V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output by @afeldman-nm in #14624
  • [Attention] Remove slow setattr in MLA by @LucasWilkinson in #14769
  • [Doc] Fix typo in documentation by @yasu52 in #14783
  • [Doc] Fix small typo in Transformers fallback by @heheda12345 in #14791
  • [V1] TPU - Enable prefix caching by default by @alexm-redhat in #14773
  • forward fix PR 14245, restore build on ROCm 6.2 by @jeffdaily in #14709
  • [V1] Move OOM check into sampler run by @ywang96 in #14728
  • [V1] Temporarily disable FlashInfer Rejection Sampler by @WoosukKwon in #14788
  • [Kernel] LoRA - Enable CUDAGraphs for V1 by @varun-sundar-rabindranath in #14626
  • [Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA. by @tdoublep in #14431
  • [Bugfix][IPEX] Add VLLM_CPU_MOE_PREPACK to allow disabling MoE prepack when CPU does not support it by @gau-nernst in #14681
  • [ci] Reduce number of tests in fastcheck by @khluu in #14782
  • [Misc][Minor] Simplify SamplingParams.__post_init__() by @njhill in #14772
  • [Neuron] flatten test parameterization for neuron attention kernels by @liangfu in #14712
  • [Feature] Add visionarena offline support for benchmark_throughput by @JenZhao in #14654
  • [CI] Fix missing example model id in processor test by @ywang96 in #14787
  • [Attention] MLA get rid of materialization by @LucasWilkinson in #14770
  • [Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel by @gau-nernst in #14667
  • [BugFix]Fix performance serving benchmark when enable profiling by @Potabk in #14737
  • [Misc] Clean up type annotation for SupportsMultiModal by @DarkLight1337 in #14794
  • [Bugfix] Fix small typo in the example of Streaming delimiter by @bravo325806 in #14793
  • [Misc] Gemma3ForConditionalGeneration supports LoRA by @jeejeelee in #14797
  • [V1][Minor] Minor code cleanup for scheduling metrics by @WoosukKwon in #14800
  • [Bugfix][W8A8] fixed cutlass block fp8 binding by @DefTruth in #14796
  • [VLM] Various cleanup and fixes by @DarkLight1337 in #14806
  • [BugFix]: properly catch templating error when preprocess input by @gcalmettes in #13976
  • [Bugfix] Fix Aria test loading by @DarkLight1337 in #14823
  • [V1] Fix vocab size calculation for structured output by @russellb in #14826
  • [Frontend] Fix log message to use http vs https by @russellb in #14774
  • [V1][Metrics] Updated list of deprecated metrics in v0.8 by @markmc in #14695
  • [Frontend] track server_load by @daniel-salib in #13950
  • [Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 by @wyajieha in #14430
  • [release] Remove log cleanup commands from TPU job by @khluu in #14838
  • Re-enable the AMD Entrypoints Test by @Alexei-V-Ivanov-AMD in #14711
  • [Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies by @cyang49 in #14778
  • [V1] Fix model parameterization for structured output tests by @russellb in #14833
  • Update to torch==2.6.0 by @mgoin in #12721
  • [CI] Add TPU v1 test by @richardsliu in #14834
  • [Build/CI] Move ninja to common deps by @russellb in #14835
  • [Build/CI] Upgrade aiohttp to incldue CVE fix by @russellb in #14840
  • [Doc] More neutral K8s deployment guide by @terrytangyuan in #14084
  • [Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … by @yarongmu-google in #14844
  • [Neuron][CI] update docker run command by @liangfu in #14829
  • [Bugfix][V1] Fix flashinfer sampling by @DefTruth in #14815
  • Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… by @tlrmchlsmth in #14848
  • Disable outlines cache by default by @russellb in #14837
  • [Misc] Remove misleading message in gemma2 and gemma3 by @Isotr0py in #14850
  • [Misc][Easy] Annotate unused vars in the csrc files by @houseroad in #14798
  • [V1] V1 Enablement Oracle by @robertgshaw2-redhat in #13726
  • [Docs] Add new East Coast vLLM Meetup slides to README and meetups.md by @simon-mo in #14852
  • [CPU] Support FP8 KV cache by @bigPYJ1151 in #14741
  • [Attention] Get rid of mla cache alignment by @LucasWilkinson in #14842
  • [CI/Build] Delete LoRA bias test by @jeejeelee in #14849
  • [V1][Structured Output] calculate vocab_size eagerly by @aarnphm in #14851
  • [Doc] V1 user guide by @JenZhao in #13991
  • [Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes by @russellb in #14839
  • [Bugfix] EAGLE output norm bug by @luyuzhe111 in #14464
  • [VLM] Limit multimodal input cache by memory by @DarkLight1337 in #14805
  • [CI][Intel GPU] refine intel GPU ci docker build by @jikunshang in #14860
  • [Core] Expose API endpoint /is_sleeping by @waltforme in #14312
  • [VLM] Merged multi-modal processor for Pixtral by @Flechman in #12211
  • [Misc][Doc] Minor benchmark README update by @ywang96 in #14874
  • [VLM] Clean up Phi-4-MM ViT implementation by @Isotr0py in #14812
  • [V1] Remove V0 fallback for mistral-tokenizer by @ywang96 in #14873
  • [Kernel] Add more tuned configs by @simon-mo in #14877
  • [BugFix] Fix torch distributed stateless PG backend init by @njhill in #14870
  • [V1] [Spec Decode] Fix ngram tests by @LiuXiaoxuanPKU in #14878
  • [Bugfix] Limit profiling run sequence length by max_model_len by @kylesayrs in #14785
  • [Bugfix] Explicitly disable Phi-4-multimodal in V1 by @DarkLight1337 in #14889
  • Revert "[Bugfix] Limit profiling run sequence length by max_model_len (#14785) by @DarkLight1337 in #14892
  • [BugFix][V1] Fix overhead related to bad_words sampling when not in use by @njhill in #14894
  • [V1][BugFix] Detect interleaved sliding window attention by @WoosukKwon in #14896
  • [Misc] Catching Ray Compiled Graph PP test failures for V1 by @ruisearch42 in #14847
  • [Doc] Add guidance for using ccache with pip install -e . in doc by @vadiklyutiy in #14901
  • [V1] Enable Entrypoints Tests by @robertgshaw2-redhat in #14903
  • [CI] Fix Tool Calling Tests by @robertgshaw2-redhat in #14898
  • [CI/Build] Update defaults for test reproducibility by @DarkLight1337 in #14893
  • [V1] Optimize the overhead of rewinding by @WoosukKwon in #14905
  • [V1][Minor] Add repr to ConstantList by @WoosukKwon in #14907
  • [BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context by @LucasWilkinson in #14910
  • [Misc] Replace os environ to monkeypatch in test suite by @t-sibiraj in #14516
  • [Benchmark] Do not save detailed info to json by default by @simon-mo in #14879
  • [V1] [Spec Decode] Support random sampling for spec decode by @LiuXiaoxuanPKU in #13933

New Contributors

  • @ajayvohra2005 made their first contribution in #13589
  • @Edwinhr716 made their first contribution in #12913
  • @Hongbosherlock made their first contribution in #12978
  • @johnzheng1975 made their first contribution in #13668
  • @JenZhao made their first contribution in #13594
  • @bufferoverflow made their first contribution in #13011
  • @cakeng made their first contribution in #12583
  • @eli-b made their first contribution in #13785
  • @YaoJiayi made their first contribution in #12953
  • @edwardzjl made their first contribution in #13468
  • @naromero77amd made their first contribution in #13623
  • @henrylhtsang made their first contribution in #13797
  • @tianyuzhou95 made their first contribution in #13863
  • @b8zhong made their first contribution in #13736
  • @Chenyaaang made their first contribution in #13860
  • @observerw made their first contribution in #13958
  • @qli88 made their first contribution in #13718
  • @benchislett made their first contribution in #13626
  • @hasB4K made their first contribution in #13987
  • @Kacper-Pietkun made their first contribution in #13213
  • @Ryp made their first contribution in #13090
  • @LouieYang made their first contribution in #14031
  • @vanbasten23 made their first contribution in #13379
  • @atalman made their first contribution in #13926
  • @wyajieha made their first contribution in #12931
  • @qux-bbb made their first contribution in #14086
  • @realShengYao made their first contribution in #14051
  • @zhanwenchen made their first contribution in #14142
  • @rainkert made their first contribution in #13750
  • @congcongchen123 made their first contribution in #14119
  • @iacolippo made their first contribution in #14217
  • @zhe-thoughts made their first contribution in #14288
  • @DaividFrank made their first contribution in #14293
  • @vincent-4 made their first contribution in #13997
  • @yangsijia-serena made their first contribution in #14267
  • @pyc96 made their first contribution in #14237
  • @upayuryeva made their first contribution in #14363
  • @courage17340 made their first contribution in #14326
  • @dilipgb made their first contribution in #12613
  • @ZhongYingMatrix made their first contribution in #13897
  • @hj-mistral made their first contribution in #14224
  • @yaochengji made their first contribution in #14310
  • @dyli-google made their first contribution in #14385
  • @vincent-pli made their first contribution in #14414
  • @York-RDWang made their first contribution in #14420
  • @yarongmu-google made their first contribution in #14459
  • @22quinn made their first contribution in #13376
  • @yanyc428 made their first contribution in #12428
  • @martinhoyer made their first contribution in #14527
  • @gnovack made their first contribution in #14085
  • @cynthieye made their first contribution in #14377
  • @jeffdaily made their first contribution in #14245
  • @hackty made their first contribution in #14627
  • @randyjhc made their first contribution in #13993
  • @ameyanjarlekar made their first contribution in #14633
  • @tywuAMD made their first contribution in #14555
  • @yasu52 made their first contribution in #14783
  • @gau-nernst made their first contribution in #14681
  • @Potabk made their first contribution in #14737
  • @bravo325806 made their first contribution in #14793
  • @daniel-salib made their first contribution in #13950
  • @cyang49 made their first contribution in #14778
  • @luyuzhe111 made their first contribution in #14464
  • @Flechman made their first contribution in #12211
  • @vadiklyutiy made their first contribution in #14901
  • @t-sibiraj made their first contribution in #14516

Full Changelog: v0.7.3...v0.8.0rc1

Don't miss a new vllm release

NewReleases is sending notifications on new releases.