vllm-project/vllm v0.8.0rc1 on GitHub

Note: vLLM no longer sets the global seed (#14274). Please set the seed parameter if you need to reproduce your results.

What's Changed

Update pre-commit's isort version to remove warnings by @hmellor in #13614
[V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
fix neuron performance issue by @ajayvohra2005 in #13589
[Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
[Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
Add llmaz as another integration by @kerthcet in #13643
[Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
Use pre-commit to update requirements-test.txt by @hmellor in #13617
[Bugfix] Add mm_processor_kwargs to chat-related protocols by @ywang96 in #13644
[V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
[FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
[ci] Fix metrics test model path by @khluu in #13635
[Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
[Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in #13668
[Attention] MLA with chunked prefill by @LucasWilkinson in #12639
[Misc] Fix yapf linting tools etc not running on pre-commit by @Isotr0py in #13695
docs: Add a note on full CI run in contributing guide by @terrytangyuan in #13646
[HTTP Server] Make model param optional in request by @youngkent in #13568
[Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… by @WangErXiao in #13672
[Misc] Capture and log the time of loading weights by @waltforme in #13666
[ROCM] fix native attention function call by @gongdao123 in #13650
[Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA by @2015aroras in #13687
[Misc] Bump compressed-tensors by @dsikka in #13619
[Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len by @WangErXiao in #13691
[v1] Support allowed_token_ids in v1 Sampler by @houseroad in #13210
[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler by @JenZhao in #13594
Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size by @fabianlim in #13660
[V1][Metrics] Support vllm:cache_config_info by @markmc in #13299
[Metrics] Add --show-hidden-metrics-for-version CLI arg by @markmc in #13295
[Misc] Reduce LoRA-related static variable by @jeejeelee in #13166
[CI/Build] Fix pre-commit errors by @DarkLight1337 in #13696
[core] set up data parallel communication by @youkaichao in #13591
[ci] fix linter by @youkaichao in #13701
Support SSL Key Rotation in HTTP Server by @youngkent in #13495
[NVIDIA] Support nvfp4 cutlass gemm by @kaixih in #13571
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths by @SageMoore in #13095
[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm by @gshtras in #13231
[Doc] Dockerfile instructions for optional dependencies and dev transformers by @DarkLight1337 in #13699
[Bugfix] Fix boolean conversion for OpenVINO env variable by @helena-intel in #13615
[XPU]fix setuptools version for xpu by @yma11 in #13548
[CI/Build] fix uv caching in Dockerfile by @dtrifiro in #13611
[CI/Build] Fix pre-commit errors from #13571 by @ywang96 in #13709
[BugFix] Minor: logger import in attention backend by @andylolu2 in #13706
[ci] Use env var to control whether to use S3 bucket in CI by @khluu in #13634
[Quant] BaiChuan SupportsQuant by @kylesayrs in #13710
[LMM] Implement merged multimodal processor for whisper by @Isotr0py in #13278
[Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms by @njhill in #13688
[Misc] Deprecate --dataset from benchmark_serving.py by @ywang96 in #13708
[v1] torchrun compatibility by @youkaichao in #13642
[V1][BugFix] Fix engine core client shutdown hangs by @njhill in #13298
Fix some issues with benchmark data output by @huydhn in #13641
[ci] Add logic to change model to S3 path only when S3 CI env var is on by @khluu in #13727
[V1][Core] Fix memory issue with logits & sampling by @ywang96 in #13721
[model][refactor] remove cuda hard code in models and layers by @MengqingCao in #13658
[Bugfix] fix(logging): add missing opening square bracket by @bufferoverflow in #13011
[CI/Build] add python-json-logger to requirements-common by @bufferoverflow in #12842
Expert Parallelism (EP) Support for DeepSeek Models by @cakeng in #12583
[BugFix] Illegal memory access for MoE On H20 by @Abatom in #13693
[Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set by @NickLucche in #12513
[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) by @afeldman-nm in #10980
Revert "[V1][Core] Fix memory issue with logits & sampling" by @ywang96 in #13775
Fix precommit fail in fused_moe intermediate_cache2 chunking by @mgoin in #13772
[Misc] Clean Up EngineArgs.create_engine_config by @robertgshaw2-redhat in #13734
[Misc][Chore] Clean Up AsyncOutputProcessing Logs by @robertgshaw2-redhat in #13780
Remove unused kwargs from model definitions by @hmellor in #13555
[Doc] arg_utils.py: fixed a typo by @eli-b in #13785
[Misc] set single whitespace between log sentences by @cjackal in #13771
[Bugfix][Quantization] Fix FP8 + EP by @tlrmchlsmth in #13784
[Misc][Attention][Quantization] init property earlier by @wangxiyuan in #13733
[V1][Metrics] Implement vllm:lora_requests_info metric by @markmc in #13504
[Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" by @LucasWilkinson in #13802
[Bugfix] Support MLA for CompressedTensorsWNA16 by @mgoin in #13725
Fix CompressedTensorsWNA16MoE with grouped scales by @mgoin in #13769
[Core] LoRA V1 - Add add/pin/list/remove_lora functions by @varun-sundar-rabindranath in #13705
[Misc] Check that the model can be inspected upon registration by @DarkLight1337 in #13743
[Core] xgrammar: Expand list of unsupported jsonschema keywords by @russellb in #13783
[Bugfix] Modify modelscope api usage in transformer_utils by @shen-shanshan in #13807
[misc] Clean up ray compiled graph type hints by @ruisearch42 in #13731
[Feature] Support KV cache offloading and disagg prefill with LMCache connector. by @YaoJiayi in #12953
[ROCm][Quantization][Kernel] Using HIP FP8 header by @gshtras in #12593
[CI/Build] Fix V1 LoRA failure by @jeejeelee in #13767
[Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs by @Chen-0210 in #13724
[Bugfix] Initialize attention bias on the same device as Query/Key/Value by @edwardzjl in #13468
[Bugfix] Flush TunableOp results before worker processes are destroyed. by @naromero77amd in #13623
[Bugfix] Fix deepseek-vl2 inference with more than 2 images by @Isotr0py in #13818
Fix /v1/audio/transcriptions Bad Request Error by @HermitSun in #13811
[Bugfix] Revert inspection code in #13743 by @DarkLight1337 in #13832
Fix string parsing error by @Chen-0210 in #13825
[Neuron] Add custom_ops for neuron backend by @liangfu in #13246
Fix failing MyGemma2Embedding test by @hmellor in #13820
[Model] Support Grok1 by @mgoin in #13795
DeepSeek V2/V3/R1 only place lm_head on last pp rank by @hmellor in #13833
[misc] Show driver IP info when Ray fails to allocate driver worker by @ruisearch42 in #13858
[V1][Spec Decode] Change Spec Decode Rejection Sampling API by @LiuXiaoxuanPKU in #13729
[Misc]Code Cleanup by @noemotiovon in #13859
[Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues by @henrylhtsang in #13797
Improve pipeline partitioning by @hmellor in #13839
[Doc] fix the incorrect module path of tensorize_vllm_model by @tianyuzhou95 in #13863
[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms by @SageMoore in #13844
[v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine by @sethkimmel3 in #13837
[Misc] Improve LoRA spelling by @jeejeelee in #13831
[Misc] Fix input processing for Ultravox by @ywang96 in #13871
[Bugfix] Add test example for Ultravox v0.5 by @DarkLight1337 in #13890
Add comments on accessing kv_cache and attn_metadata by @hmellor in #13887
[Bugfix] Handle None parameters in Mistral function calls. by @fgreinacher in #13786
[Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor by @b8zhong in #13736
[Bugfix] Do not crash V0 engine on input errors by @joerunde in #13101
[Bugfix] Update expected token counts for Ultravox tests by @DarkLight1337 in #13895
[TPU] use torch2.6 with whl package by @Chenyaaang in #13860
[Misc] fixed qwen_vl_utils parameter error by @chaunceyjiang in #13906
[Bugfix] Backend option to disable xgrammar any_whitespace by @wallashss in #12744
[BugFix] Make FP8 Linear compatible with torch.compile by @WoosukKwon in #13918
[Kernel] FlashMLA integration by @LucasWilkinson in #13747
[ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined by @HollowMan6 in #13851
Use CUDA 12.4 as default for release and nightly wheels by @mgoin in #12098
[misc] Rename Ray ADAG to Compiled Graph by @ruisearch42 in #13928
[ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding by @SageMoore in #13922
[V1][Metrics] Handle preemptions by @markmc in #13169
[CI/Build] Add examples/ directory to be labelled by mergify by @b8zhong in #13944
[Misc] fixed 'required' is an invalid argument for positionals by @chaunceyjiang in #13948
[PP] Correct cache size check by @zhengy001 in #13873
Fix test_block_fp8.py test for MoE by @mgoin in #13915
[VLM] Support multimodal inputs for Florence-2 models by @Isotr0py in #13320
[Model] Deepseek GGUF support by @SzymonOzog in #13167
Update quickstart.md by @observerw in #13958
Deduplicate .pre-commit-config.yaml's exclude by @hmellor in #13967
[bugfix] Fix profiling for RayDistributedExecutor by @ruisearch42 in #13945
Update LMFE version to v0.10.11 to support new versions of transforme… by @noamgat in #13930
[Bugfix] Fix qwen2.5-vl overflow issue by @Isotr0py in #13968
[VLM] Generalized prompt updates for multi-modal processor by @DarkLight1337 in #13964
[Attention] MLA support for V1 by @chenyang78 in #13789
Bump azure/setup-helm from 4.2.0 to 4.3.0 by @dependabot in #13742
[VLM] Deprecate legacy input mapper for OOT multimodal models by @DarkLight1337 in #13979
[ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups by @SageMoore in #13970
[V1][Minor] Minor cleanup for GPU Model Runner by @WoosukKwon in #13983
[core] Perf improvement for DSv3 on AMD GPUs by @qli88 in #13718
[Attention] Flash MLA for V1 by @LucasWilkinson in #13867
[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict by @benchislett in #13626
[Misc] Print FusedMoE detail info by @jeejeelee in #13974
[V1]SupportsV0Only protocol for model definitions by @ywang96 in #13959
[Bugfix] Check that number of images matches number of <|image|> tokens with mllama by @tjohnson31415 in #13911
[Doc] Move multimodal Embedding API example to Online Serving page by @DarkLight1337 in #14017
[Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) by @hasB4K in #13987
Use smaller embedding model when not testing model specifically by @hmellor in #13891
[Hardware][Intel-Gaudi] Regional compilation support by @Kacper-Pietkun in #13213
[V1][Minor] Restore V1 compatibility with LLMEngine class by @Ryp in #13090
Update AutoAWQ docs by @hmellor in #14042
[Bugfix] Fix MoeWNA16Method activation by @jeejeelee in #14024
[VLM][Bugfix] Enable specifying prompt target via index by @DarkLight1337 in #14038
[Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series by @LouieYang in #14031
[Doc] Fix ROCm documentation by @b8zhong in #14041
Fix entrypoint tests for embedding models by @hmellor in #14052
[V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU by @vanbasten23 in #13379
[v1] Cleanup the BlockTable in InputBatch by @heheda12345 in #13977
Add RELEASE.md by @atalman in #13926
[v1] Move block pool operations to a separate class by @heheda12345 in #13973
[core] Bump ray to 2.43 by @ruisearch42 in #13994
[torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass by @ProExpertProg in #10902
[Docs] Add pipeline_parallel_size to optimization docs by @b8zhong in #14059
[Bugfix] Add file lock for ModelScope download by @jeejeelee in #14060
[Misc][Kernel]: Add GPTQAllSpark Quantization by @wyajieha in #12931
[Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor by @bigPYJ1151 in #14053
[Documentation] Add more deployment guide for Kubernetes deployment by @KuntaiDu in #13841
[Doc] Consolidate whisper and florence2 examples by @Isotr0py in #14050
[V1][Minor] Do not print attn backend twice by @WoosukKwon in #13985
[ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class by @SageMoore in #14065
[v1][Bugfix] Only cache blocks that are not in the prefix cache by @heheda12345 in #14073
[v1] Add __repr__ to KVCacheBlock to avoid recursive print by @heheda12345 in #14081
[Model] Add LoRA support for TransformersModel by @jeejeelee in #13770
[Misc] Accurately capture the time of loading weights by @waltforme in #14063
[Doc] Source building add clone step by @qux-bbb in #14086
[v0][structured output] Support reasoning output by @gaocegege in #12955
Update deprecated Python 3.8 typing by @hmellor in #13971
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure by @realShengYao in #14051
[Misc] typo find in deepseek_v2 by @noooop in #14106
[Misc][Platform] Move use allgather to platform by @MengqingCao in #14010
[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 by @comaniac in #13921
[V1] Refactor parallel sampling support by @markmc in #13774
Improve the docs for TransformersModel by @hmellor in #14147
[ROCm] Faster Custom Paged Attention kernels by @tjtanaa in #12348
Fix head_dim not existing in all model configs (Transformers backend) by @hmellor in #14141
[V0][Metrics] Remove unimplemented vllm:tokens_total by @markmc in #14134
[V0][Metrics] Deprecate some KV/prefix cache metrics by @markmc in #14136
[V1] Simplify stats logging by @njhill in #14082
[WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics by @markmc in https://github.com//pull/14055
[Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 by @mgoin in #14100
[Kernel] Optimize moe intermediate_cache usage by @mgoin in #13625
[Docs] Add GPTQModel by @Qubitium in #14056
[v1] Add comments to the new ragged paged attention Pallas kernel by @vanbasten23 in #14155
[Model] Add support for GraniteMoeShared models by @tjohnson31415 in #13313
[core] moe fp8 block quant tuning support by @divakar-amd in #14068
[Misc] Remove lru_cache in NvmlCudaPlatform by @comaniac in #14156
[core] Pass all driver env vars to ray workers unless excluded by @ruisearch42 in #14099
Use math.prod instead of np.prod for trivial ops by @zhanwenchen in #14142
Fix benchmark_moe.py tuning for CUDA devices by @mgoin in #14164
[platform] add debug logging during inferring the device type by @youkaichao in #14195
[sleep mode] error out with expandable_segments by @youkaichao in #14189
[doc] add "Failed to infer device type" to faq by @youkaichao in #14200
[Bugfix] Restrict MacOS CPU detection by @mgoin in #14210
[V1][BugFix] Fix remaining sync engine client shutdown errors/hangs by @njhill in #13869
[V0][Metrics] Deprecate some questionable request time metrics by @markmc in #14135
[V1][Molmo] Fix get_multimodal_embeddings() in molmo.py by @lk-chen in #14161
add cutlass support for blackwell fp8 gemm by @kushanam in #13798
[TPU][Profiler] Support start_profile/stop_profile in TPU worker by @lsy323 in #13988
Fix performance when --generation-config is not None by @hmellor in #14223
[Frontend] Do prompt_logprobs clamping for chat as well as completions by @hmellor in #14225
[Docs] Update Dockerfile dependency image by @mgoin in #14215
[v1][Metrics] Add design doc by @markmc in #12745
Serialize using safetensors for KV caches by @KuntaiDu in #14228
Clean up unused padding_idx variables across many model definitions by @tlrmchlsmth in #13240
[ROCm] Disable a few more kernel tests that are broken on ROCm by @SageMoore in #14145
[V1][TPU] TPU multimodal model support for ragged attention by @mgoin in #14158
[misc] announce china meetup by @youkaichao in #14248
Moved numba from common requirements to cuda/rocm specific requirements by @npanpaliya in #14199
Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 by @mgoin in #14157
[Bugfix] Fix gptq_marlin for deepseek-v3 by @rainkert in #13750
[V1][Bugfix] Do not reset prefix caching metrics by @comaniac in #14235
[Model] New model support for Phi-4-multimodal-instruct by @congcongchen123 in #14119
[V1] EP/TP MoE + DP Attention by @tlrmchlsmth in #13931
[platforms] improve rocm debugging info by @youkaichao in #14257
Temporarily disable test_awq_gemm_opcheck by @mgoin in #14251
[Frontend] Allow return_tokens_as_token_ids to be passed as a request param by @benchislett in #14066
[Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing by @ywang96 in #14256
[Bugfix][V1] Fix allowed_token_ids for v1 Sampler by @houseroad in #14169
[Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID by @iacolippo in #14217
[Doc] [3/N] Refer code examples for common cases in dev multimodal processor by @DarkLight1337 in #14278
Small update for external_launcher backend docs by @zhe-thoughts in #14288
[V1][Frontend] Add Testing For V1 Runtime Parameters by @robertgshaw2-redhat in #14159
[LoRA] Remove linear hack outside transformers backend by @Isotr0py in #14177
[Misc] Add Qwen2MoeForCausalLM moe tuning support by @jeejeelee in #14276
[Doc] Fixed typo in prefix_caching.md by @DaividFrank in #14293
[Bugfix] Fix broken vision language example by @Isotr0py in #14292
[Docs] Add Meta Slides by @simon-mo in #14297
[V1][Minor] Remove obsolete FIXME comment by @njhill in #14304
Deprecate best_of Sampling Parameter in anticipation for vLLM V1 by @vincent-4 in #13997
[V1][BugFix] Fix for mixed top_k batch by @njhill in #14301
[misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env by @yangsijia-serena in #14267
[V1][Easy] Add empty allowed_token_ids in the v1 sampler test by @houseroad in #14308
[Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch by @pyc96 in #14237
[Bugfix] Remove num_tokens_across_dp by @tlrmchlsmth in #14302
[BugFix] Fix prefix caching V0 MLA by @LucasWilkinson in #14255
[CI/Build] Use spawn multiprocessing mode for V1 test pipeline by @russellb in #14243
Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM by @mgoin in #13917
[Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation by @terrytangyuan in #13850
[BugFix] MLA + V1, illegal memory access and accuracy issues by @LucasWilkinson in #14253
[misc] Mention ray list nodes command to troubleshoot ray issues by @ruisearch42 in #14318
[Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 by @gaocegege in #14114
[V1] LoRA - Enable more V1 tests by @varun-sundar-rabindranath in #14315
[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention by @NickLucche in #11301
[Hardware] Update the flash attn tag to support Blackwell by @pavanimajety in #14244
[Model] Update Paligemma multimodal processing with PromptUpdate by @kylehh in #14015
[VLM] Support Pixtral-HF on V1 by @lk-chen in #14275
[Core] Optimizing cross-attention QKVParallelLinear computation by @NickLucche in #12325
[Frontend][Docs] Transcription API streaming by @NickLucche in #13301
[Doc] Update reasoning with stream example to use OpenAI library by @liuyanyi in #14077
[Doc] Correct beam_search using in generative_models.md by @upayuryeva in #14363
[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend by @tdoublep in #14152
[Bugfix][Core] fix abort_seq_group and memory leak when n>1 by @courage17340 in #14326
[Core] Don't use cache during multi-modal profiling by @DarkLight1337 in #14336
[Doc] Fix date typo in README.md by @jitseklomp in #14366
[RLHF] use worker_extension_cls for compatibility with V0 and V1 by @youkaichao in #14185
Reinstate best_of for V0 by @hmellor in #14356
Adding cpu inference with VXE ISA for s390x architecture by @dilipgb in #12613
Add authors to license header. by @tdoublep in #14371
Fix mla prefill context performance by @ZhongYingMatrix in #13897
[V1] Do not detokenize if sampling param detokenize is False by @hj-mistral in #14224
[Distributed] Add enable_expert_parallel arg by @tlrmchlsmth in #14305
[CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa by @mgoin in #13569
[CI] Disable spawn when running V1 Test by @tdoublep in #14345
[Kernel] Add needs_fixed_stride_order tag to most GEMMs by @tlrmchlsmth in #14306
[Bugfix] Fix use_direct_call condition in FusedMoE layer for by @tlrmchlsmth in #14382
[Bug] Fix Attention when ignored in by quant_method by @mgoin in #14313
[V1][Bugfix] Standardize quantized kv cache rejection for attention backends by @mgoin in #14221
[Docs] Add nsight guide to profiling docs by @mgoin in #14298
[Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue by @yaochengji in #14310
[Doc] Fix a typo by @dyli-google in #14385
[Bugfix] Correctly call cudaProfilerStop in benchmarks script by @b8zhong in #14183
[Perf] Reduce MLA CPU overheads in V1 by @LucasWilkinson in #14384
[FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object by @ProExpertProg in #14390
[BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs by @LucasWilkinson in #14396
[Bugfix] Fix JambaForCausalLM LoRA by @jeejeelee in #14370
[Build] Add nightly wheel fallback when latest commit wheel unavailable by @Isotr0py in #14358
OpenVINO: added CPU-like conditions by @ilya-lavrenov in #14338
[GH] Auto-apply multi-modality label to relevant PRs by @DarkLight1337 in #14402
correct wrong markdown syntax by @vincent-pli in #14414
[Bugfix] Further clean up LoRA test by @jeejeelee in #14422
[Bugfix] Clean up multi-modal processors by @DarkLight1337 in #14417
[Misc] Set default value of seed to None by @SmartManoj in #14274
[BUGFIX] Skip tokenization support for throughput benchmark by @maleksan85 in #12712
Fix missing kv_caches and attn_metadata in OpenVINOCausalLM by @hmellor in #14271
Use the optimized block sizes after tuning the kernel. by @vanbasten23 in #14329
[V1][Core] Support for Structured Outputs by @aarnphm in #12388
[Doc] Update prefix_caching.md to match the example image by @York-RDWang in #14420
[Benchmarks] Make detokenization optional in benchmark scripts by @JArnoldAMD in #11697
[Kernel] optimize performance of gptq marlin kernel when n is small by @jinzhen-lin in #14138
[Misc] Add Phi4-MM example by @jeejeelee in #14343
[v1] torch.compile integration explanation by @youkaichao in #14437
[V1] Eagerly remove finished requests from the batch by @njhill in #14388
[V1][Metrics] Fix traceback with preemptions+LoRA by @markmc in #14220
[Bugfix] Fix torch_xla which can't handle None seed introduced in #14274 by @yarongmu-google in #14459
[V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC by @afeldman-nm in #13949
[Bugfix][V1] Handle MLA in kv_cache_interface by @tlrmchlsmth in #14462
Revert "[Perf] Reduce MLA CPU overheads in V1 (#14384)" by @tlrmchlsmth in #14471
[Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache by @hasB4K in #14369
[MISC][V1] Register process killing handler only in the main thread by @comaniac in #14380
[core] add extra_args to SamplingParams by @akeshet in #13300
[CI/Build] refactor: set timezone of container to UTC by @bufferoverflow in #12888
Default to generation_config from model by @hmellor in #12622
[Doc]add doc for Qwen models tool calling by @WangErXiao in #14478
[Doc] Added QwQ-32B to the supported models list in the reasoning out… by @WangErXiao in #14479
[Bugfix] Make the deviceprofiler include LoRA memory. by @jeejeelee in #14469
Add training doc signposting to TRL by @hmellor in #14439
[Build/BugFix] Fix hopper 12.8 build by @LucasWilkinson in #14354
Add RLHF document by @hmellor in #14482
[CI/Build] Use a fixed seed to avoid flaky tests by @DarkLight1337 in #14480
[V1] TPU - Add tensor parallel support via Ray by @alexm-redhat in #13618
[VLM] Add TP support for Phi-4-MM by @Isotr0py in #14453
[Misc] add use_tqdm_on_load to reduce logs by @aarnphm in #14407
[V1][Core] Fix memory issue with logits & sampling by @ywang96 in #13776
[benchmarks] Add option to use unique jsonschema for each request by @russellb in #14457
[Misc] Don't run ruff at all on 3rd party libs by @DarkLight1337 in #14493
Move requirements into their own directory by @hmellor in #12547
[Bugfix] DeepSeek Accuracy by @LucasWilkinson in #14476
[Bugfix] Fix profiling OOM and decouple encoder multimodal profiling by @Isotr0py in #14361
Update CODEOWNERS for structured output by @russellb in #14496
[Misc] Upgrade to Python 3.9 typing for additional directories by @DarkLight1337 in #14492
[V1] Support bad_words in sampler by @22quinn in #13376
Revert "[V1][Core] Fix memory issue with logits & sampling" by @robertgshaw2-redhat in #14504
[Attention] Default to FlashMLA backend for MLA by @LucasWilkinson in #14451
[V1][TPU] Remove unnecessary padding for running on TPU. by @vanbasten23 in #14467
[Feat] Support chunked prefill for LMCache connector by @YaoJiayi in #14505
[Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 by @yanyc428 in #12428
[Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work by @Isotr0py in #14498
[Hardware][TPU] Fix the recompiling issue in logits processor after warmup by @yaochengji in #14510
[Misc] Ensure out-of-tree quantization method recognize by cli args by @liuyanyi in #14328
[Bugfix] Wrong requirements path - rocm by @martinhoyer in #14527
[Feature] Consolidate performance benchmark datasets by @JenZhao in #14036
[Misc] Add log information for handle_process_request. by @chaunceyjiang in #14130
[Docs] Mention model_impl arg when explaining Transformers fallback by @hmellor in #14552
[Frontend] support image embeds by @chaunceyjiang in #13955
[Kernel] Add more dtype support for GGUF kernels by @SzymonOzog in #14043
[Doc] Update PaliGemma note to a warning by @DarkLight1337 in #14565
Correct capitalisation: Github -> GitHub by @hmellor in #14561
[V1][Bugfix] Fix handing of second_per_grid_ts for Qwen2-VL & Qwen2.5-VL by @ywang96 in #14548
Correct capitalisation: VLLM -> vLLM by @hmellor in #14562
[Docs] Make installation URLs nicer by @hmellor in #14556
[Bugfix][v1] fixed llava-hf/llava-1.5-7b-hf is broken on V1 by @chaunceyjiang in #14554
[Perf] Improve MLA on V1 by @simon-mo in #14540
[Minor] Update the tqdm bar for parallel sampling by @WoosukKwon in #14571
[V1] LoRA - Add triton kernels for V1 by @varun-sundar-rabindranath in #13096
Fix typo in benchmark_serving_structured_output.py by @russellb in #14566
[V1] Prevent xgrammar from breaking TPU support by @russellb in #14575
[Kernel] moe wna16 cuda kernel by @jinzhen-lin in #13321
[MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils by @comaniac in #14379
[Neuron] Add Neuron device communicator for vLLM v1 by @gnovack in #14085
[neuron] add reshape_and_cache by @liangfu in #14391
[V1][PP] Do not block engine core when no requests to schedule by @comaniac in #14585
[Bugfix] Fix FP16 overflow for DeepSeek V2 by @Concurrensee in #13232
[V1][Core] Fix memory issue with logits & sampling by @ywang96 in #14508
[Misc] Correct deepseek-vl2 chat template by @Isotr0py in #14558
[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync by @cynthieye in #14377
[VLM] Cleanup siglip legacy code and fix broken paligemma multimodal processor by @Isotr0py in #14602
benchmarks: simplify test jsonschema by @russellb in #14567
dynamic distpatch of fp8 kernels by @jeffdaily in #14245
[Bugfix] Update --hf-overrides for Alibaba-NLP/gte-Qwen2 by @DarkLight1337 in #14609
Uninstall dependencies before installing requirements/tpu.txt by @richardsliu in #14586
[V1] Add regex structured output support with xgrammar by @russellb in #14590
docs: Add documentation for s390x cpu implementation by @dilipgb in #14198
[BugFix/Build] Fix sparse kernels not getting built on hopper by @LucasWilkinson in #14572
[Hardware][Intel GPU] upgrade IPEX dependency to 2.6.10. by @jikunshang in #14564
[V1] Remove cache from StructuredOutputManager by @russellb in #14622
fix some typos : supported_head_sizes by @hackty in #14627
[V1] Delay all xgrammar usage until needed by @russellb in #14616
Fix run_tpu_test by @richardsliu in #14641
[V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly by @vanbasten23 in #14597
[Bugfix][V1][PP] Only warmup sampler at last PP rank by @comaniac in #14643
[release] Add commands to clean up logs on TPU release node by @khluu in #14642
[Feature] Add vllm bench CLI by @randyjhc in #13993
[core][V1] pluggable scheduler by @joerunde in #14466
[Doc] Update benchmarks README by @JenZhao in #14646
[Model] Extend Ultravox to accept audio longer than 30s by @farzadab in #13631
[V1][Core] Support MistralTokenizer for Structured Output by @aarnphm in #14625
[Core] Refactor QKVCrossParallelLinear implementation to support BNB 4-bit quantization by @Isotr0py in #14545
[Kernel] GGUF MoE kernel by @SzymonOzog in #14613
[V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing by @benchislett in #14645
[Kernel] Add ModelOpt FP4 Checkpoint Support by @pavanimajety in #12520
[CPU] Upgrade CPU backend to torch-2.6 by @bigPYJ1151 in #13381
[ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. by @SageMoore in #14629
[Model] Add support for Gemma 3 by @WoosukKwon in #14660
[Bugfix] Missing thumbnail from NVLM-D processor by @ameyanjarlekar in #14633
[ROCm] Enable chunked prefill/paged attention in MLA on ROCm by @SageMoore in #14316
[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. by @tjtanaa in #14664
[BugFix][V1] Fix parallel sampling finishing/aborts by @njhill in #14512
[V1] Allow sliding window + prefix caching by @WoosukKwon in #13069
[release] Add force remove for TPU logs by @khluu in #14697
[bugfix] fixup warning message for plugged schedulers for v1 by @joerunde in #14700
Add ray[data] as tpu dependency by @richardsliu in #14691
[ROCm][FP8] Fix for adjustments needed only for fnuz by @gshtras in #14689
[BugFix][TritonMLA] Process weights after model loading for GGUF by @tywuAMD in #14555
[Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config by @hasB4K in #14367
[V1][TPU] Add assertion on multi-step-scheduler by @lsy323 in #14707
[Quant] BartModel SupportsQuant by @kylesayrs in #14699
[Quant] Bamba SupportsQuant by @kylesayrs in #14698
[Bugfix] Fix chunked prefill for GGUF by @SzymonOzog in #14666
[CI/Build] Delete ultravox LoRA test by @jeejeelee in #14730
[Bugfix] fix benchmark moe by @jeejeelee in #14653
[VLM] Support pan-and-scan for Gemma3 multi-modal processor by @DarkLight1337 in #14672
[VLM] Support loading InternVideo2.5 models as original InternVLChatModel by @Isotr0py in #14738
[Bugfix] Fix prompt format of GLM4V by @DarkLight1337 in #14539
[V1][Minor] Minor enhancements on scheduler by @WoosukKwon in #14732
[Misc] Clean up processor tests by @DarkLight1337 in #14771
[V1][Core] using cached vocab_size for Structured Outputs by @aarnphm in #14630
[V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output by @afeldman-nm in #14624
[Attention] Remove slow setattr in MLA by @LucasWilkinson in #14769
[Doc] Fix typo in documentation by @yasu52 in #14783
[Doc] Fix small typo in Transformers fallback by @heheda12345 in #14791
[V1] TPU - Enable prefix caching by default by @alexm-redhat in #14773
forward fix PR 14245, restore build on ROCm 6.2 by @jeffdaily in #14709
[V1] Move OOM check into sampler run by @ywang96 in #14728
[V1] Temporarily disable FlashInfer Rejection Sampler by @WoosukKwon in #14788
[Kernel] LoRA - Enable CUDAGraphs for V1 by @varun-sundar-rabindranath in #14626
[Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA. by @tdoublep in #14431
[Bugfix][IPEX] Add VLLM_CPU_MOE_PREPACK to allow disabling MoE prepack when CPU does not support it by @gau-nernst in #14681
[ci] Reduce number of tests in fastcheck by @khluu in #14782
[Misc][Minor] Simplify SamplingParams.__post_init__() by @njhill in #14772
[Neuron] flatten test parameterization for neuron attention kernels by @liangfu in #14712
[Feature] Add visionarena offline support for benchmark_throughput by @JenZhao in #14654
[CI] Fix missing example model id in processor test by @ywang96 in #14787
[Attention] MLA get rid of materialization by @LucasWilkinson in #14770
[Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel by @gau-nernst in #14667
[BugFix]Fix performance serving benchmark when enable profiling by @Potabk in #14737
[Misc] Clean up type annotation for SupportsMultiModal by @DarkLight1337 in #14794
[Bugfix] Fix small typo in the example of Streaming delimiter by @bravo325806 in #14793
[Misc] Gemma3ForConditionalGeneration supports LoRA by @jeejeelee in #14797
[V1][Minor] Minor code cleanup for scheduling metrics by @WoosukKwon in #14800
[Bugfix][W8A8] fixed cutlass block fp8 binding by @DefTruth in #14796
[VLM] Various cleanup and fixes by @DarkLight1337 in #14806
[BugFix]: properly catch templating error when preprocess input by @gcalmettes in #13976
[Bugfix] Fix Aria test loading by @DarkLight1337 in #14823
[V1] Fix vocab size calculation for structured output by @russellb in #14826
[Frontend] Fix log message to use http vs https by @russellb in #14774
[V1][Metrics] Updated list of deprecated metrics in v0.8 by @markmc in #14695
[Frontend] track server_load by @daniel-salib in #13950
[Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 by @wyajieha in #14430
[release] Remove log cleanup commands from TPU job by @khluu in #14838
Re-enable the AMD Entrypoints Test by @Alexei-V-Ivanov-AMD in #14711
[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies by @cyang49 in #14778
[V1] Fix model parameterization for structured output tests by @russellb in #14833
Update to torch==2.6.0 by @mgoin in #12721
[CI] Add TPU v1 test by @richardsliu in #14834
[Build/CI] Move ninja to common deps by @russellb in #14835
[Build/CI] Upgrade aiohttp to incldue CVE fix by @russellb in #14840
[Doc] More neutral K8s deployment guide by @terrytangyuan in #14084
[Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … by @yarongmu-google in #14844
[Neuron][CI] update docker run command by @liangfu in #14829
[Bugfix][V1] Fix flashinfer sampling by @DefTruth in #14815
Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… by @tlrmchlsmth in #14848
Disable outlines cache by default by @russellb in #14837
[Misc] Remove misleading message in gemma2 and gemma3 by @Isotr0py in #14850
[Misc][Easy] Annotate unused vars in the csrc files by @houseroad in #14798
[V1] V1 Enablement Oracle by @robertgshaw2-redhat in #13726
[Docs] Add new East Coast vLLM Meetup slides to README and meetups.md by @simon-mo in #14852
[CPU] Support FP8 KV cache by @bigPYJ1151 in #14741
[Attention] Get rid of mla cache alignment by @LucasWilkinson in #14842
[CI/Build] Delete LoRA bias test by @jeejeelee in #14849
[V1][Structured Output] calculate vocab_size eagerly by @aarnphm in #14851
[Doc] V1 user guide by @JenZhao in #13991
[Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes by @russellb in #14839
[Bugfix] EAGLE output norm bug by @luyuzhe111 in #14464
[VLM] Limit multimodal input cache by memory by @DarkLight1337 in #14805
[CI][Intel GPU] refine intel GPU ci docker build by @jikunshang in #14860
[Core] Expose API endpoint /is_sleeping by @waltforme in #14312
[VLM] Merged multi-modal processor for Pixtral by @Flechman in #12211
[Misc][Doc] Minor benchmark README update by @ywang96 in #14874
[VLM] Clean up Phi-4-MM ViT implementation by @Isotr0py in #14812
[V1] Remove V0 fallback for mistral-tokenizer by @ywang96 in #14873
[Kernel] Add more tuned configs by @simon-mo in #14877
[BugFix] Fix torch distributed stateless PG backend init by @njhill in #14870
[V1] [Spec Decode] Fix ngram tests by @LiuXiaoxuanPKU in #14878
[Bugfix] Limit profiling run sequence length by max_model_len by @kylesayrs in #14785
[Bugfix] Explicitly disable Phi-4-multimodal in V1 by @DarkLight1337 in #14889
Revert "[Bugfix] Limit profiling run sequence length by max_model_len (#14785) by @DarkLight1337 in #14892
[BugFix][V1] Fix overhead related to bad_words sampling when not in use by @njhill in #14894
[V1][BugFix] Detect interleaved sliding window attention by @WoosukKwon in #14896
[Misc] Catching Ray Compiled Graph PP test failures for V1 by @ruisearch42 in #14847
[Doc] Add guidance for using ccache with pip install -e . in doc by @vadiklyutiy in #14901
[V1] Enable Entrypoints Tests by @robertgshaw2-redhat in #14903
[CI] Fix Tool Calling Tests by @robertgshaw2-redhat in #14898
[CI/Build] Update defaults for test reproducibility by @DarkLight1337 in #14893
[V1] Optimize the overhead of rewinding by @WoosukKwon in #14905
[V1][Minor] Add repr to ConstantList by @WoosukKwon in #14907
[BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context by @LucasWilkinson in #14910
[Misc] Replace os environ to monkeypatch in test suite by @t-sibiraj in #14516
[Benchmark] Do not save detailed info to json by default by @simon-mo in #14879
[V1] [Spec Decode] Support random sampling for spec decode by @LiuXiaoxuanPKU in #13933

New Contributors

@ajayvohra2005 made their first contribution in #13589
@Edwinhr716 made their first contribution in #12913
@Hongbosherlock made their first contribution in #12978
@johnzheng1975 made their first contribution in #13668
@JenZhao made their first contribution in #13594
@bufferoverflow made their first contribution in #13011
@cakeng made their first contribution in #12583
@eli-b made their first contribution in #13785
@YaoJiayi made their first contribution in #12953
@edwardzjl made their first contribution in #13468
@naromero77amd made their first contribution in #13623
@henrylhtsang made their first contribution in #13797
@tianyuzhou95 made their first contribution in #13863
@b8zhong made their first contribution in #13736
@Chenyaaang made their first contribution in #13860
@observerw made their first contribution in #13958
@qli88 made their first contribution in #13718
@benchislett made their first contribution in #13626
@hasB4K made their first contribution in #13987
@Kacper-Pietkun made their first contribution in #13213
@Ryp made their first contribution in #13090
@LouieYang made their first contribution in #14031
@vanbasten23 made their first contribution in #13379
@atalman made their first contribution in #13926
@wyajieha made their first contribution in #12931
@qux-bbb made their first contribution in #14086
@realShengYao made their first contribution in #14051
@zhanwenchen made their first contribution in #14142
@rainkert made their first contribution in #13750
@congcongchen123 made their first contribution in #14119
@iacolippo made their first contribution in #14217
@zhe-thoughts made their first contribution in #14288
@DaividFrank made their first contribution in #14293
@vincent-4 made their first contribution in #13997
@yangsijia-serena made their first contribution in #14267
@pyc96 made their first contribution in #14237
@upayuryeva made their first contribution in #14363
@courage17340 made their first contribution in #14326
@dilipgb made their first contribution in #12613
@ZhongYingMatrix made their first contribution in #13897
@hj-mistral made their first contribution in #14224
@yaochengji made their first contribution in #14310
@dyli-google made their first contribution in #14385
@vincent-pli made their first contribution in #14414
@York-RDWang made their first contribution in #14420
@yarongmu-google made their first contribution in #14459
@22quinn made their first contribution in #13376
@yanyc428 made their first contribution in #12428
@martinhoyer made their first contribution in #14527
@gnovack made their first contribution in #14085
@cynthieye made their first contribution in #14377
@jeffdaily made their first contribution in #14245
@hackty made their first contribution in #14627
@randyjhc made their first contribution in #13993
@ameyanjarlekar made their first contribution in #14633
@tywuAMD made their first contribution in #14555
@yasu52 made their first contribution in #14783
@gau-nernst made their first contribution in #14681
@Potabk made their first contribution in #14737
@bravo325806 made their first contribution in #14793
@daniel-salib made their first contribution in #13950
@cyang49 made their first contribution in #14778
@luyuzhe111 made their first contribution in #14464
@Flechman made their first contribution in #12211
@vadiklyutiy made their first contribution in #14901
@t-sibiraj made their first contribution in #14516

Full Changelog: v0.7.3...v0.8.0rc1