Note: vLLM no longer sets the global seed (#14274). Please set the seed
parameter if you need to reproduce your results.
What's Changed
- Update
pre-commit
'sisort
version to remove warnings by @hmellor in #13614 - [V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
- fix neuron performance issue by @ajayvohra2005 in #13589
- [Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
- [Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
- [Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
- Add llmaz as another integration by @kerthcet in #13643
- [Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
- [NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
- Use pre-commit to update
requirements-test.txt
by @hmellor in #13617 - [Bugfix] Add
mm_processor_kwargs
to chat-related protocols by @ywang96 in #13644 - [V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
- Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
- [FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
- [ci] Fix metrics test model path by @khluu in #13635
- [Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
- [Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
- fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in #13668
- [Attention] MLA with chunked prefill by @LucasWilkinson in #12639
- [Misc] Fix yapf linting tools etc not running on pre-commit by @Isotr0py in #13695
- docs: Add a note on full CI run in contributing guide by @terrytangyuan in #13646
- [HTTP Server] Make model param optional in request by @youngkent in #13568
- [Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… by @WangErXiao in #13672
- [Misc] Capture and log the time of loading weights by @waltforme in #13666
- [ROCM] fix native attention function call by @gongdao123 in #13650
- [Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA by @2015aroras in #13687
- [Misc] Bump compressed-tensors by @dsikka in #13619
- [Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len by @WangErXiao in #13691
- [v1] Support allowed_token_ids in v1 Sampler by @houseroad in #13210
- [Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler by @JenZhao in #13594
- Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size by @fabianlim in #13660
- [V1][Metrics] Support
vllm:cache_config_info
by @markmc in #13299 - [Metrics] Add
--show-hidden-metrics-for-version
CLI arg by @markmc in #13295 - [Misc] Reduce LoRA-related static variable by @jeejeelee in #13166
- [CI/Build] Fix pre-commit errors by @DarkLight1337 in #13696
- [core] set up data parallel communication by @youkaichao in #13591
- [ci] fix linter by @youkaichao in #13701
- Support SSL Key Rotation in HTTP Server by @youngkent in #13495
- [NVIDIA] Support nvfp4 cutlass gemm by @kaixih in #13571
- [V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths by @SageMoore in #13095
- [ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm by @gshtras in #13231
- [Doc] Dockerfile instructions for optional dependencies and dev transformers by @DarkLight1337 in #13699
- [Bugfix] Fix boolean conversion for OpenVINO env variable by @helena-intel in #13615
- [XPU]fix setuptools version for xpu by @yma11 in #13548
- [CI/Build] fix uv caching in Dockerfile by @dtrifiro in #13611
- [CI/Build] Fix pre-commit errors from #13571 by @ywang96 in #13709
- [BugFix] Minor: logger import in attention backend by @andylolu2 in #13706
- [ci] Use env var to control whether to use S3 bucket in CI by @khluu in #13634
- [Quant] BaiChuan SupportsQuant by @kylesayrs in #13710
- [LMM] Implement merged multimodal processor for whisper by @Isotr0py in #13278
- [Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms by @njhill in #13688
- [Misc] Deprecate
--dataset
frombenchmark_serving.py
by @ywang96 in #13708 - [v1] torchrun compatibility by @youkaichao in #13642
- [V1][BugFix] Fix engine core client shutdown hangs by @njhill in #13298
- Fix some issues with benchmark data output by @huydhn in #13641
- [ci] Add logic to change model to S3 path only when S3 CI env var is on by @khluu in #13727
- [V1][Core] Fix memory issue with logits & sampling by @ywang96 in #13721
- [model][refactor] remove cuda hard code in models and layers by @MengqingCao in #13658
- [Bugfix] fix(logging): add missing opening square bracket by @bufferoverflow in #13011
- [CI/Build] add python-json-logger to requirements-common by @bufferoverflow in #12842
- Expert Parallelism (EP) Support for DeepSeek Models by @cakeng in #12583
- [BugFix] Illegal memory access for MoE On H20 by @Abatom in #13693
- [Misc][Docs] Raise error when flashinfer is not installed and
VLLM_ATTENTION_BACKEND
is set by @NickLucche in #12513 - [V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) by @afeldman-nm in #10980
- Revert "[V1][Core] Fix memory issue with logits & sampling" by @ywang96 in #13775
- Fix precommit fail in fused_moe intermediate_cache2 chunking by @mgoin in #13772
- [Misc] Clean Up
EngineArgs.create_engine_config
by @robertgshaw2-redhat in #13734 - [Misc][Chore] Clean Up
AsyncOutputProcessing
Logs by @robertgshaw2-redhat in #13780 - Remove unused kwargs from model definitions by @hmellor in #13555
- [Doc] arg_utils.py: fixed a typo by @eli-b in #13785
- [Misc] set single whitespace between log sentences by @cjackal in #13771
- [Bugfix][Quantization] Fix FP8 + EP by @tlrmchlsmth in #13784
- [Misc][Attention][Quantization] init property earlier by @wangxiyuan in #13733
- [V1][Metrics] Implement vllm:lora_requests_info metric by @markmc in #13504
- [Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" by @LucasWilkinson in #13802
- [Bugfix] Support MLA for CompressedTensorsWNA16 by @mgoin in #13725
- Fix CompressedTensorsWNA16MoE with grouped scales by @mgoin in #13769
- [Core] LoRA V1 - Add add/pin/list/remove_lora functions by @varun-sundar-rabindranath in #13705
- [Misc] Check that the model can be inspected upon registration by @DarkLight1337 in #13743
- [Core] xgrammar: Expand list of unsupported jsonschema keywords by @russellb in #13783
- [Bugfix] Modify modelscope api usage in transformer_utils by @shen-shanshan in #13807
- [misc] Clean up ray compiled graph type hints by @ruisearch42 in #13731
- [Feature] Support KV cache offloading and disagg prefill with LMCache connector. by @YaoJiayi in #12953
- [ROCm][Quantization][Kernel] Using HIP FP8 header by @gshtras in #12593
- [CI/Build] Fix V1 LoRA failure by @jeejeelee in #13767
- [Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs by @Chen-0210 in #13724
- [Bugfix] Initialize attention bias on the same device as Query/Key/Value by @edwardzjl in #13468
- [Bugfix] Flush TunableOp results before worker processes are destroyed. by @naromero77amd in #13623
- [Bugfix] Fix deepseek-vl2 inference with more than 2 images by @Isotr0py in #13818
- Fix
/v1/audio/transcriptions
Bad Request Error by @HermitSun in #13811 - [Bugfix] Revert inspection code in #13743 by @DarkLight1337 in #13832
- Fix string parsing error by @Chen-0210 in #13825
- [Neuron] Add custom_ops for neuron backend by @liangfu in #13246
- Fix failing
MyGemma2Embedding
test by @hmellor in #13820 - [Model] Support Grok1 by @mgoin in #13795
- DeepSeek V2/V3/R1 only place
lm_head
on last pp rank by @hmellor in #13833 - [misc] Show driver IP info when Ray fails to allocate driver worker by @ruisearch42 in #13858
- [V1][Spec Decode] Change Spec Decode Rejection Sampling API by @LiuXiaoxuanPKU in #13729
- [Misc]Code Cleanup by @noemotiovon in #13859
- [Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues by @henrylhtsang in #13797
- Improve pipeline partitioning by @hmellor in #13839
- [Doc] fix the incorrect module path of tensorize_vllm_model by @tianyuzhou95 in #13863
- [ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms by @SageMoore in #13844
- [v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine by @sethkimmel3 in #13837
- [Misc] Improve LoRA spelling by @jeejeelee in #13831
- [Misc] Fix input processing for Ultravox by @ywang96 in #13871
- [Bugfix] Add test example for Ultravox v0.5 by @DarkLight1337 in #13890
- Add comments on accessing
kv_cache
andattn_metadata
by @hmellor in #13887 - [Bugfix] Handle None parameters in Mistral function calls. by @fgreinacher in #13786
- [Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor by @b8zhong in #13736
- [Bugfix] Do not crash V0 engine on input errors by @joerunde in #13101
- [Bugfix] Update expected token counts for Ultravox tests by @DarkLight1337 in #13895
- [TPU] use torch2.6 with whl package by @Chenyaaang in #13860
- [Misc] fixed qwen_vl_utils parameter error by @chaunceyjiang in #13906
- [Bugfix] Backend option to disable xgrammar any_whitespace by @wallashss in #12744
- [BugFix] Make FP8 Linear compatible with torch.compile by @WoosukKwon in #13918
- [Kernel] FlashMLA integration by @LucasWilkinson in #13747
- [ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined by @HollowMan6 in #13851
- Use CUDA 12.4 as default for release and nightly wheels by @mgoin in #12098
- [misc] Rename Ray ADAG to Compiled Graph by @ruisearch42 in #13928
- [ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding by @SageMoore in #13922
- [V1][Metrics] Handle preemptions by @markmc in #13169
- [CI/Build] Add examples/ directory to be labelled by
mergify
by @b8zhong in #13944 - [Misc] fixed 'required' is an invalid argument for positionals by @chaunceyjiang in #13948
- [PP] Correct cache size check by @zhengy001 in #13873
- Fix test_block_fp8.py test for MoE by @mgoin in #13915
- [VLM] Support multimodal inputs for Florence-2 models by @Isotr0py in #13320
- [Model] Deepseek GGUF support by @SzymonOzog in #13167
- Update quickstart.md by @observerw in #13958
- Deduplicate
.pre-commit-config.yaml
'sexclude
by @hmellor in #13967 - [bugfix] Fix profiling for RayDistributedExecutor by @ruisearch42 in #13945
- Update LMFE version to v0.10.11 to support new versions of transforme… by @noamgat in #13930
- [Bugfix] Fix qwen2.5-vl overflow issue by @Isotr0py in #13968
- [VLM] Generalized prompt updates for multi-modal processor by @DarkLight1337 in #13964
- [Attention] MLA support for V1 by @chenyang78 in #13789
- Bump azure/setup-helm from 4.2.0 to 4.3.0 by @dependabot in #13742
- [VLM] Deprecate legacy input mapper for OOT multimodal models by @DarkLight1337 in #13979
- [ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups by @SageMoore in #13970
- [V1][Minor] Minor cleanup for GPU Model Runner by @WoosukKwon in #13983
- [core] Perf improvement for DSv3 on AMD GPUs by @qli88 in #13718
- [Attention] Flash MLA for V1 by @LucasWilkinson in #13867
- [Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict by @benchislett in #13626
- [Misc] Print FusedMoE detail info by @jeejeelee in #13974
- [V1]
SupportsV0Only
protocol for model definitions by @ywang96 in #13959 - [Bugfix] Check that number of images matches number of <|image|> tokens with mllama by @tjohnson31415 in #13911
- [Doc] Move multimodal Embedding API example to Online Serving page by @DarkLight1337 in #14017
- [Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) by @hasB4K in #13987
- Use smaller embedding model when not testing model specifically by @hmellor in #13891
- [Hardware][Intel-Gaudi] Regional compilation support by @Kacper-Pietkun in #13213
- [V1][Minor] Restore V1 compatibility with LLMEngine class by @Ryp in #13090
- Update AutoAWQ docs by @hmellor in #14042
- [Bugfix] Fix MoeWNA16Method activation by @jeejeelee in #14024
- [VLM][Bugfix] Enable specifying prompt target via index by @DarkLight1337 in #14038
- [Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series by @LouieYang in #14031
- [Doc] Fix ROCm documentation by @b8zhong in #14041
- Fix entrypoint tests for embedding models by @hmellor in #14052
- [V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU by @vanbasten23 in #13379
- [v1] Cleanup the BlockTable in InputBatch by @heheda12345 in #13977
- Add RELEASE.md by @atalman in #13926
- [v1] Move block pool operations to a separate class by @heheda12345 in #13973
- [core] Bump ray to 2.43 by @ruisearch42 in #13994
- [torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass by @ProExpertProg in #10902
- [Docs] Add
pipeline_parallel_size
to optimization docs by @b8zhong in #14059 - [Bugfix] Add file lock for ModelScope download by @jeejeelee in #14060
- [Misc][Kernel]: Add GPTQAllSpark Quantization by @wyajieha in #12931
- [Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor by @bigPYJ1151 in #14053
- [Documentation] Add more deployment guide for Kubernetes deployment by @KuntaiDu in #13841
- [Doc] Consolidate
whisper
andflorence2
examples by @Isotr0py in #14050 - [V1][Minor] Do not print attn backend twice by @WoosukKwon in #13985
- [ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class by @SageMoore in #14065
- [v1][Bugfix] Only cache blocks that are not in the prefix cache by @heheda12345 in #14073
- [v1] Add
__repr__
to KVCacheBlock to avoid recursive print by @heheda12345 in #14081 - [Model] Add LoRA support for TransformersModel by @jeejeelee in #13770
- [Misc] Accurately capture the time of loading weights by @waltforme in #14063
- [Doc] Source building add clone step by @qux-bbb in #14086
- [v0][structured output] Support reasoning output by @gaocegege in #12955
- Update deprecated Python 3.8 typing by @hmellor in #13971
- [Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure by @realShengYao in #14051
- [Misc] typo find in deepseek_v2 by @noooop in #14106
- [Misc][Platform] Move use allgather to platform by @MengqingCao in #14010
- [Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 by @comaniac in #13921
- [V1] Refactor parallel sampling support by @markmc in #13774
- Improve the docs for
TransformersModel
by @hmellor in #14147 - [ROCm] Faster Custom Paged Attention kernels by @tjtanaa in #12348
- Fix
head_dim
not existing in all model configs (Transformers backend) by @hmellor in #14141 - [V0][Metrics] Remove unimplemented
vllm:tokens_total
by @markmc in #14134 - [V0][Metrics] Deprecate some KV/prefix cache metrics by @markmc in #14136
- [V1] Simplify stats logging by @njhill in #14082
- [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics by @markmc in https://github.com//pull/14055
- [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 by @mgoin in #14100
- [Kernel] Optimize moe intermediate_cache usage by @mgoin in #13625
- [Docs] Add GPTQModel by @Qubitium in #14056
- [v1] Add comments to the new ragged paged attention Pallas kernel by @vanbasten23 in #14155
- [Model] Add support for GraniteMoeShared models by @tjohnson31415 in #13313
- [core] moe fp8 block quant tuning support by @divakar-amd in #14068
- [Misc] Remove lru_cache in NvmlCudaPlatform by @comaniac in #14156
- [core] Pass all driver env vars to ray workers unless excluded by @ruisearch42 in #14099
- Use math.prod instead of np.prod for trivial ops by @zhanwenchen in #14142
- Fix benchmark_moe.py tuning for CUDA devices by @mgoin in #14164
- [platform] add debug logging during inferring the device type by @youkaichao in #14195
- [sleep mode] error out with expandable_segments by @youkaichao in #14189
- [doc] add "Failed to infer device type" to faq by @youkaichao in #14200
- [Bugfix] Restrict MacOS CPU detection by @mgoin in #14210
- [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs by @njhill in #13869
- [V0][Metrics] Deprecate some questionable request time metrics by @markmc in #14135
- [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py by @lk-chen in #14161
- add cutlass support for blackwell fp8 gemm by @kushanam in #13798
- [TPU][Profiler] Support start_profile/stop_profile in TPU worker by @lsy323 in #13988
- Fix performance when
--generation-config
is notNone
by @hmellor in #14223 - [Frontend] Do
prompt_logprobs
clamping for chat as well as completions by @hmellor in #14225 - [Docs] Update Dockerfile dependency image by @mgoin in #14215
- [v1][Metrics] Add design doc by @markmc in #12745
- Serialize using safetensors for KV caches by @KuntaiDu in #14228
- Clean up unused padding_idx variables across many model definitions by @tlrmchlsmth in #13240
- [ROCm] Disable a few more kernel tests that are broken on ROCm by @SageMoore in #14145
- [V1][TPU] TPU multimodal model support for ragged attention by @mgoin in #14158
- [misc] announce china meetup by @youkaichao in #14248
- Moved numba from common requirements to cuda/rocm specific requirements by @npanpaliya in #14199
- Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 by @mgoin in #14157
- [Bugfix] Fix gptq_marlin for deepseek-v3 by @rainkert in #13750
- [V1][Bugfix] Do not reset prefix caching metrics by @comaniac in #14235
- [Model] New model support for Phi-4-multimodal-instruct by @congcongchen123 in #14119
- [V1] EP/TP MoE + DP Attention by @tlrmchlsmth in #13931
- [platforms] improve rocm debugging info by @youkaichao in #14257
- Temporarily disable test_awq_gemm_opcheck by @mgoin in #14251
- [Frontend] Allow return_tokens_as_token_ids to be passed as a request param by @benchislett in #14066
- [Misc][V1] Avoid using
envs.VLLM_USE_V1
in mm processing by @ywang96 in #14256 - [Bugfix][V1] Fix allowed_token_ids for v1 Sampler by @houseroad in #14169
- [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID by @iacolippo in #14217
- [Doc] [3/N] Refer code examples for common cases in dev multimodal processor by @DarkLight1337 in #14278
- Small update for external_launcher backend docs by @zhe-thoughts in #14288
- [V1][Frontend] Add Testing For V1 Runtime Parameters by @robertgshaw2-redhat in #14159
- [LoRA] Remove linear hack outside transformers backend by @Isotr0py in #14177
- [Misc] Add Qwen2MoeForCausalLM moe tuning support by @jeejeelee in #14276
- [Doc] Fixed typo in prefix_caching.md by @DaividFrank in #14293
- [Bugfix] Fix broken vision language example by @Isotr0py in #14292
- [Docs] Add Meta Slides by @simon-mo in #14297
- [V1][Minor] Remove obsolete FIXME comment by @njhill in #14304
- Deprecate
best_of
Sampling Parameter in anticipation for vLLM V1 by @vincent-4 in #13997 - [V1][BugFix] Fix for mixed top_k batch by @njhill in #14301
- [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env by @yangsijia-serena in #14267
- [V1][Easy] Add empty allowed_token_ids in the v1 sampler test by @houseroad in #14308
- [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch by @pyc96 in #14237
- [Bugfix] Remove num_tokens_across_dp by @tlrmchlsmth in #14302
- [BugFix] Fix prefix caching V0 MLA by @LucasWilkinson in #14255
- [CI/Build] Use spawn multiprocessing mode for V1 test pipeline by @russellb in #14243
- Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM by @mgoin in #13917
- [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation by @terrytangyuan in #13850
- [BugFix] MLA + V1, illegal memory access and accuracy issues by @LucasWilkinson in #14253
- [misc] Mention
ray list nodes
command to troubleshoot ray issues by @ruisearch42 in #14318 - [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 by @gaocegege in #14114
- [V1] LoRA - Enable more V1 tests by @varun-sundar-rabindranath in #14315
- [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention by @NickLucche in #11301
- [Hardware] Update the flash attn tag to support Blackwell by @pavanimajety in #14244
- [Model] Update Paligemma multimodal processing with PromptUpdate by @kylehh in #14015
- [VLM] Support Pixtral-HF on V1 by @lk-chen in #14275
- [Core] Optimizing cross-attention
QKVParallelLinear
computation by @NickLucche in #12325 - [Frontend][Docs] Transcription API streaming by @NickLucche in #13301
- [Doc] Update reasoning with stream example to use OpenAI library by @liuyanyi in #14077
- [Doc] Correct beam_search using in generative_models.md by @upayuryeva in #14363
- [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend by @tdoublep in #14152
- [Bugfix][Core] fix abort_seq_group and memory leak when n>1 by @courage17340 in #14326
- [Core] Don't use cache during multi-modal profiling by @DarkLight1337 in #14336
- [Doc] Fix date typo in README.md by @jitseklomp in #14366
- [RLHF] use worker_extension_cls for compatibility with V0 and V1 by @youkaichao in #14185
- Reinstate
best_of
for V0 by @hmellor in #14356 - Adding cpu inference with VXE ISA for s390x architecture by @dilipgb in #12613
- Add authors to license header. by @tdoublep in #14371
- Fix mla prefill context performance by @ZhongYingMatrix in #13897
- [V1] Do not detokenize if sampling param detokenize is False by @hj-mistral in #14224
- [Distributed] Add enable_expert_parallel arg by @tlrmchlsmth in #14305
- [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa by @mgoin in #13569
- [CI] Disable spawn when running V1 Test by @tdoublep in #14345
- [Kernel] Add needs_fixed_stride_order tag to most GEMMs by @tlrmchlsmth in #14306
- [Bugfix] Fix use_direct_call condition in FusedMoE layer for by @tlrmchlsmth in #14382
- [Bug] Fix Attention when ignored in by quant_method by @mgoin in #14313
- [V1][Bugfix] Standardize quantized kv cache rejection for attention backends by @mgoin in #14221
- [Docs] Add nsight guide to profiling docs by @mgoin in #14298
- [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue by @yaochengji in #14310
- [Doc] Fix a typo by @dyli-google in #14385
- [Bugfix] Correctly call
cudaProfilerStop
in benchmarks script by @b8zhong in #14183 - [Perf] Reduce MLA CPU overheads in V1 by @LucasWilkinson in #14384
- [FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object by @ProExpertProg in #14390
- [BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs by @LucasWilkinson in #14396
- [Bugfix] Fix JambaForCausalLM LoRA by @jeejeelee in #14370
- [Build] Add nightly wheel fallback when latest commit wheel unavailable by @Isotr0py in #14358
- OpenVINO: added CPU-like conditions by @ilya-lavrenov in #14338
- [GH] Auto-apply multi-modality label to relevant PRs by @DarkLight1337 in #14402
- correct wrong markdown syntax by @vincent-pli in #14414
- [Bugfix] Further clean up LoRA test by @jeejeelee in #14422
- [Bugfix] Clean up multi-modal processors by @DarkLight1337 in #14417
- [Misc] Set default value of seed to None by @SmartManoj in #14274
- [BUGFIX] Skip tokenization support for throughput benchmark by @maleksan85 in #12712
- Fix missing
kv_caches
andattn_metadata
inOpenVINOCausalLM
by @hmellor in #14271 - Use the optimized block sizes after tuning the kernel. by @vanbasten23 in #14329
- [V1][Core] Support for Structured Outputs by @aarnphm in #12388
- [Doc] Update prefix_caching.md to match the example image by @York-RDWang in #14420
- [Benchmarks] Make detokenization optional in benchmark scripts by @JArnoldAMD in #11697
- [Kernel] optimize performance of gptq marlin kernel when n is small by @jinzhen-lin in #14138
- [Misc] Add Phi4-MM example by @jeejeelee in #14343
- [v1] torch.compile integration explanation by @youkaichao in #14437
- [V1] Eagerly remove finished requests from the batch by @njhill in #14388
- [V1][Metrics] Fix traceback with preemptions+LoRA by @markmc in #14220
- [Bugfix] Fix torch_xla which can't handle None seed introduced in #14274 by @yarongmu-google in #14459
- [V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC by @afeldman-nm in #13949
- [Bugfix][V1] Handle MLA in kv_cache_interface by @tlrmchlsmth in #14462
- Revert "[Perf] Reduce MLA CPU overheads in V1 (#14384)" by @tlrmchlsmth in #14471
- [Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache by @hasB4K in #14369
- [MISC][V1] Register process killing handler only in the main thread by @comaniac in #14380
- [core] add
extra_args
toSamplingParams
by @akeshet in #13300 - [CI/Build] refactor: set timezone of container to UTC by @bufferoverflow in #12888
- Default to
generation_config
from model by @hmellor in #12622 - [Doc]add doc for Qwen models tool calling by @WangErXiao in #14478
- [Doc] Added QwQ-32B to the supported models list in the reasoning out… by @WangErXiao in #14479
- [Bugfix] Make the deviceprofiler include LoRA memory. by @jeejeelee in #14469
- Add training doc signposting to TRL by @hmellor in #14439
- [Build/BugFix] Fix hopper 12.8 build by @LucasWilkinson in #14354
- Add RLHF document by @hmellor in #14482
- [CI/Build] Use a fixed seed to avoid flaky tests by @DarkLight1337 in #14480
- [V1] TPU - Add tensor parallel support via Ray by @alexm-redhat in #13618
- [VLM] Add TP support for Phi-4-MM by @Isotr0py in #14453
- [Misc] add
use_tqdm_on_load
to reduce logs by @aarnphm in #14407 - [V1][Core] Fix memory issue with logits & sampling by @ywang96 in #13776
- [benchmarks] Add option to use unique jsonschema for each request by @russellb in #14457
- [Misc] Don't run ruff at all on 3rd party libs by @DarkLight1337 in #14493
- Move requirements into their own directory by @hmellor in #12547
- [Bugfix] DeepSeek Accuracy by @LucasWilkinson in #14476
- [Bugfix] Fix profiling OOM and decouple encoder multimodal profiling by @Isotr0py in #14361
- Update CODEOWNERS for structured output by @russellb in #14496
- [Misc] Upgrade to Python 3.9 typing for additional directories by @DarkLight1337 in #14492
- [V1] Support bad_words in sampler by @22quinn in #13376
- Revert "[V1][Core] Fix memory issue with logits & sampling" by @robertgshaw2-redhat in #14504
- [Attention] Default to FlashMLA backend for MLA by @LucasWilkinson in #14451
- [V1][TPU] Remove unnecessary padding for running on TPU. by @vanbasten23 in #14467
- [Feat] Support chunked prefill for LMCache connector by @YaoJiayi in #14505
- [Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 by @yanyc428 in #12428
- [Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work by @Isotr0py in #14498
- [Hardware][TPU] Fix the recompiling issue in logits processor after warmup by @yaochengji in #14510
- [Misc] Ensure out-of-tree quantization method recognize by cli args by @liuyanyi in #14328
- [Bugfix] Wrong requirements path - rocm by @martinhoyer in #14527
- [Feature] Consolidate performance benchmark datasets by @JenZhao in #14036
- [Misc] Add log information for handle_process_request. by @chaunceyjiang in #14130
- [Docs] Mention
model_impl
arg when explaining Transformers fallback by @hmellor in #14552 - [Frontend] support image embeds by @chaunceyjiang in #13955
- [Kernel] Add more dtype support for GGUF kernels by @SzymonOzog in #14043
- [Doc] Update PaliGemma note to a warning by @DarkLight1337 in #14565
- Correct capitalisation:
Github
->GitHub
by @hmellor in #14561 - [V1][Bugfix] Fix handing of
second_per_grid_ts
for Qwen2-VL & Qwen2.5-VL by @ywang96 in #14548 - Correct capitalisation:
VLLM
->vLLM
by @hmellor in #14562 - [Docs] Make installation URLs nicer by @hmellor in #14556
- [Bugfix][v1] fixed llava-hf/llava-1.5-7b-hf is broken on V1 by @chaunceyjiang in #14554
- [Perf] Improve MLA on V1 by @simon-mo in #14540
- [Minor] Update the tqdm bar for parallel sampling by @WoosukKwon in #14571
- [V1] LoRA - Add triton kernels for V1 by @varun-sundar-rabindranath in #13096
- Fix typo in benchmark_serving_structured_output.py by @russellb in #14566
- [V1] Prevent xgrammar from breaking TPU support by @russellb in #14575
- [Kernel] moe wna16 cuda kernel by @jinzhen-lin in #13321
- [MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils by @comaniac in #14379
- [Neuron] Add Neuron device communicator for vLLM v1 by @gnovack in #14085
- [neuron] add reshape_and_cache by @liangfu in #14391
- [V1][PP] Do not block engine core when no requests to schedule by @comaniac in #14585
- [Bugfix] Fix FP16 overflow for DeepSeek V2 by @Concurrensee in #13232
- [V1][Core] Fix memory issue with logits & sampling by @ywang96 in #14508
- [Misc] Correct deepseek-vl2 chat template by @Isotr0py in #14558
- [Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync by @cynthieye in #14377
- [VLM] Cleanup siglip legacy code and fix broken paligemma multimodal processor by @Isotr0py in #14602
- benchmarks: simplify test jsonschema by @russellb in #14567
- dynamic distpatch of fp8 kernels by @jeffdaily in #14245
- [Bugfix] Update
--hf-overrides
forAlibaba-NLP/gte-Qwen2
by @DarkLight1337 in #14609 - Uninstall dependencies before installing requirements/tpu.txt by @richardsliu in #14586
- [V1] Add regex structured output support with xgrammar by @russellb in #14590
- docs: Add documentation for s390x cpu implementation by @dilipgb in #14198
- [BugFix/Build] Fix sparse kernels not getting built on hopper by @LucasWilkinson in #14572
- [Hardware][Intel GPU] upgrade IPEX dependency to 2.6.10. by @jikunshang in #14564
- [V1] Remove cache from StructuredOutputManager by @russellb in #14622
- fix some typos : supported_head_sizes by @hackty in #14627
- [V1] Delay all xgrammar usage until needed by @russellb in #14616
- Fix run_tpu_test by @richardsliu in #14641
- [V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly by @vanbasten23 in #14597
- [Bugfix][V1][PP] Only warmup sampler at last PP rank by @comaniac in #14643
- [release] Add commands to clean up logs on TPU release node by @khluu in #14642
- [Feature] Add
vllm bench
CLI by @randyjhc in #13993 - [core][V1] pluggable scheduler by @joerunde in #14466
- [Doc] Update benchmarks README by @JenZhao in #14646
- [Model] Extend Ultravox to accept audio longer than 30s by @farzadab in #13631
- [V1][Core] Support MistralTokenizer for Structured Output by @aarnphm in #14625
- [Core] Refactor
QKVCrossParallelLinear
implementation to support BNB 4-bit quantization by @Isotr0py in #14545 - [Kernel] GGUF MoE kernel by @SzymonOzog in #14613
- [V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing by @benchislett in #14645
- [Kernel] Add ModelOpt FP4 Checkpoint Support by @pavanimajety in #12520
- [CPU] Upgrade CPU backend to torch-2.6 by @bigPYJ1151 in #13381
- [ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. by @SageMoore in #14629
- [Model] Add support for Gemma 3 by @WoosukKwon in #14660
- [Bugfix] Missing thumbnail from NVLM-D processor by @ameyanjarlekar in #14633
- [ROCm] Enable chunked prefill/paged attention in MLA on ROCm by @SageMoore in #14316
- [FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. by @tjtanaa in #14664
- [BugFix][V1] Fix parallel sampling finishing/aborts by @njhill in #14512
- [V1] Allow sliding window + prefix caching by @WoosukKwon in #13069
- [release] Add force remove for TPU logs by @khluu in #14697
- [bugfix] fixup warning message for plugged schedulers for v1 by @joerunde in #14700
- Add ray[data] as tpu dependency by @richardsliu in #14691
- [ROCm][FP8] Fix for adjustments needed only for fnuz by @gshtras in #14689
- [BugFix][TritonMLA] Process weights after model loading for GGUF by @tywuAMD in #14555
- [Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config by @hasB4K in #14367
- [V1][TPU] Add assertion on multi-step-scheduler by @lsy323 in #14707
- [Quant] BartModel SupportsQuant by @kylesayrs in #14699
- [Quant] Bamba SupportsQuant by @kylesayrs in #14698
- [Bugfix] Fix chunked prefill for GGUF by @SzymonOzog in #14666
- [CI/Build] Delete ultravox LoRA test by @jeejeelee in #14730
- [Bugfix] fix benchmark moe by @jeejeelee in #14653
- [VLM] Support pan-and-scan for Gemma3 multi-modal processor by @DarkLight1337 in #14672
- [VLM] Support loading InternVideo2.5 models as original InternVLChatModel by @Isotr0py in #14738
- [Bugfix] Fix prompt format of GLM4V by @DarkLight1337 in #14539
- [V1][Minor] Minor enhancements on scheduler by @WoosukKwon in #14732
- [Misc] Clean up processor tests by @DarkLight1337 in #14771
- [V1][Core] using cached vocab_size for Structured Outputs by @aarnphm in #14630
- [V1] Detokenizer: Respect Stop Tokens +
not include_stop_str_in_output
by @afeldman-nm in #14624 - [Attention] Remove slow setattr in MLA by @LucasWilkinson in #14769
- [Doc] Fix typo in documentation by @yasu52 in #14783
- [Doc] Fix small typo in Transformers fallback by @heheda12345 in #14791
- [V1] TPU - Enable prefix caching by default by @alexm-redhat in #14773
- forward fix PR 14245, restore build on ROCm 6.2 by @jeffdaily in #14709
- [V1] Move OOM check into sampler run by @ywang96 in #14728
- [V1] Temporarily disable FlashInfer Rejection Sampler by @WoosukKwon in #14788
- [Kernel] LoRA - Enable CUDAGraphs for V1 by @varun-sundar-rabindranath in #14626
- [Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA. by @tdoublep in #14431
- [Bugfix][IPEX] Add
VLLM_CPU_MOE_PREPACK
to allow disabling MoE prepack when CPU does not support it by @gau-nernst in #14681 - [ci] Reduce number of tests in fastcheck by @khluu in #14782
- [Misc][Minor] Simplify
SamplingParams.__post_init__()
by @njhill in #14772 - [Neuron] flatten test parameterization for neuron attention kernels by @liangfu in #14712
- [Feature] Add visionarena offline support for benchmark_throughput by @JenZhao in #14654
- [CI] Fix missing example model id in processor test by @ywang96 in #14787
- [Attention] MLA get rid of materialization by @LucasWilkinson in #14770
- [Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel by @gau-nernst in #14667
- [BugFix]Fix performance serving benchmark when enable profiling by @Potabk in #14737
- [Misc] Clean up type annotation for
SupportsMultiModal
by @DarkLight1337 in #14794 - [Bugfix] Fix small typo in the example of Streaming delimiter by @bravo325806 in #14793
- [Misc] Gemma3ForConditionalGeneration supports LoRA by @jeejeelee in #14797
- [V1][Minor] Minor code cleanup for scheduling metrics by @WoosukKwon in #14800
- [Bugfix][W8A8] fixed cutlass block fp8 binding by @DefTruth in #14796
- [VLM] Various cleanup and fixes by @DarkLight1337 in #14806
- [BugFix]: properly catch templating error when preprocess input by @gcalmettes in #13976
- [Bugfix] Fix Aria test loading by @DarkLight1337 in #14823
- [V1] Fix vocab size calculation for structured output by @russellb in #14826
- [Frontend] Fix log message to use http vs https by @russellb in #14774
- [V1][Metrics] Updated list of deprecated metrics in v0.8 by @markmc in #14695
- [Frontend] track server_load by @daniel-salib in #13950
- [Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 by @wyajieha in #14430
- [release] Remove log cleanup commands from TPU job by @khluu in #14838
- Re-enable the AMD Entrypoints Test by @Alexei-V-Ivanov-AMD in #14711
- [Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies by @cyang49 in #14778
- [V1] Fix model parameterization for structured output tests by @russellb in #14833
- Update to torch==2.6.0 by @mgoin in #12721
- [CI] Add TPU v1 test by @richardsliu in #14834
- [Build/CI] Move ninja to common deps by @russellb in #14835
- [Build/CI] Upgrade aiohttp to incldue CVE fix by @russellb in #14840
- [Doc] More neutral K8s deployment guide by @terrytangyuan in #14084
- [Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … by @yarongmu-google in #14844
- [Neuron][CI] update docker run command by @liangfu in #14829
- [Bugfix][V1] Fix flashinfer sampling by @DefTruth in #14815
- Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… by @tlrmchlsmth in #14848
- Disable outlines cache by default by @russellb in #14837
- [Misc] Remove misleading message in gemma2 and gemma3 by @Isotr0py in #14850
- [Misc][Easy] Annotate unused vars in the csrc files by @houseroad in #14798
- [V1] V1 Enablement Oracle by @robertgshaw2-redhat in #13726
- [Docs] Add new East Coast vLLM Meetup slides to README and meetups.md by @simon-mo in #14852
- [CPU] Support FP8 KV cache by @bigPYJ1151 in #14741
- [Attention] Get rid of mla cache alignment by @LucasWilkinson in #14842
- [CI/Build] Delete LoRA bias test by @jeejeelee in #14849
- [V1][Structured Output] calculate vocab_size eagerly by @aarnphm in #14851
- [Doc] V1 user guide by @JenZhao in #13991
- [Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes by @russellb in #14839
- [Bugfix] EAGLE output norm bug by @luyuzhe111 in #14464
- [VLM] Limit multimodal input cache by memory by @DarkLight1337 in #14805
- [CI][Intel GPU] refine intel GPU ci docker build by @jikunshang in #14860
- [Core] Expose API endpoint
/is_sleeping
by @waltforme in #14312 - [VLM] Merged multi-modal processor for Pixtral by @Flechman in #12211
- [Misc][Doc] Minor benchmark README update by @ywang96 in #14874
- [VLM] Clean up Phi-4-MM ViT implementation by @Isotr0py in #14812
- [V1] Remove V0 fallback for mistral-tokenizer by @ywang96 in #14873
- [Kernel] Add more tuned configs by @simon-mo in #14877
- [BugFix] Fix torch distributed stateless PG backend init by @njhill in #14870
- [V1] [Spec Decode] Fix ngram tests by @LiuXiaoxuanPKU in #14878
- [Bugfix] Limit profiling run sequence length by max_model_len by @kylesayrs in #14785
- [Bugfix] Explicitly disable Phi-4-multimodal in V1 by @DarkLight1337 in #14889
- Revert "[Bugfix] Limit profiling run sequence length by max_model_len (#14785) by @DarkLight1337 in #14892
- [BugFix][V1] Fix overhead related to bad_words sampling when not in use by @njhill in #14894
- [V1][BugFix] Detect interleaved sliding window attention by @WoosukKwon in #14896
- [Misc] Catching Ray Compiled Graph PP test failures for V1 by @ruisearch42 in #14847
- [Doc] Add guidance for using
ccache
withpip install -e .
in doc by @vadiklyutiy in #14901 - [V1] Enable Entrypoints Tests by @robertgshaw2-redhat in #14903
- [CI] Fix Tool Calling Tests by @robertgshaw2-redhat in #14898
- [CI/Build] Update defaults for test reproducibility by @DarkLight1337 in #14893
- [V1] Optimize the overhead of rewinding by @WoosukKwon in #14905
- [V1][Minor] Add repr to ConstantList by @WoosukKwon in #14907
- [BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context by @LucasWilkinson in #14910
- [Misc] Replace os environ to monkeypatch in test suite by @t-sibiraj in #14516
- [Benchmark] Do not save detailed info to json by default by @simon-mo in #14879
- [V1] [Spec Decode] Support random sampling for spec decode by @LiuXiaoxuanPKU in #13933
New Contributors
- @ajayvohra2005 made their first contribution in #13589
- @Edwinhr716 made their first contribution in #12913
- @Hongbosherlock made their first contribution in #12978
- @johnzheng1975 made their first contribution in #13668
- @JenZhao made their first contribution in #13594
- @bufferoverflow made their first contribution in #13011
- @cakeng made their first contribution in #12583
- @eli-b made their first contribution in #13785
- @YaoJiayi made their first contribution in #12953
- @edwardzjl made their first contribution in #13468
- @naromero77amd made their first contribution in #13623
- @henrylhtsang made their first contribution in #13797
- @tianyuzhou95 made their first contribution in #13863
- @b8zhong made their first contribution in #13736
- @Chenyaaang made their first contribution in #13860
- @observerw made their first contribution in #13958
- @qli88 made their first contribution in #13718
- @benchislett made their first contribution in #13626
- @hasB4K made their first contribution in #13987
- @Kacper-Pietkun made their first contribution in #13213
- @Ryp made their first contribution in #13090
- @LouieYang made their first contribution in #14031
- @vanbasten23 made their first contribution in #13379
- @atalman made their first contribution in #13926
- @wyajieha made their first contribution in #12931
- @qux-bbb made their first contribution in #14086
- @realShengYao made their first contribution in #14051
- @zhanwenchen made their first contribution in #14142
- @rainkert made their first contribution in #13750
- @congcongchen123 made their first contribution in #14119
- @iacolippo made their first contribution in #14217
- @zhe-thoughts made their first contribution in #14288
- @DaividFrank made their first contribution in #14293
- @vincent-4 made their first contribution in #13997
- @yangsijia-serena made their first contribution in #14267
- @pyc96 made their first contribution in #14237
- @upayuryeva made their first contribution in #14363
- @courage17340 made their first contribution in #14326
- @dilipgb made their first contribution in #12613
- @ZhongYingMatrix made their first contribution in #13897
- @hj-mistral made their first contribution in #14224
- @yaochengji made their first contribution in #14310
- @dyli-google made their first contribution in #14385
- @vincent-pli made their first contribution in #14414
- @York-RDWang made their first contribution in #14420
- @yarongmu-google made their first contribution in #14459
- @22quinn made their first contribution in #13376
- @yanyc428 made their first contribution in #12428
- @martinhoyer made their first contribution in #14527
- @gnovack made their first contribution in #14085
- @cynthieye made their first contribution in #14377
- @jeffdaily made their first contribution in #14245
- @hackty made their first contribution in #14627
- @randyjhc made their first contribution in #13993
- @ameyanjarlekar made their first contribution in #14633
- @tywuAMD made their first contribution in #14555
- @yasu52 made their first contribution in #14783
- @gau-nernst made their first contribution in #14681
- @Potabk made their first contribution in #14737
- @bravo325806 made their first contribution in #14793
- @daniel-salib made their first contribution in #13950
- @cyang49 made their first contribution in #14778
- @luyuzhe111 made their first contribution in #14464
- @Flechman made their first contribution in #12211
- @vadiklyutiy made their first contribution in #14901
- @t-sibiraj made their first contribution in #14516
Full Changelog: v0.7.3...v0.8.0rc1