vllm 0.9.0 on Python PyPI

Highlights

This release features 649 commits, from 215 contributors (82 new contributors!)

vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
- The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
- As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
- You can use our docker image or install FlashInfer nightly wheel pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl then set VLLM_ATTENTION_BACKEND=FLASHINFER for better performance.
- Upgraded support for the new FlashInfer main branch. (#15777)
- Please checkout #18153 for the full roadmap
Initial DP, EP, PD support for large scale inference
- EP:
  - Permute and unpermute kernel for moe optimization (#14568)
  - Modularize fused experts and integrate PPLX kernels (#15956)
  - Refactor pplx init logic to make it modular (prepare for deepep) (#18200)
  - Add ep group and all2all interface (#18077)
- DP:
  - Decouple engine process management and comms (#15977)
- PD:
  - NIXL Integration (#17751)
  - Local attention optimization for NIXL (#18170)
  - Support multiple kv connectors (#17564)
Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)

Notable Changes

Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
Change top_k to be disabled with 0 (still accept -1 for now) (#17773)
The seed is now set to 0 by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even if temperature > 0. This does not modify the random state in user code since workers are run in separate processes unless VLLM_USE_V1_MULTIPROCESSING=0. (#17929, #18741)

Model Enhancements

Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
- Please install the development version of transformers (from source) to use Falcon-H1.
Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
InternVL-Qwen2.5 models now support video inputs (#18499)

Performance, Production and Scaling

Support full cuda graph in v1 (#16072)
Pipeline Parallelism: MultiprocExecutor support (#14219), torchrun (#17827)
Support sequence parallelism combined with pipeline parallelism (#18243)
Async tensor parallelism using compilation pass (#17882)
Perf: Use small max_num_batched_tokens for A100 (#17885)
Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)

Security

Prevent side-channel attacks via cache salting (#17045)
Fix image hash collision in certain edge cases (#17378)
Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)

Features

CLI: deprecated=True (#17426)
Frontend: progress bar for adding requests (#17525), chat_template_kwargs in LLM.chat (#17356), /classify endpoint (#17032), truncation control for embedding models (#14776), cached_tokens in response usage (#18149)
LoRA: default local directory LoRA resolver plugin. (#16855)
Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
Quantization: nvidia/DeepSeek-R1-FP4 (#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models with AOPerModuleConfig (#17826), CUDA Graph support for V1 GGUF support (#18646)
Reasoning: deprecate --enable-reasoning (#17452)
Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466), tool_choice: required for Xgrammar (#17845), Structural Tag with Guidance backend (#17333)
Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)

Hardwares

NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)

Documentation

Update quickstart and install for cu128 using --torch-backend=auto (#18505)
NVIDIA TensorRT Model Optimizer (#17561)
Usage of Qwen3 thinking (#18291)

Developer Facing

Benchmark: Add single turn MTBench to Serving Bench (#17202)
Usability: Decrease import time of vllm.multimodal (#18031)
Code Format: Code formatting using ruff format (#17656, #18068, #18400)
Readability:
- Configuration and arguments unification is now complete! (#17130, #17453, #17562)
- Update deprecated type hinting from Python 3.7 (#18056, #18130, #18132, #18129, #18073, #18072, #18126, #18128, #18057, #18058)
Process:
- Propose a deprecation policy for the project (#17063)
Testing: expanding torch nightly tests (#18004)

What's Changed

Support loading transformers models with named parameters by @wuisawesome in #16868
Add tuned triton fused_moe configs for Qwen3Moe by @mgoin in #17328
[Benchmark] Add single turn MTBench to Serving Bench by @ekagra-ranjan in #17202
[Optim] Compute multimodal hash only once per item by @DarkLight1337 in #17314
implement Structural Tag with Guidance backend by @mmoskal in #17333
[V1][Spec Decode] Make Eagle model arch config driven by @ekagra-ranjan in #17323
[model] make llama4 compatible with pure dense layers by @luccafong in #17315
[Bugfix] Fix numel() downcast in fused_layernorm_dynamic_per_token_quant.cu by @r-barnes in #17316
Ignore '<string>' filepath by @zou3519 in #17330
[Bugfix] Add contiguous call inside rope kernel wrapper by @timzsu in #17091
[Misc] Add a Jinja template to support Mistral3 function calling by @chaunceyjiang in #17195
[Model] support MiniMax-VL-01 model by @qscqesze in #16328
[Misc] Move config fields to MultiModalConfig by @DarkLight1337 in #17343
[Misc]Use a platform independent interface to obtain the device attributes by @ponix-j in #17100
[Fix] Documentation spacing in compilation config help text by @Zerohertz in #17342
[Build][Bugfix] Restrict setuptools version to <80 by @gshtras in #17320
[Model] Ignore rotary embed load for Cohere model by @ekagra-ranjan in #17319
Update docs requirements by @hmellor in #17379
[Doc] Fix QWen3MOE info by @jeejeelee in #17381
[Bugfix] Clean up MiniMax-VL and fix processing by @DarkLight1337 in #17354
pre-commit autoupdate by @hmellor in #17380
[Frontend] Support chat_template_kwargs in LLM.chat by @DarkLight1337 in #17356
Transformers backend tweaks by @hmellor in #17365
Fix: Spelling of inference by @a2q1p in #17387
Improve literal dataclass field conversion to argparse argument by @hmellor in #17391
[V1] Remove num_input_tokens from attn_metadata by @heheda12345 in #17193
[Bugfix] add qwen3 reasoning-parser fix content is None when disable … by @mofanke in #17369
fix gemma3 results all zero by @mayuyuace in #17364
[Misc][ROCm] Exclude cutlass_mla_decode for ROCm build by @tywuAMD in #17289
Enabling multi-group kernel tests. by @Alexei-V-Ivanov-AMD in #17115
[Docs] Propose a deprecation policy for the project by @russellb in #17063
[Doc][Typo] Fixing label in new model requests link in overview.md by @casinca in #17400
[TPU][V1][CI] Replace python3 setup.py develop with standard pip install --e on TPU by @NickLucche in #17374
[CI] Uses Python 3.11 for TPU by @aarnphm in #17359
[CI/Build] Add retry mechanism for add-apt-repository by @reidliu41 in #17107
[Bugfix] Fix Minicpm-O-int4 GPTQ model inference by @Isotr0py in #17397
Simplify (and fix) passing of guided decoding backend options by @hmellor in #17008
Remove Falcon3 2x7B from CI by @hmellor in #17404
Fix: Python package installation for opentelmetry by @dilipgb in #17049
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE by @luyuzhe111 in #17211
Remove Bamba 9B from CI by @hmellor in #17407
[V1][Feature] Enable Speculative Decoding with Structured Outputs by @benchislett in #14702
[release] Always git fetch all to get latest tag on TPU release by @khluu in #17322
Truncation control for embedding models by @gmarinho2 in #14776
Update PyTorch to 2.7.0 by @huydhn in #16859
Improve configs - ModelConfig by @hmellor in #17130
Fix call to logger.info_once by @hmellor in #17416
Fix some speculative decode tests with tl.dot by @huydhn in #17371
Support LoRA for Mistral3 by @mgoin in #17428
[Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue by @jikunshang in #17298
[Hardware][Intel GPU] Upgrade to torch 2.7 by @jikunshang in #17444
[Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' by @chaunceyjiang in #17434
[MODEL ADDITION] Ovis2 Model Addition by @mlinmg in #15826
Make the _apply_rotary_emb compatible with dynamo by @houseroad in #17435
[Misc] Remove deprecated files by @chaunceyjiang in #17447
[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None by @lengrongfu in #15755
[TPU][V1][CI] Update regression test baseline for v6 CI by @NickLucche in #17064
[Core] Prevent side-channel attacks via cache salting by @dr75 in #17045
[V1][Metrics] add support for kv event publishing by @alec-flowers in #16750
[Feature] The Qwen3 reasoning parser supports guided decoding by @chaunceyjiang in #17466
[Docs] Add command for running mypy tests from CI by @russellb in #17475
[Fix] Support passing args to logger by @aarnphm in #17425
[Bugfix] Fixed mistral tokenizer path when pointing to file by @psav in #17457
[V1] Allow turning off pickle fallback in vllm.v1.serial_utils by @russellb in #17427
[Docs] Update optimization.md doc by @mgoin in #17482
[BugFix] Fix authorization of openai_transcription_client.py by @hhy3 in #17321
[Bugfix][ROCm] Restrict ray version due to a breaking release by @gshtras in #17480
[doc] add install tips by @reidliu41 in #17373
doc: fix bug report Github template formatting by @davidxia in #17486
[v1][Spec Decode] Make sliding window compatible with eagle prefix caching by @heheda12345 in #17398
Bump Compressed Tensors version to 0.9.4 by @rahul-tuli in #17478
[Misc] Rename Audios -> Audio in Qwen2audio Processing by @alex-jw-brooks in #17507
[CI][TPU] Skip Multimodal test by @lsy323 in #17488
[Bugfix][ROCm] Fix import error on ROCm by @gshtras in #17495
[Bugfix] Temporarily disable gptq_bitblas on ROCm by @nlzy in #17411
[CI][TPU] Skip structured outputs+spec decode tests on TPU by @mgoin in #17510
[CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg by @mgoin in #17500
[CI/Build] Reorganize models tests by @DarkLight1337 in #17459
FIxing the AMD test failures caused by PR#16457 by @Alexei-V-Ivanov-AMD in #17511
[Build] Require setuptools >= 77.0.3 for PEP 639 by @russellb in #17389
[ROCm] Effort to reduce the number of environment variables in command line by @hongxiayang in #17229
[BugFix] fix speculative decoding memory leak when speculation is disabled by @noyoshi in #15506
[BugFix] Fix mla cpu - missing 3 required positional arguments by @LucasWilkinson in #17494
Avoid overwriting vllm_compile_cache.py by @youngkent in #17418
[Core] Enable IPv6 with vllm.utils.make_zmq_socket() by @russellb in #16506
[Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content by @chaunceyjiang in #17515
Improve configs - ObservabilityConfig by @hmellor in #17453
[Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to select a model by @tishizaki in #17285
[Frontend] Show progress bar for adding requests by @DarkLight1337 in #17525
[Misc] Clean up test docstrings and names by @DarkLight1337 in #17521
[FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X by @tjtanaa in #17530
Fix more broken speculative decode tests by @huydhn in #17450
[doc] add streamlit integration by @reidliu41 in #17522
[FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe config by @tjtanaa in #17535
[Feature][Frontend]: Deprecate --enable-reasoning by @chaunceyjiang in #17452
[ROCm] remove unsupported archs from rocm triton flash-attention supported list by @hongxiayang in #17536
[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations by @SageMoore in #10867
[Misc] refactor example - cpu_offload_lmcache by @reidliu41 in #17460
[CI/Build] Remove awscli dependency by @DarkLight1337 in #17532
Move the last arguments in arg_utils.py to be in their final groups by @hmellor in #17531
[Model] Refactor Ovis2 to support original tokenizer by @Isotr0py in #17537
[ROCm] update installation guide to include build aiter from source instructions by @hongxiayang in #17542
[Misc]add configurable cuda graph size by @CXIAAAAA in #17201
[Bugfix] Fix lint error by @DarkLight1337 in #17547
[ROCM] Add gfx950 to the custom attention archs by @jpvillam-amd in #16034
Remove duplicate code from dbrx.py by @sstamenk in #17550
[Bug]change the position of cuda_graph_sizes in dataclasses by @CXIAAAAA in #17548
[Misc][Tools][Benchmark] Publish script to auto tune server parameters by @Chenyaaang in #17207
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE3 by @zixi-qi in #17504
[Bugfix] Disable gptq_bitblas for <SM80 to fix GPTQ on V100/T4 by @mgoin in #17541
[Doc] note that not all unit tests pass on CPU platforms by @davidxia in #17554
[Attention] MLA move o_proj q_proj into cuda-graph region by @LucasWilkinson in #17484
[CI] Actually run tests/kv_transfer/test_disagg.py in CI by @mgoin in #17555
Check if bitblas is installed during support check by @mgoin in #17572
[Misc] Continue refactoring model tests by @DarkLight1337 in #17573
Fix PixtralHF missing spatial_merge_size by @mgoin in #17571
Add pt_load_map_location to allow loading to cuda by @jerryzh168 in #16869
[Bugifx] Remove TritonPlaceholder from sys.modules by @Isotr0py in #17317
[Core] [Bugfix] Add Input Embeddings by @qthequartermasterman in #15428
[BugFix] Fix Memory Leak by @robertgshaw2-redhat in #17567
[Misc] Rename assets for testing by @DarkLight1337 in #17575
add more pytorch related tests for torch nightly by @yangw-dev in #17422
[doc] add the print result by @reidliu41 in #17584
Automatically tell users that dict args must be valid JSON in CLI by @hmellor in #17577
[Security] Fix image hash collision by @DarkLight1337 in #17378
Support W8A8 INT8 MoE for compressed-tensors by @mgoin in #16745
[doc] miss result by @reidliu41 in #17589
[Misc] Clean up input processing by @DarkLight1337 in #17582
[Bugfix] fix tmp_out and exp_sums dimensions by @hliuca in #17438
[BugFix][Attention] Fix sliding window attention in V1 giving incorrect results by @LucasWilkinson in #17574
permute/unpermute kernel for moe optimization by @CalebDu in #14568
Add NVIDIA TensorRT Model Optimizer in vLLM documentation by @Edwardf0t1 in #17561
[Hardware][AMD] Improve OAM device ID + llama4 Maverick MOE tuning by @xw285cornell in #16263
[easy] Print number of needed GPUs in skip message by @zou3519 in #17594
fix typo in logging by @ehartford in #17605
[release] Add command to clean up Docker containers/images in TPU release machine by @khluu in #17606
[Neuron][Build] Require setuptools >= 77.0.3 for PEP 639 by @liangfu in #17603
Update test requirements to CUDA 12.8 by @22quinn in #17576
[Quantizaton] [AMD] Add support for running DeepSeek int8 w8a8 MoE on ROCm by @rasmith in #17558
[Frontend][TPU] Add TPU default max-num-batched-tokens based on device name by @Chenyaaang in #17508
[Build/CI] Upgrade CUTLASS to 3.9.1 by @tlrmchlsmth in #17602
[Bugfix][ROCm] Using device_type because on ROCm the API is still torch.cuda by @gshtras in #17601
[Core] Gate prompt_embeds behind a feature flag by @DarkLight1337 in #17607
[Bugfix] Fix broken Qwen2.5-omni tests by @Isotr0py in #17613
[Misc] V0 fallback for --enable-prompt-embeds by @DarkLight1337 in #17615
Add full API docs and improve the UX of navigating them by @hmellor in #17485
[Bugfix] Prioritize dtype in root config before checking text config by @DarkLight1337 in #17629
[Bugfix][Easy] Fix whitespace in shm_broadcast.py logging by @tlrmchlsmth in #17635
[Bugfix] fix KeyError on top logprobs are special tokens by @chaunceyjiang in #17637
[Build/CI] Upgrade CUTLASS to 3.9.2 by @tlrmchlsmth in #17641
[Kernel] some optimizations for dense marlin and moe marlin by @jinzhen-lin in #16850
[Doc] Fix broken cuda installation doc rendering by @Isotr0py in #17654
Use git-path commit in hook by @thomasjpfan in #17616
[Benchmarks] Remove invalid option under V1 engine by @russellb in #17651
[BugFix] Increase timeout for startup failure test by @njhill in #17642
[TPU] Enable gemma3-27b with TP>1 on multi-chips. by @vanbasten23 in #17335
[TPU][V1] Add support for top-logprobs by @NickLucche in #17072
[Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument by @varun-sundar-rabindranath in #17677
Update nm to rht in doc links + refine fp8 doc by @mgoin in #17678
[Model] Add GraniteMoeHybrid 4.0 model by @s3woz in #17497
[easy] Fix logspam on PiecewiseBackend errors by @zou3519 in #17138
[Bugfix] Fixed prompt length for random dataset by @Xarbirus in #17408
[Doc] Update notes for H2O-VL and Gemma3 by @DarkLight1337 in #17219
[Misc] Fix ScalarType float4 naming by @LucasWilkinson in #17690
Fix dockerfilegraph pre-commit hook by @hmellor in #17698
[Bugfix] Fix triton import with local TritonPlaceholder by @MengqingCao in #17446
[V1] Enable TPU V1 backend by default by @mgoin in #17673
[V1][PP] Support PP for MultiprocExecutor by @bigPYJ1151 in #14219
[v1] AttentionMetadata for each layer by @heheda12345 in #17394
[Feat] Add deprecated=True to CLI args by @aarnphm in #17426
[Docs] Use gh-file to add links to tool_calling.md by @windsonsea in #17709
[v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager by @heheda12345 in #17479
[doc] Add RAG Integration example by @reidliu41 in #17692
[Bugfix] Fix modality limits in vision language example by @DarkLight1337 in #17721
Make right sidebar more readable in "Supported Models" by @hmellor in #17723
[TPU] Increase block size and reset block shapes by @bythew3i in #16458
[Misc] Add Next Edit Prediction (NEP) datasets support in benchmark_serving.py by @dtransposed in #16839
[Bugfix] Fix for the condition to accept empty encoder inputs for mllama by @gshtras in #17732
[Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode by @tdoublep in #16828
Fix doc build performance by @hmellor in #17748
[ROCm] fix num_stages for default moe config to avoid triton OutOfResource error by @hongxiayang in #17744
Add logging for torch nightly version by @yangw-dev in #17669
[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels by @cyang49 in #17146
Removed unused marlin cuda code by @mgoin in #17684
[TPU] Add kernel test for moe_pallas by @mgoin in #17496
Replace lm-eval bash script with pytest and use enforce_eager for faster CI by @mgoin in #17717
[BugFix][Spec Decode] Fix hidden size mismatch between target and eagle head by @WoosukKwon in #17740
[Misc] Split model loader by @jeejeelee in #17712
[Misc] Use apply_rotary_emb from vllm_flash_attn for Qwen2-VL vision RoPE by @Isotr0py in #17726
[Kernel] GGUF MoeVec kernel by @SzymonOzog in #16780
[Kernel] Use fused rmsnorm for some models like qwen3 series by @Eviannn in #17735
[Misc] Remove qlora_adapter_name_or_path by @jeejeelee in #17699
Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling by @aws-satyajith in #16357
[Frontend] Add missing chat templates for various MLLMs by @DarkLight1337 in #17758
Fix test_memory_usage_no_spec by @sarckk in #17754
Make key optional for rotary embedding by @sarckk in #17566
[doc] update the issue link by @reidliu41 in #17782
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention by @gshtras in #17139
Only depend on importlib-metadata for Python < 3.10 by @tiran in #17776
[Bugfix] Fix Video IO error for short video by @Isotr0py in #17791
Fix and simplify deprecated=True CLI kwarg by @hmellor in #17781
[Bugfix] Fix missing lora name mapping for lora without prefix by @Isotr0py in #17793
[Quantization] Quark MXFP4 format loading by @BowenBao in #16943
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend by @Akshat-Tripathi in #14238
[BugFix] Avoid secondary missing MultiprocExecutor.workers error by @njhill in #17811
[Core][Feature] Input metadata dump on crash by @wallashss in #13407
[Chore][Doc] uses model id determined from OpenAI client by @aarnphm in #17815
Don't call the venv vllm by @hmellor in #17810
[BugFix] Fix --disable-log-stats in V1 server mode by @njhill in #17600
[Core] Support full cuda graph in v1 by @chanh in #16072
Improve exception reporting in MP engine by @vmarkovtsev in #17800
[Installation] OpenTelemetry version update by @Xarbirus in #17771
Only log non-default CLI args for online serving by @hmellor in #17803
[V1] Add VLLM_ALLOW_INSECURE_SERIALIZATION env var by @russellb in #17490
[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs by @amd-hhashemi in #17071
[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER by @Akashcodes732 in #17153
[Hardware][Intel-Gaudi] Support Automatic Prefix Caching on HPU by @adobrzyn in #17648
[Frontend] Chat template fallbacks for multimodal models by @DarkLight1337 in #17805
[Qwen3]add qwen3-235b-bf16 fused moe config on A100 by @Ximingwang-09 in #17715
[Bugfix] Fix bad words for Mistral models by @qionghuang6 in #17753
[Misc] support model prefix & add deepseek vl2 tiny fused moe config by @xsank in #17763
[Bugfix] Fix tool call template validation for Mistral models by @RIckYuan999 in #17644
[TPU] Fix the test_sampler by @bythew3i in #17820
[Bugfix] Fix quark fp8 format loading on AMD GPUs by @fxmarty-amd in #12612
[Doc] Fix a typo in the file name by @DarkLight1337 in #17836
[Easy] Eliminate c10::optional usage in vllm/csrc by @houseroad in #17819
[Misc] add chatbox integration by @reidliu41 in #17828
Fix transient dependency error in docs build by @hmellor in #17848
[Bugfix] use_fast failing to be propagated to Qwen2-VL image processor by @DarkLight1337 in #17838
[Misc] Delete LoRA-related redundancy code by @jeejeelee in #17841
[CI] Fix test_collective_rpc by @russellb in #17858
[V1] Improve VLLM_ALLOW_INSECURE_SERIALIZATION logging by @russellb in #17860
[Test] Attempt all TPU V1 tests, even if some of them fail. by @yarongmu-google in #17334
[CI] Prune down lm-eval small tests by @mgoin in #17012
Fix noisy warning for uncalibrated q_scale/p_scale by @mgoin in #17414
Add cutlass support for blackwell fp8 blockwise gemm by @wenscarl in #14383
[FEAT][ROCm]: Support AITER MLA on V1 Engine by @vllmellm in #17523
[V1][Structured Output] Update llguidance (>= 0.7.11) to avoid AttributeError (no StructTag) by @shen-shanshan in #17839
[Attention] MLA move rotary embedding to cuda-graph region by @LucasWilkinson in #17668
[BUGFIX]: return fast when request requires prompt logprobs by @andyxning in #17251
[Docs] Add Slides from NYC Meetup by @simon-mo in #17879
[Doc] Update several links in reasoning_outputs.md by @windsonsea in #17846
[Doc] remove visible token in doc by @yma11 in #17884
[Bugfix][ROCm] Fix AITER MLA V1 by @vllmellm in #17880
[Bugfix][CPU] Fix broken AVX2 CPU TP support by @Isotr0py in #17252
Fix Whisper crash caused by invalid max_num_batched_tokens config by @inkcherry in #17853
Change top_k to be disabled with 0 (still accept -1 for now) by @hmellor in #17773
[Misc] add dify integration by @reidliu41 in #17895
[BugFix][AMD] Compatible patch for latest AITER(05/07/2025) by @qli88 in #17864
[v1] Move block management logic from KVCacheManager to SpecializedManager by @heheda12345 in #17474
[CI/Build] Automatically retry flaky tests by @DarkLight1337 in #17856
Revert "[BugFix][AMD] Compatible patch for latest AITER(05/07/2025)" by @mgoin in #17910
[Misc] Add references in ray_serve_deepseek example by @ruisearch42 in #17907
[Misc] Auto fallback to float16 for pre-Ampere GPUs when detected bfloat16 config by @Isotr0py in #17265
Update CT WNA16MarlinMoE integration by @mgoin in #16666
Handle error when str passed to /v1/audio/transcriptions by @hmellor in #17909
Add option to use torch._inductor.standalone_compile by @zou3519 in #17057
[V1][Spec Decoding] Include bonus tokens in mean acceptance length by @markmc in #17908
Improve configs - the rest! by @hmellor in #17562
AMD conditional all test execution // new test groups by @Alexei-V-Ivanov-AMD in #17556
[Hardware/NVIDIA/Kernel] [Functional Enablement] [1/N] Enable nvidia/DeepSeek-R1-FP4 Model by @pavanimajety in #16362
[V1][Spec Decoding] Log accumulated metrics after system goes idle by @markmc in #17913
fix broken test vllm:test_kernels - test_attention_selector.py::test_flash_attn by @tracelogfb in #17873
Add missing content type headers to /ping and /health (#17036) by @edrevo in #17786
Don't default construct ModelConfig when default constructing VllmConfig by @hmellor in #17943
[Misc] remove --model from vllm serve usage by @reidliu41 in #17944
[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders by @heheda12345 in #17483
[v1] Rename specialized_manager.py to single_type_kv_cache_manager.py by @heheda12345 in #17946
[Kernel] fp4 marlin kernel by @jinzhen-lin in #17687
[Bugfix] Add revision to transformers.Auto*.from_pretrained processors by @xinli-centml in #17948
[Perf] Use small max_num_batched_tokens for A100 by @KuntaiDu in #17885
fix amd triton mla path by @842974287 in #17871
[Bugfix]: v1 engine - consider lora adapters in allowed_token_ids by @bbrowning in #17855
[doc] update lora doc by @reidliu41 in #17936
[Frontend] Add /classify endpoint by @frieda-huang in #17032
[Misc] Add compressed-tensors NVFP4A16 emulation support by @dsikka in #17914
[FP8][ROCm][Attention] Enable FP8 KV cache on ROCm for V1 by @gshtras in #17870
[New Model]: nomic-embed-text-v2-moe by @noooop in #17785
[Misc] not show --model in vllm serve --help by @reidliu41 in #16691
[BugFix] [ROCm]: Bugfix and handle addition case of input for rocm_aiter_rms_norm by @tjtanaa in #17857
[BUG] [ROCm] [MLA] Fix variable name bug due to change in variable name in PR #17483 by @tjtanaa in #17961
[Model] Broadcast Ovis2 implementation to fit Ovis1.6 by @Isotr0py in #17861
[misc] add instructions on how to install nvshmem/pplx/deepep by @youkaichao in #17964
[Bugfix] validate grammar and throw 400 error instead of crashing the engine when xgrammar validation fails by @Jason-CKY in #17623
[bugfix] fix the wrong parser by @reidliu41 in #17958
[Bugfix] Fix pydantic.errors.PydanticUserError by @Potabk in #17962
[Bugfix][TPU] Use np array when updating cache slot_mapping by @lsy323 in #17971
[Fix] Benchmark "EngineClient" has no attribute "model_config" by @b8zhong in #17976
[Feature] Support DeepSeekV3 Function Call by @Xu-Wenqing in #17784
Correcting testcases in builkite job for IBM Power by @AaruniAggarwal in #17675
[Misc] Improve modelscope import error by @jeejeelee in #17983
Initialize the delta tool call fields explicitly by @maxdebayser in #17340
[P/D] NIXL Integration by @robertgshaw2-redhat in #17751
[Lora][Frontend]Add default local directory LoRA resolver plugin. by @jberkhahn in #16855
Construct KVTransferConfig properly from Python instead of using JSON blobs without CLI by @hmellor in #17994
[CI/Build] Fix TPU V1 Test mixed use of & and && across tests by @CAROLZXYZXY in #17968
[Core] Use platform-agnostic device control for DP engine core by @jianzs in #17245
Enabling "Weight Loading Multiple GPU Test - Large Models" by @Alexei-V-Ivanov-AMD in #18020
[v1][KVCacheManager] Change prefix caching metric from counting blocks to counting tokens by @heheda12345 in #18003
[Chore] Remove unused method by @robertgshaw2-redhat in #18024
Enable standard language model for torhc nightly by @yangw-dev in #18004
[CI] Make JSON output tests less likely to fail by @russellb in #17859
[V1][Spec Decode] Eagle unit tests by @wwl2755 in #17350
[Bugfix] Fix FBGEMM integration by @mgoin in #18002
[Model] Support MiMo-7B inference with MTP by @bwshen-mi in #17433
Update some more deprecated type hinting by @hmellor in #17998
Use NVFP4 Marlin for CompressedTensorsW4A16Fp4 by @mgoin in #18000
Remove noisy warnings from SchedulerConfig by @hmellor in #17995
[ROCm] Skip tests for quantizations incompatible with ROCm by @hissu-hyvarinen in #17905
Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support by @sighingnow in #11844
[Misc] Slight spelling modification by @jeejeelee in #18039
[ROCm]: Fix build from source failure with gcc14 and ROCm 6.3 by @arjunkathuria in #13779
[Bugfix] Fixes for new marlin moe usage by @mgoin in #18017
[Bugfix] Avoid repeatedly creating dummy data during engine startup by @DarkLight1337 in #17935
[Feature][V1] Support tool_choice: required when using Xgrammar as the StructuredOutputBackend. by @chaunceyjiang in #17845
cleanup invalid prints by @calvin0327 in #18050
[BugFix] Fix 4-GPU RLHF tests by @njhill in #18007
Fix Broken macro for cutlass moe by @drisspg in #18049
[v1][KVCacheManager] Avoid full cache hit by controlling max_length by @heheda12345 in #17999
[Bugfix][V1] Only get input embeddings w/ multi-modal models if first PP by @jinhuang12 in #17916
[BugFix] Set default random seed to 0 for V1 by @WoosukKwon in #17929
[Bugfix] Fix marlin moe fallback logic for llama4 by @mgoin in #18042
[Benchmarks] Refactor run_structured_output_benchmarks.sh by @russellb in #17722
Convert .buildkite to ruff format by @hmellor in #17656
[Fix] check to make sure processor has chat templates by @aarnphm in #18047
[doc] add download/list/delete HF model CLI usage by @reidliu41 in #17940
Update deprecated type hinting in model_executor/layers by @hmellor in #18056
Update deprecated type hinting in vllm/profiler by @hmellor in #18057
Update deprecated type hinting in vllm/transformers_utils by @hmellor in #18058
[CI] Set token permissions for reminder comment CI job by @russellb in #17728
[CI] Add workflow permissions for helm CI job by @russellb in #17727
[CI] Add token permissions for add-ready-label CI job by @russellb in #17730
[CI] set token permissions for pre-commit CI job by @russellb in #17729
[Bugfix] Fix entrypoints metrics tests by @DarkLight1337 in #18063
Convert benchmarks to ruff format by @hmellor in #18068
Give auto-merge label workflow permission to add labels to issues by @hmellor in #18078
Update deprecated type hinting in vllm/compilation by @hmellor in #18072
Update deprecated type hinting in vllm/adapter_commons by @hmellor in #18073
[V1] DP scale-out (2/N): Decouple engine process management and comms by @njhill in #15977
[Docs] Expand security doc with firewall info by @russellb in #18081
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature by @vllmellm in #14968
[v1][KVCacheManager] pass num_new_computed_tokens to kv cache manager by @heheda12345 in #18001
[Fix] Support CUDAGraph capture for encoder-decoder on ROCm by @ProExpertProg in #18104
[Hardware/NVIDIA/Modelopt] Fix modelopt forward method for v1 torch.compile by @pavanimajety in #18101
[P/D] Add some more debug logs to NixlConnector by @njhill in #18102
[Misc] Remove unused numpy tensor by @ywang96 in #18084
[Bug]: Fix S3 model/tokenizer path resolution by @gilljon in #18083
[core][distributed] add ep group and all2all interface by @youkaichao in #18077
[Bugfix] Fix FP8 Marlin MoE and enable for compressed-tensors models by @mgoin in #18026
[Bugfix][V1] Fix FlashInfer V1 backend using the wrong VllmConfig by @mgoin in #18086
[FEAT] [ROCm] [V1]: Add AITER biased group topk for DeepSeekV3 by @vllmellm in #17955
[AMD][torch.compile] Enable silu+fp8_quant fusion for rocm by @charlifu in #18082
[BugFix][AMD] Compatible patch for AITER lib after 04/20 by @qli88 in #17912
Fix broken example: examples/offline_inference/profiling at scheduler_config by @Ecthlion in #18117
[Fix] Move "model_config" as keyword args in chat_utils.py by @lk-chen in #18098
[Bugfix] fix moe marlin topk_weight loading by @jinzhen-lin in #18080
[Bugfix][Example] make lmcache v0 work. by @majianpeng in #18051
[New Model]: support GTE NewModel by @noooop in #17986
[Bugfix] Fix entrypoints audio test failure by @DarkLight1337 in #18111
[Model] Add packed_modules_mapping for Qwen3-MOE by @jeejeelee in #18118
[Misc] replace does not exist model by @lengrongfu in #18119
[Bugfix] Fix QKVCrossParallelLinear::sync_weight_attrs for PyTorch compile by @anko-intel in #17844
[FEAT] [ROCm]: Add AITER CK 2 Stages MoE support by @tjtanaa in #17110
[Bugfix] Fix LoRA test by @jeejeelee in #18123
[Model] GritLM supports other attention backends by @DarkLight1337 in #18109
[doc] add missing import by @reidliu41 in #18133
Update deprecated type hinting in vllm/lora by @hmellor in #18128
Update deprecated type hinting in vllm/device_allocator and vllm/distributed by @hmellor in #18126
Update deprecated type hinting in platform, plugins, triton_utils, vllm_flash_attn by @hmellor in #18129
[Bugfix] Fix chat utils tests by @DarkLight1337 in #18139
[KVConnector] Keep KVTransferParams as a dict by @njhill in #18033
[Doc] Update prefix cache metrics to counting tokens by @heheda12345 in #18138
[V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model by @ekagra-ranjan in #17326
Modularize fused experts and integrate PPLX kernels by @bnellnm in #15956
[CI] Disable Failing Tests by @robertgshaw2-redhat in #18165
[Frontend] decrease import time of vllm.multimodal by @davidxia in #18031
[Kernel] Have rotary embeddings support tensors by @LucasWilkinson in #18046
[V1] Structured Outputs + Thinking compatibility by @aarnphm in #16577
Add support for loading torchao models with AOPerModuleConfig by @jerryzh168 in #17826
[CI] Fix race condition in test_kv_cache_events test by @russellb in #18169
[V1] Support multiple kv connectors by @mgoin in #17564
Upload vllm index for the rc builds by @atalman in #18173
[Bugfix]: make most of test_openai_schema.py pass by @davidxia in #17664
[v1] Support multiple KV cache groups in GPU model runner by @heheda12345 in #17945
[V1][Metrics] Remove unused code by @markmc in #18158
[Chore] astral's ty by @aarnphm in #18116
[Misc] add lobe-chat support by @reidliu41 in #18177
[Fix][ROCm] Enforce eager for all encoder-decoder models on ROCm by @ProExpertProg in #18154
Update deprecated type hinting in models by @hmellor in #18132
[Bugfix] Fix fp8 tests for triton_unified_attention for Triton 3.3 by @tdoublep in #18013
Support custom implementations of VideoLoader backends. by @huachenheli in #18091
[UT] Add ut for none hash by @andyxning in #17892
[Model] Allow the use of sliding window in Qwen2 by @inkcherry in #17772
[Bugfix] Fix FusedMoEPrepareAndFinalize for cuda-disalike backends by @MengqingCao in #18178
[CI] don't skip fixed test_kv_cache_events() by @davidxia in #18183
[V1] Update zmq socket creation in nixl connector by @russellb in #18148
fix: typos by @omahs in #18151
Update deprecated type hinting in model_loader by @hmellor in #18130
add tools into TokenizeChatRequest by @hustxiayang in #18187
[Kernel] [V1] Fix performance regression for triton unified attention by @tdoublep in #18161
Adding "Basic Models Test" and "Multi-Modal Models Test (Extended) 3" in AMD Pipeline by @Alexei-V-Ivanov-AMD in #18106
Improve examples rendering in docs and GitHub by @hmellor in #18203
[Frontend] Fix chat template content format detection by @schoennenbeck in #18190
[Bugfix]Change the exception thrown by call_hf_processor from RuntimeError to ValueError by @Abatom in #18181
[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in unquantizedMethod to reenable LLama4 BF16 by @tjtanaa in #18205
[Misc] Avoid cuda graph log when sizes still match by @NickLucche in #18202
Adding "AMD: Tensorizer Test" to amdproduction. by @Alexei-V-Ivanov-AMD in #18216
[Bugfix] Fix test_eagle test by @luccafong in #18223
[Build] Allow shipping PTX on a per-file basis by @LucasWilkinson in #18155
[Bugfix] fix rotary embedding test for _get_padded_tensor_shape by @LucasWilkinson in #18229
[Bugfix][ROCm] Use chunked_prefill_paged_decode as fallback for V1 attention on ROCm by @kliuae in #18093
[Model] vLLM v1 supports Medusa by @skylee-01 in #17956
Allow users to pass arbitrary JSON keys from CLI by @hmellor in #18208
Throw better error for when running into k8s service discovery issue by @wseaton in #18209
[Feature] Support Pipeline Parallism in torchrun SPMD offline inference for V1 by @luccafong in #17827
[doc] fix multimodal example script by @davidxia in #18089
[PERF] Speed up Qwen2.5-VL model by speed up rotary position embedding const… by @vadiklyutiy in #17973
[Misc] Add Ray Prometheus logger to V1 by @eicherseiji in #17925
[Misc] Consolidate Audio tests into multimodal common generation tests by @Isotr0py in #18214
use ceil_div in cutlass block scaling shape check by @IwakuraRein in #17918
[Fix] Fix typo in resolve_hf_chat_template by @fxmarty-amd in #18259
[Model] Use autoweightloader for dbrx by @learner0810 in #18251
[Misc][MacOS] fix bfloat16 error by @reidliu41 in #18249
[BugFix] Fix multi async save in MultiConnector by @njhill in #18246
[BugFix] Fix ordering of KVConnector finished send/rcv sets by @njhill in #18211
[CI] Assign reviewer to mergify with changes to Tensorizer files by @sangstar in #18278
[Sampler] Adapt to FlashInfer 0.2.3 sampler API by @abmfy in #15777
[Bugfix] fix an illegal memory access was encountered of marlin kernel + act_order by @jinzhen-lin in #18245
[Spec Decode] Don't fall back to V0 when spec decoding is enabled by @WoosukKwon in #18265
[V1][P/D] Local attention optimization for NIXL by @mgoin in #18170
Move cli args docs to its own page (#18228) by @strangiato in #18264
[Misc] reformat the collect-env output by @reidliu41 in #18285
[BugFix] Correct max_model_len derivation from config.json for Mistral format by @princepride in #17937
[P/D][V1] Support dynamic loading of external KV connector implementations by @sdavidbd in #18142
[Hardware][TPU] Optionally import for TPU backend by @lsy323 in #18269
Update Dockerfile to build for Blackwell by @mgoin in #18095
Fixed build on ppc64le due to openssl conflicts by @npanpaliya in #18262
[Model] use AutoWeightsLoader for solar by @lengrongfu in #18113
[MISC] fix typo by @andyxning in #18305
Support sequence parallelism combined with pipeline parallelism by @cascade812 in #18243
[doc] update reasoning doc by @reidliu41 in #18306
[Model] Use sigmoid for single-label classification by @22quinn in #18313
Fix copy-paste error in phi4mm image processing by @lifuhuang in #18315
[Misc] add litellm integration by @reidliu41 in #18320
[Doc] Add doc to explain the usage of Qwen3 thinking by @WangErXiao in #18291
[Spec Decode][V0] Fix spec decode correctness test in V0 eagle/medusa by @wwl2755 in #18175
Feature/vllm/input embedding completion api by @Nan2018 in #17590
[Misc] extract parser.parse_args() by @reidliu41 in #18323
[Build] Supports CUDA 12.6 and 11.8 after Blackwell Update by @simon-mo in #18316
fix: Add type specifications for CLI arguments in tensorizer options by @googs1025 in #18314
[BugFix] [Vul] Add missing usedforsecurity=False in MD5 hashing to enable FIPS by @shaoyuyoung in #18319
[Doc] Fix prompt embedding examples by @Potabk in #18350
[Doc] Move input-related docs to Features by @DarkLight1337 in #18353
[BugFix] Fix handling of num_computed_tokens with connector by @njhill in #18232
[Quantization] Pool model support bitsandbytes by @jeejeelee in #18087
[Doc] Fix typo by @eladsegal in #18355
[Frontend] add --quick option for vllm chat/complete by @reidliu41 in #18297
[Feature]Add support for models quantized with AutoRound by @wenhuach21 in #17850
Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) for DeepSeek V3/R1 on a single-node 8x NVIDIA H20 96GB setup by @sunyicode0012 in #18337
[Misc] Fix typo by @Unprincess17 in #18330
Neuron up mistral by @aws-satyajith in #18222
fix CUDA_check redefinition in #17918 by @luccafong in #18287
[neuron] fix authorization issue by @liangfu in #18364
[Misc] Allow AutoWeightsLoader to skip loading weights with specific substr in name by @Isotr0py in #18358
[Core] [Bugfix]: tensor parallel with prompt embeds by @Nan2018 in #18171
[release] Change dockerhub username for TPU release by @khluu in #18389
[Bugfix] fix adding bias twice in ipex GPTQ quantization by @rand-fly in #18363
[doc] update env variable export by @reidliu41 in #18391
[Misc] Add LoRA code owner by @jeejeelee in #18387
Update cpu.txt by @princepride in #18398
[CI] Add mteb testing to test the accuracy of the embedding model by @noooop in #17175
[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text by @wulipc in #18407
[Misc] refactor prompt embedding examples by @reidliu41 in #18405
[Minor] Rename quantization nvfp4 to modelopt_fp4 by @mgoin in #18356
[Model] use AutoWeightsLoader for bloom by @calvin0327 in #18300
[Kernel] update comment for KV shape in unified triton attn by @haochengxia in #18099
fix:Build torch wheel inline rather than picking from nightly by @dilipgb in #18351
[TPU] Re-enable the Pallas MoE kernel by @mgoin in #18025
[Bugfix] config.head_dim is now explicitly set to None by @gshtras in #18432
[Bug] Fix moe_sum signature by @bnellnm in #18440
Revert "[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (#18407)" by @DarkLight1337 in #18456
[Bugfix][Failing Test] Fix nixl connector test when promt size < block size by @wwl2755 in #18429
[Misc] MultiConnector._connectors type by @NickLucche in #18423
[Frontend] deprecate --device arg by @kebe7jun in #18399
[V1] Fix general plugins not loaded in engine for multiproc by @sarckk in #18326
[Misc] refactor disaggregated-prefill-v1 example by @reidliu41 in #18474
[Bugfix][Failing Test] Fix test_events.py by @rabi in #18460
[MODEL] FalconH1 by @dhiaEddineRhaiem in #18406
[Doc] fix arg docstring in linear layers by @giantcroc in #18410
[Bugfix] Reduce moe_sum test size to avoid OOM by @bnellnm in #18484
[Build] fix Dockerfile shell by @kebe7jun in #18402
[Misc] Update deprecation message for --enable-reasoning by @Zerohertz in #18404
[ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 by @hyoon1 in #17004
Revert "[v1] Support multiple KV cache groups in GPU model runner (#17945) by @markmc in #18459
[FEAT][ROCm] Upgrade AITER MLA v1 backend by @vllmellm in #18338
[Bugfix] Consistent ascii handling in tool parsers by @schoennenbeck in #17704
[FalconH1] Fix output dtype in RMSNorm fallback path for Falcon-H1 (e.g. 0.5B) by @dhiaEddineRhaiem in #18500
[MISC] update project urls in pyproject.toml by @andyxning in #18519
[CI] Fix race condition with StatelessProcessGroup.barrier by @russellb in #18506
Intialize io_thread_pool attribute in the beginning. by @rabi in #18331
[Bugfix] Inconsistent token calculation compared to HF in llava family by @cyr0930 in #18479
[BugFix][DP] Send DP wave completion only from dp_rank==0 by @njhill in #18502
[Bugfix][Model] Make Olmo2Model weight loading return loaded weights by @2015aroras in #18504
[Bugfix] Fix LoRA test by @jeejeelee in #18518
[Doc] Fix invalid JSON in example args by @DarkLight1337 in #18527
[Neuron] Update Dockerfile.neuron to use latest neuron release (2.23) by @aws-satyajith in #18512
Update default neuron config for speculation by @elaineyz in #18274
Order sequence ids + config update to support specifying custom quantization layers by @elaineyz in #18279
[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text by @wulipc in #18526
[Bugfix] Add kwargs to RequestOutput init to be forward compatible by @lk-chen in #18513
[CI/Build] Update bamba test model location by @hmellor in #18544
[Doc] Support --stream arg in openai_completion_client.py script by @googs1025 in #18388
[Bugfix] Use random hidden states in dummy sampler run by @abmfy in #18543
[Doc] Add stream flag for chat completion example by @calvin0327 in #18524
[BugFix][CPU] Fix x86 SHM distributed module initialization by @bigPYJ1151 in #18536
[Misc] improve Automatic Prefix Caching example by @reidliu41 in #18554
[Misc] Call ndarray.tobytes() directly instead of ndarray.data.tobytes() by @lgeiger in #18347
[Bugfix] make test_openai_schema.py pass by @davidxia in #18224
[Platform] Move platform check to right place by @wangxiyuan in #18470
[Compile][Platform] Make PiecewiseBackend pluggable and extendable by @MengqingCao in #18076
[Build/CI] Fix CUDA 11.8 build by @tlrmchlsmth in #17679
[Tool] Add NIXL installation script by @lk-chen in #18172
[V1][Spec Decode][Bugfix] Load quantize weights for EAGLE by @ekagra-ranjan in #18290
[Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser by @wukaixingxp in #17917
[Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization by @sangstar in #17926
[AMD] [P/D] Compute num gpus for ROCm correctly in run_accuracy_test.sh by @rasmith in #18568
Re-submit: Fix: Proper RGBA -> RGB conversion for PIL images. by @huachenheli in #18569
[V1][Spec Decoding] Use model_loader.get_model() to load models by @markmc in #18273
Enable interleaved sliding window attention models for Transformers backend by @hmellor in #18494
[Misc] refactor: simplify input validation and num_requests handling in _convert_v1_inputs by @googs1025 in #18482
[BugFix] Increase TP execute_model timeout by @njhill in #18558
[Bugfix] Set KVTransferConfig.engine_id in post_init by @lk-chen in #18576
[Spec Decode] Make EAGLE3 draft token ID mapping optional by @benchislett in #18488
[Neuron] Remove bypass on EAGLEConfig and add a test by @elaineyz in #18514
[Bugfix][Benchmarks] Fix a benchmark of deepspeed-mii backend to use api_key by @tishizaki in #17291
[Misc] Replace cuda hard code with current_platform by @shen-shanshan in #16983
[Hardware] correct method signatures for HPU,ROCm,XPU by @andyxning in #18551
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal by @RonaldBXu in #18034
[Feature]Add async tensor parallelism using compilation pass by @cascade812 in #17882
[Doc] Update quickstart and install for cu128 using --torch-backend=auto by @mgoin in #18505
[Feature][V1]: suupports cached_tokens in response usage by @chaunceyjiang in #18149
[Bugfix] Add half type support in reshape_and_cache_cpu_impl on x86 cpu platform by @zzzyq in #18430
Migrate docs from Sphinx to MkDocs by @hmellor in #18145
Revert "[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (#18034)" by @DarkLight1337 in #18600
[Bugfix][Model] Fix baichuan model loader for tp by @MengqingCao in #18597
[V0][Bugfix] Fix parallel sampling performance regression when guided decoding is enabled by @shadeMe in #17731
Add myself as docs code owner by @hmellor in #18605
[Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move to requirements/cpu.txt by @yankay in #18542
[CI] fix kv_cache_type argument by @andyxning in #18594
[Doc] Fix indent of contributing to vllm by @Zerohertz in #18611
Replace {func} with mkdocs style links by @hmellor in #18610
[CI/Build] Fix V1 flag being set in entrypoints tests by @DarkLight1337 in #18598
Fix examples with code blocks in docs by @hmellor in #18609
[Bugfix] Fix transformers model impl ignored for mixtral quant by @tristanleclercq in #18602
Include private attributes in API documentation by @hmellor in #18614
[Misc] add Haystack integration by @reidliu41 in #18601
[Bugfix][Build/CI] Fixup CUDA compiler version check for CUDA_SUPPORTED_ARCHS by @simon-mo in #18579
[Doc] Fix markdown list indentation for MkDocs rendering by @Zerohertz in #18620
[Doc] Use a different color for the announcement by @DarkLight1337 in #18616
Refactor pplx init logic to make it modular (prepare for deepep) by @youkaichao in #18200
Fix figures in design doc by @hmellor in #18612
[Docs] Change mkdocs to not use directory urls by @mgoin in #18622
[v1] Redo "Support multiple KV cache groups in GPU model runner (#17945)" by @heheda12345 in #18593
[Doc] fix list formatting by @davidxia in #18624
[Doc] Fix top-level API links/docs by @DarkLight1337 in #18621
[Doc] Avoid documenting dynamic / internal modules by @DarkLight1337 in #18626
[Doc] Fix broken links and unlinked docs, add shortcuts to home sidebar by @DarkLight1337 in #18627
[V1] Support Deepseek MTP by @YaoJiayi in #18435
Use prebuilt FlashInfer x86_64 PyTorch 2.7 CUDA 12.8 wheel for CI by @huydhn in #18537
[CI] Enable test_initialization to run on V1 by @mgoin in #16736
[Doc] Update references to doc files by @DarkLight1337 in #18637
[ModelOpt] Introduce VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var to control blockscale tensor allocation by @pavanimajety in #18160
[Bugfix] Migrate to REGEX Library to prevent catastrophic backtracking by @Crucifixion-Fxl in #18454
[Bugfix][Nixl] Fix Preemption Bug by @robertgshaw2-redhat in #18631
config.py: Clarify that only local GGUF checkpoints are supported. by @MathieuBordere in #18623
FIX MOE issue in AutoRound format by @wenhuach21 in #18586
[V1][Spec Decode] Small refactors to improve eagle bookkeeping performance by @zixi-qi in #18424
[Frontend] improve vllm serve --help display by @reidliu41 in #18643
[Model] Add support for Qwen2.5-Omni-7B-AWQ (Qwen2_5OmniForConditionalGeneration) by @Nalkey in #18647
[V1][Spec Decode] Support multi-layer eagle draft model by @zixi-qi in #18030
[Doc] Update README links, mark external links by @DarkLight1337 in #18635
[MISC][pre-commit] Add pre-commit check for triton import by @MengqingCao in #17716
[Doc] Fix indentation problems in V0 Paged Attention docs by @DarkLight1337 in #18659
[Doc] Add community links by @DarkLight1337 in #18657
[Model] use AutoWeightsLoader for gpt2 by @ztang2370 in #18625
[Doc] Reorganize user guide by @DarkLight1337 in #18661
[CI/Build] chmod +x to cleanup_pr_body.sh by @DarkLight1337 in #18650
[MISC] typo fix and clean import by @andyxning in #18664
[BugFix] Fix import error for fused_moe by @wangxiyuan in #18642
[CI] enforce import regex instead of re by @aarnphm in #18665
fix(regression): clone from reference items by @aarnphm in #18662
[CI/Build] fix permission denied issue by @reidliu41 in #18645
[BugFix][Spec Decode] Improve Prefix Caching Logic in Speculative Decoding by @WoosukKwon in #18668
[V1] Fix _pickle.PicklingError: Can't pickle <class 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite... by @eicherseiji in #18640
[MISC] correct signature for LoaderFunction by @andyxning in #18670
[Misc]Replace cuda hard code with current_platform in Ray by @noemotiovon in #14668
[Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE by @MengqingCao in #18655
[VLM] Initialize video input support for InternVL models by @Isotr0py in #18499
Speed up the kernels/quantization/ tests by @mgoin in #18669
[BUGFIX] catch subclass first for try...except by @andyxning in #18672
[Misc] Reduce logs on startup by @DarkLight1337 in #18649
[doc] fix broken links by @reidliu41 in #18671
[doc] improve readability by @reidliu41 in #18675
[Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment by @zzzyq in #18674
[CI/build] fix no regex by @reidliu41 in #18676
[Misc] small improve by @reidliu41 in #18680
[Bugfix] Fix profiling dummy data for Pixtral by @DarkLight1337 in #18677
[Core][Multimodal] Convert PIL Image to array without data copy when hashing by @lgeiger in #18682
[CI/Build][Doc] Update gte-Qwen2-1.5B-instruct usage by @DarkLight1337 in #18683
[Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example by @zhaohaidao in #18644
refactor: simplify request handler, use positive condition check for handler assignment by @googs1025 in #18690
[Bugfix] Fix the lm_head in gpt_bigcode in lora mode by @maxdebayser in #6357
[CI] add missing argument by @andyxning in #18694
[GH] Add issue template for reporting CI failures by @DarkLight1337 in #18696
[Doc] Fix issue template format by @DarkLight1337 in #18699
[Bugfix] Fix Mistral-format models with sliding window by @DarkLight1337 in #18693
[CI/Build] Replace math.isclose with pytest.approx by @DarkLight1337 in #18703
[CI] fix dump_input for str type by @andyxning in #18697
[Model] Add support for YARN in NemotronNAS models by @Naveassaf in #18427
[CI/Build] Split pooling and generation extended language models tests in CI by @Isotr0py in #18705
[Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI by @ldurejko in #18709
[Misc] add AutoGen integration by @reidliu41 in #18712
[Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM by @YanWuHao in #18701
[Doc] Improve API docs by @DarkLight1337 in #18713
[Doc] Move examples and further reorganize user guide by @DarkLight1337 in #18666
[Bugfix] Fix Llama GGUF initialization by @DarkLight1337 in #18717
[V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs by @lgeiger in #18608
Convert examples to ruff-format by @hmellor in #18400
[Model][Gemma3] Simplify image input validation by @lgeiger in #18710
[Misc] improve web section group title display by @reidliu41 in #18684
[V1][Quantization] Add CUDA graph compatible v1 GGUF support by @Isotr0py in #18646
[Model][Gemma3] Cast image pixel values already on CPU by @lgeiger in #18732
[FEAT] [ROCm] Upgrade AITER Fused MoE kernels. by @vllmellm in #18271
[Doc] Update OOT model docs by @DarkLight1337 in #18742
[Doc] Update reproducibility doc and example by @DarkLight1337 in #18741
[Misc] improve docs by @reidliu41 in #18734
feat(rocm-support): support mamba2 on rocm by @almersawi in #18565
[Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh by @ldurejko in #18752
[Doc] cleanup deprecated flag for doc by @calvin0327 in #18715
Minor fix about MooncakeStoreConnector by @maobaolong in #18721
[Build] fix cpu build missing libtbbmalloc.so by @kebe7jun in #18744
[BUG FIX] minicpm by @huangyuxiang03 in #18739
[Doc] Convert Sphinx directives ( {class}, {meth}, {attr}, ...) to MkDocs format for better documentation linking by @Zerohertz in #18663
[CI/Build] Remove imports of built-in re by @DarkLight1337 in #18750
[V1][Metrics] Add API for accessing in-memory Prometheus metrics by @markmc in #17010
Disable prefix cache by default for benchmark by @cascade812 in #18639
optimize get_kv_cache_torch_dtype by @chunxiaozheng in #18531
[Core] Automatically cast multi-modal input dtype by @DarkLight1337 in #18756
[Bugfix] Mistral tool calling when content is list by @mgoin in #18729

New Contributors

@r-barnes made their first contribution in #17316
@qscqesze made their first contribution in #16328
@ponix-j made their first contribution in #17100
@Zerohertz made their first contribution in #17342
@a2q1p made their first contribution in #17387
@mofanke made their first contribution in #17369
@mayuyuace made their first contribution in #17364
@casinca made their first contribution in #17400
@mlinmg made their first contribution in #15826
@alec-flowers made their first contribution in #16750
@psav made their first contribution in #17457
@nlzy made their first contribution in #17411
@noyoshi made their first contribution in #15506
@tishizaki made their first contribution in #17285
@sstamenk made their first contribution in #17550
@qthequartermasterman made their first contribution in #15428
@CalebDu made their first contribution in #14568
@Edwardf0t1 made their first contribution in #17561
@xw285cornell made their first contribution in #16263
@ehartford made their first contribution in #17605
@thomasjpfan made their first contribution in #17616
@s3woz made their first contribution in #17497
@Xarbirus made their first contribution in #17408
@bythew3i made their first contribution in #16458
@dtransposed made their first contribution in #16839
@BowenBao made their first contribution in #16943
@vmarkovtsev made their first contribution in #17800
@amd-hhashemi made their first contribution in #17071
@qionghuang6 made their first contribution in #17753
@RIckYuan999 made their first contribution in #17644
@fxmarty-amd made their first contribution in #12612
@inkcherry made their first contribution in #17853
@tracelogfb made their first contribution in #17873
@edrevo made their first contribution in #17786
@xinli-centml made their first contribution in #17948
@bbrowning made their first contribution in #17855
@frieda-huang made their first contribution in #17032
@Xu-Wenqing made their first contribution in #17784
@bwshen-mi made their first contribution in #17433
@arjunkathuria made their first contribution in #13779
@calvin0327 made their first contribution in #18050
@jinhuang12 made their first contribution in #17916
@gilljon made their first contribution in #18083
@Ecthlion made their first contribution in #18117
@majianpeng made their first contribution in #18051
@anko-intel made their first contribution in #17844
@huachenheli made their first contribution in #18091
@omahs made their first contribution in #18151
@hustxiayang made their first contribution in #18187
@eicherseiji made their first contribution in #17925
@IwakuraRein made their first contribution in #17918
@learner0810 made their first contribution in #18251
@strangiato made their first contribution in #18264
@princepride made their first contribution in #17937
@sdavidbd made their first contribution in #18142
@Nan2018 made their first contribution in #17590
@googs1025 made their first contribution in #18314
@shaoyuyoung made their first contribution in #18319
@eladsegal made their first contribution in #18355
@wenhuach21 made their first contribution in #17850
@sunyicode0012 made their first contribution in #18337
@Unprincess17 made their first contribution in #18330
@rand-fly made their first contribution in #18363
@rabi made their first contribution in #18460
@giantcroc made their first contribution in #18410
@hyoon1 made their first contribution in #17004
@cyr0930 made their first contribution in #18479
@elaineyz made their first contribution in #18274
@lgeiger made their first contribution in #18347
@RonaldBXu made their first contribution in #18034
@zzzyq made their first contribution in #18430
@shadeMe made their first contribution in #17731
@Crucifixion-Fxl made their first contribution in #18454
@MathieuBordere made their first contribution in #18623
@Nalkey made their first contribution in #18647
@ztang2370 made their first contribution in #18625
@zhaohaidao made their first contribution in #18644
@ldurejko made their first contribution in #18709
@YanWuHao made their first contribution in #18701
@almersawi made their first contribution in #18565
@huangyuxiang03 made their first contribution in #18739
@chunxiaozheng made their first contribution in #18531

Full Changelog: v0.8.5.post1...v0.9.0

vllm 0.9.0 v0.9.0 on Python PyPI

Highlights

Notable Changes

Model Enhancements

Performance, Production and Scaling

Security

Features

Hardwares

Documentation

Developer Facing

What's Changed

New Contributors

vllm 0.9.0
v0.9.0

on Python PyPI