Highlights
This release features 649 commits, from 215 contributors (82 new contributors!)
- vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
- The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
- As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
- Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
- You can use our docker image or install FlashInfer nightly wheel
pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl
then setVLLM_ATTENTION_BACKEND=FLASHINFER
for better performance. - Upgraded support for the new FlashInfer main branch. (#15777)
- Please checkout #18153 for the full roadmap
- You can use our docker image or install FlashInfer nightly wheel
- Initial DP, EP, PD support for large scale inference
- EP:
- DP:
- Decouple engine process management and comms (#15977)
- PD:
- Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)
Notable Changes
- Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
- Change
top_k
to be disabled with0
(still accept-1
for now) (#17773) - The seed is now set to
0
by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even iftemperature > 0
. This does not modify the random state in user code since workers are run in separate processes unlessVLLM_USE_V1_MULTIPROCESSING=0
. (#17929, #18741)
Model Enhancements
- Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
- Please install the development version of
transformers
(from source) to use Falcon-H1.
- Please install the development version of
- Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
- Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
- DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
- Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
- Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
- InternVL-Qwen2.5 models now support video inputs (#18499)
Performance, Production and Scaling
- Support full cuda graph in v1 (#16072)
- Pipeline Parallelism: MultiprocExecutor support (#14219),
torchrun
(#17827) - Support sequence parallelism combined with pipeline parallelism (#18243)
- Async tensor parallelism using compilation pass (#17882)
- Perf: Use small max_num_batched_tokens for A100 (#17885)
- Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
- Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)
Security
- Prevent side-channel attacks via cache salting (#17045)
- Fix image hash collision in certain edge cases (#17378)
- Add
VLLM_ALLOW_INSECURE_SERIALIZATION
env var (#17490) - Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)
Features
- CLI:
deprecated=True
(#17426) - Frontend: progress bar for adding requests (#17525),
chat_template_kwargs
inLLM.chat
(#17356),/classify
endpoint (#17032), truncation control for embedding models (#14776),cached_tokens
in response usage (#18149) - LoRA: default local directory LoRA resolver plugin. (#16855)
- Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
- Quantization:
nvidia/DeepSeek-R1-FP4
(#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models withAOPerModuleConfig
(#17826), CUDA Graph support for V1 GGUF support (#18646) - Reasoning: deprecate
--enable-reasoning
(#17452) - Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
- Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466),
tool_choice: required
for Xgrammar (#17845), Structural Tag with Guidance backend (#17333) - Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)
Hardwares
- NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
- TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
- Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
- AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
- Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)
Documentation
- Update quickstart and install for cu128 using
--torch-backend=auto
(#18505) - NVIDIA TensorRT Model Optimizer (#17561)
- Usage of Qwen3 thinking (#18291)
Developer Facing
- Benchmark: Add single turn MTBench to Serving Bench (#17202)
- Usability: Decrease import time of
vllm.multimodal
(#18031) - Code Format: Code formatting using
ruff format
(#17656, #18068, #18400) - Readability:
- Process:
- Propose a deprecation policy for the project (#17063)
- Testing: expanding torch nightly tests (#18004)
What's Changed
- Support loading transformers models with named parameters by @wuisawesome in #16868
- Add tuned triton fused_moe configs for Qwen3Moe by @mgoin in #17328
- [Benchmark] Add single turn MTBench to Serving Bench by @ekagra-ranjan in #17202
- [Optim] Compute multimodal hash only once per item by @DarkLight1337 in #17314
- implement Structural Tag with Guidance backend by @mmoskal in #17333
- [V1][Spec Decode] Make Eagle model arch config driven by @ekagra-ranjan in #17323
- [model] make llama4 compatible with pure dense layers by @luccafong in #17315
- [Bugfix] Fix
numel()
downcast in fused_layernorm_dynamic_per_token_quant.cu by @r-barnes in #17316 - Ignore
'<string>'
filepath by @zou3519 in #17330 - [Bugfix] Add contiguous call inside rope kernel wrapper by @timzsu in #17091
- [Misc] Add a Jinja template to support Mistral3 function calling by @chaunceyjiang in #17195
- [Model] support MiniMax-VL-01 model by @qscqesze in #16328
- [Misc] Move config fields to MultiModalConfig by @DarkLight1337 in #17343
- [Misc]Use a platform independent interface to obtain the device attributes by @ponix-j in #17100
- [Fix] Documentation spacing in compilation config help text by @Zerohertz in #17342
- [Build][Bugfix] Restrict setuptools version to <80 by @gshtras in #17320
- [Model] Ignore rotary embed load for Cohere model by @ekagra-ranjan in #17319
- Update docs requirements by @hmellor in #17379
- [Doc] Fix QWen3MOE info by @jeejeelee in #17381
- [Bugfix] Clean up MiniMax-VL and fix processing by @DarkLight1337 in #17354
pre-commit autoupdate
by @hmellor in #17380- [Frontend] Support
chat_template_kwargs
inLLM.chat
by @DarkLight1337 in #17356 - Transformers backend tweaks by @hmellor in #17365
- Fix: Spelling of inference by @a2q1p in #17387
- Improve literal dataclass field conversion to argparse argument by @hmellor in #17391
- [V1] Remove num_input_tokens from attn_metadata by @heheda12345 in #17193
- [Bugfix] add qwen3 reasoning-parser fix content is None when disable … by @mofanke in #17369
- fix gemma3 results all zero by @mayuyuace in #17364
- [Misc][ROCm] Exclude
cutlass_mla_decode
for ROCm build by @tywuAMD in #17289 - Enabling multi-group kernel tests. by @Alexei-V-Ivanov-AMD in #17115
- [Docs] Propose a deprecation policy for the project by @russellb in #17063
- [Doc][Typo] Fixing label in new model requests link in overview.md by @casinca in #17400
- [TPU][V1][CI] Replace
python3 setup.py develop
with standardpip install --e
on TPU by @NickLucche in #17374 - [CI] Uses Python 3.11 for TPU by @aarnphm in #17359
- [CI/Build] Add retry mechanism for add-apt-repository by @reidliu41 in #17107
- [Bugfix] Fix Minicpm-O-int4 GPTQ model inference by @Isotr0py in #17397
- Simplify (and fix) passing of guided decoding backend options by @hmellor in #17008
- Remove Falcon3 2x7B from CI by @hmellor in #17404
- Fix: Python package installation for opentelmetry by @dilipgb in #17049
- [V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE by @luyuzhe111 in #17211
- Remove Bamba 9B from CI by @hmellor in #17407
- [V1][Feature] Enable Speculative Decoding with Structured Outputs by @benchislett in #14702
- [release] Always git fetch all to get latest tag on TPU release by @khluu in #17322
- Truncation control for embedding models by @gmarinho2 in #14776
- Update PyTorch to 2.7.0 by @huydhn in #16859
- Improve configs -
ModelConfig
by @hmellor in #17130 - Fix call to
logger.info_once
by @hmellor in #17416 - Fix some speculative decode tests with tl.dot by @huydhn in #17371
- Support LoRA for Mistral3 by @mgoin in #17428
- [Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue by @jikunshang in #17298
- [Hardware][Intel GPU] Upgrade to torch 2.7 by @jikunshang in #17444
- [Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' by @chaunceyjiang in #17434
- [MODEL ADDITION] Ovis2 Model Addition by @mlinmg in #15826
- Make the _apply_rotary_emb compatible with dynamo by @houseroad in #17435
- [Misc] Remove deprecated files by @chaunceyjiang in #17447
- [V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None by @lengrongfu in #15755
- [TPU][V1][CI] Update regression test baseline for v6 CI by @NickLucche in #17064
- [Core] Prevent side-channel attacks via cache salting by @dr75 in #17045
- [V1][Metrics] add support for kv event publishing by @alec-flowers in #16750
- [Feature] The Qwen3 reasoning parser supports guided decoding by @chaunceyjiang in #17466
- [Docs] Add command for running mypy tests from CI by @russellb in #17475
- [Fix] Support passing args to logger by @aarnphm in #17425
- [Bugfix] Fixed mistral tokenizer path when pointing to file by @psav in #17457
- [V1] Allow turning off pickle fallback in vllm.v1.serial_utils by @russellb in #17427
- [Docs] Update optimization.md doc by @mgoin in #17482
- [BugFix] Fix authorization of openai_transcription_client.py by @hhy3 in #17321
- [Bugfix][ROCm] Restrict ray version due to a breaking release by @gshtras in #17480
- [doc] add install tips by @reidliu41 in #17373
- doc: fix bug report Github template formatting by @davidxia in #17486
- [v1][Spec Decode] Make sliding window compatible with eagle prefix caching by @heheda12345 in #17398
- Bump Compressed Tensors version to 0.9.4 by @rahul-tuli in #17478
- [Misc] Rename Audios -> Audio in Qwen2audio Processing by @alex-jw-brooks in #17507
- [CI][TPU] Skip Multimodal test by @lsy323 in #17488
- [Bugfix][ROCm] Fix import error on ROCm by @gshtras in #17495
- [Bugfix] Temporarily disable gptq_bitblas on ROCm by @nlzy in #17411
- [CI][TPU] Skip structured outputs+spec decode tests on TPU by @mgoin in #17510
- [CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg by @mgoin in #17500
- [CI/Build] Reorganize models tests by @DarkLight1337 in #17459
- FIxing the AMD test failures caused by PR#16457 by @Alexei-V-Ivanov-AMD in #17511
- [Build] Require setuptools >= 77.0.3 for PEP 639 by @russellb in #17389
- [ROCm] Effort to reduce the number of environment variables in command line by @hongxiayang in #17229
- [BugFix] fix speculative decoding memory leak when speculation is disabled by @noyoshi in #15506
- [BugFix] Fix mla cpu - missing 3 required positional arguments by @LucasWilkinson in #17494
- Avoid overwriting vllm_compile_cache.py by @youngkent in #17418
- [Core] Enable IPv6 with vllm.utils.make_zmq_socket() by @russellb in #16506
- [Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content by @chaunceyjiang in #17515
- Improve configs -
ObservabilityConfig
by @hmellor in #17453 - [Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to select a model by @tishizaki in #17285
- [Frontend] Show progress bar for adding requests by @DarkLight1337 in #17525
- [Misc] Clean up test docstrings and names by @DarkLight1337 in #17521
- [FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X by @tjtanaa in #17530
- Fix more broken speculative decode tests by @huydhn in #17450
- [doc] add streamlit integration by @reidliu41 in #17522
- [FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe config by @tjtanaa in #17535
- [Feature][Frontend]: Deprecate --enable-reasoning by @chaunceyjiang in #17452
- [ROCm] remove unsupported archs from rocm triton flash-attention supported list by @hongxiayang in #17536
- [torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations by @SageMoore in #10867
- [Misc] refactor example - cpu_offload_lmcache by @reidliu41 in #17460
- [CI/Build] Remove
awscli
dependency by @DarkLight1337 in #17532 - Move the last arguments in
arg_utils.py
to be in their final groups by @hmellor in #17531 - [Model] Refactor Ovis2 to support original tokenizer by @Isotr0py in #17537
- [ROCm] update installation guide to include build aiter from source instructions by @hongxiayang in #17542
- [Misc]add configurable cuda graph size by @CXIAAAAA in #17201
- [Bugfix] Fix lint error by @DarkLight1337 in #17547
- [ROCM] Add gfx950 to the custom attention archs by @jpvillam-amd in #16034
- Remove duplicate code from dbrx.py by @sstamenk in #17550
- [Bug]change the position of cuda_graph_sizes in dataclasses by @CXIAAAAA in #17548
- [Misc][Tools][Benchmark] Publish script to auto tune server parameters by @Chenyaaang in #17207
- [V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE3 by @zixi-qi in #17504
- [Bugfix] Disable gptq_bitblas for <SM80 to fix GPTQ on V100/T4 by @mgoin in #17541
- [Doc] note that not all unit tests pass on CPU platforms by @davidxia in #17554
- [Attention] MLA move o_proj q_proj into cuda-graph region by @LucasWilkinson in #17484
- [CI] Actually run tests/kv_transfer/test_disagg.py in CI by @mgoin in #17555
- Check if bitblas is installed during support check by @mgoin in #17572
- [Misc] Continue refactoring model tests by @DarkLight1337 in #17573
- Fix PixtralHF missing spatial_merge_size by @mgoin in #17571
- Add
pt_load_map_location
to allow loading to cuda by @jerryzh168 in #16869 - [Bugifx] Remove TritonPlaceholder from sys.modules by @Isotr0py in #17317
- [Core] [Bugfix] Add Input Embeddings by @qthequartermasterman in #15428
- [BugFix] Fix Memory Leak by @robertgshaw2-redhat in #17567
- [Misc] Rename assets for testing by @DarkLight1337 in #17575
- add more pytorch related tests for torch nightly by @yangw-dev in #17422
- [doc] add the print result by @reidliu41 in #17584
- Automatically tell users that dict args must be valid JSON in CLI by @hmellor in #17577
- [Security] Fix image hash collision by @DarkLight1337 in #17378
- Support W8A8 INT8 MoE for compressed-tensors by @mgoin in #16745
- [doc] miss result by @reidliu41 in #17589
- [Misc] Clean up input processing by @DarkLight1337 in #17582
- [Bugfix] fix tmp_out and exp_sums dimensions by @hliuca in #17438
- [BugFix][Attention] Fix sliding window attention in V1 giving incorrect results by @LucasWilkinson in #17574
- permute/unpermute kernel for moe optimization by @CalebDu in #14568
- Add NVIDIA TensorRT Model Optimizer in vLLM documentation by @Edwardf0t1 in #17561
- [Hardware][AMD] Improve OAM device ID + llama4 Maverick MOE tuning by @xw285cornell in #16263
- [easy] Print number of needed GPUs in skip message by @zou3519 in #17594
- fix typo in logging by @ehartford in #17605
- [release] Add command to clean up Docker containers/images in TPU release machine by @khluu in #17606
- [Neuron][Build] Require setuptools >= 77.0.3 for PEP 639 by @liangfu in #17603
- Update test requirements to CUDA 12.8 by @22quinn in #17576
- [Quantizaton] [AMD] Add support for running DeepSeek int8 w8a8 MoE on ROCm by @rasmith in #17558
- [Frontend][TPU] Add TPU default max-num-batched-tokens based on device name by @Chenyaaang in #17508
- [Build/CI] Upgrade CUTLASS to 3.9.1 by @tlrmchlsmth in #17602
- [Bugfix][ROCm] Using device_type because on ROCm the API is still torch.cuda by @gshtras in #17601
- [Core] Gate
prompt_embeds
behind a feature flag by @DarkLight1337 in #17607 - [Bugfix] Fix broken Qwen2.5-omni tests by @Isotr0py in #17613
- [Misc] V0 fallback for
--enable-prompt-embeds
by @DarkLight1337 in #17615 - Add full API docs and improve the UX of navigating them by @hmellor in #17485
- [Bugfix] Prioritize dtype in root config before checking text config by @DarkLight1337 in #17629
- [Bugfix][Easy] Fix whitespace in shm_broadcast.py logging by @tlrmchlsmth in #17635
- [Bugfix] fix KeyError on top logprobs are special tokens by @chaunceyjiang in #17637
- [Build/CI] Upgrade CUTLASS to 3.9.2 by @tlrmchlsmth in #17641
- [Kernel] some optimizations for dense marlin and moe marlin by @jinzhen-lin in #16850
- [Doc] Fix broken cuda installation doc rendering by @Isotr0py in #17654
- Use git-path commit in hook by @thomasjpfan in #17616
- [Benchmarks] Remove invalid option under V1 engine by @russellb in #17651
- [BugFix] Increase timeout for startup failure test by @njhill in #17642
- [TPU] Enable gemma3-27b with TP>1 on multi-chips. by @vanbasten23 in #17335
- [TPU][V1] Add support for top-logprobs by @NickLucche in #17072
- [Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument by @varun-sundar-rabindranath in #17677
- Update nm to rht in doc links + refine fp8 doc by @mgoin in #17678
- [Model] Add GraniteMoeHybrid 4.0 model by @s3woz in #17497
- [easy] Fix logspam on PiecewiseBackend errors by @zou3519 in #17138
- [Bugfix] Fixed prompt length for random dataset by @Xarbirus in #17408
- [Doc] Update notes for H2O-VL and Gemma3 by @DarkLight1337 in #17219
- [Misc] Fix ScalarType float4 naming by @LucasWilkinson in #17690
- Fix
dockerfilegraph
pre-commit hook by @hmellor in #17698 - [Bugfix] Fix triton import with local TritonPlaceholder by @MengqingCao in #17446
- [V1] Enable TPU V1 backend by default by @mgoin in #17673
- [V1][PP] Support PP for MultiprocExecutor by @bigPYJ1151 in #14219
- [v1] AttentionMetadata for each layer by @heheda12345 in #17394
- [Feat] Add deprecated=True to CLI args by @aarnphm in #17426
- [Docs] Use gh-file to add links to tool_calling.md by @windsonsea in #17709
- [v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager by @heheda12345 in #17479
- [doc] Add RAG Integration example by @reidliu41 in #17692
- [Bugfix] Fix modality limits in vision language example by @DarkLight1337 in #17721
- Make right sidebar more readable in "Supported Models" by @hmellor in #17723
- [TPU] Increase block size and reset block shapes by @bythew3i in #16458
- [Misc] Add Next Edit Prediction (NEP) datasets support in
benchmark_serving.py
by @dtransposed in #16839 - [Bugfix] Fix for the condition to accept empty encoder inputs for mllama by @gshtras in #17732
- [Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode by @tdoublep in #16828
- Fix doc build performance by @hmellor in #17748
- [ROCm] fix num_stages for default moe config to avoid triton OutOfResource error by @hongxiayang in #17744
- Add logging for torch nightly version by @yangw-dev in #17669
- [Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels by @cyang49 in #17146
- Removed unused marlin cuda code by @mgoin in #17684
- [TPU] Add kernel test for moe_pallas by @mgoin in #17496
- Replace lm-eval bash script with pytest and use enforce_eager for faster CI by @mgoin in #17717
- [BugFix][Spec Decode] Fix hidden size mismatch between target and eagle head by @WoosukKwon in #17740
- [Misc] Split model loader by @jeejeelee in #17712
- [Misc] Use
apply_rotary_emb
from vllm_flash_attn for Qwen2-VL vision RoPE by @Isotr0py in #17726 - [Kernel] GGUF MoeVec kernel by @SzymonOzog in #16780
- [Kernel] Use fused rmsnorm for some models like qwen3 series by @Eviannn in #17735
- [Misc] Remove qlora_adapter_name_or_path by @jeejeelee in #17699
- Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling by @aws-satyajith in #16357
- [Frontend] Add missing chat templates for various MLLMs by @DarkLight1337 in #17758
- Fix test_memory_usage_no_spec by @sarckk in #17754
- Make key optional for rotary embedding by @sarckk in #17566
- [doc] update the issue link by @reidliu41 in #17782
- [ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention by @gshtras in #17139
- Only depend on importlib-metadata for Python < 3.10 by @tiran in #17776
- [Bugfix] Fix Video IO error for short video by @Isotr0py in #17791
- Fix and simplify
deprecated=True
CLIkwarg
by @hmellor in #17781 - [Bugfix] Fix missing lora name mapping for lora without prefix by @Isotr0py in #17793
- [Quantization] Quark MXFP4 format loading by @BowenBao in #16943
- [Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend by @Akshat-Tripathi in #14238
- [BugFix] Avoid secondary missing
MultiprocExecutor.workers
error by @njhill in #17811 - [Core][Feature] Input metadata dump on crash by @wallashss in #13407
- [Chore][Doc] uses model id determined from OpenAI client by @aarnphm in #17815
- Don't call the venv
vllm
by @hmellor in #17810 - [BugFix] Fix
--disable-log-stats
in V1 server mode by @njhill in #17600 - [Core] Support full cuda graph in v1 by @chanh in #16072
- Improve exception reporting in MP engine by @vmarkovtsev in #17800
- [Installation] OpenTelemetry version update by @Xarbirus in #17771
- Only log non-default CLI args for online serving by @hmellor in #17803
- [V1] Add VLLM_ALLOW_INSECURE_SERIALIZATION env var by @russellb in #17490
- [Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs by @amd-hhashemi in #17071
- [Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER by @Akashcodes732 in #17153
- [Hardware][Intel-Gaudi] Support Automatic Prefix Caching on HPU by @adobrzyn in #17648
- [Frontend] Chat template fallbacks for multimodal models by @DarkLight1337 in #17805
- [Qwen3]add qwen3-235b-bf16 fused moe config on A100 by @Ximingwang-09 in #17715
- [Bugfix] Fix bad words for Mistral models by @qionghuang6 in #17753
- [Misc] support model prefix & add deepseek vl2 tiny fused moe config by @xsank in #17763
- [Bugfix] Fix tool call template validation for Mistral models by @RIckYuan999 in #17644
- [TPU] Fix the test_sampler by @bythew3i in #17820
- [Bugfix] Fix quark fp8 format loading on AMD GPUs by @fxmarty-amd in #12612
- [Doc] Fix a typo in the file name by @DarkLight1337 in #17836
- [Easy] Eliminate c10::optional usage in vllm/csrc by @houseroad in #17819
- [Misc] add chatbox integration by @reidliu41 in #17828
- Fix transient dependency error in docs build by @hmellor in #17848
- [Bugfix]
use_fast
failing to be propagated to Qwen2-VL image processor by @DarkLight1337 in #17838 - [Misc] Delete LoRA-related redundancy code by @jeejeelee in #17841
- [CI] Fix test_collective_rpc by @russellb in #17858
- [V1] Improve VLLM_ALLOW_INSECURE_SERIALIZATION logging by @russellb in #17860
- [Test] Attempt all TPU V1 tests, even if some of them fail. by @yarongmu-google in #17334
- [CI] Prune down lm-eval small tests by @mgoin in #17012
- Fix noisy warning for uncalibrated q_scale/p_scale by @mgoin in #17414
- Add cutlass support for blackwell fp8 blockwise gemm by @wenscarl in #14383
- [FEAT][ROCm]: Support AITER MLA on V1 Engine by @vllmellm in #17523
- [V1][Structured Output] Update llguidance (
>= 0.7.11
) to avoid AttributeError (noStructTag
) by @shen-shanshan in #17839 - [Attention] MLA move rotary embedding to cuda-graph region by @LucasWilkinson in #17668
- [BUGFIX]: return fast when request requires prompt logprobs by @andyxning in #17251
- [Docs] Add Slides from NYC Meetup by @simon-mo in #17879
- [Doc] Update several links in reasoning_outputs.md by @windsonsea in #17846
- [Doc] remove visible token in doc by @yma11 in #17884
- [Bugfix][ROCm] Fix AITER MLA V1 by @vllmellm in #17880
- [Bugfix][CPU] Fix broken AVX2 CPU TP support by @Isotr0py in #17252
- Fix Whisper crash caused by invalid
max_num_batched_tokens
config by @inkcherry in #17853 - Change
top_k
to be disabled with0
(still accept-1
for now) by @hmellor in #17773 - [Misc] add dify integration by @reidliu41 in #17895
- [BugFix][AMD] Compatible patch for latest AITER(05/07/2025) by @qli88 in #17864
- [v1] Move block management logic from KVCacheManager to SpecializedManager by @heheda12345 in #17474
- [CI/Build] Automatically retry flaky tests by @DarkLight1337 in #17856
- Revert "[BugFix][AMD] Compatible patch for latest AITER(05/07/2025)" by @mgoin in #17910
- [Misc] Add references in ray_serve_deepseek example by @ruisearch42 in #17907
- [Misc] Auto fallback to float16 for pre-Ampere GPUs when detected bfloat16 config by @Isotr0py in #17265
- Update CT WNA16MarlinMoE integration by @mgoin in #16666
- Handle error when
str
passed to/v1/audio/transcriptions
by @hmellor in #17909 - Add option to use torch._inductor.standalone_compile by @zou3519 in #17057
- [V1][Spec Decoding] Include bonus tokens in mean acceptance length by @markmc in #17908
- Improve configs - the rest! by @hmellor in #17562
- AMD conditional all test execution // new test groups by @Alexei-V-Ivanov-AMD in #17556
- [Hardware/NVIDIA/Kernel] [Functional Enablement] [1/N] Enable nvidia/DeepSeek-R1-FP4 Model by @pavanimajety in #16362
- [V1][Spec Decoding] Log accumulated metrics after system goes idle by @markmc in #17913
- fix broken test vllm:test_kernels - test_attention_selector.py::test_flash_attn by @tracelogfb in #17873
- Add missing content type headers to /ping and /health (#17036) by @edrevo in #17786
- Don't default construct
ModelConfig
when default constructingVllmConfig
by @hmellor in #17943 - [Misc] remove --model from vllm serve usage by @reidliu41 in #17944
- [v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders by @heheda12345 in #17483
- [v1] Rename specialized_manager.py to single_type_kv_cache_manager.py by @heheda12345 in #17946
- [Kernel] fp4 marlin kernel by @jinzhen-lin in #17687
- [Bugfix] Add revision to
transformers.Auto*.from_pretrained
processors by @xinli-centml in #17948 - [Perf] Use small max_num_batched_tokens for A100 by @KuntaiDu in #17885
- fix amd triton mla path by @842974287 in #17871
- [Bugfix]: v1 engine - consider lora adapters in allowed_token_ids by @bbrowning in #17855
- [doc] update lora doc by @reidliu41 in #17936
- [Frontend] Add /classify endpoint by @frieda-huang in #17032
- [Misc] Add compressed-tensors NVFP4A16 emulation support by @dsikka in #17914
- [FP8][ROCm][Attention] Enable FP8 KV cache on ROCm for V1 by @gshtras in #17870
- [New Model]: nomic-embed-text-v2-moe by @noooop in #17785
- [Misc] not show --model in vllm serve --help by @reidliu41 in #16691
- [BugFix] [ROCm]: Bugfix and handle addition case of input for
rocm_aiter_rms_norm
by @tjtanaa in #17857 - [BUG] [ROCm] [MLA] Fix variable name bug due to change in variable name in PR #17483 by @tjtanaa in #17961
- [Model] Broadcast Ovis2 implementation to fit Ovis1.6 by @Isotr0py in #17861
- [misc] add instructions on how to install nvshmem/pplx/deepep by @youkaichao in #17964
- [Bugfix] validate grammar and throw 400 error instead of crashing the engine when xgrammar validation fails by @Jason-CKY in #17623
- [bugfix] fix the wrong parser by @reidliu41 in #17958
- [Bugfix] Fix pydantic.errors.PydanticUserError by @Potabk in #17962
- [Bugfix][TPU] Use np array when updating cache slot_mapping by @lsy323 in #17971
- [Fix] Benchmark
"EngineClient" has no attribute "model_config"
by @b8zhong in #17976 - [Feature] Support DeepSeekV3 Function Call by @Xu-Wenqing in #17784
- Correcting testcases in builkite job for IBM Power by @AaruniAggarwal in #17675
- [Misc] Improve modelscope import error by @jeejeelee in #17983
- Initialize the delta tool call fields explicitly by @maxdebayser in #17340
- [P/D] NIXL Integration by @robertgshaw2-redhat in #17751
- [Lora][Frontend]Add default local directory LoRA resolver plugin. by @jberkhahn in #16855
- Construct
KVTransferConfig
properly from Python instead of using JSON blobs without CLI by @hmellor in #17994 - [CI/Build] Fix TPU V1 Test mixed use of & and && across tests by @CAROLZXYZXY in #17968
- [Core] Use platform-agnostic device control for DP engine core by @jianzs in #17245
- Enabling "Weight Loading Multiple GPU Test - Large Models" by @Alexei-V-Ivanov-AMD in #18020
- [v1][KVCacheManager] Change prefix caching metric from counting blocks to counting tokens by @heheda12345 in #18003
- [Chore] Remove unused method by @robertgshaw2-redhat in #18024
- Enable standard language model for torhc nightly by @yangw-dev in #18004
- [CI] Make JSON output tests less likely to fail by @russellb in #17859
- [V1][Spec Decode] Eagle unit tests by @wwl2755 in #17350
- [Bugfix] Fix FBGEMM integration by @mgoin in #18002
- [Model] Support MiMo-7B inference with MTP by @bwshen-mi in #17433
- Update some more deprecated type hinting by @hmellor in #17998
- Use NVFP4 Marlin for CompressedTensorsW4A16Fp4 by @mgoin in #18000
- Remove noisy warnings from
SchedulerConfig
by @hmellor in #17995 - [ROCm] Skip tests for quantizations incompatible with ROCm by @hissu-hyvarinen in #17905
- Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support by @sighingnow in #11844
- [Misc] Slight spelling modification by @jeejeelee in #18039
- [ROCm]: Fix build from source failure with gcc14 and ROCm 6.3 by @arjunkathuria in #13779
- [Bugfix] Fixes for new marlin moe usage by @mgoin in #18017
- [Bugfix] Avoid repeatedly creating dummy data during engine startup by @DarkLight1337 in #17935
- [Feature][V1] Support
tool_choice: required
when using Xgrammar as theStructuredOutputBackend
. by @chaunceyjiang in #17845 - cleanup invalid prints by @calvin0327 in #18050
- [BugFix] Fix 4-GPU RLHF tests by @njhill in #18007
- Fix Broken macro for cutlass moe by @drisspg in #18049
- [v1][KVCacheManager] Avoid full cache hit by controlling max_length by @heheda12345 in #17999
- [Bugfix][V1] Only get input embeddings w/ multi-modal models if first PP by @jinhuang12 in #17916
- [BugFix] Set default random seed to 0 for V1 by @WoosukKwon in #17929
- [Bugfix] Fix marlin moe fallback logic for llama4 by @mgoin in #18042
- [Benchmarks] Refactor run_structured_output_benchmarks.sh by @russellb in #17722
- Convert
.buildkite
toruff format
by @hmellor in #17656 - [Fix] check to make sure processor has chat templates by @aarnphm in #18047
- [doc] add download/list/delete HF model CLI usage by @reidliu41 in #17940
- Update deprecated type hinting in
model_executor/layers
by @hmellor in #18056 - Update deprecated type hinting in
vllm/profiler
by @hmellor in #18057 - Update deprecated type hinting in
vllm/transformers_utils
by @hmellor in #18058 - [CI] Set token permissions for reminder comment CI job by @russellb in #17728
- [CI] Add workflow permissions for helm CI job by @russellb in #17727
- [CI] Add token permissions for add-ready-label CI job by @russellb in #17730
- [CI] set token permissions for pre-commit CI job by @russellb in #17729
- [Bugfix] Fix entrypoints metrics tests by @DarkLight1337 in #18063
- Convert
benchmarks
toruff format
by @hmellor in #18068 - Give auto-merge label workflow permission to add labels to issues by @hmellor in #18078
- Update deprecated type hinting in
vllm/compilation
by @hmellor in #18072 - Update deprecated type hinting in
vllm/adapter_commons
by @hmellor in #18073 - [V1] DP scale-out (2/N): Decouple engine process management and comms by @njhill in #15977
- [Docs] Expand security doc with firewall info by @russellb in #18081
- [FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature by @vllmellm in #14968
- [v1][KVCacheManager] pass num_new_computed_tokens to kv cache manager by @heheda12345 in #18001
- [Fix] Support CUDAGraph capture for encoder-decoder on ROCm by @ProExpertProg in #18104
- [Hardware/NVIDIA/Modelopt] Fix modelopt forward method for v1 torch.compile by @pavanimajety in #18101
- [P/D] Add some more debug logs to
NixlConnector
by @njhill in #18102 - [Misc] Remove unused numpy tensor by @ywang96 in #18084
- [Bug]: Fix S3 model/tokenizer path resolution by @gilljon in #18083
- [core][distributed] add ep group and all2all interface by @youkaichao in #18077
- [Bugfix] Fix FP8 Marlin MoE and enable for compressed-tensors models by @mgoin in #18026
- [Bugfix][V1] Fix FlashInfer V1 backend using the wrong VllmConfig by @mgoin in #18086
- [FEAT] [ROCm] [V1]: Add AITER biased group topk for DeepSeekV3 by @vllmellm in #17955
- [AMD][torch.compile] Enable silu+fp8_quant fusion for rocm by @charlifu in #18082
- [BugFix][AMD] Compatible patch for AITER lib after 04/20 by @qli88 in #17912
- Fix broken example: examples/offline_inference/profiling at scheduler_config by @Ecthlion in #18117
- [Fix] Move "model_config" as keyword args in chat_utils.py by @lk-chen in #18098
- [Bugfix] fix moe marlin
topk_weight
loading by @jinzhen-lin in #18080 - [Bugfix][Example] make lmcache v0 work. by @majianpeng in #18051
- [New Model]: support GTE NewModel by @noooop in #17986
- [Bugfix] Fix entrypoints audio test failure by @DarkLight1337 in #18111
- [Model] Add packed_modules_mapping for Qwen3-MOE by @jeejeelee in #18118
- [Misc] replace does not exist model by @lengrongfu in #18119
- [Bugfix] Fix QKVCrossParallelLinear::sync_weight_attrs for PyTorch compile by @anko-intel in #17844
- [FEAT] [ROCm]: Add AITER CK 2 Stages MoE support by @tjtanaa in #17110
- [Bugfix] Fix LoRA test by @jeejeelee in #18123
- [Model] GritLM supports other attention backends by @DarkLight1337 in #18109
- [doc] add missing import by @reidliu41 in #18133
- Update deprecated type hinting in
vllm/lora
by @hmellor in #18128 - Update deprecated type hinting in
vllm/device_allocator
andvllm/distributed
by @hmellor in #18126 - Update deprecated type hinting in
platform
,plugins
,triton_utils
,vllm_flash_attn
by @hmellor in #18129 - [Bugfix] Fix chat utils tests by @DarkLight1337 in #18139
- [KVConnector] Keep KVTransferParams as a dict by @njhill in #18033
- [Doc] Update prefix cache metrics to counting tokens by @heheda12345 in #18138
- [V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model by @ekagra-ranjan in #17326
- Modularize fused experts and integrate PPLX kernels by @bnellnm in #15956
- [CI] Disable Failing Tests by @robertgshaw2-redhat in #18165
- [Frontend] decrease import time of vllm.multimodal by @davidxia in #18031
- [Kernel] Have rotary embeddings support tensors by @LucasWilkinson in #18046
- [V1] Structured Outputs + Thinking compatibility by @aarnphm in #16577
- Add support for loading torchao models with
AOPerModuleConfig
by @jerryzh168 in #17826 - [CI] Fix race condition in test_kv_cache_events test by @russellb in #18169
- [V1] Support multiple kv connectors by @mgoin in #17564
- Upload vllm index for the rc builds by @atalman in #18173
- [Bugfix]: make most of
test_openai_schema.py
pass by @davidxia in #17664 - [v1] Support multiple KV cache groups in GPU model runner by @heheda12345 in #17945
- [V1][Metrics] Remove unused code by @markmc in #18158
- [Chore] astral's ty by @aarnphm in #18116
- [Misc] add lobe-chat support by @reidliu41 in #18177
- [Fix][ROCm] Enforce eager for all encoder-decoder models on ROCm by @ProExpertProg in #18154
- Update deprecated type hinting in
models
by @hmellor in #18132 - [Bugfix] Fix fp8 tests for triton_unified_attention for Triton 3.3 by @tdoublep in #18013
- Support custom implementations of VideoLoader backends. by @huachenheli in #18091
- [UT] Add ut for none hash by @andyxning in #17892
- [Model] Allow the use of sliding window in Qwen2 by @inkcherry in #17772
- [Bugfix] Fix FusedMoEPrepareAndFinalize for cuda-disalike backends by @MengqingCao in #18178
- [CI] don't skip fixed
test_kv_cache_events()
by @davidxia in #18183 - [V1] Update zmq socket creation in nixl connector by @russellb in #18148
- fix: typos by @omahs in #18151
- Update deprecated type hinting in
model_loader
by @hmellor in #18130 - add tools into TokenizeChatRequest by @hustxiayang in #18187
- [Kernel] [V1] Fix performance regression for triton unified attention by @tdoublep in #18161
- Adding "Basic Models Test" and "Multi-Modal Models Test (Extended) 3" in AMD Pipeline by @Alexei-V-Ivanov-AMD in #18106
- Improve examples rendering in docs and GitHub by @hmellor in #18203
- [Frontend] Fix chat template content format detection by @schoennenbeck in #18190
- [Bugfix]Change the exception thrown by call_hf_processor from RuntimeError to ValueError by @Abatom in #18181
- [Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in unquantizedMethod to reenable LLama4 BF16 by @tjtanaa in #18205
- [Misc] Avoid cuda graph log when sizes still match by @NickLucche in #18202
- Adding "AMD: Tensorizer Test" to amdproduction. by @Alexei-V-Ivanov-AMD in #18216
- [Bugfix] Fix test_eagle test by @luccafong in #18223
- [Build] Allow shipping PTX on a per-file basis by @LucasWilkinson in #18155
- [Bugfix] fix rotary embedding test for _get_padded_tensor_shape by @LucasWilkinson in #18229
- [Bugfix][ROCm] Use
chunked_prefill_paged_decode
as fallback for V1 attention on ROCm by @kliuae in #18093 - [Model] vLLM v1 supports Medusa by @skylee-01 in #17956
- Allow users to pass arbitrary JSON keys from CLI by @hmellor in #18208
- Throw better error for when running into k8s service discovery issue by @wseaton in #18209
- [Feature] Support Pipeline Parallism in torchrun SPMD offline inference for V1 by @luccafong in #17827
- [doc] fix multimodal example script by @davidxia in #18089
- [PERF] Speed up Qwen2.5-VL model by speed up rotary position embedding const… by @vadiklyutiy in #17973
- [Misc] Add Ray Prometheus logger to V1 by @eicherseiji in #17925
- [Misc] Consolidate Audio tests into multimodal common generation tests by @Isotr0py in #18214
- use ceil_div in cutlass block scaling shape check by @IwakuraRein in #17918
- [Fix] Fix typo in
resolve_hf_chat_template
by @fxmarty-amd in #18259 - [Model] Use autoweightloader for dbrx by @learner0810 in #18251
- [Misc][MacOS] fix bfloat16 error by @reidliu41 in #18249
- [BugFix] Fix multi async save in MultiConnector by @njhill in #18246
- [BugFix] Fix ordering of KVConnector finished send/rcv sets by @njhill in #18211
- [CI] Assign reviewer to mergify with changes to Tensorizer files by @sangstar in #18278
- [Sampler] Adapt to FlashInfer 0.2.3 sampler API by @abmfy in #15777
- [Bugfix] fix
an illegal memory access was encountered
of marlin kernel + act_order by @jinzhen-lin in #18245 - [Spec Decode] Don't fall back to V0 when spec decoding is enabled by @WoosukKwon in #18265
- [V1][P/D] Local attention optimization for NIXL by @mgoin in #18170
- Move cli args docs to its own page (#18228) by @strangiato in #18264
- [Misc] reformat the collect-env output by @reidliu41 in #18285
- [BugFix] Correct max_model_len derivation from config.json for Mistral format by @princepride in #17937
- [P/D][V1] Support dynamic loading of external KV connector implementations by @sdavidbd in #18142
- [Hardware][TPU] Optionally import for TPU backend by @lsy323 in #18269
- Update Dockerfile to build for Blackwell by @mgoin in #18095
- Fixed build on ppc64le due to openssl conflicts by @npanpaliya in #18262
- [Model] use AutoWeightsLoader for solar by @lengrongfu in #18113
- [MISC] fix typo by @andyxning in #18305
- Support sequence parallelism combined with pipeline parallelism by @cascade812 in #18243
- [doc] update reasoning doc by @reidliu41 in #18306
- [Model] Use sigmoid for single-label classification by @22quinn in #18313
- Fix copy-paste error in phi4mm image processing by @lifuhuang in #18315
- [Misc] add litellm integration by @reidliu41 in #18320
- [Doc] Add doc to explain the usage of Qwen3 thinking by @WangErXiao in #18291
- [Spec Decode][V0] Fix spec decode correctness test in V0 eagle/medusa by @wwl2755 in #18175
- Feature/vllm/input embedding completion api by @Nan2018 in #17590
- [Misc] extract parser.parse_args() by @reidliu41 in #18323
- [Build] Supports CUDA 12.6 and 11.8 after Blackwell Update by @simon-mo in #18316
- fix: Add type specifications for CLI arguments in tensorizer options by @googs1025 in #18314
- [BugFix] [Vul] Add missing
usedforsecurity=False
in MD5 hashing to enable FIPS by @shaoyuyoung in #18319 - [Doc] Fix prompt embedding examples by @Potabk in #18350
- [Doc] Move input-related docs to Features by @DarkLight1337 in #18353
- [BugFix] Fix handling of num_computed_tokens with connector by @njhill in #18232
- [Quantization] Pool model support bitsandbytes by @jeejeelee in #18087
- [Doc] Fix typo by @eladsegal in #18355
- [Frontend] add --quick option for vllm chat/complete by @reidliu41 in #18297
- [Feature]Add support for models quantized with AutoRound by @wenhuach21 in #17850
- Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) for DeepSeek V3/R1 on a single-node 8x NVIDIA H20 96GB setup by @sunyicode0012 in #18337
- [Misc] Fix typo by @Unprincess17 in #18330
- Neuron up mistral by @aws-satyajith in #18222
- fix CUDA_check redefinition in #17918 by @luccafong in #18287
- [neuron] fix authorization issue by @liangfu in #18364
- [Misc] Allow
AutoWeightsLoader
to skip loading weights with specific substr in name by @Isotr0py in #18358 - [Core] [Bugfix]: tensor parallel with prompt embeds by @Nan2018 in #18171
- [release] Change dockerhub username for TPU release by @khluu in #18389
- [Bugfix] fix adding bias twice in ipex GPTQ quantization by @rand-fly in #18363
- [doc] update env variable export by @reidliu41 in #18391
- [Misc] Add LoRA code owner by @jeejeelee in #18387
- Update cpu.txt by @princepride in #18398
- [CI] Add mteb testing to test the accuracy of the embedding model by @noooop in #17175
- [Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text by @wulipc in #18407
- [Misc] refactor prompt embedding examples by @reidliu41 in #18405
- [Minor] Rename quantization nvfp4 to modelopt_fp4 by @mgoin in #18356
- [Model] use AutoWeightsLoader for bloom by @calvin0327 in #18300
- [Kernel] update comment for KV shape in unified triton attn by @haochengxia in #18099
- fix:Build torch wheel inline rather than picking from nightly by @dilipgb in #18351
- [TPU] Re-enable the Pallas MoE kernel by @mgoin in #18025
- [Bugfix] config.head_dim is now explicitly set to None by @gshtras in #18432
- [Bug] Fix moe_sum signature by @bnellnm in #18440
- Revert "[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (#18407)" by @DarkLight1337 in #18456
- [Bugfix][Failing Test] Fix nixl connector test when promt size < block size by @wwl2755 in #18429
- [Misc] MultiConnector._connectors type by @NickLucche in #18423
- [Frontend] deprecate
--device
arg by @kebe7jun in #18399 - [V1] Fix general plugins not loaded in engine for multiproc by @sarckk in #18326
- [Misc] refactor disaggregated-prefill-v1 example by @reidliu41 in #18474
- [Bugfix][Failing Test] Fix test_events.py by @rabi in #18460
- [MODEL] FalconH1 by @dhiaEddineRhaiem in #18406
- [Doc] fix arg docstring in linear layers by @giantcroc in #18410
- [Bugfix] Reduce moe_sum test size to avoid OOM by @bnellnm in #18484
- [Build] fix Dockerfile shell by @kebe7jun in #18402
- [Misc] Update deprecation message for
--enable-reasoning
by @Zerohertz in #18404 - [ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 by @hyoon1 in #17004
- Revert "[v1] Support multiple KV cache groups in GPU model runner (#17945) by @markmc in #18459
- [FEAT][ROCm] Upgrade AITER MLA v1 backend by @vllmellm in #18338
- [Bugfix] Consistent ascii handling in tool parsers by @schoennenbeck in #17704
- [FalconH1] Fix output dtype in RMSNorm fallback path for Falcon-H1 (e.g. 0.5B) by @dhiaEddineRhaiem in #18500
- [MISC] update project urls in pyproject.toml by @andyxning in #18519
- [CI] Fix race condition with StatelessProcessGroup.barrier by @russellb in #18506
- Intialize io_thread_pool attribute in the beginning. by @rabi in #18331
- [Bugfix] Inconsistent token calculation compared to HF in llava family by @cyr0930 in #18479
- [BugFix][DP] Send DP wave completion only from
dp_rank==0
by @njhill in #18502 - [Bugfix][Model] Make Olmo2Model weight loading return loaded weights by @2015aroras in #18504
- [Bugfix] Fix LoRA test by @jeejeelee in #18518
- [Doc] Fix invalid JSON in example args by @DarkLight1337 in #18527
- [Neuron] Update Dockerfile.neuron to use latest neuron release (2.23) by @aws-satyajith in #18512
- Update default neuron config for speculation by @elaineyz in #18274
- Order sequence ids + config update to support specifying custom quantization layers by @elaineyz in #18279
- [Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text by @wulipc in #18526
- [Bugfix] Add kwargs to RequestOutput init to be forward compatible by @lk-chen in #18513
- [CI/Build] Update bamba test model location by @hmellor in #18544
- [Doc] Support --stream arg in openai_completion_client.py script by @googs1025 in #18388
- [Bugfix] Use random hidden states in dummy sampler run by @abmfy in #18543
- [Doc] Add stream flag for chat completion example by @calvin0327 in #18524
- [BugFix][CPU] Fix x86 SHM distributed module initialization by @bigPYJ1151 in #18536
- [Misc] improve Automatic Prefix Caching example by @reidliu41 in #18554
- [Misc] Call
ndarray.tobytes()
directly instead ofndarray.data.tobytes()
by @lgeiger in #18347 - [Bugfix] make
test_openai_schema.py
pass by @davidxia in #18224 - [Platform] Move platform check to right place by @wangxiyuan in #18470
- [Compile][Platform] Make PiecewiseBackend pluggable and extendable by @MengqingCao in #18076
- [Build/CI] Fix CUDA 11.8 build by @tlrmchlsmth in #17679
- [Tool] Add NIXL installation script by @lk-chen in #18172
- [V1][Spec Decode][Bugfix] Load quantize weights for EAGLE by @ekagra-ranjan in #18290
- [Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser by @wukaixingxp in #17917
- [Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization by @sangstar in #17926
- [AMD] [P/D] Compute num gpus for ROCm correctly in run_accuracy_test.sh by @rasmith in #18568
- Re-submit: Fix: Proper RGBA -> RGB conversion for PIL images. by @huachenheli in #18569
- [V1][Spec Decoding] Use model_loader.get_model() to load models by @markmc in #18273
- Enable interleaved sliding window attention models for Transformers backend by @hmellor in #18494
- [Misc] refactor: simplify input validation and num_requests handling in _convert_v1_inputs by @googs1025 in #18482
- [BugFix] Increase TP execute_model timeout by @njhill in #18558
- [Bugfix] Set
KVTransferConfig.engine_id
in post_init by @lk-chen in #18576 - [Spec Decode] Make EAGLE3 draft token ID mapping optional by @benchislett in #18488
- [Neuron] Remove bypass on EAGLEConfig and add a test by @elaineyz in #18514
- [Bugfix][Benchmarks] Fix a benchmark of deepspeed-mii backend to use api_key by @tishizaki in #17291
- [Misc] Replace
cuda
hard code withcurrent_platform
by @shen-shanshan in #16983 - [Hardware] correct method signatures for HPU,ROCm,XPU by @andyxning in #18551
- [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal by @RonaldBXu in #18034
- [Feature]Add async tensor parallelism using compilation pass by @cascade812 in #17882
- [Doc] Update quickstart and install for cu128 using
--torch-backend=auto
by @mgoin in #18505 - [Feature][V1]: suupports cached_tokens in response usage by @chaunceyjiang in #18149
- [Bugfix] Add half type support in reshape_and_cache_cpu_impl on x86 cpu platform by @zzzyq in #18430
- Migrate docs from Sphinx to MkDocs by @hmellor in #18145
- Revert "[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (#18034)" by @DarkLight1337 in #18600
- [Bugfix][Model] Fix baichuan model loader for tp by @MengqingCao in #18597
- [V0][Bugfix] Fix parallel sampling performance regression when guided decoding is enabled by @shadeMe in #17731
- Add myself as docs code owner by @hmellor in #18605
- [Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move to
requirements/cpu.txt
by @yankay in #18542 - [CI] fix kv_cache_type argument by @andyxning in #18594
- [Doc] Fix indent of contributing to vllm by @Zerohertz in #18611
- Replace
{func}
with mkdocs style links by @hmellor in #18610 - [CI/Build] Fix V1 flag being set in entrypoints tests by @DarkLight1337 in #18598
- Fix examples with code blocks in docs by @hmellor in #18609
- [Bugfix] Fix transformers model impl ignored for mixtral quant by @tristanleclercq in #18602
- Include private attributes in API documentation by @hmellor in #18614
- [Misc] add Haystack integration by @reidliu41 in #18601
- [Bugfix][Build/CI] Fixup CUDA compiler version check for CUDA_SUPPORTED_ARCHS by @simon-mo in #18579
- [Doc] Fix markdown list indentation for MkDocs rendering by @Zerohertz in #18620
- [Doc] Use a different color for the announcement by @DarkLight1337 in #18616
- Refactor pplx init logic to make it modular (prepare for deepep) by @youkaichao in #18200
- Fix figures in design doc by @hmellor in #18612
- [Docs] Change mkdocs to not use directory urls by @mgoin in #18622
- [v1] Redo "Support multiple KV cache groups in GPU model runner (#17945)" by @heheda12345 in #18593
- [Doc] fix list formatting by @davidxia in #18624
- [Doc] Fix top-level API links/docs by @DarkLight1337 in #18621
- [Doc] Avoid documenting dynamic / internal modules by @DarkLight1337 in #18626
- [Doc] Fix broken links and unlinked docs, add shortcuts to home sidebar by @DarkLight1337 in #18627
- [V1] Support Deepseek MTP by @YaoJiayi in #18435
- Use prebuilt FlashInfer x86_64 PyTorch 2.7 CUDA 12.8 wheel for CI by @huydhn in #18537
- [CI] Enable test_initialization to run on V1 by @mgoin in #16736
- [Doc] Update references to doc files by @DarkLight1337 in #18637
- [ModelOpt] Introduce VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var to control blockscale tensor allocation by @pavanimajety in #18160
- [Bugfix] Migrate to REGEX Library to prevent catastrophic backtracking by @Crucifixion-Fxl in #18454
- [Bugfix][Nixl] Fix Preemption Bug by @robertgshaw2-redhat in #18631
- config.py: Clarify that only local GGUF checkpoints are supported. by @MathieuBordere in #18623
- FIX MOE issue in AutoRound format by @wenhuach21 in #18586
- [V1][Spec Decode] Small refactors to improve eagle bookkeeping performance by @zixi-qi in #18424
- [Frontend] improve vllm serve --help display by @reidliu41 in #18643
- [Model] Add support for Qwen2.5-Omni-7B-AWQ (Qwen2_5OmniForConditionalGeneration) by @Nalkey in #18647
- [V1][Spec Decode] Support multi-layer eagle draft model by @zixi-qi in #18030
- [Doc] Update README links, mark external links by @DarkLight1337 in #18635
- [MISC][pre-commit] Add pre-commit check for triton import by @MengqingCao in #17716
- [Doc] Fix indentation problems in V0 Paged Attention docs by @DarkLight1337 in #18659
- [Doc] Add community links by @DarkLight1337 in #18657
- [Model] use AutoWeightsLoader for gpt2 by @ztang2370 in #18625
- [Doc] Reorganize user guide by @DarkLight1337 in #18661
- [CI/Build]
chmod +x
tocleanup_pr_body.sh
by @DarkLight1337 in #18650 - [MISC] typo fix and clean import by @andyxning in #18664
- [BugFix] Fix import error for fused_moe by @wangxiyuan in #18642
- [CI] enforce import regex instead of re by @aarnphm in #18665
- fix(regression): clone from reference items by @aarnphm in #18662
- [CI/Build] fix permission denied issue by @reidliu41 in #18645
- [BugFix][Spec Decode] Improve Prefix Caching Logic in Speculative Decoding by @WoosukKwon in #18668
- [V1] Fix _pickle.PicklingError: Can't pickle <class 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite... by @eicherseiji in #18640
- [MISC] correct signature for LoaderFunction by @andyxning in #18670
- [Misc]Replace
cuda
hard code withcurrent_platform
in Ray by @noemotiovon in #14668 - [Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE by @MengqingCao in #18655
- [VLM] Initialize video input support for InternVL models by @Isotr0py in #18499
- Speed up the
kernels/quantization/
tests by @mgoin in #18669 - [BUGFIX] catch subclass first for try...except by @andyxning in #18672
- [Misc] Reduce logs on startup by @DarkLight1337 in #18649
- [doc] fix broken links by @reidliu41 in #18671
- [doc] improve readability by @reidliu41 in #18675
- [Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment by @zzzyq in #18674
- [CI/build] fix no regex by @reidliu41 in #18676
- [Misc] small improve by @reidliu41 in #18680
- [Bugfix] Fix profiling dummy data for Pixtral by @DarkLight1337 in #18677
- [Core][Multimodal] Convert PIL Image to array without data copy when hashing by @lgeiger in #18682
- [CI/Build][Doc] Update
gte-Qwen2-1.5B-instruct
usage by @DarkLight1337 in #18683 - [Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example by @zhaohaidao in #18644
- refactor: simplify request handler, use positive condition check for handler assignment by @googs1025 in #18690
- [Bugfix] Fix the lm_head in gpt_bigcode in lora mode by @maxdebayser in #6357
- [CI] add missing argument by @andyxning in #18694
- [GH] Add issue template for reporting CI failures by @DarkLight1337 in #18696
- [Doc] Fix issue template format by @DarkLight1337 in #18699
- [Bugfix] Fix Mistral-format models with sliding window by @DarkLight1337 in #18693
- [CI/Build] Replace
math.isclose
withpytest.approx
by @DarkLight1337 in #18703 - [CI] fix dump_input for str type by @andyxning in #18697
- [Model] Add support for YARN in NemotronNAS models by @Naveassaf in #18427
- [CI/Build] Split pooling and generation extended language models tests in CI by @Isotr0py in #18705
- [Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI by @ldurejko in #18709
- [Misc] add AutoGen integration by @reidliu41 in #18712
- [Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM by @YanWuHao in #18701
- [Doc] Improve API docs by @DarkLight1337 in #18713
- [Doc] Move examples and further reorganize user guide by @DarkLight1337 in #18666
- [Bugfix] Fix Llama GGUF initialization by @DarkLight1337 in #18717
- [V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs by @lgeiger in #18608
- Convert
examples
toruff-format
by @hmellor in #18400 - [Model][Gemma3] Simplify image input validation by @lgeiger in #18710
- [Misc] improve web section group title display by @reidliu41 in #18684
- [V1][Quantization] Add CUDA graph compatible v1 GGUF support by @Isotr0py in #18646
- [Model][Gemma3] Cast image pixel values already on CPU by @lgeiger in #18732
- [FEAT] [ROCm] Upgrade AITER Fused MoE kernels. by @vllmellm in #18271
- [Doc] Update OOT model docs by @DarkLight1337 in #18742
- [Doc] Update reproducibility doc and example by @DarkLight1337 in #18741
- [Misc] improve docs by @reidliu41 in #18734
- feat(rocm-support): support mamba2 on rocm by @almersawi in #18565
- [Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh by @ldurejko in #18752
- [Doc] cleanup deprecated flag for doc by @calvin0327 in #18715
- Minor fix about MooncakeStoreConnector by @maobaolong in #18721
- [Build] fix cpu build missing libtbbmalloc.so by @kebe7jun in #18744
- [BUG FIX] minicpm by @huangyuxiang03 in #18739
- [Doc] Convert Sphinx directives (
{class}
,{meth}
,{attr}
, ...) to MkDocs format for better documentation linking by @Zerohertz in #18663 - [CI/Build] Remove imports of built-in
re
by @DarkLight1337 in #18750 - [V1][Metrics] Add API for accessing in-memory Prometheus metrics by @markmc in #17010
- Disable prefix cache by default for benchmark by @cascade812 in #18639
- optimize get_kv_cache_torch_dtype by @chunxiaozheng in #18531
- [Core] Automatically cast multi-modal input dtype by @DarkLight1337 in #18756
- [Bugfix] Mistral tool calling when content is list by @mgoin in #18729
New Contributors
- @r-barnes made their first contribution in #17316
- @qscqesze made their first contribution in #16328
- @ponix-j made their first contribution in #17100
- @Zerohertz made their first contribution in #17342
- @a2q1p made their first contribution in #17387
- @mofanke made their first contribution in #17369
- @mayuyuace made their first contribution in #17364
- @casinca made their first contribution in #17400
- @mlinmg made their first contribution in #15826
- @alec-flowers made their first contribution in #16750
- @psav made their first contribution in #17457
- @nlzy made their first contribution in #17411
- @noyoshi made their first contribution in #15506
- @tishizaki made their first contribution in #17285
- @sstamenk made their first contribution in #17550
- @qthequartermasterman made their first contribution in #15428
- @CalebDu made their first contribution in #14568
- @Edwardf0t1 made their first contribution in #17561
- @xw285cornell made their first contribution in #16263
- @ehartford made their first contribution in #17605
- @thomasjpfan made their first contribution in #17616
- @s3woz made their first contribution in #17497
- @Xarbirus made their first contribution in #17408
- @bythew3i made their first contribution in #16458
- @dtransposed made their first contribution in #16839
- @BowenBao made their first contribution in #16943
- @vmarkovtsev made their first contribution in #17800
- @amd-hhashemi made their first contribution in #17071
- @qionghuang6 made their first contribution in #17753
- @RIckYuan999 made their first contribution in #17644
- @fxmarty-amd made their first contribution in #12612
- @inkcherry made their first contribution in #17853
- @tracelogfb made their first contribution in #17873
- @edrevo made their first contribution in #17786
- @xinli-centml made their first contribution in #17948
- @bbrowning made their first contribution in #17855
- @frieda-huang made their first contribution in #17032
- @Xu-Wenqing made their first contribution in #17784
- @bwshen-mi made their first contribution in #17433
- @arjunkathuria made their first contribution in #13779
- @calvin0327 made their first contribution in #18050
- @jinhuang12 made their first contribution in #17916
- @gilljon made their first contribution in #18083
- @Ecthlion made their first contribution in #18117
- @majianpeng made their first contribution in #18051
- @anko-intel made their first contribution in #17844
- @huachenheli made their first contribution in #18091
- @omahs made their first contribution in #18151
- @hustxiayang made their first contribution in #18187
- @eicherseiji made their first contribution in #17925
- @IwakuraRein made their first contribution in #17918
- @learner0810 made their first contribution in #18251
- @strangiato made their first contribution in #18264
- @princepride made their first contribution in #17937
- @sdavidbd made their first contribution in #18142
- @Nan2018 made their first contribution in #17590
- @googs1025 made their first contribution in #18314
- @shaoyuyoung made their first contribution in #18319
- @eladsegal made their first contribution in #18355
- @wenhuach21 made their first contribution in #17850
- @sunyicode0012 made their first contribution in #18337
- @Unprincess17 made their first contribution in #18330
- @rand-fly made their first contribution in #18363
- @rabi made their first contribution in #18460
- @giantcroc made their first contribution in #18410
- @hyoon1 made their first contribution in #17004
- @cyr0930 made their first contribution in #18479
- @elaineyz made their first contribution in #18274
- @lgeiger made their first contribution in #18347
- @RonaldBXu made their first contribution in #18034
- @zzzyq made their first contribution in #18430
- @shadeMe made their first contribution in #17731
- @Crucifixion-Fxl made their first contribution in #18454
- @MathieuBordere made their first contribution in #18623
- @Nalkey made their first contribution in #18647
- @ztang2370 made their first contribution in #18625
- @zhaohaidao made their first contribution in #18644
- @ldurejko made their first contribution in #18709
- @YanWuHao made their first contribution in #18701
- @almersawi made their first contribution in #18565
- @huangyuxiang03 made their first contribution in #18739
- @chunxiaozheng made their first contribution in #18531
Full Changelog: v0.8.5.post1...v0.9.0