vLLM v0.17.0
Known Issue: If you are on CUDA 12.9+ and encounter a CUBLAS_STATUS_INVALID_VALUE error, this is caused by a CUDA library mismatch. To resolve, try one of the following:
- Remove the path to system CUDA shared library files (e.g.
/usr/local/cuda) fromLD_LIBRARY_PATH, or simplyunset LD_LIBRARY_PATH. - Install vLLM with
uv pip install vllm --torch-backend=auto. - Install vLLM with
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129(change the CUDA version to match your system).
Highlights
This release features 699 commits from 272 contributors (48 new)!
- PyTorch 2.10 Upgrade: This release upgrades to PyTorch 2.10.0, which is a breaking change for environment dependencies.
- FlashAttention 4 Integration: vLLM now supports the FlashAttention 4 backend (#32974), bringing next-generation attention performance.
- Model Runner V2 Maturation: Model Runner V2 has reached a major milestone with Pipeline Parallel (#33960), Decode Context Parallel (#34179), Eagle3 speculative decoding with CUDA graphs (#35029, #35040), pooling model support (#35120), piecewise & mixed CUDA graph capture (#32771), DP+EP for spec decoding (#35294), and a new ModelState architecture. Design docs are now available (#35819).
- Qwen3.5 Model Family: Full support for the Qwen3.5 model family (#34110) featuring GDN (Gated Delta Networks), with FP8 quantization, MTP speculative decoding, and reasoning parser support.
- New
--performance-modeFlag: A new--performance-mode {balanced, interactivity, throughput}flag (#34936) simplifies performance tuning for common deployment scenarios. - Anthropic API Compatibility: Added support for Anthropic thinking blocks (#33671),
count_tokensAPI (#35588),tool_choice=none(#35835), and streaming/image handling fixes. - Weight Offloading V2 with Prefetching: The weight offloader now hides onloading latency via prefetching (#29941), plus selective CPU weight offloading (#34535) and CPU offloading without pinned memory doubling (#32993).
- Elastic Expert Parallelism Milestone 2: Initial support for elastic expert parallelism enabling dynamic GPU scaling for MoE models (#34861).
- Quantized LoRA Adapters: Users can now load quantized LoRA adapters (e.g. QLoRA) directly (#30286).
- Transformers v5 Compatibility: Extensive work to ensure compatibility with HuggingFace Transformers v5 across models and utilities.
Model Support
- New architectures: Qwen3.5 (#34110), COLQwen3 (#34398), ColModernVBERT (#34558), Ring 2.5 (#35102), skt/A.X-K1 (#32407), Ovis 2.6 (#34426), nvidia/llama-nemotron-embed-vl-1b-v2 (#35297), nvidia/llama-nemotron-rerank-vl-1b-v2 (#35735), nvidia/nemotron-colembed (#34574).
- ASR models: FunASR (#33247), FireRedASR2 (#35727), Qwen3-ASR realtime streaming (#34613).
- Multimodal: OpenPangu-VL video input (#34134), audio chunking for offline LLM (#34628), Parakeet audio encoder for nemotron-nano-vl (#35100), MiniCPM-o flagos (#34126).
- LoRA: LFM2 (#34921), Llama 4 Vision tower/connector (#35147), max vocab size increased to 258048 (#34773), quantized LoRA adapters (#30286).
- Task expansion: ColBERT extended to non-standard BERT backbones (#34170), multimodal scoring for late-interaction models (#34574).
- Performance: Qwen3.5 GDN projector fusion (#34697), FlashInfer cuDNN backend for Qwen3 VL ViT (#34580), Step3.5-Flash NVFP4 (#34478), Qwen3MoE tuned configs for H200 (#35457).
- Fixes: DeepSeek-VL V2 simplified loading (#35203), Qwen3/Qwen3.5 reasoning parser (#34779), Qwen2.5-Omni/Qwen3-Omni mixed-modality (#35368), Ernie4.5-VL garbled output (#35587), Qwen-VL tokenizer (#36140), Qwen-Omni audio cache (#35994), Nemotron-3-Nano NVFP4 accuracy with TP>1 (#34476).
Engine Core
- Model Runner V2: Pipeline Parallel (#33960), Decode Context Parallel (#34179), piecewise & mixed CUDA graphs (#32771), Eagle3 with CUDA graphs (#35029, #35040), pooling models (#35120), DP+EP for spec decoding (#35294), bad_words sampling (#33433), ModelState architecture (#35350, #35383, #35564, #35621, #35774), design docs (#35819).
- Weight offloading: V2 prefetching to hide latency (#29941), selective CPU weight offloading (#34535), CPU offloading without pinned memory doubling (#32993).
- Sleep level 0 mode with enqueue/wait pattern (#33195), pause/resume moved into engine (#34125).
- Fixes: allreduce_rms_fusion disabled by default with PP > 1 (#35424), DCP + FA3 crash (#35082), prefix caching for Mamba "all" mode (#34874), num_active_loras fix (#34119), async TP reduce-scatter reduction fix (#33088).
- Repetitive token pattern detection flags (#35451).
Kernel
- FlashAttention 4 integration (#32974).
- FlashInfer Sparse MLA backend (#33451).
- Triton-based top-k and top-p sampler kernels (#33538).
- Faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680).
- Optimized grouped topk kernel (#34206).
- TRTLLM DSV3 Router GEMM kernel, 6% batch-1 speedup (#34302).
- FA3 swizzle optimization (#34043).
- 256-bit LDG/STG activation kernels (#33022).
- TMA support for fused_moe_lora kernel (#32195).
- Helion kernel framework: silu_mul_fp8 kernel (#33373), autotuning infrastructure (#34025), num_tokens autotuning (#34185), fx tracing via HOP (#34390), GPU variant canonicalization (#34928).
- FlashInfer TRTLLM fused MoE non-gated FP8 & NVFP4 (#33506).
- Optimized sample_recovered_tokens kernel (#34974).
- KV cache update ops extraction from FlashInfer forward (#35422) and MLA backends (#34627).
Hardware & Performance
- NVIDIA: SM100 FMHA FP8 prefill for MLA (#31195), SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448), SM100 Oink RMSNorm path (#31828), SM120 FP8 GEMM optimization (#34424), FlashInfer DeepGEMM swapAB on SM90 by default (#34924), DeepSeek R1 BF16 min latency QKV GEMM 0.5% E2E speedup (#34758), Cublas BF16 gate with FP32 output (#35121), FlashInfer All Reduce default to TRTLLM backend (#35793).
- AMD ROCm: AITER fused RoPE+KVCache (#33443), MXFP4 MoE weight pre-shuffling on gfx950 (#34192), bitsandbytes quantization (#34688), CK backend for MoE quantization (#34301), dynamic MXFP4 for DeepSeek V2 (#34157), GPT-OSS Quark format (#29008), GPT-OSS WMXFP4_AFP8 static scales (#30357), encoder/encoder-decoder on AITER (#35334), device capability derivation without CUDA init (#35069),
aiterpackage renamed toamd-aiter(#35198). - Intel XPU: CUDA graph support (#34482), GPUDirect RDMA via NIXL (#35270), TORCH_SDPA/TRITON_ATTN as ViT backend (#35010), vllm-xpu-kernels v0.1.3 (#35984).
- CPU: ARM BF16 cross-compilation (#33079), FP16 for s390x (#34116), KleidiAI INT8_W4A8 for all input dtypes (#34890), s390x vector intrinsics for attention (#34434), prefix caching for ppc64le (#35081), CPU release supports both AVX2 and AVX512 (#35466).
- Performance: Pipeline Parallel async send/recv 2.9% E2E throughput (#33368), pooling maxsim 13.9% throughput improvement (#35330), Triton ViT attention backend (#32183), Mamba1 kernel-level chunk alignment for prefix caching (#34798), detokenizer optimization (#32975), pooling model copy optimization 1.8% throughput (#35127).
Large Scale Serving
- Pipeline Parallel async send/recv, 2.9% throughput improvement (#33368).
- Elastic EP Milestone 2 (#34861).
- EPLB: Async rebalance algorithm (#30888), sync enforcement for NCCL backend (#35212).
- Native weight syncing API via IPC for RL workflows (#34171).
- Decode Context Parallel in Model Runner V2 (#34179).
- Ray env var propagation to workers (#34383).
- Breaking: KV load failure policy default changed from "recompute" to "fail" (#34896).
- Cross-node data parallelism message queue fix (#35429).
- NIXL: Token-based IPC API (#34175), version bound (#35495), NUMA core binding (#32365).
Speculative Decoding
- Nemotron-H MTP and Mamba speculative decoding (#33726).
- Eagle3 on Model Runner V2 with CUDA graphs (#35029, #35040), Eagle3 + disaggregated serving (#34529).
- Hidden states extraction system (#33736).
min_tokenssupport with speculative decoding (#32642).- Reduced TP communication for draft generation (#34049).
- MTP num_speculative_tokens > 1 with sparse MLA (#34552).
- Sparse MLA + MTP with full CUDA graphs (#34457).
- Spec decoding in Mamba cache align mode (#33705).
- DP+EP for spec decoding in Model Runner V2 (#35294).
MoE Refactor
- MoERunner abstraction (#32344) with modular kernel architecture.
- MXFP4 Cutlass Experts to modular kernel (#34542), MXFP4 Marlin to modular kernel format (#34588), TRTLLM Kernels MK (#32564).
- MoEActivation enum (#33843).
- Improved default Triton fused MoE configs (#34846).
- Fused MoE + LoRA shared expert dual stream, 1.07x throughput (#34933).
- DSV3 QKVAProj GEMM custom op for torch.compile (#35751).
- Fix routing for models without expert groups (MiniMax-M2.1) (#34673).
torch.compile
- AOT compile with PyTorch 2.10 (#34155).
- AR+RMSNorm fusion by default at -O2 (#34299).
- SiLU+FP4 quant fusion by default at O1+ (#34718).
- Sequence parallelism threshold compile ranges (#28672).
- Various compile fixes: recursive pre_grad_passes (#34092), FakeTensorProp elimination (#34093), time discrepancy logging (#34912), artifact load errors (#35115), atomic artifact saving (#35117), pytree slice caching (#35308), fast_moe_cold_start undo for torch>=2.11 (#35475).
Quantization
- Quantized LoRA adapters (#30286).
- Per-head KV cache scales in attention selector (#34281).
- FP8 MoE bias for GPT-OSS (#34906).
- SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448).
- Mixed precision support for ModelOpt (#35047).
- Llama-4 attention quantization (int8, fp8) (#34243).
- Sparse24 compressed tensors fix (#33446).
- KV scale loading fix for MLA models (#35430).
- Compressed tensors as ground-truth for quant strategies (#34254).
- AMD: CK backend for MoE (#34301), dynamic MXFP4 for DeepSeek V2 (#34157), bitsandbytes on ROCm (#34688), GPT-OSS Quark format (#29008).
- CPU: KleidiAI INT8_W4A8 for all input dtypes (#34890).
- Qwen3.5: FP8 weight loading fix (#35289), mlp.gate not quantizable (#35156).
- int4_w4a16 fused_moe benchmark and tuning (#34130).
- FlashInfer integrate mm_mxfp8 in ModelOpt MXFP8 (#35053).
API & Frontend
- Anthropic API: Thinking blocks (#33671), count_tokens (#35588), tool_choice=none (#35835), tool call streaming fix (#34887), base64 image handling (#35557).
- Responses API: Structured outputs (#33709), reasoning_tokens fix (#33513), reasoning_part streaming events (#35184).
- UX:
--performance-mode {balanced, interactivity, throughput}(#34936),--moe-backendfor explicit kernel selection (#33807),--language-model-onlyfor hybrid models (#34120),--enforce-eagerclarification (#34523). - Whisper automatic language detection (#34342).
- MFU Prometheus counters (#30950).
- Unrecognized environment variable warnings (#33581).
generation_configmax_tokens treated as default not ceiling (#34063).- Structured output bugfix for completions (#35237).
- Structured output JSON feature validation (#33233).
- Validate non-text content in system messages (#34072).
- Explicit validation error for tool calls (#34438).
- IO Processor plugin simplification (#34236).
- Sparse embedding IO process plugin (#34214).
- Pooling entrypoint improvements (#35604).
Security
- Fix SSRF bypass via backslash-@ URL parsing inconsistency (#34743).
Dependencies
- PyTorch 2.10.0 upgrade — breaking change requiring environment updates. ROCm torch also updated to official 2.10 release (#34387).
- OpenTelemetry libraries included by default (#34466).
- Bound NIXL upper bound version (#35495).
- mooncake-transfer-engine added to kv_connectors requirements (#34826).
- openai bounded to under 2.25.0.
- lm-eval bumped for Transformers v5 compatibility (#33994).
- mamba-ssm bumped for Transformers v5 (#34233).
- PyPI source distribution (sdist) now included (#35136).
- amd-quark package added for ROCm (#35658).
V0 Deprecation
- Removed per-request logits processors (#34400).
- Removed unused MM placeholders in request output (#34944).
- Removed Swin model (#35821).
- Scheduled v0.17 deprecations applied (#35441).
Transformers v5 Compatibility
- Model fixes: Qwen3VL (#34262), JAIS (#34264), MiniCPM-V, GLM-ASR, Qwen3.5.
- Xet high-performance mode (#35098).
- Custom processor import fixes (#35101, #35107).
- padding_index removal for compatibility (#35189).
- lm-eval (#33994) and mamba-ssm (#34233) version bumps.
New Contributors 🎉
- @2ez4bz made their first contribution in #33607
- @Alibaba-HZY made their first contribution in #35289
- @aykoppol made their first contribution in #35451
- @bhoomit made their first contribution in #34773
- @charlesashby made their first contribution in #34169
- @chengyinie made their first contribution in #35457
- @EdalatiAli made their first contribution in #34448
- @ehfd made their first contribution in #33992
- @flutist made their first contribution in #35838
- @fort726 made their first contribution in #32407
- @fynnsu made their first contribution in #33736
- @gante made their first contribution in #35281
- @hallerite made their first contribution in #35834
- @hujia177 made their first contribution in #34982
- @itayalroy made their first contribution in #34861
- @jasonozuzu-cohere made their first contribution in #34715
- @jcaip made their first contribution in #35327
- @jhaotingc made their first contribution in #34933
- @jjmiao1 made their first contribution in #35994
- @jonoillar made their first contribution in #34513
- @koush made their first contribution in #33646
- @lailoo made their first contribution in #35616
- @Laurawly made their first contribution in #31828
- @Li-Yongwen made their first contribution in #34336
- @lichuang made their first contribution in #34679
- @lin-shh made their first contribution in #35645
- @majian4work made their first contribution in #35466
- @ojhaanshika made their first contribution in #34986
- @PatrykWo made their first contribution in #35307
- @pi314ever made their first contribution in #35434
- @pkousha made their first contribution in #33839
- @pks made their first contribution in #35237
- @qianlihuang made their first contribution in #32642
- @simonreginis made their first contribution in #31025
- @stakeswky made their first contribution in #35230
- @SteadfastAsArt made their first contribution in #34888
- @stingoChen made their first contribution in #35352
- @sychen52 made their first contribution in #35047
- @thepushkarp made their first contribution in #32114
- @Tib-Gridello made their first contribution in #35423
- @umut-polat made their first contribution in #35510
- @voipmonitor made their first contribution in #35615
- @wangxingran222 made their first contribution in #33088
- @wenshuai-xiaomi made their first contribution in #34424
- @wjabbour made their first contribution in #35672
- @yashwantbezawada made their first contribution in #31057
- @yoonsnowdev made their first contribution in #35382
- @ZhongsJie made their first contribution in #35835