vLLM v0.17.0

Known Issue: If you are on CUDA 12.9+ and encounter a CUBLAS_STATUS_INVALID_VALUE error, this is caused by a CUDA library mismatch. To resolve, try one of the following:

Remove the path to system CUDA shared library files (e.g. /usr/local/cuda) from LD_LIBRARY_PATH, or simply unset LD_LIBRARY_PATH.
Install vLLM with uv pip install vllm --torch-backend=auto.
Install vLLM with pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129 (change the CUDA version to match your system).

Highlights

This release features 699 commits from 272 contributors (48 new)!

PyTorch 2.10 Upgrade: This release upgrades to PyTorch 2.10.0, which is a breaking change for environment dependencies.
FlashAttention 4 Integration: vLLM now supports the FlashAttention 4 backend (#32974), bringing next-generation attention performance.
Model Runner V2 Maturation: Model Runner V2 has reached a major milestone with Pipeline Parallel (#33960), Decode Context Parallel (#34179), Eagle3 speculative decoding with CUDA graphs (#35029, #35040), pooling model support (#35120), piecewise & mixed CUDA graph capture (#32771), DP+EP for spec decoding (#35294), and a new ModelState architecture. Design docs are now available (#35819).
Qwen3.5 Model Family: Full support for the Qwen3.5 model family (#34110) featuring GDN (Gated Delta Networks), with FP8 quantization, MTP speculative decoding, and reasoning parser support.
New --performance-mode Flag: A new --performance-mode {balanced, interactivity, throughput} flag (#34936) simplifies performance tuning for common deployment scenarios.
Anthropic API Compatibility: Added support for Anthropic thinking blocks (#33671), count_tokens API (#35588), tool_choice=none (#35835), and streaming/image handling fixes.
Weight Offloading V2 with Prefetching: The weight offloader now hides onloading latency via prefetching (#29941), plus selective CPU weight offloading (#34535) and CPU offloading without pinned memory doubling (#32993).
Elastic Expert Parallelism Milestone 2: Initial support for elastic expert parallelism enabling dynamic GPU scaling for MoE models (#34861).
Quantized LoRA Adapters: Users can now load quantized LoRA adapters (e.g. QLoRA) directly (#30286).
Transformers v5 Compatibility: Extensive work to ensure compatibility with HuggingFace Transformers v5 across models and utilities.

Model Support

New architectures: Qwen3.5 (#34110), COLQwen3 (#34398), ColModernVBERT (#34558), Ring 2.5 (#35102), skt/A.X-K1 (#32407), Ovis 2.6 (#34426), nvidia/llama-nemotron-embed-vl-1b-v2 (#35297), nvidia/llama-nemotron-rerank-vl-1b-v2 (#35735), nvidia/nemotron-colembed (#34574).
ASR models: FunASR (#33247), FireRedASR2 (#35727), Qwen3-ASR realtime streaming (#34613).
Multimodal: OpenPangu-VL video input (#34134), audio chunking for offline LLM (#34628), Parakeet audio encoder for nemotron-nano-vl (#35100), MiniCPM-o flagos (#34126).
LoRA: LFM2 (#34921), Llama 4 Vision tower/connector (#35147), max vocab size increased to 258048 (#34773), quantized LoRA adapters (#30286).
Task expansion: ColBERT extended to non-standard BERT backbones (#34170), multimodal scoring for late-interaction models (#34574).
Performance: Qwen3.5 GDN projector fusion (#34697), FlashInfer cuDNN backend for Qwen3 VL ViT (#34580), Step3.5-Flash NVFP4 (#34478), Qwen3MoE tuned configs for H200 (#35457).
Fixes: DeepSeek-VL V2 simplified loading (#35203), Qwen3/Qwen3.5 reasoning parser (#34779), Qwen2.5-Omni/Qwen3-Omni mixed-modality (#35368), Ernie4.5-VL garbled output (#35587), Qwen-VL tokenizer (#36140), Qwen-Omni audio cache (#35994), Nemotron-3-Nano NVFP4 accuracy with TP>1 (#34476).

Engine Core

Model Runner V2: Pipeline Parallel (#33960), Decode Context Parallel (#34179), piecewise & mixed CUDA graphs (#32771), Eagle3 with CUDA graphs (#35029, #35040), pooling models (#35120), DP+EP for spec decoding (#35294), bad_words sampling (#33433), ModelState architecture (#35350, #35383, #35564, #35621, #35774), design docs (#35819).
Weight offloading: V2 prefetching to hide latency (#29941), selective CPU weight offloading (#34535), CPU offloading without pinned memory doubling (#32993).
Sleep level 0 mode with enqueue/wait pattern (#33195), pause/resume moved into engine (#34125).
Fixes: allreduce_rms_fusion disabled by default with PP > 1 (#35424), DCP + FA3 crash (#35082), prefix caching for Mamba "all" mode (#34874), num_active_loras fix (#34119), async TP reduce-scatter reduction fix (#33088).
Repetitive token pattern detection flags (#35451).

Kernel

FlashAttention 4 integration (#32974).
FlashInfer Sparse MLA backend (#33451).
Triton-based top-k and top-p sampler kernels (#33538).
Faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680).
Optimized grouped topk kernel (#34206).
TRTLLM DSV3 Router GEMM kernel, 6% batch-1 speedup (#34302).
FA3 swizzle optimization (#34043).
256-bit LDG/STG activation kernels (#33022).
TMA support for fused_moe_lora kernel (#32195).
Helion kernel framework: silu_mul_fp8 kernel (#33373), autotuning infrastructure (#34025), num_tokens autotuning (#34185), fx tracing via HOP (#34390), GPU variant canonicalization (#34928).
FlashInfer TRTLLM fused MoE non-gated FP8 & NVFP4 (#33506).
Optimized sample_recovered_tokens kernel (#34974).
KV cache update ops extraction from FlashInfer forward (#35422) and MLA backends (#34627).

Hardware & Performance

NVIDIA: SM100 FMHA FP8 prefill for MLA (#31195), SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448), SM100 Oink RMSNorm path (#31828), SM120 FP8 GEMM optimization (#34424), FlashInfer DeepGEMM swapAB on SM90 by default (#34924), DeepSeek R1 BF16 min latency QKV GEMM 0.5% E2E speedup (#34758), Cublas BF16 gate with FP32 output (#35121), FlashInfer All Reduce default to TRTLLM backend (#35793).
AMD ROCm: AITER fused RoPE+KVCache (#33443), MXFP4 MoE weight pre-shuffling on gfx950 (#34192), bitsandbytes quantization (#34688), CK backend for MoE quantization (#34301), dynamic MXFP4 for DeepSeek V2 (#34157), GPT-OSS Quark format (#29008), GPT-OSS WMXFP4_AFP8 static scales (#30357), encoder/encoder-decoder on AITER (#35334), device capability derivation without CUDA init (#35069), aiter package renamed to amd-aiter (#35198).
Intel XPU: CUDA graph support (#34482), GPUDirect RDMA via NIXL (#35270), TORCH_SDPA/TRITON_ATTN as ViT backend (#35010), vllm-xpu-kernels v0.1.3 (#35984).
CPU: ARM BF16 cross-compilation (#33079), FP16 for s390x (#34116), KleidiAI INT8_W4A8 for all input dtypes (#34890), s390x vector intrinsics for attention (#34434), prefix caching for ppc64le (#35081), CPU release supports both AVX2 and AVX512 (#35466).
Performance: Pipeline Parallel async send/recv 2.9% E2E throughput (#33368), pooling maxsim 13.9% throughput improvement (#35330), Triton ViT attention backend (#32183), Mamba1 kernel-level chunk alignment for prefix caching (#34798), detokenizer optimization (#32975), pooling model copy optimization 1.8% throughput (#35127).

Large Scale Serving

Pipeline Parallel async send/recv, 2.9% throughput improvement (#33368).
Elastic EP Milestone 2 (#34861).
EPLB: Async rebalance algorithm (#30888), sync enforcement for NCCL backend (#35212).
Native weight syncing API via IPC for RL workflows (#34171).
Decode Context Parallel in Model Runner V2 (#34179).
Ray env var propagation to workers (#34383).
Breaking: KV load failure policy default changed from "recompute" to "fail" (#34896).
Cross-node data parallelism message queue fix (#35429).
NIXL: Token-based IPC API (#34175), version bound (#35495), NUMA core binding (#32365).

Speculative Decoding

Nemotron-H MTP and Mamba speculative decoding (#33726).
Eagle3 on Model Runner V2 with CUDA graphs (#35029, #35040), Eagle3 + disaggregated serving (#34529).
Hidden states extraction system (#33736).
min_tokens support with speculative decoding (#32642).
Reduced TP communication for draft generation (#34049).
MTP num_speculative_tokens > 1 with sparse MLA (#34552).
Sparse MLA + MTP with full CUDA graphs (#34457).
Spec decoding in Mamba cache align mode (#33705).
DP+EP for spec decoding in Model Runner V2 (#35294).

MoE Refactor

MoERunner abstraction (#32344) with modular kernel architecture.
MXFP4 Cutlass Experts to modular kernel (#34542), MXFP4 Marlin to modular kernel format (#34588), TRTLLM Kernels MK (#32564).
MoEActivation enum (#33843).
Improved default Triton fused MoE configs (#34846).
Fused MoE + LoRA shared expert dual stream, 1.07x throughput (#34933).
DSV3 QKVAProj GEMM custom op for torch.compile (#35751).
Fix routing for models without expert groups (MiniMax-M2.1) (#34673).

torch.compile

AOT compile with PyTorch 2.10 (#34155).
AR+RMSNorm fusion by default at -O2 (#34299).
SiLU+FP4 quant fusion by default at O1+ (#34718).
Sequence parallelism threshold compile ranges (#28672).
Various compile fixes: recursive pre_grad_passes (#34092), FakeTensorProp elimination (#34093), time discrepancy logging (#34912), artifact load errors (#35115), atomic artifact saving (#35117), pytree slice caching (#35308), fast_moe_cold_start undo for torch>=2.11 (#35475).

Quantization

Quantized LoRA adapters (#30286).
Per-head KV cache scales in attention selector (#34281).
FP8 MoE bias for GPT-OSS (#34906).
SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448).
Mixed precision support for ModelOpt (#35047).
Llama-4 attention quantization (int8, fp8) (#34243).
Sparse24 compressed tensors fix (#33446).
KV scale loading fix for MLA models (#35430).
Compressed tensors as ground-truth for quant strategies (#34254).
AMD: CK backend for MoE (#34301), dynamic MXFP4 for DeepSeek V2 (#34157), bitsandbytes on ROCm (#34688), GPT-OSS Quark format (#29008).
CPU: KleidiAI INT8_W4A8 for all input dtypes (#34890).
Qwen3.5: FP8 weight loading fix (#35289), mlp.gate not quantizable (#35156).
int4_w4a16 fused_moe benchmark and tuning (#34130).
FlashInfer integrate mm_mxfp8 in ModelOpt MXFP8 (#35053).

API & Frontend

Anthropic API: Thinking blocks (#33671), count_tokens (#35588), tool_choice=none (#35835), tool call streaming fix (#34887), base64 image handling (#35557).
Responses API: Structured outputs (#33709), reasoning_tokens fix (#33513), reasoning_part streaming events (#35184).
UX: --performance-mode {balanced, interactivity, throughput} (#34936), --moe-backend for explicit kernel selection (#33807), --language-model-only for hybrid models (#34120), --enforce-eager clarification (#34523).
Whisper automatic language detection (#34342).
MFU Prometheus counters (#30950).
Unrecognized environment variable warnings (#33581).
generation_config max_tokens treated as default not ceiling (#34063).
Structured output bugfix for completions (#35237).
Structured output JSON feature validation (#33233).
Validate non-text content in system messages (#34072).
Explicit validation error for tool calls (#34438).
IO Processor plugin simplification (#34236).
Sparse embedding IO process plugin (#34214).
Pooling entrypoint improvements (#35604).

Security

Fix SSRF bypass via backslash-@ URL parsing inconsistency (#34743).

Dependencies

PyTorch 2.10.0 upgrade — breaking change requiring environment updates. ROCm torch also updated to official 2.10 release (#34387).
OpenTelemetry libraries included by default (#34466).
Bound NIXL upper bound version (#35495).
mooncake-transfer-engine added to kv_connectors requirements (#34826).
openai bounded to under 2.25.0.
lm-eval bumped for Transformers v5 compatibility (#33994).
mamba-ssm bumped for Transformers v5 (#34233).
PyPI source distribution (sdist) now included (#35136).
amd-quark package added for ROCm (#35658).

V0 Deprecation

Removed per-request logits processors (#34400).
Removed unused MM placeholders in request output (#34944).
Removed Swin model (#35821).
Scheduled v0.17 deprecations applied (#35441).

Transformers v5 Compatibility

Model fixes: Qwen3VL (#34262), JAIS (#34264), MiniCPM-V, GLM-ASR, Qwen3.5.
Xet high-performance mode (#35098).
Custom processor import fixes (#35101, #35107).
padding_index removal for compatibility (#35189).
lm-eval (#33994) and mamba-ssm (#34233) version bumps.

New Contributors 🎉

@2ez4bz made their first contribution in #33607
@Alibaba-HZY made their first contribution in #35289
@aykoppol made their first contribution in #35451
@bhoomit made their first contribution in #34773
@charlesashby made their first contribution in #34169
@chengyinie made their first contribution in #35457
@EdalatiAli made their first contribution in #34448
@ehfd made their first contribution in #33992
@flutist made their first contribution in #35838
@fort726 made their first contribution in #32407
@fynnsu made their first contribution in #33736
@gante made their first contribution in #35281
@hallerite made their first contribution in #35834
@hujia177 made their first contribution in #34982
@itayalroy made their first contribution in #34861
@jasonozuzu-cohere made their first contribution in #34715
@jcaip made their first contribution in #35327
@jhaotingc made their first contribution in #34933
@jjmiao1 made their first contribution in #35994
@jonoillar made their first contribution in #34513
@koush made their first contribution in #33646
@lailoo made their first contribution in #35616
@Laurawly made their first contribution in #31828
@Li-Yongwen made their first contribution in #34336
@lichuang made their first contribution in #34679
@lin-shh made their first contribution in #35645
@majian4work made their first contribution in #35466
@ojhaanshika made their first contribution in #34986
@PatrykWo made their first contribution in #35307
@pi314ever made their first contribution in #35434
@pkousha made their first contribution in #33839
@pks made their first contribution in #35237
@qianlihuang made their first contribution in #32642
@simonreginis made their first contribution in #31025
@stakeswky made their first contribution in #35230
@SteadfastAsArt made their first contribution in #34888
@stingoChen made their first contribution in #35352
@sychen52 made their first contribution in #35047
@thepushkarp made their first contribution in #32114
@Tib-Gridello made their first contribution in #35423
@umut-polat made their first contribution in #35510
@voipmonitor made their first contribution in #35615
@wangxingran222 made their first contribution in #33088
@wenshuai-xiaomi made their first contribution in #34424
@wjabbour made their first contribution in #35672
@yashwantbezawada made their first contribution in #31057
@yoonsnowdev made their first contribution in #35382
@ZhongsJie made their first contribution in #35835

vllm 0.17.0 v0.17.0 on Python PyPI