pypi vllm 0.17.0
v0.17.0

8 hours ago

vLLM v0.17.0

Known Issue: If you are on CUDA 12.9+ and encounter a CUBLAS_STATUS_INVALID_VALUE error, this is caused by a CUDA library mismatch. To resolve, try one of the following:

  1. Remove the path to system CUDA shared library files (e.g. /usr/local/cuda) from LD_LIBRARY_PATH, or simply unset LD_LIBRARY_PATH.
  2. Install vLLM with uv pip install vllm --torch-backend=auto.
  3. Install vLLM with pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129 (change the CUDA version to match your system).

Highlights

This release features 699 commits from 272 contributors (48 new)!

  • PyTorch 2.10 Upgrade: This release upgrades to PyTorch 2.10.0, which is a breaking change for environment dependencies.
  • FlashAttention 4 Integration: vLLM now supports the FlashAttention 4 backend (#32974), bringing next-generation attention performance.
  • Model Runner V2 Maturation: Model Runner V2 has reached a major milestone with Pipeline Parallel (#33960), Decode Context Parallel (#34179), Eagle3 speculative decoding with CUDA graphs (#35029, #35040), pooling model support (#35120), piecewise & mixed CUDA graph capture (#32771), DP+EP for spec decoding (#35294), and a new ModelState architecture. Design docs are now available (#35819).
  • Qwen3.5 Model Family: Full support for the Qwen3.5 model family (#34110) featuring GDN (Gated Delta Networks), with FP8 quantization, MTP speculative decoding, and reasoning parser support.
  • New --performance-mode Flag: A new --performance-mode {balanced, interactivity, throughput} flag (#34936) simplifies performance tuning for common deployment scenarios.
  • Anthropic API Compatibility: Added support for Anthropic thinking blocks (#33671), count_tokens API (#35588), tool_choice=none (#35835), and streaming/image handling fixes.
  • Weight Offloading V2 with Prefetching: The weight offloader now hides onloading latency via prefetching (#29941), plus selective CPU weight offloading (#34535) and CPU offloading without pinned memory doubling (#32993).
  • Elastic Expert Parallelism Milestone 2: Initial support for elastic expert parallelism enabling dynamic GPU scaling for MoE models (#34861).
  • Quantized LoRA Adapters: Users can now load quantized LoRA adapters (e.g. QLoRA) directly (#30286).
  • Transformers v5 Compatibility: Extensive work to ensure compatibility with HuggingFace Transformers v5 across models and utilities.

Model Support

  • New architectures: Qwen3.5 (#34110), COLQwen3 (#34398), ColModernVBERT (#34558), Ring 2.5 (#35102), skt/A.X-K1 (#32407), Ovis 2.6 (#34426), nvidia/llama-nemotron-embed-vl-1b-v2 (#35297), nvidia/llama-nemotron-rerank-vl-1b-v2 (#35735), nvidia/nemotron-colembed (#34574).
  • ASR models: FunASR (#33247), FireRedASR2 (#35727), Qwen3-ASR realtime streaming (#34613).
  • Multimodal: OpenPangu-VL video input (#34134), audio chunking for offline LLM (#34628), Parakeet audio encoder for nemotron-nano-vl (#35100), MiniCPM-o flagos (#34126).
  • LoRA: LFM2 (#34921), Llama 4 Vision tower/connector (#35147), max vocab size increased to 258048 (#34773), quantized LoRA adapters (#30286).
  • Task expansion: ColBERT extended to non-standard BERT backbones (#34170), multimodal scoring for late-interaction models (#34574).
  • Performance: Qwen3.5 GDN projector fusion (#34697), FlashInfer cuDNN backend for Qwen3 VL ViT (#34580), Step3.5-Flash NVFP4 (#34478), Qwen3MoE tuned configs for H200 (#35457).
  • Fixes: DeepSeek-VL V2 simplified loading (#35203), Qwen3/Qwen3.5 reasoning parser (#34779), Qwen2.5-Omni/Qwen3-Omni mixed-modality (#35368), Ernie4.5-VL garbled output (#35587), Qwen-VL tokenizer (#36140), Qwen-Omni audio cache (#35994), Nemotron-3-Nano NVFP4 accuracy with TP>1 (#34476).

Engine Core

  • Model Runner V2: Pipeline Parallel (#33960), Decode Context Parallel (#34179), piecewise & mixed CUDA graphs (#32771), Eagle3 with CUDA graphs (#35029, #35040), pooling models (#35120), DP+EP for spec decoding (#35294), bad_words sampling (#33433), ModelState architecture (#35350, #35383, #35564, #35621, #35774), design docs (#35819).
  • Weight offloading: V2 prefetching to hide latency (#29941), selective CPU weight offloading (#34535), CPU offloading without pinned memory doubling (#32993).
  • Sleep level 0 mode with enqueue/wait pattern (#33195), pause/resume moved into engine (#34125).
  • Fixes: allreduce_rms_fusion disabled by default with PP > 1 (#35424), DCP + FA3 crash (#35082), prefix caching for Mamba "all" mode (#34874), num_active_loras fix (#34119), async TP reduce-scatter reduction fix (#33088).
  • Repetitive token pattern detection flags (#35451).

Kernel

  • FlashAttention 4 integration (#32974).
  • FlashInfer Sparse MLA backend (#33451).
  • Triton-based top-k and top-p sampler kernels (#33538).
  • Faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention (#33680).
  • Optimized grouped topk kernel (#34206).
  • TRTLLM DSV3 Router GEMM kernel, 6% batch-1 speedup (#34302).
  • FA3 swizzle optimization (#34043).
  • 256-bit LDG/STG activation kernels (#33022).
  • TMA support for fused_moe_lora kernel (#32195).
  • Helion kernel framework: silu_mul_fp8 kernel (#33373), autotuning infrastructure (#34025), num_tokens autotuning (#34185), fx tracing via HOP (#34390), GPU variant canonicalization (#34928).
  • FlashInfer TRTLLM fused MoE non-gated FP8 & NVFP4 (#33506).
  • Optimized sample_recovered_tokens kernel (#34974).
  • KV cache update ops extraction from FlashInfer forward (#35422) and MLA backends (#34627).

Hardware & Performance

  • NVIDIA: SM100 FMHA FP8 prefill for MLA (#31195), SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448), SM100 Oink RMSNorm path (#31828), SM120 FP8 GEMM optimization (#34424), FlashInfer DeepGEMM swapAB on SM90 by default (#34924), DeepSeek R1 BF16 min latency QKV GEMM 0.5% E2E speedup (#34758), Cublas BF16 gate with FP32 output (#35121), FlashInfer All Reduce default to TRTLLM backend (#35793).
  • AMD ROCm: AITER fused RoPE+KVCache (#33443), MXFP4 MoE weight pre-shuffling on gfx950 (#34192), bitsandbytes quantization (#34688), CK backend for MoE quantization (#34301), dynamic MXFP4 for DeepSeek V2 (#34157), GPT-OSS Quark format (#29008), GPT-OSS WMXFP4_AFP8 static scales (#30357), encoder/encoder-decoder on AITER (#35334), device capability derivation without CUDA init (#35069), aiter package renamed to amd-aiter (#35198).
  • Intel XPU: CUDA graph support (#34482), GPUDirect RDMA via NIXL (#35270), TORCH_SDPA/TRITON_ATTN as ViT backend (#35010), vllm-xpu-kernels v0.1.3 (#35984).
  • CPU: ARM BF16 cross-compilation (#33079), FP16 for s390x (#34116), KleidiAI INT8_W4A8 for all input dtypes (#34890), s390x vector intrinsics for attention (#34434), prefix caching for ppc64le (#35081), CPU release supports both AVX2 and AVX512 (#35466).
  • Performance: Pipeline Parallel async send/recv 2.9% E2E throughput (#33368), pooling maxsim 13.9% throughput improvement (#35330), Triton ViT attention backend (#32183), Mamba1 kernel-level chunk alignment for prefix caching (#34798), detokenizer optimization (#32975), pooling model copy optimization 1.8% throughput (#35127).

Large Scale Serving

  • Pipeline Parallel async send/recv, 2.9% throughput improvement (#33368).
  • Elastic EP Milestone 2 (#34861).
  • EPLB: Async rebalance algorithm (#30888), sync enforcement for NCCL backend (#35212).
  • Native weight syncing API via IPC for RL workflows (#34171).
  • Decode Context Parallel in Model Runner V2 (#34179).
  • Ray env var propagation to workers (#34383).
  • Breaking: KV load failure policy default changed from "recompute" to "fail" (#34896).
  • Cross-node data parallelism message queue fix (#35429).
  • NIXL: Token-based IPC API (#34175), version bound (#35495), NUMA core binding (#32365).

Speculative Decoding

  • Nemotron-H MTP and Mamba speculative decoding (#33726).
  • Eagle3 on Model Runner V2 with CUDA graphs (#35029, #35040), Eagle3 + disaggregated serving (#34529).
  • Hidden states extraction system (#33736).
  • min_tokens support with speculative decoding (#32642).
  • Reduced TP communication for draft generation (#34049).
  • MTP num_speculative_tokens > 1 with sparse MLA (#34552).
  • Sparse MLA + MTP with full CUDA graphs (#34457).
  • Spec decoding in Mamba cache align mode (#33705).
  • DP+EP for spec decoding in Model Runner V2 (#35294).

MoE Refactor

  • MoERunner abstraction (#32344) with modular kernel architecture.
  • MXFP4 Cutlass Experts to modular kernel (#34542), MXFP4 Marlin to modular kernel format (#34588), TRTLLM Kernels MK (#32564).
  • MoEActivation enum (#33843).
  • Improved default Triton fused MoE configs (#34846).
  • Fused MoE + LoRA shared expert dual stream, 1.07x throughput (#34933).
  • DSV3 QKVAProj GEMM custom op for torch.compile (#35751).
  • Fix routing for models without expert groups (MiniMax-M2.1) (#34673).

torch.compile

  • AOT compile with PyTorch 2.10 (#34155).
  • AR+RMSNorm fusion by default at -O2 (#34299).
  • SiLU+FP4 quant fusion by default at O1+ (#34718).
  • Sequence parallelism threshold compile ranges (#28672).
  • Various compile fixes: recursive pre_grad_passes (#34092), FakeTensorProp elimination (#34093), time discrepancy logging (#34912), artifact load errors (#35115), atomic artifact saving (#35117), pytree slice caching (#35308), fast_moe_cold_start undo for torch>=2.11 (#35475).

Quantization

  • Quantized LoRA adapters (#30286).
  • Per-head KV cache scales in attention selector (#34281).
  • FP8 MoE bias for GPT-OSS (#34906).
  • SM100 MXFP8 blockscaled grouped MM and quant kernels (#34448).
  • Mixed precision support for ModelOpt (#35047).
  • Llama-4 attention quantization (int8, fp8) (#34243).
  • Sparse24 compressed tensors fix (#33446).
  • KV scale loading fix for MLA models (#35430).
  • Compressed tensors as ground-truth for quant strategies (#34254).
  • AMD: CK backend for MoE (#34301), dynamic MXFP4 for DeepSeek V2 (#34157), bitsandbytes on ROCm (#34688), GPT-OSS Quark format (#29008).
  • CPU: KleidiAI INT8_W4A8 for all input dtypes (#34890).
  • Qwen3.5: FP8 weight loading fix (#35289), mlp.gate not quantizable (#35156).
  • int4_w4a16 fused_moe benchmark and tuning (#34130).
  • FlashInfer integrate mm_mxfp8 in ModelOpt MXFP8 (#35053).

API & Frontend

  • Anthropic API: Thinking blocks (#33671), count_tokens (#35588), tool_choice=none (#35835), tool call streaming fix (#34887), base64 image handling (#35557).
  • Responses API: Structured outputs (#33709), reasoning_tokens fix (#33513), reasoning_part streaming events (#35184).
  • UX: --performance-mode {balanced, interactivity, throughput} (#34936), --moe-backend for explicit kernel selection (#33807), --language-model-only for hybrid models (#34120), --enforce-eager clarification (#34523).
  • Whisper automatic language detection (#34342).
  • MFU Prometheus counters (#30950).
  • Unrecognized environment variable warnings (#33581).
  • generation_config max_tokens treated as default not ceiling (#34063).
  • Structured output bugfix for completions (#35237).
  • Structured output JSON feature validation (#33233).
  • Validate non-text content in system messages (#34072).
  • Explicit validation error for tool calls (#34438).
  • IO Processor plugin simplification (#34236).
  • Sparse embedding IO process plugin (#34214).
  • Pooling entrypoint improvements (#35604).

Security

  • Fix SSRF bypass via backslash-@ URL parsing inconsistency (#34743).

Dependencies

  • PyTorch 2.10.0 upgrade — breaking change requiring environment updates. ROCm torch also updated to official 2.10 release (#34387).
  • OpenTelemetry libraries included by default (#34466).
  • Bound NIXL upper bound version (#35495).
  • mooncake-transfer-engine added to kv_connectors requirements (#34826).
  • openai bounded to under 2.25.0.
  • lm-eval bumped for Transformers v5 compatibility (#33994).
  • mamba-ssm bumped for Transformers v5 (#34233).
  • PyPI source distribution (sdist) now included (#35136).
  • amd-quark package added for ROCm (#35658).

V0 Deprecation

  • Removed per-request logits processors (#34400).
  • Removed unused MM placeholders in request output (#34944).
  • Removed Swin model (#35821).
  • Scheduled v0.17 deprecations applied (#35441).

Transformers v5 Compatibility

  • Model fixes: Qwen3VL (#34262), JAIS (#34264), MiniCPM-V, GLM-ASR, Qwen3.5.
  • Xet high-performance mode (#35098).
  • Custom processor import fixes (#35101, #35107).
  • padding_index removal for compatibility (#35189).
  • lm-eval (#33994) and mamba-ssm (#34233) version bumps.

New Contributors 🎉

Don't miss a new vllm release

NewReleases is sending notifications on new releases.