github vllm-project/vllm v0.13.0

22 hours ago

vLLM v0.13.0 Release Notes Highlights

Highlights

This release features 442 commits from 207 contributors (61 new contributors)!

Breaking Changes: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading.

Model Support

  • New models: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), latent MoE architecture support (#30203).
  • Tool parsers: DeepSeek-V3.2 (#29848), Gigachat 3 (#29905), Holo2 reasoning (#30048).
  • Model enhancements: Qwen3-VL embeddings support (#30037), Qwen3-VL EVS (Efficient Video Sampling) (#29752), DeepSeek V3.2 proper drop_thinking logic (#30490), DeepSeek V3.2 top-k fix (#27568).
  • Task expansion: Automatic TokenClassification model conversion (#30666), Ultravox v0.7 transformer projector (#30089).
  • Quantization: BitsAndBytes for Qwen3-Omni-MoE (#29896).
  • Speculative decoding: Eagle/Eagle3 Transformers backend (#30340), Mamba selective_state_update spec decode (#29488).

Engine Core

  • Compilation: Conditional compilation via compile_ranges for selective kernel compilation (#24252).
  • Prefix caching: xxHash high-performance hash option (#29163).
  • Attention: PrefixLM support for FlexAttention (#27938) and TritonAttention (#30386), CUDA graphs for 3D Triton attention (#28306), TRITON_MLA without prefix-caching (#29125).
  • Batch invariance: FA2 and LoRA batch-invariant support (#30018).
  • Pooling: Chunked prefill for ALL pooling tasks (#27145), multi-vector retrieval API (#26686).
  • Model Runner V2: Min-p sampling (#30171), NaN detection in logits (#30187).
  • Speculative decoding: Medusa GPU-CPU sync avoidance (#29723), async spec-decode improvements (#29624).
  • Whisper: Encoder batching (#29421), FULL_DECODE_ONLY CUDA graph (#30072).
  • Performance: Fused blockwise quant RMS norm (#27883), MoE LoRA loading reduction (#30243), encoder cache optimization (#30475), CPU KV offloading streams (#29013).

Hardware & Performance

  • NVIDIA Blackwell Ultra: SM103 (GB300) support with CUDA 13 (#30484).
  • DeepSeek optimizations (benchmarked on DeepSeek-V3.1):
    • DeepEP High-Throughput CUDA graph enabled by default: 5.3% throughput, 4.4% TTFT improvement (#29558)
    • DeepGEMM fused layout kernel: 4.3% throughput, 10.7% TTFT improvement (#29546)
    • DeepGEMM experts initialization: 3.9% TTFT improvement (#30494)
    • group_topk kernel: 1.9% throughput, 2.1% TPOT improvement (#30159)
    • Sparse prefill kernel for FP8 KV-cache in DeepSeek-V3.2 (#27532)
    • MLA FP8 optimization with ReduceScatterSum (#29795), direct k_nope/k_pe copy (#29710)
  • CPU: Whisper support (#30062), Arm Optimized Routines vectorized exp (#30068), x86 CPU wheel pipeline (#28848).
  • AMD ROCm: Aiter quantization kernels (#25552), torch.compile layernorm/silu + FP8 quant (#25693), Triton ScaledMM fallback (#26668), MXFP4 w4a4 inference (#29775).
  • Intel XPU: wNa16 compressed tensors (#29484).
  • Build: CUDA 13 aarch64 wheels (#30341), Docker kernel build stage (#29452), Ascend NPU Docker (#30015).

Large Scale Serving & Disaggregated Prefill/Decode

  • KV connectors: Mooncake Transfer Engine (#24718), cache reset via /reset_prefix_cache (#27170), KV events (#28309), failure recovery config (#26813).
  • NIXL: Compatibility checking in handshake (#29503), large batch proxy support (#28782).
  • EPLB: NVFP4 support (#29804), algorithm abstraction (#26471).
  • Multi-node: External launcher mode (#29833).
  • Hybrid allocator: Optional KV connector integration (#29805).
  • Performance: silu_mul_per_token_group_quant_fp8 kernel for DP/EP (#29470).

Quantization

  • New: W4A8 grouped GEMM on Hopper (#29691), online FP8 with streaming post-processing (#29196), FP8 weight reloading for RLHF (#28480).
  • MoE + LoRA: AWQ Marlin (#30442) and GPTQ Marlin (#30254) support.
  • GGUF: MoE + GGUF restored for Qwen3 MoE (#30116), Qwen2 MoE (#30307), HF defaults override (#30118).
  • Compatibility: Transformers v5 RoPE support (#30046).

API & Frontend

  • Responses API: MCP type infrastructure (#30054), Browser/Container MCP tools (#29989), full MCP Python loop (#29798), extra body parameters (#30532).
  • Configuration: AttentionConfig replaces VLLM_ATTENTION_BACKEND env var (#26315).
  • Chat templates: DeepSeek-V3.2 (#29837), DeepSeek-V3.2 developer tools (#30040).
  • Anthropic API: Streaming fixes (#29971, #30266).
  • Embeddings: Binary format with encoding_format=bytes_only (#30249), multiple image/audio per request (#29988), tokenization_kwargs override (#29794).
  • Metrics: Prefill KV compute metric excluding cached tokens (#30189).
  • Profiling: Layer-wise NVTX (#29990), profiling CLI config (#29912).
  • UX: Better OOM errors (#28051), ModelConfig validation (#30213), distributed executor errors (#30140).

Security

Dependencies

  • NVSHMEM 3.3.24 + CUDA 13 fix (#30149).
  • TPU tpu-inference 0.12.0 (#30221).

Breaking Changes & Deprecations

  1. PassConfig flags renamed per RFC #27995 (#29646)
  2. Attention env vars → CLI args: VLLM_ATTENTION_BACKEND replaced with --attention-backend (#26315)
  3. Removed -O.xx flag (#29991)
  4. Removed deprecated plugin/compilation fields (#30396)
  5. Removed deprecated task, seed, MM settings (#30397)
  6. Removed embed_input_ids/embed_multimodal fallbacks (#30458)
  7. Removed tokenizer setter (#30400)
  8. Deprecations: merge_by_field_config (#30035, #30170), --convert reward--convert embed (#30463)

New Contributors 🎉

Full Changelog: v0.12.0...v0.13.0

Don't miss a new vllm release

NewReleases is sending notifications on new releases.