vLLM v0.13.0 Release Notes Highlights

Highlights

This release features 442 commits from 207 contributors (61 new contributors)!

Breaking Changes: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading.

Model Support

New models: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), latent MoE architecture support (#30203).
Tool parsers: DeepSeek-V3.2 (#29848), Gigachat 3 (#29905), Holo2 reasoning (#30048).
Model enhancements: Qwen3-VL embeddings support (#30037), Qwen3-VL EVS (Efficient Video Sampling) (#29752), DeepSeek V3.2 proper drop_thinking logic (#30490), DeepSeek V3.2 top-k fix (#27568).
Task expansion: Automatic TokenClassification model conversion (#30666), Ultravox v0.7 transformer projector (#30089).
Quantization: BitsAndBytes for Qwen3-Omni-MoE (#29896).
Speculative decoding: Eagle/Eagle3 Transformers backend (#30340), Mamba selective_state_update spec decode (#29488).

Engine Core

Compilation: Conditional compilation via compile_ranges for selective kernel compilation (#24252).
Prefix caching: xxHash high-performance hash option (#29163).
Attention: PrefixLM support for FlexAttention (#27938) and TritonAttention (#30386), CUDA graphs for 3D Triton attention (#28306), TRITON_MLA without prefix-caching (#29125).
Batch invariance: FA2 and LoRA batch-invariant support (#30018).
Pooling: Chunked prefill for ALL pooling tasks (#27145), multi-vector retrieval API (#26686).
Model Runner V2: Min-p sampling (#30171), NaN detection in logits (#30187).
Speculative decoding: Medusa GPU-CPU sync avoidance (#29723), async spec-decode improvements (#29624).
Whisper: Encoder batching (#29421), FULL_DECODE_ONLY CUDA graph (#30072).
Performance: Fused blockwise quant RMS norm (#27883), MoE LoRA loading reduction (#30243), encoder cache optimization (#30475), CPU KV offloading streams (#29013).

Hardware & Performance

NVIDIA Blackwell Ultra: SM103 (GB300) support with CUDA 13 (#30484).
DeepSeek optimizations (benchmarked on DeepSeek-V3.1):
- DeepEP High-Throughput CUDA graph enabled by default: 5.3% throughput, 4.4% TTFT improvement (#29558)
- DeepGEMM fused layout kernel: 4.3% throughput, 10.7% TTFT improvement (#29546)
- DeepGEMM experts initialization: 3.9% TTFT improvement (#30494)
- group_topk kernel: 1.9% throughput, 2.1% TPOT improvement (#30159)
- Sparse prefill kernel for FP8 KV-cache in DeepSeek-V3.2 (#27532)
- MLA FP8 optimization with ReduceScatterSum (#29795), direct k_nope/k_pe copy (#29710)
CPU: Whisper support (#30062), Arm Optimized Routines vectorized exp (#30068), x86 CPU wheel pipeline (#28848).
AMD ROCm: Aiter quantization kernels (#25552), torch.compile layernorm/silu + FP8 quant (#25693), Triton ScaledMM fallback (#26668), MXFP4 w4a4 inference (#29775).
Intel XPU: wNa16 compressed tensors (#29484).
Build: CUDA 13 aarch64 wheels (#30341), Docker kernel build stage (#29452), Ascend NPU Docker (#30015).

Large Scale Serving & Disaggregated Prefill/Decode

KV connectors: Mooncake Transfer Engine (#24718), cache reset via /reset_prefix_cache (#27170), KV events (#28309), failure recovery config (#26813).
NIXL: Compatibility checking in handshake (#29503), large batch proxy support (#28782).
EPLB: NVFP4 support (#29804), algorithm abstraction (#26471).
Multi-node: External launcher mode (#29833).
Hybrid allocator: Optional KV connector integration (#29805).
Performance: silu_mul_per_token_group_quant_fp8 kernel for DP/EP (#29470).

Quantization

New: W4A8 grouped GEMM on Hopper (#29691), online FP8 with streaming post-processing (#29196), FP8 weight reloading for RLHF (#28480).
MoE + LoRA: AWQ Marlin (#30442) and GPTQ Marlin (#30254) support.
GGUF: MoE + GGUF restored for Qwen3 MoE (#30116), Qwen2 MoE (#30307), HF defaults override (#30118).
Compatibility: Transformers v5 RoPE support (#30046).

API & Frontend

Responses API: MCP type infrastructure (#30054), Browser/Container MCP tools (#29989), full MCP Python loop (#29798), extra body parameters (#30532).
Configuration: AttentionConfig replaces VLLM_ATTENTION_BACKEND env var (#26315).
Chat templates: DeepSeek-V3.2 (#29837), DeepSeek-V3.2 developer tools (#30040).
Anthropic API: Streaming fixes (#29971, #30266).
Embeddings: Binary format with encoding_format=bytes_only (#30249), multiple image/audio per request (#29988), tokenization_kwargs override (#29794).
Metrics: Prefill KV compute metric excluding cached tokens (#30189).
Profiling: Layer-wise NVTX (#29990), profiling CLI config (#29912).
UX: Better OOM errors (#28051), ModelConfig validation (#30213), distributed executor errors (#30140).

Security

Additional protection for CVE-2025-62164 (#30649).

Dependencies

NVSHMEM 3.3.24 + CUDA 13 fix (#30149).
TPU tpu-inference 0.12.0 (#30221).

Breaking Changes & Deprecations

PassConfig flags renamed per RFC #27995 (#29646)
Attention env vars → CLI args: VLLM_ATTENTION_BACKEND replaced with --attention-backend (#26315)
Removed -O.xx flag (#29991)
Removed deprecated plugin/compilation fields (#30396)
Removed deprecated task, seed, MM settings (#30397)
Removed embed_input_ids/embed_multimodal fallbacks (#30458)
Removed tokenizer setter (#30400)
Deprecations: merge_by_field_config (#30035, #30170), --convert reward → --convert embed (#30463)

New Contributors 🎉

@ajpqs made their first contribution in #29905
@amitz-nv made their first contribution in #29978
@amrmahdi made their first contribution in #29452
@andrewbriand made their first contribution in #29804
@anker-c2 made their first contribution in #30344
@AuruTus made their first contribution in #30182
@avigny made their first contribution in #19425
@Bhanu068 made their first contribution in #30254
@Copilot made their first contribution in #29025
@dbotwinick made their first contribution in #30583
@dependabot[bot] made their first contribution in #30234
@desertfire made their first contribution in #29919
@dmitry-tokarev-nv made their first contribution in #30149
@drslark made their first contribution in #30632
@dtcccc made their first contribution in #24718
@elizabetht made their first contribution in #28671
@Elm8116 made their first contribution in #30068
@gausah01 made their first contribution in #29604
@gh-wf made their first contribution in #30285
@hdlj-h made their first contribution in #30056
@HF-001 made their first contribution in #30051
@hzxuzhonghu made their first contribution in #29931
@JaviS-Rei made their first contribution in #29882
@johannesflommersfeld made their first contribution in #30390
@KevinMusgrave made their first contribution in #30529
@kitaekatt made their first contribution in #30408
@lashahub made their first contribution in #30539
@LuminolT made their first contribution in #29163
@majiayu000 made their first contribution in #30615
@MaoJianwei made their first contribution in #29797
@Mercykid-bash made their first contribution in #26471
@mgehre-amd made their first contribution in #30364
@mivehk made their first contribution in #30512
@mondaylord made their first contribution in #30671
@noa-neria made their first contribution in #29320
@PatrykSaffer made their first contribution in #30330
@Peng-YM made their first contribution in #29074
@realliujiaxu made their first contribution in #30059
@redwrasse made their first contribution in #29261
@Ri0S made their first contribution in #30532
@sarathc-cerebras made their first contribution in #30188
@scratch-ml made their first contribution in #30351
@seokhyunan made their first contribution in #30648
@shaharmor98 made their first contribution in #30203
@taoyun951753 made their first contribution in #30037
@tom-zju made their first contribution in #30057
@tomtomjhj made their first contribution in #29692
@vkuzo made their first contribution in #29196
@vladnosiv made their first contribution in #30490
@weiguihua2 made their first contribution in #30042
@wenqiglantz made their first contribution in #30649
@wkcn made their first contribution in #29879
@wu-kan made their first contribution in #21804
@wz1qqx made their first contribution in #30376
@xyDong0223 made their first contribution in #30455
@yifant-code made their first contribution in #30213
@yjc9696 made their first contribution in #30040
@yurekami made their first contribution in #30552
@yuttian1 made their first contribution in #30102
@ZhijianJiang made their first contribution in #30219
@ZhiweiYan-96 made their first contribution in #29773

Full Changelog: v0.12.0...v0.13.0

vllm-project/vllm v0.13.0 on GitHub