vLLM v0.13.0 Release Notes Highlights
Highlights
This release features 442 commits from 207 contributors (61 new contributors)!
Breaking Changes: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading.
Model Support
- New models: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), latent MoE architecture support (#30203).
- Tool parsers: DeepSeek-V3.2 (#29848), Gigachat 3 (#29905), Holo2 reasoning (#30048).
- Model enhancements: Qwen3-VL embeddings support (#30037), Qwen3-VL EVS (Efficient Video Sampling) (#29752), DeepSeek V3.2 proper
drop_thinkinglogic (#30490), DeepSeek V3.2 top-k fix (#27568). - Task expansion: Automatic TokenClassification model conversion (#30666), Ultravox v0.7 transformer projector (#30089).
- Quantization: BitsAndBytes for Qwen3-Omni-MoE (#29896).
- Speculative decoding: Eagle/Eagle3 Transformers backend (#30340), Mamba
selective_state_updatespec decode (#29488).
Engine Core
- Compilation: Conditional compilation via
compile_rangesfor selective kernel compilation (#24252). - Prefix caching: xxHash high-performance hash option (#29163).
- Attention: PrefixLM support for FlexAttention (#27938) and TritonAttention (#30386), CUDA graphs for 3D Triton attention (#28306),
TRITON_MLAwithout prefix-caching (#29125). - Batch invariance: FA2 and LoRA batch-invariant support (#30018).
- Pooling: Chunked prefill for ALL pooling tasks (#27145), multi-vector retrieval API (#26686).
- Model Runner V2: Min-p sampling (#30171), NaN detection in logits (#30187).
- Speculative decoding: Medusa GPU-CPU sync avoidance (#29723), async spec-decode improvements (#29624).
- Whisper: Encoder batching (#29421),
FULL_DECODE_ONLYCUDA graph (#30072). - Performance: Fused blockwise quant RMS norm (#27883), MoE LoRA loading reduction (#30243), encoder cache optimization (#30475), CPU KV offloading streams (#29013).
Hardware & Performance
- NVIDIA Blackwell Ultra: SM103 (GB300) support with CUDA 13 (#30484).
- DeepSeek optimizations (benchmarked on DeepSeek-V3.1):
- DeepEP High-Throughput CUDA graph enabled by default: 5.3% throughput, 4.4% TTFT improvement (#29558)
- DeepGEMM fused layout kernel: 4.3% throughput, 10.7% TTFT improvement (#29546)
- DeepGEMM experts initialization: 3.9% TTFT improvement (#30494)
group_topkkernel: 1.9% throughput, 2.1% TPOT improvement (#30159)- Sparse prefill kernel for FP8 KV-cache in DeepSeek-V3.2 (#27532)
- MLA FP8 optimization with ReduceScatterSum (#29795), direct k_nope/k_pe copy (#29710)
- CPU: Whisper support (#30062), Arm Optimized Routines vectorized exp (#30068), x86 CPU wheel pipeline (#28848).
- AMD ROCm: Aiter quantization kernels (#25552), torch.compile layernorm/silu + FP8 quant (#25693), Triton ScaledMM fallback (#26668), MXFP4 w4a4 inference (#29775).
- Intel XPU: wNa16 compressed tensors (#29484).
- Build: CUDA 13 aarch64 wheels (#30341), Docker kernel build stage (#29452), Ascend NPU Docker (#30015).
Large Scale Serving & Disaggregated Prefill/Decode
- KV connectors: Mooncake Transfer Engine (#24718), cache reset via
/reset_prefix_cache(#27170), KV events (#28309), failure recovery config (#26813). - NIXL: Compatibility checking in handshake (#29503), large batch proxy support (#28782).
- EPLB: NVFP4 support (#29804), algorithm abstraction (#26471).
- Multi-node: External launcher mode (#29833).
- Hybrid allocator: Optional KV connector integration (#29805).
- Performance: silu_mul_per_token_group_quant_fp8 kernel for DP/EP (#29470).
Quantization
- New: W4A8 grouped GEMM on Hopper (#29691), online FP8 with streaming post-processing (#29196), FP8 weight reloading for RLHF (#28480).
- MoE + LoRA: AWQ Marlin (#30442) and GPTQ Marlin (#30254) support.
- GGUF: MoE + GGUF restored for Qwen3 MoE (#30116), Qwen2 MoE (#30307), HF defaults override (#30118).
- Compatibility: Transformers v5 RoPE support (#30046).
API & Frontend
- Responses API: MCP type infrastructure (#30054), Browser/Container MCP tools (#29989), full MCP Python loop (#29798), extra body parameters (#30532).
- Configuration:
AttentionConfigreplacesVLLM_ATTENTION_BACKENDenv var (#26315). - Chat templates: DeepSeek-V3.2 (#29837), DeepSeek-V3.2 developer tools (#30040).
- Anthropic API: Streaming fixes (#29971, #30266).
- Embeddings: Binary format with
encoding_format=bytes_only(#30249), multiple image/audio per request (#29988), tokenization_kwargs override (#29794). - Metrics: Prefill KV compute metric excluding cached tokens (#30189).
- Profiling: Layer-wise NVTX (#29990), profiling CLI config (#29912).
- UX: Better OOM errors (#28051), ModelConfig validation (#30213), distributed executor errors (#30140).
Security
- Additional protection for CVE-2025-62164 (#30649).
Dependencies
Breaking Changes & Deprecations
- PassConfig flags renamed per RFC #27995 (#29646)
- Attention env vars → CLI args:
VLLM_ATTENTION_BACKENDreplaced with--attention-backend(#26315) - Removed
-O.xxflag (#29991) - Removed deprecated plugin/compilation fields (#30396)
- Removed deprecated task, seed, MM settings (#30397)
- Removed
embed_input_ids/embed_multimodalfallbacks (#30458) - Removed tokenizer setter (#30400)
- Deprecations:
merge_by_field_config(#30035, #30170),--convert reward→--convert embed(#30463)
New Contributors 🎉
- @ajpqs made their first contribution in #29905
- @amitz-nv made their first contribution in #29978
- @amrmahdi made their first contribution in #29452
- @andrewbriand made their first contribution in #29804
- @anker-c2 made their first contribution in #30344
- @AuruTus made their first contribution in #30182
- @avigny made their first contribution in #19425
- @Bhanu068 made their first contribution in #30254
- @Copilot made their first contribution in #29025
- @dbotwinick made their first contribution in #30583
- @dependabot[bot] made their first contribution in #30234
- @desertfire made their first contribution in #29919
- @dmitry-tokarev-nv made their first contribution in #30149
- @drslark made their first contribution in #30632
- @dtcccc made their first contribution in #24718
- @elizabetht made their first contribution in #28671
- @Elm8116 made their first contribution in #30068
- @gausah01 made their first contribution in #29604
- @gh-wf made their first contribution in #30285
- @hdlj-h made their first contribution in #30056
- @HF-001 made their first contribution in #30051
- @hzxuzhonghu made their first contribution in #29931
- @JaviS-Rei made their first contribution in #29882
- @johannesflommersfeld made their first contribution in #30390
- @KevinMusgrave made their first contribution in #30529
- @kitaekatt made their first contribution in #30408
- @lashahub made their first contribution in #30539
- @LuminolT made their first contribution in #29163
- @majiayu000 made their first contribution in #30615
- @MaoJianwei made their first contribution in #29797
- @Mercykid-bash made their first contribution in #26471
- @mgehre-amd made their first contribution in #30364
- @mivehk made their first contribution in #30512
- @mondaylord made their first contribution in #30671
- @noa-neria made their first contribution in #29320
- @PatrykSaffer made their first contribution in #30330
- @Peng-YM made their first contribution in #29074
- @realliujiaxu made their first contribution in #30059
- @redwrasse made their first contribution in #29261
- @Ri0S made their first contribution in #30532
- @sarathc-cerebras made their first contribution in #30188
- @scratch-ml made their first contribution in #30351
- @seokhyunan made their first contribution in #30648
- @shaharmor98 made their first contribution in #30203
- @taoyun951753 made their first contribution in #30037
- @tom-zju made their first contribution in #30057
- @tomtomjhj made their first contribution in #29692
- @vkuzo made their first contribution in #29196
- @vladnosiv made their first contribution in #30490
- @weiguihua2 made their first contribution in #30042
- @wenqiglantz made their first contribution in #30649
- @wkcn made their first contribution in #29879
- @wu-kan made their first contribution in #21804
- @wz1qqx made their first contribution in #30376
- @xyDong0223 made their first contribution in #30455
- @yifant-code made their first contribution in #30213
- @yjc9696 made their first contribution in #30040
- @yurekami made their first contribution in #30552
- @yuttian1 made their first contribution in #30102
- @ZhijianJiang made their first contribution in #30219
- @ZhiweiYan-96 made their first contribution in #29773
Full Changelog: v0.12.0...v0.13.0