vllm 0.15.0 on Python PyPI

Highlights

This release features 335 commits from 158 contributors (39 new)!

Model Support

New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456).
LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763).
Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322).
Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings (#14526).
Model enhancements: Voxtral streaming architecture (#32861), SharedFusedMoE for Qwen3MoE (#32082), dynamic resolution for Nemotron Nano VL (#32121), Molmo2 vision backbone quantization (#32385).

Engine Core

Async scheduling + Pipeline Parallelism: --async-scheduling now works with pipeline parallelism (#32359).
Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with --enable-prefix-caching --mamba-cache-mode align. Achieves ~2x speedup by caching Mamba states directly (#30877).
Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing StreamingInput objects while maintaining KV cache alignment (#28973).
Model Runner V2: VLM support (#32546), architecture improvements.
LoRA: Inplace loading for memory efficiency (#31326).
AOT compilation: torch.compile inductor artifacts support (#25205).
Performance: KV cache offloading redundant load prevention (#29087), FlashAttn attention/cache update separation (#25954).

Hardware & Performance

NVIDIA

Blackwell defaults: FlashInfer MLA is now the default MLA backend on Blackwell, with TRTLLM as default prefill (#32615).
MoE performance: 1.2-2% E2E throughput improvement via grouped topk kernel fusion (#32058), NVFP4 small-batch decoding improvement (#30885), faster cold start for MoEs with torch.compile (#32805).
FP4 kernel optimization: Up to 65% faster FP4 quantization on Blackwell (SM100F) using 256-bit loads, ~4% E2E throughput improvement (#32520).
Kernel improvements: topk_sigmoid kernel for MoE routing (#31246), atomics reduce counting for SplitK skinny GEMMs (#29843), fused cat+quant for FP8 KV cache in MLA (#32950).
torch.compile: SiluAndMul and QuantFP8 CustomOp compilation (#32806), Triton prefill attention performance (#32403).

AMD ROCm

MoRI EP: High-performance all2all backend for Expert Parallel (#28664).
Attention improvements: Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#29887).
FP4 support: MLA projection GEMMs with dynamic quantization (#32238).
Consumer GPU support: Flash Attention Triton backend on RDNA3/RDNA4 (#32944).

Other Platforms

TPU: Pipeline parallelism support (#28506), backend option (#32438).
Intel XPU: AgRsAll2AllManager for distributed communication (#32654).
CPU: NUMA-aware acceleration for TP/DP inference on ARM (#32792), PyTorch 2.10 (#32869).
Whisper: torch.compile support (#30385).
WSL: Platform compatibility fix for Windows Subsystem for Linux (#32749).

Quantization

MXFP4: W4A16 support for compressed-tensors MoE models (#32285).
Non-gated MoE: Quantization support with Marlin, NVFP4 CUTLASS, FP8, INT8, and compressed-tensors (#32257).
Intel: Quantization Toolkit integration (#31716).
FP8 KV cache: Per-tensor and per-attention-head quantization via llmcompressor (#30141).

API & Frontend

Responses API: Partial message generation (#32100), include_stop_str_in_output tuning (#32383), prompt_cache_key support (#32824).
OpenAI API: skip_special_tokens configuration (#32345).
Score endpoint: Flexible input formats with data_1/data_2 and queries/documents (#32577).
Render endpoints: New endpoints for prompt preprocessing (#32473).
Whisper API: avg_logprob and compression_ratio in verbose_json segments (#31059).
Security: FIPS 140-3 compliant hash option for enterprise/government users (#32386), --ssl-ciphers CLI argument (#30937).
UX improvements: Auto api_server_count based on dp_size (#32525), wheel variant auto-detection during install (#32948), custom profiler URI schemes (#32393).

Dependencies

FlashInfer v0.6.1 (#30993)
Transformers 4.57.5 (#32287)
PyTorch 2.10 for CPU backend (#32869)
DeepGEMM newer version (#32479)

Breaking Changes & Deprecations

Metrics: Removed deprecated vllm:time_per_output_token_seconds metric - use vllm:inter_token_latency_seconds instead (#32661).
Environment variables: Removed deprecated environment variables (#32812).
Quantization: DeepSpeedFp8 removed (#32679), RTN removed (#32697), HQQ deprecated (#32681).

Bug Fixes

Speculative decoding: Eagle draft_model_config fix (#31753).
DeepSeek: DeepSeek-V3.1 + DeepGEMM incompatible scale shapes fix (#32361).
Distributed: DP+MoE inference fix via CpuCommunicator (#31867), P/D with non-MoE DP fix (#33037).
EPLB: Possible deadlock fix (#32418).
NIXL: UCX memory leak fix by exporting UCX_MEM_MMAP_HOOK_MODE=none (#32181).
Structured output: Outlines byte fallback handling fix (#31391).

New Contributors 🎉

@YunzhuLu made their first contribution in #32126
@emricksini-h made their first contribution in #30784
@dsfaccini made their first contribution in #32289
@ofirzaf made their first contribution in #32312
@seekskyworld made their first contribution in #32321
@brian033 made their first contribution in #31715
@TomerBN-Nvidia made their first contribution in #32257
@vanshilshah97 made their first contribution in #32448
@George-Polya made their first contribution in #32385
@T1mn made their first contribution in #32411
@mritunjaysharma394 made their first contribution in #31492
@randzero made their first contribution in #32511
@DemingCheng made their first contribution in #32556
@iboiko-habana made their first contribution in #32471
@honglyua-il made their first contribution in #32462
@hyeongyun0916 made their first contribution in #32473
@DanielMe made their first contribution in #32560
@netanel-haber made their first contribution in #32121
@longregen made their first contribution in #28784
@jasonyanwenl made their first contribution in #32749
@Wauplin made their first contribution in #32788
@ikaadil made their first contribution in #32775
@alexsun07 made their first contribution in #28664
@liranschour made their first contribution in #30207
@AuYang261 made their first contribution in #32844
@diviramon made their first contribution in #32393
@RishabhSaini made their first contribution in #32884
@MatteoFari made their first contribution in #32397
@peakcrosser7 made their first contribution in #30877
@orionr made their first contribution in #30443
@marksverdhei made their first contribution in #32614
@joninco made their first contribution in #32935
@monajafi-amd made their first contribution in #32944
@ruizcrp made their first contribution in #32988
@sjhddh made their first contribution in #32983
@HirokenOvo made their first contribution in #32646
@Chenhao-Guan made their first contribution in #32763
@joshuadeng made their first contribution in #28973
@ZhanqiuHu made their first contribution in #33016

Full Changelog: v0.14.1...v0.15.0

vllm 0.15.0 v0.15.0 on Python PyPI

Highlights

Model Support

Engine Core

Hardware & Performance

NVIDIA

AMD ROCm

Other Platforms

Quantization

API & Frontend

Dependencies

Breaking Changes & Deprecations

Bug Fixes

New Contributors 🎉

vllm 0.15.0
v0.15.0

on Python PyPI