Highlights
This release features 335 commits from 158 contributors (39 new)!
Model Support
- New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456).
- LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763).
- Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322).
- Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings (#14526).
- Model enhancements: Voxtral streaming architecture (#32861), SharedFusedMoE for Qwen3MoE (#32082), dynamic resolution for Nemotron Nano VL (#32121), Molmo2 vision backbone quantization (#32385).
Engine Core
- Async scheduling + Pipeline Parallelism:
--async-schedulingnow works with pipeline parallelism (#32359). - Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with
--enable-prefix-caching --mamba-cache-mode align. Achieves ~2x speedup by caching Mamba states directly (#30877). - Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing
StreamingInputobjects while maintaining KV cache alignment (#28973). - Model Runner V2: VLM support (#32546), architecture improvements.
- LoRA: Inplace loading for memory efficiency (#31326).
- AOT compilation: torch.compile inductor artifacts support (#25205).
- Performance: KV cache offloading redundant load prevention (#29087), FlashAttn attention/cache update separation (#25954).
Hardware & Performance
NVIDIA
- Blackwell defaults: FlashInfer MLA is now the default MLA backend on Blackwell, with TRTLLM as default prefill (#32615).
- MoE performance: 1.2-2% E2E throughput improvement via grouped topk kernel fusion (#32058), NVFP4 small-batch decoding improvement (#30885), faster cold start for MoEs with torch.compile (#32805).
- FP4 kernel optimization: Up to 65% faster FP4 quantization on Blackwell (SM100F) using 256-bit loads, ~4% E2E throughput improvement (#32520).
- Kernel improvements: topk_sigmoid kernel for MoE routing (#31246), atomics reduce counting for SplitK skinny GEMMs (#29843), fused cat+quant for FP8 KV cache in MLA (#32950).
- torch.compile: SiluAndMul and QuantFP8 CustomOp compilation (#32806), Triton prefill attention performance (#32403).
AMD ROCm
- MoRI EP: High-performance all2all backend for Expert Parallel (#28664).
- Attention improvements: Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#29887).
- FP4 support: MLA projection GEMMs with dynamic quantization (#32238).
- Consumer GPU support: Flash Attention Triton backend on RDNA3/RDNA4 (#32944).
Other Platforms
- TPU: Pipeline parallelism support (#28506), backend option (#32438).
- Intel XPU: AgRsAll2AllManager for distributed communication (#32654).
- CPU: NUMA-aware acceleration for TP/DP inference on ARM (#32792), PyTorch 2.10 (#32869).
- Whisper: torch.compile support (#30385).
- WSL: Platform compatibility fix for Windows Subsystem for Linux (#32749).
Quantization
- MXFP4: W4A16 support for compressed-tensors MoE models (#32285).
- Non-gated MoE: Quantization support with Marlin, NVFP4 CUTLASS, FP8, INT8, and compressed-tensors (#32257).
- Intel: Quantization Toolkit integration (#31716).
- FP8 KV cache: Per-tensor and per-attention-head quantization via llmcompressor (#30141).
API & Frontend
- Responses API: Partial message generation (#32100),
include_stop_str_in_outputtuning (#32383),prompt_cache_keysupport (#32824). - OpenAI API:
skip_special_tokensconfiguration (#32345). - Score endpoint: Flexible input formats with
data_1/data_2andqueries/documents(#32577). - Render endpoints: New endpoints for prompt preprocessing (#32473).
- Whisper API:
avg_logprobandcompression_ratioin verbose_json segments (#31059). - Security: FIPS 140-3 compliant hash option for enterprise/government users (#32386),
--ssl-ciphersCLI argument (#30937). - UX improvements: Auto
api_server_countbased ondp_size(#32525), wheel variant auto-detection during install (#32948), custom profiler URI schemes (#32393).
Dependencies
- FlashInfer v0.6.1 (#30993)
- Transformers 4.57.5 (#32287)
- PyTorch 2.10 for CPU backend (#32869)
- DeepGEMM newer version (#32479)
Breaking Changes & Deprecations
- Metrics: Removed deprecated
vllm:time_per_output_token_secondsmetric - usevllm:inter_token_latency_secondsinstead (#32661). - Environment variables: Removed deprecated environment variables (#32812).
- Quantization: DeepSpeedFp8 removed (#32679), RTN removed (#32697), HQQ deprecated (#32681).
Bug Fixes
- Speculative decoding: Eagle draft_model_config fix (#31753).
- DeepSeek: DeepSeek-V3.1 + DeepGEMM incompatible scale shapes fix (#32361).
- Distributed: DP+MoE inference fix via CpuCommunicator (#31867), P/D with non-MoE DP fix (#33037).
- EPLB: Possible deadlock fix (#32418).
- NIXL: UCX memory leak fix by exporting UCX_MEM_MMAP_HOOK_MODE=none (#32181).
- Structured output: Outlines byte fallback handling fix (#31391).
New Contributors 🎉
- @YunzhuLu made their first contribution in #32126
- @emricksini-h made their first contribution in #30784
- @dsfaccini made their first contribution in #32289
- @ofirzaf made their first contribution in #32312
- @seekskyworld made their first contribution in #32321
- @brian033 made their first contribution in #31715
- @TomerBN-Nvidia made their first contribution in #32257
- @vanshilshah97 made their first contribution in #32448
- @George-Polya made their first contribution in #32385
- @T1mn made their first contribution in #32411
- @mritunjaysharma394 made their first contribution in #31492
- @randzero made their first contribution in #32511
- @DemingCheng made their first contribution in #32556
- @iboiko-habana made their first contribution in #32471
- @honglyua-il made their first contribution in #32462
- @hyeongyun0916 made their first contribution in #32473
- @DanielMe made their first contribution in #32560
- @netanel-haber made their first contribution in #32121
- @longregen made their first contribution in #28784
- @jasonyanwenl made their first contribution in #32749
- @Wauplin made their first contribution in #32788
- @ikaadil made their first contribution in #32775
- @alexsun07 made their first contribution in #28664
- @liranschour made their first contribution in #30207
- @AuYang261 made their first contribution in #32844
- @diviramon made their first contribution in #32393
- @RishabhSaini made their first contribution in #32884
- @MatteoFari made their first contribution in #32397
- @peakcrosser7 made their first contribution in #30877
- @orionr made their first contribution in #30443
- @marksverdhei made their first contribution in #32614
- @joninco made their first contribution in #32935
- @monajafi-amd made their first contribution in #32944
- @ruizcrp made their first contribution in #32988
- @sjhddh made their first contribution in #32983
- @HirokenOvo made their first contribution in #32646
- @Chenhao-Guan made their first contribution in #32763
- @joshuadeng made their first contribution in #28973
- @ZhanqiuHu made their first contribution in #33016
Full Changelog: v0.14.1...v0.15.0