pypi vllm 0.15.0
v0.15.0

7 hours ago

Highlights

This release features 335 commits from 158 contributors (39 new)!

Model Support

  • New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456).
  • LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763).
  • Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322).
  • Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings (#14526).
  • Model enhancements: Voxtral streaming architecture (#32861), SharedFusedMoE for Qwen3MoE (#32082), dynamic resolution for Nemotron Nano VL (#32121), Molmo2 vision backbone quantization (#32385).

Engine Core

  • Async scheduling + Pipeline Parallelism: --async-scheduling now works with pipeline parallelism (#32359).
  • Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with --enable-prefix-caching --mamba-cache-mode align. Achieves ~2x speedup by caching Mamba states directly (#30877).
  • Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing StreamingInput objects while maintaining KV cache alignment (#28973).
  • Model Runner V2: VLM support (#32546), architecture improvements.
  • LoRA: Inplace loading for memory efficiency (#31326).
  • AOT compilation: torch.compile inductor artifacts support (#25205).
  • Performance: KV cache offloading redundant load prevention (#29087), FlashAttn attention/cache update separation (#25954).

Hardware & Performance

NVIDIA

  • Blackwell defaults: FlashInfer MLA is now the default MLA backend on Blackwell, with TRTLLM as default prefill (#32615).
  • MoE performance: 1.2-2% E2E throughput improvement via grouped topk kernel fusion (#32058), NVFP4 small-batch decoding improvement (#30885), faster cold start for MoEs with torch.compile (#32805).
  • FP4 kernel optimization: Up to 65% faster FP4 quantization on Blackwell (SM100F) using 256-bit loads, ~4% E2E throughput improvement (#32520).
  • Kernel improvements: topk_sigmoid kernel for MoE routing (#31246), atomics reduce counting for SplitK skinny GEMMs (#29843), fused cat+quant for FP8 KV cache in MLA (#32950).
  • torch.compile: SiluAndMul and QuantFP8 CustomOp compilation (#32806), Triton prefill attention performance (#32403).

AMD ROCm

  • MoRI EP: High-performance all2all backend for Expert Parallel (#28664).
  • Attention improvements: Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#29887).
  • FP4 support: MLA projection GEMMs with dynamic quantization (#32238).
  • Consumer GPU support: Flash Attention Triton backend on RDNA3/RDNA4 (#32944).

Other Platforms

  • TPU: Pipeline parallelism support (#28506), backend option (#32438).
  • Intel XPU: AgRsAll2AllManager for distributed communication (#32654).
  • CPU: NUMA-aware acceleration for TP/DP inference on ARM (#32792), PyTorch 2.10 (#32869).
  • Whisper: torch.compile support (#30385).
  • WSL: Platform compatibility fix for Windows Subsystem for Linux (#32749).

Quantization

  • MXFP4: W4A16 support for compressed-tensors MoE models (#32285).
  • Non-gated MoE: Quantization support with Marlin, NVFP4 CUTLASS, FP8, INT8, and compressed-tensors (#32257).
  • Intel: Quantization Toolkit integration (#31716).
  • FP8 KV cache: Per-tensor and per-attention-head quantization via llmcompressor (#30141).

API & Frontend

  • Responses API: Partial message generation (#32100), include_stop_str_in_output tuning (#32383), prompt_cache_key support (#32824).
  • OpenAI API: skip_special_tokens configuration (#32345).
  • Score endpoint: Flexible input formats with data_1/data_2 and queries/documents (#32577).
  • Render endpoints: New endpoints for prompt preprocessing (#32473).
  • Whisper API: avg_logprob and compression_ratio in verbose_json segments (#31059).
  • Security: FIPS 140-3 compliant hash option for enterprise/government users (#32386), --ssl-ciphers CLI argument (#30937).
  • UX improvements: Auto api_server_count based on dp_size (#32525), wheel variant auto-detection during install (#32948), custom profiler URI schemes (#32393).

Dependencies

  • FlashInfer v0.6.1 (#30993)
  • Transformers 4.57.5 (#32287)
  • PyTorch 2.10 for CPU backend (#32869)
  • DeepGEMM newer version (#32479)

Breaking Changes & Deprecations

  • Metrics: Removed deprecated vllm:time_per_output_token_seconds metric - use vllm:inter_token_latency_seconds instead (#32661).
  • Environment variables: Removed deprecated environment variables (#32812).
  • Quantization: DeepSpeedFp8 removed (#32679), RTN removed (#32697), HQQ deprecated (#32681).

Bug Fixes

  • Speculative decoding: Eagle draft_model_config fix (#31753).
  • DeepSeek: DeepSeek-V3.1 + DeepGEMM incompatible scale shapes fix (#32361).
  • Distributed: DP+MoE inference fix via CpuCommunicator (#31867), P/D with non-MoE DP fix (#33037).
  • EPLB: Possible deadlock fix (#32418).
  • NIXL: UCX memory leak fix by exporting UCX_MEM_MMAP_HOOK_MODE=none (#32181).
  • Structured output: Outlines byte fallback handling fix (#31391).

New Contributors 🎉

Full Changelog: v0.14.1...v0.15.0

Don't miss a new vllm release

NewReleases is sending notifications on new releases.