github vllm-project/vllm v0.18.0

9 hours ago

vLLM v0.18.0

Known issues

  • Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618)
  • If you previously ran into CUBLAS_STATUS_INVALID_VALUE and had to use a workaround in v0.17.0, you can reinstall torch 2.10.0. PyTorch published an updated wheel that addresses this bug.

Highlights

This release features 445 commits from 213 contributors (61 new)!

  • gRPC Serving Support: vLLM now supports gRPC serving via the new --grpc flag (#36169), enabling high-performance RPC-based serving alongside the existing HTTP/REST interface.
  • GPU-less Render Serving: New vllm launch render command (#36166, #34551) enables GPU-less preprocessing and rendering, allowing separation of multimodal preprocessing from GPU inference.
  • NGram GPU Speculative Decoding: NGram speculative decoding now runs on GPU and is compatible with the async scheduler (#29184), significantly reducing spec decode overhead.
  • KV Cache Offloading Improvements: Smart CPU offloading that stores only frequently-reused blocks (#35342), plus FlexKV as a new offloading backend (#34328) and support for multiple KV groups in offloading spec (#36610).
  • Elastic Expert Parallelism Milestone 2: NIXL-EP integration (#35627) enables dynamic GPU scaling for MoE experts, with new --enable-ep-weight-filter CLI option (#37351) for faster EP model loading.
  • FlashInfer 0.6.6: Updated FlashInfer dependency (#36768) with numerous performance and correctness improvements.
  • Responses API Streaming Tool Calls: The OpenAI Responses API now supports tool/function calling with streaming (#29947).
  • Online Beam Search for ASR: Beam search support for encoder/decoder models both offline (#36153) and online transcriptions (#36160).
  • Ray No Longer a Default Dependency: Ray has been removed as a default dependency (#36170) — install it explicitly if needed.

Model Support

  • New architectures: Sarvam MoE (#33942), OLMo Hybrid (#32550), HyperCLOVAX-SEED-Think-32B VLM (#31471), HyperCLOVAX-SEED-Think-14B (#37107), Kimi-Audio-7B-Instruct (#36127), ColPali late-interaction retrieval (#36818), ERNIE pooling models (#36385).
  • Speculative decoding: Eagle3 for Qwen3.5 (#36658), Eagle3 for Kimi K2.5 MLA (#36361), Eagle for Mistral Large 3 with dense layers (#36163).
  • LoRA: Whisper LoRA (#29856), FP8 LoRA dense kernel (#35242).
  • Multimodal: Online use_audio_in_video (#36319), audio extraction from MP4 for Nemotron Nano VL (#35539), audio transcription for MP4/M4A/WebM (#35109), expose media_io_kwargs at runtime (#34778), fast media preprocessing for Nano Nemotron VL (#35657).
  • Compatibility: Gemma/Gemma2 inputs_embeds (#36787), SigLIP/CLIP Transformers v5 (#37200), fused expert weights in Transformers backend (#36997).
  • Performance: Qwen3 Next fused GDN kernel (#35777), LFM2 tuned H100 MoE configs (#36699).
  • Fixes: DeepSeek-V3.2 tokenizer space stripping (#37004), Qwen3.5 tool calling (#36774), Qwen3-VL timestamp mismatch (#36136), Qwen3-Next TP>1 weight sharding (#36242), Qwen3-ASR torch.compile (#35869), MiniCPM-V audio inference (#36751), MiniCPM-O 4.5 ViT attention (#34127), routed experts for hybrid models (#35744), Qwen2.5-Omni/Qwen3-Omni multi-video audio_in_video (#37147), DeepSeek-OCR empty images crash (#36670).

Engine Core

  • Model Runner V2: Probabilistic rejection sampling for spec decode (#35461), pooling models (#36019), extensible CUDA graph dispatch (#35959), WhisperModelState (#35790), XD-RoPE (#36817), model_state CUDA graph capture (#36544).
  • KV cache offloading: Reuse-frequency-gated CPU stores (#35342), FlexKV offloading backend (#34328), multiple KV groups (#36610), async scheduling fix (#33881).
  • Speculative decoding: NGram GPU implementation with async scheduler (#29184), fused EAGLE step slot mapping (#33503).
  • Performance: Remove busy loop from idle buffer readers (#28053), 2.7% E2E throughput for pooling via worker-side maxsim (#36159), 3.2% via batched maxsim (#36710), CUDA graph memory accounting during profiling (#30515), checkpoint prefetch to OS page cache (#36012), InstantTensor weight loader (#36139), sporadic stall fix via pin_memory removal (#37006).
  • Stability: VLM concurrent throughput degradation fix (#36557), DP deadlock fix (#35194), DeepSeek V3.2 OOM during CG profiling (#36691), Ray DP startup crash (#36665), NCCL rank calculation fix (#36940), zero-init MLA output buffers for NaN prevention (#37442), CUDA OOM fix (#35594).
  • Defaults: Cascade attention disabled by default (#36318).
  • Extensibility: OOT linear method registration (#35981), custom collective ops registration for non-CUDA platforms (#34760).

Kernel

  • FA4 for MLA prefill (#34732).
  • FlashInfer Sparse MLA: FP8 KV cache support (#35891), CUDA graphs on ROCm (#35719), MTP lens > 1 on ROCm (#36681).
  • TRTLLM FP8 MoE modular kernel (#36307).
  • FP8 KV cache for Triton MLA decode (#34597).
  • FlashInfer MoE A2A kernel (#36022).
  • Remove chunking from FusedMoE for full batch processing (#34086).
  • CustomOp FusedRMSNormGated for torch.compile compatibility (#35877).
  • Mamba2 SSD prefill Triton kernel optimization (#35397).
  • DeepSeek-V3.2: Vectorized MLA query concat kernel (#34917), optimized FP8 KV cache gather for context parallel (#35290).
  • 320-dimension MLA head size support (#36161).
  • Packed recurrent fast path for decode (#36596).
  • EP scatter race condition fix (#34991).

Hardware & Performance

  • NVIDIA: FA4 for MLA prefill (#34732), DeepSeek-V3.2 MLA kernel optimizations (#34917, #35290).
  • AMD ROCm: Sparse MLA CUDA graphs (#35719), MTP lens > 1 in Sparse MLA (#36681), MLA with nhead<16 + FP8 KV for TP=8 (#35850), RoPE+KV cache fusion for AITER FA (#35786), AITER MLA CPU sync avoidance (#35765), Quark W4A8 MXFP4/FP8 (#35316), gfx1152/gfx1153 Krackan support (#36499), fused_topk_bias AITER optimization (#36253), skinny GEMM improvements (#34304), DeepEP in ROCm Dockerfile (#36086), startup OOM fix (#36720).
  • Intel XPU: Model Runner V2 enabled (#36078), MLA Sparse backend for DeepSeek V3.2 (#33230), LoRA via torch.compile (#36962), block FP8 MoE fallback (#36458), deepseek_scaling_rope fused kernel (#36612).
  • CPU: aarch64 int8 matmul via OneDNN upgrade (#36147), AMD Zen CPU backend via zentorch (#35970).
  • RISC-V: CPU backend support (#36578).
  • Performance: 5% E2E improvement for PD disaggregation scheduling (#35781), packed recurrent decode fast path (#36596), pooling model maxsim 2.7%+3.2% throughput (#36159, #36710).
  • torch.compile: FakeTensors instead of real GPU tensors for single-size compilation (#36093), non-contiguous fused RMSNorm + group quant (#36551), stop lazy compiling (#35472).

Large Scale Serving

  • Elastic EP Milestone 2: NIXL-EP integration (#35627), --enable-ep-weight-filter for faster EP loading (#37351).
  • PD Disaggregation: ~5% scheduler overhead reduction (#35781), KV transfer fix with spec decode (#35158), P/D for hybrid SSM-FA models via NIXL (#36687), PP for multimodal models on Transformers backend (#37057).
  • KV Connectors: HMA + NIXL connector (#35758), FlexKV offloading (#34328), worker→scheduler metadata (#31964), All-to-All DCP backend (#34883).
  • LMCache: Fault tolerance mechanism (#36586), memory leak fix (#35931), race condition fix (#35831), TP size for MLA multi-reader locking (#36129).
  • EP loading: Skip non-local expert weights (#37136).

Quantization

  • ModelOpt MXFP8 MoE support (#35986).
  • MXFP4 MoE routing simulation override for accuracy (#33595).
  • FP8 LoRA dense kernel (#35242).
  • ROCm: Quark W4A8 MXFP4/FP8 for LinearLayer (#35316), compressed-tensors fix for DeepSeek-R1 on MI300x (#36247).
  • Fixes: MLA crash with AWQ/GPTQ quantized models (#34695), score layer quantization for reranker models (#35849), GLM-4.1V non-default quantization (#36321), FP8 k_scale/v_scale loading for Qwen3-MoE (#35656).

API & Frontend

  • gRPC: New --grpc flag for gRPC serving (#36169).
  • GPU-less serving: vllm launch render for preprocessing-only serving (#36166), vllm launch for GPU-less preprocessing (#34551).
  • Responses API: Streaming tool/function calling (#29947), reasoning item fixes (#34499, #36516).
  • Anthropic API: Accept redacted thinking blocks (#36992).
  • ASR: Online beam search transcriptions (#36160), offline beam search (#36153), audio transcription for MP4/M4A/WebM (#35109), realtime endpoint metrics (#35500).
  • Tool calling: Granite4 tool parser (#36827), Qwen3Coder anyOf double encoding fix (#36032).
  • New options: --distributed-timeout-seconds (#36047), --attention-backend auto (#35738), reasoning_effort=none (#36238), PyTorch profiler schedule (#35240).
  • Cohere Embed v2 API support (#37074).
  • Azure Blob Storage support for RunAI Model Streamer (#34614).
  • Graceful shutdown timeout for in-flight requests (#36666).
  • Fixes: tool_choice=required exceeding max_tokens crash (#36841), negative max_tokens with long prompts (#36789), concurrent classify/token_classify race (#36614), Anthropic billing header prefix cache miss (#36829), render endpoint crash for multimodal requests (#35684), xgrammar dtype mismatch on macOS CPU (#32384), minimax_m2 tool parser with stream interval > 1 (#35895).

Security

  • Respect user trust_remote_code setting in NemotronVL and KimiK25 (#36192).
  • Upgrade xgrammar for security fix (#36168).
  • Guard RLHF weight sync deserialization behind insecure serialization flag (#35928).

Dependencies

  • FlashInfer 0.6.6 (#36768).
  • Ray removed from default dependencies (#36170).
  • kaldi_native_fbank made optional (#35996).
  • OpenAI dependency bounded to 2.24.0 (#36471).
  • Deprecated items from v0.18 removed (#36470, #36006).
  • Mistral common v10 (#36971).

Breaking Changes

  1. Ray no longer a default dependency — install explicitly if needed (#36170).
  2. Deprecated items removed — items deprecated in v0.18 have been removed (#36470, #36006).
  3. Cascade attention disabled by default (#36318).
  4. swap_space parameter removed (V0 deprecation, #36216).
  5. Monolithic TRTLLM MoE disabled for renormalize routing — late fix cherry-picked (#37591).

New Contributors 🎉

Don't miss a new vllm release

NewReleases is sending notifications on new releases.