vLLM v0.18.0
Known issues
- Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618)
- If you previously ran into
CUBLAS_STATUS_INVALID_VALUEand had to use a workaround inv0.17.0, you can reinstalltorch 2.10.0. PyTorch published an updated wheel that addresses this bug.
Highlights
This release features 445 commits from 213 contributors (61 new)!
- gRPC Serving Support: vLLM now supports gRPC serving via the new
--grpcflag (#36169), enabling high-performance RPC-based serving alongside the existing HTTP/REST interface. - GPU-less Render Serving: New
vllm launch rendercommand (#36166, #34551) enables GPU-less preprocessing and rendering, allowing separation of multimodal preprocessing from GPU inference. - NGram GPU Speculative Decoding: NGram speculative decoding now runs on GPU and is compatible with the async scheduler (#29184), significantly reducing spec decode overhead.
- KV Cache Offloading Improvements: Smart CPU offloading that stores only frequently-reused blocks (#35342), plus FlexKV as a new offloading backend (#34328) and support for multiple KV groups in offloading spec (#36610).
- Elastic Expert Parallelism Milestone 2: NIXL-EP integration (#35627) enables dynamic GPU scaling for MoE experts, with new
--enable-ep-weight-filterCLI option (#37351) for faster EP model loading. - FlashInfer 0.6.6: Updated FlashInfer dependency (#36768) with numerous performance and correctness improvements.
- Responses API Streaming Tool Calls: The OpenAI Responses API now supports tool/function calling with streaming (#29947).
- Online Beam Search for ASR: Beam search support for encoder/decoder models both offline (#36153) and online transcriptions (#36160).
- Ray No Longer a Default Dependency: Ray has been removed as a default dependency (#36170) — install it explicitly if needed.
Model Support
- New architectures: Sarvam MoE (#33942), OLMo Hybrid (#32550), HyperCLOVAX-SEED-Think-32B VLM (#31471), HyperCLOVAX-SEED-Think-14B (#37107), Kimi-Audio-7B-Instruct (#36127), ColPali late-interaction retrieval (#36818), ERNIE pooling models (#36385).
- Speculative decoding: Eagle3 for Qwen3.5 (#36658), Eagle3 for Kimi K2.5 MLA (#36361), Eagle for Mistral Large 3 with dense layers (#36163).
- LoRA: Whisper LoRA (#29856), FP8 LoRA dense kernel (#35242).
- Multimodal: Online use_audio_in_video (#36319), audio extraction from MP4 for Nemotron Nano VL (#35539), audio transcription for MP4/M4A/WebM (#35109), expose media_io_kwargs at runtime (#34778), fast media preprocessing for Nano Nemotron VL (#35657).
- Compatibility: Gemma/Gemma2 inputs_embeds (#36787), SigLIP/CLIP Transformers v5 (#37200), fused expert weights in Transformers backend (#36997).
- Performance: Qwen3 Next fused GDN kernel (#35777), LFM2 tuned H100 MoE configs (#36699).
- Fixes: DeepSeek-V3.2 tokenizer space stripping (#37004), Qwen3.5 tool calling (#36774), Qwen3-VL timestamp mismatch (#36136), Qwen3-Next TP>1 weight sharding (#36242), Qwen3-ASR torch.compile (#35869), MiniCPM-V audio inference (#36751), MiniCPM-O 4.5 ViT attention (#34127), routed experts for hybrid models (#35744), Qwen2.5-Omni/Qwen3-Omni multi-video audio_in_video (#37147), DeepSeek-OCR empty images crash (#36670).
Engine Core
- Model Runner V2: Probabilistic rejection sampling for spec decode (#35461), pooling models (#36019), extensible CUDA graph dispatch (#35959), WhisperModelState (#35790), XD-RoPE (#36817), model_state CUDA graph capture (#36544).
- KV cache offloading: Reuse-frequency-gated CPU stores (#35342), FlexKV offloading backend (#34328), multiple KV groups (#36610), async scheduling fix (#33881).
- Speculative decoding: NGram GPU implementation with async scheduler (#29184), fused EAGLE step slot mapping (#33503).
- Performance: Remove busy loop from idle buffer readers (#28053), 2.7% E2E throughput for pooling via worker-side maxsim (#36159), 3.2% via batched maxsim (#36710), CUDA graph memory accounting during profiling (#30515), checkpoint prefetch to OS page cache (#36012), InstantTensor weight loader (#36139), sporadic stall fix via pin_memory removal (#37006).
- Stability: VLM concurrent throughput degradation fix (#36557), DP deadlock fix (#35194), DeepSeek V3.2 OOM during CG profiling (#36691), Ray DP startup crash (#36665), NCCL rank calculation fix (#36940), zero-init MLA output buffers for NaN prevention (#37442), CUDA OOM fix (#35594).
- Defaults: Cascade attention disabled by default (#36318).
- Extensibility: OOT linear method registration (#35981), custom collective ops registration for non-CUDA platforms (#34760).
Kernel
- FA4 for MLA prefill (#34732).
- FlashInfer Sparse MLA: FP8 KV cache support (#35891), CUDA graphs on ROCm (#35719), MTP lens > 1 on ROCm (#36681).
- TRTLLM FP8 MoE modular kernel (#36307).
- FP8 KV cache for Triton MLA decode (#34597).
- FlashInfer MoE A2A kernel (#36022).
- Remove chunking from FusedMoE for full batch processing (#34086).
- CustomOp FusedRMSNormGated for torch.compile compatibility (#35877).
- Mamba2 SSD prefill Triton kernel optimization (#35397).
- DeepSeek-V3.2: Vectorized MLA query concat kernel (#34917), optimized FP8 KV cache gather for context parallel (#35290).
- 320-dimension MLA head size support (#36161).
- Packed recurrent fast path for decode (#36596).
- EP scatter race condition fix (#34991).
Hardware & Performance
- NVIDIA: FA4 for MLA prefill (#34732), DeepSeek-V3.2 MLA kernel optimizations (#34917, #35290).
- AMD ROCm: Sparse MLA CUDA graphs (#35719), MTP lens > 1 in Sparse MLA (#36681), MLA with nhead<16 + FP8 KV for TP=8 (#35850), RoPE+KV cache fusion for AITER FA (#35786), AITER MLA CPU sync avoidance (#35765), Quark W4A8 MXFP4/FP8 (#35316), gfx1152/gfx1153 Krackan support (#36499), fused_topk_bias AITER optimization (#36253), skinny GEMM improvements (#34304), DeepEP in ROCm Dockerfile (#36086), startup OOM fix (#36720).
- Intel XPU: Model Runner V2 enabled (#36078), MLA Sparse backend for DeepSeek V3.2 (#33230), LoRA via torch.compile (#36962), block FP8 MoE fallback (#36458), deepseek_scaling_rope fused kernel (#36612).
- CPU: aarch64 int8 matmul via OneDNN upgrade (#36147), AMD Zen CPU backend via zentorch (#35970).
- RISC-V: CPU backend support (#36578).
- Performance: 5% E2E improvement for PD disaggregation scheduling (#35781), packed recurrent decode fast path (#36596), pooling model maxsim 2.7%+3.2% throughput (#36159, #36710).
- torch.compile: FakeTensors instead of real GPU tensors for single-size compilation (#36093), non-contiguous fused RMSNorm + group quant (#36551), stop lazy compiling (#35472).
Large Scale Serving
- Elastic EP Milestone 2: NIXL-EP integration (#35627),
--enable-ep-weight-filterfor faster EP loading (#37351). - PD Disaggregation: ~5% scheduler overhead reduction (#35781), KV transfer fix with spec decode (#35158), P/D for hybrid SSM-FA models via NIXL (#36687), PP for multimodal models on Transformers backend (#37057).
- KV Connectors: HMA + NIXL connector (#35758), FlexKV offloading (#34328), worker→scheduler metadata (#31964), All-to-All DCP backend (#34883).
- LMCache: Fault tolerance mechanism (#36586), memory leak fix (#35931), race condition fix (#35831), TP size for MLA multi-reader locking (#36129).
- EP loading: Skip non-local expert weights (#37136).
Quantization
- ModelOpt MXFP8 MoE support (#35986).
- MXFP4 MoE routing simulation override for accuracy (#33595).
- FP8 LoRA dense kernel (#35242).
- ROCm: Quark W4A8 MXFP4/FP8 for LinearLayer (#35316), compressed-tensors fix for DeepSeek-R1 on MI300x (#36247).
- Fixes: MLA crash with AWQ/GPTQ quantized models (#34695), score layer quantization for reranker models (#35849), GLM-4.1V non-default quantization (#36321), FP8 k_scale/v_scale loading for Qwen3-MoE (#35656).
API & Frontend
- gRPC: New
--grpcflag for gRPC serving (#36169). - GPU-less serving:
vllm launch renderfor preprocessing-only serving (#36166),vllm launchfor GPU-less preprocessing (#34551). - Responses API: Streaming tool/function calling (#29947), reasoning item fixes (#34499, #36516).
- Anthropic API: Accept redacted thinking blocks (#36992).
- ASR: Online beam search transcriptions (#36160), offline beam search (#36153), audio transcription for MP4/M4A/WebM (#35109), realtime endpoint metrics (#35500).
- Tool calling: Granite4 tool parser (#36827), Qwen3Coder anyOf double encoding fix (#36032).
- New options:
--distributed-timeout-seconds(#36047),--attention-backend auto(#35738),reasoning_effort=none(#36238), PyTorch profiler schedule (#35240). - Cohere Embed v2 API support (#37074).
- Azure Blob Storage support for RunAI Model Streamer (#34614).
- Graceful shutdown timeout for in-flight requests (#36666).
- Fixes: tool_choice=required exceeding max_tokens crash (#36841), negative max_tokens with long prompts (#36789), concurrent classify/token_classify race (#36614), Anthropic billing header prefix cache miss (#36829), render endpoint crash for multimodal requests (#35684), xgrammar dtype mismatch on macOS CPU (#32384), minimax_m2 tool parser with stream interval > 1 (#35895).
Security
- Respect user
trust_remote_codesetting in NemotronVL and KimiK25 (#36192). - Upgrade xgrammar for security fix (#36168).
- Guard RLHF weight sync deserialization behind insecure serialization flag (#35928).
Dependencies
- FlashInfer 0.6.6 (#36768).
- Ray removed from default dependencies (#36170).
kaldi_native_fbankmade optional (#35996).- OpenAI dependency bounded to 2.24.0 (#36471).
- Deprecated items from v0.18 removed (#36470, #36006).
- Mistral common v10 (#36971).
Breaking Changes
- Ray no longer a default dependency — install explicitly if needed (#36170).
- Deprecated items removed — items deprecated in v0.18 have been removed (#36470, #36006).
- Cascade attention disabled by default (#36318).
- swap_space parameter removed (V0 deprecation, #36216).
- Monolithic TRTLLM MoE disabled for renormalize routing — late fix cherry-picked (#37591).
New Contributors 🎉
- @11happy made their first contribution in #35481
- @12010486 made their first contribution in #36782
- @abhishkh made their first contribution in #32454
- @AjAnubolu made their first contribution in #35976
- @alvinttang made their first contribution in #36397
- @amd-asalykov made their first contribution in #35093
- @amd-lalithnc made their first contribution in #35970
- @arlo-scitix made their first contribution in #36139
- @benenzhu made their first contribution in #36253
- @ChuanLi1101 made their first contribution in #35893
- @cluster2600 made their first contribution in #34882
- @cong-or made their first contribution in #36164
- @daje0601 made their first contribution in #29856
- @davzaman made their first contribution in #32441
- @eellison made their first contribution in #35877
- @fangyuchu made their first contribution in #35194
- @feiqiangs made their first contribution in #34328
- @fenypatel99 made their first contribution in #35240
- @gambletan made their first contribution in #36402
- @giulio-leone made their first contribution in #36937
- @gkswns0531 made their first contribution in #35849
- @grimulkan made their first contribution in #34597
- @hai-meh-cs made their first contribution in #36684
- @hasethuraman made their first contribution in #34614
- @Hongbin10 made their first contribution in #36713
- @jeonsworld made their first contribution in #34499
- @jjmiao1 made their first contribution in #35994
- @Kaonael made their first contribution in #36818
- @ketyi made their first contribution in #36670
- @KevinZonda made their first contribution in #36209
- @leo-cf-tian made their first contribution in #36022
- @lisperz made their first contribution in #34531
- @mitre88 made their first contribution in #35933
- @nkm-meta made their first contribution in #34760
- @nvnbagrov made their first contribution in #35657
- @rahul-sarvam made their first contribution in #33942
- @royyhuang made their first contribution in #35931
- @sbeurnier made their first contribution in #37006
- @seanmamasde made their first contribution in #35109
- @sergey-zinchenko made their first contribution in #35684
- @shaunkotek made their first contribution in #36149
- @shubhra made their first contribution in #36545
- @simone-dotolo made their first contribution in #36000
- @sladyn98 made their first contribution in #33503
- @slin1237 made their first contribution in #36938
- @SoluMilken made their first contribution in #36511
- @Srinivasoo7 made their first contribution in #35342
- @stecasta made their first contribution in #35871
- @sungsooha made their first contribution in #34883
- @SunMarc made their first contribution in #36896
- @TQCB made their first contribution in #36165
- @tunglinwood made their first contribution in #36127
- @tusharshetty61 made their first contribution in #36243
- @typer-J made their first contribution in #36578
- @weiguangli-io made their first contribution in #35815
- @wuxun-zhang made their first contribution in #33230
- @XingLiu1 made their first contribution in #35197
- @yanhong-lbh made their first contribution in #32550
- @yitingw1 made their first contribution in #36612
- @yuanheng-zhao made their first contribution in #36106
- @zihaoanllm made their first contribution in #35973