github vllm-project/vllm v0.11.0rc5
v0.11.0rc5 pre release

latest release: v0.11.0rc6
pre-releaseone day ago

Highlights

This release features 538 commits, 207 contributors (65 new contributors)!

  • This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
  • This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.

Model Support

  • New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
  • Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
  • Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
  • Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
  • Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
  • Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
  • Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
  • Reasoning: SeedOSS reason parser (#24263).

Engine Core

  • KV cache offloading: CPU offloading with LRU management (#19848, #20075, #21448, #22595, #24251).
  • V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
  • Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
  • Async scheduling: Uniprocessor executor support (#24219).
  • Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
  • Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
  • Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, #25005), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
  • LoRA: Optimized weight loading (#25403).
  • Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
  • torch.compile: CUDA graph Inductor partition integration (#24281).

Hardware & Performance

  • NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
  • DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
  • New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
  • AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
  • Intel XPU: MoE DP accuracy fix (#25465).

Large Scale Serving & Performance

  • Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
  • Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
  • EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
  • Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
  • MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
  • Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).

Quantization

  • FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
  • FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
  • W4A8: Faster preprocessing (#23972).
  • Compressed tensors: Blocked FP8 for MoE (#25219).

API & Frontend

  • OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, #24985), health 503 on dead engine (#24897).
  • Multimodal: Media UUID caching (#23950), image path format (#25081).
  • Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
  • CLI: --enable-logging (#25610), improved --help (#24903).
  • Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, #25422).
  • Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, #25479).
  • UX: Removed misleading quantization warning (#25012).

Security

Dependencies

  • PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
  • Build requirements: C++17 now enforced globally (#24823).
  • TPU: Deprecated xm.mark_step in favor of torch_xla.sync (#25254).

V0 Deprecation

What's Changed

New Contributors

Full Changelog: v0.10.2...v0.11.0rc5

Don't miss a new vllm release

NewReleases is sending notifications on new releases.