vLLM v0.20.1

This is a patch release on top of v0.20.0 primarily focused on DeepSeek V4 stabilization and performance improvements, along with several important bug fixes.

DeepSeek V4

Base model support (#41006).
Multi-stream pre-attention GEMM (#41061), configurable pre-attn GEMM knob (#41443), and tuned default VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD (#41526).
BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication (#40960).
PTX cvt instruction for faster FP32->FP4 conversion (#41015).
Integrated tile kernels (head_compute_mix_kernel) for optimized head computation (#41255).
Guard megamoe flag with Pure TP (#41522).
Fixed persistent topk cooperative deadlock at TopK=1024 (#41189) and inter-CTA init race on RadixRowState (#41444), with temporary disable of persistent topk as a workaround (#41442).
Fixed import error due to AOT compile cache loading (#41090).
Fixed torch inductor error (#41135).
Fixed repeated RoPE cache initialization (#41148).
Fixed missing type conversion for non-streaming tool calls in DSV3.2/V4 (#41198).

Bug Fixes

Fixed max_num_batched_token not being captured in CUDA graph (#40734).
Fixed num_gpu_blocks_override not accounted for in max_model_len checks (#41069).
Auto-disable expandable_segments around cumem memory pool (#40812).
Fixed BailingMoE linear layer (#40859) and MLA RoPE rotation for BailingMoE V2.5 (#41185).
Fixed reasoning parser kwargs not being passed to structured output (#41199).
[ROCm] Fixed input_ids and expert_map args for Quark W4A8 GPT-OSS (#41165).

List of contributors

@BugenZhao, @chaunceyjiang, @gau-nernst, @ghphotoframe, @Isotr0py, @jeejeelee, @khluu, @njhill, @Rohan138, @wzhao18, @youkaichao, @ywang96, @ZJY0516, @zixi-qi, @zyongye

vllm-project/vllm v0.20.1 on GitHub

vLLM v0.20.1

DeepSeek V4

Bug Fixes

List of contributors

vllm-project/vllm v0.20.1
on GitHub