Highlights
New Model Support: GLM-5.2, LiquidAI LFM2.5, Kimi-K2.7-Code, Poolside Laguna-M.1, DiffusionGemma, Zyphra ZAYA1, MiMo-V2-ASR
DeepSeek-V4 on GB300 since Day 0: 5x higher throughput at the same interactivity, serving DeepSeek-V4 on NVIDIA GB300 with SGLang (blog).
Waterfill & LPLB MoE load balancing: Two dispatch-time load-balancing methods for DeepEP expert parallelism: Waterfill for shared-expert dispatch and LPLB for redundant expert replicas, improving throughput for DeepSeek-V3/R1 and DeepSeek-V4 (blog).
KDA CuteDSL prefill kernel on Blackwell (SM100): New CuteDSL prefill kernel for Kimi-Linear (KDA), 1.08-1.52x faster than the Triton path via a reusable scratch workspace, plus a cuda-graph padding fix (#27488); see the Kimi-Linear cookbook.
Linear-attention prefix-cache memory savings: An int8 checkpoint pool stores recurrent states compactly in the Mamba radix cache, substantially increasing prefix-cache capacity for KDA / GDN models (#28185); the speculative conv-window intermediate cache is deduplicated with a sliding-window layout, halving its footprint with no numerical change (#28302).
LPLB: linear-programming load balancer for MoE expert parallelism: Balances token routing across redundant expert replicas by solving a per-layer LP; opt-in via --ep-dispatch-algorithm=lp, default behavior unchanged (#24515).
MSCCL++ integration & MNNVL allreduce fusion: MSCCL++ migrates to the upstream mscclpp Python package (Executor + DSL compiler) with auto-tuned collectives for TP=8 single-node and TP=16 two-node (#22734); FlashInfer fused allreduce + residual + RMSNorm re-enables an MNNVL backend behind --flashinfer-allreduce-fusion-backend (auto / trtllm / mnnvl), fixing the piecewise-CUDA-graph interaction (#23402).
Nemotron DP attention + MTP: Data-parallel attention for the hybrid Nemotron-H (Mamba2 + full attention + MoE), plus MTP support (#24955); see the Nemotron 3 Ultra cookbook.
AMD: breakable CUDA graph on ROCm/HIP: The breakable CUDA graph execution path now runs on AMD GPUs (#28173).
NVFP4 MoE for DeepSeek-V4: Adds an NVFP4 MoE quantization path for DeepSeek-V4 on Blackwell for higher MoE throughput; enable with --moe-runner-backend flashinfer_trtllm_routed (#25820); see the DeepSeek-V4 cookbook.
DeepSeek-V4 decode & quantization optimizations: FP8 group quantization now emits power-of-two (UE8M0) scales directly from the per-token group-quant kernel, dropping a separate rounding pass (#26766); MLA decode q-heads are padded to 64 under attention-TP so FlashMLA dispatches the ~2x cheaper head64 kernel instead of head128 (#27954); the MHC prenorm kernel is prewarmed at startup to remove the first-run JIT slowdown on a fresh server (#27986); and BF16 mixed-dtype compression states are supported on the C4 / C128 paths (#27277); see the DeepSeek-V4 cookbook.
Full release notes by category below.
New Model Support
- GLM-5.2: #28437 (cookbook)
- LiquidAI LFM2.5: #27409 (cookbook)
- Kimi-K2.7-Code: #28064 (cookbook)
- Poolside Laguna-M.1: #28661, #28400 (cookbook)
- DiffusionGemma: #27824 (cookbook)
- Zyphra ZAYA1: #26347 (cookbook wip)
- MiMo-V2-ASR: #26278 (cookbook wip)
DeepSeek V4
- [NVIDIA] Support NVFP4 MoE for DeepSeek-V4: #25820
- [DeepSeek-V4] Fuse UE8M0 scale rounding into FP8 group quantization: #26766
- [NPU] Add Ascend NPU support for DeepSeek-V4: #25144
- Deepseek v4: support mixed dtype compression states: #27277
- [AMD] Feat: Add prefill context parallel support for deepseek v4 unified kv attention: #27928
- DeepSeek-V4 Online Compress support MTP: #26471
- [dsv4] Pad MLA decode q-heads to 64 (not full n_heads) for FlashMLA head64 kernel: #27954
- [dsv4] Prewarm MHC prenorm kernel at startup: #27986
- [LoRA] Support DSA indexer LoRA targets for GLM-5.1 / DeepSeek-V3.2-family models: #28110
- Add DeepSeek V4 MTP acceptance length checks: #28098
Speculative Decoding
- [Spec] Add sync-free
fast_prefill_planfor EAGLE draft-extend CUDA graph: #28854 - [Spec] Support FlashInfer CUDA graph for EAGLE draft-extend: #28782
- [mtp] add rejection sampling for speculative decoding: #26312
- [NPU] Add MTP support for GLM-4.7-Flash: #28516
- Dflash add sliding window attention draft layer support: #27469
- Support Nemotron DP attention and MTP: #24955
- [Feature] [Ngram spec] Support ngram spec v2: #17260
Piecewise & Breakable CUDA Graph
- [AMD] Make breakable CUDA graph run on ROCm/HIP: #28173
- Dflash piecewise cuda graphs support: #27468
Attention Backends
- [Cookbook] Nemotron3-Ultra: Add mamba-backend and SSM dtype flags: #28675
- [Mamba][GDN] Deduplicate spec conv-window intermediate cache via sliding window layout: #28302
- [GDN][KDA][mem_cache] int8 checkpoint pool for the linear-attn prefix cache: #28185
- [diffusion] feat: use LocalAttention for mistral3 encoder: #28176
- [NPU] Add Gemma4 Sliding Window Attention support on Ascend backend: #26147
- [AMD] Fuse sigmoid + mul attention output gate into single Triton kernel: #27630
- [AMD] Enable fused GDN QKV split Triton kernel on HIP: #27583
- [KDA] Add CuteDSL Prefill Kernel on SM100: #27488
- [AMD] Add unified kv attention support in dpsk-v4: #27380
MoE & Expert Parallelism
- Add GB10 FP8 fused MoE Triton config: #25665
- Support asymmetric compressed-tensors MoE: #27690
- LPLB: linear-programming load balancer for MoE expert parallelism: #24515
- [AMD] Fuse sigmoid + mul into single Triton kernel for shared expert gating: #27636
- [quantization] NVFP4 MoE: split fused w13 gate/up global scales: #27588
- [Apple Silicon] [MLX] Fuse SwiGLU activation into gate gather_qmv for SwitchGLU MoE blocks: #26188
- [DeepSeek V3] Defer moe finalize and fused it with main stream add: #27720
Quantization
- ✨ [llm][npu][quant] Add W8A8 MXFP8 quantization support for Qwen3 Dense on Ascend NPU: #22352
- Implement online nvfp4 quantization: #26083
Parallelism & Disaggregation
- Add Mooncake group semantics: #26574
- [cookbook] Laguna-M.1: add PD disaggregation section: #28737
- [2/n] [CP] Add context parallel strategy abstractions: #27313
- [AMD] Support unified_kv_triton for disaggregation: #27935
- Add bucketed multi-dir layout for NIXL file storage: #27672
- Add EPD disaggregated encode tracing: #25994
Scheduler & Runtime
- [core] Gate the overlap WAR barrier on forward reads to recover decode throughput: #28363
- [Feature] Add graceful scheduler shutdown; free hisparse host buffer on exit: #28779
- Support MPServer and embedded server for granian to enable muti tokenizer worker: #28573
- Add get_parallel(): a structured accessor for parallel-topology state: #28567
- Support GLM-4.7 function calling via structural tags: #28149
- Add SGLANG_ENABLE_WAR_BARRIER to force-enable the overlap scheduler WAR barrier on non-CUDA (e.g. AMD): #27967
- [router] Add request/TTFT/worker metrics + Grafana dashboard to experimental sgl-router: #27591
HiCache & Radix Cache
- [HiCache]Support hybrid pool staged H2D kernel: #28434
- [HiCache & Bench] add cache hit breakdown in bench_serving: #22053
- [HiCache]Asymmetric pool support direct backend: #28446
- Support HiCache for MiMo-V2 models (1/N): #27378
- [UnifiedTree]: HybridModel launches HiCache via UnifiedTree by default.: #27759
- [HiCache] Add opt-in LRU eviction to file storage backend (CP-aware): #26670
LoRA
- [diffusion] perf: merge LTX-2 stage-1 distilled LoRA into the base in original mode: #28594
Multimodal
- Report multimodal (image/audio/video) token counts in usage.prompt_tokens_details: #27122
- Eliminate CUDA syncs in VLM embed path: #26082
Model Support & Optimizations
- Add dflash gemma4 support: #27471
- [NPU] Add MiMo-V2-Flash manual testcases: #28223
- Add mimo best practice: #27665
- [AMD][Perf] Fuse QK RMSNorm + gate extraction Triton kernel for Qwen3.5 on HIP: #27656
- Mistral3 add tensor parallel support for diffusion text encoder: #25950
SGLang-Diffusion
- Shard hunyuan text tokens under sp: #28319
- Shard text when using sp in flux.1/2: #27066
- Use srt custom allreduce for tp groups: #28324
- Optimize causal conv3d vae padding: #28204
- Persist torch.compile inductor/triton cache across restarts: #28205
- FLUX: fuse FeedForward GELU into up-proj GEMM (cublasLt epilogue): #28166
- Use regional torch.compile (compile_repeated_blocks) for DiT of diffusers backend: #28193
- Add --warmup-mode enum server arg: #28184
- Enable spatial-shard vae decode across GPUs: #28071
- Enable vae parallel decode with cfg-parallel: #27875
- Optimize flux1 tensor parallel sharding: #27826
- Progressive resolution growing for Ideogram 4 via GPU DCT upsampling with up to 1.56× speedup: #27736
- Use fused w8a8 kernel for Ideogram4 weight-only linear as an opt-in: #27590
- Run LTX-2 VAE decode in channels_last_3d (faster decode, lower peak memory): #27431
- Rl: extract post-training weight apis into mixins and add tensor update/checker paths: #22817
AMD / ROCm
- [AMD][DFlash] Enable Fused KV Materialization: #27854
NPU / Ascend
- [NPU] Add head_dim=256 to _can_use_tnd whitelist: #28635
- [NPU] Add NPU fallback for fused Triton gating kernels: #28293
- [NPU] [DOC] Update server arguments to NPU support features page: #28083
CPU / Intel / XPU
- [MLX] Add Metal profiling hooks to server profiler: #28122
- [Intel GPU]Add sycl mrope pass for xpu device: #27646
Dependencies
- sgl-kernel 0.4.3 → 0.4.4: #28556
- tokenspeed_mla 0.1.1 → 0.1.7: #28116, #28759
- Ray minimum version → 2.55.1: #27724
- pytorch-xpu → 2.12: #27133
Security
No security-tagged PRs in this release.
All PRs included in this release: v0.5.13...v0.5.14
New Contributors
- @caiomcbr made their first contribution in #22734
- @LJL36 made their first contribution in #27550
- @xythink made their first contribution in #23802
- @gq112 made their first contribution in #25980
- @weizhoublue made their first contribution in #24401
- @BiggieW made their first contribution in #26320
- @yuchengliu1 made their first contribution in #27133
- @nbarzilie made their first contribution in #26908
- @ChengYao-amd made their first contribution in #26347
- @HZY-Wade made their first contribution in #26670
- @sigama-w made their first contribution in #27665
- @vinayK34 made their first contribution in #27157
- @luoroger37 made their first contribution in #27779
- @zalcit made their first contribution in #26278
- @xbfs made their first contribution in #26351
- @oulgen made their first contribution in #27967
- @Joectwm made their first contribution in #25994
- @liuxpro made their first contribution in #27866
- @Zhichenzzz made their first contribution in #24955
- @zqzten made their first contribution in #28013
- @Sunt-ing made their first contribution in #28088
- @Ibrahim2595 made their first contribution in #28078
- @evanderfff123-boop made their first contribution in #27913
- @JaredforReal made their first contribution in #26902
- @kingjameschan made their first contribution in #25975
- @HumphreySun98 made their first contribution in #26971
- @Zyann7 made their first contribution in #27122
- @DaZhUUU made their first contribution in #28043
- @prajjwal1 made their first contribution in #27588
- @cccccya made their first contribution in #28031
- @qinsir5522 made their first contribution in #28283
- @jinhaosong-source made their first contribution in #28004
- @feliang-git made their first contribution in #24515
- @jvmncs made their first contribution in #28002
- @Jyothirmaikottu made their first contribution in #28338
- @Qeeweew made their first contribution in #27328
- @okorzh-amd made their first contribution in #28486
- @stellaxcpeng made their first contribution in #28436
- @kangwangamd made their first contribution in #27815
- @lmyybh made their first contribution in #27553
- @joerowell made their first contribution in #28400
- @Talantan1102 made their first contribution in #25144
- @ashishdatta made their first contribution in #24082
- @jaybe1234 made their first contribution in #23910
- @shuwang21 made their first contribution in #28665
- @liyucheng09 made their first contribution in #26312
- @Zhangpch2021 made their first contribution in #23377
- @VarV0id made their first contribution in #26773
- @jeremyzhang866 made their first contribution in #26923
- @1e4ves made their first contribution in #28619
- @shihaoustc made their first contribution in #28718
- @pjdurden made their first contribution in #27430
- @EazyReal made their first contribution in #28802
- @Terry-Uv made their first contribution in #26880
- @yokinoshitayoki made their first contribution in #26574
Full Changelog: v0.5.13...v0.5.14