Highlights

New Model Support: GLM-5.2, LiquidAI LFM2.5, Kimi-K2.7-Code, Poolside Laguna-M.1, DiffusionGemma, Zyphra ZAYA1, MiMo-V2-ASR

DeepSeek-V4 on GB300 since Day 0: 5x higher throughput at the same interactivity, serving DeepSeek-V4 on NVIDIA GB300 with SGLang (blog).

Waterfill & LPLB MoE load balancing: Two dispatch-time load-balancing methods for DeepEP expert parallelism: Waterfill for shared-expert dispatch and LPLB for redundant expert replicas, improving throughput for DeepSeek-V3/R1 and DeepSeek-V4 (blog).

KDA CuteDSL prefill kernel on Blackwell (SM100): New CuteDSL prefill kernel for Kimi-Linear (KDA), 1.08-1.52x faster than the Triton path via a reusable scratch workspace, plus a cuda-graph padding fix (#27488); see the Kimi-Linear cookbook.

Linear-attention prefix-cache memory savings: An int8 checkpoint pool stores recurrent states compactly in the Mamba radix cache, substantially increasing prefix-cache capacity for KDA / GDN models (#28185); the speculative conv-window intermediate cache is deduplicated with a sliding-window layout, halving its footprint with no numerical change (#28302).

LPLB: linear-programming load balancer for MoE expert parallelism: Balances token routing across redundant expert replicas by solving a per-layer LP; opt-in via --ep-dispatch-algorithm=lp, default behavior unchanged (#24515).

MSCCL++ integration & MNNVL allreduce fusion: MSCCL++ migrates to the upstream mscclpp Python package (Executor + DSL compiler) with auto-tuned collectives for TP=8 single-node and TP=16 two-node (#22734); FlashInfer fused allreduce + residual + RMSNorm re-enables an MNNVL backend behind --flashinfer-allreduce-fusion-backend (auto / trtllm / mnnvl), fixing the piecewise-CUDA-graph interaction (#23402).

Nemotron DP attention + MTP: Data-parallel attention for the hybrid Nemotron-H (Mamba2 + full attention + MoE), plus MTP support (#24955); see the Nemotron 3 Ultra cookbook.

AMD: breakable CUDA graph on ROCm/HIP: The breakable CUDA graph execution path now runs on AMD GPUs (#28173).

NVFP4 MoE for DeepSeek-V4: Adds an NVFP4 MoE quantization path for DeepSeek-V4 on Blackwell for higher MoE throughput; enable with --moe-runner-backend flashinfer_trtllm_routed (#25820); see the DeepSeek-V4 cookbook.

DeepSeek-V4 decode & quantization optimizations: FP8 group quantization now emits power-of-two (UE8M0) scales directly from the per-token group-quant kernel, dropping a separate rounding pass (#26766); MLA decode q-heads are padded to 64 under attention-TP so FlashMLA dispatches the ~2x cheaper head64 kernel instead of head128 (#27954); the MHC prenorm kernel is prewarmed at startup to remove the first-run JIT slowdown on a fresh server (#27986); and BF16 mixed-dtype compression states are supported on the C4 / C128 paths (#27277); see the DeepSeek-V4 cookbook.

Full release notes by category below.

New Model Support

GLM-5.2: #28437 (cookbook)
LiquidAI LFM2.5: #27409 (cookbook)
Kimi-K2.7-Code: #28064 (cookbook)
Poolside Laguna-M.1: #28661, #28400 (cookbook)
DiffusionGemma: #27824 (cookbook)
Zyphra ZAYA1: #26347 (cookbook wip)
MiMo-V2-ASR: #26278 (cookbook wip)

DeepSeek V4

[NVIDIA] Support NVFP4 MoE for DeepSeek-V4: #25820
[DeepSeek-V4] Fuse UE8M0 scale rounding into FP8 group quantization: #26766
[NPU] Add Ascend NPU support for DeepSeek-V4: #25144
Deepseek v4: support mixed dtype compression states: #27277
[AMD] Feat: Add prefill context parallel support for deepseek v4 unified kv attention: #27928
DeepSeek-V4 Online Compress support MTP: #26471
[dsv4] Pad MLA decode q-heads to 64 (not full n_heads) for FlashMLA head64 kernel: #27954
[dsv4] Prewarm MHC prenorm kernel at startup: #27986
[LoRA] Support DSA indexer LoRA targets for GLM-5.1 / DeepSeek-V3.2-family models: #28110
Add DeepSeek V4 MTP acceptance length checks: #28098

Speculative Decoding

[Spec] Add sync-free fast_prefill_plan for EAGLE draft-extend CUDA graph: #28854
[Spec] Support FlashInfer CUDA graph for EAGLE draft-extend: #28782
[mtp] add rejection sampling for speculative decoding: #26312
[NPU] Add MTP support for GLM-4.7-Flash: #28516
Dflash add sliding window attention draft layer support: #27469
Support Nemotron DP attention and MTP: #24955
[Feature] [Ngram spec] Support ngram spec v2: #17260

Piecewise & Breakable CUDA Graph

[AMD] Make breakable CUDA graph run on ROCm/HIP: #28173
Dflash piecewise cuda graphs support: #27468

Attention Backends

[Cookbook] Nemotron3-Ultra: Add mamba-backend and SSM dtype flags: #28675
[Mamba][GDN] Deduplicate spec conv-window intermediate cache via sliding window layout: #28302
[GDN][KDA][mem_cache] int8 checkpoint pool for the linear-attn prefix cache: #28185
[diffusion] feat: use LocalAttention for mistral3 encoder: #28176
[NPU] Add Gemma4 Sliding Window Attention support on Ascend backend: #26147
[AMD] Fuse sigmoid + mul attention output gate into single Triton kernel: #27630
[AMD] Enable fused GDN QKV split Triton kernel on HIP: #27583
[KDA] Add CuteDSL Prefill Kernel on SM100: #27488
[AMD] Add unified kv attention support in dpsk-v4: #27380

MoE & Expert Parallelism

Add GB10 FP8 fused MoE Triton config: #25665
Support asymmetric compressed-tensors MoE: #27690
LPLB: linear-programming load balancer for MoE expert parallelism: #24515
[AMD] Fuse sigmoid + mul into single Triton kernel for shared expert gating: #27636
[quantization] NVFP4 MoE: split fused w13 gate/up global scales: #27588
[Apple Silicon] [MLX] Fuse SwiGLU activation into gate gather_qmv for SwitchGLU MoE blocks: #26188
[DeepSeek V3] Defer moe finalize and fused it with main stream add: #27720

Quantization

✨ [llm][npu][quant] Add W8A8 MXFP8 quantization support for Qwen3 Dense on Ascend NPU: #22352
Implement online nvfp4 quantization: #26083

Parallelism & Disaggregation

Add Mooncake group semantics: #26574
[cookbook] Laguna-M.1: add PD disaggregation section: #28737
[2/n] [CP] Add context parallel strategy abstractions: #27313
[AMD] Support unified_kv_triton for disaggregation: #27935
Add bucketed multi-dir layout for NIXL file storage: #27672
Add EPD disaggregated encode tracing: #25994

Scheduler & Runtime

[core] Gate the overlap WAR barrier on forward reads to recover decode throughput: #28363
[Feature] Add graceful scheduler shutdown; free hisparse host buffer on exit: #28779
Support MPServer and embedded server for granian to enable muti tokenizer worker: #28573
Add get_parallel(): a structured accessor for parallel-topology state: #28567
Support GLM-4.7 function calling via structural tags: #28149
Add SGLANG_ENABLE_WAR_BARRIER to force-enable the overlap scheduler WAR barrier on non-CUDA (e.g. AMD): #27967
[router] Add request/TTFT/worker metrics + Grafana dashboard to experimental sgl-router: #27591

HiCache & Radix Cache

[HiCache]Support hybrid pool staged H2D kernel: #28434
[HiCache & Bench] add cache hit breakdown in bench_serving: #22053
[HiCache]Asymmetric pool support direct backend: #28446
Support HiCache for MiMo-V2 models (1/N): #27378
[UnifiedTree]: HybridModel launches HiCache via UnifiedTree by default.: #27759
[HiCache] Add opt-in LRU eviction to file storage backend (CP-aware): #26670

LoRA

[diffusion] perf: merge LTX-2 stage-1 distilled LoRA into the base in original mode: #28594

Multimodal

Report multimodal (image/audio/video) token counts in usage.prompt_tokens_details: #27122
Eliminate CUDA syncs in VLM embed path: #26082

Model Support & Optimizations

Add dflash gemma4 support: #27471
[NPU] Add MiMo-V2-Flash manual testcases: #28223
Add mimo best practice: #27665
[AMD][Perf] Fuse QK RMSNorm + gate extraction Triton kernel for Qwen3.5 on HIP: #27656
Mistral3 add tensor parallel support for diffusion text encoder: #25950

SGLang-Diffusion

Shard hunyuan text tokens under sp: #28319
Shard text when using sp in flux.1/2: #27066
Use srt custom allreduce for tp groups: #28324
Optimize causal conv3d vae padding: #28204
Persist torch.compile inductor/triton cache across restarts: #28205
FLUX: fuse FeedForward GELU into up-proj GEMM (cublasLt epilogue): #28166
Use regional torch.compile (compile_repeated_blocks) for DiT of diffusers backend: #28193
Add --warmup-mode enum server arg: #28184
Enable spatial-shard vae decode across GPUs: #28071
Enable vae parallel decode with cfg-parallel: #27875
Optimize flux1 tensor parallel sharding: #27826
Progressive resolution growing for Ideogram 4 via GPU DCT upsampling with up to 1.56× speedup: #27736
Use fused w8a8 kernel for Ideogram4 weight-only linear as an opt-in: #27590
Run LTX-2 VAE decode in channels_last_3d (faster decode, lower peak memory): #27431
Rl: extract post-training weight apis into mixins and add tensor update/checker paths: #22817

AMD / ROCm

[AMD][DFlash] Enable Fused KV Materialization: #27854

NPU / Ascend

[NPU] Add head_dim=256 to _can_use_tnd whitelist: #28635
[NPU] Add NPU fallback for fused Triton gating kernels: #28293
[NPU] [DOC] Update server arguments to NPU support features page: #28083

CPU / Intel / XPU

[MLX] Add Metal profiling hooks to server profiler: #28122
[Intel GPU]Add sycl mrope pass for xpu device: #27646

Dependencies

sgl-kernel 0.4.3 → 0.4.4: #28556
tokenspeed_mla 0.1.1 → 0.1.7: #28116, #28759
Ray minimum version → 2.55.1: #27724
pytorch-xpu → 2.12: #27133

Security

No security-tagged PRs in this release.

All PRs included in this release: v0.5.13...v0.5.14

New Contributors

@caiomcbr made their first contribution in #22734
@LJL36 made their first contribution in #27550
@xythink made their first contribution in #23802
@gq112 made their first contribution in #25980
@weizhoublue made their first contribution in #24401
@BiggieW made their first contribution in #26320
@yuchengliu1 made their first contribution in #27133
@nbarzilie made their first contribution in #26908
@ChengYao-amd made their first contribution in #26347
@HZY-Wade made their first contribution in #26670
@sigama-w made their first contribution in #27665
@vinayK34 made their first contribution in #27157
@luoroger37 made their first contribution in #27779
@zalcit made their first contribution in #26278
@xbfs made their first contribution in #26351
@oulgen made their first contribution in #27967
@Joectwm made their first contribution in #25994
@liuxpro made their first contribution in #27866
@Zhichenzzz made their first contribution in #24955
@zqzten made their first contribution in #28013
@Sunt-ing made their first contribution in #28088
@Ibrahim2595 made their first contribution in #28078
@evanderfff123-boop made their first contribution in #27913
@JaredforReal made their first contribution in #26902
@kingjameschan made their first contribution in #25975
@HumphreySun98 made their first contribution in #26971
@Zyann7 made their first contribution in #27122
@DaZhUUU made their first contribution in #28043
@prajjwal1 made their first contribution in #27588
@cccccya made their first contribution in #28031
@qinsir5522 made their first contribution in #28283
@jinhaosong-source made their first contribution in #28004
@feliang-git made their first contribution in #24515
@jvmncs made their first contribution in #28002
@Jyothirmaikottu made their first contribution in #28338
@Qeeweew made their first contribution in #27328
@okorzh-amd made their first contribution in #28486
@stellaxcpeng made their first contribution in #28436
@kangwangamd made their first contribution in #27815
@lmyybh made their first contribution in #27553
@joerowell made their first contribution in #28400
@Talantan1102 made their first contribution in #25144
@ashishdatta made their first contribution in #24082
@jaybe1234 made their first contribution in #23910
@shuwang21 made their first contribution in #28665
@liyucheng09 made their first contribution in #26312
@Zhangpch2021 made their first contribution in #23377
@VarV0id made their first contribution in #26773
@jeremyzhang866 made their first contribution in #26923
@1e4ves made their first contribution in #28619
@shihaoustc made their first contribution in #28718
@pjdurden made their first contribution in #27430
@EazyReal made their first contribution in #28802
@Terry-Uv made their first contribution in #26880
@yokinoshitayoki made their first contribution in #26574

Full Changelog: v0.5.13...v0.5.14

sgl-project/sglang v0.5.14 on GitHub