github sgl-project/sglang v0.5.14

7 hours ago

Highlights

New Model Support: GLM-5.2, LiquidAI LFM2.5, Kimi-K2.7-Code, Poolside Laguna-M.1, DiffusionGemma, Zyphra ZAYA1, MiMo-V2-ASR

DeepSeek-V4 on GB300 since Day 0: 5x higher throughput at the same interactivity, serving DeepSeek-V4 on NVIDIA GB300 with SGLang (blog).

Waterfill & LPLB MoE load balancing: Two dispatch-time load-balancing methods for DeepEP expert parallelism: Waterfill for shared-expert dispatch and LPLB for redundant expert replicas, improving throughput for DeepSeek-V3/R1 and DeepSeek-V4 (blog).

KDA CuteDSL prefill kernel on Blackwell (SM100): New CuteDSL prefill kernel for Kimi-Linear (KDA), 1.08-1.52x faster than the Triton path via a reusable scratch workspace, plus a cuda-graph padding fix (#27488); see the Kimi-Linear cookbook.

Linear-attention prefix-cache memory savings: An int8 checkpoint pool stores recurrent states compactly in the Mamba radix cache, substantially increasing prefix-cache capacity for KDA / GDN models (#28185); the speculative conv-window intermediate cache is deduplicated with a sliding-window layout, halving its footprint with no numerical change (#28302).

LPLB: linear-programming load balancer for MoE expert parallelism: Balances token routing across redundant expert replicas by solving a per-layer LP; opt-in via --ep-dispatch-algorithm=lp, default behavior unchanged (#24515).

MSCCL++ integration & MNNVL allreduce fusion: MSCCL++ migrates to the upstream mscclpp Python package (Executor + DSL compiler) with auto-tuned collectives for TP=8 single-node and TP=16 two-node (#22734); FlashInfer fused allreduce + residual + RMSNorm re-enables an MNNVL backend behind --flashinfer-allreduce-fusion-backend (auto / trtllm / mnnvl), fixing the piecewise-CUDA-graph interaction (#23402).

Nemotron DP attention + MTP: Data-parallel attention for the hybrid Nemotron-H (Mamba2 + full attention + MoE), plus MTP support (#24955); see the Nemotron 3 Ultra cookbook.

AMD: breakable CUDA graph on ROCm/HIP: The breakable CUDA graph execution path now runs on AMD GPUs (#28173).

NVFP4 MoE for DeepSeek-V4: Adds an NVFP4 MoE quantization path for DeepSeek-V4 on Blackwell for higher MoE throughput; enable with --moe-runner-backend flashinfer_trtllm_routed (#25820); see the DeepSeek-V4 cookbook.

DeepSeek-V4 decode & quantization optimizations: FP8 group quantization now emits power-of-two (UE8M0) scales directly from the per-token group-quant kernel, dropping a separate rounding pass (#26766); MLA decode q-heads are padded to 64 under attention-TP so FlashMLA dispatches the ~2x cheaper head64 kernel instead of head128 (#27954); the MHC prenorm kernel is prewarmed at startup to remove the first-run JIT slowdown on a fresh server (#27986); and BF16 mixed-dtype compression states are supported on the C4 / C128 paths (#27277); see the DeepSeek-V4 cookbook.

Full release notes by category below.

New Model Support

DeepSeek V4

  • [NVIDIA] Support NVFP4 MoE for DeepSeek-V4: #25820
  • [DeepSeek-V4] Fuse UE8M0 scale rounding into FP8 group quantization: #26766
  • [NPU] Add Ascend NPU support for DeepSeek-V4: #25144
  • Deepseek v4: support mixed dtype compression states: #27277
  • [AMD] Feat: Add prefill context parallel support for deepseek v4 unified kv attention: #27928
  • DeepSeek-V4 Online Compress support MTP: #26471
  • [dsv4] Pad MLA decode q-heads to 64 (not full n_heads) for FlashMLA head64 kernel: #27954
  • [dsv4] Prewarm MHC prenorm kernel at startup: #27986
  • [LoRA] Support DSA indexer LoRA targets for GLM-5.1 / DeepSeek-V3.2-family models: #28110
  • Add DeepSeek V4 MTP acceptance length checks: #28098

Speculative Decoding

  • [Spec] Add sync-free fast_prefill_plan for EAGLE draft-extend CUDA graph: #28854
  • [Spec] Support FlashInfer CUDA graph for EAGLE draft-extend: #28782
  • [mtp] add rejection sampling for speculative decoding: #26312
  • [NPU] Add MTP support for GLM-4.7-Flash: #28516
  • Dflash add sliding window attention draft layer support: #27469
  • Support Nemotron DP attention and MTP: #24955
  • [Feature] [Ngram spec] Support ngram spec v2: #17260

Piecewise & Breakable CUDA Graph

  • [AMD] Make breakable CUDA graph run on ROCm/HIP: #28173
  • Dflash piecewise cuda graphs support: #27468

Attention Backends

  • [Cookbook] Nemotron3-Ultra: Add mamba-backend and SSM dtype flags: #28675
  • [Mamba][GDN] Deduplicate spec conv-window intermediate cache via sliding window layout: #28302
  • [GDN][KDA][mem_cache] int8 checkpoint pool for the linear-attn prefix cache: #28185
  • [diffusion] feat: use LocalAttention for mistral3 encoder: #28176
  • [NPU] Add Gemma4 Sliding Window Attention support on Ascend backend: #26147
  • [AMD] Fuse sigmoid + mul attention output gate into single Triton kernel: #27630
  • [AMD] Enable fused GDN QKV split Triton kernel on HIP: #27583
  • [KDA] Add CuteDSL Prefill Kernel on SM100: #27488
  • [AMD] Add unified kv attention support in dpsk-v4: #27380

MoE & Expert Parallelism

  • Add GB10 FP8 fused MoE Triton config: #25665
  • Support asymmetric compressed-tensors MoE: #27690
  • LPLB: linear-programming load balancer for MoE expert parallelism: #24515
  • [AMD] Fuse sigmoid + mul into single Triton kernel for shared expert gating: #27636
  • [quantization] NVFP4 MoE: split fused w13 gate/up global scales: #27588
  • [Apple Silicon] [MLX] Fuse SwiGLU activation into gate gather_qmv for SwitchGLU MoE blocks: #26188
  • [DeepSeek V3] Defer moe finalize and fused it with main stream add: #27720

Quantization

  • ✨ [llm][npu][quant] Add W8A8 MXFP8 quantization support for Qwen3 Dense on Ascend NPU: #22352
  • Implement online nvfp4 quantization: #26083

Parallelism & Disaggregation

  • Add Mooncake group semantics: #26574
  • [cookbook] Laguna-M.1: add PD disaggregation section: #28737
  • [2/n] [CP] Add context parallel strategy abstractions: #27313
  • [AMD] Support unified_kv_triton for disaggregation: #27935
  • Add bucketed multi-dir layout for NIXL file storage: #27672
  • Add EPD disaggregated encode tracing: #25994

Scheduler & Runtime

  • [core] Gate the overlap WAR barrier on forward reads to recover decode throughput: #28363
  • [Feature] Add graceful scheduler shutdown; free hisparse host buffer on exit: #28779
  • Support MPServer and embedded server for granian to enable muti tokenizer worker: #28573
  • Add get_parallel(): a structured accessor for parallel-topology state: #28567
  • Support GLM-4.7 function calling via structural tags: #28149
  • Add SGLANG_ENABLE_WAR_BARRIER to force-enable the overlap scheduler WAR barrier on non-CUDA (e.g. AMD): #27967
  • [router] Add request/TTFT/worker metrics + Grafana dashboard to experimental sgl-router: #27591

HiCache & Radix Cache

  • [HiCache]Support hybrid pool staged H2D kernel: #28434
  • [HiCache & Bench] add cache hit breakdown in bench_serving: #22053
  • [HiCache]Asymmetric pool support direct backend: #28446
  • Support HiCache for MiMo-V2 models (1/N): #27378
  • [UnifiedTree]: HybridModel launches HiCache via UnifiedTree by default.: #27759
  • [HiCache] Add opt-in LRU eviction to file storage backend (CP-aware): #26670

LoRA

  • [diffusion] perf: merge LTX-2 stage-1 distilled LoRA into the base in original mode: #28594

Multimodal

  • Report multimodal (image/audio/video) token counts in usage.prompt_tokens_details: #27122
  • Eliminate CUDA syncs in VLM embed path: #26082

Model Support & Optimizations

  • Add dflash gemma4 support: #27471
  • [NPU] Add MiMo-V2-Flash manual testcases: #28223
  • Add mimo best practice: #27665
  • [AMD][Perf] Fuse QK RMSNorm + gate extraction Triton kernel for Qwen3.5 on HIP: #27656
  • Mistral3 add tensor parallel support for diffusion text encoder: #25950

SGLang-Diffusion

  • Shard hunyuan text tokens under sp: #28319
  • Shard text when using sp in flux.1/2: #27066
  • Use srt custom allreduce for tp groups: #28324
  • Optimize causal conv3d vae padding: #28204
  • Persist torch.compile inductor/triton cache across restarts: #28205
  • FLUX: fuse FeedForward GELU into up-proj GEMM (cublasLt epilogue): #28166
  • Use regional torch.compile (compile_repeated_blocks) for DiT of diffusers backend: #28193
  • Add --warmup-mode enum server arg: #28184
  • Enable spatial-shard vae decode across GPUs: #28071
  • Enable vae parallel decode with cfg-parallel: #27875
  • Optimize flux1 tensor parallel sharding: #27826
  • Progressive resolution growing for Ideogram 4 via GPU DCT upsampling with up to 1.56× speedup: #27736
  • Use fused w8a8 kernel for Ideogram4 weight-only linear as an opt-in: #27590
  • Run LTX-2 VAE decode in channels_last_3d (faster decode, lower peak memory): #27431
  • Rl: extract post-training weight apis into mixins and add tensor update/checker paths: #22817

AMD / ROCm

  • [AMD][DFlash] Enable Fused KV Materialization: #27854

NPU / Ascend

  • [NPU] Add head_dim=256 to _can_use_tnd whitelist: #28635
  • [NPU] Add NPU fallback for fused Triton gating kernels: #28293
  • [NPU] [DOC] Update server arguments to NPU support features page: #28083

CPU / Intel / XPU

  • [MLX] Add Metal profiling hooks to server profiler: #28122
  • [Intel GPU]Add sycl mrope pass for xpu device: #27646

Dependencies

  • sgl-kernel 0.4.3 → 0.4.4: #28556
  • tokenspeed_mla 0.1.1 → 0.1.7: #28116, #28759
  • Ray minimum version → 2.55.1: #27724
  • pytorch-xpu → 2.12: #27133

Security

No security-tagged PRs in this release.

All PRs included in this release: v0.5.13...v0.5.14

New Contributors

Full Changelog: v0.5.13...v0.5.14

Don't miss a new sglang release

NewReleases is sending notifications on new releases.