github sgl-project/sglang v0.5.11

4 hours ago

Highlights

  • CUDA 13 + Torch 2.11: Default CUDA version moves to 13.0 across SGLang, sgl-kernel, and Docker images, and PyTorch is upgraded from 2.9 to 2.11 — modernizing the build matrix and unlocking newer kernels: #21247, #24162, #24183, #23593 (tracking issue #21498)

  • Speculative Decoding V2 by default: Spec V2 (with overlap scheduling to hide CPU overhead) is now the default, materially reducing per-step CPU cost for EAGLE/MTP/DFLASH paths: #21062

  • Decode Radix Cache for PD Disaggregation: Decode-side prefix caching now works under prefill/decode disaggregation, recovering radix-cache hit rates and TTFT savings for long shared prefixes in disaggregated deployments: #19746

  • Day-0 / New Model Support: Gemma 4, GLM-5.1, Qwen3.6, MiMo-V2.5 / V2.5-Pro, Ling-2.6-Flash, Mistral Medium 3.5, and Kimi-K2.6 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook: #21952, #23808, #23811, #23851, #23947, #23486, #23394

  • DFLASH Speculative Decoding: New high-throughput spec-decode kernel from the kernel community, expanded across model backends and AMD ROCm: #22077, #22358, #22342, #23553

  • FA3 Kernels from the Kernel Community: Drop-in FA3 kernels contributed by the community, integrated alongside FA4 to give users a high-performance option that's easy to maintain: #20796

  • LoRA support for DeepSeek-V3 and Kimi-K2: LoRA now works on the largest MLA-based MoE models, including DeepSeek-V3 MLA LoRA and Kimi K2 — enabling adapter-based fine-tuning of frontier-scale models: #22323, #22381

  • Context Parallel (CP) Enhancements: All-reduce + RMSNorm fusion under CP for end-to-end speedups, plus support for moe_dp_size = 1 paired with arbitrary attention_cp_size so MoE and attention parallelism can be tuned independently: #21249, #22003

  • FlashInfer CuteDSL MoE Runner Backend: New dedicated FlashInferCuteDslMoE layer for the standard FP4 MoE path, giving an additional high-performance fused-MoE option: #21339

New Model Support

Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.

Speculative Decoding

  • DFLASH speculative decoding initial support: #22077
  • DFLASH enabled across additional model backends: #22358
  • DFLASH speculative decoding on AMD ROCm: #22342
  • Spec V2 enabled by default with overlap scheduling: #21062
  • Penalty support for Spec V2 overlap scheduling: #22049
  • Adaptive speculative_num_steps for EAGLE topk=1: #21599
  • Allow piecewise CUDA graph with speculative decoding: #22128
  • Eagle3 / DFLASH aux hidden state capture during CUDA graph init fixed: #22836
  • Split accept_length into num_accepted_drafts / num_accepted_tokens: #23962
  • DFLASH speculative decoding documentation: #23553

PD Disaggregation

  • Decode-side radix cache support: #19746
  • Incremental transfer for Mooncake transfer engine: #24257
  • Allow PrefillDelayer in disaggregated-prefill mode: #23588
  • NIXL: heterogeneous TP KV transfer for non-MLA models (Step 1/2 for Qwen3.5): #22145
  • NIXL: Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5): #22240
  • Bug fixes for IntraNode NVLink, MTP-layer KV transfer, and disagg-prefill DP rank resolution: #23252, #23539, #22901, #22990

Context Parallel & Parallelism

  • All-reduce fusion support under CP: #21249
  • moe_dp_size = 1 paired with arbitrary attention_cp_size: #22003
  • All-reduce fusion enabled for DSA models: #22390
  • Replace all-reduce + dp_scatter with reduce_scatterv for DP attention: #22642
  • Step3p5: optimize all-reduce in MoE layers: #22773
  • Pipeline parallelism on Intel XPU: #23472
  • OpenTelemetry tracing for pipeline parallelism: #23169

LoRA

  • DeepSeek-V3 MLA LoRA support and quantization-info refactor: #22323
  • Kimi K2 LoRA support: #22381
  • LoRADrainer to address high P99 TTFT: #17913
  • Decoupled LoRA MoE backend with Marlin support: #21858
  • Virtual experts for LoRA MoE (1/n): #22122, #24007
  • CSGMV kernel offline auto-tuning: #20391
  • Triton sgemm speedup with better grid selection: #22386
  • Dual MoE CUDA graph capture for lora/nolora batches: #22809

Performance

  • FA3 kernels from the kernel community: #20796
  • Precompute FA3 scheduler_metadata to eliminate per-layer prepare cost: #21104
  • Precompute gemma_weight to avoid redundant add on every forward: #22673
  • Eliminate attention DtoD copy by passing pre-allocated output to FA: #21985
  • Skip KV cache in FA backend for embedding mode: #21971
  • O(1) RadixKey view for EAGLE bigram key: #23106
  • PCG inductor path optimization for FP8 models: #23227
  • Combo-kernels for horizontal fusion: #21977
  • Optimize Gemma4 VLM with PCG and fused RMSNorm + residual add + scalar: #24048
  • Restore torch.compile fusion for topk postprocessing: #21771
  • Reduce unnecessary kernels and copies in the NSA indexer: #22232

Observability

  • Pending token count surfaced in prefill log and get_load: #22480
  • OpenTelemetry tracing for speculative decoding: #19545
  • OpenTelemetry tracing for pipeline parallelism: #23169
  • OpenTelemetry tracing in DiffGenerator: #21254
  • Prometheus metrics endpoint for gRPC mode: #20801
  • HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode: #22500
  • Raw KV cache pool token counts as Prometheus gauges: #22726

SGLang-Diffusion

  • New model support: LTX-2.3 (#22182, #22667, #22869), ERNIE-Image (#22439), FLUX.2-small-decoder (#22414), JoyAI-Image-Edit (#22625), FLUX.1-dev ModelOpt NVFP4 (#22672), Qwen Image ModelOpt FP8 (#23155), Stable Diffusion 3 medium (#19225)
  • ModelOpt diffusion FP8 support for Flux1/Flux2 and Wan2.2: #22365
  • Standalone Rollout API + Denoising Environment Backpass + SP-Aligned Log-Prob for T2I post-training: #22604
  • Disaggregated diffusion: #21701
  • Dynamic batching v0: #18764
  • CPU platform support for SGLang Diffusion: #20816
  • AITER backends in Flux 2 pipeline (AMD): #22802
  • LTX-2 feed-forward tensor parallelism optimization: #23221
  • In-memory loading for URL/base64 image inputs (default): #23118
  • Mixed-resolution benchmark support: #20863
  • Auto-enable best parallel setting if unspecified: #22763

AMD

  • MiniMax-M2.5 optimizations (aiter biased grouped topk; fused FP8 KV cache write): #23611, #23620
  • Fused QK Gemma norm kernels (4 → fewer kernels): #23575
  • Fused all-reduce + RMSNorm simplification: #21986
  • GLM-5 / GLM-5.1 MXFP4 nightly accuracy + perf benchmarks (MI30x / MI35x): #21773, #22336
  • MTP for GLM-5-mxfp4: #23219
  • Aiter v0.1.12.post1 upgrade: #22264
  • DFLASH speculative decoding enabled on ROCm: #22342
  • Fix --page-size > 1 memory access fault with speculative decoding: #23596

NPU / Ascend

  • Ascend backend supports Qwen3 MoE attention CP: #21685
  • GLM-4.5V and GLM-4.7-Flash NPU support / fixes: #22961, #22509
  • MTP for Qwen3.5: #20918
  • TP communications compression for Qwen3 on NPU: #20520
  • Add support-new-models documentation for NPU: #23824
  • GGUF quantization for Ascend NPU (dense + MoE): #17883

CPU

  • GPTQ / AWQ 4-bit quantization on CPU: #22685
  • gemma4_rmsnorm_cpu kernel: #22842
  • Qwen3.5 model optimization for CPU: #19484
  • Apply routed scaling factor on output for biased grouped topk fusion: #22413
  • Fix extend_attention_cpu / flash_attn_varlen_func NaN for large seq: #22434

Quantization

  • MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs: #19143 (later reverted in #23031, follow-up forthcoming)
  • NVFP4 KV cache: quantization strategy abstraction and kernel: #21954
  • DeepSeek-R1-0528-w4a8 + DeepEP Low-Latency FP8 dispatch: #22316
  • MXFP8 sm100 path cleanup: #21881
  • GLM-5/5.1 MXFP4 checkpoint inference compatibility fix: #22543

Dependencies

Security

All PRs included in this release: v0.5.10.post1...v0.5.11

New Contributors

Full Changelog: v0.5.10.post1...v0.5.11

Don't miss a new sglang release

NewReleases is sending notifications on new releases.