Highlights

CUDA 13 + Torch 2.11: Default CUDA version moves to 13.0 across SGLang, sgl-kernel, and Docker images, and PyTorch is upgraded from 2.9 to 2.11 — modernizing the build matrix and unlocking newer kernels: #21247, #24162, #24183, #23593 (tracking issue #21498)
Speculative Decoding V2 by default: Spec V2 (with overlap scheduling to hide CPU overhead) is now the default, materially reducing per-step CPU cost for EAGLE/MTP/DFLASH paths: #21062
Decode Radix Cache for PD Disaggregation: Decode-side prefix caching now works under prefill/decode disaggregation, recovering radix-cache hit rates and TTFT savings for long shared prefixes in disaggregated deployments: #19746
Day-0 / New Model Support: Gemma 4, GLM-5.1, Qwen3.6, MiMo-V2.5 / V2.5-Pro, Ling-2.6-Flash, Mistral Medium 3.5, and Kimi-K2.6 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook: #21952, #23808, #23811, #23851, #23947, #23486, #23394
DFLASH Speculative Decoding: New high-throughput spec-decode kernel from the kernel community, expanded across model backends and AMD ROCm: #22077, #22358, #22342, #23553
FA3 Kernels from the Kernel Community: Drop-in FA3 kernels contributed by the community, integrated alongside FA4 to give users a high-performance option that's easy to maintain: #20796
LoRA support for DeepSeek-V3 and Kimi-K2: LoRA now works on the largest MLA-based MoE models, including DeepSeek-V3 MLA LoRA and Kimi K2 — enabling adapter-based fine-tuning of frontier-scale models: #22323, #22381
Context Parallel (CP) Enhancements: All-reduce + RMSNorm fusion under CP for end-to-end speedups, plus support for moe_dp_size = 1 paired with arbitrary attention_cp_size so MoE and attention parallelism can be tuned independently: #21249, #22003
FlashInfer CuteDSL MoE Runner Backend: New dedicated FlashInferCuteDslMoE layer for the standard FP4 MoE path, giving an additional high-performance fused-MoE option: #21339

New Model Support

Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.

Gemma 4: #21952 (and follow-ups #22079, #24048, #22842; see cookbook)
GLM-5.1: #22543, #23037 (see cookbook)
Qwen3.6: #23486 (see cookbook)
MiMo-V2.5 / MiMo-V2.5-Pro: #23808, #23811, #23851, #23945, #24118 (see cookbook)
Ling-2.6-Flash: #23947 (see cookbook)
Mistral Medium 3.5: see cookbook
Kimi-K2.6: #23394, #23408 (see cookbook)
Hunyuan v3 (Tencent, preview): #23533 (see cookbook)
FLUX.1-dev ModelOpt NVFP4 (Diffusion): #22672 (see FLUX cookbook)
FLUX.2-small-decoder (Diffusion): #22414 (see FLUX cookbook)
Qwen Image ModelOpt FP8 (Diffusion): #23155 (see Qwen-Image cookbook)
LTX-2.3 / LTX-2.3 two-stage / TI2V (Diffusion): #22182, #22667, #22869 (see LTX cookbook)
Qwen3-ASR (chunk-based streaming): #22073, #22089
Voxtral (Mistral speech-to-text): #21635
Parakeet (NVIDIA Nemotron encoder): #23568
Moss-VL: #23454
SequenceClassification model architecture (powers the Score API): #22118
Stable Diffusion 3 medium (Diffusion): #19225
ERNIE-Image (Diffusion): #22439
JoyAI-Image-Edit (Diffusion): #22625

Speculative Decoding

DFLASH speculative decoding initial support: #22077
DFLASH enabled across additional model backends: #22358
DFLASH speculative decoding on AMD ROCm: #22342
Spec V2 enabled by default with overlap scheduling: #21062
Penalty support for Spec V2 overlap scheduling: #22049
Adaptive speculative_num_steps for EAGLE topk=1: #21599
Allow piecewise CUDA graph with speculative decoding: #22128
Eagle3 / DFLASH aux hidden state capture during CUDA graph init fixed: #22836
Split accept_length into num_accepted_drafts / num_accepted_tokens: #23962
DFLASH speculative decoding documentation: #23553

PD Disaggregation

Decode-side radix cache support: #19746
Incremental transfer for Mooncake transfer engine: #24257
Allow PrefillDelayer in disaggregated-prefill mode: #23588
NIXL: heterogeneous TP KV transfer for non-MLA models (Step 1/2 for Qwen3.5): #22145
NIXL: Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5): #22240
Bug fixes for IntraNode NVLink, MTP-layer KV transfer, and disagg-prefill DP rank resolution: #23252, #23539, #22901, #22990

Context Parallel & Parallelism

All-reduce fusion support under CP: #21249
moe_dp_size = 1 paired with arbitrary attention_cp_size: #22003
All-reduce fusion enabled for DSA models: #22390
Replace all-reduce + dp_scatter with reduce_scatterv for DP attention: #22642
Step3p5: optimize all-reduce in MoE layers: #22773
Pipeline parallelism on Intel XPU: #23472
OpenTelemetry tracing for pipeline parallelism: #23169

LoRA

DeepSeek-V3 MLA LoRA support and quantization-info refactor: #22323
Kimi K2 LoRA support: #22381
LoRADrainer to address high P99 TTFT: #17913
Decoupled LoRA MoE backend with Marlin support: #21858
Virtual experts for LoRA MoE (1/n): #22122, #24007
CSGMV kernel offline auto-tuning: #20391
Triton sgemm speedup with better grid selection: #22386
Dual MoE CUDA graph capture for lora/nolora batches: #22809

Performance

FA3 kernels from the kernel community: #20796
Precompute FA3 scheduler_metadata to eliminate per-layer prepare cost: #21104
Precompute gemma_weight to avoid redundant add on every forward: #22673
Eliminate attention DtoD copy by passing pre-allocated output to FA: #21985
Skip KV cache in FA backend for embedding mode: #21971
O(1) RadixKey view for EAGLE bigram key: #23106
PCG inductor path optimization for FP8 models: #23227
Combo-kernels for horizontal fusion: #21977
Optimize Gemma4 VLM with PCG and fused RMSNorm + residual add + scalar: #24048
Restore torch.compile fusion for topk postprocessing: #21771
Reduce unnecessary kernels and copies in the NSA indexer: #22232

Observability

Pending token count surfaced in prefill log and get_load: #22480
OpenTelemetry tracing for speculative decoding: #19545
OpenTelemetry tracing for pipeline parallelism: #23169
OpenTelemetry tracing in DiffGenerator: #21254
Prometheus metrics endpoint for gRPC mode: #20801
HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode: #22500
Raw KV cache pool token counts as Prometheus gauges: #22726

SGLang-Diffusion

New model support: LTX-2.3 (#22182, #22667, #22869), ERNIE-Image (#22439), FLUX.2-small-decoder (#22414), JoyAI-Image-Edit (#22625), FLUX.1-dev ModelOpt NVFP4 (#22672), Qwen Image ModelOpt FP8 (#23155), Stable Diffusion 3 medium (#19225)
ModelOpt diffusion FP8 support for Flux1/Flux2 and Wan2.2: #22365
Standalone Rollout API + Denoising Environment Backpass + SP-Aligned Log-Prob for T2I post-training: #22604
Disaggregated diffusion: #21701
Dynamic batching v0: #18764
CPU platform support for SGLang Diffusion: #20816
AITER backends in Flux 2 pipeline (AMD): #22802
LTX-2 feed-forward tensor parallelism optimization: #23221
In-memory loading for URL/base64 image inputs (default): #23118
Mixed-resolution benchmark support: #20863
Auto-enable best parallel setting if unspecified: #22763

AMD

MiniMax-M2.5 optimizations (aiter biased grouped topk; fused FP8 KV cache write): #23611, #23620
Fused QK Gemma norm kernels (4 → fewer kernels): #23575
Fused all-reduce + RMSNorm simplification: #21986
GLM-5 / GLM-5.1 MXFP4 nightly accuracy + perf benchmarks (MI30x / MI35x): #21773, #22336
MTP for GLM-5-mxfp4: #23219
Aiter v0.1.12.post1 upgrade: #22264
DFLASH speculative decoding enabled on ROCm: #22342
Fix --page-size > 1 memory access fault with speculative decoding: #23596

NPU / Ascend

Ascend backend supports Qwen3 MoE attention CP: #21685
GLM-4.5V and GLM-4.7-Flash NPU support / fixes: #22961, #22509
MTP for Qwen3.5: #20918
TP communications compression for Qwen3 on NPU: #20520
Add support-new-models documentation for NPU: #23824
GGUF quantization for Ascend NPU (dense + MoE): #17883

CPU

GPTQ / AWQ 4-bit quantization on CPU: #22685
gemma4_rmsnorm_cpu kernel: #22842
Qwen3.5 model optimization for CPU: #19484
Apply routed scaling factor on output for biased grouped topk fusion: #22413
Fix extend_attention_cpu / flash_attn_varlen_func NaN for large seq: #22434

Quantization

MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs: #19143 (later reverted in #23031, follow-up forthcoming)
NVFP4 KV cache: quantization strategy abstraction and kernel: #21954
DeepSeek-R1-0528-w4a8 + DeepEP Low-Latency FP8 dispatch: #22316
MXFP8 sm100 path cleanup: #21881
GLM-5/5.1 MXFP4 checkpoint inference compatibility fix: #22543

Dependencies

Torch upgraded 2.9 → 2.11: #21247
Default CUDA bumped to 13.0 across sglang, sgl-kernel, and Docker images: #21498 (tracking), #24162, #24183, #23593, #23119
Flashinfer 0.6.7.post2 → 0.6.8.post1: #23281
sgl-kernel bumped to 0.4.1.post1: #23720, #23733
sgl-kernel bumped to 0.4.2: #24170
Aiter v0.1.12.post1 (AMD): #22264

Security

Fix for CVE-2026-5760: #23660
Fix Trivy CVEs and cubin download 403s in Docker image: #22322

All PRs included in this release: v0.5.10.post1...v0.5.11

New Contributors

@AethoceSora made their first contribution in #23426
@AlbeeSo made their first contribution in #23710
@alec-flowers made their first contribution in #24090
@AlonKejzman made their first contribution in #23753
@amacaskill made their first contribution in #22537
@AndyLi429 made their first contribution in #21685
@Baichuan7 made their first contribution in #23060
@ccullen-cert made their first contribution in #23660
@ChangLiu0709 made their first contribution in #22908
@charlotte12l made their first contribution in #21983
@chenkaiyue made their first contribution in #17195
@chx96642264 made their first contribution in #22705
@ColinZ22 made their first contribution in #22543
@cyyc0310 made their first contribution in #22920
@divyamagrawal06 made their first contribution in #23325
@dyhsup made their first contribution in #22439
@egvenediktov made their first contribution in #20520
@erikwijmans made their first contribution in #21974
@fengli1702 made their first contribution in #19143
@fergusfinn made their first contribution in #21035
@fortunecookiee made their first contribution in #20960
@gxlvera made their first contribution in #19225
@he-yufeng made their first contribution in #20739
@Henson-Zh-Ali made their first contribution in #20522
@icepoint666 made their first contribution in #22592
@iridiumine made their first contribution in #20918
@is-not made their first contribution in #18349
@JasonHe-WQ made their first contribution in #21944
@jh-nv made their first contribution in #21254
@jiangyinzuo made their first contribution in #23169
@JieTang66 made their first contribution in #23983
@JoyFuture made their first contribution in #23808
@jthakurH made their first contribution in #16793
@kangyifei made their first contribution in #23241
@kingkingleeljj made their first contribution in #20967
@kkyyxhll made their first contribution in #23062
@KrishnanPrash made their first contribution in #22175
@lahmuller made their first contribution in #22625
@lixuwei2333 made their first contribution in #22247
@lkhl made their first contribution in #22431
@loading66 made their first contribution in #22700
@luccafong made their first contribution in #24165
@mingyue300 made their first contribution in #21723
@minosfuture made their first contribution in #23419
@mispa-ms made their first contribution in #23097
@mlleo made their first contribution in #23537
@Napkin-AI made their first contribution in #23572
@nvpohanh made their first contribution in #22852
@officialasishkumar made their first contribution in #22600
@opherlieber made their first contribution in #22547
@ranjiewen made their first contribution in #21698
@RichardoMrMu made their first contribution in #19545
@robellliu-dev made their first contribution in #20835
@SammLSH made their first contribution in #22089
@Seven-Streams made their first contribution in #21722
@shenxiul made their first contribution in #23327
@siju-samuel made their first contribution in #23472
@stepinto made their first contribution in #23478
@tfhddd made their first contribution in #22029
@vvagaytsev made their first contribution in #22363
@WangHao-hw made their first contribution in #22778
@Wen-xuan-Xu made their first contribution in #22923
@xiaobochen-amd made their first contribution in #22626
@yaya159456 made their first contribution in #21694
@YMbmzy made their first contribution in #22049
@yuki-brook made their first contribution in #18016
@Zaire404 made their first contribution in #22982
@ZeyuanChen2000 made their first contribution in #21543
@zhaozx-cn made their first contribution in #22266
@zhsurpass made their first contribution in #22697
@zsj555 made their first contribution in #23454

Full Changelog: v0.5.10.post1...v0.5.11

sgl-project/sglang v0.5.11 on GitHub

Highlights

New Model Support

Speculative Decoding

PD Disaggregation

Context Parallel & Parallelism

LoRA

Performance

Observability

SGLang-Diffusion

AMD

NPU / Ascend

CPU

Quantization

Dependencies

Security

New Contributors

sgl-project/sglang v0.5.11
on GitHub