Highlights
-
CUDA 13 + Torch 2.11: Default CUDA version moves to 13.0 across SGLang, sgl-kernel, and Docker images, and PyTorch is upgraded from 2.9 to 2.11 — modernizing the build matrix and unlocking newer kernels: #21247, #24162, #24183, #23593 (tracking issue #21498)
-
Speculative Decoding V2 by default: Spec V2 (with overlap scheduling to hide CPU overhead) is now the default, materially reducing per-step CPU cost for EAGLE/MTP/DFLASH paths: #21062
-
Decode Radix Cache for PD Disaggregation: Decode-side prefix caching now works under prefill/decode disaggregation, recovering radix-cache hit rates and TTFT savings for long shared prefixes in disaggregated deployments: #19746
-
Day-0 / New Model Support: Gemma 4, GLM-5.1, Qwen3.6, MiMo-V2.5 / V2.5-Pro, Ling-2.6-Flash, Mistral Medium 3.5, and Kimi-K2.6 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook: #21952, #23808, #23811, #23851, #23947, #23486, #23394
-
DFLASH Speculative Decoding: New high-throughput spec-decode kernel from the kernel community, expanded across model backends and AMD ROCm: #22077, #22358, #22342, #23553
-
FA3 Kernels from the Kernel Community: Drop-in FA3 kernels contributed by the community, integrated alongside FA4 to give users a high-performance option that's easy to maintain: #20796
-
LoRA support for DeepSeek-V3 and Kimi-K2: LoRA now works on the largest MLA-based MoE models, including DeepSeek-V3 MLA LoRA and Kimi K2 — enabling adapter-based fine-tuning of frontier-scale models: #22323, #22381
-
Context Parallel (CP) Enhancements: All-reduce + RMSNorm fusion under CP for end-to-end speedups, plus support for
moe_dp_size = 1paired with arbitraryattention_cp_sizeso MoE and attention parallelism can be tuned independently: #21249, #22003 -
FlashInfer CuteDSL MoE Runner Backend: New dedicated
FlashInferCuteDslMoElayer for the standard FP4 MoE path, giving an additional high-performance fused-MoE option: #21339
New Model Support
Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.
- Gemma 4: #21952 (and follow-ups #22079, #24048, #22842; see cookbook)
- GLM-5.1: #22543, #23037 (see cookbook)
- Qwen3.6: #23486 (see cookbook)
- MiMo-V2.5 / MiMo-V2.5-Pro: #23808, #23811, #23851, #23945, #24118 (see cookbook)
- Ling-2.6-Flash: #23947 (see cookbook)
- Mistral Medium 3.5: see cookbook
- Kimi-K2.6: #23394, #23408 (see cookbook)
- Hunyuan v3 (Tencent, preview): #23533 (see cookbook)
- FLUX.1-dev ModelOpt NVFP4 (Diffusion): #22672 (see FLUX cookbook)
- FLUX.2-small-decoder (Diffusion): #22414 (see FLUX cookbook)
- Qwen Image ModelOpt FP8 (Diffusion): #23155 (see Qwen-Image cookbook)
- LTX-2.3 / LTX-2.3 two-stage / TI2V (Diffusion): #22182, #22667, #22869 (see LTX cookbook)
- Qwen3-ASR (chunk-based streaming): #22073, #22089
- Voxtral (Mistral speech-to-text): #21635
- Parakeet (NVIDIA Nemotron encoder): #23568
- Moss-VL: #23454
- SequenceClassification model architecture (powers the Score API): #22118
- Stable Diffusion 3 medium (Diffusion): #19225
- ERNIE-Image (Diffusion): #22439
- JoyAI-Image-Edit (Diffusion): #22625
Speculative Decoding
- DFLASH speculative decoding initial support: #22077
- DFLASH enabled across additional model backends: #22358
- DFLASH speculative decoding on AMD ROCm: #22342
- Spec V2 enabled by default with overlap scheduling: #21062
- Penalty support for Spec V2 overlap scheduling: #22049
- Adaptive
speculative_num_stepsfor EAGLE topk=1: #21599 - Allow piecewise CUDA graph with speculative decoding: #22128
- Eagle3 / DFLASH aux hidden state capture during CUDA graph init fixed: #22836
- Split
accept_lengthintonum_accepted_drafts/num_accepted_tokens: #23962 - DFLASH speculative decoding documentation: #23553
PD Disaggregation
- Decode-side radix cache support: #19746
- Incremental transfer for Mooncake transfer engine: #24257
- Allow
PrefillDelayerin disaggregated-prefill mode: #23588 - NIXL: heterogeneous TP KV transfer for non-MLA models (Step 1/2 for Qwen3.5): #22145
- NIXL: Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5): #22240
- Bug fixes for
IntraNode NVLink, MTP-layer KV transfer, and disagg-prefill DP rank resolution: #23252, #23539, #22901, #22990
Context Parallel & Parallelism
- All-reduce fusion support under CP: #21249
moe_dp_size = 1paired with arbitraryattention_cp_size: #22003- All-reduce fusion enabled for DSA models: #22390
- Replace all-reduce + dp_scatter with
reduce_scattervfor DP attention: #22642 - Step3p5: optimize all-reduce in MoE layers: #22773
- Pipeline parallelism on Intel XPU: #23472
- OpenTelemetry tracing for pipeline parallelism: #23169
LoRA
- DeepSeek-V3 MLA LoRA support and quantization-info refactor: #22323
- Kimi K2 LoRA support: #22381
- LoRADrainer to address high P99 TTFT: #17913
- Decoupled LoRA MoE backend with Marlin support: #21858
- Virtual experts for LoRA MoE (1/n): #22122, #24007
- CSGMV kernel offline auto-tuning: #20391
- Triton
sgemmspeedup with better grid selection: #22386 - Dual MoE CUDA graph capture for lora/nolora batches: #22809
Performance
- FA3 kernels from the kernel community: #20796
- Precompute FA3
scheduler_metadatato eliminate per-layer prepare cost: #21104 - Precompute
gemma_weightto avoid redundant add on every forward: #22673 - Eliminate attention DtoD copy by passing pre-allocated output to FA: #21985
- Skip KV cache in FA backend for embedding mode: #21971
- O(1)
RadixKeyview for EAGLE bigram key: #23106 - PCG inductor path optimization for FP8 models: #23227
- Combo-kernels for horizontal fusion: #21977
- Optimize Gemma4 VLM with PCG and fused RMSNorm + residual add + scalar: #24048
- Restore torch.compile fusion for topk postprocessing: #21771
- Reduce unnecessary kernels and copies in the NSA indexer: #22232
Observability
- Pending token count surfaced in prefill log and
get_load: #22480 - OpenTelemetry tracing for speculative decoding: #19545
- OpenTelemetry tracing for pipeline parallelism: #23169
- OpenTelemetry tracing in DiffGenerator: #21254
- Prometheus metrics endpoint for gRPC mode: #20801
- HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode: #22500
- Raw KV cache pool token counts as Prometheus gauges: #22726
SGLang-Diffusion
- New model support: LTX-2.3 (#22182, #22667, #22869), ERNIE-Image (#22439), FLUX.2-small-decoder (#22414), JoyAI-Image-Edit (#22625), FLUX.1-dev ModelOpt NVFP4 (#22672), Qwen Image ModelOpt FP8 (#23155), Stable Diffusion 3 medium (#19225)
- ModelOpt diffusion FP8 support for Flux1/Flux2 and Wan2.2: #22365
- Standalone Rollout API + Denoising Environment Backpass + SP-Aligned Log-Prob for T2I post-training: #22604
- Disaggregated diffusion: #21701
- Dynamic batching v0: #18764
- CPU platform support for SGLang Diffusion: #20816
- AITER backends in Flux 2 pipeline (AMD): #22802
- LTX-2 feed-forward tensor parallelism optimization: #23221
- In-memory loading for URL/base64 image inputs (default): #23118
- Mixed-resolution benchmark support: #20863
- Auto-enable best parallel setting if unspecified: #22763
AMD
- MiniMax-M2.5 optimizations (aiter biased grouped topk; fused FP8 KV cache write): #23611, #23620
- Fused QK Gemma norm kernels (4 → fewer kernels): #23575
- Fused all-reduce + RMSNorm simplification: #21986
- GLM-5 / GLM-5.1 MXFP4 nightly accuracy + perf benchmarks (MI30x / MI35x): #21773, #22336
- MTP for GLM-5-mxfp4: #23219
- Aiter v0.1.12.post1 upgrade: #22264
- DFLASH speculative decoding enabled on ROCm: #22342
- Fix
--page-size > 1memory access fault with speculative decoding: #23596
NPU / Ascend
- Ascend backend supports Qwen3 MoE attention CP: #21685
- GLM-4.5V and GLM-4.7-Flash NPU support / fixes: #22961, #22509
- MTP for Qwen3.5: #20918
- TP communications compression for Qwen3 on NPU: #20520
- Add support-new-models documentation for NPU: #23824
- GGUF quantization for Ascend NPU (dense + MoE): #17883
CPU
- GPTQ / AWQ 4-bit quantization on CPU: #22685
gemma4_rmsnorm_cpukernel: #22842- Qwen3.5 model optimization for CPU: #19484
- Apply routed scaling factor on output for biased grouped topk fusion: #22413
- Fix
extend_attention_cpu/flash_attn_varlen_funcNaN for large seq: #22434
Quantization
- MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs: #19143 (later reverted in #23031, follow-up forthcoming)
- NVFP4 KV cache: quantization strategy abstraction and kernel: #21954
- DeepSeek-R1-0528-w4a8 + DeepEP Low-Latency FP8 dispatch: #22316
- MXFP8 sm100 path cleanup: #21881
- GLM-5/5.1 MXFP4 checkpoint inference compatibility fix: #22543
Dependencies
- Torch upgraded 2.9 → 2.11: #21247
- Default CUDA bumped to 13.0 across sglang, sgl-kernel, and Docker images: #21498 (tracking), #24162, #24183, #23593, #23119
- Flashinfer 0.6.7.post2 → 0.6.8.post1: #23281
- sgl-kernel bumped to 0.4.1.post1: #23720, #23733
- sgl-kernel bumped to 0.4.2: #24170
- Aiter v0.1.12.post1 (AMD): #22264
Security
- Fix for CVE-2026-5760: #23660
- Fix Trivy CVEs and cubin download 403s in Docker image: #22322
All PRs included in this release: v0.5.10.post1...v0.5.11
New Contributors
- @AethoceSora made their first contribution in #23426
- @AlbeeSo made their first contribution in #23710
- @alec-flowers made their first contribution in #24090
- @AlonKejzman made their first contribution in #23753
- @amacaskill made their first contribution in #22537
- @AndyLi429 made their first contribution in #21685
- @Baichuan7 made their first contribution in #23060
- @ccullen-cert made their first contribution in #23660
- @ChangLiu0709 made their first contribution in #22908
- @charlotte12l made their first contribution in #21983
- @chenkaiyue made their first contribution in #17195
- @chx96642264 made their first contribution in #22705
- @ColinZ22 made their first contribution in #22543
- @cyyc0310 made their first contribution in #22920
- @divyamagrawal06 made their first contribution in #23325
- @dyhsup made their first contribution in #22439
- @egvenediktov made their first contribution in #20520
- @erikwijmans made their first contribution in #21974
- @fengli1702 made their first contribution in #19143
- @fergusfinn made their first contribution in #21035
- @fortunecookiee made their first contribution in #20960
- @gxlvera made their first contribution in #19225
- @he-yufeng made their first contribution in #20739
- @Henson-Zh-Ali made their first contribution in #20522
- @icepoint666 made their first contribution in #22592
- @iridiumine made their first contribution in #20918
- @is-not made their first contribution in #18349
- @JasonHe-WQ made their first contribution in #21944
- @jh-nv made their first contribution in #21254
- @jiangyinzuo made their first contribution in #23169
- @JieTang66 made their first contribution in #23983
- @JoyFuture made their first contribution in #23808
- @jthakurH made their first contribution in #16793
- @kangyifei made their first contribution in #23241
- @kingkingleeljj made their first contribution in #20967
- @kkyyxhll made their first contribution in #23062
- @KrishnanPrash made their first contribution in #22175
- @lahmuller made their first contribution in #22625
- @lixuwei2333 made their first contribution in #22247
- @lkhl made their first contribution in #22431
- @loading66 made their first contribution in #22700
- @luccafong made their first contribution in #24165
- @mingyue300 made their first contribution in #21723
- @minosfuture made their first contribution in #23419
- @mispa-ms made their first contribution in #23097
- @mlleo made their first contribution in #23537
- @Napkin-AI made their first contribution in #23572
- @nvpohanh made their first contribution in #22852
- @officialasishkumar made their first contribution in #22600
- @opherlieber made their first contribution in #22547
- @ranjiewen made their first contribution in #21698
- @RichardoMrMu made their first contribution in #19545
- @robellliu-dev made their first contribution in #20835
- @SammLSH made their first contribution in #22089
- @Seven-Streams made their first contribution in #21722
- @shenxiul made their first contribution in #23327
- @siju-samuel made their first contribution in #23472
- @stepinto made their first contribution in #23478
- @tfhddd made their first contribution in #22029
- @vvagaytsev made their first contribution in #22363
- @WangHao-hw made their first contribution in #22778
- @Wen-xuan-Xu made their first contribution in #22923
- @xiaobochen-amd made their first contribution in #22626
- @yaya159456 made their first contribution in #21694
- @YMbmzy made their first contribution in #22049
- @yuki-brook made their first contribution in #18016
- @Zaire404 made their first contribution in #22982
- @ZeyuanChen2000 made their first contribution in #21543
- @zhaozx-cn made their first contribution in #22266
- @zhsurpass made their first contribution in #22697
- @zsj555 made their first contribution in #23454
Full Changelog: v0.5.10.post1...v0.5.11