sgl-project/sglang v0.5.13 on GitHub

Highlights

New Model Support:

Autoregressive: Nemotron 3 Ultra (Day-0, blog), Step-3.7-Flash, Command A+
Diffusion: Cosmos3, LingBot-World, SANA-WM, Ernie-Image, FLUX.2-Klein 4B/9B, Ideogram 4

Spec V2 is now the default speculative-decoding path: Tree drafting with topk > 1 is production-ready across the triton / FA3 / MLA / aiter backends, including page_size > 1 and Mamba/hybrid-linear models (#26997, #26972, #27463). Spec V1 is deprecated, with EAGLE/MTP now running on the unified V2 worker (#25464), and topk = 1 drafting is faster (#26397, #26424).

Lower per-step scheduler overhead: Unified async value passing through FutureMap plus moving prefill input transfer onto the forward stream reduced per-step launch overhead and improved stability under high concurrency (#25945, #25879, #26380).

Piecewise & Breakable CUDA Graph coverage: Piecewise (PCG) and Breakable (BCG) CUDA Graph capture more of the model to cut per-step kernel-launch overhead, now extended to DSA models, Kimi-K2.5, and DeepSeek V4: #23351, #26382, #25195.

Faster Qwen 3.5 on Blackwell: New FlashInfer Gated DeltaNet (GDN) kernels and a CuTeDSL GDN prefill kernel speed up Qwen 3.5 on Blackwell GPUs: #22921, #23273, #26200.

HiCache for hybrid models by default: HybridModel (SWA/Mamba) launches HiCache through UnifiedTree by default, bringing hierarchical KV-cache offload to sliding-window and Mamba hybrids out of the box: #27759.

Heterogeneous CPU + GPU EPD disaggregation (with Intel): Offload VLM vision encoding onto Intel Xeon CPUs alongside GPUs, with up to ~1.3x P99 TTFT and request-throughput gains under load. (blog)

MoRI on AMD Instinct MI355X (with AMD): Cost-competitive DeepSeek-R1 disaggregated inference via AMD's MoRI communication library, $0.169 per million tokens at 129 tok/s/user. (blog)

DeepSeek V4 — context parallelism & sparse-attention kernels: Building on the v0.5.12 Day-0 path, v0.5.13 extends DeepSeek-V4 to context-parallel serving and adds its sparse-attention kernels:

Context Parallel + MTP: #24934
Context Parallel + fused MoE kernel (non-DeepEP): #24947
Sparse FlashMLA via flash_mla_sparse_fwd: #25418
FP4 indexer support: #26209
SM120 support: #24692
DeepEP waterfill load balancing: #25391
MHC kernel warmup: #25810
Breakable CUDA Graph for DeepSeek V4: #25195
Backed by sgl-kernel 0.4.3 exposing sgl_kernel.flashmla: #26421, #26132

See the DeepSeek-V4 cookbook for tuned deployment commands.

SGLang-Diffusion — realtime & progressive resolution: OpenAI-style realtime video generation with msgpack frame streaming and a standalone browser WebUI (#26954, #26959), continuous camera controls + super-resolution controls (#27026, #27297), and progressive-resolution growing across FLUX / FLUX.2 / Qwen-Image / Wan / Z-Image (#27524).

Full release notes by category below.

New Model Support

Nemotron 3 Ultra (Day-0, kernel optimizations): #26733 (see cookbook)
Step-3.7-Flash: #26565 (see cookbook)
Cosmos3-Nano / Cosmos3-Super — T2V / I2V / T2I (Diffusion): #24994, #26492, #26926, #26950 (see cookbook)
Ernie-Image (Diffusion): #22439, #27195 (see cookbook)
LingBot-World — realtime / causal-DMD generation (Diffusion): #26954 (see cookbook)
SANA-WM — streaming + realtime (Diffusion): #27531 (see cookbook)
FLUX.2-Klein 4B / 9B (Diffusion): #25661 (see cookbook)
Ideogram 4 — FP8 / NVFP4 with tensor parallelism (Diffusion): #27279, #27379, #27393 (see cookbook)
Command A+ (Cohere 2 family): #26106, #27401 (cookbook page pending)

DeepSeek V4

Context Parallel + MTP: #24934
Context Parallel + fused MoE kernel (non-DeepEP): #24947
MHC kernel warmup: #25810
SM120 support: #24692
FP4 indexer support: #26209
Integrate flash_mla_sparse_fwd kernel: #25418
DeepEP waterfill support: #25391
Breakable CUDA Graph for DeepSeek V4: #25195

Speculative Decoding

Spec V2 is now the default speculative-decoding path
Tree speculative drafting (topk > 1) on Spec V2 — page_size > 1 and Mamba/hybrid-linear, validated across triton / FA3 / MLA / aiter: #26997, #26972, #27463
Spec V1 deprecated; EAGLE/MTP run on the unified V2 worker: #25464
Spec V2 extended to the FlashMLA backend: #24640
Adaptive speculative decoding: batch-size-aware num_steps + observability metrics: #24055, #25940
Faster topk = 1 drafting (skip full-vocab softmax + redundant cat/topk/sort/gather ops): #26397, #26424
Draft-extend CUDA Graph for the trtllm mha attention backend: #25002

Piecewise & Breakable CUDA Graph

PCG support for DSA models: #23351
PCG support for Kimi-K2.5: #26382
BCG support for DeepSeek V4: #25195

Context Parallelism

Support bs > 1 for prefill CP: #23269
Prefill CP for MLA models (Kimi K2.5, DeepSeek V3): #23292

Attention Backends

CuTeDSL MLA attention kernels from FlashInfer: #24737
Qwen 3.5: FlashInfer GDN kernels on Blackwell: #22921, #23273
Qwen 3.5: CuTeDSL GDN prefill kernel on Blackwell: #26200

Scheduler & Runtime

Overlap Scheduler: fewer CPU–GPU sync points
Unified async value passing through FutureMap; prefill input transfer moved onto the forward stream, reducing per-step launch overhead and improving stability under high concurrency: #25945, #25879, #26380
Runtime memory-safety checks enabled by default in CI to guard correctness continuously: #27461, #26335

HiCache & Radix Cache

HybridModel (SWA/Mamba) launches HiCache via UnifiedTree by default: #27759

PD Disaggregation

Optimistic prefill for better TTFT: #26780
Decode-side HiCache integration for incremental KV-cache transfer: #26227
HiSparse support for DeepSeek V4 with PD: #24880
Notify and cancel KV-cache transfer for aborted requests: #27372
Pipeline Parallelism (PP) + PD support for DeepSeek-V4: #24704
EPD disaggregation support for MiMo-V2: #24931
Tolerate KV pools without end_layer (Qwen3-Next disagg): #25476
Unstick decode aborts under prealloc pressure: #25561
Un-blacklist mooncake sessions when a probe succeeds: #25287
HiSparse + PD: support host memory-pool page > 1: #23606
Support regular worker discovery alongside PD workers in IGW mode: #25294

LoRA

Experimental fast LoRA path with experimental_sgl_trtllm MoE backend for FP8 and NVFP4 models: #27329
Remove synchronous .any().item() guard in the LoRA MoE prefill path: #25531
Share MoE LoRA info: #24160
More efficient pinned memory: #20876
Fix overlap loading for cancelled requests: #25413
Fix LoRA pool not appearing in /v1/loads: #25440

Multimodal

Gemma 4: encoder-free variant unifying text, vision, and audio in one model: #27167 (see cookbook)

SGLang-Diffusion

Realtime diffusion

OpenAI-style realtime video generation: #26954
Msgpack frame streaming + standalone browser WebUI: #26959
Continuous camera controls + super-resolution controls: #27026, #27297
Lossless RGB transport improvements: #27236

Progressive resolution

Progressive-resolution growing for image and video generation (FLUX, FLUX.2, Qwen Image, Wan, Z-Image): #27524

Memory & residency

Layerwise offload generalized beyond legacy DiT components: #24593
Memory-aware component load order: #25457
Encoder layerwise-offload defaults: #25517
Role-based component loading + stage affinity: #25168
Combined DiT CPU offload + layerwise offload: #26925

Quantization & backends

Ideogram 4 FP8 / NVFP4 support: #27279, #27379
FlashInfer TRTLLM as the default diffusion NVFP4 backend: #25523
Wan2.2 ModelOpt checkpoint updates: #25483, #25857

Performance

Cosmos3 serve / denoising / I2V / parallel-decode / fused QKNorm optimizations: #26926, #26973, #27037, #27041, #27084, #27096
LingBot realtime transport / SP cache path / camera-conditioning optimizations: #27023, #27297, #27383
USP varlen-FA fast path + replicated-KV prefix all-to-all optimization: #26318, #27143
UniPC scheduler GPU-sync removal: #27440

AMD / ROCm

[AMD][DSV4] DSV4 MTP graph + sparse triton attention optimizations: #26383
[AMD] DSV4 compressor optimization: #26208
[AMD] Enable shared-experts fusion with the new Kimi-K2.5-MXFP4 model: #25390
[AMD][aiter] Fix cuda_graph_kv_indices OOB under page_size > 1: #24587
[AMD] Upgrade AITER: #25896

NPU / Ascend

[NPU] Support chunked prefill for Qwen3.5 / Qwen3.6 models: #25839
[NPU] Use Triton split_qkvgate_gemma_rmsnorm_rope for Qwen3.5 and Qwen3-Next: #23925
[NPU] Support DeepSeek-OCR and DeepSeek-OCR-2: #25257
[Diffusion][NPU] Add attention backends for diffusion models on Ascend NPU: #23482
[Diffusion][NPU][Quant] Add MXFP4 quantization support for Wan2.2 on Ascend NPU: #22338
[Diffusion][NPU] Disaggregation diffusion-stage support for NPU: #25895

CPU / Intel / MUSA / MLX

[MLX] Support Qwen3.5 (dense) model: #25754
[CPU] Add support for Qwen3-VL and Qwen3-Omni: #12662
[CPU] Add GPT-OSS model optimization for CPU: #16775
[CPU] Faster KV-cache writes: #25874
[CPU] Explicitly enable AVX512 & AMX instruction sets: #26145
[Xeon] CPU CI enhancement for Intel Xeon platforms: #24649
[MUSA][Diffusion] Improve Wan model inference speed using torch.compile: #25256

Dependencies

transformers 5.6.0 → 5.8.1: #25451
flashinfer 0.6.11.post1 → 0.6.12: #26854
xgrammar 0.2.0 → 0.2.1: #25676
sgl-kernel 0.4.2.post2 → 0.4.3 (sgl_kernel.flashmla + DeepSeek V4 kernels): #26421, #26132
nvidia-cutlass-dsl 4.5.1 → 4.5.2: #26854

Security

No security-tagged PRs in this release.

All PRs included in this release: v0.5.12...v0.5.13

New Contributors

@jasonjk-park made their first contribution in #24999
@JoeLee314 made their first contribution in #25380
@zhengluo-nv made their first contribution in #24723
@yuychang made their first contribution in #25260
@amd-bishwoadhikari made their first contribution in #22371
@miamia0 made their first contribution in #25180
@Xia-Weiwen made their first contribution in #21668
@JINO-ROHIT made their first contribution in #25178
@Gruner-atero made their first contribution in #25293
@EanWang211123 made their first contribution in #23331
@nagisa-kunhah made their first contribution in #24640
@kflansburg made their first contribution in #25287
@xiaobao520123 made their first contribution in #25786
@zjd0112 made their first contribution in #25770
@jiayisunx made their first contribution in #25730
@rllin made their first contribution in #25807
@moehanabi made their first contribution in #21191
@abinggo made their first contribution in #24751
@alex0dd made their first contribution in #25661
@FredHuang99 made their first contribution in #25168
@zhangtao2-1 made their first contribution in #25600
@longxin9715 made their first contribution in #26069
@nadongjun made their first contribution in #24610
@hanwlax made their first contribution in #25856
@vuuihc made their first contribution in #26177
@yuhuiaws made their first contribution in #25404
@ckvermaAI made their first contribution in #23757
@chengchao23 made their first contribution in #19493
@0-693 made their first contribution in #21544
@vguduruTT made their first contribution in #22627
@jbschlosser made their first contribution in #25911
@nv-dmajchrowski made their first contribution in #24994
@jvzibro made their first contribution in #26470
@Shaoting-Feng made their first contribution in #24089
@yao-matrix made their first contribution in #25174
@hippothewild made their first contribution in #25973
@cquil11 made their first contribution in #26590
@jmamou made their first contribution in #24149
@SKRohit made their first contribution in #26257
@LucQueen made their first contribution in #22587
@whn09 made their first contribution in #25083
@BangBOOM made their first contribution in #26672
@fcczzz made their first contribution in #25880
@rbrugaro-amd made their first contribution in #25463
@akhoroshev made their first contribution in #26673
@KaisennHu made their first contribution in #24667
@mpdfdfl made their first contribution in #26303
@ntgiang71096 made their first contribution in #25669
@decajoin made their first contribution in #26925
@krishung5 made their first contribution in #25300
@AliceChenyy made their first contribution in #24692
@prakashkagitha made their first contribution in #25813
@thanhhao98 made their first contribution in #26384
@pengdurice made their first contribution in #26045
@Li-brua made their first contribution in #27028
@iyastreb made their first contribution in #26406
@Ronnie-Rui made their first contribution in #27011
@yeqcharlotte made their first contribution in #27085
@JonnyKong made their first contribution in #26768
@zhenghax made their first contribution in #26969
@zx3xyy made their first contribution in #26746
@ilia-iliev made their first contribution in #25292
@bowenwan6 made their first contribution in #26864
@akelch11 made their first contribution in #26859
@gogongxt made their first contribution in #27374
@256256mjw made their first contribution in #26444
@Fatemanx made their first contribution in #27279
@maodoudou168 made their first contribution in #24055
@L4-1024 made their first contribution in #26356
@Stella-17 made their first contribution in #27242

Full Changelog: v0.5.12...v0.5.13