github sgl-project/sglang v0.5.13

3 hours ago

Highlights

New Model Support:

Spec V2 is now the default speculative-decoding path: Tree drafting with topk > 1 is production-ready across the triton / FA3 / MLA / aiter backends, including page_size > 1 and Mamba/hybrid-linear models (#26997, #26972, #27463). Spec V1 is deprecated, with EAGLE/MTP now running on the unified V2 worker (#25464), and topk = 1 drafting is faster (#26397, #26424).

Lower per-step scheduler overhead: Unified async value passing through FutureMap plus moving prefill input transfer onto the forward stream reduced per-step launch overhead and improved stability under high concurrency (#25945, #25879, #26380).

Piecewise & Breakable CUDA Graph coverage: Piecewise (PCG) and Breakable (BCG) CUDA Graph capture more of the model to cut per-step kernel-launch overhead, now extended to DSA models, Kimi-K2.5, and DeepSeek V4: #23351, #26382, #25195.

Faster Qwen 3.5 on Blackwell: New FlashInfer Gated DeltaNet (GDN) kernels and a CuTeDSL GDN prefill kernel speed up Qwen 3.5 on Blackwell GPUs: #22921, #23273, #26200.

HiCache for hybrid models by default: HybridModel (SWA/Mamba) launches HiCache through UnifiedTree by default, bringing hierarchical KV-cache offload to sliding-window and Mamba hybrids out of the box: #27759.

Heterogeneous CPU + GPU EPD disaggregation (with Intel): Offload VLM vision encoding onto Intel Xeon CPUs alongside GPUs, with up to ~1.3x P99 TTFT and request-throughput gains under load. (blog)

MoRI on AMD Instinct MI355X (with AMD): Cost-competitive DeepSeek-R1 disaggregated inference via AMD's MoRI communication library, $0.169 per million tokens at 129 tok/s/user. (blog)

DeepSeek V4 — context parallelism & sparse-attention kernels: Building on the v0.5.12 Day-0 path, v0.5.13 extends DeepSeek-V4 to context-parallel serving and adds its sparse-attention kernels:

  • Context Parallel + MTP: #24934
  • Context Parallel + fused MoE kernel (non-DeepEP): #24947
  • Sparse FlashMLA via flash_mla_sparse_fwd: #25418
  • FP4 indexer support: #26209
  • SM120 support: #24692
  • DeepEP waterfill load balancing: #25391
  • MHC kernel warmup: #25810
  • Breakable CUDA Graph for DeepSeek V4: #25195
  • Backed by sgl-kernel 0.4.3 exposing sgl_kernel.flashmla: #26421, #26132

See the DeepSeek-V4 cookbook for tuned deployment commands.

SGLang-Diffusion — realtime & progressive resolution: OpenAI-style realtime video generation with msgpack frame streaming and a standalone browser WebUI (#26954, #26959), continuous camera controls + super-resolution controls (#27026, #27297), and progressive-resolution growing across FLUX / FLUX.2 / Qwen-Image / Wan / Z-Image (#27524).


Full release notes by category below.

New Model Support

DeepSeek V4

  • Context Parallel + MTP: #24934
  • Context Parallel + fused MoE kernel (non-DeepEP): #24947
  • MHC kernel warmup: #25810
  • SM120 support: #24692
  • FP4 indexer support: #26209
  • Integrate flash_mla_sparse_fwd kernel: #25418
  • DeepEP waterfill support: #25391
  • Breakable CUDA Graph for DeepSeek V4: #25195

Speculative Decoding

  • Spec V2 is now the default speculative-decoding path
  • Tree speculative drafting (topk > 1) on Spec V2 — page_size > 1 and Mamba/hybrid-linear, validated across triton / FA3 / MLA / aiter: #26997, #26972, #27463
  • Spec V1 deprecated; EAGLE/MTP run on the unified V2 worker: #25464
  • Spec V2 extended to the FlashMLA backend: #24640
  • Adaptive speculative decoding: batch-size-aware num_steps + observability metrics: #24055, #25940
  • Faster topk = 1 drafting (skip full-vocab softmax + redundant cat/topk/sort/gather ops): #26397, #26424
  • Draft-extend CUDA Graph for the trtllm mha attention backend: #25002

Piecewise & Breakable CUDA Graph

  • PCG support for DSA models: #23351
  • PCG support for Kimi-K2.5: #26382
  • BCG support for DeepSeek V4: #25195

Context Parallelism

  • Support bs > 1 for prefill CP: #23269
  • Prefill CP for MLA models (Kimi K2.5, DeepSeek V3): #23292

Attention Backends

  • CuTeDSL MLA attention kernels from FlashInfer: #24737
  • Qwen 3.5: FlashInfer GDN kernels on Blackwell: #22921, #23273
  • Qwen 3.5: CuTeDSL GDN prefill kernel on Blackwell: #26200

Scheduler & Runtime

  • Overlap Scheduler: fewer CPU–GPU sync points
  • Unified async value passing through FutureMap; prefill input transfer moved onto the forward stream, reducing per-step launch overhead and improving stability under high concurrency: #25945, #25879, #26380
  • Runtime memory-safety checks enabled by default in CI to guard correctness continuously: #27461, #26335

HiCache & Radix Cache

  • HybridModel (SWA/Mamba) launches HiCache via UnifiedTree by default: #27759

PD Disaggregation

  • Optimistic prefill for better TTFT: #26780
  • Decode-side HiCache integration for incremental KV-cache transfer: #26227
  • HiSparse support for DeepSeek V4 with PD: #24880
  • Notify and cancel KV-cache transfer for aborted requests: #27372
  • Pipeline Parallelism (PP) + PD support for DeepSeek-V4: #24704
  • EPD disaggregation support for MiMo-V2: #24931
  • Tolerate KV pools without end_layer (Qwen3-Next disagg): #25476
  • Unstick decode aborts under prealloc pressure: #25561
  • Un-blacklist mooncake sessions when a probe succeeds: #25287
  • HiSparse + PD: support host memory-pool page > 1: #23606
  • Support regular worker discovery alongside PD workers in IGW mode: #25294

LoRA

  • Experimental fast LoRA path with experimental_sgl_trtllm MoE backend for FP8 and NVFP4 models: #27329
  • Remove synchronous .any().item() guard in the LoRA MoE prefill path: #25531
  • Share MoE LoRA info: #24160
  • More efficient pinned memory: #20876
  • Fix overlap loading for cancelled requests: #25413
  • Fix LoRA pool not appearing in /v1/loads: #25440

Multimodal

  • Gemma 4: encoder-free variant unifying text, vision, and audio in one model: #27167 (see cookbook)

SGLang-Diffusion

Realtime diffusion

  • OpenAI-style realtime video generation: #26954
  • Msgpack frame streaming + standalone browser WebUI: #26959
  • Continuous camera controls + super-resolution controls: #27026, #27297
  • Lossless RGB transport improvements: #27236

Progressive resolution

  • Progressive-resolution growing for image and video generation (FLUX, FLUX.2, Qwen Image, Wan, Z-Image): #27524

Memory & residency

  • Layerwise offload generalized beyond legacy DiT components: #24593
  • Memory-aware component load order: #25457
  • Encoder layerwise-offload defaults: #25517
  • Role-based component loading + stage affinity: #25168
  • Combined DiT CPU offload + layerwise offload: #26925

Quantization & backends

  • Ideogram 4 FP8 / NVFP4 support: #27279, #27379
  • FlashInfer TRTLLM as the default diffusion NVFP4 backend: #25523
  • Wan2.2 ModelOpt checkpoint updates: #25483, #25857

Performance

AMD / ROCm

  • [AMD][DSV4] DSV4 MTP graph + sparse triton attention optimizations: #26383
  • [AMD] DSV4 compressor optimization: #26208
  • [AMD] Enable shared-experts fusion with the new Kimi-K2.5-MXFP4 model: #25390
  • [AMD][aiter] Fix cuda_graph_kv_indices OOB under page_size > 1: #24587
  • [AMD] Upgrade AITER: #25896

NPU / Ascend

  • [NPU] Support chunked prefill for Qwen3.5 / Qwen3.6 models: #25839
  • [NPU] Use Triton split_qkvgate_gemma_rmsnorm_rope for Qwen3.5 and Qwen3-Next: #23925
  • [NPU] Support DeepSeek-OCR and DeepSeek-OCR-2: #25257
  • [Diffusion][NPU] Add attention backends for diffusion models on Ascend NPU: #23482
  • [Diffusion][NPU][Quant] Add MXFP4 quantization support for Wan2.2 on Ascend NPU: #22338
  • [Diffusion][NPU] Disaggregation diffusion-stage support for NPU: #25895

CPU / Intel / MUSA / MLX

  • [MLX] Support Qwen3.5 (dense) model: #25754
  • [CPU] Add support for Qwen3-VL and Qwen3-Omni: #12662
  • [CPU] Add GPT-OSS model optimization for CPU: #16775
  • [CPU] Faster KV-cache writes: #25874
  • [CPU] Explicitly enable AVX512 & AMX instruction sets: #26145
  • [Xeon] CPU CI enhancement for Intel Xeon platforms: #24649
  • [MUSA][Diffusion] Improve Wan model inference speed using torch.compile: #25256

Dependencies

  • transformers 5.6.0 → 5.8.1: #25451
  • flashinfer 0.6.11.post1 → 0.6.12: #26854
  • xgrammar 0.2.0 → 0.2.1: #25676
  • sgl-kernel 0.4.2.post2 → 0.4.3 (sgl_kernel.flashmla + DeepSeek V4 kernels): #26421, #26132
  • nvidia-cutlass-dsl 4.5.1 → 4.5.2: #26854

Security

No security-tagged PRs in this release.


All PRs included in this release: v0.5.12...v0.5.13

New Contributors

Full Changelog: v0.5.12...v0.5.13

Don't miss a new sglang release

NewReleases is sending notifications on new releases.