Highlights
New Model Support:
- Autoregressive: Nemotron 3 Ultra (Day-0, blog), Step-3.7-Flash, Command A+
- Diffusion: Cosmos3, LingBot-World, SANA-WM, Ernie-Image, FLUX.2-Klein 4B/9B, Ideogram 4
Spec V2 is now the default speculative-decoding path: Tree drafting with topk > 1 is production-ready across the triton / FA3 / MLA / aiter backends, including page_size > 1 and Mamba/hybrid-linear models (#26997, #26972, #27463). Spec V1 is deprecated, with EAGLE/MTP now running on the unified V2 worker (#25464), and topk = 1 drafting is faster (#26397, #26424).
Lower per-step scheduler overhead: Unified async value passing through FutureMap plus moving prefill input transfer onto the forward stream reduced per-step launch overhead and improved stability under high concurrency (#25945, #25879, #26380).
Piecewise & Breakable CUDA Graph coverage: Piecewise (PCG) and Breakable (BCG) CUDA Graph capture more of the model to cut per-step kernel-launch overhead, now extended to DSA models, Kimi-K2.5, and DeepSeek V4: #23351, #26382, #25195.
Faster Qwen 3.5 on Blackwell: New FlashInfer Gated DeltaNet (GDN) kernels and a CuTeDSL GDN prefill kernel speed up Qwen 3.5 on Blackwell GPUs: #22921, #23273, #26200.
HiCache for hybrid models by default: HybridModel (SWA/Mamba) launches HiCache through UnifiedTree by default, bringing hierarchical KV-cache offload to sliding-window and Mamba hybrids out of the box: #27759.
Heterogeneous CPU + GPU EPD disaggregation (with Intel): Offload VLM vision encoding onto Intel Xeon CPUs alongside GPUs, with up to ~1.3x P99 TTFT and request-throughput gains under load. (blog)
MoRI on AMD Instinct MI355X (with AMD): Cost-competitive DeepSeek-R1 disaggregated inference via AMD's MoRI communication library, $0.169 per million tokens at 129 tok/s/user. (blog)
DeepSeek V4 — context parallelism & sparse-attention kernels: Building on the v0.5.12 Day-0 path, v0.5.13 extends DeepSeek-V4 to context-parallel serving and adds its sparse-attention kernels:
- Context Parallel + MTP: #24934
- Context Parallel + fused MoE kernel (non-DeepEP): #24947
- Sparse FlashMLA via
flash_mla_sparse_fwd: #25418 - FP4 indexer support: #26209
- SM120 support: #24692
- DeepEP waterfill load balancing: #25391
- MHC kernel warmup: #25810
- Breakable CUDA Graph for DeepSeek V4: #25195
- Backed by sgl-kernel 0.4.3 exposing
sgl_kernel.flashmla: #26421, #26132
See the DeepSeek-V4 cookbook for tuned deployment commands.
SGLang-Diffusion — realtime & progressive resolution: OpenAI-style realtime video generation with msgpack frame streaming and a standalone browser WebUI (#26954, #26959), continuous camera controls + super-resolution controls (#27026, #27297), and progressive-resolution growing across FLUX / FLUX.2 / Qwen-Image / Wan / Z-Image (#27524).
Full release notes by category below.
New Model Support
- Nemotron 3 Ultra (Day-0, kernel optimizations): #26733 (see cookbook)
- Step-3.7-Flash: #26565 (see cookbook)
- Cosmos3-Nano / Cosmos3-Super — T2V / I2V / T2I (Diffusion): #24994, #26492, #26926, #26950 (see cookbook)
- Ernie-Image (Diffusion): #22439, #27195 (see cookbook)
- LingBot-World — realtime / causal-DMD generation (Diffusion): #26954 (see cookbook)
- SANA-WM — streaming + realtime (Diffusion): #27531 (see cookbook)
- FLUX.2-Klein 4B / 9B (Diffusion): #25661 (see cookbook)
- Ideogram 4 — FP8 / NVFP4 with tensor parallelism (Diffusion): #27279, #27379, #27393 (see cookbook)
- Command A+ (Cohere 2 family): #26106, #27401 (cookbook page pending)
DeepSeek V4
- Context Parallel + MTP: #24934
- Context Parallel + fused MoE kernel (non-DeepEP): #24947
- MHC kernel warmup: #25810
- SM120 support: #24692
- FP4 indexer support: #26209
- Integrate
flash_mla_sparse_fwdkernel: #25418 - DeepEP waterfill support: #25391
- Breakable CUDA Graph for DeepSeek V4: #25195
Speculative Decoding
- Spec V2 is now the default speculative-decoding path
- Tree speculative drafting (topk > 1) on Spec V2 —
page_size > 1and Mamba/hybrid-linear, validated across triton / FA3 / MLA / aiter: #26997, #26972, #27463 - Spec V1 deprecated; EAGLE/MTP run on the unified V2 worker: #25464
- Spec V2 extended to the FlashMLA backend: #24640
- Adaptive speculative decoding: batch-size-aware
num_steps+ observability metrics: #24055, #25940 - Faster topk = 1 drafting (skip full-vocab softmax + redundant cat/topk/sort/gather ops): #26397, #26424
- Draft-extend CUDA Graph for the trtllm
mhaattention backend: #25002
Piecewise & Breakable CUDA Graph
- PCG support for DSA models: #23351
- PCG support for Kimi-K2.5: #26382
- BCG support for DeepSeek V4: #25195
Context Parallelism
Attention Backends
- CuTeDSL MLA attention kernels from FlashInfer: #24737
- Qwen 3.5: FlashInfer GDN kernels on Blackwell: #22921, #23273
- Qwen 3.5: CuTeDSL GDN prefill kernel on Blackwell: #26200
Scheduler & Runtime
- Overlap Scheduler: fewer CPU–GPU sync points
- Unified async value passing through FutureMap; prefill input transfer moved onto the forward stream, reducing per-step launch overhead and improving stability under high concurrency: #25945, #25879, #26380
- Runtime memory-safety checks enabled by default in CI to guard correctness continuously: #27461, #26335
HiCache & Radix Cache
- HybridModel (SWA/Mamba) launches HiCache via UnifiedTree by default: #27759
PD Disaggregation
- Optimistic prefill for better TTFT: #26780
- Decode-side HiCache integration for incremental KV-cache transfer: #26227
- HiSparse support for DeepSeek V4 with PD: #24880
- Notify and cancel KV-cache transfer for aborted requests: #27372
- Pipeline Parallelism (PP) + PD support for DeepSeek-V4: #24704
- EPD disaggregation support for MiMo-V2: #24931
- Tolerate KV pools without
end_layer(Qwen3-Next disagg): #25476 - Unstick decode aborts under prealloc pressure: #25561
- Un-blacklist mooncake sessions when a probe succeeds: #25287
- HiSparse + PD: support host memory-pool page > 1: #23606
- Support regular worker discovery alongside PD workers in IGW mode: #25294
LoRA
- Experimental fast LoRA path with
experimental_sgl_trtllmMoE backend for FP8 and NVFP4 models: #27329 - Remove synchronous
.any().item()guard in the LoRA MoE prefill path: #25531 - Share MoE LoRA info: #24160
- More efficient pinned memory: #20876
- Fix overlap loading for cancelled requests: #25413
- Fix LoRA pool not appearing in
/v1/loads: #25440
Multimodal
SGLang-Diffusion
Realtime diffusion
- OpenAI-style realtime video generation: #26954
- Msgpack frame streaming + standalone browser WebUI: #26959
- Continuous camera controls + super-resolution controls: #27026, #27297
- Lossless RGB transport improvements: #27236
Progressive resolution
- Progressive-resolution growing for image and video generation (FLUX, FLUX.2, Qwen Image, Wan, Z-Image): #27524
Memory & residency
- Layerwise offload generalized beyond legacy DiT components: #24593
- Memory-aware component load order: #25457
- Encoder layerwise-offload defaults: #25517
- Role-based component loading + stage affinity: #25168
- Combined DiT CPU offload + layerwise offload: #26925
Quantization & backends
- Ideogram 4 FP8 / NVFP4 support: #27279, #27379
- FlashInfer TRTLLM as the default diffusion NVFP4 backend: #25523
- Wan2.2 ModelOpt checkpoint updates: #25483, #25857
Performance
- Cosmos3 serve / denoising / I2V / parallel-decode / fused QKNorm optimizations: #26926, #26973, #27037, #27041, #27084, #27096
- LingBot realtime transport / SP cache path / camera-conditioning optimizations: #27023, #27297, #27383
- USP varlen-FA fast path + replicated-KV prefix all-to-all optimization: #26318, #27143
- UniPC scheduler GPU-sync removal: #27440
AMD / ROCm
- [AMD][DSV4] DSV4 MTP graph + sparse triton attention optimizations: #26383
- [AMD] DSV4 compressor optimization: #26208
- [AMD] Enable shared-experts fusion with the new Kimi-K2.5-MXFP4 model: #25390
- [AMD][aiter] Fix
cuda_graph_kv_indicesOOB underpage_size > 1: #24587 - [AMD] Upgrade AITER: #25896
NPU / Ascend
- [NPU] Support chunked prefill for Qwen3.5 / Qwen3.6 models: #25839
- [NPU] Use Triton
split_qkvgate_gemma_rmsnorm_ropefor Qwen3.5 and Qwen3-Next: #23925 - [NPU] Support DeepSeek-OCR and DeepSeek-OCR-2: #25257
- [Diffusion][NPU] Add attention backends for diffusion models on Ascend NPU: #23482
- [Diffusion][NPU][Quant] Add MXFP4 quantization support for Wan2.2 on Ascend NPU: #22338
- [Diffusion][NPU] Disaggregation diffusion-stage support for NPU: #25895
CPU / Intel / MUSA / MLX
- [MLX] Support Qwen3.5 (dense) model: #25754
- [CPU] Add support for Qwen3-VL and Qwen3-Omni: #12662
- [CPU] Add GPT-OSS model optimization for CPU: #16775
- [CPU] Faster KV-cache writes: #25874
- [CPU] Explicitly enable AVX512 & AMX instruction sets: #26145
- [Xeon] CPU CI enhancement for Intel Xeon platforms: #24649
- [MUSA][Diffusion] Improve Wan model inference speed using
torch.compile: #25256
Dependencies
- transformers 5.6.0 → 5.8.1: #25451
- flashinfer 0.6.11.post1 → 0.6.12: #26854
- xgrammar 0.2.0 → 0.2.1: #25676
- sgl-kernel 0.4.2.post2 → 0.4.3 (
sgl_kernel.flashmla+ DeepSeek V4 kernels): #26421, #26132 - nvidia-cutlass-dsl 4.5.1 → 4.5.2: #26854
Security
No security-tagged PRs in this release.
All PRs included in this release: v0.5.12...v0.5.13
New Contributors
- @jasonjk-park made their first contribution in #24999
- @JoeLee314 made their first contribution in #25380
- @zhengluo-nv made their first contribution in #24723
- @yuychang made their first contribution in #25260
- @amd-bishwoadhikari made their first contribution in #22371
- @miamia0 made their first contribution in #25180
- @Xia-Weiwen made their first contribution in #21668
- @JINO-ROHIT made their first contribution in #25178
- @Gruner-atero made their first contribution in #25293
- @EanWang211123 made their first contribution in #23331
- @nagisa-kunhah made their first contribution in #24640
- @kflansburg made their first contribution in #25287
- @xiaobao520123 made their first contribution in #25786
- @zjd0112 made their first contribution in #25770
- @jiayisunx made their first contribution in #25730
- @rllin made their first contribution in #25807
- @moehanabi made their first contribution in #21191
- @abinggo made their first contribution in #24751
- @alex0dd made their first contribution in #25661
- @FredHuang99 made their first contribution in #25168
- @zhangtao2-1 made their first contribution in #25600
- @longxin9715 made their first contribution in #26069
- @nadongjun made their first contribution in #24610
- @hanwlax made their first contribution in #25856
- @vuuihc made their first contribution in #26177
- @yuhuiaws made their first contribution in #25404
- @ckvermaAI made their first contribution in #23757
- @chengchao23 made their first contribution in #19493
- @0-693 made their first contribution in #21544
- @vguduruTT made their first contribution in #22627
- @jbschlosser made their first contribution in #25911
- @nv-dmajchrowski made their first contribution in #24994
- @jvzibro made their first contribution in #26470
- @Shaoting-Feng made their first contribution in #24089
- @yao-matrix made their first contribution in #25174
- @hippothewild made their first contribution in #25973
- @cquil11 made their first contribution in #26590
- @jmamou made their first contribution in #24149
- @SKRohit made their first contribution in #26257
- @LucQueen made their first contribution in #22587
- @whn09 made their first contribution in #25083
- @BangBOOM made their first contribution in #26672
- @fcczzz made their first contribution in #25880
- @rbrugaro-amd made their first contribution in #25463
- @akhoroshev made their first contribution in #26673
- @KaisennHu made their first contribution in #24667
- @mpdfdfl made their first contribution in #26303
- @ntgiang71096 made their first contribution in #25669
- @decajoin made their first contribution in #26925
- @krishung5 made their first contribution in #25300
- @AliceChenyy made their first contribution in #24692
- @prakashkagitha made their first contribution in #25813
- @thanhhao98 made their first contribution in #26384
- @pengdurice made their first contribution in #26045
- @Li-brua made their first contribution in #27028
- @iyastreb made their first contribution in #26406
- @Ronnie-Rui made their first contribution in #27011
- @yeqcharlotte made their first contribution in #27085
- @JonnyKong made their first contribution in #26768
- @zhenghax made their first contribution in #26969
- @zx3xyy made their first contribution in #26746
- @ilia-iliev made their first contribution in #25292
- @bowenwan6 made their first contribution in #26864
- @akelch11 made their first contribution in #26859
- @gogongxt made their first contribution in #27374
- @256256mjw made their first contribution in #26444
- @Fatemanx made their first contribution in #27279
- @maodoudou168 made their first contribution in #24055
- @L4-1024 made their first contribution in #26356
- @Stella-17 made their first contribution in #27242
Full Changelog: v0.5.12...v0.5.13