Highlights

DeepSeek V4 support: Full inference path for DeepSeek-V4 (#23882), including:
Day-0 Features: #23882
- Parallelism: Tensor Parallelism/Expert Parallelism/Context Parallelism/Data Parallel Attention
- Hardware: Nvidia B300/B200/H200/H100/GB200/GB300, AMD MI35X
- Prefill-Decode Disaggregation
- HiSparse for offloading inactive KV cache to CPU memory
- Reasoning parser and Tool Call Parser
- DeepGemm and FlashMLA kernels for DeepSeek V4, including MegaMoE
Post-Day-0 additions:
- HiCache for DeepSeek V4 under unified Radix Tree [UnifiedTree]: #24691
- W4A4 MegaMoE kernels — faster speed with negligible accuracy drop: #25052
- Marlin/FlashInfer W4A8 MoE kernels on Hopper: #24816 #24986
- Faster V2 fused compression kernels: #24890
- TP16 support on H100/H20: #24949
- Fused SiLU+clamp+FP8 quant kernel: #24897
- Optimized MHC + DeepGemm pipeline (fused norm, fused hc_head): #24775
- Non-standard chat template support for DSv4: #23915
- Multi-detokenizer support: #24944
- Pipeline Parallelism + PD support for DeepSeek-V4: #24700
- A unified docker tag lmsysorg/sglang:v0.5.12 for all Nvidia GPUs
See the LMSYS blog and the DeepSeek-V4 cookbook for more details.
TokenSpeed MLA attention backend (Blackwell, FP8 KV cache): New MLA prefill/decode kernels integrated as an attention backend on SM100, with FP8 KV cache support for low-latency MLA serving: #24925
DSv3.2 / GLM-5 FP4 low-latency perf: PDL enabled across DSv3.2 / GLM-5 kernels, torch.mm for the DeepSeek V3.2 indexer GEMM, and a reland of the Cute-DSL FP4 dense GEMM — materially trimming low-latency overheads on FP4 paths: #23965, #23856, #23590, #25311
New Model Support: DeepSeek V4 #23882, Intern-S2-Preview #24875, MiniCPM-V 4.6 #24855, Laguna-XS.2 #24204, Ring-2.6-1T #25360, and Gemma 4 MTP #24436 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook
HiCache + UnifiedRadixTree: HiCache framework support for UnifiedRadixTree (with SWA), HiCache for DeepSeek V4, SSD offload through Mooncake store, and stability fixes across cascade eviction, tombstone replay, and partial-match paths: #23316, #23391, #24691, #24277, #24943, #24972, #25068, #25277
Speculative Decoding V2 maturation: Adaptive Spec V2, EAGLE-3 SWA + newer drafters, Kimi K2.5 EAGLE-3 MLA, Gemma 3/4 + EAGLE-3, and an extensive naming / shape-handling refactor across draft-extend paths: #23336, #24663, #24664, #24826, #23976, #24859
CUDA 13 DeepEP migration: Gateway DeepEP source swapped from a community fork to deepseek-ai/DeepEP@hybrid-ep so DeepEP builds and runs cleanly on the CUDA 13 default; FlashInfer pinned at 0.6.11.post1 alongside a gpt-oss triton-kernel fix: #25113

New Model Support

Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.

DeepSeek V4 (see cookbook; LMSYS blog)
Intern-S2-Preview: #24875, #25115, #25134 (see cookbook)
MiniCPM-V 4.6: #24855, #24876, #24991, #24998 (see cookbook)
Laguna-XS.2 (Poolside): #24204, #24730 (see cookbook)
Ring-2.6-1T (InclusionAI, trillion-param reasoning): #25360, #25370 (see cookbook)
Gemma 4 MTP (MTP head for Gemma 4): #24436, #24433
Trinity-mini (Ascend NPU, ~90% accuracy): #18172
HunyuanVideo ModelOpt FP8 (Diffusion): #23199
Qwen Image ModelOpt FP8 (Diffusion): #23155

Speculative Decoding

TokenSpeed MLA prefill/decode kernels integrated as attention backend (FP8 KV cache, Blackwell): #24925
Adaptive Spec V2 (2/N): #23336
SWA support for EAGLE-3 drafter: #24664
Support newer EAGLE-3 drafters: #24663
Kimi K2.5 EAGLE-3 MLA spec decoding: #24826
Gemma 3 / Gemma 4 + EAGLE-3 support: #23976
Spec V1 — split draft-extend into EagleDraftExtendInput: #24859
Custom speculative-algorithm registry: #23991
Spec-V2 overlap stale-state fix: #23456
trtllm decode kernel for draft extend: #24566
AMD: EAGLE on Qwen3.5 FP8/MXFP4 via aiter unified attention: #23146
Fix Kimi K2.5 MLA EAGLE + DP attention: #25033
Fix ngram metric off-by-1 in num_accepted_drafts_per_req_cpu: #24965
Fix frozen-KV MTP crash when bonus_tokens is None: #25204
Fix stuck-MTP on DSA models: #24635
Reduce specdec CPU overhead: #23321
Spec-decoding naming-convention rule + refactors: #24094, #25014, #25038, #24081, #24724, #24735, #24881, #25010, #25012, #25030, #25029, #25037, #25109

PD Disaggregation

DSv4 Flash disaggregation test: #24973
Unify DSv4 dispatch with SWA: #24888
DSv4 mooncake state_type branch: #24878
Hybrid state transfer refactor: #24932
Priority scheduling in PD mode fix: #25062
NIXL: staging buffer for heterogeneous-TP KV transfer: #22536
NIXL: async transfer: #23967
NIXL XPU: uint64 pointer overflow + mismatched P/D TP fixes: #24188, #24648
Mooncake: incremental transfer + SSD offload: #24257, #24277
Multi-node prefill bootstrap-port broadcast: #24378
Add retry-with-backoff for prefill bootstrap registration: #25125
PrefillDelayer: NCCL all-gather for cross-DP info sync: #24768
MORI-IO: state transfer + high-concurrency fixes: #22665
Per-room cleanup centralization; prevent update_status from cleared entries; fix abort update_status across KV backends: #24601, #24539, #24522
PD KV transfer metrics fix: #24416
SWA memory preallocation for disaggregated decode: #24857
IntraNode NVLink configuration docs: #23329

HiCache & Radix Cache

HiCache framework for UnifiedRadixTree: #23316
SWA HiCache for unified radix cache: #23391
HiCache for DeepSeek V4 + nightly CI for DSA model: #24691, #25369, #25348
SSD offload through Mooncake store: #24277
HiSparse FP8 KV cache via flashmla_kv backend: #23013
Default storage prefetch timeout: #23309
UnifiedRadixCache device match semantics with HiCache: #25277
UnifiedTree partial match on evicted+backuped nodes: #24943
UnifiedTree tombstone lock release replay fix: #24972
UnifiedTree _cascade_evict leaf determination fix: #25068
UnifiedRadixTree align cache_empty_result with RadixTree: #24779
Mamba radix cache KV events; SWA radix cache events: #23678, #24718
SWA chunk req deferred fix; SWA component host hit fix: #24318, #25085

LoRA

MLA attention LoRA (q_b_proj / kv_b_proj): #25001
CSGMV backend with virtual experts for MoE LoRA: #24007
MoE LoRA: remove CPU-GPU sync barriers and duplicate code (prefill optimize 2/n, 3/n): #24246, #24262
LoRADrainer for high P99 TTFT: #17913
qkv_proj buffer sizing when tp_size > num_key_value_heads: #24420
Torch-Native LoRA: embedding + graph optimization: #21885
Deterministic lora_id for multi-node --lora-paths: #24555
Fix broken sgemm_lora_a_graph_fwd due to invalid torch.mm(): #24760
Diffusion: fix RowParallel LoRA merged forwarding: #24410

Performance

TMA bulk-store set_mla_kv_buffer (up to 12× over baseline): #25311
Kimi tokenizer TTFT optimization: #25265
Avoid hidden-states D2H copy when return_hidden_states=false: #25155
DeepseekV2MoE: defer shared experts when routed kernel is non-mutating: #25279
SGLANG_OPT_FP8_WO_A_GEMM on by default: #25181
--prefill-only-disable-kv-cache to skip KV pool allocation: #23675
Gemma 4 MoE: fused Q/K/V RMSNorm + per-expert FP8 ckpt loader: #24696
Gemma 4 VLM: PCG + fused RMSNorm + residual: #24048
MHC pipeline: DeepGemm + fused norm + fused hc_head: #24775
JIT custom all-reduce default; non-NVL follow-up: #24363, #24742
SGLANG_USE_JIT_ALL_REDUCE → SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: #24297
Eliminate logits H2D blocking copy: #24627
Cache empty MatchResult in RadixCache: #24470
Breakable CUDA graph for bs > 1: #24662
FA3: skip scheduler_metadata precompute under DP attention: #24632
aten::rms_norm / aten::mm.dtype registration in batch-invariant mode: #24459
Optimize Helios fused norm modulation: #24059
Z-Image packed QKV optimization: #24117
KDA prefill kernels: diagonal + recompute fuse: #24271

Observability

sglang:get_loads_duration_seconds Prometheus metric: #25163
Per-iteration forward-pass metrics via ZMQ PUB: #22789
SGLANG_TRACE_LEVEL env for startup trace level: #24716
fwd_occupancy metric in SchedulerStats + Prometheus collector: #24458
SWA / Mamba cache metrics: #24396
Mamba radix cache + SWA radix cache KV events: #23678, #24718
PD KV transfer metrics fix: #24416
CP allgather buffer registered with symmetric memory: #24040
Decode-side bootstrap/alloc metrics + non-int token-id filter: #24684

Frontend & API

/v1/tokenize chat-completion-style support: #23981
Multi-detokenizer support: #24944
Structural tags for strict tool calling & reasoning across more models: #21722
Auto-detect reasoning / tool-call parser from chat template: #23952
Two-phase reasoning grammar + --enable-strict-thinking: #23953
OpenAI reasoning.enabled mapping to thinking + enable_thinking: #23951
Kimi-K2.5 bare-numeric tool-call IDs: #23950
Crusoe managed-inference backend: #20475
Azure Blob Storage connector (az:// and *.blob.core.windows.net): #23995
Adaptive queue-based prefill-delayer trigger: #23189
SGLANG_MAX_KV_CHUNK_CAPACITY env: #25120
SGLANG_RADIX_FORCE_MISS env: #24726, #24950
Reject repetition_penalty=0 in SamplingParams.verify(): #24874
--random-input-len for send_one.py: #24464

SGLang-Diffusion

New model support: HunyuanVideo ModelOpt FP8 (#23199), Qwen Image ModelOpt FP8 (#23155)
CFG parallelism framework + multi-branch CFG for LTX-2: #23736
Initial dynamic batching: #18764
Performance-mode server args: #24491
dit_precision config respected (no hardcoded bf16): #24988
Cache-DiT: mount before torch.compile in native denoising: #25328
Z-Image Cache-DiT sequence-parallel override fix: #25305
USP: direct all-to-all collectives; NCCL deadlock fix for remainder seq lengths: #24366, #24694
FA3 varlen out argument handling: #24688
RowParallel LoRA merged forwarding fix: #24410
CFG communication: handle non-contiguous tensors: #24332
LTX-2.3 alignment with official + HQ denoising split passes: #24313, #24298
LTX-2 feed-forward TP optimization (#23221) + Hunyuan3D shape denoising / export chunks: #24287, #24358
Encoder result cache for default negative prompt: #24304
Channels-last 3D VAE convs by default; disable VAE CPU offload by default: #23200, #24315
Component attention-backend override CLI: #24320
AMD: online MXFP4 + FP8 diffusion quantization; aiter RMSNorm; temporal-unfolded batched Conv2D for ROCm VAE decode; dual-stream MoE: #21431, #24360, #22971, #24005, #24677
NPU: MXFP8 quantization for Wan2.2 (#20922, #24918); fused-operator E2E perf for Wan (#24028); selectable parallel VAE decode strategies (#23248); SANA fix (#24798); Z-Image negative-branch rotary embed CFG fix (#23538)
MUSA: sage attention backend (#24752)

AMD / ROCm

DSv4 Flash / Pro nightly tests on MI35x ROCm 7.2: #24203, #24825, #25039
NSA indexer fallbacks + preshuffle paged MQA + GLM-5 NSA TileLang: #24125, #23562, #25205
fp8 blockwise quantization combine for MoRI EP: #24879
gfx950 + aiter _skip_rope_for_aiter_fused_mla: #24148
aiter fused_qk_rmsnorm API shim (pre/post #2958): #24799
TBO Spec-V2 seq_lens_cpu None handling: #24319
Kimi-K2.6 nightly tests (MI30x / MI35x): #23848
JIT kernel PR-CI through run_suite.py: #24987
AMD JIT benches: clamp position + resolve-token-ids: #24209, #24210 (#25209, #25210)
AMD CI hygiene (registration + cleanup + VRAM): #24569, #24572, #24586, #24612, #24614, #24615, #24665, #24924, #24981, #25112
Docker: cache-dit 1.3.0 pin; archive.ubuntu.com fallback: #24924, #24407

NPU / Ascend

zbal support: #24575
Trinity-mini support (~90% accuracy): #18172
Shared-expert dual-stream optimization: #23827
Mamba-extra-buffer radix cache (Qwen3.5): #23891
MLA KV transfer in pipeline parallel: #23893
Multi-batch FIA ops: #20177
GLM-5 docs: DeepEP enabled by default: #23708
GLM-4.5V / GLM-4.7-Flash NPU support / fixes (carry-over): existing
--disable-cuda-graph + MTP warmup fix: #23819
MRoPE position fix in Eagle Worker v2 with PlanStream: #23423
Z-Image negative-branch rotary embeddings for CFG: #23538
Wan quantization fix: #24540
causal_conv1d_update_v2 for performance: #24595
sgl-kernel-npu 2026.05.01 bump: #24951
Profiler revert + re-add: #24685, #24815
Doc / accuracy / FAQ work: #21537, #24658, #24676, #24777, #25114, #25130, #25268, #24668, #24918

CPU / Intel / MUSA / MLX / Apple Silicon

MUSA: FlashInfer sampling backend: #24978
MUSA: optimized kernels for piecewise CUDA graph: #23633
MUSA: optimized kernels for hot ops: #23255
MUSA: torchada 0.1.54 bump: #24592
MLX: on-the-fly --quantization mlx_q4 / mlx_q8 on Apple Silicon: #24907
MLX: auto-detect MLX-format quantization_config dict: #25191
MLX: thread --quantization through MlxModelRunner in bench_one_batch: #25221
MLX: Apple Silicon Metal kernel support in sgl-kernel: #23449
sgl-kernel/cpu: w8a8 int8 model support for arm cpu: #16045
Intel CPU tests migrated to test/registered (re-applied after revert): #25139, #22670, #25044
Arm64 CPU Phase-1A CI bootstrap: #22123
XPU pipeline parallelism on Intel: #23472

Quantization & Kernels

NVFP4 hot-reload-safe weight loading (alias-when-same-shape): #25190
NVFP4: free unused source scales after weight processing: #25107
Cute-DSL NVFP4 quantization kernels: #23745
Cute-DSL FP4 dense GEMM (reland): #23590
DSv3.2 indexer GEMM via torch.mm: #23856
PDL for DSv3.2 / GLM-5 kernels: #23965
DSv4: W4A4 MegaMoE; W4(MXFP4)A16 on Hopper: #25052, #24986
FlashInfer SM90 cutlass MXFP4 MoE backend (W4A16) for GPT-OSS + DSv4: #24816
Port KV Compression V2 + fused SiLU+clamp+FP8 quant from DSV4 dev branch: #24890, #24897
BF16 EP-MoE for DeepGEMM: #17392
DeepGEMM deprecated in sgl-kernel; custom sgl-deep-gemm wheel + release workflow: #24268, #24348, #24385
TRT-LLM A2A dispatch: NaN sanitization in padding slots: #24850
TRT-LLM BF16 MoE for MTP: #24260
MegaMoE decoupled from DeepEP backend (subsequently reverted): #24884, #25317
DeepEP waterfill load balancing for shared-expert dispatch: #19290
DeepEP support for --enable-return-routed-experts: #16859

Dependencies

FlashInfer 0.6.8.post1 → 0.6.11 → 0.6.11.post1 (with intermediate revert): #24452, #25129, #25310, #25335
sgl-kernel 0.4.2.post1, 0.4.2.post2: #24457, #25326
sgl-kernel: SM90 flashmla compile fix: #24130
Custom sgl-deep-gemm wheel + release workflow: #24268, #24348, #24385
sgl-kernel-build x86 + arm merged into reusable workflow; disk-reclaim cleanup: #25135, #25206
DeepEP swapped from fzyzcjy fork to deepseek-ai/DeepEP@hybrid-ep (CUDA 13): #25113
Torch 2.11 Docker prep + dependency cleanup: #23593
nixl stub installation alongside nixl-cuXX binary: #24369
aarch64 cubin handling + masked-failure fix: #24234
H20 stage on CUDA 13: #24916
CUDA-13 kernel installation docs: #24181, #24516
FlashInfer autotune cache: #24156
FlashInfer workspace OOM fix: #24172
FlashInfer allreduce fusion disabled under deterministic inference: #24629
trtllm allreduce fusion with PDL: #23765
TRTLLM MHA routing fix for draft-extend: #24856
torchcodec → soundfile WAV fallback for trailing metadata: #24185
sgl-kernel-npu 2026.05.01: #24951

Security

No security-tagged PRs in this window.

All PRs included in this release: v0.5.11...v0.5.12

New Contributors

@Seven-Streams made their first contribution in #21722
@stargazerZJ made their first contribution in #24344
@Jianhong-Zhang made their first contribution in #24188
@gh1595 made their first contribution in #24420
@revanthreddy-hai made their first contribution in #24329
@TallMessiWu made their first contribution in #20922
@ranimandepudi made their first contribution in #22123
@Joey-gvwal made their first contribution in #23255
@fanxingran made their first contribution in #24129
@xz-keg made their first contribution in #24604
@zhongdaor-nv made their first contribution in #23678
@chfeng-cs made their first contribution in #24434
@sglang-npu-bot made their first contribution in #24815
@brian030128 made their first contribution in #24217
@tjdharamsi made their first contribution in #24871
@sytianhe made their first contribution in #24716
@Dogacel made their first contribution in #24663
@tangcy98 made their first contribution in #24967
@1pikachu made their first contribution in #22670
@flutist made their first contribution in #24760
@acheamponge made their first contribution in #20475
@taegeonum made their first contribution in #25022
@RulinJuice made their first contribution in #24874
@lluki made their first contribution in #24671
@Religious-J made their first contribution in #20930
@ltcs11 made their first contribution in #24575
@damahua made their first contribution in #24907
@ziang663 made their first contribution in #25126
@Emmanuel0612 made their first contribution in #25209
@Jialin made their first contribution in #25234
@jlee5814 made their first contribution in #25191
@unseenmars made their first contribution in #24935
@nano8259 made their first contribution in #25125
@imp2002 made their first contribution in #24130
@liuxianglong17 made their first contribution in #25080

Full Changelog: v0.5.11...v0.5.12

sgl-project/sglang v0.5.12 on GitHub