Highlights
-
DeepSeek V4 support: Full inference path for DeepSeek-V4 (#23882), including:
Day-0 Features: #23882
- Parallelism: Tensor Parallelism/Expert Parallelism/Context Parallelism/Data Parallel Attention
- Hardware: Nvidia B300/B200/H200/H100/GB200/GB300, AMD MI35X
- Prefill-Decode Disaggregation
- HiSparse for offloading inactive KV cache to CPU memory
- Reasoning parser and Tool Call Parser
- DeepGemm and FlashMLA kernels for DeepSeek V4, including MegaMoE
Post-Day-0 additions:
- HiCache for DeepSeek V4 under unified Radix Tree [UnifiedTree]: #24691
- W4A4 MegaMoE kernels — faster speed with negligible accuracy drop: #25052
- Marlin/FlashInfer W4A8 MoE kernels on Hopper: #24816 #24986
- Faster V2 fused compression kernels: #24890
- TP16 support on H100/H20: #24949
- Fused SiLU+clamp+FP8 quant kernel: #24897
- Optimized MHC + DeepGemm pipeline (fused norm, fused hc_head): #24775
- Non-standard chat template support for DSv4: #23915
- Multi-detokenizer support: #24944
- Pipeline Parallelism + PD support for DeepSeek-V4: #24700
- A unified docker tag
lmsysorg/sglang:v0.5.12for all Nvidia GPUs
See the LMSYS blog and the DeepSeek-V4 cookbook for more details.
-
TokenSpeed MLA attention backend (Blackwell, FP8 KV cache): New MLA prefill/decode kernels integrated as an attention backend on SM100, with FP8 KV cache support for low-latency MLA serving: #24925
-
DSv3.2 / GLM-5 FP4 low-latency perf: PDL enabled across DSv3.2 / GLM-5 kernels,
torch.mmfor the DeepSeek V3.2 indexer GEMM, and a reland of the Cute-DSL FP4 dense GEMM — materially trimming low-latency overheads on FP4 paths: #23965, #23856, #23590, #25311 -
New Model Support: DeepSeek V4 #23882, Intern-S2-Preview #24875, MiniCPM-V 4.6 #24855, Laguna-XS.2 #24204, Ring-2.6-1T #25360, and Gemma 4 MTP #24436 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook
-
HiCache + UnifiedRadixTree: HiCache framework support for UnifiedRadixTree (with SWA), HiCache for DeepSeek V4, SSD offload through Mooncake store, and stability fixes across cascade eviction, tombstone replay, and partial-match paths: #23316, #23391, #24691, #24277, #24943, #24972, #25068, #25277
-
Speculative Decoding V2 maturation: Adaptive Spec V2, EAGLE-3 SWA + newer drafters, Kimi K2.5 EAGLE-3 MLA, Gemma 3/4 + EAGLE-3, and an extensive naming / shape-handling refactor across draft-extend paths: #23336, #24663, #24664, #24826, #23976, #24859
-
CUDA 13 DeepEP migration: Gateway DeepEP source swapped from a community fork to
deepseek-ai/DeepEP@hybrid-epso DeepEP builds and runs cleanly on the CUDA 13 default; FlashInfer pinned at 0.6.11.post1 alongside a gpt-oss triton-kernel fix: #25113
New Model Support
Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.
- DeepSeek V4 (see cookbook; LMSYS blog)
- Intern-S2-Preview: #24875, #25115, #25134 (see cookbook)
- MiniCPM-V 4.6: #24855, #24876, #24991, #24998 (see cookbook)
- Laguna-XS.2 (Poolside): #24204, #24730 (see cookbook)
- Ring-2.6-1T (InclusionAI, trillion-param reasoning): #25360, #25370 (see cookbook)
- Gemma 4 MTP (MTP head for Gemma 4): #24436, #24433
- Trinity-mini (Ascend NPU, ~90% accuracy): #18172
- HunyuanVideo ModelOpt FP8 (Diffusion): #23199
- Qwen Image ModelOpt FP8 (Diffusion): #23155
Speculative Decoding
- TokenSpeed MLA prefill/decode kernels integrated as attention backend (FP8 KV cache, Blackwell): #24925
- Adaptive Spec V2 (2/N): #23336
- SWA support for EAGLE-3 drafter: #24664
- Support newer EAGLE-3 drafters: #24663
- Kimi K2.5 EAGLE-3 MLA spec decoding: #24826
- Gemma 3 / Gemma 4 + EAGLE-3 support: #23976
- Spec V1 — split draft-extend into
EagleDraftExtendInput: #24859 - Custom speculative-algorithm registry: #23991
- Spec-V2 overlap stale-state fix: #23456
trtllmdecode kernel for draft extend: #24566- AMD: EAGLE on Qwen3.5 FP8/MXFP4 via aiter unified attention: #23146
- Fix Kimi K2.5 MLA EAGLE + DP attention: #25033
- Fix
ngrammetric off-by-1 innum_accepted_drafts_per_req_cpu: #24965 - Fix frozen-KV MTP crash when
bonus_tokensis None: #25204 - Fix stuck-MTP on DSA models: #24635
- Reduce specdec CPU overhead: #23321
- Spec-decoding naming-convention rule + refactors: #24094, #25014, #25038, #24081, #24724, #24735, #24881, #25010, #25012, #25030, #25029, #25037, #25109
PD Disaggregation
- DSv4 Flash disaggregation test: #24973
- Unify DSv4 dispatch with SWA: #24888
- DSv4 mooncake
state_typebranch: #24878 - Hybrid state transfer refactor: #24932
- Priority scheduling in PD mode fix: #25062
- NIXL: staging buffer for heterogeneous-TP KV transfer: #22536
- NIXL: async transfer: #23967
- NIXL XPU: uint64 pointer overflow + mismatched P/D TP fixes: #24188, #24648
- Mooncake: incremental transfer + SSD offload: #24257, #24277
- Multi-node prefill bootstrap-port broadcast: #24378
- Add retry-with-backoff for prefill bootstrap registration: #25125
PrefillDelayer: NCCL all-gather for cross-DP info sync: #24768- MORI-IO: state transfer + high-concurrency fixes: #22665
- Per-room cleanup centralization; prevent
update_statusfrom cleared entries; fix abortupdate_statusacross KV backends: #24601, #24539, #24522 - PD KV transfer metrics fix: #24416
- SWA memory preallocation for disaggregated decode: #24857
- IntraNode NVLink configuration docs: #23329
HiCache & Radix Cache
- HiCache framework for UnifiedRadixTree: #23316
- SWA HiCache for unified radix cache: #23391
- HiCache for DeepSeek V4 + nightly CI for DSA model: #24691, #25369, #25348
- SSD offload through Mooncake store: #24277
- HiSparse FP8 KV cache via flashmla_kv backend: #23013
- Default storage prefetch timeout: #23309
- UnifiedRadixCache device match semantics with HiCache: #25277
- UnifiedTree partial match on evicted+backuped nodes: #24943
- UnifiedTree tombstone lock release replay fix: #24972
- UnifiedTree
_cascade_evictleaf determination fix: #25068 - UnifiedRadixTree align
cache_empty_resultwith RadixTree: #24779 - Mamba radix cache KV events; SWA radix cache events: #23678, #24718
- SWA chunk req deferred fix; SWA component host hit fix: #24318, #25085
LoRA
- MLA attention LoRA (q_b_proj / kv_b_proj): #25001
- CSGMV backend with virtual experts for MoE LoRA: #24007
- MoE LoRA: remove CPU-GPU sync barriers and duplicate code (prefill optimize 2/n, 3/n): #24246, #24262
- LoRADrainer for high P99 TTFT: #17913
qkv_projbuffer sizing whentp_size > num_key_value_heads: #24420- Torch-Native LoRA: embedding + graph optimization: #21885
- Deterministic
lora_idfor multi-node--lora-paths: #24555 - Fix broken
sgemm_lora_a_graph_fwddue to invalidtorch.mm(): #24760 - Diffusion: fix RowParallel LoRA merged forwarding: #24410
Performance
- TMA bulk-store
set_mla_kv_buffer(up to 12× over baseline): #25311 - Kimi tokenizer TTFT optimization: #25265
- Avoid hidden-states D2H copy when
return_hidden_states=false: #25155 - DeepseekV2MoE: defer shared experts when routed kernel is non-mutating: #25279
SGLANG_OPT_FP8_WO_A_GEMMon by default: #25181--prefill-only-disable-kv-cacheto skip KV pool allocation: #23675- Gemma 4 MoE: fused Q/K/V RMSNorm + per-expert FP8 ckpt loader: #24696
- Gemma 4 VLM: PCG + fused RMSNorm + residual: #24048
- MHC pipeline: DeepGemm + fused norm + fused hc_head: #24775
- JIT custom all-reduce default; non-NVL follow-up: #24363, #24742
SGLANG_USE_JIT_ALL_REDUCE→SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: #24297- Eliminate logits H2D blocking copy: #24627
- Cache empty
MatchResultin RadixCache: #24470 - Breakable CUDA graph for
bs > 1: #24662 - FA3: skip
scheduler_metadataprecompute under DP attention: #24632 aten::rms_norm/aten::mm.dtyperegistration in batch-invariant mode: #24459- Optimize Helios fused norm modulation: #24059
- Z-Image packed QKV optimization: #24117
- KDA prefill kernels: diagonal + recompute fuse: #24271
Observability
sglang:get_loads_duration_secondsPrometheus metric: #25163- Per-iteration forward-pass metrics via ZMQ PUB: #22789
SGLANG_TRACE_LEVELenv for startup trace level: #24716fwd_occupancymetric inSchedulerStats+ Prometheus collector: #24458- SWA / Mamba cache metrics: #24396
- Mamba radix cache + SWA radix cache KV events: #23678, #24718
- PD KV transfer metrics fix: #24416
- CP allgather buffer registered with symmetric memory: #24040
- Decode-side bootstrap/alloc metrics + non-int token-id filter: #24684
Frontend & API
/v1/tokenizechat-completion-style support: #23981- Multi-detokenizer support: #24944
- Structural tags for strict tool calling & reasoning across more models: #21722
- Auto-detect reasoning / tool-call parser from chat template: #23952
- Two-phase reasoning grammar +
--enable-strict-thinking: #23953 - OpenAI
reasoning.enabledmapping tothinking+enable_thinking: #23951 - Kimi-K2.5 bare-numeric tool-call IDs: #23950
- Crusoe managed-inference backend: #20475
- Azure Blob Storage connector (
az://and*.blob.core.windows.net): #23995 - Adaptive queue-based prefill-delayer trigger: #23189
SGLANG_MAX_KV_CHUNK_CAPACITYenv: #25120SGLANG_RADIX_FORCE_MISSenv: #24726, #24950- Reject
repetition_penalty=0inSamplingParams.verify(): #24874 --random-input-lenforsend_one.py: #24464
SGLang-Diffusion
- New model support: HunyuanVideo ModelOpt FP8 (#23199), Qwen Image ModelOpt FP8 (#23155)
- CFG parallelism framework + multi-branch CFG for LTX-2: #23736
- Initial dynamic batching: #18764
- Performance-mode server args: #24491
dit_precisionconfig respected (no hardcoded bf16): #24988- Cache-DiT: mount before
torch.compilein native denoising: #25328 - Z-Image Cache-DiT sequence-parallel override fix: #25305
- USP: direct all-to-all collectives; NCCL deadlock fix for remainder seq lengths: #24366, #24694
- FA3 varlen
outargument handling: #24688 - RowParallel LoRA merged forwarding fix: #24410
- CFG communication: handle non-contiguous tensors: #24332
- LTX-2.3 alignment with official + HQ denoising split passes: #24313, #24298
- LTX-2 feed-forward TP optimization (#23221) + Hunyuan3D shape denoising / export chunks: #24287, #24358
- Encoder result cache for default negative prompt: #24304
- Channels-last 3D VAE convs by default; disable VAE CPU offload by default: #23200, #24315
- Component attention-backend override CLI: #24320
- AMD: online MXFP4 + FP8 diffusion quantization; aiter RMSNorm; temporal-unfolded batched Conv2D for ROCm VAE decode; dual-stream MoE: #21431, #24360, #22971, #24005, #24677
- NPU: MXFP8 quantization for Wan2.2 (#20922, #24918); fused-operator E2E perf for Wan (#24028); selectable parallel VAE decode strategies (#23248); SANA fix (#24798); Z-Image negative-branch rotary embed CFG fix (#23538)
- MUSA: sage attention backend (#24752)
AMD / ROCm
- DSv4 Flash / Pro nightly tests on MI35x ROCm 7.2: #24203, #24825, #25039
- NSA indexer fallbacks + preshuffle paged MQA + GLM-5 NSA TileLang: #24125, #23562, #25205
fp8blockwise quantization combine for MoRI EP: #24879- gfx950 + aiter
_skip_rope_for_aiter_fused_mla: #24148 - aiter
fused_qk_rmsnormAPI shim (pre/post #2958): #24799 - TBO Spec-V2
seq_lens_cpuNone handling: #24319 - Kimi-K2.6 nightly tests (MI30x / MI35x): #23848
- JIT kernel PR-CI through
run_suite.py: #24987 - AMD JIT benches: clamp position + resolve-token-ids: #24209, #24210 (#25209, #25210)
- AMD CI hygiene (registration + cleanup + VRAM): #24569, #24572, #24586, #24612, #24614, #24615, #24665, #24924, #24981, #25112
- Docker: cache-dit 1.3.0 pin;
archive.ubuntu.comfallback: #24924, #24407
NPU / Ascend
zbalsupport: #24575- Trinity-mini support (~90% accuracy): #18172
- Shared-expert dual-stream optimization: #23827
- Mamba-extra-buffer radix cache (Qwen3.5): #23891
- MLA KV transfer in pipeline parallel: #23893
- Multi-batch FIA ops: #20177
- GLM-5 docs: DeepEP enabled by default: #23708
- GLM-4.5V / GLM-4.7-Flash NPU support / fixes (carry-over): existing
--disable-cuda-graph+ MTP warmup fix: #23819- MRoPE position fix in Eagle Worker v2 with PlanStream: #23423
- Z-Image negative-branch rotary embeddings for CFG: #23538
- Wan quantization fix: #24540
causal_conv1d_update_v2for performance: #24595sgl-kernel-npu2026.05.01 bump: #24951- Profiler revert + re-add: #24685, #24815
- Doc / accuracy / FAQ work: #21537, #24658, #24676, #24777, #25114, #25130, #25268, #24668, #24918
CPU / Intel / MUSA / MLX / Apple Silicon
- MUSA: FlashInfer sampling backend: #24978
- MUSA: optimized kernels for piecewise CUDA graph: #23633
- MUSA: optimized kernels for hot ops: #23255
- MUSA:
torchada0.1.54 bump: #24592 - MLX: on-the-fly
--quantization mlx_q4/mlx_q8on Apple Silicon: #24907 - MLX: auto-detect MLX-format
quantization_configdict: #25191 - MLX: thread
--quantizationthroughMlxModelRunnerinbench_one_batch: #25221 - MLX: Apple Silicon Metal kernel support in
sgl-kernel: #23449 sgl-kernel/cpu: w8a8 int8 model support for arm cpu: #16045- Intel CPU tests migrated to
test/registered(re-applied after revert): #25139, #22670, #25044 - Arm64 CPU Phase-1A CI bootstrap: #22123
- XPU pipeline parallelism on Intel: #23472
Quantization & Kernels
- NVFP4 hot-reload-safe weight loading (alias-when-same-shape): #25190
- NVFP4: free unused source scales after weight processing: #25107
- Cute-DSL NVFP4 quantization kernels: #23745
- Cute-DSL FP4 dense GEMM (reland): #23590
- DSv3.2 indexer GEMM via
torch.mm: #23856 - PDL for DSv3.2 / GLM-5 kernels: #23965
- DSv4: W4A4 MegaMoE; W4(MXFP4)A16 on Hopper: #25052, #24986
- FlashInfer SM90 cutlass MXFP4 MoE backend (W4A16) for GPT-OSS + DSv4: #24816
- Port KV Compression V2 + fused SiLU+clamp+FP8 quant from DSV4 dev branch: #24890, #24897
- BF16 EP-MoE for DeepGEMM: #17392
- DeepGEMM deprecated in sgl-kernel; custom
sgl-deep-gemmwheel + release workflow: #24268, #24348, #24385 - TRT-LLM A2A dispatch: NaN sanitization in padding slots: #24850
- TRT-LLM BF16 MoE for MTP: #24260
- MegaMoE decoupled from DeepEP backend (subsequently reverted): #24884, #25317
- DeepEP waterfill load balancing for shared-expert dispatch: #19290
- DeepEP support for
--enable-return-routed-experts: #16859
Dependencies
- FlashInfer 0.6.8.post1 → 0.6.11 → 0.6.11.post1 (with intermediate revert): #24452, #25129, #25310, #25335
sgl-kernel0.4.2.post1, 0.4.2.post2: #24457, #25326sgl-kernel: SM90 flashmla compile fix: #24130- Custom
sgl-deep-gemmwheel + release workflow: #24268, #24348, #24385 sgl-kernel-buildx86 + arm merged into reusable workflow; disk-reclaim cleanup: #25135, #25206- DeepEP swapped from
fzyzcjyfork todeepseek-ai/DeepEP@hybrid-ep(CUDA 13): #25113 - Torch 2.11 Docker prep + dependency cleanup: #23593
nixlstub installation alongsidenixl-cuXXbinary: #24369- aarch64 cubin handling + masked-failure fix: #24234
- H20 stage on CUDA 13: #24916
- CUDA-13 kernel installation docs: #24181, #24516
- FlashInfer autotune cache: #24156
- FlashInfer workspace OOM fix: #24172
- FlashInfer allreduce fusion disabled under deterministic inference: #24629
trtllmallreduce fusion with PDL: #23765- TRTLLM MHA routing fix for draft-extend: #24856
torchcodec→soundfileWAV fallback for trailing metadata: #24185sgl-kernel-npu2026.05.01: #24951
Security
No security-tagged PRs in this window.
All PRs included in this release: v0.5.11...v0.5.12
New Contributors
- @Seven-Streams made their first contribution in #21722
- @stargazerZJ made their first contribution in #24344
- @Jianhong-Zhang made their first contribution in #24188
- @gh1595 made their first contribution in #24420
- @revanthreddy-hai made their first contribution in #24329
- @TallMessiWu made their first contribution in #20922
- @ranimandepudi made their first contribution in #22123
- @Joey-gvwal made their first contribution in #23255
- @fanxingran made their first contribution in #24129
- @xz-keg made their first contribution in #24604
- @zhongdaor-nv made their first contribution in #23678
- @chfeng-cs made their first contribution in #24434
- @sglang-npu-bot made their first contribution in #24815
- @brian030128 made their first contribution in #24217
- @tjdharamsi made their first contribution in #24871
- @sytianhe made their first contribution in #24716
- @Dogacel made their first contribution in #24663
- @tangcy98 made their first contribution in #24967
- @1pikachu made their first contribution in #22670
- @flutist made their first contribution in #24760
- @acheamponge made their first contribution in #20475
- @taegeonum made their first contribution in #25022
- @RulinJuice made their first contribution in #24874
- @lluki made their first contribution in #24671
- @Religious-J made their first contribution in #20930
- @ltcs11 made their first contribution in #24575
- @damahua made their first contribution in #24907
- @ziang663 made their first contribution in #25126
- @Emmanuel0612 made their first contribution in #25209
- @Jialin made their first contribution in #25234
- @jlee5814 made their first contribution in #25191
- @unseenmars made their first contribution in #24935
- @nano8259 made their first contribution in #25125
- @imp2002 made their first contribution in #24130
- @liuxianglong17 made their first contribution in #25080
Full Changelog: v0.5.11...v0.5.12