github sgl-project/sglang v0.5.12

5 hours ago

Highlights

  • DeepSeek V4 support: Full inference path for DeepSeek-V4 (#23882), including:

    Day-0 Features: #23882

    • Parallelism: Tensor Parallelism/Expert Parallelism/Context Parallelism/Data Parallel Attention
    • Hardware: Nvidia B300/B200/H200/H100/GB200/GB300, AMD MI35X
    • Prefill-Decode Disaggregation
    • HiSparse for offloading inactive KV cache to CPU memory
    • Reasoning parser and Tool Call Parser
    • DeepGemm and FlashMLA kernels for DeepSeek V4, including MegaMoE

    Post-Day-0 additions:

    • HiCache for DeepSeek V4 under unified Radix Tree [UnifiedTree]: #24691
    • W4A4 MegaMoE kernels — faster speed with negligible accuracy drop: #25052
    • Marlin/FlashInfer W4A8 MoE kernels on Hopper: #24816 #24986
    • Faster V2 fused compression kernels: #24890
    • TP16 support on H100/H20: #24949
    • Fused SiLU+clamp+FP8 quant kernel: #24897
    • Optimized MHC + DeepGemm pipeline (fused norm, fused hc_head): #24775
    • Non-standard chat template support for DSv4: #23915
    • Multi-detokenizer support: #24944
    • Pipeline Parallelism + PD support for DeepSeek-V4: #24700
    • A unified docker tag lmsysorg/sglang:v0.5.12 for all Nvidia GPUs

    See the LMSYS blog and the DeepSeek-V4 cookbook for more details.

  • TokenSpeed MLA attention backend (Blackwell, FP8 KV cache): New MLA prefill/decode kernels integrated as an attention backend on SM100, with FP8 KV cache support for low-latency MLA serving: #24925

  • DSv3.2 / GLM-5 FP4 low-latency perf: PDL enabled across DSv3.2 / GLM-5 kernels, torch.mm for the DeepSeek V3.2 indexer GEMM, and a reland of the Cute-DSL FP4 dense GEMM — materially trimming low-latency overheads on FP4 paths: #23965, #23856, #23590, #25311

  • New Model Support: DeepSeek V4 #23882, Intern-S2-Preview #24875, MiniCPM-V 4.6 #24855, Laguna-XS.2 #24204, Ring-2.6-1T #25360, and Gemma 4 MTP #24436 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook

  • HiCache + UnifiedRadixTree: HiCache framework support for UnifiedRadixTree (with SWA), HiCache for DeepSeek V4, SSD offload through Mooncake store, and stability fixes across cascade eviction, tombstone replay, and partial-match paths: #23316, #23391, #24691, #24277, #24943, #24972, #25068, #25277

  • Speculative Decoding V2 maturation: Adaptive Spec V2, EAGLE-3 SWA + newer drafters, Kimi K2.5 EAGLE-3 MLA, Gemma 3/4 + EAGLE-3, and an extensive naming / shape-handling refactor across draft-extend paths: #23336, #24663, #24664, #24826, #23976, #24859

  • CUDA 13 DeepEP migration: Gateway DeepEP source swapped from a community fork to deepseek-ai/DeepEP@hybrid-ep so DeepEP builds and runs cleanly on the CUDA 13 default; FlashInfer pinned at 0.6.11.post1 alongside a gpt-oss triton-kernel fix: #25113

New Model Support

Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.

Speculative Decoding

  • TokenSpeed MLA prefill/decode kernels integrated as attention backend (FP8 KV cache, Blackwell): #24925
  • Adaptive Spec V2 (2/N): #23336
  • SWA support for EAGLE-3 drafter: #24664
  • Support newer EAGLE-3 drafters: #24663
  • Kimi K2.5 EAGLE-3 MLA spec decoding: #24826
  • Gemma 3 / Gemma 4 + EAGLE-3 support: #23976
  • Spec V1 — split draft-extend into EagleDraftExtendInput: #24859
  • Custom speculative-algorithm registry: #23991
  • Spec-V2 overlap stale-state fix: #23456
  • trtllm decode kernel for draft extend: #24566
  • AMD: EAGLE on Qwen3.5 FP8/MXFP4 via aiter unified attention: #23146
  • Fix Kimi K2.5 MLA EAGLE + DP attention: #25033
  • Fix ngram metric off-by-1 in num_accepted_drafts_per_req_cpu: #24965
  • Fix frozen-KV MTP crash when bonus_tokens is None: #25204
  • Fix stuck-MTP on DSA models: #24635
  • Reduce specdec CPU overhead: #23321
  • Spec-decoding naming-convention rule + refactors: #24094, #25014, #25038, #24081, #24724, #24735, #24881, #25010, #25012, #25030, #25029, #25037, #25109

PD Disaggregation

  • DSv4 Flash disaggregation test: #24973
  • Unify DSv4 dispatch with SWA: #24888
  • DSv4 mooncake state_type branch: #24878
  • Hybrid state transfer refactor: #24932
  • Priority scheduling in PD mode fix: #25062
  • NIXL: staging buffer for heterogeneous-TP KV transfer: #22536
  • NIXL: async transfer: #23967
  • NIXL XPU: uint64 pointer overflow + mismatched P/D TP fixes: #24188, #24648
  • Mooncake: incremental transfer + SSD offload: #24257, #24277
  • Multi-node prefill bootstrap-port broadcast: #24378
  • Add retry-with-backoff for prefill bootstrap registration: #25125
  • PrefillDelayer: NCCL all-gather for cross-DP info sync: #24768
  • MORI-IO: state transfer + high-concurrency fixes: #22665
  • Per-room cleanup centralization; prevent update_status from cleared entries; fix abort update_status across KV backends: #24601, #24539, #24522
  • PD KV transfer metrics fix: #24416
  • SWA memory preallocation for disaggregated decode: #24857
  • IntraNode NVLink configuration docs: #23329

HiCache & Radix Cache

  • HiCache framework for UnifiedRadixTree: #23316
  • SWA HiCache for unified radix cache: #23391
  • HiCache for DeepSeek V4 + nightly CI for DSA model: #24691, #25369, #25348
  • SSD offload through Mooncake store: #24277
  • HiSparse FP8 KV cache via flashmla_kv backend: #23013
  • Default storage prefetch timeout: #23309
  • UnifiedRadixCache device match semantics with HiCache: #25277
  • UnifiedTree partial match on evicted+backuped nodes: #24943
  • UnifiedTree tombstone lock release replay fix: #24972
  • UnifiedTree _cascade_evict leaf determination fix: #25068
  • UnifiedRadixTree align cache_empty_result with RadixTree: #24779
  • Mamba radix cache KV events; SWA radix cache events: #23678, #24718
  • SWA chunk req deferred fix; SWA component host hit fix: #24318, #25085

LoRA

  • MLA attention LoRA (q_b_proj / kv_b_proj): #25001
  • CSGMV backend with virtual experts for MoE LoRA: #24007
  • MoE LoRA: remove CPU-GPU sync barriers and duplicate code (prefill optimize 2/n, 3/n): #24246, #24262
  • LoRADrainer for high P99 TTFT: #17913
  • qkv_proj buffer sizing when tp_size > num_key_value_heads: #24420
  • Torch-Native LoRA: embedding + graph optimization: #21885
  • Deterministic lora_id for multi-node --lora-paths: #24555
  • Fix broken sgemm_lora_a_graph_fwd due to invalid torch.mm(): #24760
  • Diffusion: fix RowParallel LoRA merged forwarding: #24410

Performance

  • TMA bulk-store set_mla_kv_buffer (up to 12× over baseline): #25311
  • Kimi tokenizer TTFT optimization: #25265
  • Avoid hidden-states D2H copy when return_hidden_states=false: #25155
  • DeepseekV2MoE: defer shared experts when routed kernel is non-mutating: #25279
  • SGLANG_OPT_FP8_WO_A_GEMM on by default: #25181
  • --prefill-only-disable-kv-cache to skip KV pool allocation: #23675
  • Gemma 4 MoE: fused Q/K/V RMSNorm + per-expert FP8 ckpt loader: #24696
  • Gemma 4 VLM: PCG + fused RMSNorm + residual: #24048
  • MHC pipeline: DeepGemm + fused norm + fused hc_head: #24775
  • JIT custom all-reduce default; non-NVL follow-up: #24363, #24742
  • SGLANG_USE_JIT_ALL_REDUCESGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: #24297
  • Eliminate logits H2D blocking copy: #24627
  • Cache empty MatchResult in RadixCache: #24470
  • Breakable CUDA graph for bs > 1: #24662
  • FA3: skip scheduler_metadata precompute under DP attention: #24632
  • aten::rms_norm / aten::mm.dtype registration in batch-invariant mode: #24459
  • Optimize Helios fused norm modulation: #24059
  • Z-Image packed QKV optimization: #24117
  • KDA prefill kernels: diagonal + recompute fuse: #24271

Observability

  • sglang:get_loads_duration_seconds Prometheus metric: #25163
  • Per-iteration forward-pass metrics via ZMQ PUB: #22789
  • SGLANG_TRACE_LEVEL env for startup trace level: #24716
  • fwd_occupancy metric in SchedulerStats + Prometheus collector: #24458
  • SWA / Mamba cache metrics: #24396
  • Mamba radix cache + SWA radix cache KV events: #23678, #24718
  • PD KV transfer metrics fix: #24416
  • CP allgather buffer registered with symmetric memory: #24040
  • Decode-side bootstrap/alloc metrics + non-int token-id filter: #24684

Frontend & API

  • /v1/tokenize chat-completion-style support: #23981
  • Multi-detokenizer support: #24944
  • Structural tags for strict tool calling & reasoning across more models: #21722
  • Auto-detect reasoning / tool-call parser from chat template: #23952
  • Two-phase reasoning grammar + --enable-strict-thinking: #23953
  • OpenAI reasoning.enabled mapping to thinking + enable_thinking: #23951
  • Kimi-K2.5 bare-numeric tool-call IDs: #23950
  • Crusoe managed-inference backend: #20475
  • Azure Blob Storage connector (az:// and *.blob.core.windows.net): #23995
  • Adaptive queue-based prefill-delayer trigger: #23189
  • SGLANG_MAX_KV_CHUNK_CAPACITY env: #25120
  • SGLANG_RADIX_FORCE_MISS env: #24726, #24950
  • Reject repetition_penalty=0 in SamplingParams.verify(): #24874
  • --random-input-len for send_one.py: #24464

SGLang-Diffusion

  • New model support: HunyuanVideo ModelOpt FP8 (#23199), Qwen Image ModelOpt FP8 (#23155)
  • CFG parallelism framework + multi-branch CFG for LTX-2: #23736
  • Initial dynamic batching: #18764
  • Performance-mode server args: #24491
  • dit_precision config respected (no hardcoded bf16): #24988
  • Cache-DiT: mount before torch.compile in native denoising: #25328
  • Z-Image Cache-DiT sequence-parallel override fix: #25305
  • USP: direct all-to-all collectives; NCCL deadlock fix for remainder seq lengths: #24366, #24694
  • FA3 varlen out argument handling: #24688
  • RowParallel LoRA merged forwarding fix: #24410
  • CFG communication: handle non-contiguous tensors: #24332
  • LTX-2.3 alignment with official + HQ denoising split passes: #24313, #24298
  • LTX-2 feed-forward TP optimization (#23221) + Hunyuan3D shape denoising / export chunks: #24287, #24358
  • Encoder result cache for default negative prompt: #24304
  • Channels-last 3D VAE convs by default; disable VAE CPU offload by default: #23200, #24315
  • Component attention-backend override CLI: #24320
  • AMD: online MXFP4 + FP8 diffusion quantization; aiter RMSNorm; temporal-unfolded batched Conv2D for ROCm VAE decode; dual-stream MoE: #21431, #24360, #22971, #24005, #24677
  • NPU: MXFP8 quantization for Wan2.2 (#20922, #24918); fused-operator E2E perf for Wan (#24028); selectable parallel VAE decode strategies (#23248); SANA fix (#24798); Z-Image negative-branch rotary embed CFG fix (#23538)
  • MUSA: sage attention backend (#24752)

AMD / ROCm

NPU / Ascend

  • zbal support: #24575
  • Trinity-mini support (~90% accuracy): #18172
  • Shared-expert dual-stream optimization: #23827
  • Mamba-extra-buffer radix cache (Qwen3.5): #23891
  • MLA KV transfer in pipeline parallel: #23893
  • Multi-batch FIA ops: #20177
  • GLM-5 docs: DeepEP enabled by default: #23708
  • GLM-4.5V / GLM-4.7-Flash NPU support / fixes (carry-over): existing
  • --disable-cuda-graph + MTP warmup fix: #23819
  • MRoPE position fix in Eagle Worker v2 with PlanStream: #23423
  • Z-Image negative-branch rotary embeddings for CFG: #23538
  • Wan quantization fix: #24540
  • causal_conv1d_update_v2 for performance: #24595
  • sgl-kernel-npu 2026.05.01 bump: #24951
  • Profiler revert + re-add: #24685, #24815
  • Doc / accuracy / FAQ work: #21537, #24658, #24676, #24777, #25114, #25130, #25268, #24668, #24918

CPU / Intel / MUSA / MLX / Apple Silicon

  • MUSA: FlashInfer sampling backend: #24978
  • MUSA: optimized kernels for piecewise CUDA graph: #23633
  • MUSA: optimized kernels for hot ops: #23255
  • MUSA: torchada 0.1.54 bump: #24592
  • MLX: on-the-fly --quantization mlx_q4 / mlx_q8 on Apple Silicon: #24907
  • MLX: auto-detect MLX-format quantization_config dict: #25191
  • MLX: thread --quantization through MlxModelRunner in bench_one_batch: #25221
  • MLX: Apple Silicon Metal kernel support in sgl-kernel: #23449
  • sgl-kernel/cpu: w8a8 int8 model support for arm cpu: #16045
  • Intel CPU tests migrated to test/registered (re-applied after revert): #25139, #22670, #25044
  • Arm64 CPU Phase-1A CI bootstrap: #22123
  • XPU pipeline parallelism on Intel: #23472

Quantization & Kernels

  • NVFP4 hot-reload-safe weight loading (alias-when-same-shape): #25190
  • NVFP4: free unused source scales after weight processing: #25107
  • Cute-DSL NVFP4 quantization kernels: #23745
  • Cute-DSL FP4 dense GEMM (reland): #23590
  • DSv3.2 indexer GEMM via torch.mm: #23856
  • PDL for DSv3.2 / GLM-5 kernels: #23965
  • DSv4: W4A4 MegaMoE; W4(MXFP4)A16 on Hopper: #25052, #24986
  • FlashInfer SM90 cutlass MXFP4 MoE backend (W4A16) for GPT-OSS + DSv4: #24816
  • Port KV Compression V2 + fused SiLU+clamp+FP8 quant from DSV4 dev branch: #24890, #24897
  • BF16 EP-MoE for DeepGEMM: #17392
  • DeepGEMM deprecated in sgl-kernel; custom sgl-deep-gemm wheel + release workflow: #24268, #24348, #24385
  • TRT-LLM A2A dispatch: NaN sanitization in padding slots: #24850
  • TRT-LLM BF16 MoE for MTP: #24260
  • MegaMoE decoupled from DeepEP backend (subsequently reverted): #24884, #25317
  • DeepEP waterfill load balancing for shared-expert dispatch: #19290
  • DeepEP support for --enable-return-routed-experts: #16859

Dependencies

  • FlashInfer 0.6.8.post1 → 0.6.11 → 0.6.11.post1 (with intermediate revert): #24452, #25129, #25310, #25335
  • sgl-kernel 0.4.2.post1, 0.4.2.post2: #24457, #25326
  • sgl-kernel: SM90 flashmla compile fix: #24130
  • Custom sgl-deep-gemm wheel + release workflow: #24268, #24348, #24385
  • sgl-kernel-build x86 + arm merged into reusable workflow; disk-reclaim cleanup: #25135, #25206
  • DeepEP swapped from fzyzcjy fork to deepseek-ai/DeepEP@hybrid-ep (CUDA 13): #25113
  • Torch 2.11 Docker prep + dependency cleanup: #23593
  • nixl stub installation alongside nixl-cuXX binary: #24369
  • aarch64 cubin handling + masked-failure fix: #24234
  • H20 stage on CUDA 13: #24916
  • CUDA-13 kernel installation docs: #24181, #24516
  • FlashInfer autotune cache: #24156
  • FlashInfer workspace OOM fix: #24172
  • FlashInfer allreduce fusion disabled under deterministic inference: #24629
  • trtllm allreduce fusion with PDL: #23765
  • TRTLLM MHA routing fix for draft-extend: #24856
  • torchcodecsoundfile WAV fallback for trailing metadata: #24185
  • sgl-kernel-npu 2026.05.01: #24951

Security

No security-tagged PRs in this window.

All PRs included in this release: v0.5.11...v0.5.12

New Contributors

Full Changelog: v0.5.11...v0.5.12

Don't miss a new sglang release

NewReleases is sending notifications on new releases.