github sgl-project/sglang v0.5.10rc0

pre-release6 hours ago

Highlights

  • Piecewise CUDA Graph Enabled by Default: Piecewise CUDA graph capture is now the default execution mode, reducing memory overhead and improving throughput for models with complex control flow patterns: #16331

  • Elastic EP for Partial Failure Tolerance: Integrate Elastic NIXL-EP into SGLang, enabling partial failure tolerance for DeepSeek MoE deployments — when a GPU fails, the system redistributes expert weights and continues serving without full restart: #19248, #17374, #12068 blog

  • HiSparse for Sparse Attention: Integrate HiSparse sparse attention backend for efficient long-context inference with reduced compute through sparsity-aware attention: #20343

  • SGLang-Diffusion Update:

    • Model support: LTX-2, Hunyuan3D-2, Helios
    • Performance improvements on Qwen-image, Z-image increased by 1.5x
    • New platform: macOS
    • New feature: enhance the performance of diffusers backend by integrating all optimization from Cache-DiT
    • SKILLs: feel free to explore the curated skill for developing and optimizing sglang-diffusion!
  • FlashInfer MXFP8 Kernel Support: Integrate FlashInfer mxfp8 kernels for GEMM and MoE operations, enabling mixed-precision FP8 inference with higher accuracy through microscaling for RL and general workloads: #19537

  • Transformers 5.3.0 Upgrade: Major upgrade from transformers 4.57.1 to 5.3.0, unlocking support for the latest model architectures and features from HuggingFace. GLM-5 model is now supported in this image instead of the custom built image: #17784

  • DeepSeek V3.2 / GLM-5 Optimization: GLM-5 runnable on main branch (with upgraded transformers). Fused Triton kernel for prefill KV cache fetching, NSA fuse store indexer for K cache, and configurable KV length threshold for sparse MLA attention at prefill — boosting throughput for long-context DeepSeek V3.2 and GLM-5 serving: #19319, #19148, #20062

  • Qwen3.5 GDN/KDA Optimization: Transpose linear attention state layout from [N, HV, K, V] to [N, HV, V, K] and fuse split/reshape/cat ops in GDN projection with Triton kernel, plus CuTeDSL KDA decode kernel support for improved Qwen3.5 performance: #20283, #21019, #21203

  • LoRA Support for MoE Layers: Add LoRA fine-tuning support for Mixture-of-Experts layers with JIT alignment kernels, fused Triton kernels, TP support, and auto-detection of LoRA target modules — enabling efficient adapter-based tuning on MoE models like DeepSeek: #19710, #19711, #14105, #21439

  • Prefill Context Parallel for MHA (Qwen3): Enable context parallelism during prefill for multi-head attention models like Qwen3 MoE, distributing long sequences across GPUs to reduce per-GPU memory and accelerate prefill: #18233

  • Flash Attention 4 Official Library Support: Upgrade to the official FlashAttention 4 package, bringing the latest attention optimizations and Blackwell GPU support: #20303

  • sglang-kernel 0.4.0: Major kernel package release with renamed package (sgl-kernel → sglang-kernel), consolidated kernels, and cleanup of deprecated ops: #20440

  • Native MLX Backend for Apple Silicon: Add native MLX execution backend enabling SGLang to run inference directly on Apple Silicon Macs without CUDA: #20342

New Model Support

  • Nemotron-3-Super (bf16/fp8/nvfp4): #20407, cookbook
  • Mistral Small 4 (Pixtral): #20708
  • GLM-5: Supported on main branch with transformers 5.3.0
  • Helios (Diffusion - Real-Time Long Video Generation): #19782
  • Hunyuan3D-2 (Diffusion): #18170
  • LTX-2 (Diffusion): #19295
  • MOVA (Diffusion): #19489, #20430
  • FireRed-Image-Edit (Diffusion): #20862

DeepSeek V3.2 / GLM-5 Optimization

  • Fused get_k_and_s Triton kernel for prefill KV cache fetching: #19319
  • Support NSA fuse store indexer K cache: #19148
  • SGLANG_NSA_DENSE_ATTN_KV_LEN_THRESHOLD environ for controlling KV length threshold of applying sparse MLA attention kernel at prefill: #20062
  • Change default setting of V3.2 nvfp4 on TP4: #20086
  • Fix NSA topk_indices_offset when prefill flashmla_sparse is used with FP8 KV cache: #20606

Qwen3.5 Optimization

  • GDN attention state layout transposed from [N, HV, K, V] to [N, HV, V, K]: #20283
  • Fuse split/reshape/cat ops in GDN projection with Triton kernel: #21019
  • CuTeDSL KDA decode kernel support: #21203
  • GDN packed decode support: #20627
  • Replace einops rearrange with torch.flatten in GatedDeltaNet: #20386
  • Mamba slice fix for Prefill TP != Decode TP: #20655
  • Fix broken pipeline parallelism layer splitting: #21070
  • Fix CP in-seq-split method for DeepSeek V32: #21192

Performance

  • Piecewise CUDA graph enabled by default: #16331
  • FlashInfer MXFP8 kernels for GEMM and MoE: #19537
  • NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency: #20477
  • Overlap NSA-CP key all-gather with query computation for DeepSeek-V3.2: #20438
  • Pad max-num-requests in decode CUDA graph for higher coverage: #20978
  • Optimize waiting queue update with set usage: #20503
  • Avoid unnecessary GPU-CPU sync in eagle_info: #20266
  • Use Triton conv1d for non-contiguous input to avoid .contiguous() copy (Mamba): #20469
  • Replace einops rearrange with native torch ops in Kimi-Linear KDA path: #20396
  • Fused triton kernel for normal_decode_set_metadata (FlashAttn): #20778
  • Precompute SWA cache location: #20449
  • Use FlashInfer tinygemm for GPT-OSS MoE router on SM90+: #20755
  • Replace clamp_position and _resolve_future_token_ids with JIT kernel + platform dispatch: #20976, #20999
  • CUTLASS FP8 Blockwise GEMM improvement for SM120: #20887
  • Expose get_scheduler_metadata for FA3 decode optimization: #21103
  • Optimize diffusion Triton rotary embedding by processing multiple heads per token: #21387
  • Speed up Qwen select01 Triton modulation kernels (Diffusion): #21318

LoRA

  • LoRA support for MoE layers with JIT alignment kernel, fused Triton kernel, and TP: #19710, #19711, #14105
  • Auto-detect LoRA target modules: #21439
  • Use torch.addmm instead of separate mm and add_ for torch.native LoRA: #20562
  • Fix torch-native LoRA for multi-adapter case: #20564

SGLang-Diffusion

  • Model support: LTX-2 (#19295), Hunyuan3D-2 (#18170), Helios (#19782), FireRed-Image-Edit (#20862), MOVA (#19489, #20430)
  • Performance: Optimized Qwen-image with fused residual/layernorm/scale/shift/gate/select01 kernel (#20395), Z-Image with fused Triton rotary embedding and select01 kernels (#21387, #21318) — up to 1.5x speedup
  • Platform: macOS support for diffusion models (#19549, #20607)
  • Feature: Enhance diffusers backend by integrating all optimizations from Cache-DiT: #20361
  • NVFP4 support for Flux.2: #20137
  • AITER Sage attention backend: #20178
  • AITer GroupNorm for VAE decode on ROCm: #20170
  • Fused qknorm rope kernel and qknorm_across_heads optimization: #21440, #21503
  • Fix FLUX.1 output correctness: #21041
  • Fix Sana corrupted output by removing spurious QK norm layers: #20656
  • Fix Z-Image SP sharding for portrait and padded resolutions: #21042

Speculative Decoding

  • Reference-based speculative decoding refactor: #20393
  • Fix spec v1 token_ids_logprobs: #20718

Disaggregation (PD)

  • Fix infinite loop in decode resolve_pending_reqs: #20371
  • Non-blocking try_ensure_parallel_info in pending queue: #20785
  • Add retry interval in ensure_prefill_info: #20832
  • Fix health check false-positive in disagg is_fully_idle: #20756
  • Fix AssertionError crash in disagg prefill inflight queue with PP: #20686
  • Add kv_cache_dtype consistency check for PD disaggregation: #19407
  • Skip health check enqueue when PD disagg queues have backlog: #20191

HiCache

  • HiSparse for sparse attention: #20343
  • HybridCacheController for mamba state offloading: #20457
  • Release write-through lock_ref during decode: #20049
  • Check in-flight async ops in is_fully_idle() before attach/detach: #20746
  • Add check to provide hicache-storage-backend when enabling kv caching on Decode Side in PD: #20732

VLM

  • Replace decord with torchcodec for video decoding: #20055
  • Replace soundfile+torchaudio with torchcodec AudioDecoder in load_audio: #20190
  • Fix GLM-V / GLM-OCR field detection for transformers 5.x and MTP omission fix: #21134
  • Fix Qwen3VL hang when --mm-enable-dp-encoder is enabled: #20759
  • Fix pos_emb layer TP issue when DP encoder enabled for Qwen3 VL: #20788
  • Compute M-RoPE positions for preprocessed VL inputs (gRPC): #21244

Bug Fixes

  • Fix streaming session with paged KV cache (SWA/MLA): #20070
  • Fix missing TTFT histogram for single-batch requests: #20122
  • Fix missing clone in hicache: #20130
  • Fix VRAM leak in overlap scheduling with structured output: #20697
  • Fix chunked prefill and KV cache leaks for streaming sessions: #20476
  • Fix token leak with logprob_start_len=0 in streaming sessions: #20557
  • Fix streaming logprobs corruption caused by shared mutable list reference: #21030
  • Fix non-streaming request abort failure when --enable-metrics is enabled: #20625
  • Fix /GET HTTP route when ollama endpoint is not set: #20494
  • Fix customized_info offset truncation: #21262
  • Fix UnboundLocalError when DetokenizerManager constructor fails: #21471
  • Fix benchmark generating empty prompts when random_input_len is small: #21492
  • Fix MxInt4 MoE returning wrong output variable: #21348
  • Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40: #20368
  • Fix scale_step_k computation in the fp8_kernel: #20819
  • Propagate grammar errors and improve llguidance backend: #20467
  • Fix /generate JSON serialization for non-finite top_logprobs: #20714
  • Fix Kimi K2.5 dp attention + spec decoding launch crash: #21391
  • Fix sessions with mm inputs: #21269
  • Fix write-through events not processed when scheduler is idle: #20560
  • Fix MTP prefill cuda graph logging: #20279
  • Fix bug when enable prefill delay and DP: #20134
  • Fix CP residual size mismatch crash when tp_size == attn_cp_size: #21170
  • Fix dbrx model bug: #21445
  • Add adjusted_filter_batch: #21260

Network / IPv6

  • Fix socket utilities and reserve_port for IPv6 dual-stack support: #20491
  • Use NetworkAddress for dist_init_method and loopback fallbacks: #20657
  • Add NetworkAddress abstraction for IPv6-safe address handling: #20306
  • Fix dual-stack socket handling: IPV6_V6ONLY, IPv4-first, is_port_available all-family check: #20643
  • Add --strict-ports option for predictable port assignment: #21320

AMD Hardware

  • FP8 prefill integration with radix cache path for DeepSeek models: #20187
  • Add MHA FP8-KV support: #21253
  • Improve openai/gpt-oss performance: #21020
  • GPT-OSS decode performance optimization: #20392
  • Enable aiter unified attention for non-SWA models (Qwen3-VL): #20897
  • Add fused GemmaRMSNorm forward_hip for Qwen3.5: #21188
  • Integrate aiter's fused_topk for softmax scoring: #21421
  • Auto-select dispatch quantization type from MoE weight dtype (MoRI): #21040
  • Fix dpsk-v32 accuracy issue on mi355: #20840
  • Fix GPU fault when running DeepSeek-R1 with DP enabled: #20841
  • Lazy-import CuteDSL KDA kernel to fix AMD/ROCm startup crash: #21428

NPU/Ascend

  • Replace swiglu with custom kernel: #20192
  • Support Kimi-K2.5-w4a8 on Ascend: #20131
  • Support qwen image preprocess on NPU: #20189
  • Add new fusion operator DispatchFFNCombine: #20245
  • Support mamba cache transfer for NPU: #20364
  • NPU support for diffusion models with enable_torch_compile: #20687
  • Adapt w2 quant layer for Minimax2.5: #20905
  • Fix NZ performance bug for diffusion models: #20684

CPU Backend

  • Fix bug in AVX512 implementation of flash_attn_softmax: #20220

MUSA Platform

  • Enable Piecewise CUDA Graph support for MUSA platform: #20758
  • apply_vocab_mask support musa device: #21296

MPS (Apple Silicon)

  • Support sglang.check_env on MPS: #20753
  • Add StreamContext stub for MPS: #20782
  • Native MLX execution backend for Apple Silicon Mac: #20342

Documentation

  • Clarify that --chat-template is required for Qwen3-Reranker: #20596
  • Add GEMM backends table: #20213
  • Add DSA/NSA attention backend to support matrix: #20326
  • Add out-of-tree model integration guide: #21050
  • Fix quantization documentation: #20619
  • Add documents for encoder global mm cache: #20636
  • Update Nemotron example docs: #21416

Dependencies

  • sgl-kernel 0.3.21 → sglang-kernel 0.4.0: #20440
  • Flashinfer 0.6.3 → 0.6.6: #20480
  • Transformers 4.57.1 → 5.3.0: #17784
  • xgrammar 0.1.25 → 0.1.32: #21032
  • mooncake-transfer-engine 0.3.9 → 0.3.10: #20942
  • Flash Attention 4 (official release): #20303
  • sgl-fa4 bumped to 4.0.5: #20378
  • Diffusers 0.36.0 → 0.37.0: #20318

Security

  • Fix CVE-2026-3989: Replace unsafe pickle.loads with SafeUnpickler in replay_request_dump.py: #20904
  • Fix CVE-2026-3059 / CVE-2026-3060: Bind ZMQ sockets to localhost to prevent unauthenticated remote access (multimodal generation broker and encoder parallel disaggregation): #21435

New Contributors

All PRs included in this release: v0.5.9...v0.5.10rc0

Full Changelog: v0.5.9...v0.5.10rc0

Don't miss a new sglang release

NewReleases is sending notifications on new releases.