github sgl-project/sglang v0.5.10

11 hours ago

Highlights

  • Piecewise CUDA Graph Enabled by Default: Piecewise CUDA graph capture is now the default execution mode, reducing memory overhead and improving throughput for models with complex control flow patterns: #16331

  • Elastic EP for Partial Failure Tolerance: Integrate Elastic NIXL-EP into SGLang, enabling partial failure tolerance for DeepSeek MoE deployments — when a GPU fails, the system redistributes expert weights and continues serving without full restart: #19248, #17374, #12068 blog

  • GPU Staging Buffer for PD Disaggregation: Gathers scattered head slices into contiguous memory for bulk RDMA transfer, reducing RDMA request count on GQA models by ~1000x. TPS/GPU on large concurrency increased by ~5x with Prefill TP4+Decode DEP4 on Qwen3.5: #19890

  • HiSparse for Sparse Attention: Integrate HiSparse sparse attention backend for efficient long-context inference with reduced compute through sparsity-aware attention: #20343

  • SGLang-Diffusion Update:

    • Model support: LTX-2, Hunyuan3D-2, Helios
    • Performance improvements on Qwen-image, Z-image increased by 1.5x
    • New platform: macOS
    • New feature: enhance the performance of diffusers backend by integrating all optimization from Cache-DiT
    • SKILLs: feel free to explore the curated skill for developing and optimizing sglang-diffusion!
  • FlashInfer MXFP8 Kernel Support: Integrate FlashInfer mxfp8 kernels for GEMM and MoE operations, enabling mixed-precision FP8 inference with higher accuracy through microscaling for RL and general workloads: #19537

  • Transformers 5.3.0 Upgrade: Major upgrade from transformers 4.57.1 to 5.3.0, unlocking support for the latest model architectures and features from HuggingFace. GLM-5 model is now supported in this image instead of the custom built image: #17784

  • DeepSeek V3.2 / GLM-5 Optimization: GLM-5 runnable on main branch (with upgraded transformers). Fused Triton kernel for prefill KV cache fetching, NSA fuse store indexer for K cache, TRT-LLM prefill/decode DSA kernels as default on SM100/SM103, and IndexCache for improved throughput by more than 10% on high workloads: #19319, #19148, #20062, #21914, #21405

  • Qwen3.5 GDN/KDA Optimization: Transpose linear attention state layout from [N, HV, K, V] to [N, HV, V, K] and fuse split/reshape/cat ops in GDN projection with Triton kernel, plus CuTeDSL KDA decode kernel support for improved Qwen3.5 performance: #20283, #21019, #21203

  • LoRA Support for MoE Layers: Add LoRA fine-tuning support for Mixture-of-Experts layers with JIT alignment kernels, fused Triton kernels, TP support, CUDA graph support, and auto-detection of LoRA target modules — enabling efficient adapter-based tuning on MoE models like DeepSeek: #19710, #19711, #14105, #21439, #21647

  • Prefill Context Parallel for MHA (Qwen3): Enable context parallelism during prefill for multi-head attention models like Qwen3 MoE, distributing long sequences across GPUs to reduce per-GPU memory and accelerate prefill: #18233

  • Flash Attention 4 Official Library Support: Upgrade to the official FlashAttention 4 package, bringing the latest attention optimizations and Blackwell GPU support: #20303

  • Skip-Softmax Attention for FlashInfer TRT-LLM Kernels: Reduce computation overhead in attention layers by skipping redundant softmax normalization: #19089

  • Speculative Decoding with FA4 Backend: Enable speculative decoding for the FA4 attention backend, combining speculative inference with next-generation flash attention for faster generation: #21080

  • MM Attention FA4 Default on SM100: Multi-modal attention now uses FA4 by default on Blackwell hardware for improved VLM performance: #21595

  • Stronger Transformers Modeling Backend: Enhanced transformers backend with full TP, PP, MoE, VLM support, and torch.compile compatibility: #19163

  • sglang-kernel 0.4.1: Major kernel package release with renamed package (sgl-kernel → sglang-kernel), consolidated kernels, and cleanup of deprecated ops: #20440, #22009

  • Native MLX Backend for Apple Silicon: Add native MLX execution backend enabling SGLang to run inference directly on Apple Silicon Macs without CUDA: #20342

New Model Support

  • Nemotron-3-Super (bf16/fp8/nvfp4): #20407, cookbook
  • Mistral Small 4 (Pixtral): #20708
  • LFM2-VL (Liquid Foundation Model 2 Vision-Language): #21230
  • Voxtral (speech-to-text): #21635
  • GLM-5: Supported on main branch with transformers 5.3.0
  • Helios (Diffusion - Real-Time Long Video Generation): #19782
  • Hunyuan3D-2 (Diffusion): #18170
  • LTX-2 (Diffusion): #19295
  • MOVA (Diffusion): #19489, #20430
  • FireRed-Image-Edit (Diffusion): #20862

DeepSeek V3.2 / GLM-5 Optimization

  • Fused get_k_and_s Triton kernel for prefill KV cache fetching: #19319
  • Support NSA fuse store indexer K cache: #19148
  • SGLANG_NSA_DENSE_ATTN_KV_LEN_THRESHOLD environ for controlling KV length threshold of applying sparse MLA attention kernel at prefill: #20062
  • Support TRT-LLM prefill/decode DSA kernels as default for Blackwell (SM100/SM103): #21914, #21783
  • Enable IndexCache for improved throughput by more than 10% on high workloads: #21405
  • Change default setting of V3.2 nvfp4 on TP4: #20086

Qwen3.5 Optimization

  • GDN attention state layout transposed from [N, HV, K, V] to [N, HV, V, K]: #20283
  • Fuse split/reshape/cat ops in GDN projection with Triton kernel: #21019
  • CuTeDSL KDA decode kernel support: #21203
  • Fuse GDN kkt + solve_tril and KDA kernels: #21411, #21604
  • GDN packed decode support: #20627

Performance

  • Piecewise CUDA graph enabled by default: #16331
  • FlashInfer MXFP8 kernels for GEMM and MoE: #19537
  • Skip-softmax attention for FlashInfer TRT-LLM kernels: #19089
  • NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency: #20477
  • Overlap NSA-CP key all-gather with query computation for DeepSeek-V3.2: #20438
  • CUTLASS FP8 Blockwise GEMM improvement for SM120: #20887
  • CUTLASS NVFP4 GEMM improvement for SM120: #21314
  • Enable multi-thread weight loading by default: #20289
  • Optimize CUDA IPC for multimodal transfer by caching IPC pool handles: #21418

LoRA

  • LoRA support for MoE layers with JIT alignment kernel, fused Triton kernel, and TP: #19710, #19711, #14105
  • Auto-detect LoRA target modules: #21439
  • LoRA support for CUDA graph: #21647
  • LoRA support for Qwen3-VL-30B-A3B and GPT-OSS 20B: #21469, #21570

Elastic EP

  • Integrate Elastic NIXL-EP into SGLang: #19248
  • Back up Expert Weights in DRAM: #17374
  • Use GPU P2P to exchange expert weights during EPLB: #12068
  • Add EPLB rebalance support for Kimi K2.5: #21004

SGLang-Diffusion

  • Model support: LTX-2 (#19295), Hunyuan3D-2 (#18170), Helios (#19782), FireRed-Image-Edit (#20862), MOVA (#19489, #20430)
  • Performance: Optimized Qwen-image with fused residual/layernorm/scale/shift/gate/select01 kernel (#20395), Z-Image with fused Triton rotary embedding and select01 kernels (#21387, #21318) — up to 1.5x speedup
  • Platform: macOS support for diffusion models (#19549, #20607)
  • Feature: Enhance diffusers backend by integrating all optimizations from Cache-DiT: #20361
  • NVFP4 support for Flux.2: #20137
  • Diffusion norm fusion for Z-Image: #18762
  • LTX-2 two-stage pipeline support: #20707

Speculative Decoding

  • Reference-based speculative decoding refactor: #20393
  • Add FA4-based speculative decoding support: #21080

Disaggregation (PD)

  • GPU staging buffer with dynamic ring allocator for heterogeneous TP KV transfer: #19890
  • HiSparse direct cache transfer from Prefill to Decode DRAM: #21591
  • Non-blocking try_ensure_parallel_info in pending queue: #20785
  • Add kv_cache_dtype consistency check for PD disaggregation: #19407

HiCache

  • HiSparse for sparse attention: #20343
  • HybridCacheController for mamba state offloading: #20457

VLM

  • Replace decord with torchcodec for video decoding: #20055
  • Replace soundfile+torchaudio with torchcodec AudioDecoder in load_audio: #20190
  • Chunk-aware ViT encoding with per-image cache and lazy device transfer: #22038
  • Compute M-RoPE positions for preprocessed VL inputs (gRPC): #21244

Bug Fixes

  • Fix streaming session with paged KV cache (SWA/MLA): #20070
  • Fix VRAM leak in overlap scheduling with structured output: #20697
  • Fix chunked prefill and KV cache leaks for streaming sessions: #20476
  • Fix streaming logprobs corruption caused by shared mutable list reference: #21030
  • Fix TRT-LLM MHA CUDA illegal address with EAGLE v2 + DP attention: #21649
  • Fix Mistral Small 4 config/weight format mismatch: #21620
  • Fix mamba cache leak when adder fails to add a matched req: #21404
  • Propagate grammar errors and improve llguidance backend: #20467

Features

  • Add reasoning tokens usage: #15562
  • Add --stream-response-default-include-usage server flag: #16711
  • Subprocess liveness monitor to detect scheduler crashes: #18582
  • Score API — implement EngineScoreMixin: #21342
  • Direct model loading from object storage with RunAI Model Streamer: #17948
  • MFU metrics in Prometheus: #19395

Network / IPv6

  • Add NetworkAddress abstraction for IPv6-safe address handling: #20306
  • Fix socket utilities and reserve_port for IPv6 dual-stack support: #20491
  • Add --strict-ports option for predictable port assignment: #21320

AMD Hardware

  • FP8 prefill integration with radix cache path for DeepSeek models: #20187
  • Add MHA FP8-KV support: #21253
  • Support AMD MXFP4 Qwen3.5-397B-A17B model: #21234
  • Fused rope KV store: #21315
  • Optimize Qwen3-VL decode — fuse QK-norm + 3D mRoPE + KV cache write: #21458
  • Enable FP8 KV cache and FP8 attention kernel for NSA on MI300/MI355 with TileLang: #21511
  • Improve openai/gpt-oss performance: #21020

NPU/Ascend

  • GLM-5 optimize with fused kernels: #18617
  • Support GLM-4.7-Flash on NPU: #21408
  • Replace swiglu with custom kernel: #20192
  • Support Kimi-K2.5-w4a8 on Ascend: #20131
  • NPU support for diffusion models with enable_torch_compile: #20687

CPU Backend

  • Add kernel apply_rotary_pos_emb_cpu for Qwen3-VL and Qwen3-Omni: #13121
  • Implement MXFP4 GEMM kernels for Intel AMX to support GPT-OSS series: #14385
  • Enable DeepSeek R1 inference on XPU [Intel GPU]: #18461

MPS (Apple Silicon)

  • Native MLX execution backend for Apple Silicon Mac: #20342
  • Fix Triton stub sub-module imports on Python 3.12+: #21551

Dependencies

  • sgl-kernel 0.3.21 → sglang-kernel 0.4.1: #20440, #22009
  • FlashInfer 0.6.3 → 0.6.7.post2: #20480, #22097
  • Transformers 4.57.1 → 5.3.0: #17784
  • xgrammar 0.1.25 → 0.1.32: #21032
  • mooncake-transfer-engine 0.3.9 → 0.3.10.post1: #20942, #21844
  • Flash Attention 4 (official release): #20303
  • Diffusers 0.36.0 → 0.37.0: #20318

Security

  • Fix CVE-2026-3989: Replace unsafe pickle.loads with SafeUnpickler in replay_request_dump.py: #20904
  • Fix CVE-2026-3059 / CVE-2026-3060: Bind ZMQ sockets to localhost to prevent unauthenticated remote access (multimodal generation broker and encoder parallel disaggregation): #21435

New Contributors

Full Changelog: v0.5.9...v0.5.10

Don't miss a new sglang release

NewReleases is sending notifications on new releases.