github sgl-project/sglang v0.5.9

one hour ago

Highlights

  • LoRA Weight Loading Overlap with Computation: Overlap LoRA weight loading with computation during inference, reducing TTFT by ~78% and TPOT by ~34.88% on large adaptors: #15512

  • TRT-LLM NSA Kernel Integration for DeepSeek V3.2: Integrate TRT-LLM DSA kernels for Native Sparse Attention, boosting DeepSeek V3.2 performance by 3x-5x on Blackwell platforms with trtllm for both --nsa-prefill-backend and --nsa-decode-backend
    (with minor accuracy drop): #16758, #17662, #18389

  • Flashinfer All-to-All MoE Dispatcher: Add the Flashinfer all-to-all MoE dispatcher for efficient expert parallelism communication, enabling optimized routing in MoE models: #14668

  • FA4 (FP4 Attention) Support for Multimodal Encoder: Introduce FP4 attention backend and variable-length attention function for multimodal encoders, enabling lower-precision inference for vision-language models: #13539

  • Anthropic Compatible API Endpoint: Add native Anthropic API compatibility to SGLang, allowing direct integration with tools and clients built for the Anthropic API format: #18630

  • SGLang-Diffusion Advanced Optimizations: Production-ready improvements including token-level sequence sharding, parallel VAE decoding, fused kernels, Nunchaku and FP8 support, and multiple new models in the ComfyUI plugin: blog

  • Spec V2 Critical bug fix: Fix out-of-index bug caused by torch garbage collection in speculative decoding v2, improving reliability of speculative verification: #18958

  • Deploying DeepSeek on GB300 NVL72: Optimization work for long-context inference using prefill-decode disaggregation and other SGLang features on NVIDIA's latest GB300 platform: blog

  • Bump AITER version to 0.1.10.post3: Support FP8 Prefill/Decode/KV Cache

  • Commit-to-Version Lookup in docs.sglang.io: Easily find the earliest official version that includes a given PR or commit, streamlining release tracking for users and developers: #18450

New Model Support

SGLang-Diffusion

  • Support multiple new models in ComfyUI Plugin
  • Parallel Folding and Parallel VAE Decoding for faster image/video generation
  • Nunchaku and FP8 support for diffusion models
  • Sequence Sharding (token-level) replacing Frame Sharding for improved efficiency
  • LTX-2 support: #17495, #17496
  • MOVA model support: #17704
  • Cache-DiT optimizations and fused kernel improvements
  • Numerous bug fixes and refactors across the diffusion pipeline

Performance

  • Integrate TRT-LLM NSA kernels with up to 3-5x speedup on Blackwell: #16758, #17662, #18389
  • LoRA weight loading overlap reducing TTFT by ~78%: #15512
  • Flashinfer all-to-all MoE dispatcher: #14668
  • FA4 for multimodal encoder: #13539
  • Optimize GDN decode for Qwen3 Next: #17094
  • Tune fused MoE kernels for Llama-4-Scout, MiniMax M2: #17891, #18851, #18833
  • Symmetric memory pre-allocation to avoid fragmentation: #17089
  • Optimize fused_moe triton kernel TMA: #18782
  • Fused triton kernel for Ernie4.5-VL rotary embedding: #18856
  • Support MxINT4 Flashinfer TRT-LLM MoE GEMM: #16892
  • AITER bias MoE support for GPT-OSS MxFP4: #17735

Prefill-Decode Disaggregation

  • Support KV transfer with MORI-IO: #14626
  • Mooncake intra-node NVLink KV transfer: #17866
  • Improve KV offset calculation for MHA model with different TP size: #18163
  • Document SGLANG_MOONCAKE_CUSTOM_MEM_POOL: #18259

Diffusion LLM (dLLM)

  • Remove cuda graph batch size limitation: #17458
  • JointThreshold algorithm for joint M2T and T2T decoding: #18171
  • Basic dLLM scheduling strategy and implementation: #17484

Speculative Decoding

  • Fix out-of-index bug caused by torch garbage collection in Spec V2: #18958
  • Move forward timeout before verify to fix Eagle v1 filter mismatch: #18760

Dependencies

  • Flashinfer updated to 0.6.3: #17700
  • AITER updated to 0.1.10.post3: #18741
  • Mooncake transfer engine updated to 0.3.9: #18316

AMD Hardware

  • AITER updated to v0.1.10.post3 with FP8 Prefill, FP8 Decode, FP8 KV Cache support
  • ROCm 7 standardization and ROCm 6.3 deprecation: #17785
  • Kimi K2.5 Day 0 ROCm support: #17863
  • FP8 prefill attention kernel integration: #18528
  • Two-batch overlapping for MORI EP: #17953
  • DeepSeek V3.2 and Kimi-K2 nightly CI tests: #17523

NPU/Ascend

  • Support for MiniCPM3-4B: #16866
  • Qwen 3.5 support on Ascend: #18544
  • Accuracy improvements for StableLM-2: #17470
  • Bug fixes for DeepSeek V3.2 and DeepSeek-VL2: #17007

CPU Backend

  • Optimize Qwen3-Next model on CPU: #12525
  • Optimize flash_attn_varlen_func: #15708
  • Add INT4 kernels for CPU: #8226

Kernel Slimming

  • Migrate GPTQ-Marlin repack kernel to JIT: #18543
  • Migrate AWQ Marlin repack kernel to JIT: #18949

Documentation

  • Add RL documentation: #17663
  • Update torch compile description: #17819
  • Refine spec decode docs for SpecV2/STANDALONE/NGRAM: #18321
  • Consolidate diffusion documentation: #18095

What's Changed

New Contributors

Full Changelog: v0.5.8...v0.5.9

Don't miss a new sglang release

NewReleases is sending notifications on new releases.