sgl-project/sglang v0.5.1 on GitHub

What's Changed

[PD] Use batch transfer for rdma transport and add notes for mnnvl usage by @ShangmingCai in #8595
[bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. by @sighingnow in #8611
Fix hf3fs_fuse import error by @ispobock in #8623
Update step3v default config by @ispobock in #8626
[ci] fix genai-bench execution cmd by @slin1237 in #8629
[router] update router pypi version by @slin1237 in #8628
[Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x by @b8zhong in #8577
Fix typos in py_test/test_launch_server.py by @windsonsea in #6227
misc: Remove debug print to logger.info by @CatherineSue in #8633
SGLang HiCache NIXL Connector by @vvenkates27 in #8488
[bug] remove pdlb from minilb since its no longer available by @slin1237 in #8634
[bugfix] Fix flashinfer cutlass EP moe after MoE refactor by @trevor-m in #8630
Conditionally import HiCacheHF3FS by @pansicheng in #8598
TRTLLM Gen MLA Decode Kernel Integration (same as #7938) by @farazkh80 in #8632
Fix nan value generated after custom all reduce by @kkHuang-amd in #8532
Revert "Fix nan value generated after custom all reduce (#8532)" by @zhyncs in #8642
Feature/modelscope model download by @yrk111222 in #8083
chore: speedup NPU CI by cache by @pkking in #8270
[Bugfix] fix w8a8_int8 load issue by @iforgetmyname in #8308
[bugfix] fix router python parser for pd urls by @slin1237 in #8644
[router] add basic usage doc by @slin1237 in #8640
[router] upgrade router version to 0.1.8 by @slin1237 in #8645
[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE by @kaixih in #8450
HiCache, fixing hash value indexing by @xiezhq-hermann in #8636
Interface change for kvcache io to support page first layout by @xiezhq-hermann in #8318
Update batch size limitation of dsv3_router_gemm kernel to 16 by @Fridge003 in #8051
chore: bump v0.4.10.post1 by @ispobock in #8652
Add hf3fs_utils.cpp to package-data by @pansicheng in #8653
Fix chat template handling for OpenAI serving by @JustinTong0323 in #8635
Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… by @byjiang1996 in #8511
[5/N] MoE Refactor: Update MoE parallelism arguments by @ch-wan in #8658
Increase tolerance to address CI failures by @lifuhuang in #8643
[Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 by @panpan0000 in #8013
[Doc] fix: Update README for cu126 sgl-kernel compile problem by @Hongbosherlock in #8665
fix per token cuda kernel hidden dim cannot divide by 16 by @hebiao064 in #8543
fix arg typo for --disaggregation-transfer-backend by @ZacWang in #8664
[fix] fix pd disagg error of vlms by @ccw1996 in #8094
Disable tp for shared experts under expert parallelism for GLM4.5 model (#8647) by @zminglei in #8647
[bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla by @trevor-m in #8685
[bug] limit bootstrap room to to [0, 2^63 - 1] by @slin1237 in #8684
Update CODEOWNERS by @merrymercy in #8686
Fix deepgemm masked grouped gemm jit compile by @ispobock in #8679
Fix FP8 block quantization when N or K is not multiples of 128 by @yanbing-j in #8648
bugfix(hicache): Fix 'MooncakeStore' not defined error. by @hzh0425 in #8668
upgrade xgrammar 0.1.22 by @Swipe4057 in #8522
[bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually by @lbh2001 in #8618
Add support for NCCL symmetric memory for TP allreduces by @nvcastet in #8238
[1/2] sgl-kernel: Fuse routed scaling factor into select_experts by @trevor-m in #8364
chore(gb200): update dockerfile to handle fp4 disaggregation by @ishandhanani in #8694
[bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 by @trevor-m in #8688
Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled by @GaoYusong in #7434
model: adapt mllama4 to VisionAttention by @wenchen76 in #8512
Add tensor.detach() back to update weight util by @hebiao064 in #8691
[Doc] Polish sgl-kernel readme for cu126 build error by @FlamingoPg in #8704
Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" by @hnyls2002 in #8706
[router] minor code clean up and and refactoring by @slin1237 in #8711
[Bug] fix green context's incompatibility with cuda < 12.4 by @hnyls2002 in #8701
chore: bump sgl-kernel v0.2.9 by @zhyncs in #8713
Remove assertions about per group quant fp8 by @fzyzcjy in #8717
[FIX] Fix the nightly CI by disabling swa mem pool for gemma2 by @merrymercy in #8693
Fix triton moe error caused by TopK refactor by @fzyzcjy in #8705
[router] Implement HTTP Dependency Injection Pattern for Router System by @slin1237 in #8714
[Feature] Radix Tree in C++ by @DarkSharpness in #7369
[Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm by @HydraQYH in #8722
Fix fused MoE when routed_scaling_factor is None by @hnyls2002 in #8709
Tiny fix CI pytest error by @fzyzcjy in #8524
[hotfix] fix mixtral with tensor-level compressed-tensor quantization by @ch-wan in #8721
Support limiting max loaded loras in CPU. by @lifuhuang in #8650
Reduce memory accumulation in long-running server by @Edenzzzz in #8306
HiCache storage, style change and bug fix by @xiezhq-hermann in #8719
[feat] support minimum token load balance in dp attention by @WANG-GH in #7379
Do layernorm before allgather for DP attention by @trevor-m in #8631
[fix] Fix divide by zero error for llama4. by @shenoyvvarun in #8683
feat: Add new moe triton for NVIDIA RTX 6000 Ada by @17Reset in #8547
[Improvements] Merge health check route by @whybeyoung in #8444
chore: bump sgl-kernel 0.3.0 with torch 2.8.0 by @zhyncs in #8718
Save cuda graph memory for fa3 by @ch-wan in #8567
[CUDA Graph] save cuda graph memory by using next_token_logits_buffer by @ch-wan in #8579
[DP] fix the compatibility issue between DP attention and --attention-backend triton by @ch-wan in #8723
chore: bump v0.4.10.post2 by @zhyncs in #8727
feat: Support DP Attention for step3_vl by @yhyang201 in #8699
[RL] fix update weight for FusedMoE with EP by @zhuzilin in #8676
use fp32 for e_score_correction_bias in GLM-4.5 by @zRzRzRzRzRzRzR in #8729
Fix triton kernels topk with keyword arguments by @ispobock in #8732
feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 by @TianQiLin666666 in #8678
Fix the missing 'lof' choice of --schedule-policy server args by @acelyc111 in #7114
fix args typo in memory_pool_host by @huangtingwei9988 in #8662
[CI] Do not trigger pd-disaggregation CI in draft PR by @hnyls2002 in #8737
[MoE] Enable renormalize=False in Triton kernels by @ch-wan in #8735
Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting by @YyWangCS in #8733
Fix bug of refactoring TopKOutput in w4afp8 by @yuan-luo in #8745
Rename lora_path to lora_id in batches by @Fridge003 in #8437
[sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count by @BBuf in #8738
[CI] Ascend NPU CI enhancement by @iforgetmyname in #8294
[bugfix] fix import path in HiCacheController by @lbh2001 in #8749
[NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer by @azhurkevich in #8552
[router] introduce dp worker abstraction by @slin1237 in #8639
[bugfix] Fix typo in modelopt quant: 'FusedMoE' object has no attribute 'local_num_experts' by @trevor-m in #8768
Integrate triton_kernels in sgl-kernel by @Qiaolin-Yu in #8762
chore: bump sgl-kernel v0.3.1 by @zhyncs in #8771
[NVIDIA] Fix breakage of using trtllm-gen fp8 moe by @kaixih in #8773
[Fix] Fix several issues preventing gemma3n LoRA support. by @lifuhuang in #8776
Support OCP MXFP4 quantization on AMD GPUs by @kkHuang-amd in #8255
[CPU][sgl-kernel] biased_grouped_topk: fix correction_bias dtype to float32 by @chunyuan-w in #8212
[PD] Refactor parallel sizes and add pp support for mooncake by @ShangmingCai in #8571
[pd-router] Add Configurable Retry Logic for reduce backend pressure by @slin1237 in #8744
chore: upgrade flashinfer v0.2.9 by @zhyncs in #8780
[NVIDIA]Fix local_num_experts for EP by @wenscarl in #8779
[feat] Add detail in image_data by @yuhyao in #8596
Revert "[NVIDIA]Fix local_num_experts for EP (#8779)" by @zhyncs in #8797
feat: support sgl-kernel cu129 by @zhyncs in #8800
chore: bump sgl-kernel v0.3.2 by @zhyncs in #8802
feat: add trtllm-gen mha from direct call by @yyihuang in #8782
GLM-4.5 and GLM-4.5-Air both support by @zRzRzRzRzRzRzR in #8804
fix: update cmake by @zhyncs in #8817
chore: upgrade transformers 4.55.0 by @zhyncs in #8823
chore: upgrade flashinfer 0.2.10 by @zhyncs in #8827
Fix potential memory fault issue and ncclSystemError in CI test by @kkHuang-amd in #8681
feat: use py312 by @zhyncs in #8832
fix: remove unused import by @zhyncs in #8809
Add initial support for gpt-oss by @Ying1123 in #8824
chore: upgrade torch 2.8.0 by @zhyncs in #8836
[router] complete router oai spec by @slin1237 in #8828
Turn off hybrid cache by default by @ispobock in #8839
Support bailing moe by @ppraneth in #8680
[Feature] improve TBO: two chunk overlap by @House-West in #8144
[router] PD Router Simplification and Reorganization by @slin1237 in #8838
[1/3] Optimize Slime Update Weights: Remove QWen3MOE Load Weight Overhead by @hebiao064 in #8751
[2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync when update expert weights by @hebiao064 in #8753
Support mxfp4 for GPT-OSS by @Ying1123 in #8843
Add unit test for triton swa kernel by @ispobock in #8853
fix: resolve ci issue by @zhyncs in #8859
fix benchmark fp8 blockwise group gemm by @yuan-luo in #8815
Refine naming by @ispobock in #8868
Optimize triton swa kernel by skipping computation by @ispobock in #8860
Support B200 in CI by @fzyzcjy in #8861
chore: update Dockerfile by @mickqian in #8872
[NVIDIA] Fix num_experts in modelopt_quant by @wenscarl in #8811
[CI] fix pip upgrade by @ch-wan in #8881
chore: use torch 2.8 stable by @zhyncs in #8880
Support v1/responses and use harmony in serving_chat by @CatherineSue in #8837
Use reduce scatter for DP by @trevor-m in #8539
add flashinfer mxfp4 by @BBuf in #8847
fix glm4 moe by @ch-wan in #8883
feat: openai oss attention sink support with trtllm-gen backend #8825 by @yyihuang in #8834
Support GPU pinning for LoRA by @lifuhuang in #8697
Enables force reasoning based on chat template for Qwen3-Thinking by @JustinTong0323 in #8369
[AMD] Pull latest SGLang version for AMD CI by @michael-amd in #8787
[Feature][Multimodal] Implement LRU cache for multimodal embeddings by @ZhengWG in #8292
[router] fix req handling order, improve serialization, remove retry by @slin1237 in #8888
[Feat] QWen-1M context support[2/2]: Update block sparse attention backend by @FlamingoPg in #5949
[CPU] Fix fallback allgather issue by @blzheng in #8041
Disable gemma3 for SWA due to CUDA illegal memory access error by @JustinTong0323 in #8895
[Perf] Auto enable best flashinfer mxfp4 kernel in b200 by @BBuf in #8898
Fix sgl-kernel arch and missing package in CI by @fzyzcjy in #8869
refactor(sgl-router): Replace once_cell with LazyLock in worker.rs and remove once_cell dependency from Cargo.toml by @htiennv in #8698
[router] re-enable pd router benchmark CI by @slin1237 in #8912
[router] update pd router ci summary step with new threshold by @slin1237 in #8916
[router] upgrade router version to 0.1.9 by @slin1237 in #8844
Fix hopper launch gpt-oss model illegal memory by @BBuf in #8908
fix: use openai 1.99.1 by @zhyncs in #8927
codeowner updates for modelopt related files by @Edwardf0t1 in #8925
chore: support blackwell cu129 image by @zhyncs in #8928
docs: update README by @zhyncs in #8929
remove vllm fp8quant from fp8.py by @hebiao064 in #8937
fix: reasoning parser when request have enable_thinking flag by @JustinTong0323 in #8933
correct the tp_plan logic by @hebiao064 in #8850
[router] dedicated prefill HTTP client and request-path optimizations by @slin1237 in #8923
Enhancements for bench_one_batch by @ZailiWang in #8703
refactor: Move scalar_types.py to sgl-kernel to avoid circular import by @Hongbosherlock in #8720
Fix enable flashinfer mxfp4 moe bf16 check by @BBuf in #8950
Reduce scheduler recv requests overhead by @fzyzcjy in #8947
Better optimization log for gpt-oss model by @BBuf in #8953
minor: global workspace buffer for trtllm-gen mha from flashinfer by @yyihuang in #8952
bench: add attention sink op benchmark, triton and trtllm-gen [B200] by @yyihuang in #8932
Fix typos and unify size(s)/stride(s) API calls by @triple-Mu in #8799
Expert Parallelism for GPT-OSS by @ch-wan in #8944
Add ernie4.py for ERNIE-4.5 by @solrex in #7657
[NVIDIA] Fix missing get_col_major_tma_aligned_tensor for Blackwell deepgemm in EpMoE by @kaixih in #8955
chore: bump sgl-kernel v0.3.3 by @zhyncs in #8957
add zai-org/GLM-4.5-Air-FP8 model into nightly CI by @zminglei in #8894
Support Multi Process Tokenizer Manager by @whybeyoung in #6555
Simple prefetch policy by @pansicheng in #8692
chore: update flashinfer by @zhyncs in #8958
Revert "Support Multi Process Tokenizer Manager" by @merrymercy in #8960
[RL] fix skip_server_warmup and rl health_generate logic by @zhuzilin in #8757
chore: bump v0.5.0rc0 by @zhyncs in #8959
[router] router circuit breaker core by @slin1237 in #8941
refine aiter_backend for mtp by @valarLip in #7279
[router] harden retries + metrics; fix streaming load; header filtering by @slin1237 in #8972
Fix kimi k2 function call format by @merrymercy in #8968
[router] add metrics for worker and policy by @tonyluj in #8971
chore(gb200): update to CUDA 12.9 and improve build process by @ishandhanani in #8772
chore(ci): update Python version from 3.9 to 3.10 in sgl-kernel workflow by @ishandhanani in #8981
[router] reduce radix tree contention, fix radix tree double-count race by @slin1237 in #8978
[router] fix radix tree integration issues in PD router by @slin1237 in #8982
Update qwen3_coder_detector.py for streaming by @maocheng23 in #8371
[bug fix] Ensure local token and global token buffers are pointing to different storage by @elfiegg in #8785
Create cancel-all-pr-test-runs by @merrymercy in #8986
[Fix] Add a workflow to cancel all pending CI runs by @merrymercy in #8988
Minor Optimizations in Schedule Batch by @merrymercy in #8724
[1/2][resubmit] sgl-kernel: Fuse routed scaling factor into moe_fused_gate (select_experts) by @trevor-m in #8770
Add unit test for flashinfer fp4 moe by @trevor-m in #8330
[AMD] Update SGLang image fallback logic for AMD CI by @michael-amd in #8980
Clean up server_args.py to have a dedicated function for model specific adjustments by @merrymercy in #8983
Molly/ci gnr server by @DiweiSun in #8667
[Fix] Fix wrong backend chosen in hybrid backend by @DarkSharpness in #8989
Revert "[bug fix] Ensure local token and global token buffers are pointing to different storage " by @ch-wan in #8993
[hotfix] use the original implementation in 8785 by @ch-wan in #8994
Fix incorrect default get_hidden_dim logic by @lifuhuang in #8987
optimize: reduce shulffle and quantization overhead in cutlass_moe sm90 by @TianQiLin666666 in #8962
chore(deps): update minimum python to 3.10 by @ishandhanani in #8984
Add CI for gpt-oss model on hopper by @fzyzcjy in #8851
Fix redundant kernel in sink dtype conversion by @fzyzcjy in #8966
Fix qwen2 audio not working bug by @byjiang1996 in #8600
feat: update flashinfer ar oneshot params by @yyihuang in #8687
Support glm4.1v and glm4.5v by @byjiang1996 in #8798
feature(hicache): Support hf3fs-hicache reusing kvcache across different instances by @hzh0425 in #8673
Tiny Llama4 type error in constructor by @b8zhong in #6752
HiCache Storage tp fix by @xiezhq-hermann in #8878
chore: upgrade sgl-kernel 0.3.3 by @zhyncs in #8998
[DP] fix: engine crash when decode batch is padded by @ch-wan in #8995
[bugfix] Fix missing args in bench one batch by @trevor-m in #8877
[Feature] Optimize DeepSeek's DeepEP on Ascend NPU by @iforgetmyname in #8355
Enable TBO on ROCm by @lcskrishna in #8329
fix nvshmem cu126 by @zhyncs in #9001
[perf] add kimi-k2 b200 fused moe config by @Alcanderian in #9010
fix: fix obsolete qwen-audio processor arg by @mickqian in #9003
Fix CI by @merrymercy in #9012
fix flashinfer allreduce fusion import bug by @BBuf in #9007
Fix CI by @merrymercy in #9013
[hicache] Optimization for DMA copy by @cctry in #8245
fix page first per layer pf2lf kernel by @huangtingwei9988 in #8915
[Fix] Fix hicache backend by @DarkSharpness in #8991
[Fix] Fix flashinfer cpu <-> gpu synchronization by @DarkSharpness in #8340
[router] upgrade to latest sgl kernel for router ci by @slin1237 in #9019
[router] upgrade rand to latest version by @slin1237 in #9017
[router] upgrade kube version to latest by @slin1237 in #9018
Optimize: Cache CUDA device to reduce redundant calls during tensor l… by @GeLee-Q in #8996
Improve LoRA Perf by Deprecating FlashInfer and Eliminating Redundant Tensor Ops by @lifuhuang in #8940
[router] update pyo3 version to 0.25.1 by @slin1237 in #9022
[RL] Add test for /abort_request by @hebiao064 in #7626
Simplify frontend language by @merrymercy in #9029
Reorganize CI and test files by @merrymercy in #9027
Reduce CI duration of test_lora_update. by @lifuhuang in #9024
[Optimization] Update estimated_num_new_pages logic in TokenToKVPoolAllocator by @YiXR in #8794
Support Flatten Tensor Update Weights to speed up MOE Update Weights by 20% by @hebiao064 in #8079
Simplify memory pool by @merrymercy in #9033
Revert "[1/2][resubmit] sgl-kernel: Fuse routed scaling factor into m… by @zhyncs in #9035
Simplify health check by @merrymercy in #9034
chore: upgrade flashinfer 0.2.11 by @zhyncs in #9036
Update release-docs.yml by @merrymercy in #9037
Refactor the docs by @merrymercy in #9031
Improve docs and developer guide by @merrymercy in #9044
Update REVIEWERS.md by @merrymercy in #9046
[router] regular router circuit breaker by @slin1237 in #8997
REVIEWERS.md typo fix by @xiezhq-hermann in #9048
Revert "feat: update flashinfer ar oneshot params (#8687)" by @zhyncs in #9054
[CI] Fix CI tests by @ch-wan in #9050
Revert "chore: upgrade flashinfer 0.2.11 (#9036)" by @zhyncs in #9057
bugfix: Fix output_ids extraction in detokenizer_manager by @CatherineSue in #9047
[pd-router] add retry and circuit breakfor for pd router by @slin1237 in #9051
Support radix cache for Lora feature by @Fridge003 in #7216
update deepep commit to support qwen3-coder by @yizhang2077 in #9066
chore(gb200): remove ToT flashinfer installation by @ishandhanani in #9079
Update REVIEWERS by @HaiShaw in #9063
Fix chunked prefill size validation for disabled state by @chi2liu in #8973
Fix broken Kimi models HuggingFace link by @Hangzhi in #9080
[PD]decode: add CLIP_MAX_NEW_TOKEN for pop_preallocated by @jinmingyi1998 in #8866
Fix docs for clip max new tokens by @hnyls2002 in #9082
refactor(pd-router): extract common patterns to reduce code duplication by @slin1237 in #9081
fix: w4afp8 accuracy problem and rebase by @yangsijia-serena in #8752
Update hyperparameter_tuning.md by @merrymercy in #9083
fuse allreduce and residual_rmsnorm by @BBuf in #8731
TRTLLM-MLA FP8 path by @farazkh80 in #8638
HiCache Storage: generate hash when inserting new nodes by @xiezhq-hermann in #9053
[fix] Set Radix tree root node hash to None - Nvidia Dynamo Integration by @faradawn in #9030
HiCache, add bench long context plus minor fixs by @xiezhq-hermann in #9086
(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support by @CatherineSue in #9043
[router] Add Rust Binary Entrypoint for SGLang Router by @slin1237 in #9089
[CI]Test BM.A10.4 runner by @key4ng in #8992
Fix race condition in async lora unload by @lifuhuang in #9084
Fix broken CI TestRequestLengthValidation by @lifuhuang in #9095
Optimization for AscendPagedTokenToKVPoolAllocator by @Makcum888e in #8293
feat: add fused moe config for Qwen3-30B-A3B on B200 by @zixuanzhang226 in #9087
Fix mismatch between padded_scales shape and reshape dimensions in modelopt quantization by @ovowei in #8766
[Fix] Fix dual chunk model default behavior by @DarkSharpness in #9032
bugfix: Fix the commentary msg extraction in GptOssDetector by @CatherineSue in #9097
docs: fix broken links in README.md by @xxrjun in #9075
Fuse two kernels of hidden states padding into quantization kernel by @fzyzcjy in #9005
update support new models doc by @yichaolemon in #9096
Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) by @fzyzcjy in #9077
chore: bump sgl-kernel v0.3.4 by @zhyncs in #9103
Runtime check CUDA driver version to avoid unresolved green context symbols by @hnyls2002 in #9021
[Bugfix] Fix accuracy-test-1-gpu failure caused by builtin_tools by @CatherineSue in #9114
Fix typo in REVIEWERS by @ShangmingCai in #9113
[5/n] DP Enhancement: Correct num_token_non_padded by @ch-wan in #9107
router: Fix user guide link README.md by @CatherineSue in #9122
fix(docker): update sgl_kernel version to 0.3.4 in Dockerfile.gb200 by @ishandhanani in #9118
Fuse writing KV buffer into rope kernel (part 2: srt) by @JeremieMelo in #9014
[router] update router documentation by @slin1237 in #9121
fix: update Dockerfile by @zhyncs in #9125
[feat] add ascend readme and docker release by @pkking in #8700
[feat] Enable Ascend profiling on SGLang by @ping1jing2 in #8610
[Quantization] Supported w8a8 int8 quantized Gemma3 and Qwen-VL models by @ichernob in #8619
Fix typos in supported models documentation by @Hangzhi in #9119
[AMD] Support Wave attention backend with AMD GPU optimizations by @yichiche in #8660
fix: update Dockerfile by @zhyncs in #9129
chore: use cp310 by @zhyncs in #9130
Support page first layout zero copy for mooncake store by @huangtingwei9988 in #8651
[Feature] Support custom set kv buffer kernel by @DarkSharpness in #8884
fix: wrong docker hub org name by @pkking in #9137
Use FlashInfer's TRTLLM FP8 Blockscale GEMM by @elfiegg in #8588
[1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe_fused_gate by @trevor-m in #9088
Support Triton FP8 Gemm can handle hidden_dim not divisible by 16 by @hebiao064 in #9093
Fix gpt-oss ~2x memory consumption issue by @fzyzcjy in #9146
Update docker file for MI35x base image update to support gpt-oss mxfp4 model by @kkHuang-amd in #9111
Double vision prefill throughput by defaulting to optimal vision attention backend by @AlienKevin in #8484
Update fa3 interface and add unit test by @ispobock in #9150
feat: update fa3 by @zhyncs in #9126
[router] optimize Rust compilation and development workflow by @slin1237 in #9133
[PD] optimize kv cache transfer directly using batch transfer by @ssssnow in #9149
[PD] feat: mooncake use batch reg/dereg by @stmatengss in #8910
Support FA3 backend for gpt-oss by @ispobock in #9028
chore: bump v0.5.0rc1 by @zhyncs in #9069
[Model] Support Qwen3ForSequenceClassification for Qwen3-Embed Model by @nysa-liu in #7957
Swap xeon ci to gnr server by @DiweiSun in #9042
Clean up allocators by @merrymercy in #9134
[Generative Score API] Optimization to Remove Decode. by @sundar24295s in #8840
Fix broken trtllm_mha attn backend with gpt-oss by @nvcastet in #9161
Replace sglang.srt.layers.quantization.scalar_types with sgl_kernel.scalar_type by @Hongbosherlock in #8951
[AMD] Update fallback images for AMD CI by @michael-amd in #9159
[Bugfix] Avoid unnecessary reduce-scatter call in prepare_mlp by @changhuaixin in #9169
Fix docker container DeepEP error on Blackwell by @fzyzcjy in #9171
[DP Attention] Refactor: adding some utility functions by @ch-wan in #9136
Faster weight processing (trtllm-gen moe nvfp4) by @aleozlx in #9162
Feature: support qwen and llama4 reducescatter for dp attention padding by @Misaka9468 in #9101
fix io group by @pansicheng in #9154
[Perf] Tunings for SM100 FP8 CUTLASS kernel by @hhzguo in #8818
Add A800 fused MoE kernel tuning configs for GLM4.5 and GLM4.5-Air by @lambert0312 in #8808
Add H200 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @forestlee95 in #8852
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #7004
Add H20 fused MoE kernel configs for Dpsk & Qwen3 by @M0gician in #7631
Add H200 fused MoE kernel configs for DeepSeek-V3 in triton 3.3.1 by @junliu-mde in #7687
add w8a8-fp8-block-wise H20-3e triton config by @sleepcoo in #8018
fix: zero_init buffer by @yyihuang in #9065
[2/n]decouple quantization implementation from vLLM dependency by @AniZpZ in #8112
chore: bump sgl-kernel v0.3.5 by @zhyncs in #9185
[sgl-kernel] 1/N Refactor sglang cutlass 3x - gemm fp8 blockwise sm90 by @yuan-luo in #8913
[sgl-kernel] Support FlashInfer top_k_top_p_sampling_from_logits by @yuan-luo in #9060
refine mxfp4 shuffling log by @BBuf in #9194
[4/n]decouple quantization implementation from vLLM dependency by @Hongbosherlock in #9191
feat: Add model version tracking with API endpoints and response metadata by @yitianlian in #8795
[VLM] Improving multimodal tensor hash kernel by @adarshxs in #9008
chore: upgrade transformers 4.55.2 by @zhyncs in #9197
feat: update model config by @zhyncs in #9202
chore: bump v0.5.0rc2 by @zhyncs in #9203
feat: add fused moe config for Qwen3-235B-A22B-FP8 on B200 by @zixuanzhang226 in #9204
[typo fix] Fix a typo in communicator.py by @LPhgh in #9183
Minor fix docker container DeepEP on multi platforms by @fzyzcjy in #9205
fix: fix unsupported palette mode of images in bench_serving for mmmu by @mickqian in #9206
[6/N] MoE Refactor: Cleanup MoE-related configs by @ch-wan in #8849
use fast math for per_token_group_quant_8bit. by @strgrb in #9177
feat: remove sm75 by @zhyncs in #9207
feat(hicache-3fs): 3FS-SGLang Hierarchical Cache Deployment Guide by @hzh0425 in #9213
fix: the store_dtype typo for ascend mla by @shilinlee in #9208
Fix the deprecation warning for enable_flashinfer_mxfp4_moe by @ch-wan in #9214
Tiny update tmux history limit on dev container by @fzyzcjy in #9218
[Eagle Warning fix] replace the deprecated 'and' with & by @XucSh in #9215
[Misc] feat: Deepgemm update for sgl-kernel by @FlamingoPg in #8790
Fp4 MOE quant kernel optimization by @jy-song-hub in #8777
[CI] Fix sgl-router disaggregation test by @ShangmingCai in #9222
Cleanup MoE Refactor by @ch-wan in #9223
chore: bump sgl-kernel v0.3.6 by @zhyncs in #9220
Optional extension for green context by @hnyls2002 in #9231
[router] allow more health check configuration by @slin1237 in #9198
[router] clean up lint warnings with clippy execution by @jeffdn in #9201
[router] preserve original worker response header in router by @slin1237 in #9236
chore(docker): update sgl_kernel version to 0.3.6 in Dockerfile.gb200 by @ishandhanani in #9243
[AMD] Expand test coverage for AMD CI and enable apply_token_bitmask_inplace_cuda in sgl-kernel by @hubertlu-tw in #8268
Fix nan value generated after custom all reduce by @kkHuang-amd in #8663
Revert "chore(docker): update sgl_kernel version to 0.3.6 in Dockerfi… by @zhyncs in #9246
Revert "chore: bump sgl-kernel v0.3.6 (#9220)" by @zhyncs in #9247
Add fp4 quantize before all-gather for Flashinfer cutlass MoE DP (max throughput) by @trevor-m in #7667
Fix DP load for embedding by @b8zhong in #9165
[CI] add deepseek w4a8 test on h20 ci by @HanHan009527 in #7758
Fix Custom All Reduce CI job. by @saienduri in #9258
[feature] Ascend NPU graph support by @VDV1985 in #8027
fix unexcepted answer in EAGLE mode by @zyksir in #9252
[PD] Support PD disaggregation with Prefill PP by @ShangmingCai in #8846
Combine fp4.py and mxfp4.py into one file and support dynamic mxfp4 quantization in mxfp4.py by @kkHuang-amd in #9049
Bug fix: use correct mm_items in embed_mm_inputs by @byjiang1996 in #8893
ci: simplify multi-modality tests by using mixins by @mickqian in #9006
[Bugfix] Change vLLM install order & Add A2 support by @iforgetmyname in #9232
[router] fix pd prefill http request complinace issue by @slin1237 in #9237
Quick Fix GLM by @hebiao064 in #9264
model: support nvidia/Llama-3_3-Nemotron-Super-49B-v1 by @netanel-haber in #9067
from python.sglang.srt -> from sglang.srt by @netanel-haber in #9268
Revert "[Misc] feat: Deepgemm update for sgl-kernel (#8790)" to fix kernel CI by @hnyls2002 in #9260
[router] add cargo clippy in CI and fix-up linting errors by @jeffdn in #9242
[chore] Clean up redundant lora_weight_names concept to simplify code by @lifuhuang in #9131
Fix swa eagle verify accuracy for Triton backend by @ispobock in #9279
Fix memory pool leak error by @fzyzcjy in #9271
[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance by @kousakawang in #9272
Tiny make fp4 moe method parameters more static by @fzyzcjy in #8520
[router] introduce prefill response draining for http compliance by @slin1237 in #9281
[CPU] Fix TP padding issue on Phi-4 by @blzheng in #8289
chore: bump sgl-kernel v0.3.6.post1 by @zhyncs in #9286
[router] introducing tokenizer trait by @slin1237 in #9287
Set the default attention backend for GLM-4.5v to fa3 by @zifeitong in #9245
[Fix] Add undefined update_tensor_inplace function by @b8zhong in #6307
[router] tokenizer factory, hf tokenizer, and stop sequence detector by @slin1237 in #9293
Fix triton_fused_moe unit test and benchmark by @yuan-luo in #9276
Further fix memory pool leak error by @fzyzcjy in #9298
[router] add tokenizer metrics by @slin1237 in #9307
[router] add reasoning parser base structure by @slin1237 in #9310
Minor style fixes for sgl-kernel by @merrymercy in #9289
[fix] fix enable_pdl for blackwell by @Alcanderian in #9011
Modelopt quant config adaptation by @Edwardf0t1 in #8829
should return invalide request for empty prompt by @gongwei-130 in #9315
[MISC] use dynamic choices for tool-call-parser argument by @key4ng in #9316
[Docs] Correct and clarify notes in Engine docstring by @JiangJiaWei1103 in #9313
upgrade xgrammar 0.1.23 and openai-harmony 0.0.4 by @Swipe4057 in #9284
[PD] Propagate internal server errors from aborted requests to clients instead of blindly returning 200's by @datdo-msft in #8936
[GLM4.1V and GLM4.5V] Add vision transformer num_dummy_head support: max tp=4 -> max tp=8 by @byjiang1996 in #9059
[AMD] Reorganize hip-related header files in sgl-kernel by @hubertlu-tw in #9320
Tiny fix CI by @fzyzcjy in #9306
[router] Add spec for sglang scheduler by @CatherineSue in #9322
support for interns1-mini by @CUHKSZzxy in #9299
[Bug] Fix input arguments of flashinfer_trtllm_moe by @JeremieMelo in #9317
[router]restructure protocol modules for better organization by @key4ng in #9321
Add CMakeLists.txt binary_dir by @EduardDurech in #7019
enable marlin fp8 blockwise by @qeternity in #8990
docs: fix spec by @zhyncs in #9326
[Minor] Fix the style of sgl-kernel by @merrymercy in #9332
[Bugfix] fix kv buffer register & dp attention & deepepmoe by @chenxu140 in #9327
Revert "[feature] Ascend NPU graph support (#8027)" by @iforgetmyname in #9348
[router] add dsr1, kimi, and qwen reasoning parser by @slin1237 in #9353
fix: enable multi-GPU Triton fused MoE tuning by @mpashkovskiy in #6295
[router] add tiktokenizer and sequence in router by @slin1237 in #9354
[CI] Fix lint issues by @CatherineSue in #9361
[Router] Add validation module for API parameters by @key4ng in #9335
[router] adds reasoning parser pooling and thread-safe by @slin1237 in #9360
[router] Implement gRPC SGLangSchedulerClient by @CatherineSue in #9364
[router] add tokenizer chat template support by @slin1237 in #9370
[router] Implement OpenAI Responses API specification by @key4ng in #9367
Fix mini lb timeout issue by @fzyzcjy in #9369
Fix triton backend eagle illegal memory access by @ispobock in #9344
Fix gpt-oss response api streaming issue by @key4ng in #9368
[feature] Rework Ascend NPU graph support by @iforgetmyname in #9350
[minor] Sync style changes by @merrymercy in #9376
[readme] Add SGLang x AMD SF meetup information by @wisclmy0611 in #9380
[CI] Fix disaggregation failure tolerance CI by @ShangmingCai in #9378
[Docs] Update contribution guide by @merrymercy in #9383
Revert "[feature] Rework Ascend NPU graph support" by @iforgetmyname in #9385
Reduce overhead for fa by not calling heavy CUDA property check by @oraluben in #7375
Add PDL support for quant kernel and rope kernel by @fzyzcjy in #9106
Fix the --allow-auto-truncate argument in tokenizer manager. by @hnyls2002 in #9391
Refactor allreduce add rmsnorm pattern by @BBuf in #9278
[2/2] Fuse routed scaling factor into select_experts by @trevor-m in #8690
Fix FlashInfer GPU <-> CPU sync by @thecodingwizard in #9409
Support pinning adapter via server args. by @lifuhuang in #9249
Fix incorrect logic in chat template handling. by @lifuhuang in #9336
Support DP attention with GPT-OSS by @nvcastet in #9359
Fixed the issue where eagle3 TPOT was not as good as without eagle3. by @jiapingW in #9404
fix: InternS1 don't recognize image, updates image token for InternVL processor by @JustinTong0323 in #9381
misc: parse bench_serving result as markdown table by @mickqian in #9377
Add support for Qwen3-seq-cls by @nathanrchn in #9357
Support trtllm_allreduce_fusion in flashinfer for cuda<12.8 by @strgrb in #9339
[router] Add IGW (Inference Gateway) Feature Flag by @key4ng in #9371
[router] add tokenizer integration test with real mini tokenizer by @CatherineSue in #9413
[router] add glm and step3 reasoning parser by @CatherineSue in #9415
Fix max_seq_len_k in trtllm_mha attention backend by @Qiaolin-Yu in #9416
Fix biased_grouped_topk_cpu by @CaoE in #9420
[PD] Fix nvlink transport accuracy through transferring metadata with tcp by @ShangmingCai in #9261
[bug] fix errors related to context length in SD by @hnyls2002 in #9388
feat: Add Triton fallback option and SM120 MoE configs for FP8 models by @voipmonitor in #9251
[feature] Ascend NPU graph support by @VDV1985 in #9399
Fix FP4 inference corruption issue in glm4.5-air model by @Azure-Tang in #9346
Fix tiny misalign with previous truncation setting in tokenizer_manager by @hnyls2002 in #9430
[NVIDIA] Fix trtllm fp4 moe backend when used in MTP by @kaixih in #9384
Enables speculative decoding for the trtllm_mla attention backend by @pranavm-nvidia in #9238
ci: enhance xeon ci by @DiweiSun in #9395
[Bug] Fix w4afp8 moe kernel by @yuhyao in #9392
Refactor weight offloading logic by @fzyzcjy in #8521
Fix quant kernel test errors and benchmark wrong output speeds by @fzyzcjy in #7604
[fix] Fix mxfp4 weight loading bug with TP sharding in GPT-OSS by @hlu1 in #9433
[router] add tokenizer benchmark by @slin1237 in #9427
[5/n]decouple quantization implementation from vLLM dependency by @Hongbosherlock in #9454
accomendate reasoning_effort set in chat_template_kwargs by @gongwei-130 in #9458
fix: should return a invalid request response when schema missing by @gongwei-130 in https://github.com/sgl-project/sglang/pull/9461
fix: support fb fp8 by @zhyncs in https://github.com/sgl-project/sglang/pull/9462
Add deepseek v3.1 thinking parser support and update docs by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/9464
feat: add fused moe config for GLM-4.5-Air-FP8 on B200 by @zixuanzhang226 in https://github.com/sgl-project/sglang/pull/9463
[FA3] Init Spec Page Table only when Spec is enabled to save ~40MB by @hebiao064 in https://github.com/sgl-project/sglang/pull/9455
fix: tmp revert gpt oss tp sharding on hopper by @zhyncs in https://github.com/sgl-project/sglang/pull/9469
feat: update auto_choose_speculative_params by @zhyncs in https://github.com/sgl-project/sglang/pull/9470
Revert "bugfix: Fix output_ids extraction in detokenizer_manager" by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/9467
Update reasoning parser doc by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/9468
Add Support for Page Size greater than 1 for Flashinfer MLA Backend by @pavanimajety in https://github.com/sgl-project/sglang/pull/8593
[AMD] Remove the deprecated C10_WARP_SIZE by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/9356
Support MHA with chunked prefix cache for flashinfer/flashmla backend, support page size > 1 for MHA chunked prefix by @xu-yfei in https://github.com/sgl-project/sglang/pull/8616
[router] remove all tokenizer metrics for performance by @CatherineSue in https://github.com/sgl-project/sglang/pull/9474
[code clean] add H20 cutlass groupGemm default config by @kousakawang in https://github.com/sgl-project/sglang/pull/9333
[docs]: fix reasoning context in docs by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/9483
[Docs]Update reasoning parser doc & fix outdated link by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/9492
[router] add tool parser base structure and partial json parser by @CatherineSue in https://github.com/sgl-project/sglang/pull/9482
[router] fix router load guard tracking for streaming by @slin1237 in https://github.com/sgl-project/sglang/pull/9491
torch.compile() mrope by @timmy-feng in https://github.com/sgl-project/sglang/pull/9487
Add trtllm_mla and cutlass_mla for ragged fmha for chunked prefill by @elfiegg in https://github.com/sgl-project/sglang/pull/9480
chore: bump sgl-kernel v0.3.6.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/9475
Update docker file for supporting PD-Disaggregation on MI300x by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/9494
[Docs] Add doc and quick demo for gpt-oss responses api & buildin tools by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/9497
Support speculative decoding in the trtllm_mha attention backend by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/9331
minor: determine mm attn backend based on platforms by @mickqian in https://github.com/sgl-project/sglang/pull/9303
Disable torch.compile for get_last_loc_large_page_size_large_top_k by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/9507
[bugfix] Make --enable-hierarchical-cache and --disable-radix-cache mutually exclusive by @XucSh in https://github.com/sgl-project/sglang/pull/9452
3fs zerocopy by @pansicheng in https://github.com/sgl-project/sglang/pull/9109
[HiCacheStorage] backup optimization for MLA model by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/8865
Use Tensor Core Decode when gqa group size >= 4 by @Edenzzzz in https://github.com/sgl-project/sglang/pull/8624
[router] tokenizer arch doc by @slin1237 in https://github.com/sgl-project/sglang/pull/9513
[MTP] Force greedy sampling on AMD by @datdo-msft in https://github.com/sgl-project/sglang/pull/9127
[router] add json tool parser by @slin1237 in https://github.com/sgl-project/sglang/pull/9516
[NVIDA] [1/N] Nvfp4 Masked Gemm: Add quant op for the flashinfer grouped gemm by @kaixih in https://github.com/sgl-project/sglang/pull/9200
[AMD] Fix Llama 4 FP8 accuracy issues on MI300X by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/7699
Add Qwen3-30B-A3B-Thinking-2507 support on AMD GPUs. by @sogalin in https://github.com/sgl-project/sglang/pull/9456
[router] Move all protocols to spec.rs file by @key4ng in https://github.com/sgl-project/sglang/pull/9519
[router] ignore client error when record failure in pd_router by @Bruce-x-1997 in https://github.com/sgl-project/sglang/pull/9503
Add support for extensions of interface and pre-registrations to NIXL HiCache by @mkhazraee in https://github.com/sgl-project/sglang/pull/9211
Support GC Freezing to improve latency & throughput by @chanh in https://github.com/sgl-project/sglang/pull/9241
Add enable_flashinfer_mxfp4_bf16_moe for higher precision and slower moe backend by @fzyzcjy in https://github.com/sgl-project/sglang/pull/9004
[benchmark] Add benchmark scripts for ceval and boolq by @yuxingcyx in https://github.com/sgl-project/sglang/pull/8946
fix: blackwell dsv3 fp8 issue temporary solution by @zhyncs in https://github.com/sgl-project/sglang/pull/9530
tool-call(dsv3): Improve deepseek-v3 chat template and tool_choice = required by @CatherineSue in https://github.com/sgl-project/sglang/pull/9525
[fix] Fix mxfp4 triton MoE tp bug by @hlu1 in https://github.com/sgl-project/sglang/pull/9473
Overlapped weight offload by @fzyzcjy in https://github.com/sgl-project/sglang/pull/8034
Tiny make device_loading_context more static by @fzyzcjy in https://github.com/sgl-project/sglang/pull/9478
Partially unify triton per token group quant kernels by @fzyzcjy in https://github.com/sgl-project/sglang/pull/9485
feat(hicache): Supports 3fs-hicache compatibility with dp-attention by @hzh0425 in https://github.com/sgl-project/sglang/pull/9372
Update grok.py and tiktoken tokenizer by @merrymercy in https://github.com/sgl-project/sglang/pull/9532
Release 0.5.1 by @merrymercy in https://github.com/sgl-project/sglang/pull/9533

New Contributors

@sighingnow made their first contribution in #8611
@vvenkates27 made their first contribution in #8488
@farazkh80 made their first contribution in #8632
@yrk111222 made their first contribution in #8083
@pkking made their first contribution in #8270
@ZacWang made their first contribution in #8664
@lbh2001 made their first contribution in #8618
@wenchen76 made their first contribution in #8512
@WANG-GH made their first contribution in #7379
@shenoyvvarun made their first contribution in #8683
@17Reset made their first contribution in #8547
@TianQiLin666666 made their first contribution in #8678
@YyWangCS made their first contribution in #8733
@azhurkevich made their first contribution in #8552
@yuhyao made their first contribution in #8596
@House-West made their first contribution in #8144
@ZhengWG made their first contribution in #8292
@htiennv made their first contribution in #8698
@triple-Mu made their first contribution in #8799
@tonyluj made their first contribution in #8971
@maocheng23 made their first contribution in #8371
@cctry made their first contribution in #8245
@chi2liu made their first contribution in #8973
@Hangzhi made their first contribution in #9080
@jinmingyi1998 made their first contribution in #8866
@Makcum888e made their first contribution in #8293
@ovowei made their first contribution in #8766
@xxrjun made their first contribution in #9075
@yichaolemon made their first contribution in #9096
@JeremieMelo made their first contribution in #9014
@ichernob made their first contribution in #8619
@nysa-liu made their first contribution in #7957
@changhuaixin made their first contribution in #9169
@aleozlx made their first contribution in #9162
@Misaka9468 made their first contribution in #9101
@hhzguo made their first contribution in #8818
@forestlee95 made their first contribution in #8852
@LPhgh made their first contribution in #9183
@shilinlee made their first contribution in #9208
@jy-song-hub made their first contribution in #8777
@jeffdn made their first contribution in #9201
@VDV1985 made their first contribution in #8027
@netanel-haber made their first contribution in #9067
@kousakawang made their first contribution in #9272
@gongwei-130 made their first contribution in #9315
@datdo-msft made their first contribution in #8936
@CUHKSZzxy made their first contribution in #9299
@EduardDurech made their first contribution in #7019
@chenxu140 made their first contribution in #9327
@mpashkovskiy made their first contribution in #6295
@oraluben made their first contribution in #7375
@thecodingwizard made their first contribution in #9409
@jiapingW made their first contribution in #9404
@nathanrchn made their first contribution in #9357
@CaoE made their first contribution in #9420
@voipmonitor made their first contribution in #9251
@Azure-Tang made their first contribution in #9346
@pranavm-nvidia made their first contribution in #9238
@hlu1 made their first contribution in #9433
@timmy-feng made their first contribution in https://github.com/sgl-project/sglang/pull/9487
@Bruce-x-1997 made their first contribution in https://github.com/sgl-project/sglang/pull/9503
@mkhazraee made their first contribution in https://github.com/sgl-project/sglang/pull/9211
@yuxingcyx made their first contribution in https://github.com/sgl-project/sglang/pull/8946

Full Changelog: v0.4.10...v0.5.1

sgl-project/sglang v0.5.1 Release v0.5.1 on GitHub

What's Changed

New Contributors

sgl-project/sglang v0.5.1
Release v0.5.1

on GitHub