sgl-project/sglang v0.5.2 on GitHub

What's Changed

feat: allow use local branch to build image by @gongwei-130 in #9546
[readme] Include additional resources for the SGLang x AMD SF Meetup event by @wisclmy0611 in #9547
[doc] deepseekv31 support by @XiaotongJiang in #9544
fix(grok): remove duplicate replicate_lm_head configuration by @vincentzed in #9549
chore: update configurer by @zhyncs in #9557
chore: bump v0.5.1.post1 by @zhyncs in #9558
[router] add right rustls dependency in sgl-router cargo.toml by @Bruce-x-1997 in #9498
fix: use sgl-kernel 0.3.5 by @zhyncs in #9565
Add target module validation for init adapters by @Beichen-Ma in #9429
fix: Update OpenAI client base URL in documentation by @JustinTong0323 in #9576
[PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats by @SCDESPERTATE in #7317
remove redundant rank0_log function. by @miter6 in #9560
Update CUTLASS 4.2 & Enable K-Major Scale Factor for SM90 FP8 Blockwise Group GEMM by @HydraQYH in #9559
Reintroduce memory usage fix by @fzyzcjy in #9535
Offload tensors by sharding on GPU by @fzyzcjy in #9536
bugfix for undefined logging functions in HarmonyBrowserTool & HarmonyPythonTool by @CiaranZhou in #9229
chore: upgrade flashinfer 0.2.14.post1 by @zhyncs in #9578
fix: revert #8593 by @zhyncs in #9581
fix: resolve tuning fused moe issue by @zhyncs in #9587
Tiny fix wrong comments by @fzyzcjy in #9589
chore: update config by @zhyncs in #9591
chore: bump v0.5.1.post2 by @zhyncs in #9592
[Doc] add LWS(LeaderWorkerSet) use case in sgl-router README by @Bruce-x-1997 in #9568
[Performance] Batch Send from Tokenizer Manager. by @sundar24295s in #9436
Fix GLM45 tool call multi-turn bug by @byjiang1996 in #9500
Fix GLM45v launch server cuda torch compile bug by @byjiang1996 in #9554
Fix Harmony reasoning parser for and auto-separation for gpt-oss models by @jonaslsaa in #9190
[docs] Refactor, remove compiled results and add gpt-oss by @zhaochenyang20 in #9613
[Fix] HiCache Bugfix & Mooncake Error Handling Enhance by @ykwd in #8901
Improve bench_one_batch_server script by @hnyls2002 in #9608
[router] add mistral tool parser by @slin1237 in #9622
[router] add qwen tool parser by @slin1237 in #9623
[router] add pythonic parser by @slin1237 in #9628
[router] add llama tool parser by @slin1237 in #9629
[router] add ut for mistral, llama, pythonic, and streaming tool parser by @slin1237 in #9632
[new feat] ascend backend support fia fusion kernel by @ZhengdQin in #8328
model: Support nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 by @netanel-haber in #9301
Fix lint for router by @hebiao064 in #9636
[docs] Update README with additional highlights and resources for SGLang x AMD SF Meetup by @wisclmy0611 in #9640
Add reasoning_effort param in TiktokenTokenizer.apply_chat_template by @lshmouse in #9630
fix: allow user to specify function as role by @GavinZhu-GMI in #9635
Fix kimi k2 function calling format by @XiaotongJiang in #9606
[router] address worker load tracking consistency by @slin1237 in #9523
[router] add token bucket rate limiter by @CatherineSue in #9656
[doc] add kimik2 --tool-call-parser by @XiaotongJiang in #9647
Install py-spy by default for containers for easier debugging by @fzyzcjy in #9649
BugFix(hicache): Fix host indices out of bound error by @hzh0425 in #9637
HiCache Storage fix host memory leak by @xiezhq-hermann in #9648
add response_format support for completion API by @cicirori in #9665
Fix FA3 swa spec verify topk>1 by @ispobock in #9658
[RL] fix register the same ops multiple times by @hebiao064 in #9564
chore: enhance bench_serving for vlms with a new dataset of configurable image count and resolution by @mickqian in #9583
refactor(hicache): Introduce generic HiCacheStorageConfig for improved configuration management by @hzh0425 in #9555
feat: (chat-template matching) enhance multimodal model detection with config.json by @KEVINTUAN12 in #9597
[docs] Instructions for bench_serving.py by @yhyang201 in #9071
Support DeepSeek-V3.1 tool call by @Xu-Wenqing in #9446
Add A100 fused MoE kernel configs for Dpsk by @ehuaa in #9677
support cuda 13.0 and trtllm kernel by @rainj-me in #9495
fix: HiRadixCache: fix prefetch completion race by @pabloiyu in #9397
fix mooncake store mla zero copy meta by @huangtingwei9988 in #9678
move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py by @merrymercy in #9679
[router] restructure tool parser module folder by @slin1237 in #9693
[router] add deepseek tool parser by @slin1237 in #9694
Quick fix for loading processor for supporting internvl3_5 series by @yilian49 in #9676
Fix get_ip when no external network by @whybeyoung in #9700
Sets default model name in request classes by @JustinTong0323 in #9683
[router] add step3 tool parser by @slin1237 in #9695
[router] add kimi-k2 tool parser by @slin1237 in #9702
[router] add gpt-oss and glm4 tool parser by @slin1237 in #9703
[sgl-kernel] misc: update deepgemm version for sgl-kernel by @FlamingoPg in #9340
chore: upgrade sgl-kernel 0.3.7 by @zhyncs in #9708
chore: bump v0.5.1.post3 by @zhyncs in #9716
[router] upgrade kernel version in pd ci by @CatherineSue in #9720
[Sync] Update mxfp4.py (20250827) by @merrymercy in #9724
[router] fix error response in pd_router by @Bruce-x-1997 in #9505
[router] Add MCP Tool Handler by @key4ng in #9615
gpt-oss blog reproduction document by @hnyls2002 in #9728
[router] additional pythonic parser unit test by @slin1237 in #9730
[router] additional llama32 parser unit test and multi json support by @slin1237 in #9732
support mooncake store dp attention by @huangtingwei9988 in #9684
add support for nvidia/gpt-oss-120b-Eagle3 by @zyksir in #9739
Move git clone command up from README by @JustinTong0323 in #9740
[feat] Reduce GPU memory overhead by using weakref by @yhyang201 in #9673
Support speculative decoding in hybrid attention backend by @Qiaolin-Yu in #9573
[router] add llama3.2 multi json streaming parser by @slin1237 in #9735
Support compile sgl-kernel on cuda 13.0 by @rainj-me in #9721
[Sync] Update server_args.py (20250828) by @merrymercy in #9745
[router] grpc router bootstraps by @slin1237 in #9759
[AMD] Support Hierarchical Caching on AMD GPUs by @hubertlu-tw in #8236
feat: add tuned fused moe config for GLM-4.5-Air-FP8 tp = 4 on B200 by @zixuanzhang226 in #9770
[Feature] Support NPUGraph for DeepSeek on Ascend NPU by @chenxu140 in #9355
feat(draft_model): support draft_model for RemoteModelLoader by @DellCurry in #6407
fix: fix MLA for ShardedModelLoader/RemoteModelLoader by @DellCurry in #6287
Optimize prefill performance on cpu backend by @mingfeima in #8750
[HiCache] change the default policy to write through by @xiezhq-hermann in #9772
bugfix(hicache): Move exists check before key suffixing by @hzh0425 in #9749
Skip some tests on Blackwell by @hlu1 in #9777
Raise error when topk>1 and page>1 for paged attention backends. by @hnyls2002 in #9784
ROCm 7.0 update by @sogalin in #9757
add bench_mix.py by @pansicheng in #9788
Make sm100 fp8 kernels available on sm103 by @hlu1 in #9789
accomendate json schema in the "schema" field, not in "json_schema" field of response_format by @gongwei-130 in #9786
[PD] Support get_model_info interface for mini_lb by @XucSh in #9792
[HiCache] resolve conflict between chunked-prefill and hicache hit count by @xiezhq-hermann in #9776
feat(hicache-3fs): 3FS-Store Backup Optimizations For MLA Model. by @hzh0425 in #9692
support enable in the reasoning field to enable thingking for thinkin… by @gongwei-130 in #9715
feat: Add flexible validation for partial weight updates by @GeLee-Q in #9663
feat: add original logprobs to response by @narutolhy in #8375
[feat] Support EAGLE3 for Qwen2 by @KerwinKai in #9216
chore: upgrade flashinfer 0.3.0rc1 by @zhyncs in #9793
[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization by @pavanimajety in #9712
Fix TRTLLM MLA Cuda KV Blocks Causing accuracy drop by @farazkh80 in #9675
[NVIDIA] [2/N] Optimize silu_and_mul_scaled_fp4_grouped_quant perf by @kaixih in #9556
Adds initialize_moe_config to bench_one_batch so MOE backend is respected by @pranavm-nvidia in #9670
Small bug fix in transformers model implementation by @yilian49 in #9809
feature(eplb): add min-rebalancing-utilization-threshold for eplb by @hzh0425 in #8345
Make fp4_quantize kernels work on sm103 by @hlu1 in #9807
fix: dsv3 lite q_lora_rank none by @zhyncs in #9815
Fix memory leak when aborting decode request in PD-Disagg by @hnyls2002 in #9817
chore: fix cuda driver api issue and bump sgl-kernel 0.3.7.post1 by @zhyncs in #9746
chore: update Dockerfile by @zhyncs in #9820
Fix typo in warning message about DeepGEMM JIT by @mmangkad in #9802
chore: upgrade sgl-kernel 0.3.7.post1 with deepgemm fix by @zhyncs in #9822
[sgl-kernel] fix: fix missing FetchContent_Populate for fmt by @FlamingoPg in #9826
chore: upgrade transformers 4.56.0 by @zhyncs in #9827
[Auto Sync] Update parallel_state.py (20250830) by @merrymercy in #9828
[CI] Fix the trigger condition for PR test workflows by @merrymercy in #9761
[CI] Code sync tools by @merrymercy in #9830
Update guidelines for syncing code between repos by @merrymercy in #9831
hot fix for mooncake batch set api by @xiezhq-hermann in #9836
[router] add reasoning parser readme by @slin1237 in #9837
Tool parser.benchmark by @CatherineSue in #9835
[Model] Support Meituan LongCat-Flash && LongCat-Flash-MTP by @Orchard-DT in #9824
[router] global tool parser registry by @CatherineSue in #9840
[feat]Ascend NPU Gemma-3-12b and Gemma-3-27b support by @VDV1985 in #8909
[Performance] Improve Qwen RMSNorm by replacing with native RMSNorm op by @vincentzed in #9709
[HiCache] Clear kvcache in storage backend with fastAPI by @stmatengss in #9750
Fix input logprob index for a batch that includes both requests with input logprob and requests with input logprob. by @merrymercy in #9841
Fuse gate_proj and up_proj in Qwen 2.5 VL's vision MLP by @AlienKevin in #9661
[HiCache] Storage Refactoring by @xiezhq-hermann in #9797
fix set_interal_state API by @hnyls2002 in #9850
fix inconsistent arguments for generated shared prefix bench by @pbkowalski in #9073
fix(hicahce-long-bench): adjust context workload generator to use full query set by @hzh0425 in #9847
Disable radix cache in test_lora_update.py for better stability by @Fridge003 in #9852
Tiny allow DeepGEMM on cu12.9 by @fzyzcjy in #9858
Update docker build workflows for gfx942 ROCm 7.0. by @saienduri in #9794
Support Multi Process Tokenizer Manager(#6555) by @whybeyoung in #8964
chore: upgrade flashinfer 0.3.0 by @zhyncs in #9864
chore: bump v0.5.2rc0 by @zhyncs in #9862
Mooncake store get zero copy meta optimization by @huangtingwei9988 in #9857
[router] add tokenizer download support from hf hub by @CatherineSue in #9882
support fp8 kvcache for hybrid attn backend on GPT-OSS by @rainj-me in #9783
[HiCacheStorage] fix abort request host memory leaks by @huangtingwei9988 in #9874
[HiCacheStorage]: Improve 3fs kvstore‘s performance and resolve mla issues by @hzh0425 in #9876
[router] Fix short timeout for the prefill client by @LukasBluebaum in #9803
[code style] restruct fused_moe to avoid very long single file by @BBuf in #9878
[router] add grpc pd and regular router init by @CatherineSue in #9893
[router] fix FunctionCallResponse proto, support arguments is null by @Bruce-x-1997 in #9875
[feat] Support tp mode for DeepSeek-R1-W4AFP8 by @chenxijun1029 in #8118
Move multi-tokenizer event loop to better place by @ShangmingCai in #9902
[chore] fix dead links in doc by @lifuhuang in #9913
Change tensor alignment method to mn major by @mmangkad in #9844
chore: bump v0.3.8 sgl-kernel by @zhyncs in #9907
[Fix] fix the issue encountered when inference LongCat-Flash/MTP EP MoE on b200 by @Orchard-DT in #9916
fix parallel_state.py current_platform bug by @BBuf in #9919
[feat] apply deep_gemm compile_mode to skip launch by @Alcanderian in #9879
fix: update router deps by @zhyncs in #9921
chore: bump v0.5.2rc1 by @zhyncs in #9920
[Hicache] Generic page get bugfix by @ykwd in #9909
Support the internvl3.5 family models in sglang by @yilian49 in #9705
[router] include rust benchamrks by @slin1237 in #9932
Fix the key passing issue in page first layout. by @hzh0425 in #9929
[router] fix grpc client url normalzation and health check by @CatherineSue in #9939
[model] support MiniCPM-V 4.0 by @tc-mb in #8747
[HiCache] Minor fix on file storage backend by @xiezhq-hermann in #9869
Move parsers under a single folder by @merrymercy in #9912
[Fix] DeepSeek EP accuracy issue on B200 GPUs by @alhridoy in #9946
fix(cache): move ongoing_prefetch pop after validation to prevent leak by @xiaguan in #9927
Remove annoying warnings in sgl kernel build by @merrymercy in #9905
Update tool_chat_template_deepseekv31.jinja by @WangJianQ-0118 in #9895
Qwen FP8/NVFP4 ModelOPT Quantization support by @jingyu-ml in #7912
Optimized deepseek-v3/r1 model performance on mxfp4 run by @kkHuang-amd in #9671
add proctitle for tokenizers by @hnyls2002 in #9952
[feat] Add P/D attention select for draft model by @Ximingwang-09 in #9755
Revert "[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946)" by @zhyncs in #9955
Revert "Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)" by @zhyncs in #9959
[benchmark] add flashinfer_allreduce_fusion benchmark by @BBuf in #9937
[1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel) by @yuhyao in #9953
[router] Add Rerank API Specification by @fangjian601 in #9906
[router] add chat_template_kwargs in ChatCompletionRequest by @tonyluj in #9958
Remove mrope position sync by @timmy-feng in #9460
fix swa clear(): rename is_in_free_group to is_not_in_free_group by @JustinTong0323 in #9914
Triton 3.4.0 MoE config for Deepseek TP16 H100 by @SzymonOzog in #9978
nsys profile output kernel classifier by @gracehonv in #9314
Minor update regarding issue #9704 by @elfiegg in #9733
[Auto Sync] Update parallel_state.py, few_shot_gsm8k.py (20250903) by @merrymercy in #9986
feat: add gpt oss b200 ci by @zhyncs in #9988
[router] move tokenizer, reasoning, tool initialization to server by @slin1237 in #9996
[router] clean up dependency injector to use ctx by @slin1237 in #10000
[router] fix grpc connection mode detection by @slin1237 in #9999
[Fix] gpt-oss mxfp4 model run failed on ROCm platform by @kkHuang-amd in #9994
Fix Llama 4 with MXFP4 dynamic quant on MI35x by @hubertlu-tw in #9993
[Bugfix] fix pd chat completion protocol for batching support by @tonyluj in #10016
fix: health_generate endpoint in mini_lb by @wxsms in #9997
[1/N] DP-refactor: move dp balance code into scheduler's mixin class by @hnyls2002 in #10004
Ensure chunked request extension length respects both rem_chunk_tokens and rem_total_tokens limits by @pansicheng in #10003
feat(hicache): Add generic hicache ci e2e test and benchmark test by @hzh0425 in #9846
Optimize Qwen3-moe model by using flashinfer fused allreduce by @yuan-luo in #9973
[Doc] Fix SGLang tool parser doc by @PopSoda2002 in #9886
metrics: support customer buckets for prompt/generation_tokens_histogram by @acelyc111 in #9634
fix 3fs zerocopy by @pansicheng in #9938
Save memory for expert model parallel by @ch-wan in #9957
[Hicache] Mooncake API Fix & Test, and Improved Readme by @ykwd in #9951
Optimized deepseek-v3/r1 model performance on mxfp4 run by @kkHuang-amd in #10008
Fix accuracy drop of dsv3 run in dp enablement by @kkHuang-amd in #8677
chore: bump v0.5.2rc2 by @zhyncs in #10050
fix: update gb200 dep by @zhyncs in #10052
Simplify Router arguments passing and build it in docker image by @hnyls2002 in #9964
[router] fix release workflow to include protobuf by @CatherineSue in #10055
fix MultiTokenizerWrapper name by @LLLL114 in #10049
Integrate trtllm ragged attention for prefill self-attention by @elfiegg in #9801
[Vulnerability]feat(conn): set bootstrap server host by @jinmingyi1998 in #9931
Fix typo in scheduler by @JamesLim-sy in #9934
[1/2] Optimizations and refactors about quant kernel by @fzyzcjy in #9534
Tiny support setting numa nodes for different ranks by @fzyzcjy in #10006
[Fix] Add speculative_draft_model_revision to server_args by @DevashishLal-CB in #5255
Forbid DeepEP racing condition when too many tokens by @fzyzcjy in #9567
Support simple evals in text comparator by @fzyzcjy in #8867
Fix and enhance dumper by @fzyzcjy in #8725
Tiny let DeepGEMM scale checks cover more cases by @fzyzcjy in #7182
Support copying tensor from cpu to gpu without using copy engines by @fzyzcjy in #10007
[router] add py binding unit tests to coverage 80% by @key4ng in #10043
[router] add rust cache for rust unit test by @key4ng in #10079
[router] add rust cache by @slin1237 in #10080
enable aiter gemm_a8w8_bpreshuffle for ptpc gemm by @Yuechguo in #8555
[bugfix]: use correct cache location for cross attention in torch native backend by @MahmoudAshraf97 in #8622
Update flashinfer to 0.3.1 for B300 support by @hlu1 in #10087
[Bug Fix] Fix Glm4vVisionBlock norm by @sdpkjc in #9884
Update wave-lang to 3.7.0 and unify Wave kernel buffer options by @yichiche in #10069
Add storage read/write bandwidth logs to monitor kvcache performance by @pansicheng in #9965
[Minor] Refactors KV memory pool by @JustinTong0323 in #9842
support Llama4 with non uniformed intermediate size across layers for… by @gongwei-130 in #10047
[router] move to mcp sdk instead by @slin1237 in #10057
[router] Introduce router integration tests by @key4ng in #10086
Add lora_path argument to bench_multiturn.py by @Fridge003 in #10092
[HiStorage] Remove delete and clear as necessary methods by @xiezhq-hermann in #10039
Modify ci workflow for auto-partitioning in 2-GPU backend tests by @hzh0425 in #10029
Revert "[1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel) (#9953)" by @zhyncs in #10097
Fix RMSNorm API CALL mismatch issue. by @sogalin in #10032
fix double sparsity initialization by @shadowpa0327 in #6905
[Fix] illegal sync based on undefined behaviour by @DevashishLal-CB in #9620
[7/N] MoE Refactor: the implementation of new framework by @ch-wan in #9269
[NVIDIA] Remove unused get_fused_moe_impl_class function by @kaixih in #9764
[NVIDIA] disable chunked prefix cache when dp is used by @kaixih in #9861
perf: Avoid unnecessary data type conversions for DeepSeek-V3 on Blackwell by @jinyangyuan-nvidia in #9834
[Fix] Compatibility between DP attention and pipeline parallelism by @ch-wan in #10100
Fix circular import by @ch-wan in #10107
Disable kernel cutlass_mla_decode on SM103 by @hlu1 in #10058
Remove non-accelerated targets(100 and up) from cmake by @hlu1 in #10041
[chore] Remove unused ep_moe cuda kernels by @hlu1 in #9956
[CI] Refactor disaggregation tests by @ShangmingCai in #10068
increase the rust e2e timeout by @key4ng in #10116
[router] Improve the e2e tests by @key4ng in #10102
[Auto Sync] Update server_args.py (20250906) by @merrymercy in #10117
Optimize moe_sum_reduce_kernel by @yuan-luo in #9477
[Feature] LMCache Connector Integration by @Oasis-Git in #9741
CUTLASS fp8 blockwise gemm support of sm120 by @jianyingzhu in #9969
Optimize nvfp4 block scaled gemm kernel when M is small. by @HydraQYH in #10101
Fix cuda graph mode in flashinfer attn backend by @benbarsdell in #10056
[HiCache] fix: check clear() method for storage backend by @stmatengss in #10096
add dataset_path for bench_one_batch_server.py by @miter6 in #10113
[Auto Sync] Update parallel_state.py (20250907) by @merrymercy in #10126
[Minor] fix lint in main by @DarkSharpness in #10128
[1/2] Refactor multi-tokenizer manager by @hnyls2002 in #10074
Fix flashinfer version in sgl-kernel by @merrymercy in #10135
[DOC]: some minor updates by @yyihuang in #10134
[BUG FIX] add fail check when get fail in case wait complete block by @mss1213 in #9971
[MoE] fix: incorrect weight initialization for cutlass_fused_experts_fp8 by @ch-wan in #10144
Enables GLM4.1V server testing & fix video processing by @JustinTong0323 in #10095
Fix slow fused add RMSNorm by @fzyzcjy in #10141
fix the fp8 topk_config.correction_bias is none bug by @rainj-me in #10040
Qwen2.5-VL eagle3 infer by @Lzhang-hub in #8801
Fix run time error in dsv3-fp8 model on mi35x by @kkHuang-amd in #10104
Standalone speculative decoding by @Qiaolin-Yu in #10090
Add graph runner support with torch compile on CPU by @CaoE in #7843
move compile threads to an option to avoid OOM on low memory host by @rainj-me in #10123
[1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel, fixed) by @yuhyao in #10108
[Bugfix] Retract not releasing enough memory when page size > 1 by @xiezhq-hermann in #9989
Add speculator attention backend switch by @cicirori in #9981
Fix: (glm4v) Add missing field by @JustinTong0323 in #10147
[Bugfix] Qwen3MoE aclrtMemcpy failed with NPUGraph by @iforgetmyname in #10013
enable auto-round quantization model by @WeiweiZhang1 in #6226
Revert "enable auto-round quantization model (#6226)" by @zhyncs in #10148
enable llama3.1-8B on xpu by @huaiyuzh in #9434
[Bug fix] Fix Gemma 2 and fix Gemma 3 multimodal with bs > 1 on NPU by @ssshinigami in #9871
update xgrammar 0.1.24 and transformers 4.56.1 by @Swipe4057 in #10155
[2/N] DP-Refactor: move communicators into tokenizer_communicator_mixin by @hnyls2002 in #10028
[Hicache]: Add E2E CI For 3FS-KVStore by @hzh0425 in #10131
Monkey patch uvicorn multi worker is_alive timeout by @hnyls2002 in #10159
[CI] fix ambiguous argument in testing hybrid attentions. by @hnyls2002 in #10161
[1/2] Speed up prefill mla attention by @fzyzcjy in #10156
[Bug fix] Fix ascend mla in aclgraph by @alanhe151220037 in #9925
pref: Add H20 fp8 fused MoE kernel configs for Qwen3 by @Zhiy-Zhang in #10166
[fix] Relax white space rules in EBNFComposer by @LukasBluebaum in #9595
Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)" by @zhyncs in #10176
[Bench] feat: mooncake trace integration by @stmatengss in #9839
fix: resolve lint issue by @zhyncs in #10181
fix the cutlass moe tests by @rainj-me in #10182
gb200: update dockerfile to latest kernel by @ishandhanani in #9522
Cleaning codes for speculative attention mode by @Fridge003 in #10149
Revert "feat: add fused moe config for Qwen3-30B-A3B on B200" by @rainj-me in #10185
[Fix] Orphan process in data parallel by @Capronir in #7995
Update link for EAGLE speculative decoding by @gerayking in #10191
[CPU] Fix phi4-mm prompt issue in bench_serving by @blzheng in #9900
Updated Nvidia Jetson docs by @shahizat in #4422
[3/N]DP refactor: Improve dp rank scheduling in PD disaggregation mode. by @hnyls2002 in #10169
Support opt model by @wenhuipeng in #10165
feat: use sgl-kernel cu129 as default by @zhyncs in #10188
[Refactor] Remove Hicache Load & Write threads by @DarkSharpness in #10127
Explictly export CMAKE_BUILD_PARALLEL_LEVEL by @key4ng in #10193
[CPU] Add gelu_and_mul kernel in sgl-kernel and add ut by @blzheng in #9300
feat: support fa cute in sgl-kernel by @zhyncs in #10205
Refactor fused_add_rmsnorm import logic by @ShangmingCai in #10207
tool-call(dsv3): Fixed a parse problem when there are multiple function definitions in tool_calls by @Missmiaom in #10209
[Auto Sync] Update sampling_batch_info.py (20250909) by @merrymercy in #10212
chore: bump v0.3.9 sgl-kernel by @zhyncs in #10208
add variable TP Decode > Prefill size support by @shaharmor98 in #9960
[Fix] KV-cache eviction mismatch across PP ranks in DeepSeek V3/R1 by @qhsc in #10214
chore: upgrade v0.3.9 sgl-kernel by @zhyncs in #10220
Revert the changes on NCCL symmetric memory by @merrymercy in #10210
Revert "Revert the changes on NCCL symmetric memory" by @merrymercy in #10238
[HiCache] feat: add mooncake backend extra config by @stmatengss in #10213
Add mamba kernel by @yizhang2077 in #10234
[Auto Sync] Update io_struct.py (20250909) by @merrymercy in #10236
[Auto Sync] Update collector.py, startup_func_log_and_timer... (20250910) by @merrymercy in #10242
Revert "chore: upgrade v0.3.9 sgl-kernel" by @merrymercy in #10245
refactor(InternVL): Use gpu to preprocess the input image by @KEVINTUAN12 in #9795
make --speculative-draft-model an alias of --speculative-draft-model-path by @merrymercy in #10246
[UT for RL] Add UT to cover release/resume memory case for moe model by @ryang-max in #8803
[Benchmark] Prefil-only benchmark scripts by @sundar24295s in #10240
[doc] add walkthrough for implementing and hosting a simple llama wrapper m… by @glenliu21 in #10093
Fix: the default choice is wrong for flashinfer mxfp4 moe precision by @LauYeeYu in #10253
Page first direct IO kernel by @huangtingwei9988 in #10060
support vlm model spec bench by @Lzhang-hub in #10173
Fix assertion typo in tp_worker.py by @sgncho in #9954
[Auto Sync] Update io_struct.py (20250910) by @merrymercy in #10262
Fix potential flakiness in test_lora_qwen3 by @lifuhuang in #10250
[router][ci] Add PD router mmlu test by @key4ng in #10256
[1/2] Refactor LoRA to support backend-specific batch preprocessing. by @lifuhuang in #10251
[Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint by @pavanimajety in #9940
add dual stream for qwen2_moe by @yizhang2077 in #10252
Add tests to AMD CI for MI35x by @hubertlu-tw in #9662
pass a_scale from fp8 quant result instead of hard code to 1.0f by @rainj-me in #10241
Feat: support disable tool parser by @JustinTong0323 in #10184
[Auto Sync] Update serving_base.py, serving_chat.py, servin... (20250910) by @merrymercy in #10282
Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" by @zhyncs in #10292
chore: bump sgl-kernel 0.3.9.post1 by @zhyncs in #10294
[Feature] Support DeepEP normal & Redundant Experts on NPU by @iforgetmyname in #9881
add flash linear attention triton kernel by @yizhang2077 in #10239
[chore]Add sgl-router to npu images by @BourneSun0527 in #10229
[CPU] fix OOM when mem-fraction is not set by @ZailiWang in #9090
[fix CI] Fix logical condition in fused MoE layer for compressed tensor quantization by @BBuf in #10299
Revert "Fix flashinfer version in sgl-kernel (#10135)" by @zhyncs in #10310
chore: bump sgl-kernel 0.3.9.post2 by @zhyncs in #10311
[CI] add pyproject.toml to deepseek w4a8 ci by @HanHan009527 in #10314
chore: upgrade v0.3.9.post2 sgl-kernel by @zhyncs in #10297
Qwen3-Next support by @yizhang2077 in #10233
[Auto Sync] Update parallel_state.py (20250911) by @merrymercy in #10326
[Minor] Improve the style of server args by @merrymercy in #10328
[bugfix] fix norm type error in qwen3_next model by @cao1zhg in #10322
[Qwen3-Next] switch to triton and cache conv states to accelerate MTP from 300 tok/s to 341 tok/s by @hebiao064 in #10335
[router] add benchmark for regular router and pd router by @key4ng in #10280
add h20 qwen3 next config by @yizhang2077 in #10264
[router] Add OpenAI backend support - core function by @key4ng in #10254
[router][ci] add gpu process check and free port before start server by @key4ng in #10338
add qwen3-next doc by @yizhang2077 in #10327
fix: trtllm-gen attention take zero-init workspace by @yyihuang in #10330
Fix errors of hicache kernels in sgl-kernel for ROCm by @hubertlu-tw in #10339
update GLM nightly test threshold by @zminglei in #10331
[LongCat] Optimize zero_experts_compute_triton by changing mask by @zk-lover in #10303
add try catch for quant config hf download by @gongwei-130 in #10340
chore: bump v0.5.2 by @zhyncs in #10221

New Contributors

@Beichen-Ma made their first contribution in #9429
@SCDESPERTATE made their first contribution in #7317
@CiaranZhou made their first contribution in #9229
@jonaslsaa made their first contribution in #9190
@ykwd made their first contribution in #8901
@ZhengdQin made their first contribution in #8328
@lshmouse made their first contribution in #9630
@GavinZhu-GMI made their first contribution in #9635
@cicirori made their first contribution in #9665
@KEVINTUAN12 made their first contribution in #9597
@rainj-me made their first contribution in #9495
@pabloiyu made their first contribution in #9397
@KerwinKai made their first contribution in #9216
@mmangkad made their first contribution in #9802
@Orchard-DT made their first contribution in #9824
@pbkowalski made their first contribution in #9073
@LukasBluebaum made their first contribution in #9803
@chenxijun1029 made their first contribution in #8118
@tc-mb made their first contribution in #8747
@alhridoy made their first contribution in #9946
@xiaguan made their first contribution in #9927
@WangJianQ-0118 made their first contribution in #9895
@jingyu-ml made their first contribution in #7912
@fangjian601 made their first contribution in #9906
@SzymonOzog made their first contribution in #9978
@gracehonv made their first contribution in #9314
@JamesLim-sy made their first contribution in #9934
@DevashishLal-CB made their first contribution in #5255
@MahmoudAshraf97 made their first contribution in #8622
@sdpkjc made their first contribution in #9884
@shadowpa0327 made their first contribution in #6905
@jinyangyuan-nvidia made their first contribution in #9834
@Oasis-Git made their first contribution in #9741
@jianyingzhu made their first contribution in #9969
@benbarsdell made their first contribution in #10056
@mss1213 made their first contribution in #9971
@WeiweiZhang1 made their first contribution in #6226
@huaiyuzh made their first contribution in #9434
@ssshinigami made their first contribution in #9871
@alanhe151220037 made their first contribution in #9925
@Zhiy-Zhang made their first contribution in #10166
@gerayking made their first contribution in #10191
@wenhuipeng made their first contribution in #10165
@Missmiaom made their first contribution in #10209
@shaharmor98 made their first contribution in #9960
@qhsc made their first contribution in #10214
@glenliu21 made their first contribution in #10093
@LauYeeYu made their first contribution in #10253
@sgncho made their first contribution in #9954
@BourneSun0527 made their first contribution in #10229
@zk-lover made their first contribution in #10303

Full Changelog: v0.5.1...v0.5.2

sgl-project/sglang v0.5.2 Release v0.5.2 on GitHub

What's Changed

New Contributors

sgl-project/sglang v0.5.2
Release v0.5.2

on GitHub