github sgl-project/sglang v0.5.2
Release v0.5.2

10 hours ago

What's Changed

  • feat: allow use local branch to build image by @gongwei-130 in #9546
  • [readme] Include additional resources for the SGLang x AMD SF Meetup event by @wisclmy0611 in #9547
  • [doc] deepseekv31 support by @XiaotongJiang in #9544
  • fix(grok): remove duplicate replicate_lm_head configuration by @vincentzed in #9549
  • chore: update configurer by @zhyncs in #9557
  • chore: bump v0.5.1.post1 by @zhyncs in #9558
  • [router] add right rustls dependency in sgl-router cargo.toml by @Bruce-x-1997 in #9498
  • fix: use sgl-kernel 0.3.5 by @zhyncs in #9565
  • Add target module validation for init adapters by @Beichen-Ma in #9429
  • fix: Update OpenAI client base URL in documentation by @JustinTong0323 in #9576
  • [PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats by @SCDESPERTATE in #7317
  • remove redundant rank0_log function. by @miter6 in #9560
  • Update CUTLASS 4.2 & Enable K-Major Scale Factor for SM90 FP8 Blockwise Group GEMM by @HydraQYH in #9559
  • Reintroduce memory usage fix by @fzyzcjy in #9535
  • Offload tensors by sharding on GPU by @fzyzcjy in #9536
  • bugfix for undefined logging functions in HarmonyBrowserTool & HarmonyPythonTool by @CiaranZhou in #9229
  • chore: upgrade flashinfer 0.2.14.post1 by @zhyncs in #9578
  • fix: revert #8593 by @zhyncs in #9581
  • fix: resolve tuning fused moe issue by @zhyncs in #9587
  • Tiny fix wrong comments by @fzyzcjy in #9589
  • chore: update config by @zhyncs in #9591
  • chore: bump v0.5.1.post2 by @zhyncs in #9592
  • [Doc] add LWS(LeaderWorkerSet) use case in sgl-router README by @Bruce-x-1997 in #9568
  • [Performance] Batch Send from Tokenizer Manager. by @sundar24295s in #9436
  • Fix GLM45 tool call multi-turn bug by @byjiang1996 in #9500
  • Fix GLM45v launch server cuda torch compile bug by @byjiang1996 in #9554
  • Fix Harmony reasoning parser for and auto-separation for gpt-oss models by @jonaslsaa in #9190
  • [docs] Refactor, remove compiled results and add gpt-oss by @zhaochenyang20 in #9613
  • [Fix] HiCache Bugfix & Mooncake Error Handling Enhance by @ykwd in #8901
  • Improve bench_one_batch_server script by @hnyls2002 in #9608
  • [router] add mistral tool parser by @slin1237 in #9622
  • [router] add qwen tool parser by @slin1237 in #9623
  • [router] add pythonic parser by @slin1237 in #9628
  • [router] add llama tool parser by @slin1237 in #9629
  • [router] add ut for mistral, llama, pythonic, and streaming tool parser by @slin1237 in #9632
  • [new feat] ascend backend support fia fusion kernel by @ZhengdQin in #8328
  • model: Support nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 by @netanel-haber in #9301
  • Fix lint for router by @hebiao064 in #9636
  • [docs] Update README with additional highlights and resources for SGLang x AMD SF Meetup by @wisclmy0611 in #9640
  • Add reasoning_effort param in TiktokenTokenizer.apply_chat_template by @lshmouse in #9630
  • fix: allow user to specify function as role by @GavinZhu-GMI in #9635
  • Fix kimi k2 function calling format by @XiaotongJiang in #9606
  • [router] address worker load tracking consistency by @slin1237 in #9523
  • [router] add token bucket rate limiter by @CatherineSue in #9656
  • [doc] add kimik2 --tool-call-parser by @XiaotongJiang in #9647
  • Install py-spy by default for containers for easier debugging by @fzyzcjy in #9649
  • BugFix(hicache): Fix host indices out of bound error by @hzh0425 in #9637
  • HiCache Storage fix host memory leak by @xiezhq-hermann in #9648
  • add response_format support for completion API by @cicirori in #9665
  • Fix FA3 swa spec verify topk>1 by @ispobock in #9658
  • [RL] fix register the same ops multiple times by @hebiao064 in #9564
  • chore: enhance bench_serving for vlms with a new dataset of configurable image count and resolution by @mickqian in #9583
  • refactor(hicache): Introduce generic HiCacheStorageConfig for improved configuration management by @hzh0425 in #9555
  • feat: (chat-template matching) enhance multimodal model detection with config.json by @KEVINTUAN12 in #9597
  • [docs] Instructions for bench_serving.py by @yhyang201 in #9071
  • Support DeepSeek-V3.1 tool call by @Xu-Wenqing in #9446
  • Add A100 fused MoE kernel configs for Dpsk by @ehuaa in #9677
  • support cuda 13.0 and trtllm kernel by @rainj-me in #9495
  • fix: HiRadixCache: fix prefetch completion race by @pabloiyu in #9397
  • fix mooncake store mla zero copy meta by @huangtingwei9988 in #9678
  • move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py by @merrymercy in #9679
  • [router] restructure tool parser module folder by @slin1237 in #9693
  • [router] add deepseek tool parser by @slin1237 in #9694
  • Quick fix for loading processor for supporting internvl3_5 series by @yilian49 in #9676
  • Fix get_ip when no external network by @whybeyoung in #9700
  • Sets default model name in request classes by @JustinTong0323 in #9683
  • [router] add step3 tool parser by @slin1237 in #9695
  • [router] add kimi-k2 tool parser by @slin1237 in #9702
  • [router] add gpt-oss and glm4 tool parser by @slin1237 in #9703
  • [sgl-kernel] misc: update deepgemm version for sgl-kernel by @FlamingoPg in #9340
  • chore: upgrade sgl-kernel 0.3.7 by @zhyncs in #9708
  • chore: bump v0.5.1.post3 by @zhyncs in #9716
  • [router] upgrade kernel version in pd ci by @CatherineSue in #9720
  • [Sync] Update mxfp4.py (20250827) by @merrymercy in #9724
  • [router] fix error response in pd_router by @Bruce-x-1997 in #9505
  • [router] Add MCP Tool Handler by @key4ng in #9615
  • gpt-oss blog reproduction document by @hnyls2002 in #9728
  • [router] additional pythonic parser unit test by @slin1237 in #9730
  • [router] additional llama32 parser unit test and multi json support by @slin1237 in #9732
  • support mooncake store dp attention by @huangtingwei9988 in #9684
  • add support for nvidia/gpt-oss-120b-Eagle3 by @zyksir in #9739
  • Move git clone command up from README by @JustinTong0323 in #9740
  • [feat] Reduce GPU memory overhead by using weakref by @yhyang201 in #9673
  • Support speculative decoding in hybrid attention backend by @Qiaolin-Yu in #9573
  • [router] add llama3.2 multi json streaming parser by @slin1237 in #9735
  • Support compile sgl-kernel on cuda 13.0 by @rainj-me in #9721
  • [Sync] Update server_args.py (20250828) by @merrymercy in #9745
  • [router] grpc router bootstraps by @slin1237 in #9759
  • [AMD] Support Hierarchical Caching on AMD GPUs by @hubertlu-tw in #8236
  • feat: add tuned fused moe config for GLM-4.5-Air-FP8 tp = 4 on B200 by @zixuanzhang226 in #9770
  • [Feature] Support NPUGraph for DeepSeek on Ascend NPU by @chenxu140 in #9355
  • feat(draft_model): support draft_model for RemoteModelLoader by @DellCurry in #6407
  • fix: fix MLA for ShardedModelLoader/RemoteModelLoader by @DellCurry in #6287
  • Optimize prefill performance on cpu backend by @mingfeima in #8750
  • [HiCache] change the default policy to write through by @xiezhq-hermann in #9772
  • bugfix(hicache): Move exists check before key suffixing by @hzh0425 in #9749
  • Skip some tests on Blackwell by @hlu1 in #9777
  • Raise error when topk>1 and page>1 for paged attention backends. by @hnyls2002 in #9784
  • ROCm 7.0 update by @sogalin in #9757
  • add bench_mix.py by @pansicheng in #9788
  • Make sm100 fp8 kernels available on sm103 by @hlu1 in #9789
  • accomendate json schema in the "schema" field, not in "json_schema" field of response_format by @gongwei-130 in #9786
  • [PD] Support get_model_info interface for mini_lb by @XucSh in #9792
  • [HiCache] resolve conflict between chunked-prefill and hicache hit count by @xiezhq-hermann in #9776
  • feat(hicache-3fs): 3FS-Store Backup Optimizations For MLA Model. by @hzh0425 in #9692
  • support enable in the reasoning field to enable thingking for thinkin… by @gongwei-130 in #9715
  • feat: Add flexible validation for partial weight updates by @GeLee-Q in #9663
  • feat: add original logprobs to response by @narutolhy in #8375
  • [feat] Support EAGLE3 for Qwen2 by @KerwinKai in #9216
  • chore: upgrade flashinfer 0.3.0rc1 by @zhyncs in #9793
  • [ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization by @pavanimajety in #9712
  • Fix TRTLLM MLA Cuda KV Blocks Causing accuracy drop by @farazkh80 in #9675
  • [NVIDIA] [2/N] Optimize silu_and_mul_scaled_fp4_grouped_quant perf by @kaixih in #9556
  • Adds initialize_moe_config to bench_one_batch so MOE backend is respected by @pranavm-nvidia in #9670
  • Small bug fix in transformers model implementation by @yilian49 in #9809
  • feature(eplb): add min-rebalancing-utilization-threshold for eplb by @hzh0425 in #8345
  • Make fp4_quantize kernels work on sm103 by @hlu1 in #9807
  • fix: dsv3 lite q_lora_rank none by @zhyncs in #9815
  • Fix memory leak when aborting decode request in PD-Disagg by @hnyls2002 in #9817
  • chore: fix cuda driver api issue and bump sgl-kernel 0.3.7.post1 by @zhyncs in #9746
  • chore: update Dockerfile by @zhyncs in #9820
  • Fix typo in warning message about DeepGEMM JIT by @mmangkad in #9802
  • chore: upgrade sgl-kernel 0.3.7.post1 with deepgemm fix by @zhyncs in #9822
  • [sgl-kernel] fix: fix missing FetchContent_Populate for fmt by @FlamingoPg in #9826
  • chore: upgrade transformers 4.56.0 by @zhyncs in #9827
  • [Auto Sync] Update parallel_state.py (20250830) by @merrymercy in #9828
  • [CI] Fix the trigger condition for PR test workflows by @merrymercy in #9761
  • [CI] Code sync tools by @merrymercy in #9830
  • Update guidelines for syncing code between repos by @merrymercy in #9831
  • hot fix for mooncake batch set api by @xiezhq-hermann in #9836
  • [router] add reasoning parser readme by @slin1237 in #9837
  • Tool parser.benchmark by @CatherineSue in #9835
  • [Model] Support Meituan LongCat-Flash && LongCat-Flash-MTP by @Orchard-DT in #9824
  • [router] global tool parser registry by @CatherineSue in #9840
  • [feat]Ascend NPU Gemma-3-12b and Gemma-3-27b support by @VDV1985 in #8909
  • [Performance] Improve Qwen RMSNorm by replacing with native RMSNorm op by @vincentzed in #9709
  • [HiCache] Clear kvcache in storage backend with fastAPI by @stmatengss in #9750
  • Fix input logprob index for a batch that includes both requests with input logprob and requests with input logprob. by @merrymercy in #9841
  • Fuse gate_proj and up_proj in Qwen 2.5 VL's vision MLP by @AlienKevin in #9661
  • [HiCache] Storage Refactoring by @xiezhq-hermann in #9797
  • fix set_interal_state API by @hnyls2002 in #9850
  • fix inconsistent arguments for generated shared prefix bench by @pbkowalski in #9073
  • fix(hicahce-long-bench): adjust context workload generator to use full query set by @hzh0425 in #9847
  • Disable radix cache in test_lora_update.py for better stability by @Fridge003 in #9852
  • Tiny allow DeepGEMM on cu12.9 by @fzyzcjy in #9858
  • Update docker build workflows for gfx942 ROCm 7.0. by @saienduri in #9794
  • Support Multi Process Tokenizer Manager(#6555) by @whybeyoung in #8964
  • chore: upgrade flashinfer 0.3.0 by @zhyncs in #9864
  • chore: bump v0.5.2rc0 by @zhyncs in #9862
  • Mooncake store get zero copy meta optimization by @huangtingwei9988 in #9857
  • [router] add tokenizer download support from hf hub by @CatherineSue in #9882
  • support fp8 kvcache for hybrid attn backend on GPT-OSS by @rainj-me in #9783
  • [HiCacheStorage] fix abort request host memory leaks by @huangtingwei9988 in #9874
  • [HiCacheStorage]: Improve 3fs kvstore‘s performance and resolve mla issues by @hzh0425 in #9876
  • [router] Fix short timeout for the prefill client by @LukasBluebaum in #9803
  • [code style] restruct fused_moe to avoid very long single file by @BBuf in #9878
  • [router] add grpc pd and regular router init by @CatherineSue in #9893
  • [router] fix FunctionCallResponse proto, support arguments is null by @Bruce-x-1997 in #9875
  • [feat] Support tp mode for DeepSeek-R1-W4AFP8 by @chenxijun1029 in #8118
  • Move multi-tokenizer event loop to better place by @ShangmingCai in #9902
  • [chore] fix dead links in doc by @lifuhuang in #9913
  • Change tensor alignment method to mn major by @mmangkad in #9844
  • chore: bump v0.3.8 sgl-kernel by @zhyncs in #9907
  • [Fix] fix the issue encountered when inference LongCat-Flash/MTP EP MoE on b200 by @Orchard-DT in #9916
  • fix parallel_state.py current_platform bug by @BBuf in #9919
  • [feat] apply deep_gemm compile_mode to skip launch by @Alcanderian in #9879
  • fix: update router deps by @zhyncs in #9921
  • chore: bump v0.5.2rc1 by @zhyncs in #9920
  • [Hicache] Generic page get bugfix by @ykwd in #9909
  • Support the internvl3.5 family models in sglang by @yilian49 in #9705
  • [router] include rust benchamrks by @slin1237 in #9932
  • Fix the key passing issue in page first layout. by @hzh0425 in #9929
  • [router] fix grpc client url normalzation and health check by @CatherineSue in #9939
  • [model] support MiniCPM-V 4.0 by @tc-mb in #8747
  • [HiCache] Minor fix on file storage backend by @xiezhq-hermann in #9869
  • Move parsers under a single folder by @merrymercy in #9912
  • [Fix] DeepSeek EP accuracy issue on B200 GPUs by @alhridoy in #9946
  • fix(cache): move ongoing_prefetch pop after validation to prevent leak by @xiaguan in #9927
  • Remove annoying warnings in sgl kernel build by @merrymercy in #9905
  • Update tool_chat_template_deepseekv31.jinja by @WangJianQ-0118 in #9895
  • Qwen FP8/NVFP4 ModelOPT Quantization support by @jingyu-ml in #7912
  • Optimized deepseek-v3/r1 model performance on mxfp4 run by @kkHuang-amd in #9671
  • add proctitle for tokenizers by @hnyls2002 in #9952
  • [feat] Add P/D attention select for draft model by @Ximingwang-09 in #9755
  • Revert "[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946)" by @zhyncs in #9955
  • Revert "Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)" by @zhyncs in #9959
  • [benchmark] add flashinfer_allreduce_fusion benchmark by @BBuf in #9937
  • [1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel) by @yuhyao in #9953
  • [router] Add Rerank API Specification by @fangjian601 in #9906
  • [router] add chat_template_kwargs in ChatCompletionRequest by @tonyluj in #9958
  • Remove mrope position sync by @timmy-feng in #9460
  • fix swa clear(): rename is_in_free_group to is_not_in_free_group by @JustinTong0323 in #9914
  • Triton 3.4.0 MoE config for Deepseek TP16 H100 by @SzymonOzog in #9978
  • nsys profile output kernel classifier by @gracehonv in #9314
  • Minor update regarding issue #9704 by @elfiegg in #9733
  • [Auto Sync] Update parallel_state.py, few_shot_gsm8k.py (20250903) by @merrymercy in #9986
  • feat: add gpt oss b200 ci by @zhyncs in #9988
  • [router] move tokenizer, reasoning, tool initialization to server by @slin1237 in #9996
  • [router] clean up dependency injector to use ctx by @slin1237 in #10000
  • [router] fix grpc connection mode detection by @slin1237 in #9999
  • [Fix] gpt-oss mxfp4 model run failed on ROCm platform by @kkHuang-amd in #9994
  • Fix Llama 4 with MXFP4 dynamic quant on MI35x by @hubertlu-tw in #9993
  • [Bugfix] fix pd chat completion protocol for batching support by @tonyluj in #10016
  • fix: health_generate endpoint in mini_lb by @wxsms in #9997
  • [1/N] DP-refactor: move dp balance code into scheduler's mixin class by @hnyls2002 in #10004
  • Ensure chunked request extension length respects both rem_chunk_tokens and rem_total_tokens limits by @pansicheng in #10003
  • feat(hicache): Add generic hicache ci e2e test and benchmark test by @hzh0425 in #9846
  • Optimize Qwen3-moe model by using flashinfer fused allreduce by @yuan-luo in #9973
  • [Doc] Fix SGLang tool parser doc by @PopSoda2002 in #9886
  • metrics: support customer buckets for prompt/generation_tokens_histogram by @acelyc111 in #9634
  • fix 3fs zerocopy by @pansicheng in #9938
  • Save memory for expert model parallel by @ch-wan in #9957
  • [Hicache] Mooncake API Fix & Test, and Improved Readme by @ykwd in #9951
  • Optimized deepseek-v3/r1 model performance on mxfp4 run by @kkHuang-amd in #10008
  • Fix accuracy drop of dsv3 run in dp enablement by @kkHuang-amd in #8677
  • chore: bump v0.5.2rc2 by @zhyncs in #10050
  • fix: update gb200 dep by @zhyncs in #10052
  • Simplify Router arguments passing and build it in docker image by @hnyls2002 in #9964
  • [router] fix release workflow to include protobuf by @CatherineSue in #10055
  • fix MultiTokenizerWrapper name by @LLLL114 in #10049
  • Integrate trtllm ragged attention for prefill self-attention by @elfiegg in #9801
  • [Vulnerability]feat(conn): set bootstrap server host by @jinmingyi1998 in #9931
  • Fix typo in scheduler by @JamesLim-sy in #9934
  • [1/2] Optimizations and refactors about quant kernel by @fzyzcjy in #9534
  • Tiny support setting numa nodes for different ranks by @fzyzcjy in #10006
  • [Fix] Add speculative_draft_model_revision to server_args by @DevashishLal-CB in #5255
  • Forbid DeepEP racing condition when too many tokens by @fzyzcjy in #9567
  • Support simple evals in text comparator by @fzyzcjy in #8867
  • Fix and enhance dumper by @fzyzcjy in #8725
  • Tiny let DeepGEMM scale checks cover more cases by @fzyzcjy in #7182
  • Support copying tensor from cpu to gpu without using copy engines by @fzyzcjy in #10007
  • [router] add py binding unit tests to coverage 80% by @key4ng in #10043
  • [router] add rust cache for rust unit test by @key4ng in #10079
  • [router] add rust cache by @slin1237 in #10080
  • enable aiter gemm_a8w8_bpreshuffle for ptpc gemm by @Yuechguo in #8555
  • [bugfix]: use correct cache location for cross attention in torch native backend by @MahmoudAshraf97 in #8622
  • Update flashinfer to 0.3.1 for B300 support by @hlu1 in #10087
  • [Bug Fix] Fix Glm4vVisionBlock norm by @sdpkjc in #9884
  • Update wave-lang to 3.7.0 and unify Wave kernel buffer options by @yichiche in #10069
  • Add storage read/write bandwidth logs to monitor kvcache performance by @pansicheng in #9965
  • [Minor] Refactors KV memory pool by @JustinTong0323 in #9842
  • support Llama4 with non uniformed intermediate size across layers for… by @gongwei-130 in #10047
  • [router] move to mcp sdk instead by @slin1237 in #10057
  • [router] Introduce router integration tests by @key4ng in #10086
  • Add lora_path argument to bench_multiturn.py by @Fridge003 in #10092
  • [HiStorage] Remove delete and clear as necessary methods by @xiezhq-hermann in #10039
  • Modify ci workflow for auto-partitioning in 2-GPU backend tests by @hzh0425 in #10029
  • Revert "[1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel) (#9953)" by @zhyncs in #10097
  • Fix RMSNorm API CALL mismatch issue. by @sogalin in #10032
  • fix double sparsity initialization by @shadowpa0327 in #6905
  • [Fix] illegal sync based on undefined behaviour by @DevashishLal-CB in #9620
  • [7/N] MoE Refactor: the implementation of new framework by @ch-wan in #9269
  • [NVIDIA] Remove unused get_fused_moe_impl_class function by @kaixih in #9764
  • [NVIDIA] disable chunked prefix cache when dp is used by @kaixih in #9861
  • perf: Avoid unnecessary data type conversions for DeepSeek-V3 on Blackwell by @jinyangyuan-nvidia in #9834
  • [Fix] Compatibility between DP attention and pipeline parallelism by @ch-wan in #10100
  • Fix circular import by @ch-wan in #10107
  • Disable kernel cutlass_mla_decode on SM103 by @hlu1 in #10058
  • Remove non-accelerated targets(100 and up) from cmake by @hlu1 in #10041
  • [chore] Remove unused ep_moe cuda kernels by @hlu1 in #9956
  • [CI] Refactor disaggregation tests by @ShangmingCai in #10068
  • increase the rust e2e timeout by @key4ng in #10116
  • [router] Improve the e2e tests by @key4ng in #10102
  • [Auto Sync] Update server_args.py (20250906) by @merrymercy in #10117
  • Optimize moe_sum_reduce_kernel by @yuan-luo in #9477
  • [Feature] LMCache Connector Integration by @Oasis-Git in #9741
  • CUTLASS fp8 blockwise gemm support of sm120 by @jianyingzhu in #9969
  • Optimize nvfp4 block scaled gemm kernel when M is small. by @HydraQYH in #10101
  • Fix cuda graph mode in flashinfer attn backend by @benbarsdell in #10056
  • [HiCache] fix: check clear() method for storage backend by @stmatengss in #10096
  • add dataset_path for bench_one_batch_server.py by @miter6 in #10113
  • [Auto Sync] Update parallel_state.py (20250907) by @merrymercy in #10126
  • [Minor] fix lint in main by @DarkSharpness in #10128
  • [1/2] Refactor multi-tokenizer manager by @hnyls2002 in #10074
  • Fix flashinfer version in sgl-kernel by @merrymercy in #10135
  • [DOC]: some minor updates by @yyihuang in #10134
  • [BUG FIX] add fail check when get fail in case wait complete block by @mss1213 in #9971
  • [MoE] fix: incorrect weight initialization for cutlass_fused_experts_fp8 by @ch-wan in #10144
  • Enables GLM4.1V server testing & fix video processing by @JustinTong0323 in #10095
  • Fix slow fused add RMSNorm by @fzyzcjy in #10141
  • fix the fp8 topk_config.correction_bias is none bug by @rainj-me in #10040
  • Qwen2.5-VL eagle3 infer by @Lzhang-hub in #8801
  • Fix run time error in dsv3-fp8 model on mi35x by @kkHuang-amd in #10104
  • Standalone speculative decoding by @Qiaolin-Yu in #10090
  • Add graph runner support with torch compile on CPU by @CaoE in #7843
  • move compile threads to an option to avoid OOM on low memory host by @rainj-me in #10123
  • [1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel, fixed) by @yuhyao in #10108
  • [Bugfix] Retract not releasing enough memory when page size > 1 by @xiezhq-hermann in #9989
  • Add speculator attention backend switch by @cicirori in #9981
  • Fix: (glm4v) Add missing field by @JustinTong0323 in #10147
  • [Bugfix] Qwen3MoE aclrtMemcpy failed with NPUGraph by @iforgetmyname in #10013
  • enable auto-round quantization model by @WeiweiZhang1 in #6226
  • Revert "enable auto-round quantization model (#6226)" by @zhyncs in #10148
  • enable llama3.1-8B on xpu by @huaiyuzh in #9434
  • [Bug fix] Fix Gemma 2 and fix Gemma 3 multimodal with bs > 1 on NPU by @ssshinigami in #9871
  • update xgrammar 0.1.24 and transformers 4.56.1 by @Swipe4057 in #10155
  • [2/N] DP-Refactor: move communicators into tokenizer_communicator_mixin by @hnyls2002 in #10028
  • [Hicache]: Add E2E CI For 3FS-KVStore by @hzh0425 in #10131
  • Monkey patch uvicorn multi worker is_alive timeout by @hnyls2002 in #10159
  • [CI] fix ambiguous argument in testing hybrid attentions. by @hnyls2002 in #10161
  • [1/2] Speed up prefill mla attention by @fzyzcjy in #10156
  • [Bug fix] Fix ascend mla in aclgraph by @alanhe151220037 in #9925
  • pref: Add H20 fp8 fused MoE kernel configs for Qwen3 by @Zhiy-Zhang in #10166
  • [fix] Relax white space rules in EBNFComposer by @LukasBluebaum in #9595
  • Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)" by @zhyncs in #10176
  • [Bench] feat: mooncake trace integration by @stmatengss in #9839
  • fix: resolve lint issue by @zhyncs in #10181
  • fix the cutlass moe tests by @rainj-me in #10182
  • gb200: update dockerfile to latest kernel by @ishandhanani in #9522
  • Cleaning codes for speculative attention mode by @Fridge003 in #10149
  • Revert "feat: add fused moe config for Qwen3-30B-A3B on B200" by @rainj-me in #10185
  • [Fix] Orphan process in data parallel by @Capronir in #7995
  • Update link for EAGLE speculative decoding by @gerayking in #10191
  • [CPU] Fix phi4-mm prompt issue in bench_serving by @blzheng in #9900
  • Updated Nvidia Jetson docs by @shahizat in #4422
  • [3/N]DP refactor: Improve dp rank scheduling in PD disaggregation mode. by @hnyls2002 in #10169
  • Support opt model by @wenhuipeng in #10165
  • feat: use sgl-kernel cu129 as default by @zhyncs in #10188
  • [Refactor] Remove Hicache Load & Write threads by @DarkSharpness in #10127
  • Explictly export CMAKE_BUILD_PARALLEL_LEVEL by @key4ng in #10193
  • [CPU] Add gelu_and_mul kernel in sgl-kernel and add ut by @blzheng in #9300
  • feat: support fa cute in sgl-kernel by @zhyncs in #10205
  • Refactor fused_add_rmsnorm import logic by @ShangmingCai in #10207
  • tool-call(dsv3): Fixed a parse problem when there are multiple function definitions in tool_calls by @Missmiaom in #10209
  • [Auto Sync] Update sampling_batch_info.py (20250909) by @merrymercy in #10212
  • chore: bump v0.3.9 sgl-kernel by @zhyncs in #10208
  • add variable TP Decode > Prefill size support by @shaharmor98 in #9960
  • [Fix] KV-cache eviction mismatch across PP ranks in DeepSeek V3/R1 by @qhsc in #10214
  • chore: upgrade v0.3.9 sgl-kernel by @zhyncs in #10220
  • Revert the changes on NCCL symmetric memory by @merrymercy in #10210
  • Revert "Revert the changes on NCCL symmetric memory" by @merrymercy in #10238
  • [HiCache] feat: add mooncake backend extra config by @stmatengss in #10213
  • Add mamba kernel by @yizhang2077 in #10234
  • [Auto Sync] Update io_struct.py (20250909) by @merrymercy in #10236
  • [Auto Sync] Update collector.py, startup_func_log_and_timer... (20250910) by @merrymercy in #10242
  • Revert "chore: upgrade v0.3.9 sgl-kernel" by @merrymercy in #10245
  • refactor(InternVL): Use gpu to preprocess the input image by @KEVINTUAN12 in #9795
  • make --speculative-draft-model an alias of --speculative-draft-model-path by @merrymercy in #10246
  • [UT for RL] Add UT to cover release/resume memory case for moe model by @ryang-max in #8803
  • [Benchmark] Prefil-only benchmark scripts by @sundar24295s in #10240
  • [doc] add walkthrough for implementing and hosting a simple llama wrapper m… by @glenliu21 in #10093
  • Fix: the default choice is wrong for flashinfer mxfp4 moe precision by @LauYeeYu in #10253
  • Page first direct IO kernel by @huangtingwei9988 in #10060
  • support vlm model spec bench by @Lzhang-hub in #10173
  • Fix assertion typo in tp_worker.py by @sgncho in #9954
  • [Auto Sync] Update io_struct.py (20250910) by @merrymercy in #10262
  • Fix potential flakiness in test_lora_qwen3 by @lifuhuang in #10250
  • [router][ci] Add PD router mmlu test by @key4ng in #10256
  • [1/2] Refactor LoRA to support backend-specific batch preprocessing. by @lifuhuang in #10251
  • [Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint by @pavanimajety in #9940
  • add dual stream for qwen2_moe by @yizhang2077 in #10252
  • Add tests to AMD CI for MI35x by @hubertlu-tw in #9662
  • pass a_scale from fp8 quant result instead of hard code to 1.0f by @rainj-me in #10241
  • Feat: support disable tool parser by @JustinTong0323 in #10184
  • [Auto Sync] Update serving_base.py, serving_chat.py, servin... (20250910) by @merrymercy in #10282
  • Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" by @zhyncs in #10292
  • chore: bump sgl-kernel 0.3.9.post1 by @zhyncs in #10294
  • [Feature] Support DeepEP normal & Redundant Experts on NPU by @iforgetmyname in #9881
  • add flash linear attention triton kernel by @yizhang2077 in #10239
  • [chore]Add sgl-router to npu images by @BourneSun0527 in #10229
  • [CPU] fix OOM when mem-fraction is not set by @ZailiWang in #9090
  • [fix CI] Fix logical condition in fused MoE layer for compressed tensor quantization by @BBuf in #10299
  • Revert "Fix flashinfer version in sgl-kernel (#10135)" by @zhyncs in #10310
  • chore: bump sgl-kernel 0.3.9.post2 by @zhyncs in #10311
  • [CI] add pyproject.toml to deepseek w4a8 ci by @HanHan009527 in #10314
  • chore: upgrade v0.3.9.post2 sgl-kernel by @zhyncs in #10297
  • Qwen3-Next support by @yizhang2077 in #10233
  • [Auto Sync] Update parallel_state.py (20250911) by @merrymercy in #10326
  • [Minor] Improve the style of server args by @merrymercy in #10328
  • [bugfix] fix norm type error in qwen3_next model by @cao1zhg in #10322
  • [Qwen3-Next] switch to triton and cache conv states to accelerate MTP from 300 tok/s to 341 tok/s by @hebiao064 in #10335
  • [router] add benchmark for regular router and pd router by @key4ng in #10280
  • add h20 qwen3 next config by @yizhang2077 in #10264
  • [router] Add OpenAI backend support - core function by @key4ng in #10254
  • [router][ci] add gpu process check and free port before start server by @key4ng in #10338
  • add qwen3-next doc by @yizhang2077 in #10327
  • fix: trtllm-gen attention take zero-init workspace by @yyihuang in #10330
  • Fix errors of hicache kernels in sgl-kernel for ROCm by @hubertlu-tw in #10339
  • update GLM nightly test threshold by @zminglei in #10331
  • [LongCat] Optimize zero_experts_compute_triton by changing mask by @zk-lover in #10303
  • add try catch for quant config hf download by @gongwei-130 in #10340
  • chore: bump v0.5.2 by @zhyncs in #10221

New Contributors

  • @Beichen-Ma made their first contribution in #9429
  • @SCDESPERTATE made their first contribution in #7317
  • @CiaranZhou made their first contribution in #9229
  • @jonaslsaa made their first contribution in #9190
  • @ykwd made their first contribution in #8901
  • @ZhengdQin made their first contribution in #8328
  • @lshmouse made their first contribution in #9630
  • @GavinZhu-GMI made their first contribution in #9635
  • @cicirori made their first contribution in #9665
  • @KEVINTUAN12 made their first contribution in #9597
  • @rainj-me made their first contribution in #9495
  • @pabloiyu made their first contribution in #9397
  • @KerwinKai made their first contribution in #9216
  • @mmangkad made their first contribution in #9802
  • @Orchard-DT made their first contribution in #9824
  • @pbkowalski made their first contribution in #9073
  • @LukasBluebaum made their first contribution in #9803
  • @chenxijun1029 made their first contribution in #8118
  • @tc-mb made their first contribution in #8747
  • @alhridoy made their first contribution in #9946
  • @xiaguan made their first contribution in #9927
  • @WangJianQ-0118 made their first contribution in #9895
  • @jingyu-ml made their first contribution in #7912
  • @fangjian601 made their first contribution in #9906
  • @SzymonOzog made their first contribution in #9978
  • @gracehonv made their first contribution in #9314
  • @JamesLim-sy made their first contribution in #9934
  • @DevashishLal-CB made their first contribution in #5255
  • @MahmoudAshraf97 made their first contribution in #8622
  • @sdpkjc made their first contribution in #9884
  • @shadowpa0327 made their first contribution in #6905
  • @jinyangyuan-nvidia made their first contribution in #9834
  • @Oasis-Git made their first contribution in #9741
  • @jianyingzhu made their first contribution in #9969
  • @benbarsdell made their first contribution in #10056
  • @mss1213 made their first contribution in #9971
  • @WeiweiZhang1 made their first contribution in #6226
  • @huaiyuzh made their first contribution in #9434
  • @ssshinigami made their first contribution in #9871
  • @alanhe151220037 made their first contribution in #9925
  • @Zhiy-Zhang made their first contribution in #10166
  • @gerayking made their first contribution in #10191
  • @wenhuipeng made their first contribution in #10165
  • @Missmiaom made their first contribution in #10209
  • @shaharmor98 made their first contribution in #9960
  • @qhsc made their first contribution in #10214
  • @glenliu21 made their first contribution in #10093
  • @LauYeeYu made their first contribution in #10253
  • @sgncho made their first contribution in #9954
  • @BourneSun0527 made their first contribution in #10229
  • @zk-lover made their first contribution in #10303

Full Changelog: v0.5.1...v0.5.2

Don't miss a new sglang release

NewReleases is sending notifications on new releases.