What's Changed
- feat: allow use local branch to build image by @gongwei-130 in #9546
- [readme] Include additional resources for the SGLang x AMD SF Meetup event by @wisclmy0611 in #9547
- [doc] deepseekv31 support by @XiaotongJiang in #9544
- fix(grok): remove duplicate replicate_lm_head configuration by @vincentzed in #9549
- chore: update configurer by @zhyncs in #9557
- chore: bump v0.5.1.post1 by @zhyncs in #9558
- [router] add right rustls dependency in sgl-router cargo.toml by @Bruce-x-1997 in #9498
- fix: use sgl-kernel 0.3.5 by @zhyncs in #9565
- Add target module validation for init adapters by @Beichen-Ma in #9429
- fix: Update OpenAI client base URL in documentation by @JustinTong0323 in #9576
- [PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats by @SCDESPERTATE in #7317
- remove redundant rank0_log function. by @miter6 in #9560
- Update CUTLASS 4.2 & Enable K-Major Scale Factor for SM90 FP8 Blockwise Group GEMM by @HydraQYH in #9559
- Reintroduce memory usage fix by @fzyzcjy in #9535
- Offload tensors by sharding on GPU by @fzyzcjy in #9536
- bugfix for undefined logging functions in HarmonyBrowserTool & HarmonyPythonTool by @CiaranZhou in #9229
- chore: upgrade flashinfer 0.2.14.post1 by @zhyncs in #9578
- fix: revert #8593 by @zhyncs in #9581
- fix: resolve tuning fused moe issue by @zhyncs in #9587
- Tiny fix wrong comments by @fzyzcjy in #9589
- chore: update config by @zhyncs in #9591
- chore: bump v0.5.1.post2 by @zhyncs in #9592
- [Doc] add LWS(LeaderWorkerSet) use case in sgl-router README by @Bruce-x-1997 in #9568
- [Performance] Batch Send from Tokenizer Manager. by @sundar24295s in #9436
- Fix GLM45 tool call multi-turn bug by @byjiang1996 in #9500
- Fix GLM45v launch server cuda torch compile bug by @byjiang1996 in #9554
- Fix Harmony reasoning parser for and auto-separation for gpt-oss models by @jonaslsaa in #9190
- [docs] Refactor, remove compiled results and add gpt-oss by @zhaochenyang20 in #9613
- [Fix] HiCache Bugfix & Mooncake Error Handling Enhance by @ykwd in #8901
- Improve bench_one_batch_server script by @hnyls2002 in #9608
- [router] add mistral tool parser by @slin1237 in #9622
- [router] add qwen tool parser by @slin1237 in #9623
- [router] add pythonic parser by @slin1237 in #9628
- [router] add llama tool parser by @slin1237 in #9629
- [router] add ut for mistral, llama, pythonic, and streaming tool parser by @slin1237 in #9632
- [new feat] ascend backend support fia fusion kernel by @ZhengdQin in #8328
- model: Support nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 by @netanel-haber in #9301
- Fix lint for router by @hebiao064 in #9636
- [docs] Update README with additional highlights and resources for SGLang x AMD SF Meetup by @wisclmy0611 in #9640
- Add reasoning_effort param in TiktokenTokenizer.apply_chat_template by @lshmouse in #9630
- fix: allow user to specify function as role by @GavinZhu-GMI in #9635
- Fix kimi k2 function calling format by @XiaotongJiang in #9606
- [router] address worker load tracking consistency by @slin1237 in #9523
- [router] add token bucket rate limiter by @CatherineSue in #9656
- [doc] add kimik2 --tool-call-parser by @XiaotongJiang in #9647
- Install py-spy by default for containers for easier debugging by @fzyzcjy in #9649
- BugFix(hicache): Fix host indices out of bound error by @hzh0425 in #9637
- HiCache Storage fix host memory leak by @xiezhq-hermann in #9648
- add
response_format
support forcompletion
API by @cicirori in #9665 - Fix FA3 swa spec verify topk>1 by @ispobock in #9658
- [RL] fix register the same ops multiple times by @hebiao064 in #9564
- chore: enhance bench_serving for vlms with a new dataset of configurable image count and resolution by @mickqian in #9583
- refactor(hicache): Introduce generic HiCacheStorageConfig for improved configuration management by @hzh0425 in #9555
- feat: (chat-template matching) enhance multimodal model detection with config.json by @KEVINTUAN12 in #9597
- [docs] Instructions for bench_serving.py by @yhyang201 in #9071
- Support DeepSeek-V3.1 tool call by @Xu-Wenqing in #9446
- Add A100 fused MoE kernel configs for Dpsk by @ehuaa in #9677
- support cuda 13.0 and trtllm kernel by @rainj-me in #9495
- fix: HiRadixCache: fix prefetch completion race by @pabloiyu in #9397
- fix mooncake store mla zero copy meta by @huangtingwei9988 in #9678
- move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py by @merrymercy in #9679
- [router] restructure tool parser module folder by @slin1237 in #9693
- [router] add deepseek tool parser by @slin1237 in #9694
- Quick fix for loading processor for supporting internvl3_5 series by @yilian49 in #9676
- Fix get_ip when no external network by @whybeyoung in #9700
- Sets default model name in request classes by @JustinTong0323 in #9683
- [router] add step3 tool parser by @slin1237 in #9695
- [router] add kimi-k2 tool parser by @slin1237 in #9702
- [router] add gpt-oss and glm4 tool parser by @slin1237 in #9703
- [sgl-kernel] misc: update deepgemm version for sgl-kernel by @FlamingoPg in #9340
- chore: upgrade sgl-kernel 0.3.7 by @zhyncs in #9708
- chore: bump v0.5.1.post3 by @zhyncs in #9716
- [router] upgrade kernel version in pd ci by @CatherineSue in #9720
- [Sync] Update mxfp4.py (20250827) by @merrymercy in #9724
- [router] fix error response in pd_router by @Bruce-x-1997 in #9505
- [router] Add MCP Tool Handler by @key4ng in #9615
- gpt-oss blog reproduction document by @hnyls2002 in #9728
- [router] additional pythonic parser unit test by @slin1237 in #9730
- [router] additional llama32 parser unit test and multi json support by @slin1237 in #9732
- support mooncake store dp attention by @huangtingwei9988 in #9684
- add support for nvidia/gpt-oss-120b-Eagle3 by @zyksir in #9739
- Move git clone command up from README by @JustinTong0323 in #9740
- [feat] Reduce GPU memory overhead by using weakref by @yhyang201 in #9673
- Support speculative decoding in hybrid attention backend by @Qiaolin-Yu in #9573
- [router] add llama3.2 multi json streaming parser by @slin1237 in #9735
- Support compile sgl-kernel on cuda 13.0 by @rainj-me in #9721
- [Sync] Update server_args.py (20250828) by @merrymercy in #9745
- [router] grpc router bootstraps by @slin1237 in #9759
- [AMD] Support Hierarchical Caching on AMD GPUs by @hubertlu-tw in #8236
- feat: add tuned fused moe config for GLM-4.5-Air-FP8 tp = 4 on B200 by @zixuanzhang226 in #9770
- [Feature] Support NPUGraph for DeepSeek on Ascend NPU by @chenxu140 in #9355
- feat(draft_model): support draft_model for RemoteModelLoader by @DellCurry in #6407
- fix: fix MLA for ShardedModelLoader/RemoteModelLoader by @DellCurry in #6287
- Optimize prefill performance on cpu backend by @mingfeima in #8750
- [HiCache] change the default policy to write through by @xiezhq-hermann in #9772
- bugfix(hicache): Move exists check before key suffixing by @hzh0425 in #9749
- Skip some tests on Blackwell by @hlu1 in #9777
- Raise error when
topk>1
andpage>1
for paged attention backends. by @hnyls2002 in #9784 - ROCm 7.0 update by @sogalin in #9757
- add bench_mix.py by @pansicheng in #9788
- Make sm100 fp8 kernels available on sm103 by @hlu1 in #9789
- accomendate json schema in the "schema" field, not in "json_schema" field of response_format by @gongwei-130 in #9786
- [PD] Support get_model_info interface for mini_lb by @XucSh in #9792
- [HiCache] resolve conflict between chunked-prefill and hicache hit count by @xiezhq-hermann in #9776
- feat(hicache-3fs): 3FS-Store Backup Optimizations For MLA Model. by @hzh0425 in #9692
- support enable in the reasoning field to enable thingking for thinkin… by @gongwei-130 in #9715
- feat: Add flexible validation for partial weight updates by @GeLee-Q in #9663
- feat: add original logprobs to response by @narutolhy in #8375
- [feat] Support EAGLE3 for Qwen2 by @KerwinKai in #9216
- chore: upgrade flashinfer 0.3.0rc1 by @zhyncs in #9793
- [ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization by @pavanimajety in #9712
- Fix TRTLLM MLA Cuda KV Blocks Causing accuracy drop by @farazkh80 in #9675
- [NVIDIA] [2/N] Optimize
silu_and_mul_scaled_fp4_grouped_quant
perf by @kaixih in #9556 - Adds initialize_moe_config to bench_one_batch so MOE backend is respected by @pranavm-nvidia in #9670
- Small bug fix in transformers model implementation by @yilian49 in #9809
- feature(eplb): add min-rebalancing-utilization-threshold for eplb by @hzh0425 in #8345
- Make fp4_quantize kernels work on sm103 by @hlu1 in #9807
- fix: dsv3 lite q_lora_rank none by @zhyncs in #9815
- Fix memory leak when aborting decode request in PD-Disagg by @hnyls2002 in #9817
- chore: fix cuda driver api issue and bump sgl-kernel 0.3.7.post1 by @zhyncs in #9746
- chore: update Dockerfile by @zhyncs in #9820
- Fix typo in warning message about DeepGEMM JIT by @mmangkad in #9802
- chore: upgrade sgl-kernel 0.3.7.post1 with deepgemm fix by @zhyncs in #9822
- [sgl-kernel] fix: fix missing FetchContent_Populate for fmt by @FlamingoPg in #9826
- chore: upgrade transformers 4.56.0 by @zhyncs in #9827
- [Auto Sync] Update parallel_state.py (20250830) by @merrymercy in #9828
- [CI] Fix the trigger condition for PR test workflows by @merrymercy in #9761
- [CI] Code sync tools by @merrymercy in #9830
- Update guidelines for syncing code between repos by @merrymercy in #9831
- hot fix for mooncake batch set api by @xiezhq-hermann in #9836
- [router] add reasoning parser readme by @slin1237 in #9837
- Tool parser.benchmark by @CatherineSue in #9835
- [Model] Support Meituan LongCat-Flash && LongCat-Flash-MTP by @Orchard-DT in #9824
- [router] global tool parser registry by @CatherineSue in #9840
- [feat]Ascend NPU Gemma-3-12b and Gemma-3-27b support by @VDV1985 in #8909
- [Performance] Improve Qwen RMSNorm by replacing with native RMSNorm op by @vincentzed in #9709
- [HiCache] Clear kvcache in storage backend with fastAPI by @stmatengss in #9750
- Fix input logprob index for a batch that includes both requests with input logprob and requests with input logprob. by @merrymercy in #9841
- Fuse gate_proj and up_proj in Qwen 2.5 VL's vision MLP by @AlienKevin in #9661
- [HiCache] Storage Refactoring by @xiezhq-hermann in #9797
- fix
set_interal_state
API by @hnyls2002 in #9850 - fix inconsistent arguments for generated shared prefix bench by @pbkowalski in #9073
- fix(hicahce-long-bench): adjust context workload generator to use full query set by @hzh0425 in #9847
- Disable radix cache in test_lora_update.py for better stability by @Fridge003 in #9852
- Tiny allow DeepGEMM on cu12.9 by @fzyzcjy in #9858
- Update docker build workflows for gfx942 ROCm 7.0. by @saienduri in #9794
- Support Multi Process Tokenizer Manager(#6555) by @whybeyoung in #8964
- chore: upgrade flashinfer 0.3.0 by @zhyncs in #9864
- chore: bump v0.5.2rc0 by @zhyncs in #9862
- Mooncake store get zero copy meta optimization by @huangtingwei9988 in #9857
- [router] add tokenizer download support from hf hub by @CatherineSue in #9882
- support fp8 kvcache for hybrid attn backend on GPT-OSS by @rainj-me in #9783
- [HiCacheStorage] fix abort request host memory leaks by @huangtingwei9988 in #9874
- [HiCacheStorage]: Improve 3fs kvstore‘s performance and resolve mla issues by @hzh0425 in #9876
- [router] Fix short timeout for the prefill client by @LukasBluebaum in #9803
- [code style] restruct fused_moe to avoid very long single file by @BBuf in #9878
- [router] add grpc pd and regular router init by @CatherineSue in #9893
- [router] fix FunctionCallResponse proto, support arguments is null by @Bruce-x-1997 in #9875
- [feat] Support tp mode for DeepSeek-R1-W4AFP8 by @chenxijun1029 in #8118
- Move multi-tokenizer event loop to better place by @ShangmingCai in #9902
- [chore] fix dead links in doc by @lifuhuang in #9913
- Change tensor alignment method to mn major by @mmangkad in #9844
- chore: bump v0.3.8 sgl-kernel by @zhyncs in #9907
- [Fix] fix the issue encountered when inference LongCat-Flash/MTP EP MoE on b200 by @Orchard-DT in #9916
- fix parallel_state.py
current_platform
bug by @BBuf in #9919 - [feat] apply deep_gemm compile_mode to skip launch by @Alcanderian in #9879
- fix: update router deps by @zhyncs in #9921
- chore: bump v0.5.2rc1 by @zhyncs in #9920
- [Hicache] Generic page get bugfix by @ykwd in #9909
- Support the internvl3.5 family models in sglang by @yilian49 in #9705
- [router] include rust benchamrks by @slin1237 in #9932
- Fix the key passing issue in page first layout. by @hzh0425 in #9929
- [router] fix grpc client url normalzation and health check by @CatherineSue in #9939
- [model] support MiniCPM-V 4.0 by @tc-mb in #8747
- [HiCache] Minor fix on file storage backend by @xiezhq-hermann in #9869
- Move parsers under a single folder by @merrymercy in #9912
- [Fix] DeepSeek EP accuracy issue on B200 GPUs by @alhridoy in #9946
- fix(cache): move ongoing_prefetch pop after validation to prevent leak by @xiaguan in #9927
- Remove annoying warnings in sgl kernel build by @merrymercy in #9905
- Update tool_chat_template_deepseekv31.jinja by @WangJianQ-0118 in #9895
- Qwen FP8/NVFP4 ModelOPT Quantization support by @jingyu-ml in #7912
- Optimized deepseek-v3/r1 model performance on mxfp4 run by @kkHuang-amd in #9671
- add proctitle for tokenizers by @hnyls2002 in #9952
- [feat] Add P/D attention select for draft model by @Ximingwang-09 in #9755
- Revert "[Fix] DeepSeek EP accuracy issue on B200 GPUs (#9946)" by @zhyncs in #9955
- Revert "Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)" by @zhyncs in #9959
- [benchmark] add flashinfer_allreduce_fusion benchmark by @BBuf in #9937
- [1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel) by @yuhyao in #9953
- [router] Add Rerank API Specification by @fangjian601 in #9906
- [router] add chat_template_kwargs in ChatCompletionRequest by @tonyluj in #9958
- Remove mrope position sync by @timmy-feng in #9460
- fix swa clear(): rename is_in_free_group to is_not_in_free_group by @JustinTong0323 in #9914
- Triton 3.4.0 MoE config for Deepseek TP16 H100 by @SzymonOzog in #9978
- nsys profile output kernel classifier by @gracehonv in #9314
- Minor update regarding issue #9704 by @elfiegg in #9733
- [Auto Sync] Update parallel_state.py, few_shot_gsm8k.py (20250903) by @merrymercy in #9986
- feat: add gpt oss b200 ci by @zhyncs in #9988
- [router] move tokenizer, reasoning, tool initialization to server by @slin1237 in #9996
- [router] clean up dependency injector to use ctx by @slin1237 in #10000
- [router] fix grpc connection mode detection by @slin1237 in #9999
- [Fix] gpt-oss mxfp4 model run failed on ROCm platform by @kkHuang-amd in #9994
- Fix Llama 4 with MXFP4 dynamic quant on MI35x by @hubertlu-tw in #9993
- [Bugfix] fix pd chat completion protocol for batching support by @tonyluj in #10016
- fix: health_generate endpoint in mini_lb by @wxsms in #9997
- [1/N] DP-refactor: move dp balance code into scheduler's mixin class by @hnyls2002 in #10004
- Ensure chunked request extension length respects both rem_chunk_tokens and rem_total_tokens limits by @pansicheng in #10003
- feat(hicache): Add generic hicache ci e2e test and benchmark test by @hzh0425 in #9846
- Optimize Qwen3-moe model by using flashinfer fused allreduce by @yuan-luo in #9973
- [Doc] Fix SGLang tool parser doc by @PopSoda2002 in #9886
- metrics: support customer buckets for prompt/generation_tokens_histogram by @acelyc111 in #9634
- fix 3fs zerocopy by @pansicheng in #9938
- Save memory for expert model parallel by @ch-wan in #9957
- [Hicache] Mooncake API Fix & Test, and Improved Readme by @ykwd in #9951
- Optimized deepseek-v3/r1 model performance on mxfp4 run by @kkHuang-amd in #10008
- Fix accuracy drop of dsv3 run in dp enablement by @kkHuang-amd in #8677
- chore: bump v0.5.2rc2 by @zhyncs in #10050
- fix: update gb200 dep by @zhyncs in #10052
- Simplify
Router
arguments passing and build it in docker image by @hnyls2002 in #9964 - [router] fix release workflow to include protobuf by @CatherineSue in #10055
- fix MultiTokenizerWrapper name by @LLLL114 in #10049
- Integrate trtllm ragged attention for prefill self-attention by @elfiegg in #9801
- [Vulnerability]feat(conn): set bootstrap server host by @jinmingyi1998 in #9931
- Fix typo in scheduler by @JamesLim-sy in #9934
- [1/2] Optimizations and refactors about quant kernel by @fzyzcjy in #9534
- Tiny support setting numa nodes for different ranks by @fzyzcjy in #10006
- [Fix] Add speculative_draft_model_revision to server_args by @DevashishLal-CB in #5255
- Forbid DeepEP racing condition when too many tokens by @fzyzcjy in #9567
- Support simple evals in text comparator by @fzyzcjy in #8867
- Fix and enhance dumper by @fzyzcjy in #8725
- Tiny let DeepGEMM scale checks cover more cases by @fzyzcjy in #7182
- Support copying tensor from cpu to gpu without using copy engines by @fzyzcjy in #10007
- [router] add py binding unit tests to coverage 80% by @key4ng in #10043
- [router] add rust cache for rust unit test by @key4ng in #10079
- [router] add rust cache by @slin1237 in #10080
- enable aiter gemm_a8w8_bpreshuffle for ptpc gemm by @Yuechguo in #8555
- [bugfix]: use correct cache location for cross attention in torch native backend by @MahmoudAshraf97 in #8622
- Update flashinfer to 0.3.1 for B300 support by @hlu1 in #10087
- [Bug Fix] Fix Glm4vVisionBlock norm by @sdpkjc in #9884
- Update wave-lang to 3.7.0 and unify Wave kernel buffer options by @yichiche in #10069
- Add storage read/write bandwidth logs to monitor kvcache performance by @pansicheng in #9965
- [Minor] Refactors KV memory pool by @JustinTong0323 in #9842
- support Llama4 with non uniformed intermediate size across layers for… by @gongwei-130 in #10047
- [router] move to mcp sdk instead by @slin1237 in #10057
- [router] Introduce router integration tests by @key4ng in #10086
- Add lora_path argument to bench_multiturn.py by @Fridge003 in #10092
- [HiStorage] Remove delete and clear as necessary methods by @xiezhq-hermann in #10039
- Modify ci workflow for auto-partitioning in 2-GPU backend tests by @hzh0425 in #10029
- Revert "[1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel) (#9953)" by @zhyncs in #10097
- Fix RMSNorm API CALL mismatch issue. by @sogalin in #10032
- fix double sparsity initialization by @shadowpa0327 in #6905
- [Fix] illegal sync based on undefined behaviour by @DevashishLal-CB in #9620
- [7/N] MoE Refactor: the implementation of new framework by @ch-wan in #9269
- [NVIDIA] Remove unused
get_fused_moe_impl_class
function by @kaixih in #9764 - [NVIDIA] disable chunked prefix cache when dp is used by @kaixih in #9861
- perf: Avoid unnecessary data type conversions for DeepSeek-V3 on Blackwell by @jinyangyuan-nvidia in #9834
- [Fix] Compatibility between DP attention and pipeline parallelism by @ch-wan in #10100
- Fix circular import by @ch-wan in #10107
- Disable kernel cutlass_mla_decode on SM103 by @hlu1 in #10058
- Remove non-accelerated targets(100 and up) from cmake by @hlu1 in #10041
- [chore] Remove unused ep_moe cuda kernels by @hlu1 in #9956
- [CI] Refactor disaggregation tests by @ShangmingCai in #10068
- increase the rust e2e timeout by @key4ng in #10116
- [router] Improve the e2e tests by @key4ng in #10102
- [Auto Sync] Update server_args.py (20250906) by @merrymercy in #10117
- Optimize moe_sum_reduce_kernel by @yuan-luo in #9477
- [Feature] LMCache Connector Integration by @Oasis-Git in #9741
- CUTLASS fp8 blockwise gemm support of sm120 by @jianyingzhu in #9969
- Optimize nvfp4 block scaled gemm kernel when M is small. by @HydraQYH in #10101
- Fix cuda graph mode in flashinfer attn backend by @benbarsdell in #10056
- [HiCache] fix: check clear() method for storage backend by @stmatengss in #10096
- add dataset_path for bench_one_batch_server.py by @miter6 in #10113
- [Auto Sync] Update parallel_state.py (20250907) by @merrymercy in #10126
- [Minor] fix lint in main by @DarkSharpness in #10128
- [1/2] Refactor multi-tokenizer manager by @hnyls2002 in #10074
- Fix flashinfer version in sgl-kernel by @merrymercy in #10135
- [DOC]: some minor updates by @yyihuang in #10134
- [BUG FIX] add fail check when get fail in case wait complete block by @mss1213 in #9971
- [MoE] fix: incorrect weight initialization for cutlass_fused_experts_fp8 by @ch-wan in #10144
- Enables GLM4.1V server testing & fix video processing by @JustinTong0323 in #10095
- Fix slow fused add RMSNorm by @fzyzcjy in #10141
- fix the fp8 topk_config.correction_bias is none bug by @rainj-me in #10040
- Qwen2.5-VL eagle3 infer by @Lzhang-hub in #8801
- Fix run time error in dsv3-fp8 model on mi35x by @kkHuang-amd in #10104
- Standalone speculative decoding by @Qiaolin-Yu in #10090
- Add graph runner support with torch compile on CPU by @CaoE in #7843
- move compile threads to an option to avoid OOM on low memory host by @rainj-me in #10123
- [1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel, fixed) by @yuhyao in #10108
- [Bugfix] Retract not releasing enough memory when page size > 1 by @xiezhq-hermann in #9989
- Add speculator attention backend switch by @cicirori in #9981
- Fix: (glm4v) Add missing field by @JustinTong0323 in #10147
- [Bugfix] Qwen3MoE aclrtMemcpy failed with NPUGraph by @iforgetmyname in #10013
- enable auto-round quantization model by @WeiweiZhang1 in #6226
- Revert "enable auto-round quantization model (#6226)" by @zhyncs in #10148
- enable llama3.1-8B on xpu by @huaiyuzh in #9434
- [Bug fix] Fix Gemma 2 and fix Gemma 3 multimodal with bs > 1 on NPU by @ssshinigami in #9871
- update xgrammar 0.1.24 and transformers 4.56.1 by @Swipe4057 in #10155
- [2/N] DP-Refactor: move communicators into
tokenizer_communicator_mixin
by @hnyls2002 in #10028 - [Hicache]: Add E2E CI For 3FS-KVStore by @hzh0425 in #10131
- Monkey patch uvicorn multi worker
is_alive
timeout by @hnyls2002 in #10159 - [CI] fix ambiguous argument in testing hybrid attentions. by @hnyls2002 in #10161
- [1/2] Speed up prefill mla attention by @fzyzcjy in #10156
- [Bug fix] Fix ascend mla in aclgraph by @alanhe151220037 in #9925
- pref: Add H20 fp8 fused MoE kernel configs for Qwen3 by @Zhiy-Zhang in #10166
- [fix] Relax white space rules in EBNFComposer by @LukasBluebaum in #9595
- Revert "[ModelOpt] Fix Weight Loading for DSR1-FP4 Quantization (#9712)" by @zhyncs in #10176
- [Bench] feat: mooncake trace integration by @stmatengss in #9839
- fix: resolve lint issue by @zhyncs in #10181
- fix the cutlass moe tests by @rainj-me in #10182
- gb200: update dockerfile to latest kernel by @ishandhanani in #9522
- Cleaning codes for speculative attention mode by @Fridge003 in #10149
- Revert "feat: add fused moe config for Qwen3-30B-A3B on B200" by @rainj-me in #10185
- [Fix] Orphan process in data parallel by @Capronir in #7995
- Update link for EAGLE speculative decoding by @gerayking in #10191
- [CPU] Fix phi4-mm prompt issue in bench_serving by @blzheng in #9900
- Updated Nvidia Jetson docs by @shahizat in #4422
- [3/N]DP refactor: Improve dp rank scheduling in PD disaggregation mode. by @hnyls2002 in #10169
- Support opt model by @wenhuipeng in #10165
- feat: use sgl-kernel cu129 as default by @zhyncs in #10188
- [Refactor] Remove Hicache Load & Write threads by @DarkSharpness in #10127
- Explictly export CMAKE_BUILD_PARALLEL_LEVEL by @key4ng in #10193
- [CPU] Add gelu_and_mul kernel in sgl-kernel and add ut by @blzheng in #9300
- feat: support fa cute in sgl-kernel by @zhyncs in #10205
- Refactor fused_add_rmsnorm import logic by @ShangmingCai in #10207
- tool-call(dsv3): Fixed a parse problem when there are multiple function definitions in tool_calls by @Missmiaom in #10209
- [Auto Sync] Update sampling_batch_info.py (20250909) by @merrymercy in #10212
- chore: bump v0.3.9 sgl-kernel by @zhyncs in #10208
- add variable TP Decode > Prefill size support by @shaharmor98 in #9960
- [Fix] KV-cache eviction mismatch across PP ranks in DeepSeek V3/R1 by @qhsc in #10214
- chore: upgrade v0.3.9 sgl-kernel by @zhyncs in #10220
- Revert the changes on NCCL symmetric memory by @merrymercy in #10210
- Revert "Revert the changes on NCCL symmetric memory" by @merrymercy in #10238
- [HiCache] feat: add mooncake backend extra config by @stmatengss in #10213
- Add mamba kernel by @yizhang2077 in #10234
- [Auto Sync] Update io_struct.py (20250909) by @merrymercy in #10236
- [Auto Sync] Update collector.py, startup_func_log_and_timer... (20250910) by @merrymercy in #10242
- Revert "chore: upgrade v0.3.9 sgl-kernel" by @merrymercy in #10245
- refactor(InternVL): Use gpu to preprocess the input image by @KEVINTUAN12 in #9795
- make --speculative-draft-model an alias of --speculative-draft-model-path by @merrymercy in #10246
- [UT for RL] Add UT to cover release/resume memory case for moe model by @ryang-max in #8803
- [Benchmark] Prefil-only benchmark scripts by @sundar24295s in #10240
- [doc] add walkthrough for implementing and hosting a simple llama wrapper m… by @glenliu21 in #10093
- Fix: the default choice is wrong for flashinfer mxfp4 moe precision by @LauYeeYu in #10253
- Page first direct IO kernel by @huangtingwei9988 in #10060
- support vlm model spec bench by @Lzhang-hub in #10173
- Fix assertion typo in tp_worker.py by @sgncho in #9954
- [Auto Sync] Update io_struct.py (20250910) by @merrymercy in #10262
- Fix potential flakiness in test_lora_qwen3 by @lifuhuang in #10250
- [router][ci] Add PD router mmlu test by @key4ng in #10256
- [1/2] Refactor LoRA to support backend-specific batch preprocessing. by @lifuhuang in #10251
- [Bugfix] Fix Weightloading for the original nvidia/Deepseek-R1-FP4 checkpoint by @pavanimajety in #9940
- add dual stream for qwen2_moe by @yizhang2077 in #10252
- Add tests to AMD CI for MI35x by @hubertlu-tw in #9662
- pass a_scale from fp8 quant result instead of hard code to 1.0f by @rainj-me in #10241
- Feat: support disable tool parser by @JustinTong0323 in #10184
- [Auto Sync] Update serving_base.py, serving_chat.py, servin... (20250910) by @merrymercy in #10282
- Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" by @zhyncs in #10292
- chore: bump sgl-kernel 0.3.9.post1 by @zhyncs in #10294
- [Feature] Support DeepEP normal & Redundant Experts on NPU by @iforgetmyname in #9881
- add flash linear attention triton kernel by @yizhang2077 in #10239
- [chore]Add sgl-router to npu images by @BourneSun0527 in #10229
- [CPU] fix OOM when mem-fraction is not set by @ZailiWang in #9090
- [fix CI] Fix logical condition in fused MoE layer for compressed tensor quantization by @BBuf in #10299
- Revert "Fix flashinfer version in sgl-kernel (#10135)" by @zhyncs in #10310
- chore: bump sgl-kernel 0.3.9.post2 by @zhyncs in #10311
- [CI] add pyproject.toml to deepseek w4a8 ci by @HanHan009527 in #10314
- chore: upgrade v0.3.9.post2 sgl-kernel by @zhyncs in #10297
- Qwen3-Next support by @yizhang2077 in #10233
- [Auto Sync] Update parallel_state.py (20250911) by @merrymercy in #10326
- [Minor] Improve the style of server args by @merrymercy in #10328
- [bugfix] fix norm type error in qwen3_next model by @cao1zhg in #10322
- [Qwen3-Next] switch to triton and cache conv states to accelerate MTP from 300 tok/s to 341 tok/s by @hebiao064 in #10335
- [router] add benchmark for regular router and pd router by @key4ng in #10280
- add h20 qwen3 next config by @yizhang2077 in #10264
- [router] Add OpenAI backend support - core function by @key4ng in #10254
- [router][ci] add gpu process check and free port before start server by @key4ng in #10338
- add qwen3-next doc by @yizhang2077 in #10327
- fix: trtllm-gen attention take zero-init workspace by @yyihuang in #10330
- Fix errors of hicache kernels in sgl-kernel for ROCm by @hubertlu-tw in #10339
- update GLM nightly test threshold by @zminglei in #10331
- [LongCat] Optimize zero_experts_compute_triton by changing mask by @zk-lover in #10303
- add try catch for quant config hf download by @gongwei-130 in #10340
- chore: bump v0.5.2 by @zhyncs in #10221
New Contributors
- @Beichen-Ma made their first contribution in #9429
- @SCDESPERTATE made their first contribution in #7317
- @CiaranZhou made their first contribution in #9229
- @jonaslsaa made their first contribution in #9190
- @ykwd made their first contribution in #8901
- @ZhengdQin made their first contribution in #8328
- @lshmouse made their first contribution in #9630
- @GavinZhu-GMI made their first contribution in #9635
- @cicirori made their first contribution in #9665
- @KEVINTUAN12 made their first contribution in #9597
- @rainj-me made their first contribution in #9495
- @pabloiyu made their first contribution in #9397
- @KerwinKai made their first contribution in #9216
- @mmangkad made their first contribution in #9802
- @Orchard-DT made their first contribution in #9824
- @pbkowalski made their first contribution in #9073
- @LukasBluebaum made their first contribution in #9803
- @chenxijun1029 made their first contribution in #8118
- @tc-mb made their first contribution in #8747
- @alhridoy made their first contribution in #9946
- @xiaguan made their first contribution in #9927
- @WangJianQ-0118 made their first contribution in #9895
- @jingyu-ml made their first contribution in #7912
- @fangjian601 made their first contribution in #9906
- @SzymonOzog made their first contribution in #9978
- @gracehonv made their first contribution in #9314
- @JamesLim-sy made their first contribution in #9934
- @DevashishLal-CB made their first contribution in #5255
- @MahmoudAshraf97 made their first contribution in #8622
- @sdpkjc made their first contribution in #9884
- @shadowpa0327 made their first contribution in #6905
- @jinyangyuan-nvidia made their first contribution in #9834
- @Oasis-Git made their first contribution in #9741
- @jianyingzhu made their first contribution in #9969
- @benbarsdell made their first contribution in #10056
- @mss1213 made their first contribution in #9971
- @WeiweiZhang1 made their first contribution in #6226
- @huaiyuzh made their first contribution in #9434
- @ssshinigami made their first contribution in #9871
- @alanhe151220037 made their first contribution in #9925
- @Zhiy-Zhang made their first contribution in #10166
- @gerayking made their first contribution in #10191
- @wenhuipeng made their first contribution in #10165
- @Missmiaom made their first contribution in #10209
- @shaharmor98 made their first contribution in #9960
- @qhsc made their first contribution in #10214
- @glenliu21 made their first contribution in #10093
- @LauYeeYu made their first contribution in #10253
- @sgncho made their first contribution in #9954
- @BourneSun0527 made their first contribution in #10229
- @zk-lover made their first contribution in #10303
Full Changelog: v0.5.1...v0.5.2