What's Changed
- [PD] Use batch transfer for rdma transport and add notes for mnnvl usage by @ShangmingCai in #8595
- [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. by @sighingnow in #8611
- Fix hf3fs_fuse import error by @ispobock in #8623
- Update step3v default config by @ispobock in #8626
- [ci] fix genai-bench execution cmd by @slin1237 in #8629
- [router] update router pypi version by @slin1237 in #8628
- [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x by @b8zhong in #8577
- Fix typos in py_test/test_launch_server.py by @windsonsea in #6227
- misc: Remove debug print to logger.info by @CatherineSue in #8633
- SGLang HiCache NIXL Connector by @vvenkates27 in #8488
- [bug] remove pdlb from minilb since its no longer available by @slin1237 in #8634
- [bugfix] Fix flashinfer cutlass EP moe after MoE refactor by @trevor-m in #8630
- Conditionally import HiCacheHF3FS by @pansicheng in #8598
- TRTLLM Gen MLA Decode Kernel Integration (same as #7938) by @farazkh80 in #8632
- Fix nan value generated after custom all reduce by @kkHuang-amd in #8532
- Revert "Fix nan value generated after custom all reduce (#8532)" by @zhyncs in #8642
- Feature/modelscope model download by @yrk111222 in #8083
- chore: speedup NPU CI by cache by @pkking in #8270
- [Bugfix] fix w8a8_int8 load issue by @iforgetmyname in #8308
- [bugfix] fix router python parser for pd urls by @slin1237 in #8644
- [router] add basic usage doc by @slin1237 in #8640
- [router] upgrade router version to 0.1.8 by @slin1237 in #8645
- [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE by @kaixih in #8450
- HiCache, fixing hash value indexing by @xiezhq-hermann in #8636
- Interface change for kvcache io to support page first layout by @xiezhq-hermann in #8318
- Update batch size limitation of dsv3_router_gemm kernel to 16 by @Fridge003 in #8051
- chore: bump v0.4.10.post1 by @ispobock in #8652
- Add hf3fs_utils.cpp to package-data by @pansicheng in #8653
- Fix chat template handling for OpenAI serving by @JustinTong0323 in #8635
- Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… by @byjiang1996 in #8511
- [5/N] MoE Refactor: Update MoE parallelism arguments by @ch-wan in #8658
- Increase tolerance to address CI failures by @lifuhuang in #8643
- [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 by @panpan0000 in #8013
- [Doc] fix: Update README for cu126 sgl-kernel compile problem by @Hongbosherlock in #8665
- fix per token cuda kernel hidden dim cannot divide by 16 by @hebiao064 in #8543
- fix arg typo for --disaggregation-transfer-backend by @ZacWang in #8664
- [fix] fix pd disagg error of vlms by @ccw1996 in #8094
- Disable tp for shared experts under expert parallelism for GLM4.5 model (#8647) by @zminglei in #8647
- [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla by @trevor-m in #8685
- [bug] limit bootstrap room to to [0, 2^63 - 1] by @slin1237 in #8684
- Update CODEOWNERS by @merrymercy in #8686
- Fix deepgemm masked grouped gemm jit compile by @ispobock in #8679
- Fix FP8 block quantization when N or K is not multiples of 128 by @yanbing-j in #8648
- bugfix(hicache): Fix 'MooncakeStore' not defined error. by @hzh0425 in #8668
- upgrade xgrammar 0.1.22 by @Swipe4057 in #8522
- [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually by @lbh2001 in #8618
- Add support for NCCL symmetric memory for TP allreduces by @nvcastet in #8238
- [1/2] sgl-kernel: Fuse routed scaling factor into select_experts by @trevor-m in #8364
- chore(gb200): update dockerfile to handle fp4 disaggregation by @ishandhanani in #8694
- [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 by @trevor-m in #8688
- Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled by @GaoYusong in #7434
- model: adapt mllama4 to VisionAttention by @wenchen76 in #8512
- Add tensor.detach() back to update weight util by @hebiao064 in #8691
- [Doc] Polish sgl-kernel readme for cu126 build error by @FlamingoPg in #8704
- Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" by @hnyls2002 in #8706
- [router] minor code clean up and and refactoring by @slin1237 in #8711
- [Bug] fix green context's incompatibility with
cuda < 12.4
by @hnyls2002 in #8701 - chore: bump sgl-kernel v0.2.9 by @zhyncs in #8713
- Remove assertions about per group quant fp8 by @fzyzcjy in #8717
- [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 by @merrymercy in #8693
- Fix triton moe error caused by TopK refactor by @fzyzcjy in #8705
- [router] Implement HTTP Dependency Injection Pattern for Router System by @slin1237 in #8714
- [Feature] Radix Tree in C++ by @DarkSharpness in #7369
- [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm by @HydraQYH in #8722
- Fix fused MoE when
routed_scaling_factor is None
by @hnyls2002 in #8709 - Tiny fix CI pytest error by @fzyzcjy in #8524
- [hotfix] fix mixtral with tensor-level compressed-tensor quantization by @ch-wan in #8721
- Support limiting max loaded loras in CPU. by @lifuhuang in #8650
- Reduce memory accumulation in long-running server by @Edenzzzz in #8306
- HiCache storage, style change and bug fix by @xiezhq-hermann in #8719
- [feat] support minimum token load balance in dp attention by @WANG-GH in #7379
- Do layernorm before allgather for DP attention by @trevor-m in #8631
- [fix] Fix divide by zero error for llama4. by @shenoyvvarun in #8683
- feat: Add new moe triton for NVIDIA RTX 6000 Ada by @17Reset in #8547
- [Improvements] Merge health check route by @whybeyoung in #8444
- chore: bump sgl-kernel 0.3.0 with torch 2.8.0 by @zhyncs in #8718
- Save cuda graph memory for fa3 by @ch-wan in #8567
- [CUDA Graph] save cuda graph memory by using next_token_logits_buffer by @ch-wan in #8579
- [DP] fix the compatibility issue between DP attention and
--attention-backend triton
by @ch-wan in #8723 - chore: bump v0.4.10.post2 by @zhyncs in #8727
- feat: Support DP Attention for step3_vl by @yhyang201 in #8699
- [RL] fix update weight for FusedMoE with EP by @zhuzilin in #8676
- use fp32 for e_score_correction_bias in GLM-4.5 by @zRzRzRzRzRzRzR in #8729
- Fix triton kernels topk with keyword arguments by @ispobock in #8732
- feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 by @TianQiLin666666 in #8678
- Fix the missing 'lof' choice of --schedule-policy server args by @acelyc111 in #7114
- fix args typo in memory_pool_host by @huangtingwei9988 in #8662
- [CI] Do not trigger pd-disaggregation CI in draft PR by @hnyls2002 in #8737
- [MoE] Enable
renormalize=False
in Triton kernels by @ch-wan in #8735 - Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting by @YyWangCS in #8733
- Fix bug of refactoring TopKOutput in w4afp8 by @yuan-luo in #8745
- Rename lora_path to lora_id in batches by @Fridge003 in #8437
- [sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count by @BBuf in #8738
- [CI] Ascend NPU CI enhancement by @iforgetmyname in #8294
- [bugfix] fix import path in HiCacheController by @lbh2001 in #8749
- [NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer by @azhurkevich in #8552
- [router] introduce dp worker abstraction by @slin1237 in #8639
- [bugfix] Fix typo in modelopt quant: 'FusedMoE' object has no attribute 'local_num_experts' by @trevor-m in #8768
- Integrate triton_kernels in sgl-kernel by @Qiaolin-Yu in #8762
- chore: bump sgl-kernel v0.3.1 by @zhyncs in #8771
- [NVIDIA] Fix breakage of using trtllm-gen fp8 moe by @kaixih in #8773
- [Fix] Fix several issues preventing gemma3n LoRA support. by @lifuhuang in #8776
- Support OCP MXFP4 quantization on AMD GPUs by @kkHuang-amd in #8255
- [CPU][sgl-kernel] biased_grouped_topk: fix correction_bias dtype to float32 by @chunyuan-w in #8212
- [PD] Refactor parallel sizes and add pp support for mooncake by @ShangmingCai in #8571
- [pd-router] Add Configurable Retry Logic for reduce backend pressure by @slin1237 in #8744
- chore: upgrade flashinfer v0.2.9 by @zhyncs in #8780
- [NVIDIA]Fix local_num_experts for EP by @wenscarl in #8779
- [feat] Add detail in image_data by @yuhyao in #8596
- Revert "[NVIDIA]Fix local_num_experts for EP (#8779)" by @zhyncs in #8797
- feat: support sgl-kernel cu129 by @zhyncs in #8800
- chore: bump sgl-kernel v0.3.2 by @zhyncs in #8802
- feat: add trtllm-gen mha from direct call by @yyihuang in #8782
- GLM-4.5 and GLM-4.5-Air both support by @zRzRzRzRzRzRzR in #8804
- fix: update cmake by @zhyncs in #8817
- chore: upgrade transformers 4.55.0 by @zhyncs in #8823
- chore: upgrade flashinfer 0.2.10 by @zhyncs in #8827
- Fix potential memory fault issue and ncclSystemError in CI test by @kkHuang-amd in #8681
- feat: use py312 by @zhyncs in #8832
- fix: remove unused import by @zhyncs in #8809
- Add initial support for gpt-oss by @Ying1123 in #8824
- chore: upgrade torch 2.8.0 by @zhyncs in #8836
- [router] complete router oai spec by @slin1237 in #8828
- Turn off hybrid cache by default by @ispobock in #8839
- Support bailing moe by @ppraneth in #8680
- [Feature] improve TBO: two chunk overlap by @House-West in #8144
- [router] PD Router Simplification and Reorganization by @slin1237 in #8838
- [1/3] Optimize Slime Update Weights: Remove QWen3MOE Load Weight Overhead by @hebiao064 in #8751
- [2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync when update expert weights by @hebiao064 in #8753
- Support mxfp4 for GPT-OSS by @Ying1123 in #8843
- Add unit test for triton swa kernel by @ispobock in #8853
- fix: resolve ci issue by @zhyncs in #8859
- fix benchmark fp8 blockwise group gemm by @yuan-luo in #8815
- Refine naming by @ispobock in #8868
- Optimize triton swa kernel by skipping computation by @ispobock in #8860
- Support B200 in CI by @fzyzcjy in #8861
- chore: update Dockerfile by @mickqian in #8872
- [NVIDIA] Fix num_experts in modelopt_quant by @wenscarl in #8811
- [CI] fix pip upgrade by @ch-wan in #8881
- chore: use torch 2.8 stable by @zhyncs in #8880
- Support v1/responses and use harmony in serving_chat by @CatherineSue in #8837
- Use reduce scatter for DP by @trevor-m in #8539
- add flashinfer mxfp4 by @BBuf in #8847
- fix glm4 moe by @ch-wan in #8883
- feat: openai oss attention sink support with trtllm-gen backend #8825 by @yyihuang in #8834
- Support GPU pinning for LoRA by @lifuhuang in #8697
- Enables force reasoning based on chat template for Qwen3-Thinking by @JustinTong0323 in #8369
- [AMD] Pull latest SGLang version for AMD CI by @michael-amd in #8787
- [Feature][Multimodal] Implement LRU cache for multimodal embeddings by @ZhengWG in #8292
- [router] fix req handling order, improve serialization, remove retry by @slin1237 in #8888
- [Feat] QWen-1M context support[2/2]: Update block sparse attention backend by @FlamingoPg in #5949
- [CPU] Fix fallback allgather issue by @blzheng in #8041
- Disable gemma3 for SWA due to CUDA illegal memory access error by @JustinTong0323 in #8895
- [Perf] Auto enable best flashinfer mxfp4 kernel in b200 by @BBuf in #8898
- Fix sgl-kernel arch and missing package in CI by @fzyzcjy in #8869
- refactor(sgl-router): Replace
once_cell
withLazyLock
in worker.rs and remove once_cell dependency from Cargo.toml by @htiennv in #8698 - [router] re-enable pd router benchmark CI by @slin1237 in #8912
- [router] update pd router ci summary step with new threshold by @slin1237 in #8916
- [router] upgrade router version to 0.1.9 by @slin1237 in #8844
- Fix hopper launch gpt-oss model illegal memory by @BBuf in #8908
- fix: use openai 1.99.1 by @zhyncs in #8927
- codeowner updates for modelopt related files by @Edwardf0t1 in #8925
- chore: support blackwell cu129 image by @zhyncs in #8928
- docs: update README by @zhyncs in #8929
- remove vllm fp8quant from fp8.py by @hebiao064 in #8937
- fix: reasoning parser when request have enable_thinking flag by @JustinTong0323 in #8933
- correct the tp_plan logic by @hebiao064 in #8850
- [router] dedicated prefill HTTP client and request-path optimizations by @slin1237 in #8923
- Enhancements for bench_one_batch by @ZailiWang in #8703
- refactor: Move scalar_types.py to sgl-kernel to avoid circular import by @Hongbosherlock in #8720
- Fix enable flashinfer mxfp4 moe bf16 check by @BBuf in #8950
- Reduce scheduler recv requests overhead by @fzyzcjy in #8947
- Better optimization log for gpt-oss model by @BBuf in #8953
- minor: global workspace buffer for trtllm-gen mha from flashinfer by @yyihuang in #8952
- bench: add attention sink op benchmark, triton and trtllm-gen [B200] by @yyihuang in #8932
- Fix typos and unify size(s)/stride(s) API calls by @triple-Mu in #8799
- Expert Parallelism for GPT-OSS by @ch-wan in #8944
- Add ernie4.py for ERNIE-4.5 by @solrex in #7657
- [NVIDIA] Fix missing
get_col_major_tma_aligned_tensor
for Blackwell deepgemm in EpMoE by @kaixih in #8955 - chore: bump sgl-kernel v0.3.3 by @zhyncs in #8957
- add zai-org/GLM-4.5-Air-FP8 model into nightly CI by @zminglei in #8894
- Support Multi Process Tokenizer Manager by @whybeyoung in #6555
- Simple prefetch policy by @pansicheng in #8692
- chore: update flashinfer by @zhyncs in #8958
- Revert "Support Multi Process Tokenizer Manager" by @merrymercy in #8960
- [RL] fix skip_server_warmup and rl health_generate logic by @zhuzilin in #8757
- chore: bump v0.5.0rc0 by @zhyncs in #8959
- [router] router circuit breaker core by @slin1237 in #8941
- refine aiter_backend for mtp by @valarLip in #7279
- [router] harden retries + metrics; fix streaming load; header filtering by @slin1237 in #8972
- Fix kimi k2 function call format by @merrymercy in #8968
- [router] add metrics for worker and policy by @tonyluj in #8971
- chore(gb200): update to CUDA 12.9 and improve build process by @ishandhanani in #8772
- chore(ci): update Python version from 3.9 to 3.10 in sgl-kernel workflow by @ishandhanani in #8981
- [router] reduce radix tree contention, fix radix tree double-count race by @slin1237 in #8978
- [router] fix radix tree integration issues in PD router by @slin1237 in #8982
- Update qwen3_coder_detector.py for streaming by @maocheng23 in #8371
- [bug fix] Ensure local token and global token buffers are pointing to different storage by @elfiegg in #8785
- Create cancel-all-pr-test-runs by @merrymercy in #8986
- [Fix] Add a workflow to cancel all pending CI runs by @merrymercy in #8988
- Minor Optimizations in Schedule Batch by @merrymercy in #8724
- [1/2][resubmit] sgl-kernel: Fuse routed scaling factor into moe_fused_gate (select_experts) by @trevor-m in #8770
- Add unit test for flashinfer fp4 moe by @trevor-m in #8330
- [AMD] Update SGLang image fallback logic for AMD CI by @michael-amd in #8980
- Clean up server_args.py to have a dedicated function for model specific adjustments by @merrymercy in #8983
- Molly/ci gnr server by @DiweiSun in #8667
- [Fix] Fix wrong backend chosen in hybrid backend by @DarkSharpness in #8989
- Revert "[bug fix] Ensure local token and global token buffers are pointing to different storage " by @ch-wan in #8993
- [hotfix] use the original implementation in 8785 by @ch-wan in #8994
- Fix incorrect default get_hidden_dim logic by @lifuhuang in #8987
- optimize: reduce shulffle and quantization overhead in cutlass_moe sm90 by @TianQiLin666666 in #8962
- chore(deps): update minimum python to 3.10 by @ishandhanani in #8984
- Add CI for gpt-oss model on hopper by @fzyzcjy in #8851
- Fix redundant kernel in sink dtype conversion by @fzyzcjy in #8966
- Fix qwen2 audio not working bug by @byjiang1996 in #8600
- feat: update flashinfer ar oneshot params by @yyihuang in #8687
- Support glm4.1v and glm4.5v by @byjiang1996 in #8798
- feature(hicache): Support hf3fs-hicache reusing kvcache across different instances by @hzh0425 in #8673
- Tiny Llama4 type error in constructor by @b8zhong in #6752
- HiCache Storage tp fix by @xiezhq-hermann in #8878
- chore: upgrade sgl-kernel 0.3.3 by @zhyncs in #8998
- [DP] fix: engine crash when decode batch is padded by @ch-wan in #8995
- [bugfix] Fix missing args in bench one batch by @trevor-m in #8877
- [Feature] Optimize DeepSeek's DeepEP on Ascend NPU by @iforgetmyname in #8355
- Enable TBO on ROCm by @lcskrishna in #8329
- fix nvshmem cu126 by @zhyncs in #9001
- [perf] add kimi-k2 b200 fused moe config by @Alcanderian in #9010
- fix: fix obsolete qwen-audio processor arg by @mickqian in #9003
- Fix CI by @merrymercy in #9012
- fix flashinfer allreduce fusion import bug by @BBuf in #9007
- Fix CI by @merrymercy in #9013
- [hicache] Optimization for DMA copy by @cctry in #8245
- fix page first per layer pf2lf kernel by @huangtingwei9988 in #8915
- [Fix] Fix hicache backend by @DarkSharpness in #8991
- [Fix] Fix flashinfer cpu <-> gpu synchronization by @DarkSharpness in #8340
- [router] upgrade to latest sgl kernel for router ci by @slin1237 in #9019
- [router] upgrade rand to latest version by @slin1237 in #9017
- [router] upgrade kube version to latest by @slin1237 in #9018
- Optimize: Cache CUDA device to reduce redundant calls during tensor l… by @GeLee-Q in #8996
- Improve LoRA Perf by Deprecating FlashInfer and Eliminating Redundant Tensor Ops by @lifuhuang in #8940
- [router] update pyo3 version to 0.25.1 by @slin1237 in #9022
- [RL] Add test for /abort_request by @hebiao064 in #7626
- Simplify frontend language by @merrymercy in #9029
- Reorganize CI and test files by @merrymercy in #9027
- Reduce CI duration of test_lora_update. by @lifuhuang in #9024
- [Optimization] Update estimated_num_new_pages logic in TokenToKVPoolAllocator by @YiXR in #8794
- Support Flatten Tensor Update Weights to speed up MOE Update Weights by 20% by @hebiao064 in #8079
- Simplify memory pool by @merrymercy in #9033
- Revert "[1/2][resubmit] sgl-kernel: Fuse routed scaling factor into m… by @zhyncs in #9035
- Simplify health check by @merrymercy in #9034
- chore: upgrade flashinfer 0.2.11 by @zhyncs in #9036
- Update release-docs.yml by @merrymercy in #9037
- Refactor the docs by @merrymercy in #9031
- Improve docs and developer guide by @merrymercy in #9044
- Update REVIEWERS.md by @merrymercy in #9046
- [router] regular router circuit breaker by @slin1237 in #8997
- REVIEWERS.md typo fix by @xiezhq-hermann in #9048
- Revert "feat: update flashinfer ar oneshot params (#8687)" by @zhyncs in #9054
- [CI] Fix CI tests by @ch-wan in #9050
- Revert "chore: upgrade flashinfer 0.2.11 (#9036)" by @zhyncs in #9057
- bugfix: Fix output_ids extraction in detokenizer_manager by @CatherineSue in #9047
- [pd-router] add retry and circuit breakfor for pd router by @slin1237 in #9051
- Support radix cache for Lora feature by @Fridge003 in #7216
- update deepep commit to support qwen3-coder by @yizhang2077 in #9066
- chore(gb200): remove ToT flashinfer installation by @ishandhanani in #9079
- Update REVIEWERS by @HaiShaw in #9063
- Fix chunked prefill size validation for disabled state by @chi2liu in #8973
- Fix broken Kimi models HuggingFace link by @Hangzhi in #9080
- [PD]decode: add CLIP_MAX_NEW_TOKEN for pop_preallocated by @jinmingyi1998 in #8866
- Fix docs for clip max new tokens by @hnyls2002 in #9082
- refactor(pd-router): extract common patterns to reduce code duplication by @slin1237 in #9081
- fix: w4afp8 accuracy problem and rebase by @yangsijia-serena in #8752
- Update hyperparameter_tuning.md by @merrymercy in #9083
- fuse allreduce and residual_rmsnorm by @BBuf in #8731
- TRTLLM-MLA FP8 path by @farazkh80 in #8638
- HiCache Storage: generate hash when inserting new nodes by @xiezhq-hermann in #9053
- [fix] Set Radix tree root node hash to None - Nvidia Dynamo Integration by @faradawn in #9030
- HiCache, add bench long context plus minor fixs by @xiezhq-hermann in #9086
- (gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support by @CatherineSue in #9043
- [router] Add Rust Binary Entrypoint for SGLang Router by @slin1237 in #9089
- [CI]Test BM.A10.4 runner by @key4ng in #8992
- Fix race condition in async lora unload by @lifuhuang in #9084
- Fix broken CI TestRequestLengthValidation by @lifuhuang in #9095
- Optimization for AscendPagedTokenToKVPoolAllocator by @Makcum888e in #8293
- feat: add fused moe config for Qwen3-30B-A3B on B200 by @zixuanzhang226 in #9087
- Fix mismatch between padded_scales shape and reshape dimensions in modelopt quantization by @ovowei in #8766
- [Fix] Fix dual chunk model default behavior by @DarkSharpness in #9032
- bugfix: Fix the commentary msg extraction in GptOssDetector by @CatherineSue in #9097
- docs: fix broken links in README.md by @xxrjun in #9075
- Fuse two kernels of hidden states padding into quantization kernel by @fzyzcjy in #9005
- update support new models doc by @yichaolemon in #9096
- Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) by @fzyzcjy in #9077
- chore: bump sgl-kernel v0.3.4 by @zhyncs in #9103
- Runtime check CUDA driver version to avoid unresolved green context symbols by @hnyls2002 in #9021
- [Bugfix] Fix accuracy-test-1-gpu failure caused by
builtin_tools
by @CatherineSue in #9114 - Fix typo in REVIEWERS by @ShangmingCai in #9113
- [5/n] DP Enhancement: Correct
num_token_non_padded
by @ch-wan in #9107 - router: Fix user guide link README.md by @CatherineSue in #9122
- fix(docker): update sgl_kernel version to 0.3.4 in Dockerfile.gb200 by @ishandhanani in #9118
- Fuse writing KV buffer into rope kernel (part 2: srt) by @JeremieMelo in #9014
- [router] update router documentation by @slin1237 in #9121
- fix: update Dockerfile by @zhyncs in #9125
- [feat] add ascend readme and docker release by @pkking in #8700
- [feat] Enable Ascend profiling on SGLang by @ping1jing2 in #8610
- [Quantization] Supported w8a8 int8 quantized Gemma3 and Qwen-VL models by @ichernob in #8619
- Fix typos in supported models documentation by @Hangzhi in #9119
- [AMD] Support Wave attention backend with AMD GPU optimizations by @yichiche in #8660
- fix: update Dockerfile by @zhyncs in #9129
- chore: use cp310 by @zhyncs in #9130
- Support page first layout zero copy for mooncake store by @huangtingwei9988 in #8651
- [Feature] Support custom set kv buffer kernel by @DarkSharpness in #8884
- fix: wrong docker hub org name by @pkking in #9137
- Use FlashInfer's TRTLLM FP8 Blockscale GEMM by @elfiegg in #8588
- [1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe_fused_gate by @trevor-m in #9088
- Support Triton FP8 Gemm can handle hidden_dim not divisible by 16 by @hebiao064 in #9093
- Fix gpt-oss ~2x memory consumption issue by @fzyzcjy in #9146
- Update docker file for MI35x base image update to support gpt-oss mxfp4 model by @kkHuang-amd in #9111
- Double vision prefill throughput by defaulting to optimal vision attention backend by @AlienKevin in #8484
- Update fa3 interface and add unit test by @ispobock in #9150
- feat: update fa3 by @zhyncs in #9126
- [router] optimize Rust compilation and development workflow by @slin1237 in #9133
- [PD] optimize kv cache transfer directly using batch transfer by @ssssnow in #9149
- [PD] feat: mooncake use batch reg/dereg by @stmatengss in #8910
- Support FA3 backend for gpt-oss by @ispobock in #9028
- chore: bump v0.5.0rc1 by @zhyncs in #9069
- [Model] Support Qwen3ForSequenceClassification for Qwen3-Embed Model by @nysa-liu in #7957
- Swap xeon ci to gnr server by @DiweiSun in #9042
- Clean up allocators by @merrymercy in #9134
- [Generative Score API] Optimization to Remove Decode. by @sundar24295s in #8840
- Fix broken trtllm_mha attn backend with gpt-oss by @nvcastet in #9161
- Replace
sglang.srt.layers.quantization.scalar_types
withsgl_kernel.scalar_type
by @Hongbosherlock in #8951 - [AMD] Update fallback images for AMD CI by @michael-amd in #9159
- [Bugfix] Avoid unnecessary reduce-scatter call in prepare_mlp by @changhuaixin in #9169
- Fix docker container DeepEP error on Blackwell by @fzyzcjy in #9171
- [DP Attention] Refactor: adding some utility functions by @ch-wan in #9136
- Faster weight processing (trtllm-gen moe nvfp4) by @aleozlx in #9162
- Feature: support qwen and llama4 reducescatter for dp attention padding by @Misaka9468 in #9101
- fix io group by @pansicheng in #9154
- [Perf] Tunings for SM100 FP8 CUTLASS kernel by @hhzguo in #8818
- Add A800 fused MoE kernel tuning configs for GLM4.5 and GLM4.5-Air by @lambert0312 in #8808
- Add H200 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @forestlee95 in #8852
- Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #7004
- Add H20 fused MoE kernel configs for Dpsk & Qwen3 by @M0gician in #7631
- Add H200 fused MoE kernel configs for DeepSeek-V3 in triton 3.3.1 by @junliu-mde in #7687
- add w8a8-fp8-block-wise H20-3e triton config by @sleepcoo in #8018
- fix: zero_init buffer by @yyihuang in #9065
- [2/n]decouple quantization implementation from vLLM dependency by @AniZpZ in #8112
- chore: bump sgl-kernel v0.3.5 by @zhyncs in #9185
- [sgl-kernel] 1/N Refactor sglang cutlass 3x - gemm fp8 blockwise sm90 by @yuan-luo in #8913
- [sgl-kernel] Support FlashInfer top_k_top_p_sampling_from_logits by @yuan-luo in #9060
- refine mxfp4 shuffling log by @BBuf in #9194
- [4/n]decouple quantization implementation from vLLM dependency by @Hongbosherlock in #9191
- feat: Add model version tracking with API endpoints and response metadata by @yitianlian in #8795
- [VLM] Improving multimodal tensor hash kernel by @adarshxs in #9008
- chore: upgrade transformers 4.55.2 by @zhyncs in #9197
- feat: update model config by @zhyncs in #9202
- chore: bump v0.5.0rc2 by @zhyncs in #9203
- feat: add fused moe config for Qwen3-235B-A22B-FP8 on B200 by @zixuanzhang226 in #9204
- [typo fix] Fix a typo in communicator.py by @LPhgh in #9183
- Minor fix docker container DeepEP on multi platforms by @fzyzcjy in #9205
- fix: fix unsupported palette mode of images in bench_serving for mmmu by @mickqian in #9206
- [6/N] MoE Refactor: Cleanup MoE-related configs by @ch-wan in #8849
- use fast math for per_token_group_quant_8bit. by @strgrb in #9177
- feat: remove sm75 by @zhyncs in #9207
- feat(hicache-3fs): 3FS-SGLang Hierarchical Cache Deployment Guide by @hzh0425 in #9213
- fix: the store_dtype typo for ascend mla by @shilinlee in #9208
- Fix the deprecation warning for enable_flashinfer_mxfp4_moe by @ch-wan in #9214
- Tiny update tmux history limit on dev container by @fzyzcjy in #9218
- [Eagle Warning fix] replace the deprecated 'and' with & by @XucSh in #9215
- [Misc] feat: Deepgemm update for sgl-kernel by @FlamingoPg in #8790
- Fp4 MOE quant kernel optimization by @jy-song-hub in #8777
- [CI] Fix sgl-router disaggregation test by @ShangmingCai in #9222
- Cleanup MoE Refactor by @ch-wan in #9223
- chore: bump sgl-kernel v0.3.6 by @zhyncs in #9220
- Optional extension for green context by @hnyls2002 in #9231
- [router] allow more health check configuration by @slin1237 in #9198
- [router] clean up lint warnings with clippy execution by @jeffdn in #9201
- [router] preserve original worker response header in router by @slin1237 in #9236
- chore(docker): update sgl_kernel version to 0.3.6 in Dockerfile.gb200 by @ishandhanani in #9243
- [AMD] Expand test coverage for AMD CI and enable apply_token_bitmask_inplace_cuda in sgl-kernel by @hubertlu-tw in #8268
- Fix nan value generated after custom all reduce by @kkHuang-amd in #8663
- Revert "chore(docker): update sgl_kernel version to 0.3.6 in Dockerfi… by @zhyncs in #9246
- Revert "chore: bump sgl-kernel v0.3.6 (#9220)" by @zhyncs in #9247
- Add fp4 quantize before all-gather for Flashinfer cutlass MoE DP (max throughput) by @trevor-m in #7667
- Fix DP load for embedding by @b8zhong in #9165
- [CI] add deepseek w4a8 test on h20 ci by @HanHan009527 in #7758
- Fix Custom All Reduce CI job. by @saienduri in #9258
- [feature] Ascend NPU graph support by @VDV1985 in #8027
- fix unexcepted answer in EAGLE mode by @zyksir in #9252
- [PD] Support PD disaggregation with Prefill PP by @ShangmingCai in #8846
- Combine fp4.py and mxfp4.py into one file and support dynamic mxfp4 quantization in mxfp4.py by @kkHuang-amd in #9049
- Bug fix: use correct mm_items in embed_mm_inputs by @byjiang1996 in #8893
- ci: simplify multi-modality tests by using mixins by @mickqian in #9006
- [Bugfix] Change vLLM install order & Add A2 support by @iforgetmyname in #9232
- [router] fix pd prefill http request complinace issue by @slin1237 in #9237
- Quick Fix GLM by @hebiao064 in #9264
- model: support nvidia/Llama-3_3-Nemotron-Super-49B-v1 by @netanel-haber in #9067
from python.sglang.srt
->from sglang.srt
by @netanel-haber in #9268- Revert "[Misc] feat: Deepgemm update for sgl-kernel (#8790)" to fix kernel CI by @hnyls2002 in #9260
- [router] add cargo clippy in CI and fix-up linting errors by @jeffdn in #9242
- [chore] Clean up redundant lora_weight_names concept to simplify code by @lifuhuang in #9131
- Fix swa eagle verify accuracy for Triton backend by @ispobock in #9279
- Fix memory pool leak error by @fzyzcjy in #9271
- [fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance by @kousakawang in #9272
- Tiny make fp4 moe method parameters more static by @fzyzcjy in #8520
- [router] introduce prefill response draining for http compliance by @slin1237 in #9281
- [CPU] Fix TP padding issue on Phi-4 by @blzheng in #8289
- chore: bump sgl-kernel v0.3.6.post1 by @zhyncs in #9286
- [router] introducing tokenizer trait by @slin1237 in #9287
- Set the default attention backend for GLM-4.5v to fa3 by @zifeitong in #9245
- [Fix] Add undefined
update_tensor_inplace
function by @b8zhong in #6307 - [router] tokenizer factory, hf tokenizer, and stop sequence detector by @slin1237 in #9293
- Fix triton_fused_moe unit test and benchmark by @yuan-luo in #9276
- Further fix memory pool leak error by @fzyzcjy in #9298
- [router] add tokenizer metrics by @slin1237 in #9307
- [router] add reasoning parser base structure by @slin1237 in #9310
- Minor style fixes for sgl-kernel by @merrymercy in #9289
- [fix] fix enable_pdl for blackwell by @Alcanderian in #9011
- Modelopt quant config adaptation by @Edwardf0t1 in #8829
- should return invalide request for empty prompt by @gongwei-130 in #9315
- [MISC] use dynamic choices for tool-call-parser argument by @key4ng in #9316
- [Docs] Correct and clarify notes in Engine docstring by @JiangJiaWei1103 in #9313
- upgrade xgrammar 0.1.23 and openai-harmony 0.0.4 by @Swipe4057 in #9284
- [PD] Propagate internal server errors from aborted requests to clients instead of blindly returning 200's by @datdo-msft in #8936
- [GLM4.1V and GLM4.5V] Add vision transformer num_dummy_head support: max tp=4 -> max tp=8 by @byjiang1996 in #9059
- [AMD] Reorganize hip-related header files in sgl-kernel by @hubertlu-tw in #9320
- Tiny fix CI by @fzyzcjy in #9306
- [router] Add spec for sglang scheduler by @CatherineSue in #9322
- support for interns1-mini by @CUHKSZzxy in #9299
- [Bug] Fix input arguments of flashinfer_trtllm_moe by @JeremieMelo in #9317
- [router]restructure protocol modules for better organization by @key4ng in #9321
- Add
CMakeLists.txt
binary_dir by @EduardDurech in #7019 - enable marlin fp8 blockwise by @qeternity in #8990
- docs: fix spec by @zhyncs in #9326
- [Minor] Fix the style of sgl-kernel by @merrymercy in #9332
- [Bugfix] fix kv buffer register & dp attention & deepepmoe by @chenxu140 in #9327
- Revert "[feature] Ascend NPU graph support (#8027)" by @iforgetmyname in #9348
- [router] add dsr1, kimi, and qwen reasoning parser by @slin1237 in #9353
- fix: enable multi-GPU Triton fused MoE tuning by @mpashkovskiy in #6295
- [router] add tiktokenizer and sequence in router by @slin1237 in #9354
- [CI] Fix lint issues by @CatherineSue in #9361
- [Router] Add validation module for API parameters by @key4ng in #9335
- [router] adds reasoning parser pooling and thread-safe by @slin1237 in #9360
- [router] Implement gRPC SGLangSchedulerClient by @CatherineSue in #9364
- [router] add tokenizer chat template support by @slin1237 in #9370
- [router] Implement OpenAI Responses API specification by @key4ng in #9367
- Fix mini lb timeout issue by @fzyzcjy in #9369
- Fix triton backend eagle illegal memory access by @ispobock in #9344
- Fix gpt-oss response api streaming issue by @key4ng in #9368
- [feature] Rework Ascend NPU graph support by @iforgetmyname in #9350
- [minor] Sync style changes by @merrymercy in #9376
- [readme] Add SGLang x AMD SF meetup information by @wisclmy0611 in #9380
- [CI] Fix disaggregation failure tolerance CI by @ShangmingCai in #9378
- [Docs] Update contribution guide by @merrymercy in #9383
- Revert "[feature] Rework Ascend NPU graph support" by @iforgetmyname in #9385
- Reduce overhead for fa by not calling heavy CUDA property check by @oraluben in #7375
- Add PDL support for quant kernel and rope kernel by @fzyzcjy in #9106
- Fix the
--allow-auto-truncate
argument in tokenizer manager. by @hnyls2002 in #9391 - Refactor allreduce add rmsnorm pattern by @BBuf in #9278
- [2/2] Fuse routed scaling factor into select_experts by @trevor-m in #8690
- Fix FlashInfer GPU <-> CPU sync by @thecodingwizard in #9409
- Support pinning adapter via server args. by @lifuhuang in #9249
- Fix incorrect logic in chat template handling. by @lifuhuang in #9336
- Support DP attention with GPT-OSS by @nvcastet in #9359
- Fixed the issue where eagle3 TPOT was not as good as without eagle3. by @jiapingW in #9404
- fix: InternS1 don't recognize image, updates image token for InternVL processor by @JustinTong0323 in #9381
- misc: parse bench_serving result as markdown table by @mickqian in #9377
- Add support for Qwen3-seq-cls by @nathanrchn in #9357
- Support trtllm_allreduce_fusion in flashinfer for cuda<12.8 by @strgrb in #9339
- [router] Add IGW (Inference Gateway) Feature Flag by @key4ng in #9371
- [router] add tokenizer integration test with real mini tokenizer by @CatherineSue in #9413
- [router] add glm and step3 reasoning parser by @CatherineSue in #9415
- Fix max_seq_len_k in trtllm_mha attention backend by @Qiaolin-Yu in #9416
- Fix biased_grouped_topk_cpu by @CaoE in #9420
- [PD] Fix nvlink transport accuracy through transferring metadata with tcp by @ShangmingCai in #9261
- [bug] fix errors related to context length in SD by @hnyls2002 in #9388
- feat: Add Triton fallback option and SM120 MoE configs for FP8 models by @voipmonitor in #9251
- [feature] Ascend NPU graph support by @VDV1985 in #9399
- Fix FP4 inference corruption issue in glm4.5-air model by @Azure-Tang in #9346
- Fix tiny misalign with previous truncation setting in tokenizer_manager by @hnyls2002 in #9430
- [NVIDIA] Fix trtllm fp4 moe backend when used in MTP by @kaixih in #9384
- Enables speculative decoding for the trtllm_mla attention backend by @pranavm-nvidia in #9238
- ci: enhance xeon ci by @DiweiSun in #9395
- [Bug] Fix w4afp8 moe kernel by @yuhyao in #9392
- Refactor weight offloading logic by @fzyzcjy in #8521
- Fix quant kernel test errors and benchmark wrong output speeds by @fzyzcjy in #7604
- [fix] Fix mxfp4 weight loading bug with TP sharding in GPT-OSS by @hlu1 in #9433
- [router] add tokenizer benchmark by @slin1237 in #9427
- [5/n]decouple quantization implementation from vLLM dependency by @Hongbosherlock in #9454
- accomendate reasoning_effort set in chat_template_kwargs by @gongwei-130 in #9458
- fix: should return a invalid request response when schema missing by @gongwei-130 in https://github.com/sgl-project/sglang/pull/9461
- fix: support fb fp8 by @zhyncs in https://github.com/sgl-project/sglang/pull/9462
- Add deepseek v3.1 thinking parser support and update docs by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/9464
- feat: add fused moe config for GLM-4.5-Air-FP8 on B200 by @zixuanzhang226 in https://github.com/sgl-project/sglang/pull/9463
- [FA3] Init Spec Page Table only when Spec is enabled to save ~40MB by @hebiao064 in https://github.com/sgl-project/sglang/pull/9455
- fix: tmp revert gpt oss tp sharding on hopper by @zhyncs in https://github.com/sgl-project/sglang/pull/9469
- feat: update auto_choose_speculative_params by @zhyncs in https://github.com/sgl-project/sglang/pull/9470
- Revert "bugfix: Fix output_ids extraction in detokenizer_manager" by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/9467
- Update reasoning parser doc by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/9468
- Add Support for Page Size greater than 1 for Flashinfer MLA Backend by @pavanimajety in https://github.com/sgl-project/sglang/pull/8593
- [AMD] Remove the deprecated C10_WARP_SIZE by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/9356
- Support MHA with chunked prefix cache for flashinfer/flashmla backend, support page size > 1 for MHA chunked prefix by @xu-yfei in https://github.com/sgl-project/sglang/pull/8616
- [router] remove all tokenizer metrics for performance by @CatherineSue in https://github.com/sgl-project/sglang/pull/9474
- [code clean] add H20 cutlass groupGemm default config by @kousakawang in https://github.com/sgl-project/sglang/pull/9333
- [docs]: fix reasoning context in docs by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/9483
- [Docs]Update reasoning parser doc & fix outdated link by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/9492
- [router] add tool parser base structure and partial json parser by @CatherineSue in https://github.com/sgl-project/sglang/pull/9482
- [router] fix router load guard tracking for streaming by @slin1237 in https://github.com/sgl-project/sglang/pull/9491
- torch.compile() mrope by @timmy-feng in https://github.com/sgl-project/sglang/pull/9487
- Add trtllm_mla and cutlass_mla for ragged fmha for chunked prefill by @elfiegg in https://github.com/sgl-project/sglang/pull/9480
- chore: bump sgl-kernel v0.3.6.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/9475
- Update docker file for supporting PD-Disaggregation on MI300x by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/9494
- [Docs] Add doc and quick demo for gpt-oss responses api & buildin tools by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/9497
- Support speculative decoding in the trtllm_mha attention backend by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/9331
- minor: determine mm attn backend based on platforms by @mickqian in https://github.com/sgl-project/sglang/pull/9303
- Disable torch.compile for get_last_loc_large_page_size_large_top_k by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/9507
- [bugfix] Make --enable-hierarchical-cache and --disable-radix-cache mutually exclusive by @XucSh in https://github.com/sgl-project/sglang/pull/9452
- 3fs zerocopy by @pansicheng in https://github.com/sgl-project/sglang/pull/9109
- [HiCacheStorage] backup optimization for MLA model by @huangtingwei9988 in https://github.com/sgl-project/sglang/pull/8865
- Use Tensor Core Decode when gqa group size >= 4 by @Edenzzzz in https://github.com/sgl-project/sglang/pull/8624
- [router] tokenizer arch doc by @slin1237 in https://github.com/sgl-project/sglang/pull/9513
- [MTP] Force greedy sampling on AMD by @datdo-msft in https://github.com/sgl-project/sglang/pull/9127
- [router] add json tool parser by @slin1237 in https://github.com/sgl-project/sglang/pull/9516
- [NVIDA] [1/N] Nvfp4 Masked Gemm: Add quant op for the flashinfer grouped gemm by @kaixih in https://github.com/sgl-project/sglang/pull/9200
- [AMD] Fix Llama 4 FP8 accuracy issues on MI300X by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/7699
- Add Qwen3-30B-A3B-Thinking-2507 support on AMD GPUs. by @sogalin in https://github.com/sgl-project/sglang/pull/9456
- [router] Move all protocols to spec.rs file by @key4ng in https://github.com/sgl-project/sglang/pull/9519
- [router] ignore client error when record failure in pd_router by @Bruce-x-1997 in https://github.com/sgl-project/sglang/pull/9503
- Add support for extensions of interface and pre-registrations to NIXL HiCache by @mkhazraee in https://github.com/sgl-project/sglang/pull/9211
- Support GC Freezing to improve latency & throughput by @chanh in https://github.com/sgl-project/sglang/pull/9241
- Add enable_flashinfer_mxfp4_bf16_moe for higher precision and slower moe backend by @fzyzcjy in https://github.com/sgl-project/sglang/pull/9004
- [benchmark] Add benchmark scripts for ceval and boolq by @yuxingcyx in https://github.com/sgl-project/sglang/pull/8946
- fix: blackwell dsv3 fp8 issue temporary solution by @zhyncs in https://github.com/sgl-project/sglang/pull/9530
- tool-call(dsv3): Improve deepseek-v3 chat template and tool_choice =
required
by @CatherineSue in https://github.com/sgl-project/sglang/pull/9525 - [fix] Fix mxfp4 triton MoE tp bug by @hlu1 in https://github.com/sgl-project/sglang/pull/9473
- Overlapped weight offload by @fzyzcjy in https://github.com/sgl-project/sglang/pull/8034
- Tiny make device_loading_context more static by @fzyzcjy in https://github.com/sgl-project/sglang/pull/9478
- Partially unify triton per token group quant kernels by @fzyzcjy in https://github.com/sgl-project/sglang/pull/9485
- feat(hicache): Supports 3fs-hicache compatibility with dp-attention by @hzh0425 in https://github.com/sgl-project/sglang/pull/9372
- Update grok.py and tiktoken tokenizer by @merrymercy in https://github.com/sgl-project/sglang/pull/9532
- Release 0.5.1 by @merrymercy in https://github.com/sgl-project/sglang/pull/9533
New Contributors
- @sighingnow made their first contribution in #8611
- @vvenkates27 made their first contribution in #8488
- @farazkh80 made their first contribution in #8632
- @yrk111222 made their first contribution in #8083
- @pkking made their first contribution in #8270
- @ZacWang made their first contribution in #8664
- @lbh2001 made their first contribution in #8618
- @wenchen76 made their first contribution in #8512
- @WANG-GH made their first contribution in #7379
- @shenoyvvarun made their first contribution in #8683
- @17Reset made their first contribution in #8547
- @TianQiLin666666 made their first contribution in #8678
- @YyWangCS made their first contribution in #8733
- @azhurkevich made their first contribution in #8552
- @yuhyao made their first contribution in #8596
- @House-West made their first contribution in #8144
- @ZhengWG made their first contribution in #8292
- @htiennv made their first contribution in #8698
- @triple-Mu made their first contribution in #8799
- @tonyluj made their first contribution in #8971
- @maocheng23 made their first contribution in #8371
- @cctry made their first contribution in #8245
- @chi2liu made their first contribution in #8973
- @Hangzhi made their first contribution in #9080
- @jinmingyi1998 made their first contribution in #8866
- @Makcum888e made their first contribution in #8293
- @ovowei made their first contribution in #8766
- @xxrjun made their first contribution in #9075
- @yichaolemon made their first contribution in #9096
- @JeremieMelo made their first contribution in #9014
- @ichernob made their first contribution in #8619
- @nysa-liu made their first contribution in #7957
- @changhuaixin made their first contribution in #9169
- @aleozlx made their first contribution in #9162
- @Misaka9468 made their first contribution in #9101
- @hhzguo made their first contribution in #8818
- @forestlee95 made their first contribution in #8852
- @LPhgh made their first contribution in #9183
- @shilinlee made their first contribution in #9208
- @jy-song-hub made their first contribution in #8777
- @jeffdn made their first contribution in #9201
- @VDV1985 made their first contribution in #8027
- @netanel-haber made their first contribution in #9067
- @kousakawang made their first contribution in #9272
- @gongwei-130 made their first contribution in #9315
- @datdo-msft made their first contribution in #8936
- @CUHKSZzxy made their first contribution in #9299
- @EduardDurech made their first contribution in #7019
- @chenxu140 made their first contribution in #9327
- @mpashkovskiy made their first contribution in #6295
- @oraluben made their first contribution in #7375
- @thecodingwizard made their first contribution in #9409
- @jiapingW made their first contribution in #9404
- @nathanrchn made their first contribution in #9357
- @CaoE made their first contribution in #9420
- @voipmonitor made their first contribution in #9251
- @Azure-Tang made their first contribution in #9346
- @pranavm-nvidia made their first contribution in #9238
- @hlu1 made their first contribution in #9433
- @timmy-feng made their first contribution in https://github.com/sgl-project/sglang/pull/9487
- @Bruce-x-1997 made their first contribution in https://github.com/sgl-project/sglang/pull/9503
- @mkhazraee made their first contribution in https://github.com/sgl-project/sglang/pull/9211
- @yuxingcyx made their first contribution in https://github.com/sgl-project/sglang/pull/8946
Full Changelog: v0.4.10...v0.5.1