sgl-project/sglang v0.5.3 on GitHub

Highlights

Day 0 Support for DeepSeek-V3.2 with Sparse Attention: https://lmsys.org/blog/2025-09-29-deepseek-V32/
Deterministic inference on multiple attention backends: https://lmsys.org/blog/2025-09-22-sglang-deterministic/
Integration of FlashAttention 4 prefill kernels
Enhancing support of Qwen3-Next with MTP, DP, optimized kernels and multiple hardware platforms
Support models including Qwen3-VL series, dots.vlm1, Ling-V2, Apertus, SOLAR

What's Changed

[Auto Sync] Update server_args.py (20250912) by @merrymercy in #10347
[CPU][doc] add torch.compile param in example commands by @ZailiWang in #10349
[router][ci] Add gpu utilization analyze with nvml by @key4ng in #10345
[NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked by @wenscarl in #9199
fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale by @trevor-m in #10296
model: support Apertus by @EduardDurech in #9774
fix dual stream bug by @yizhang2077 in #10352
[router] Basic OAI Response api by @key4ng in #10346
Implement Standalone gRPC Server for SGLang Python Scheduler by @CatherineSue in #10283
support memory_pool_host page first direct layout by @huangtingwei9988 in #10031
fix the break in FlashInferFusedMoE by @chenqianfzh in #10356
fix: resolve transfer_kv_all_layer_direct_lf_pf import error by @zhyncs in #10360
Support LingV2 model by @strgrb in #10359
Fix Bailing MoE model bugs by @yuan-luo in #10362
Revert add mainprocess's proctitle by @whybeyoung in #10351
model: support dots.vlm1 model by @yonghenglh6 in #8778
Support loading weights from remote instance by @amysaq2023 in #8215
add qwen3-next ut by @yizhang2077 in #10355
Fix chunked prefix cache for nvfp4 by @wenscarl in #10180
Fix FA4 import cause moe_fused_gate output be illegal memory by @fzyzcjy in #10368
Fix global input scale incompatible with CuTe DSL moe by @fzyzcjy in #10370
[router] Add Rerank Routing Logic in Regular Router by @fangjian601 in #10219
[router] enable sccache in ci and local build by @slin1237 in #10099
fix: add fast path for function call by @yizhang2077 in #9023
[Auto Sync] Update base_grammar_backend.py, llguidance_back... (20250911) by @merrymercy in #10333
fix: resolve gb200 image link by @zhyncs in #10343
fix: exclude protobuf generated code by @zhyncs in #10388
[bug] fix ci syntax by @slin1237 in #10390
Fix GPU fault issue when run dsv3 with dp mode and enable torch-compile by @kkHuang-amd in #10361
feat: add deepseek v3 fp4 ut by @zhyncs in #10391
Add sentencepiece to project dependencies by @mmangkad in #10386
[router] allow one router to support different model families and serving mode by @slin1237 in #10244
[router] Add get and cancel method for response api by @key4ng in #10387
Benchmark: Support API_KEY without 'bearer' by @Muqi1029 in #10380
Support Qwen3-Next on Ascend NPU by @iforgetmyname in #10379
[HiCache] fix mooncake config in different tp size by @stmatengss in #10377
[HiCache] doc: update deployment in readme by @stmatengss in #10332
[router] add not implemented functions for multi model trait by @slin1237 in #10394
[Auto Sync] Update xgrammar_backend.py (20250913) by @merrymercy in #10395
fix probs name which without temp scaling name by @narutolhy in #9984
Fix the style of sgl kernel by @merrymercy in #10398
fix: tool parse in large streaming chunk beginning with normal content by @JustinTong0323 in #10397
[Fix] Init mamba related memory pools with torch.zeros by @byjiang1996 in #10400
support qwen3_next blackwell by @yizhang2077 in #10403
[Fix] Support qwen3-next MTP+DP by @byjiang1996 in #10392
Update ROCm docker image to add sgl-router support by @kkHuang-amd in #10406
[Performance] Dynamic Batch Tokenizer by @sundar24295s in #9382
[Generative Score API] Scoring(Prefill-only) optimizations. by @sundar24295s in #9748
Remove repeatedly lists adding in init_incremental_detokenization by @hnyls2002 in #10412
[Hack] Add pd-disaggregation decode polling interval by @hnyls2002 in #10411
fix duplicated logger in eager_utils by @lj970926 in #10410
Fix cutlass moe accuracy drop caused by attention UB from DP padding mode by @fzyzcjy in #10414
Add self.capture_aux_hidden_states For GLM-4.5V by @zRzRzRzRzRzRzR in #10228
Add h200 fused moe config for Qwen3-Next by @Ximingwang-09 in #10404
Auto determine sgl kernel version in blackwell CI by @fzyzcjy in #10318
Fix the global scale fix does not support EPLB and improve enabling condition by @fzyzcjy in #10369
Let sgl-kernel changes be tested on srt by @fzyzcjy in #10313
[2/2] Speed up prefill mla attention concat by @fzyzcjy in #10157
Support offloading in fp8 by @fzyzcjy in #9948
Support global scale in addition to per expert scale for cutedsl moe by @fzyzcjy in #10270
Support profile args in Engine API by @fzyzcjy in #6539
Fix sgl-kernel + srt CI by @fzyzcjy in #10419
[PD metrics] Fix some uncompleted PD related metrics by @acelyc111 in #8627
Typo: in --enable-custom-logit-processor: agree with cli arg by @thalahors in #10076
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 by @sufeng-buaa in #9962
fix: use latest flashinfer by @zhyncs in #10428
fix: enable cu124 and cu128 build on main push by @zhyncs in #10431
[Fix] MoE: fix w8a8_fp8 MoE and add tests to cover this code path by @ch-wan in #10429
Add split tile size for Triton attention by @ispobock in #10425
Fix correction bias undefined behavior for nvfp4 models by @fzyzcjy in #10426
feat: add dsv3 fp4 cutlass moe etp ut by @zhyncs in #10433
router: Add Embedding routing logic by @tao12345666333 in #10129
Revert "Fix FA4 import cause moe_fused_gate output be illegal memory" by @fzyzcjy in #10432
[4/N]DP refactor: support watching mode get_load and shortest queue strategy by @hnyls2002 in #10201
automatically label pr for ci by @merrymercy in #10435
Refactor TopK to ensure readability and extensibility by @ch-wan in #9338
Tiny fix wrong naming by @fzyzcjy in #10437
Fix label pr for ci by @merrymercy in #10441
metrics: support customer labels specified in request header by @acelyc111 in #10143
[docs / oneliner] update mmmu docs instruction by @vincentzed in #9768
Add reasoning examples for GPT-OSS in Markdown examples by @vincentzed in #9626
Fix label PR by @merrymercy in #10445
Update permissions in label-pr.yml by @merrymercy in #10450
[CI] Fix token key in label-pr.yml workflow by @merrymercy in #10452
fix: fix max_new_tokens uninitialized error by @mickqian in #9343
[router] fix service discovery and mcp ut by @slin1237 in #10449
fix(server_args): Skip chunked_prefill_size validation when disaggregation mode is decode by @jinmingyi1998 in #10358
[router] add dependency for router by @ooapex in #10401
[router] fix logger ordering git ctx by @CatherineSue in #10457
Update GITHUB_TOKEN secret for documentation push by @merrymercy in #10458
[HotFix]: Hot fix import path in 3fs_bench_client.py by @hzh0425 in #10463
Add rtx5880 moe triton by @Jimmy-L99 in #10439
Run tests based on labels by @merrymercy in #10456
Fix trtllm_moe wrong correction bias by @fzyzcjy in #10440
feat: support pip install sglang by @zhyncs in #10465
chore: bump v0.5.3rc0 by @zhyncs in #10468
[router] minor code clean up in server startup by @CatherineSue in #10470
[bugfix] fix typo by @1195343015 in #10471
[PD metrics] Add latency Histogram metrics of each stage for generate requests by @acelyc111 in #8710
[CI] Fix runner for sgl-kernel by @csahithi in #9887
fix(internvl): fix accuracy issue of normalization by @KEVINTUAN12 in #10375
fix: gpt-oss streaming dropping normal content when tools are provided but not used by @jonaslsaa in #9657
model: support solar by @ppraneth in #8189
fix: resolve sgl-kernel ut by @zhyncs in #10476
[1/2] Speed up trtllm_mla attention backend (>10% e2e) by @fzyzcjy in #10473
Fix --dataset-path in bench_one_batch_server by @hnyls2002 in #10475
[Env] minimal version for organizing envs by @hnyls2002 in #10479
chore: bump v0.3.10 sgl-kernel by @zhyncs in #10478
[router] multi model registration fix by @CatherineSue in #10481
[2/4] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance by @lifuhuang in #10286
[Auto Sync] Update registry.py (20250915) by @merrymercy in #10484
[router] fix worker registration in multi model mode by @CatherineSue in #10486
fix crash of DeepSeek-V3 update_weights_from_disk by @scut-cbq in #8863
Temporay work-around for rocm 7.0.0 alpha with enabling data-parallel issue by @kkHuang-amd in #10434
[Hicache] Evaluate Per-Round Metrics in Multiturn Bench by @ykwd in #10203
[ModelOpt] Respect kv_cache_quant_algo in ModelOpt checkpoints by @brayden-hai in #10336
Add Logprobs unit test with a loose threshold by @PrinsYin in #10230
[router] add router db connector for responses api by @slin1237 in #10487
Remove wrong imports from sglang.python by @hnyls2002 in #10493
[router] fix router manager and router init in server by @CatherineSue in #10499
Cache the result of is_blackwell platform check by @b8zhong in #10498
feat: update support for qwen3next model by @cao1zhg in #10466
Minor fix lint introduced by #10466 by @ShangmingCai in #10507
chore: upgrade sgl-kernel 0.3.10 by @zhyncs in #10500
Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. by @HydraQYH in #10491
Fix CI when sgl-kernel is changed but srt is not changed by @fzyzcjy in #10515
Support sgl-router parallel_batch in bench_one_batch_server by @fzyzcjy in #10506
[CPU] fix CPU backend sel. issue for Llama4 by @ZailiWang in #10511
adjust import setuptools_rust by @whybeyoung in #10524
Fix formatting in long code blocks by @philipkiely-baseten in #10528
skip vision_model for lora by @gongwei-130 in #10530
[2/2] Speed up trtllm_mla attention backend by @fzyzcjy in #10474
support using fa4 on deepseek on blackwell by @cicirori in #9928
[Auto Sync] Update scheduler_profiler_mixin.py, rpd_utils.p... (20250916) by @merrymercy in #10494
[Auto Sync] Update activation.py, chunk_cache.py, utils.py (20250917) by @merrymercy in #10538
feat: add priority based scheduling with priority based request acceptance and preemption by @harrisonlimh in #8746
Fix decord dependency for aarch64 docker build by @kyleliang-nv in #10529
enable prefix cache with dp by @wenscarl in #10459
[bugfix]hicache bench_long_context.py run failed by @zhannngchen in #10523
Remove duplicated code by @oraluben in #10545
CUDA Arch Independent by @EduardDurech in #8813
[bench] Fix random seed in bench_one_batch_server by @hnyls2002 in #10548
[HiCache] Add tests for hicache storage mooncake backend by @stmatengss in #10171
[BugFix] Fix incorrect hidden_states_tensor in pd disaggregation + eagle by @ZeldaHuang in #9976
fix: update dsv3 fp4 ut by @zhyncs in #10584
vlm: remove redundant d2h movement of mm feature tensors by @AlienKevin in #9987
Enable trtllm mla prefix extend by @wenscarl in #10526
[ROCm] Fix fp8 quantization accuracy issue. by @sogalin in #10558
[HICache] introduce evict policy by @XucSh in #10190
aiter v0.1.5.post2 by @HaiShaw in #10563
[PD] Improve disaggregation common backend and refactor mooncake backend by @ShangmingCai in #10273
chore: upgrade mooncake 0.3.6 by @ShangmingCai in #10596
[improvement] add average input/output token length for hicache benchmark stats output by @zhannngchen in #10525
Scale kkt after reduction by @yizhang2077 in #10604
fix deepep assert when PD disaggregation == null by @alpha-baby in #8274
[RL] Add destroy process group api by @penguin-wwy in #9979
Feat/add heartbeat mechanism for nixl conn by @shaharmor98 in #10222
update deepep version for qwen3-next deepep moe by @yizhang2077 in #10624
support qwen3-next-fp8 deepep by @yizhang2077 in #10622
Fix sgl_kernel import failure on devices other than CUDA by @ZailiWang in #10610
[Performance] qwen3-next improve causal conv1d in prefill phase by @liz-badada in #10595
Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py by @yhyang201 in #10579
feat: Add FlexAttention Backend for Efficient Sparse Attention by @yukiy927 in #9947
Garbage collector regression in the online server by @brayden-hai in #10621
[router] refactor worker to builder pattern 1/n by @slin1237 in #10628
refactor: use registry for _get_attention_backend_from_str by @zhyncs in #10629
[Feature] Speculative decoding support lookahead by @a4zhangfei in #9873
[Performance] Qwen3-Next: replace arange to cached query_start_loc_li… by @byjiang1996 in #10553
[Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster by @byjiang1996 in #10586
model support: Sarashina2VisionForCausalLM by @CatherineSue in #10632
feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 by @zixuanzhang226 in #10631
chore: bump sgl-kernel 0.3.11 by @zhyncs in #10630
Hicache L3 backend mooncake optimization configuration reading method by @leihuang-sketch in #10319
[router] refactor worker to builder pattern 2/n by @slin1237 in #10633
[Feature]feat(get_ip): unify get_ip_xxx by @jinmingyi1998 in #10081
[router] refactor worker to builder pattern 3/n by @slin1237 in #10647
[1/2][sgl-kernel] Support moe_sum_reduce cuda kernel by @yuan-luo in #10321
[router] refactor worker to builder pattern 4/n by @slin1237 in #10650
Fix fast decode plan for flashinfer v0.4.0rc1 and upgrade sgl-kernel 0.3.11 by @Fridge003 in #10634
[router] refactor worker to builder pattern 5/n by @slin1237 in #10653
[HiCacheStorage]support page_first_direct layout for generic set&get by @huangtingwei9988 in #10522
[router] preserve order of json params using preserve_order feature by @fgebhart in #10661
[router] refactor router and worker management 1/n by @slin1237 in #10664
fix: resolve sync issue by @zhyncs in #10668
[Auto Sync] Update .clang-format (20250919) by @zhyncs in #10670
[router] refactor router and worker management 2/n by @slin1237 in #10666
router-spec: Reorder ChatCompletionRequest and fix validation logic by @CatherineSue in #10675
chore: cleanup docker image by @zhyncs in #10671
limit sgl-kernel causal conv1d to cuda only by @liz-badada in #10648
[Auto Sync] Update model_runner.py (20250920) by @zhyncs in #10679
[router] refactor router and worker management 2.5/n by @slin1237 in #10677
[1/2] Support deterministic inference with flashinfer attention backend by @Fridge003 in #10645
[Auto Sync] Update deepseek_v2.py (20250920) by @zhyncs in #10683
chore: upgrade mooncake 0.3.6.post1 to fix gb200 dockerfile by @ShangmingCai in #10681
[Performance] Qwen3-Next: optimize causal_conv1d_fn triton kernel - up to 9% faster by @byjiang1996 in #10680
Replace os.environ in layernorm.py by @Fridge003 in #10684
fix(disagg): fix sending KV cache in case of MLA for NIXL backend by @dmitrygx in #10673
fix: update run_suite by @zhyncs in #10685
fix: remove awq_dequantize deps by @zhyncs in #10686
[Auto Sync] Update modelopt_quant.py (20250920) by @zhyncs in #10688
[Feature] Support deterministic inference with FA3 backend by @hebiao064 in #10651
feat: update server args by @zhyncs in #10696
Super tiny fix extra logs by @fzyzcjy in #10697
[3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization by @lifuhuang in #10592
Update release-docs.yml by @sglang-bot in #10706
Refactors radix cache for extra key support by @JustinTong0323 in #10317
[Router]fix: fix get_load missing api_key by @jinmingyi1998 in #10385
fix: disable gpt-oss b200 ut by @zhyncs in #10716
Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU by @HydraQYH in #10714
[Auto Sync] Update deepseek_v2.py (20250922) by @zhyncs in #10717
Support deterministic inference with triton backend (Hardware test: NV and AMD GPUs) by @yushengsu-thu in #10694
[deterministic inference] Move batch invariant pkg to sglang by @hebiao064 in #10695
[2/2] Support deterministic inference for temperature > 0 by @Qiaolin-Yu in #10678
[Ascend] codeowner updates for ascend related files by @ping1jing2 in #10699
[4/4] Introduce CachedKernel to reduce CSGMV kernel launch overheads by 60% by @lifuhuang in #10709
Convert FLASHINFER_WORKSPACE_SIZE to integer by @reyoung in #10731
EPLB: prefer to use physical experts in the same node by @acelyc111 in #9849
fix capture_bs when speculative decoding enabled by @feng397 in #10730
Fix flaky logprobs test by @ShangmingCai in #10728
Fix CI TestChunkedSGMV by @lifuhuang in #10737
[Docs, minor] Fix LLM doc matrix by @adarshxs in #10753
Add warnings and remove dependency for deterministic inference by @Fridge003 in #10724
bugfix: Fix get_worker_urls_for_model in http/router.rs by @CatherineSue in #10754
[router] refactor router and worker management 3/n by @slin1237 in #10727
[router] update ci so only execute benchmarks when labels are added by @slin1237 in #10757
Fix MTP MoE weight loading with NVFP4 target model. by @LorrinWWW in #10758
chore: bump sgl-kernel v0.3.12 by @zhyncs in #10732
[Generative Score API] Added test_scores_api.py to github CICD to run per commit by @vedantjh2 in #10755
refactor zero copy by @pansicheng in #10300
Fix multimodal registry and code sync scripts by @merrymercy in #10759
Enables TRT-LLM backend to be used for target_verify by @pranavm-nvidia in #10281
fix: kv events with tp > 1 by @ishandhanani in #10541
[Auto Sync] Update flashattention_backend.py (20250922) by @zhyncs in #10762
[Feature] Add MLAProcess for DeepSeek MLA on NPU by @iforgetmyname in #10130
[Ascend] optimize Qwen-vl on Ascend by @ping1jing2 in #10556
[Ascend]optimize Qwen3 on Ascend by @ping1jing2 in #10574
[Auto Sync] Update configurer.py (20250923) by @merrymercy in #10765
[router] refactor router and worker management 4/n by @slin1237 in #10756
[router] remove pd router draining channel by @slin1237 in #10767
[router] fix logger type mismatch by @CatherineSue in #10774
Use simulate acc len from sglang.environ by @hnyls2002 in #10771
Fix trtllm_mla slow concat kernel in MTP by @fzyzcjy in #10777
Move cached kernel to srt.utils by @lifuhuang in #10776
feat: unify dockerfiles by @ishandhanani in #10705
Introduce FutureMap by @hnyls2002 in #10715
chore: upgrade sgl-kernel 0.3.12 by @zhyncs in #10782
followup: clean up dockerfiles and release yamls by @ishandhanani in #10783
Clean up server args by @merrymercy in #10770
move environ into sglang.srt to avoid break SRT auto sync. by @hnyls2002 in #10791
Fix hicache mooncake backend CI by @ShangmingCai in #10792
[router] fix cache aware lock by @slin1237 in #10773
[router] responses api POST and GET with local storage by @slin1237 in #10581
model: support qwen3-vl series by @zju-stu-lizheng in #10323
[fix][pd-disag]no need set next batch sampling info done in prefill by @jinmingyi1998 in #10259
[ROCm] Update aiter to v0.1.5.post3 by @sogalin in #10812
[router] use dashmap for radix tree instead of hash for multi model by @slin1237 in #10814
router(grpc): Implement route for chat_cmpl endpoint by @CatherineSue in #10761
fix ceval by @ZhengHSI in #10504
Remove duplicate code in qwen2 model by @Lzhang-hub in #10540
[router] fix axum default body limit by @CatherineSue in #10818
Fix latest main ci by @ShangmingCai in #10799
add tunning files for QWEN-3-NEXT by @yiakwy-xpu-ml-framework-team in #10794
[Auto Sync] Update protocol.py (20250923) by @zhyncs in #10820
fix: draft model IMA by overide max_positional_embeddings by @JustinTong0323 in #10787
[Auto Sync] Update elementwise.py (20250923) by @merrymercy in #10823
[Auto Sync] Update simple_eval_common.py (20250923) by @merrymercy in #10824
[router] Support streaming for Openai Router Response api by @key4ng in #10822
[router] add auth middleware for api key auth by @CatherineSue in #10826
[Auto Sync] Update load_config.py, model_config.py, configu... (20250923) by @merrymercy in #10825
Revert "[fix][pd-disag]no need set next batch sampling info done in prefill" by @merrymercy in #10828
Add CI timeout guidelines by @merrymercy in #10829
feat: add cache_salt support to request by @JustinTong0323 in #10718
fix bailing_moe with enable_dp_attention by @GuoweiWangU in #10860
ci: free space on workers for build by @ishandhanani in #10786
router-grpc: Support jinja chat template content format detection by @CatherineSue in #10832
[router] select first healthy worker on proxied get requests by @lun-4 in #10827
chore: Initial support for input config files by @kushanam in #10534
router-grpc: Add tools processing and other paramters for apply_chat_template by @CatherineSue in #10877
[router] consolidate health endpoints and flush cache by @slin1237 in #10876
Restruct sgl-kernel benchmark by @BBuf in #10861
[Bug] Fix Issue#10215 by @yuhyao in #10572
[router] consolidate worker get loads by @slin1237 in #10880
[router] Support Oracle DB(ATP) Data Connector by @key4ng in #10845
[router] simplify tokenizer dev doc by @slin1237 in #10895
[Auto Sync] Update model_config.py (20250925) by @merrymercy in #10885
[ci feature] add ci monitor by @BBuf in #10872
[HiCache] Cleaning the deprecated host memory state by @xiezhq-hermann in #10778
integrate AIBrix KVcache by @yapple in #10376
Add fuse_moe per-channel tune by @sleepcoo in #10915
[router] consolidate worker load monitoring by @slin1237 in #10894
router: Fix constraint proto and build_constraint in grpc router by @CatherineSue in #10881
Refactor kv_cache_scheme handling for quantization by @mmangkad in #10132
refactor: Move grpc/client.rs to grpc_client/sglang_scheduler.rs by @CatherineSue in #10924
fix env flashinfer by @Swipe4057 in #10910
[minor] Remove deprecated function get_ip by @merrymercy in #10883
Rename customer label -> custom label by @merrymercy in #10899
[router] change log level to warning by @slin1237 in #10926
[router][refactor] Clean up protobuf fields by @CatherineSue in #10923
Replace the Kimi-K2 generated tool call idx with history tool call count by @eraser00 in #10612
[ci] add ci-monitor workflow by @BBuf in #10898
Remove pull_request trigger from CI monitor workflow by @merrymercy in #10932
router: Support parallel sampling num > 1 in grpc_server and non-stream handling by @CatherineSue in #10929
Revert "Refactor kv_cache_scheme handling for quantization (#10132)" by @zhyncs in #10935
Update CODEOWNERS to include JustinTong0323 in FC by @JustinTong0323 in #10939
[PD-HiCache]: Support Async Offloading KVCache In Decode Side by @hzh0425 in #10192
CI: Fix docker manifest build by @csahithi in #10936
[router] update owners for router components by @slin1237 in #10927
Fuse write kv buffer into rope for qwen3 moe & bailing moe by @yuan-luo in #10749
[router] add grpc client get and set by @slin1237 in #10955
[router]fix code owner syntax error by @slin1237 in #10956
[router] move grpc client from router to worker and builder by @slin1237 in #10958
[router] add move grpc worker management from router to worker manager by @slin1237 in #10960
[router] grpc router regular mode import cleanup by @slin1237 in #10963
[router] remove old/oudated/useless comments by @slin1237 in #10967
[router] remove old/oudated/useless comments across code base by @slin1237 in #10968
ci: fix rate-limit of huggingface with hf auth login by @mickqian in #10947
Update label field comment to indicate deprecation by @merrymercy in #10970
Restruct gpu_memory_settings in a unify function and relax max_cuda_graph_bs by @BBuf in #10372
ci: refactor nightly test by @mickqian in #10495
refactor loading weights from remote instance coding format by @amysaq2023 in #10941
[router][grpc] Add helpfer functions for decoder in router.rs and fix specs by @CatherineSue in #10971
Add simple docker file for B300 by @hlu1 in #10944
Ci monitor support performance by @BBuf in #10965
[HiCache]: Support dynamic loading backends for hicache by @hzh0425 in #10551
[Bugfix][Minor][Benchmark] Fix some bugs due to PR #10495 by @Muqi1029 in #10982
[router][grpc] Support E2E non-stream chat completions by @CatherineSue in #10980
fix: fp8 quantization failure of qwen 2.5 VL 7B model by @PanJason in #10112
[Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe by @kkHuang-amd in #10981
fix: make inference deterministic for large TP by @JustinTong0323 in #10930
Add auth to get server info by @Muqi1029 in #10751
Add support for topk metadata transferring for PD by @ShangmingCai in #10616
[PD] Extract the PP transfer layer calculate logic from Mooncake to Common backend by @ShangmingCai in #10565
Use jsonschema to constrain required or specific tool choice by @TJ5 in #10550
Fix profiler by @merrymercy in #10997
[router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) by @CatherineSue in #10995
[router] basic mcp support for openai router response api by @key4ng in #10978
[router] fix chat template loading and tokenizer path by @slin1237 in #10999
Fix CI failure of TypeError: RotaryEmbedding.forward_cpu() got an unexpected keyword argument 'fused_set_kv_buffer_arg' by @yanbing-j in #11009
[bugfix]Add empty_context import to two_batch_overlap.py by @wejoncy in #10964
prepare for sglang+verl by @lbk-sys in #10555
[sgl-kernel] Optimize concat_mla_k kernel by @yuan-luo in #10543
[HiCache] bug: fix mooncake store batch set v1 by @stmatengss in #11013
Fix FusedSetKVBufferArg in RotaryEmbedding by @merrymercy in #11003
Update GLM-4.5 Model Doc by @zRzRzRzRzRzRzR in #11017
[router] migrate to rust python module for pythonic parser by @slin1237 in #11033
fix: show failed models in nightly ci by @mickqian in #10986
[router][tool call] Support normal content extraction before tool call (streaming) by @CatherineSue in #11038
[router] add harmony tool parser base structure and interface by @slin1237 in #11036
Unify SGL Kernel Releases by @Kangyan-Zhou in #10701
[1/2] Support FA4 for MHA Prefill in sgl-kernel by @lifuhuang in #10940
fix: check if weights are already local before downloading by @mickqian in #11015
[HiCacheStorage] mooncake store support page_first_direct layout by @huangtingwei9988 in #10591
[speculative decoding] rename lookahead to ngram by @a4zhangfei in #11010
Fix gemma 3 launch with transformers: the error: AttributeError: 'TransformersForCausalLM' object has no attribute 'tp_size' by @vincentzed in #9614
Fix sgl-kernel benchmark dead code by @BBuf in #11022
[router][tool call] Improve normal content extraction and error handling (non-stream) by @CatherineSue in #11050
chore: upgrade cutedsl 4.2.1 by @zhyncs in #11054
[Ci Monitor] Auto uploaded performance data to sglang_ci_data repo by @BBuf in #10976
chore: upgrade sgl-kernel 0.3.13 by @zhyncs in #11056
[router] add n to generate sampling params by @slin1237 in #11069
Use more general heuristics to set the default value of --mem-fraction-static by @merrymercy in #10975
[router][tool call] Separate JsonParser and LlamaParser by @CatherineSue in #11073
Fix mem fraction static for nightly tests by @merrymercy in #11076
fix: fp8 mllama4 without vision modules being quantized by @mickqian in #10611
Use get_pooled in process_single_choice by @CatherineSue in #11079
[router][grpc] Add logprobs support to router by @CatherineSue in #11082
feat(reasoning): improve enable thinking from request by @jinmingyi1998 in #10875
[Profile] dump memory trace when cuda graph profile is enabled by @ch-wan in #11083
Remove hybrid_linear_attn attention backend and refactor attention registry by @samuellees in #10816
[model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… by @DevashishLal-CB in #9642
Enable optional FP32 compute for LM Head by @narutolhy in #10729
Update CODEOWNERS for attention/ascend_backend.py by @merrymercy in #11092
[router] grpc router generate endpoint support by @slin1237 in #11070
[router][tool call] Full support for ToolChoice by @CatherineSue in #11085
Fix spec filter batch when target extend by @ispobock in #10991
[Fix] Resolve performance drop in speculative decoding aiter backend by @yichiche in #11087
[Auto Sync] Update fused_moe_triton_config.py (20250930) by @merrymercy in #11099
chore: bump sgl-kernel v0.3.14 by @FlamingoPg in #11067
[router][grpc-server] Fix gRPC server shutdown by @slin1237 in #11094
Fix eagle radix cache by @ispobock in #10846
[Eval] Add --repeat in run_eval by @hnyls2002 in #11101
[CPU] Adding Memory Capacity Acquisition Functionality by @ZailiWang in #11102
Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization by @trevor-m in #11081
Support Dots.ocr model by @albaNnaksqr in #11071
[router][bugfix] Fix input_logprobs handling with None value and logprob_start_len = -1 by @CatherineSue in #11113
Feature/make PEFT adapter module format compatibile by @ConnorLi96 in #11080
fix: KimiK2Detector Improve tool call ID parsing with regex by @JustinTong0323 in #10972
[router] add mcp list and mcp call in output array by @key4ng in #11112
Organize spec-related data structures by @hnyls2002 in #10735
[AMD] Add Tilelang and Fast Hadamard Transform builds to Dockerfile.rocm by @hubertlu-tw in #11114
[Auto Sync] Update base_grammar_backend.py, xgrammar_backen... (20250930) by @merrymercy in #11115
[Doc] Update multimodal language models documentation by @JustinTong0323 in #11111
Quick Fix: fix Qwen3-VL launch failure caused by MRotaryEmbedding arg by @yhyang201 in #10985
docker: x86 dev builds for hopper and blackwell by @ishandhanani in #11075
Refactor AMD CI. by @saienduri in #11128
feat: add fast_decode_plan from flashinfer, flashinfer to 0.4.0rc3 by @yyihuang in #10760
[HiCache]bug fix: fixed blank item in host_mem_release_queue by @zhangzuo21 in #11005
[Feature] Add EIC as sglang HiCache Storage backend by @mss1213 in #10271
[HiCache] Configurable and Dynamic Prefetch Timeout by @ykwd in #10512
[router] add pd service in grpc router for pd by @slin1237 in #11120
[router] Add multi-turn tool calling loop support for MCP integration by @key4ng in #11143
Fix metrics and request tracing (TimeStats) by @merrymercy in #11123
Remove debug print statement from scheduler output by @merrymercy in #11145
Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch by @AHEADer in #10720
Fix ngram spec with page size > 1 by @hnyls2002 in #11135
[ROCm] To reduce the compiling time when using torch compile. by @sogalin in #10559
Fix DeepSeek chunked prefill memory issue by @fzyzcjy in #11149
Clean up parallel_state.py by @merrymercy in #11148
Tiny improve dumper by @fzyzcjy in #11132
Tiny fix missing alt stream in nextn layer by @fzyzcjy in #10768
Fuse quantize and rope in trtllm_mla MTP by @fzyzcjy in #10779
Tiny detect slow ranks by @fzyzcjy in #10508
Remove unused pack .item() in paged allocator. by @hnyls2002 in #11156
Support dispatch low latency by @fzyzcjy in #10263
Support single batch overlap by @fzyzcjy in #10422
[router][grpc] Support tool call parser in streaming by @CatherineSue in #11160
[model] Add mamba2 and Falcon-H1 support. by @ilyasch2 in #10988
Clean up ascend allocator by @hnyls2002 in #11152
fix cpp JIT compilation issue of ngram speculative decoding by @b8zhong in #10837
Tiny cleanup deepseek_v2.py by @fzyzcjy in #11163
Tiny fix ep_gather behavior different in CI by @fzyzcjy in #11130
Tiny remove duplicated code by @fzyzcjy in #11164
[proto] Add script to compile python protos by @CatherineSue in #11171
Unify forward output datastructure by @hnyls2002 in #11124
[grpc] style fix for grpc compilation. by @hnyls2002 in #11175
Remove dp balance metadata and minimul token balance. by @hnyls2002 in #11170
Minor fixes for server_args, parallel_state, and test_deterministic.py by @merrymercy in #11159
fix: shoudn't include CUDA_ARCH 100 and 120 for cuda12.6.1 by @gongwei-130 in #11176
[router][grpc] Support streaming for v1/chat/completions by @CatherineSue in #11179
Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell by @DomBrown in #11138
Introduce naming convention in io_struct and base sglang io classes. by @hnyls2002 in #10133
[Generative Scores API] add performance tests to CICD by @vedantjh2 in #10830
[1/n] Enable DCA CUDA graph capture by @b8zhong in #9537
[Fix] Update to v0.1.5.post4 and refine HIP attention backend selection by @yichiche in #11161
[CI]] Tee server logs to both file and stdout/stderr using PIPE by @hnyls2002 in #11185
fix: radix cache memory accounting by @skyzh in #10637
Tiny add PD disaggregation + DP attention test by @fzyzcjy in #11167
[router] Steaming support for MCP Tool Calls in OpenAI Router by @key4ng in #11173
[Feature] Option to save model weights to CPU when memory saver mode is enabled by @mattnappo in #10873
Add --thinking-mode to run_eval by @hlu1 in #11189
[hot-fix] Fix CI break which caused by adding thinking_mode in eval by @hnyls2002 in #11192
Tiny move files to utils folder by @fzyzcjy in #11166
Fix CUDA illegal memory access issues in speculative decoding by @ur4t in #10892
Fix [test]: Env:SGLANG_TORCH_PROFILER_DIR for pytest. by @singhalshubham03 in #10780
Optimize debug log position of PD abort request by @ShangmingCai in #11090
fix 3fs indices by @pansicheng in #10855
model: support starcoder2 by @ppraneth in #10609
[Test] Initialize mem_fraction_static in setUpClass to fix pytest VLM test crashes. by @vshekhawat-hlab in #10859
fix xeon ci check by @DiweiSun in #10838
fix qwen2 eagle3 runtime error by @jiapingW in #10517
[minor] fix the lint by @hnyls2002 in #11198
[Fix] Fix the bug of the calculation of base_gpu_id (dp offset) in data_parallel_controller.py by @XSongQ in #10741
[fix]missing prefix_lens_cpu init when p/d disaggregation by @HanHan009527 in #11196
fix self.enable_kv_cache_events by @narutolhy in #11178
[HICache]: Refactor HiCache CI by @hzh0425 in #11011
fix sampling_seed handling when deterministic is enabled by @skyzh in #11096
[fix]enable flashmla when using draft model P/D attention select by @HanHan009527 in #11012
[router] fix get load response parsing by @slin1237 in #11213
[router] add grpc router pd mode for chat and generate by @slin1237 in #11140
EAGLE cache fix for HiCache by @ispobock in #11215
Add --max-new-tokens CLI flag for MMMU evaluation by @yhyang201 in #11217
Add DeepSeek-V3.2 Tool Call Template by @Xu-Wenqing in #11063
Tiny skip_sample adjust by @hnyls2002 in #11225
[Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 by @DarkSharpness in #11194
Update v1/responses to be more OpenAI-compatible. by @vincentzed in #9624
chore: bump sgl-kernel v0.3.14.post1 by @FlamingoPg in #11137
Update DeepGEMM repository tag to specific commit by @merrymercy in #11229
[Feat] Support Torch Symm Mem AllReduce by @yuan-luo in #10571
Refactor and optimize mooncake CI by @ShangmingCai in #11162
[Fix AMD CI] VRAM cleanup by @sunxxuns in #11174
Update transformers package version to 4.57.0 by @JustinTong0323 in #11222
Remove gdrcopy check in ci_install_deepep.sh by @ch-wan in #11237
Rename runner labels by @merrymercy in #11228
[Auto Sync] Update io_struct.py (20251004) by @merrymercy in #11206
Create two new GH workflows to automatically bump SGLang and Kernel version by @Kangyan-Zhou in #10996
Fix spec_utils.py by @sglang-bot in #11247
ci: make find_local_hf_snapshot_dir more robust by @mickqian in #11248
[quantization] Fix scale remapping for mllama4 by @BowenBao in #10042
[quantization] Enable aiter mxfp4 fused_moe for Quark by @BowenBao in #10048
Use cu128 for torch audio to fix some CI tests by @merrymercy in #11251
Bump torch_memory_saver 0.0.9rc2 by @fzyzcjy in #11252
update sgl kernel version to 0.3.14.post1 by @merrymercy in #11242
Update condition for sgl-kernel-benchmark-test by @merrymercy in #11254
feat: add shortcut detection for multimodal templates in Jinja format by @JustinTong0323 in #11209
Improve bot release workflow by @Kangyan-Zhou in #11240
Add flashmla and fast hadamard transform to Dockerfile by @Fridge003 in #11235
Support DeepSeek V3.2 Exp by @fzyzcjy in #11061
chore: bump SGLang version to 0.5.3rc2 by @sglang-bot in #11259
chore: bump SGLang version to 0.5.3 by @sglang-bot in #11263

New Contributors

@chenqianfzh made their first contribution in #10356
@yonghenglh6 made their first contribution in #8778
@amysaq2023 made their first contribution in #8215
@lj970926 made their first contribution in #10410
@thalahors made their first contribution in #10076
@sufeng-buaa made their first contribution in #9962
@tao12345666333 made their first contribution in #10129
@ooapex made their first contribution in #10401
@Jimmy-L99 made their first contribution in #10439
@1195343015 made their first contribution in #10471
@csahithi made their first contribution in #9887
@scut-cbq made their first contribution in #8863
@brayden-hai made their first contribution in #10336
@PrinsYin made their first contribution in #10230
@philipkiely-baseten made their first contribution in #10528
@zhannngchen made their first contribution in #10523
@alpha-baby made their first contribution in #8274
@yukiy927 made their first contribution in #9947
@a4zhangfei made their first contribution in #9873
@leihuang-sketch made their first contribution in #10319
@fgebhart made their first contribution in #10661
@dmitrygx made their first contribution in #10673
@sglang-bot made their first contribution in #10706
@yushengsu-thu made their first contribution in #10694
@reyoung made their first contribution in #10731
@feng397 made their first contribution in #10730
@LorrinWWW made their first contribution in #10758
@vedantjh2 made their first contribution in #10755
@zju-stu-lizheng made their first contribution in #10323
@ZhengHSI made their first contribution in #10504
@GuoweiWangU made their first contribution in #10860
@lun-4 made their first contribution in #10827
@eraser00 made their first contribution in #10612
@TJ5 made their first contribution in #10550
@wejoncy made their first contribution in #10964
@lbk-sys made their first contribution in #10555
@Kangyan-Zhou made their first contribution in #10701
@samuellees made their first contribution in #10816
@albaNnaksqr made their first contribution in #11071
@ConnorLi96 made their first contribution in #11080
@zhangzuo21 made their first contribution in #11005
@AHEADer made their first contribution in #10720
@ilyasch2 made their first contribution in #10988
@DomBrown made their first contribution in #11138
@skyzh made their first contribution in #10637
@mattnappo made their first contribution in #10873
@ur4t made their first contribution in #10892
@singhalshubham03 made their first contribution in #10780
@XSongQ made their first contribution in #10741
@sunxxuns made their first contribution in #11174
@BowenBao made their first contribution in #10042

Full Changelog: v0.5.2...v0.5.3

sgl-project/sglang v0.5.3 Release v0.5.3 on GitHub

Highlights

What's Changed

New Contributors

sgl-project/sglang v0.5.3
Release v0.5.3

on GitHub