Highlights
- Day 0 Support for DeepSeek-V3.2 with Sparse Attention: https://lmsys.org/blog/2025-09-29-deepseek-V32/
- Deterministic inference on multiple attention backends: https://lmsys.org/blog/2025-09-22-sglang-deterministic/
- Integration of FlashAttention 4 prefill kernels
- Enhancing support of Qwen3-Next with MTP, DP, optimized kernels and multiple hardware platforms
- Support models including Qwen3-VL series, dots.vlm1, Ling-V2, Apertus, SOLAR
What's Changed
- [Auto Sync] Update server_args.py (20250912) by @merrymercy in #10347
- [CPU][doc] add torch.compile param in example commands by @ZailiWang in #10349
- [router][ci] Add gpu utilization analyze with nvml by @key4ng in #10345
- [NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked by @wenscarl in #9199
- fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale by @trevor-m in #10296
- model: support Apertus by @EduardDurech in #9774
- fix dual stream bug by @yizhang2077 in #10352
- [router] Basic OAI Response api by @key4ng in #10346
- Implement Standalone gRPC Server for SGLang Python Scheduler by @CatherineSue in #10283
- support memory_pool_host page first direct layout by @huangtingwei9988 in #10031
- fix the break in FlashInferFusedMoE by @chenqianfzh in #10356
- fix: resolve transfer_kv_all_layer_direct_lf_pf import error by @zhyncs in #10360
- Support LingV2 model by @strgrb in #10359
- Fix Bailing MoE model bugs by @yuan-luo in #10362
- Revert add mainprocess's proctitle by @whybeyoung in #10351
- model: support dots.vlm1 model by @yonghenglh6 in #8778
- Support loading weights from remote instance by @amysaq2023 in #8215
- add qwen3-next ut by @yizhang2077 in #10355
- Fix chunked prefix cache for nvfp4 by @wenscarl in #10180
- Fix FA4 import cause moe_fused_gate output be illegal memory by @fzyzcjy in #10368
- Fix global input scale incompatible with CuTe DSL moe by @fzyzcjy in #10370
- [router] Add Rerank Routing Logic in Regular Router by @fangjian601 in #10219
- [router] enable sccache in ci and local build by @slin1237 in #10099
- fix: add fast path for function call by @yizhang2077 in #9023
- [Auto Sync] Update base_grammar_backend.py, llguidance_back... (20250911) by @merrymercy in #10333
- fix: resolve gb200 image link by @zhyncs in #10343
- fix: exclude protobuf generated code by @zhyncs in #10388
- [bug] fix ci syntax by @slin1237 in #10390
- Fix GPU fault issue when run dsv3 with dp mode and enable torch-compile by @kkHuang-amd in #10361
- feat: add deepseek v3 fp4 ut by @zhyncs in #10391
- Add sentencepiece to project dependencies by @mmangkad in #10386
- [router] allow one router to support different model families and serving mode by @slin1237 in #10244
- [router] Add get and cancel method for response api by @key4ng in #10387
- Benchmark: Support API_KEY without 'bearer' by @Muqi1029 in #10380
- Support Qwen3-Next on Ascend NPU by @iforgetmyname in #10379
- [HiCache] fix mooncake config in different tp size by @stmatengss in #10377
- [HiCache] doc: update deployment in readme by @stmatengss in #10332
- [router] add not implemented functions for multi model trait by @slin1237 in #10394
- [Auto Sync] Update xgrammar_backend.py (20250913) by @merrymercy in #10395
- fix probs name which without temp scaling name by @narutolhy in #9984
- Fix the style of sgl kernel by @merrymercy in #10398
- fix: tool parse in large streaming chunk beginning with normal content by @JustinTong0323 in #10397
- [Fix] Init mamba related memory pools with torch.zeros by @byjiang1996 in #10400
- support qwen3_next blackwell by @yizhang2077 in #10403
- [Fix] Support qwen3-next MTP+DP by @byjiang1996 in #10392
- Update ROCm docker image to add sgl-router support by @kkHuang-amd in #10406
- [Performance] Dynamic Batch Tokenizer by @sundar24295s in #9382
- [Generative Score API] Scoring(Prefill-only) optimizations. by @sundar24295s in #9748
- Remove repeatedly lists adding in
init_incremental_detokenization
by @hnyls2002 in #10412 - [Hack] Add pd-disaggregation decode polling interval by @hnyls2002 in #10411
- fix duplicated logger in eager_utils by @lj970926 in #10410
- Fix cutlass moe accuracy drop caused by attention UB from DP padding mode by @fzyzcjy in #10414
- Add self.capture_aux_hidden_states For GLM-4.5V by @zRzRzRzRzRzRzR in #10228
- Add h200 fused moe config for Qwen3-Next by @Ximingwang-09 in #10404
- Auto determine sgl kernel version in blackwell CI by @fzyzcjy in #10318
- Fix the global scale fix does not support EPLB and improve enabling condition by @fzyzcjy in #10369
- Let sgl-kernel changes be tested on srt by @fzyzcjy in #10313
- [2/2] Speed up prefill mla attention concat by @fzyzcjy in #10157
- Support offloading in fp8 by @fzyzcjy in #9948
- Support global scale in addition to per expert scale for cutedsl moe by @fzyzcjy in #10270
- Support profile args in Engine API by @fzyzcjy in #6539
- Fix sgl-kernel + srt CI by @fzyzcjy in #10419
- [PD metrics] Fix some uncompleted PD related metrics by @acelyc111 in #8627
- Typo: in
--enable-custom-logit-processor
: agree with cli arg by @thalahors in #10076 - [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 by @sufeng-buaa in #9962
- fix: use latest flashinfer by @zhyncs in #10428
- fix: enable cu124 and cu128 build on main push by @zhyncs in #10431
- [Fix] MoE: fix w8a8_fp8 MoE and add tests to cover this code path by @ch-wan in #10429
- Add split tile size for Triton attention by @ispobock in #10425
- Fix correction bias undefined behavior for nvfp4 models by @fzyzcjy in #10426
- feat: add dsv3 fp4 cutlass moe etp ut by @zhyncs in #10433
- router: Add Embedding routing logic by @tao12345666333 in #10129
- Revert "Fix FA4 import cause moe_fused_gate output be illegal memory" by @fzyzcjy in #10432
- [4/N]DP refactor: support watching mode
get_load
and shortest queue strategy by @hnyls2002 in #10201 - automatically label pr for ci by @merrymercy in #10435
- Refactor TopK to ensure readability and extensibility by @ch-wan in #9338
- Tiny fix wrong naming by @fzyzcjy in #10437
- Fix label pr for ci by @merrymercy in #10441
- metrics: support customer labels specified in request header by @acelyc111 in #10143
- [docs / oneliner] update mmmu docs instruction by @vincentzed in #9768
- Add reasoning examples for GPT-OSS in Markdown examples by @vincentzed in #9626
- Fix label PR by @merrymercy in #10445
- Update permissions in label-pr.yml by @merrymercy in #10450
- [CI] Fix token key in label-pr.yml workflow by @merrymercy in #10452
- fix: fix max_new_tokens uninitialized error by @mickqian in #9343
- [router] fix service discovery and mcp ut by @slin1237 in #10449
- fix(server_args): Skip chunked_prefill_size validation when disaggregation mode is decode by @jinmingyi1998 in #10358
- [router] add dependency for router by @ooapex in #10401
- [router] fix logger ordering git ctx by @CatherineSue in #10457
- Update GITHUB_TOKEN secret for documentation push by @merrymercy in #10458
- [HotFix]: Hot fix import path in 3fs_bench_client.py by @hzh0425 in #10463
- Add rtx5880 moe triton by @Jimmy-L99 in #10439
- Run tests based on labels by @merrymercy in #10456
- Fix trtllm_moe wrong correction bias by @fzyzcjy in #10440
- feat: support pip install sglang by @zhyncs in #10465
- chore: bump v0.5.3rc0 by @zhyncs in #10468
- [router] minor code clean up in server startup by @CatherineSue in #10470
- [bugfix] fix typo by @1195343015 in #10471
- [PD metrics] Add latency Histogram metrics of each stage for generate requests by @acelyc111 in #8710
- [CI] Fix runner for sgl-kernel by @csahithi in #9887
- fix(internvl): fix accuracy issue of normalization by @KEVINTUAN12 in #10375
- fix: gpt-oss streaming dropping normal content when tools are provided but not used by @jonaslsaa in #9657
- model: support solar by @ppraneth in #8189
- fix: resolve sgl-kernel ut by @zhyncs in #10476
- [1/2] Speed up trtllm_mla attention backend (>10% e2e) by @fzyzcjy in #10473
- Fix
--dataset-path
inbench_one_batch_server
by @hnyls2002 in #10475 - [Env] minimal version for organizing envs by @hnyls2002 in #10479
- chore: bump v0.3.10 sgl-kernel by @zhyncs in #10478
- [router] multi model registration fix by @CatherineSue in #10481
- [2/4] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance by @lifuhuang in #10286
- [Auto Sync] Update registry.py (20250915) by @merrymercy in #10484
- [router] fix worker registration in multi model mode by @CatherineSue in #10486
- fix crash of DeepSeek-V3 update_weights_from_disk by @scut-cbq in #8863
- Temporay work-around for rocm 7.0.0 alpha with enabling data-parallel issue by @kkHuang-amd in #10434
- [Hicache] Evaluate Per-Round Metrics in Multiturn Bench by @ykwd in #10203
- [ModelOpt] Respect
kv_cache_quant_algo
in ModelOpt checkpoints by @brayden-hai in #10336 - Add Logprobs unit test with a loose threshold by @PrinsYin in #10230
- [router] add router db connector for responses api by @slin1237 in #10487
- Remove wrong imports
from sglang.python
by @hnyls2002 in #10493 - [router] fix router manager and router init in server by @CatherineSue in #10499
- Cache the result of
is_blackwell
platform check by @b8zhong in #10498 - feat: update support for qwen3next model by @cao1zhg in #10466
- Minor fix lint introduced by #10466 by @ShangmingCai in #10507
- chore: upgrade sgl-kernel 0.3.10 by @zhyncs in #10500
- Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. by @HydraQYH in #10491
- Fix CI when sgl-kernel is changed but srt is not changed by @fzyzcjy in #10515
- Support sgl-router parallel_batch in bench_one_batch_server by @fzyzcjy in #10506
- [CPU] fix CPU backend sel. issue for Llama4 by @ZailiWang in #10511
- adjust import setuptools_rust by @whybeyoung in #10524
- Fix formatting in long code blocks by @philipkiely-baseten in #10528
- skip vision_model for lora by @gongwei-130 in #10530
- [2/2] Speed up trtllm_mla attention backend by @fzyzcjy in #10474
- support using fa4 on deepseek on blackwell by @cicirori in #9928
- [Auto Sync] Update scheduler_profiler_mixin.py, rpd_utils.p... (20250916) by @merrymercy in #10494
- [Auto Sync] Update activation.py, chunk_cache.py, utils.py (20250917) by @merrymercy in #10538
- feat: add priority based scheduling with priority based request acceptance and preemption by @harrisonlimh in #8746
- Fix decord dependency for aarch64 docker build by @kyleliang-nv in #10529
- enable prefix cache with dp by @wenscarl in #10459
- [bugfix]hicache bench_long_context.py run failed by @zhannngchen in #10523
- Remove duplicated code by @oraluben in #10545
- CUDA Arch Independent by @EduardDurech in #8813
- [bench] Fix random seed in
bench_one_batch_server
by @hnyls2002 in #10548 - [HiCache] Add tests for hicache storage mooncake backend by @stmatengss in #10171
- [BugFix] Fix incorrect hidden_states_tensor in pd disaggregation + eagle by @ZeldaHuang in #9976
- fix: update dsv3 fp4 ut by @zhyncs in #10584
- vlm: remove redundant d2h movement of mm feature tensors by @AlienKevin in #9987
- Enable trtllm mla prefix extend by @wenscarl in #10526
- [ROCm] Fix fp8 quantization accuracy issue. by @sogalin in #10558
- [HICache] introduce evict policy by @XucSh in #10190
- aiter v0.1.5.post2 by @HaiShaw in #10563
- [PD] Improve disaggregation common backend and refactor mooncake backend by @ShangmingCai in #10273
- chore: upgrade mooncake 0.3.6 by @ShangmingCai in #10596
- [improvement] add average input/output token length for hicache benchmark stats output by @zhannngchen in #10525
- Scale kkt after reduction by @yizhang2077 in #10604
- fix deepep assert when PD disaggregation == null by @alpha-baby in #8274
- [RL] Add destroy process group api by @penguin-wwy in #9979
- Feat/add heartbeat mechanism for nixl conn by @shaharmor98 in #10222
- update deepep version for qwen3-next deepep moe by @yizhang2077 in #10624
- support qwen3-next-fp8 deepep by @yizhang2077 in #10622
- Fix sgl_kernel import failure on devices other than CUDA by @ZailiWang in #10610
- [Performance] qwen3-next improve causal conv1d in prefill phase by @liz-badada in #10595
- Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py by @yhyang201 in #10579
- feat: Add FlexAttention Backend for Efficient Sparse Attention by @yukiy927 in #9947
- Garbage collector regression in the online server by @brayden-hai in #10621
- [router] refactor worker to builder pattern 1/n by @slin1237 in #10628
- refactor: use registry for _get_attention_backend_from_str by @zhyncs in #10629
- [Feature] Speculative decoding support lookahead by @a4zhangfei in #9873
- [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… by @byjiang1996 in #10553
- [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster by @byjiang1996 in #10586
- model support: Sarashina2VisionForCausalLM by @CatherineSue in #10632
- feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 by @zixuanzhang226 in #10631
- chore: bump sgl-kernel 0.3.11 by @zhyncs in #10630
- Hicache L3 backend mooncake optimization configuration reading method by @leihuang-sketch in #10319
- [router] refactor worker to builder pattern 2/n by @slin1237 in #10633
- [Feature]feat(get_ip): unify get_ip_xxx by @jinmingyi1998 in #10081
- [router] refactor worker to builder pattern 3/n by @slin1237 in #10647
- [1/2][sgl-kernel] Support moe_sum_reduce cuda kernel by @yuan-luo in #10321
- [router] refactor worker to builder pattern 4/n by @slin1237 in #10650
- Fix fast decode plan for flashinfer v0.4.0rc1 and upgrade sgl-kernel 0.3.11 by @Fridge003 in #10634
- [router] refactor worker to builder pattern 5/n by @slin1237 in #10653
- [HiCacheStorage]support page_first_direct layout for generic set&get by @huangtingwei9988 in #10522
- [router] preserve order of json params using preserve_order feature by @fgebhart in #10661
- [router] refactor router and worker management 1/n by @slin1237 in #10664
- fix: resolve sync issue by @zhyncs in #10668
- [Auto Sync] Update .clang-format (20250919) by @zhyncs in #10670
- [router] refactor router and worker management 2/n by @slin1237 in #10666
- router-spec: Reorder
ChatCompletionRequest
and fix validation logic by @CatherineSue in #10675 - chore: cleanup docker image by @zhyncs in #10671
- limit sgl-kernel causal conv1d to cuda only by @liz-badada in #10648
- [Auto Sync] Update model_runner.py (20250920) by @zhyncs in #10679
- [router] refactor router and worker management 2.5/n by @slin1237 in #10677
- [1/2] Support deterministic inference with flashinfer attention backend by @Fridge003 in #10645
- [Auto Sync] Update deepseek_v2.py (20250920) by @zhyncs in #10683
- chore: upgrade mooncake 0.3.6.post1 to fix gb200 dockerfile by @ShangmingCai in #10681
- [Performance] Qwen3-Next: optimize causal_conv1d_fn triton kernel - up to 9% faster by @byjiang1996 in #10680
- Replace os.environ in layernorm.py by @Fridge003 in #10684
- fix(disagg): fix sending KV cache in case of MLA for NIXL backend by @dmitrygx in #10673
- fix: update run_suite by @zhyncs in #10685
- fix: remove awq_dequantize deps by @zhyncs in #10686
- [Auto Sync] Update modelopt_quant.py (20250920) by @zhyncs in #10688
- [Feature] Support deterministic inference with FA3 backend by @hebiao064 in #10651
- feat: update server args by @zhyncs in #10696
- Super tiny fix extra logs by @fzyzcjy in #10697
- [3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization by @lifuhuang in #10592
- Update release-docs.yml by @sglang-bot in #10706
- Refactors radix cache for extra key support by @JustinTong0323 in #10317
- [Router]fix: fix get_load missing api_key by @jinmingyi1998 in #10385
- fix: disable gpt-oss b200 ut by @zhyncs in #10716
- Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU by @HydraQYH in #10714
- [Auto Sync] Update deepseek_v2.py (20250922) by @zhyncs in #10717
- Support deterministic inference with triton backend (Hardware test: NV and AMD GPUs) by @yushengsu-thu in #10694
- [deterministic inference] Move batch invariant pkg to sglang by @hebiao064 in #10695
- [2/2] Support deterministic inference for temperature > 0 by @Qiaolin-Yu in #10678
- [Ascend] codeowner updates for ascend related files by @ping1jing2 in #10699
- [4/4] Introduce CachedKernel to reduce CSGMV kernel launch overheads by 60% by @lifuhuang in #10709
- Convert FLASHINFER_WORKSPACE_SIZE to integer by @reyoung in #10731
- EPLB: prefer to use physical experts in the same node by @acelyc111 in #9849
- fix capture_bs when speculative decoding enabled by @feng397 in #10730
- Fix flaky logprobs test by @ShangmingCai in #10728
- Fix CI TestChunkedSGMV by @lifuhuang in #10737
- [Docs, minor] Fix LLM doc matrix by @adarshxs in #10753
- Add warnings and remove dependency for deterministic inference by @Fridge003 in #10724
- bugfix: Fix
get_worker_urls_for_model
in http/router.rs by @CatherineSue in #10754 - [router] refactor router and worker management 3/n by @slin1237 in #10727
- [router] update ci so only execute benchmarks when labels are added by @slin1237 in #10757
- Fix MTP MoE weight loading with NVFP4 target model. by @LorrinWWW in #10758
- chore: bump sgl-kernel v0.3.12 by @zhyncs in #10732
- [Generative Score API] Added test_scores_api.py to github CICD to run per commit by @vedantjh2 in #10755
- refactor zero copy by @pansicheng in #10300
- Fix multimodal registry and code sync scripts by @merrymercy in #10759
- Enables TRT-LLM backend to be used for target_verify by @pranavm-nvidia in #10281
- fix: kv events with tp > 1 by @ishandhanani in #10541
- [Auto Sync] Update flashattention_backend.py (20250922) by @zhyncs in #10762
- [Feature] Add MLAProcess for DeepSeek MLA on NPU by @iforgetmyname in #10130
- [Ascend] optimize Qwen-vl on Ascend by @ping1jing2 in #10556
- [Ascend]optimize Qwen3 on Ascend by @ping1jing2 in #10574
- [Auto Sync] Update configurer.py (20250923) by @merrymercy in #10765
- [router] refactor router and worker management 4/n by @slin1237 in #10756
- [router] remove pd router draining channel by @slin1237 in #10767
- [router] fix logger type mismatch by @CatherineSue in #10774
- Use simulate acc len from
sglang.environ
by @hnyls2002 in #10771 - Fix trtllm_mla slow concat kernel in MTP by @fzyzcjy in #10777
- Move cached kernel to srt.utils by @lifuhuang in #10776
- feat: unify dockerfiles by @ishandhanani in #10705
- Introduce
FutureMap
by @hnyls2002 in #10715 - chore: upgrade sgl-kernel 0.3.12 by @zhyncs in #10782
- followup: clean up dockerfiles and release yamls by @ishandhanani in #10783
- Clean up server args by @merrymercy in #10770
- move
environ
intosglang.srt
to avoid break SRT auto sync. by @hnyls2002 in #10791 - Fix hicache mooncake backend CI by @ShangmingCai in #10792
- [router] fix cache aware lock by @slin1237 in #10773
- [router] responses api POST and GET with local storage by @slin1237 in #10581
- model: support qwen3-vl series by @zju-stu-lizheng in #10323
- [fix][pd-disag]no need set next batch sampling info done in prefill by @jinmingyi1998 in #10259
- [ROCm] Update aiter to v0.1.5.post3 by @sogalin in #10812
- [router] use dashmap for radix tree instead of hash for multi model by @slin1237 in #10814
- router(grpc): Implement route for chat_cmpl endpoint by @CatherineSue in #10761
- fix ceval by @ZhengHSI in #10504
- Remove duplicate code in qwen2 model by @Lzhang-hub in #10540
- [router] fix axum default body limit by @CatherineSue in #10818
- Fix latest main ci by @ShangmingCai in #10799
- add tunning files for QWEN-3-NEXT by @yiakwy-xpu-ml-framework-team in #10794
- [Auto Sync] Update protocol.py (20250923) by @zhyncs in #10820
- fix: draft model IMA by overide max_positional_embeddings by @JustinTong0323 in #10787
- [Auto Sync] Update elementwise.py (20250923) by @merrymercy in #10823
- [Auto Sync] Update simple_eval_common.py (20250923) by @merrymercy in #10824
- [router] Support streaming for Openai Router Response api by @key4ng in #10822
- [router] add auth middleware for api key auth by @CatherineSue in #10826
- [Auto Sync] Update load_config.py, model_config.py, configu... (20250923) by @merrymercy in #10825
- Revert "[fix][pd-disag]no need set next batch sampling info done in prefill" by @merrymercy in #10828
- Add CI timeout guidelines by @merrymercy in #10829
- feat: add cache_salt support to request by @JustinTong0323 in #10718
- fix bailing_moe with enable_dp_attention by @GuoweiWangU in #10860
- ci: free space on workers for build by @ishandhanani in #10786
- router-grpc: Support jinja chat template content format detection by @CatherineSue in #10832
- [router] select first healthy worker on proxied get requests by @lun-4 in #10827
- chore: Initial support for input config files by @kushanam in #10534
- router-grpc: Add tools processing and other paramters for apply_chat_template by @CatherineSue in #10877
- [router] consolidate health endpoints and flush cache by @slin1237 in #10876
- Restruct sgl-kernel benchmark by @BBuf in #10861
- [Bug] Fix Issue#10215 by @yuhyao in #10572
- [router] consolidate worker get loads by @slin1237 in #10880
- [router] Support Oracle DB(ATP) Data Connector by @key4ng in #10845
- [router] simplify tokenizer dev doc by @slin1237 in #10895
- [Auto Sync] Update model_config.py (20250925) by @merrymercy in #10885
- [ci feature] add ci monitor by @BBuf in #10872
- [HiCache] Cleaning the deprecated host memory state by @xiezhq-hermann in #10778
- integrate AIBrix KVcache by @yapple in #10376
- Add fuse_moe per-channel tune by @sleepcoo in #10915
- [router] consolidate worker load monitoring by @slin1237 in #10894
- router: Fix constraint proto and
build_constraint
in grpc router by @CatherineSue in #10881 - Refactor kv_cache_scheme handling for quantization by @mmangkad in #10132
- refactor: Move
grpc/client.rs
togrpc_client/sglang_scheduler.rs
by @CatherineSue in #10924 - fix env flashinfer by @Swipe4057 in #10910
- [minor] Remove deprecated function
get_ip
by @merrymercy in #10883 - Rename customer label -> custom label by @merrymercy in #10899
- [router] change log level to warning by @slin1237 in #10926
- [router][refactor] Clean up protobuf fields by @CatherineSue in #10923
- Replace the Kimi-K2 generated tool call idx with history tool call count by @eraser00 in #10612
- [ci] add ci-monitor workflow by @BBuf in #10898
- Remove pull_request trigger from CI monitor workflow by @merrymercy in #10932
- router: Support parallel sampling num > 1 in grpc_server and non-stream handling by @CatherineSue in #10929
- Revert "Refactor kv_cache_scheme handling for quantization (#10132)" by @zhyncs in #10935
- Update CODEOWNERS to include JustinTong0323 in FC by @JustinTong0323 in #10939
- [PD-HiCache]: Support Async Offloading KVCache In Decode Side by @hzh0425 in #10192
- CI: Fix docker manifest build by @csahithi in #10936
- [router] update owners for router components by @slin1237 in #10927
- Fuse write kv buffer into rope for qwen3 moe & bailing moe by @yuan-luo in #10749
- [router] add grpc client get and set by @slin1237 in #10955
- [router]fix code owner syntax error by @slin1237 in #10956
- [router] move grpc client from router to worker and builder by @slin1237 in #10958
- [router] add move grpc worker management from router to worker manager by @slin1237 in #10960
- [router] grpc router regular mode import cleanup by @slin1237 in #10963
- [router] remove old/oudated/useless comments by @slin1237 in #10967
- [router] remove old/oudated/useless comments across code base by @slin1237 in #10968
- ci: fix rate-limit of huggingface with hf auth login by @mickqian in #10947
- Update label field comment to indicate deprecation by @merrymercy in #10970
- Restruct gpu_memory_settings in a unify function and relax max_cuda_graph_bs by @BBuf in #10372
- ci: refactor nightly test by @mickqian in #10495
- refactor loading weights from remote instance coding format by @amysaq2023 in #10941
- [router][grpc] Add helpfer functions for decoder in router.rs and fix specs by @CatherineSue in #10971
- Add simple docker file for B300 by @hlu1 in #10944
- Ci monitor support performance by @BBuf in #10965
- [HiCache]: Support dynamic loading backends for hicache by @hzh0425 in #10551
- [Bugfix][Minor][Benchmark] Fix some bugs due to PR #10495 by @Muqi1029 in #10982
- [router][grpc] Support E2E non-stream chat completions by @CatherineSue in #10980
- fix: fp8 quantization failure of qwen 2.5 VL 7B model by @PanJason in #10112
- [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe by @kkHuang-amd in #10981
- fix: make inference deterministic for large TP by @JustinTong0323 in #10930
- Add auth to get server info by @Muqi1029 in #10751
- Add support for topk metadata transferring for PD by @ShangmingCai in #10616
- [PD] Extract the PP transfer layer calculate logic from Mooncake to Common backend by @ShangmingCai in #10565
- Use jsonschema to constrain required or specific tool choice by @TJ5 in #10550
- Fix profiler by @merrymercy in #10997
- [router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) by @CatherineSue in #10995
- [router] basic mcp support for openai router response api by @key4ng in #10978
- [router] fix chat template loading and tokenizer path by @slin1237 in #10999
- Fix CI failure of TypeError: RotaryEmbedding.forward_cpu() got an unexpected keyword argument 'fused_set_kv_buffer_arg' by @yanbing-j in #11009
- [bugfix]Add empty_context import to two_batch_overlap.py by @wejoncy in #10964
- prepare for sglang+verl by @lbk-sys in #10555
- [sgl-kernel] Optimize concat_mla_k kernel by @yuan-luo in #10543
- [HiCache] bug: fix mooncake store batch set v1 by @stmatengss in #11013
- Fix FusedSetKVBufferArg in RotaryEmbedding by @merrymercy in #11003
- Update GLM-4.5 Model Doc by @zRzRzRzRzRzRzR in #11017
- [router] migrate to rust python module for pythonic parser by @slin1237 in #11033
- fix: show failed models in nightly ci by @mickqian in #10986
- [router][tool call] Support normal content extraction before tool call (streaming) by @CatherineSue in #11038
- [router] add harmony tool parser base structure and interface by @slin1237 in #11036
- Unify SGL Kernel Releases by @Kangyan-Zhou in #10701
- [1/2] Support FA4 for MHA Prefill in sgl-kernel by @lifuhuang in #10940
- fix: check if weights are already local before downloading by @mickqian in #11015
- [HiCacheStorage] mooncake store support page_first_direct layout by @huangtingwei9988 in #10591
- [speculative decoding] rename lookahead to ngram by @a4zhangfei in #11010
- Fix gemma 3 launch with
transformers:
the error:AttributeError: 'TransformersForCausalLM' object has no attribute 'tp_size'
by @vincentzed in #9614 - Fix sgl-kernel benchmark dead code by @BBuf in #11022
- [router][tool call] Improve normal content extraction and error handling (non-stream) by @CatherineSue in #11050
- chore: upgrade cutedsl 4.2.1 by @zhyncs in #11054
- [Ci Monitor] Auto uploaded performance data to sglang_ci_data repo by @BBuf in #10976
- chore: upgrade sgl-kernel 0.3.13 by @zhyncs in #11056
- [router] add n to generate sampling params by @slin1237 in #11069
- Use more general heuristics to set the default value of --mem-fraction-static by @merrymercy in #10975
- [router][tool call] Separate
JsonParser
andLlamaParser
by @CatherineSue in #11073 - Fix mem fraction static for nightly tests by @merrymercy in #11076
- fix: fp8 mllama4 without vision modules being quantized by @mickqian in #10611
- Use
get_pooled
inprocess_single_choice
by @CatherineSue in #11079 - [router][grpc] Add logprobs support to router by @CatherineSue in #11082
- feat(reasoning): improve enable thinking from request by @jinmingyi1998 in #10875
- [Profile] dump memory trace when cuda graph profile is enabled by @ch-wan in #11083
- Remove hybrid_linear_attn attention backend and refactor attention registry by @samuellees in #10816
- [model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… by @DevashishLal-CB in #9642
- Enable optional FP32 compute for LM Head by @narutolhy in #10729
- Update CODEOWNERS for attention/ascend_backend.py by @merrymercy in #11092
- [router] grpc router generate endpoint support by @slin1237 in #11070
- [router][tool call] Full support for ToolChoice by @CatherineSue in #11085
- Fix spec filter batch when target extend by @ispobock in #10991
- [Fix] Resolve performance drop in speculative decoding aiter backend by @yichiche in #11087
- [Auto Sync] Update fused_moe_triton_config.py (20250930) by @merrymercy in #11099
- chore: bump sgl-kernel v0.3.14 by @FlamingoPg in #11067
- [router][grpc-server] Fix gRPC server shutdown by @slin1237 in #11094
- Fix eagle radix cache by @ispobock in #10846
- [Eval] Add
--repeat
inrun_eval
by @hnyls2002 in #11101 - [CPU] Adding Memory Capacity Acquisition Functionality by @ZailiWang in #11102
- Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization by @trevor-m in #11081
- Support Dots.ocr model by @albaNnaksqr in #11071
- [router][bugfix] Fix input_logprobs handling with None value and
logprob_start_len = -1
by @CatherineSue in #11113 - Feature/make PEFT adapter module format compatibile by @ConnorLi96 in #11080
- fix: KimiK2Detector Improve tool call ID parsing with regex by @JustinTong0323 in #10972
- [router] add mcp list and mcp call in output array by @key4ng in #11112
- Organize spec-related data structures by @hnyls2002 in #10735
- [AMD] Add Tilelang and Fast Hadamard Transform builds to Dockerfile.rocm by @hubertlu-tw in #11114
- [Auto Sync] Update base_grammar_backend.py, xgrammar_backen... (20250930) by @merrymercy in #11115
- [Doc] Update multimodal language models documentation by @JustinTong0323 in #11111
- Quick Fix: fix Qwen3-VL launch failure caused by MRotaryEmbedding arg by @yhyang201 in #10985
- docker: x86 dev builds for hopper and blackwell by @ishandhanani in #11075
- Refactor AMD CI. by @saienduri in #11128
- feat: add fast_decode_plan from flashinfer, flashinfer to 0.4.0rc3 by @yyihuang in #10760
- [HiCache]bug fix: fixed blank item in host_mem_release_queue by @zhangzuo21 in #11005
- [Feature] Add EIC as sglang HiCache Storage backend by @mss1213 in #10271
- [HiCache] Configurable and Dynamic Prefetch Timeout by @ykwd in #10512
- [router] add pd service in grpc router for pd by @slin1237 in #11120
- [router] Add multi-turn tool calling loop support for MCP integration by @key4ng in #11143
- Fix metrics and request tracing (TimeStats) by @merrymercy in #11123
- Remove debug print statement from scheduler output by @merrymercy in #11145
- Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch by @AHEADer in #10720
- Fix ngram spec with page size > 1 by @hnyls2002 in #11135
- [ROCm] To reduce the compiling time when using torch compile. by @sogalin in #10559
- Fix DeepSeek chunked prefill memory issue by @fzyzcjy in #11149
- Clean up parallel_state.py by @merrymercy in #11148
- Tiny improve dumper by @fzyzcjy in #11132
- Tiny fix missing alt stream in nextn layer by @fzyzcjy in #10768
- Fuse quantize and rope in trtllm_mla MTP by @fzyzcjy in #10779
- Tiny detect slow ranks by @fzyzcjy in #10508
- Remove unused pack
.item()
in paged allocator. by @hnyls2002 in #11156 - Support dispatch low latency by @fzyzcjy in #10263
- Support single batch overlap by @fzyzcjy in #10422
- [router][grpc] Support tool call parser in streaming by @CatherineSue in #11160
- [model] Add mamba2 and Falcon-H1 support. by @ilyasch2 in #10988
- Clean up ascend allocator by @hnyls2002 in #11152
- fix cpp JIT compilation issue of ngram speculative decoding by @b8zhong in #10837
- Tiny cleanup deepseek_v2.py by @fzyzcjy in #11163
- Tiny fix ep_gather behavior different in CI by @fzyzcjy in #11130
- Tiny remove duplicated code by @fzyzcjy in #11164
- [proto] Add script to compile python protos by @CatherineSue in #11171
- Unify forward output datastructure by @hnyls2002 in #11124
- [grpc] style fix for grpc compilation. by @hnyls2002 in #11175
- Remove dp balance metadata and minimul token balance. by @hnyls2002 in #11170
- Minor fixes for server_args, parallel_state, and test_deterministic.py by @merrymercy in #11159
- fix: shoudn't include CUDA_ARCH 100 and 120 for cuda12.6.1 by @gongwei-130 in #11176
- [router][grpc] Support streaming for v1/chat/completions by @CatherineSue in #11179
- Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell by @DomBrown in #11138
- Introduce naming convention in
io_struct
and base sglang io classes. by @hnyls2002 in #10133 - [Generative Scores API] add performance tests to CICD by @vedantjh2 in #10830
- [1/n] Enable DCA CUDA graph capture by @b8zhong in #9537
- [Fix] Update to v0.1.5.post4 and refine HIP attention backend selection by @yichiche in #11161
- [CI]] Tee server logs to both file and stdout/stderr using PIPE by @hnyls2002 in #11185
- fix: radix cache memory accounting by @skyzh in #10637
- Tiny add PD disaggregation + DP attention test by @fzyzcjy in #11167
- [router] Steaming support for MCP Tool Calls in OpenAI Router by @key4ng in #11173
- [Feature] Option to save model weights to CPU when memory saver mode is enabled by @mattnappo in #10873
- Add --thinking-mode to run_eval by @hlu1 in #11189
- [hot-fix] Fix CI break which caused by adding
thinking_mode
in eval by @hnyls2002 in #11192 - Tiny move files to utils folder by @fzyzcjy in #11166
- Fix CUDA illegal memory access issues in speculative decoding by @ur4t in #10892
- Fix [test]: Env:SGLANG_TORCH_PROFILER_DIR for pytest. by @singhalshubham03 in #10780
- Optimize debug log position of PD abort request by @ShangmingCai in #11090
- fix 3fs indices by @pansicheng in #10855
- model: support starcoder2 by @ppraneth in #10609
- [Test] Initialize mem_fraction_static in setUpClass to fix pytest VLM test crashes. by @vshekhawat-hlab in #10859
- fix xeon ci check by @DiweiSun in #10838
- fix qwen2 eagle3 runtime error by @jiapingW in #10517
- [minor] fix the lint by @hnyls2002 in #11198
- [Fix] Fix the bug of the calculation of base_gpu_id (dp offset) in data_parallel_controller.py by @XSongQ in #10741
- [fix]missing prefix_lens_cpu init when p/d disaggregation by @HanHan009527 in #11196
- fix self.enable_kv_cache_events by @narutolhy in #11178
- [HICache]: Refactor HiCache CI by @hzh0425 in #11011
- fix sampling_seed handling when deterministic is enabled by @skyzh in #11096
- [fix]enable flashmla when using draft model P/D attention select by @HanHan009527 in #11012
- [router] fix get load response parsing by @slin1237 in #11213
- [router] add grpc router pd mode for chat and generate by @slin1237 in #11140
- EAGLE cache fix for HiCache by @ispobock in #11215
- Add --max-new-tokens CLI flag for MMMU evaluation by @yhyang201 in #11217
- Add DeepSeek-V3.2 Tool Call Template by @Xu-Wenqing in #11063
- Tiny
skip_sample
adjust by @hnyls2002 in #11225 - [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 by @DarkSharpness in #11194
- Update
v1/responses
to be more OpenAI-compatible. by @vincentzed in #9624 - chore: bump sgl-kernel v0.3.14.post1 by @FlamingoPg in #11137
- Update DeepGEMM repository tag to specific commit by @merrymercy in #11229
- [Feat] Support Torch Symm Mem AllReduce by @yuan-luo in #10571
- Refactor and optimize mooncake CI by @ShangmingCai in #11162
- [Fix AMD CI] VRAM cleanup by @sunxxuns in #11174
- Update transformers package version to 4.57.0 by @JustinTong0323 in #11222
- Remove gdrcopy check in ci_install_deepep.sh by @ch-wan in #11237
- Rename runner labels by @merrymercy in #11228
- [Auto Sync] Update io_struct.py (20251004) by @merrymercy in #11206
- Create two new GH workflows to automatically bump SGLang and Kernel version by @Kangyan-Zhou in #10996
- Fix spec_utils.py by @sglang-bot in #11247
- ci: make find_local_hf_snapshot_dir more robust by @mickqian in #11248
- [quantization] Fix scale remapping for mllama4 by @BowenBao in #10042
- [quantization] Enable aiter mxfp4 fused_moe for Quark by @BowenBao in #10048
- Use cu128 for torch audio to fix some CI tests by @merrymercy in #11251
- Bump torch_memory_saver 0.0.9rc2 by @fzyzcjy in #11252
- update sgl kernel version to 0.3.14.post1 by @merrymercy in #11242
- Update condition for sgl-kernel-benchmark-test by @merrymercy in #11254
- feat: add shortcut detection for multimodal templates in Jinja format by @JustinTong0323 in #11209
- Improve bot release workflow by @Kangyan-Zhou in #11240
- Add flashmla and fast hadamard transform to Dockerfile by @Fridge003 in #11235
- Support DeepSeek V3.2 Exp by @fzyzcjy in #11061
- chore: bump SGLang version to 0.5.3rc2 by @sglang-bot in #11259
- chore: bump SGLang version to 0.5.3 by @sglang-bot in #11263
New Contributors
- @chenqianfzh made their first contribution in #10356
- @yonghenglh6 made their first contribution in #8778
- @amysaq2023 made their first contribution in #8215
- @lj970926 made their first contribution in #10410
- @thalahors made their first contribution in #10076
- @sufeng-buaa made their first contribution in #9962
- @tao12345666333 made their first contribution in #10129
- @ooapex made their first contribution in #10401
- @Jimmy-L99 made their first contribution in #10439
- @1195343015 made their first contribution in #10471
- @csahithi made their first contribution in #9887
- @scut-cbq made their first contribution in #8863
- @brayden-hai made their first contribution in #10336
- @PrinsYin made their first contribution in #10230
- @philipkiely-baseten made their first contribution in #10528
- @zhannngchen made their first contribution in #10523
- @alpha-baby made their first contribution in #8274
- @yukiy927 made their first contribution in #9947
- @a4zhangfei made their first contribution in #9873
- @leihuang-sketch made their first contribution in #10319
- @fgebhart made their first contribution in #10661
- @dmitrygx made their first contribution in #10673
- @sglang-bot made their first contribution in #10706
- @yushengsu-thu made their first contribution in #10694
- @reyoung made their first contribution in #10731
- @feng397 made their first contribution in #10730
- @LorrinWWW made their first contribution in #10758
- @vedantjh2 made their first contribution in #10755
- @zju-stu-lizheng made their first contribution in #10323
- @ZhengHSI made their first contribution in #10504
- @GuoweiWangU made their first contribution in #10860
- @lun-4 made their first contribution in #10827
- @eraser00 made their first contribution in #10612
- @TJ5 made their first contribution in #10550
- @wejoncy made their first contribution in #10964
- @lbk-sys made their first contribution in #10555
- @Kangyan-Zhou made their first contribution in #10701
- @samuellees made their first contribution in #10816
- @albaNnaksqr made their first contribution in #11071
- @ConnorLi96 made their first contribution in #11080
- @zhangzuo21 made their first contribution in #11005
- @AHEADer made their first contribution in #10720
- @ilyasch2 made their first contribution in #10988
- @DomBrown made their first contribution in #11138
- @skyzh made their first contribution in #10637
- @mattnappo made their first contribution in #10873
- @ur4t made their first contribution in #10892
- @singhalshubham03 made their first contribution in #10780
- @XSongQ made their first contribution in #10741
- @sunxxuns made their first contribution in #11174
- @BowenBao made their first contribution in #10042
Full Changelog: v0.5.2...v0.5.3