github sgl-project/sglang v0.5.3
Release v0.5.3

8 hours ago

Highlights

What's Changed

  • [Auto Sync] Update server_args.py (20250912) by @merrymercy in #10347
  • [CPU][doc] add torch.compile param in example commands by @ZailiWang in #10349
  • [router][ci] Add gpu utilization analyze with nvml by @key4ng in #10345
  • [NVIDIA] [3/N] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked by @wenscarl in #9199
  • fix: flashinfer_cutlass_moe: Use max of global expert scales instead of local for input scale by @trevor-m in #10296
  • model: support Apertus by @EduardDurech in #9774
  • fix dual stream bug by @yizhang2077 in #10352
  • [router] Basic OAI Response api by @key4ng in #10346
  • Implement Standalone gRPC Server for SGLang Python Scheduler by @CatherineSue in #10283
  • support memory_pool_host page first direct layout by @huangtingwei9988 in #10031
  • fix the break in FlashInferFusedMoE by @chenqianfzh in #10356
  • fix: resolve transfer_kv_all_layer_direct_lf_pf import error by @zhyncs in #10360
  • Support LingV2 model by @strgrb in #10359
  • Fix Bailing MoE model bugs by @yuan-luo in #10362
  • Revert add mainprocess's proctitle by @whybeyoung in #10351
  • model: support dots.vlm1 model by @yonghenglh6 in #8778
  • Support loading weights from remote instance by @amysaq2023 in #8215
  • add qwen3-next ut by @yizhang2077 in #10355
  • Fix chunked prefix cache for nvfp4 by @wenscarl in #10180
  • Fix FA4 import cause moe_fused_gate output be illegal memory by @fzyzcjy in #10368
  • Fix global input scale incompatible with CuTe DSL moe by @fzyzcjy in #10370
  • [router] Add Rerank Routing Logic in Regular Router by @fangjian601 in #10219
  • [router] enable sccache in ci and local build by @slin1237 in #10099
  • fix: add fast path for function call by @yizhang2077 in #9023
  • [Auto Sync] Update base_grammar_backend.py, llguidance_back... (20250911) by @merrymercy in #10333
  • fix: resolve gb200 image link by @zhyncs in #10343
  • fix: exclude protobuf generated code by @zhyncs in #10388
  • [bug] fix ci syntax by @slin1237 in #10390
  • Fix GPU fault issue when run dsv3 with dp mode and enable torch-compile by @kkHuang-amd in #10361
  • feat: add deepseek v3 fp4 ut by @zhyncs in #10391
  • Add sentencepiece to project dependencies by @mmangkad in #10386
  • [router] allow one router to support different model families and serving mode by @slin1237 in #10244
  • [router] Add get and cancel method for response api by @key4ng in #10387
  • Benchmark: Support API_KEY without 'bearer' by @Muqi1029 in #10380
  • Support Qwen3-Next on Ascend NPU by @iforgetmyname in #10379
  • [HiCache] fix mooncake config in different tp size by @stmatengss in #10377
  • [HiCache] doc: update deployment in readme by @stmatengss in #10332
  • [router] add not implemented functions for multi model trait by @slin1237 in #10394
  • [Auto Sync] Update xgrammar_backend.py (20250913) by @merrymercy in #10395
  • fix probs name which without temp scaling name by @narutolhy in #9984
  • Fix the style of sgl kernel by @merrymercy in #10398
  • fix: tool parse in large streaming chunk beginning with normal content by @JustinTong0323 in #10397
  • [Fix] Init mamba related memory pools with torch.zeros by @byjiang1996 in #10400
  • support qwen3_next blackwell by @yizhang2077 in #10403
  • [Fix] Support qwen3-next MTP+DP by @byjiang1996 in #10392
  • Update ROCm docker image to add sgl-router support by @kkHuang-amd in #10406
  • [Performance] Dynamic Batch Tokenizer by @sundar24295s in #9382
  • [Generative Score API] Scoring(Prefill-only) optimizations. by @sundar24295s in #9748
  • Remove repeatedly lists adding in init_incremental_detokenization by @hnyls2002 in #10412
  • [Hack] Add pd-disaggregation decode polling interval by @hnyls2002 in #10411
  • fix duplicated logger in eager_utils by @lj970926 in #10410
  • Fix cutlass moe accuracy drop caused by attention UB from DP padding mode by @fzyzcjy in #10414
  • Add self.capture_aux_hidden_states For GLM-4.5V by @zRzRzRzRzRzRzR in #10228
  • Add h200 fused moe config for Qwen3-Next by @Ximingwang-09 in #10404
  • Auto determine sgl kernel version in blackwell CI by @fzyzcjy in #10318
  • Fix the global scale fix does not support EPLB and improve enabling condition by @fzyzcjy in #10369
  • Let sgl-kernel changes be tested on srt by @fzyzcjy in #10313
  • [2/2] Speed up prefill mla attention concat by @fzyzcjy in #10157
  • Support offloading in fp8 by @fzyzcjy in #9948
  • Support global scale in addition to per expert scale for cutedsl moe by @fzyzcjy in #10270
  • Support profile args in Engine API by @fzyzcjy in #6539
  • Fix sgl-kernel + srt CI by @fzyzcjy in #10419
  • [PD metrics] Fix some uncompleted PD related metrics by @acelyc111 in #8627
  • Typo: in --enable-custom-logit-processor: agree with cli arg by @thalahors in #10076
  • [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 1 by @sufeng-buaa in #9962
  • fix: use latest flashinfer by @zhyncs in #10428
  • fix: enable cu124 and cu128 build on main push by @zhyncs in #10431
  • [Fix] MoE: fix w8a8_fp8 MoE and add tests to cover this code path by @ch-wan in #10429
  • Add split tile size for Triton attention by @ispobock in #10425
  • Fix correction bias undefined behavior for nvfp4 models by @fzyzcjy in #10426
  • feat: add dsv3 fp4 cutlass moe etp ut by @zhyncs in #10433
  • router: Add Embedding routing logic by @tao12345666333 in #10129
  • Revert "Fix FA4 import cause moe_fused_gate output be illegal memory" by @fzyzcjy in #10432
  • [4/N]DP refactor: support watching mode get_load and shortest queue strategy by @hnyls2002 in #10201
  • automatically label pr for ci by @merrymercy in #10435
  • Refactor TopK to ensure readability and extensibility by @ch-wan in #9338
  • Tiny fix wrong naming by @fzyzcjy in #10437
  • Fix label pr for ci by @merrymercy in #10441
  • metrics: support customer labels specified in request header by @acelyc111 in #10143
  • [docs / oneliner] update mmmu docs instruction by @vincentzed in #9768
  • Add reasoning examples for GPT-OSS in Markdown examples by @vincentzed in #9626
  • Fix label PR by @merrymercy in #10445
  • Update permissions in label-pr.yml by @merrymercy in #10450
  • [CI] Fix token key in label-pr.yml workflow by @merrymercy in #10452
  • fix: fix max_new_tokens uninitialized error by @mickqian in #9343
  • [router] fix service discovery and mcp ut by @slin1237 in #10449
  • fix(server_args): Skip chunked_prefill_size validation when disaggregation mode is decode by @jinmingyi1998 in #10358
  • [router] add dependency for router by @ooapex in #10401
  • [router] fix logger ordering git ctx by @CatherineSue in #10457
  • Update GITHUB_TOKEN secret for documentation push by @merrymercy in #10458
  • [HotFix]: Hot fix import path in 3fs_bench_client.py by @hzh0425 in #10463
  • Add rtx5880 moe triton by @Jimmy-L99 in #10439
  • Run tests based on labels by @merrymercy in #10456
  • Fix trtllm_moe wrong correction bias by @fzyzcjy in #10440
  • feat: support pip install sglang by @zhyncs in #10465
  • chore: bump v0.5.3rc0 by @zhyncs in #10468
  • [router] minor code clean up in server startup by @CatherineSue in #10470
  • [bugfix] fix typo by @1195343015 in #10471
  • [PD metrics] Add latency Histogram metrics of each stage for generate requests by @acelyc111 in #8710
  • [CI] Fix runner for sgl-kernel by @csahithi in #9887
  • fix(internvl): fix accuracy issue of normalization by @KEVINTUAN12 in #10375
  • fix: gpt-oss streaming dropping normal content when tools are provided but not used by @jonaslsaa in #9657
  • model: support solar by @ppraneth in #8189
  • fix: resolve sgl-kernel ut by @zhyncs in #10476
  • [1/2] Speed up trtllm_mla attention backend (>10% e2e) by @fzyzcjy in #10473
  • Fix --dataset-path in bench_one_batch_server by @hnyls2002 in #10475
  • [Env] minimal version for organizing envs by @hnyls2002 in #10479
  • chore: bump v0.3.10 sgl-kernel by @zhyncs in #10478
  • [router] multi model registration fix by @CatherineSue in #10481
  • [2/4] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance by @lifuhuang in #10286
  • [Auto Sync] Update registry.py (20250915) by @merrymercy in #10484
  • [router] fix worker registration in multi model mode by @CatherineSue in #10486
  • fix crash of DeepSeek-V3 update_weights_from_disk by @scut-cbq in #8863
  • Temporay work-around for rocm 7.0.0 alpha with enabling data-parallel issue by @kkHuang-amd in #10434
  • [Hicache] Evaluate Per-Round Metrics in Multiturn Bench by @ykwd in #10203
  • [ModelOpt] Respect kv_cache_quant_algo in ModelOpt checkpoints by @brayden-hai in #10336
  • Add Logprobs unit test with a loose threshold by @PrinsYin in #10230
  • [router] add router db connector for responses api by @slin1237 in #10487
  • Remove wrong imports from sglang.python by @hnyls2002 in #10493
  • [router] fix router manager and router init in server by @CatherineSue in #10499
  • Cache the result of is_blackwell platform check by @b8zhong in #10498
  • feat: update support for qwen3next model by @cao1zhg in #10466
  • Minor fix lint introduced by #10466 by @ShangmingCai in #10507
  • chore: upgrade sgl-kernel 0.3.10 by @zhyncs in #10500
  • Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. by @HydraQYH in #10491
  • Fix CI when sgl-kernel is changed but srt is not changed by @fzyzcjy in #10515
  • Support sgl-router parallel_batch in bench_one_batch_server by @fzyzcjy in #10506
  • [CPU] fix CPU backend sel. issue for Llama4 by @ZailiWang in #10511
  • adjust import setuptools_rust by @whybeyoung in #10524
  • Fix formatting in long code blocks by @philipkiely-baseten in #10528
  • skip vision_model for lora by @gongwei-130 in #10530
  • [2/2] Speed up trtllm_mla attention backend by @fzyzcjy in #10474
  • support using fa4 on deepseek on blackwell by @cicirori in #9928
  • [Auto Sync] Update scheduler_profiler_mixin.py, rpd_utils.p... (20250916) by @merrymercy in #10494
  • [Auto Sync] Update activation.py, chunk_cache.py, utils.py (20250917) by @merrymercy in #10538
  • feat: add priority based scheduling with priority based request acceptance and preemption by @harrisonlimh in #8746
  • Fix decord dependency for aarch64 docker build by @kyleliang-nv in #10529
  • enable prefix cache with dp by @wenscarl in #10459
  • [bugfix]hicache bench_long_context.py run failed by @zhannngchen in #10523
  • Remove duplicated code by @oraluben in #10545
  • CUDA Arch Independent by @EduardDurech in #8813
  • [bench] Fix random seed in bench_one_batch_server by @hnyls2002 in #10548
  • [HiCache] Add tests for hicache storage mooncake backend by @stmatengss in #10171
  • [BugFix] Fix incorrect hidden_states_tensor in pd disaggregation + eagle by @ZeldaHuang in #9976
  • fix: update dsv3 fp4 ut by @zhyncs in #10584
  • vlm: remove redundant d2h movement of mm feature tensors by @AlienKevin in #9987
  • Enable trtllm mla prefix extend by @wenscarl in #10526
  • [ROCm] Fix fp8 quantization accuracy issue. by @sogalin in #10558
  • [HICache] introduce evict policy by @XucSh in #10190
  • aiter v0.1.5.post2 by @HaiShaw in #10563
  • [PD] Improve disaggregation common backend and refactor mooncake backend by @ShangmingCai in #10273
  • chore: upgrade mooncake 0.3.6 by @ShangmingCai in #10596
  • [improvement] add average input/output token length for hicache benchmark stats output by @zhannngchen in #10525
  • Scale kkt after reduction by @yizhang2077 in #10604
  • fix deepep assert when PD disaggregation == null by @alpha-baby in #8274
  • [RL] Add destroy process group api by @penguin-wwy in #9979
  • Feat/add heartbeat mechanism for nixl conn by @shaharmor98 in #10222
  • update deepep version for qwen3-next deepep moe by @yizhang2077 in #10624
  • support qwen3-next-fp8 deepep by @yizhang2077 in #10622
  • Fix sgl_kernel import failure on devices other than CUDA by @ZailiWang in #10610
  • [Performance] qwen3-next improve causal conv1d in prefill phase by @liz-badada in #10595
  • Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py by @yhyang201 in #10579
  • feat: Add FlexAttention Backend for Efficient Sparse Attention by @yukiy927 in #9947
  • Garbage collector regression in the online server by @brayden-hai in #10621
  • [router] refactor worker to builder pattern 1/n by @slin1237 in #10628
  • refactor: use registry for _get_attention_backend_from_str by @zhyncs in #10629
  • [Feature] Speculative decoding support lookahead by @a4zhangfei in #9873
  • [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… by @byjiang1996 in #10553
  • [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster by @byjiang1996 in #10586
  • model support: Sarashina2VisionForCausalLM by @CatherineSue in #10632
  • feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 by @zixuanzhang226 in #10631
  • chore: bump sgl-kernel 0.3.11 by @zhyncs in #10630
  • Hicache L3 backend mooncake optimization configuration reading method by @leihuang-sketch in #10319
  • [router] refactor worker to builder pattern 2/n by @slin1237 in #10633
  • [Feature]feat(get_ip): unify get_ip_xxx by @jinmingyi1998 in #10081
  • [router] refactor worker to builder pattern 3/n by @slin1237 in #10647
  • [1/2][sgl-kernel] Support moe_sum_reduce cuda kernel by @yuan-luo in #10321
  • [router] refactor worker to builder pattern 4/n by @slin1237 in #10650
  • Fix fast decode plan for flashinfer v0.4.0rc1 and upgrade sgl-kernel 0.3.11 by @Fridge003 in #10634
  • [router] refactor worker to builder pattern 5/n by @slin1237 in #10653
  • [HiCacheStorage]support page_first_direct layout for generic set&get by @huangtingwei9988 in #10522
  • [router] preserve order of json params using preserve_order feature by @fgebhart in #10661
  • [router] refactor router and worker management 1/n by @slin1237 in #10664
  • fix: resolve sync issue by @zhyncs in #10668
  • [Auto Sync] Update .clang-format (20250919) by @zhyncs in #10670
  • [router] refactor router and worker management 2/n by @slin1237 in #10666
  • router-spec: Reorder ChatCompletionRequest and fix validation logic by @CatherineSue in #10675
  • chore: cleanup docker image by @zhyncs in #10671
  • limit sgl-kernel causal conv1d to cuda only by @liz-badada in #10648
  • [Auto Sync] Update model_runner.py (20250920) by @zhyncs in #10679
  • [router] refactor router and worker management 2.5/n by @slin1237 in #10677
  • [1/2] Support deterministic inference with flashinfer attention backend by @Fridge003 in #10645
  • [Auto Sync] Update deepseek_v2.py (20250920) by @zhyncs in #10683
  • chore: upgrade mooncake 0.3.6.post1 to fix gb200 dockerfile by @ShangmingCai in #10681
  • [Performance] Qwen3-Next: optimize causal_conv1d_fn triton kernel - up to 9% faster by @byjiang1996 in #10680
  • Replace os.environ in layernorm.py by @Fridge003 in #10684
  • fix(disagg): fix sending KV cache in case of MLA for NIXL backend by @dmitrygx in #10673
  • fix: update run_suite by @zhyncs in #10685
  • fix: remove awq_dequantize deps by @zhyncs in #10686
  • [Auto Sync] Update modelopt_quant.py (20250920) by @zhyncs in #10688
  • [Feature] Support deterministic inference with FA3 backend by @hebiao064 in #10651
  • feat: update server args by @zhyncs in #10696
  • Super tiny fix extra logs by @fzyzcjy in #10697
  • [3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization by @lifuhuang in #10592
  • Update release-docs.yml by @sglang-bot in #10706
  • Refactors radix cache for extra key support by @JustinTong0323 in #10317
  • [Router]fix: fix get_load missing api_key by @jinmingyi1998 in #10385
  • fix: disable gpt-oss b200 ut by @zhyncs in #10716
  • Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU by @HydraQYH in #10714
  • [Auto Sync] Update deepseek_v2.py (20250922) by @zhyncs in #10717
  • Support deterministic inference with triton backend (Hardware test: NV and AMD GPUs) by @yushengsu-thu in #10694
  • [deterministic inference] Move batch invariant pkg to sglang by @hebiao064 in #10695
  • [2/2] Support deterministic inference for temperature > 0 by @Qiaolin-Yu in #10678
  • [Ascend] codeowner updates for ascend related files by @ping1jing2 in #10699
  • [4/4] Introduce CachedKernel to reduce CSGMV kernel launch overheads by 60% by @lifuhuang in #10709
  • Convert FLASHINFER_WORKSPACE_SIZE to integer by @reyoung in #10731
  • EPLB: prefer to use physical experts in the same node by @acelyc111 in #9849
  • fix capture_bs when speculative decoding enabled by @feng397 in #10730
  • Fix flaky logprobs test by @ShangmingCai in #10728
  • Fix CI TestChunkedSGMV by @lifuhuang in #10737
  • [Docs, minor] Fix LLM doc matrix by @adarshxs in #10753
  • Add warnings and remove dependency for deterministic inference by @Fridge003 in #10724
  • bugfix: Fix get_worker_urls_for_model in http/router.rs by @CatherineSue in #10754
  • [router] refactor router and worker management 3/n by @slin1237 in #10727
  • [router] update ci so only execute benchmarks when labels are added by @slin1237 in #10757
  • Fix MTP MoE weight loading with NVFP4 target model. by @LorrinWWW in #10758
  • chore: bump sgl-kernel v0.3.12 by @zhyncs in #10732
  • [Generative Score API] Added test_scores_api.py to github CICD to run per commit by @vedantjh2 in #10755
  • refactor zero copy by @pansicheng in #10300
  • Fix multimodal registry and code sync scripts by @merrymercy in #10759
  • Enables TRT-LLM backend to be used for target_verify by @pranavm-nvidia in #10281
  • fix: kv events with tp > 1 by @ishandhanani in #10541
  • [Auto Sync] Update flashattention_backend.py (20250922) by @zhyncs in #10762
  • [Feature] Add MLAProcess for DeepSeek MLA on NPU by @iforgetmyname in #10130
  • [Ascend] optimize Qwen-vl on Ascend by @ping1jing2 in #10556
  • [Ascend]optimize Qwen3 on Ascend by @ping1jing2 in #10574
  • [Auto Sync] Update configurer.py (20250923) by @merrymercy in #10765
  • [router] refactor router and worker management 4/n by @slin1237 in #10756
  • [router] remove pd router draining channel by @slin1237 in #10767
  • [router] fix logger type mismatch by @CatherineSue in #10774
  • Use simulate acc len from sglang.environ by @hnyls2002 in #10771
  • Fix trtllm_mla slow concat kernel in MTP by @fzyzcjy in #10777
  • Move cached kernel to srt.utils by @lifuhuang in #10776
  • feat: unify dockerfiles by @ishandhanani in #10705
  • Introduce FutureMap by @hnyls2002 in #10715
  • chore: upgrade sgl-kernel 0.3.12 by @zhyncs in #10782
  • followup: clean up dockerfiles and release yamls by @ishandhanani in #10783
  • Clean up server args by @merrymercy in #10770
  • move environ into sglang.srt to avoid break SRT auto sync. by @hnyls2002 in #10791
  • Fix hicache mooncake backend CI by @ShangmingCai in #10792
  • [router] fix cache aware lock by @slin1237 in #10773
  • [router] responses api POST and GET with local storage by @slin1237 in #10581
  • model: support qwen3-vl series by @zju-stu-lizheng in #10323
  • [fix][pd-disag]no need set next batch sampling info done in prefill by @jinmingyi1998 in #10259
  • [ROCm] Update aiter to v0.1.5.post3 by @sogalin in #10812
  • [router] use dashmap for radix tree instead of hash for multi model by @slin1237 in #10814
  • router(grpc): Implement route for chat_cmpl endpoint by @CatherineSue in #10761
  • fix ceval by @ZhengHSI in #10504
  • Remove duplicate code in qwen2 model by @Lzhang-hub in #10540
  • [router] fix axum default body limit by @CatherineSue in #10818
  • Fix latest main ci by @ShangmingCai in #10799
  • add tunning files for QWEN-3-NEXT by @yiakwy-xpu-ml-framework-team in #10794
  • [Auto Sync] Update protocol.py (20250923) by @zhyncs in #10820
  • fix: draft model IMA by overide max_positional_embeddings by @JustinTong0323 in #10787
  • [Auto Sync] Update elementwise.py (20250923) by @merrymercy in #10823
  • [Auto Sync] Update simple_eval_common.py (20250923) by @merrymercy in #10824
  • [router] Support streaming for Openai Router Response api by @key4ng in #10822
  • [router] add auth middleware for api key auth by @CatherineSue in #10826
  • [Auto Sync] Update load_config.py, model_config.py, configu... (20250923) by @merrymercy in #10825
  • Revert "[fix][pd-disag]no need set next batch sampling info done in prefill" by @merrymercy in #10828
  • Add CI timeout guidelines by @merrymercy in #10829
  • feat: add cache_salt support to request by @JustinTong0323 in #10718
  • fix bailing_moe with enable_dp_attention by @GuoweiWangU in #10860
  • ci: free space on workers for build by @ishandhanani in #10786
  • router-grpc: Support jinja chat template content format detection by @CatherineSue in #10832
  • [router] select first healthy worker on proxied get requests by @lun-4 in #10827
  • chore: Initial support for input config files by @kushanam in #10534
  • router-grpc: Add tools processing and other paramters for apply_chat_template by @CatherineSue in #10877
  • [router] consolidate health endpoints and flush cache by @slin1237 in #10876
  • Restruct sgl-kernel benchmark by @BBuf in #10861
  • [Bug] Fix Issue#10215 by @yuhyao in #10572
  • [router] consolidate worker get loads by @slin1237 in #10880
  • [router] Support Oracle DB(ATP) Data Connector by @key4ng in #10845
  • [router] simplify tokenizer dev doc by @slin1237 in #10895
  • [Auto Sync] Update model_config.py (20250925) by @merrymercy in #10885
  • [ci feature] add ci monitor by @BBuf in #10872
  • [HiCache] Cleaning the deprecated host memory state by @xiezhq-hermann in #10778
  • integrate AIBrix KVcache by @yapple in #10376
  • Add fuse_moe per-channel tune by @sleepcoo in #10915
  • [router] consolidate worker load monitoring by @slin1237 in #10894
  • router: Fix constraint proto and build_constraint in grpc router by @CatherineSue in #10881
  • Refactor kv_cache_scheme handling for quantization by @mmangkad in #10132
  • refactor: Move grpc/client.rs to grpc_client/sglang_scheduler.rs by @CatherineSue in #10924
  • fix env flashinfer by @Swipe4057 in #10910
  • [minor] Remove deprecated function get_ip by @merrymercy in #10883
  • Rename customer label -> custom label by @merrymercy in #10899
  • [router] change log level to warning by @slin1237 in #10926
  • [router][refactor] Clean up protobuf fields by @CatherineSue in #10923
  • Replace the Kimi-K2 generated tool call idx with history tool call count by @eraser00 in #10612
  • [ci] add ci-monitor workflow by @BBuf in #10898
  • Remove pull_request trigger from CI monitor workflow by @merrymercy in #10932
  • router: Support parallel sampling num > 1 in grpc_server and non-stream handling by @CatherineSue in #10929
  • Revert "Refactor kv_cache_scheme handling for quantization (#10132)" by @zhyncs in #10935
  • Update CODEOWNERS to include JustinTong0323 in FC by @JustinTong0323 in #10939
  • [PD-HiCache]: Support Async Offloading KVCache In Decode Side by @hzh0425 in #10192
  • CI: Fix docker manifest build by @csahithi in #10936
  • [router] update owners for router components by @slin1237 in #10927
  • Fuse write kv buffer into rope for qwen3 moe & bailing moe by @yuan-luo in #10749
  • [router] add grpc client get and set by @slin1237 in #10955
  • [router]fix code owner syntax error by @slin1237 in #10956
  • [router] move grpc client from router to worker and builder by @slin1237 in #10958
  • [router] add move grpc worker management from router to worker manager by @slin1237 in #10960
  • [router] grpc router regular mode import cleanup by @slin1237 in #10963
  • [router] remove old/oudated/useless comments by @slin1237 in #10967
  • [router] remove old/oudated/useless comments across code base by @slin1237 in #10968
  • ci: fix rate-limit of huggingface with hf auth login by @mickqian in #10947
  • Update label field comment to indicate deprecation by @merrymercy in #10970
  • Restruct gpu_memory_settings in a unify function and relax max_cuda_graph_bs by @BBuf in #10372
  • ci: refactor nightly test by @mickqian in #10495
  • refactor loading weights from remote instance coding format by @amysaq2023 in #10941
  • [router][grpc] Add helpfer functions for decoder in router.rs and fix specs by @CatherineSue in #10971
  • Add simple docker file for B300 by @hlu1 in #10944
  • Ci monitor support performance by @BBuf in #10965
  • [HiCache]: Support dynamic loading backends for hicache by @hzh0425 in #10551
  • [Bugfix][Minor][Benchmark] Fix some bugs due to PR #10495 by @Muqi1029 in #10982
  • [router][grpc] Support E2E non-stream chat completions by @CatherineSue in #10980
  • fix: fp8 quantization failure of qwen 2.5 VL 7B model by @PanJason in #10112
  • [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe by @kkHuang-amd in #10981
  • fix: make inference deterministic for large TP by @JustinTong0323 in #10930
  • Add auth to get server info by @Muqi1029 in #10751
  • Add support for topk metadata transferring for PD by @ShangmingCai in #10616
  • [PD] Extract the PP transfer layer calculate logic from Mooncake to Common backend by @ShangmingCai in #10565
  • Use jsonschema to constrain required or specific tool choice by @TJ5 in #10550
  • Fix profiler by @merrymercy in #10997
  • [router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) by @CatherineSue in #10995
  • [router] basic mcp support for openai router response api by @key4ng in #10978
  • [router] fix chat template loading and tokenizer path by @slin1237 in #10999
  • Fix CI failure of TypeError: RotaryEmbedding.forward_cpu() got an unexpected keyword argument 'fused_set_kv_buffer_arg' by @yanbing-j in #11009
  • [bugfix]Add empty_context import to two_batch_overlap.py by @wejoncy in #10964
  • prepare for sglang+verl by @lbk-sys in #10555
  • [sgl-kernel] Optimize concat_mla_k kernel by @yuan-luo in #10543
  • [HiCache] bug: fix mooncake store batch set v1 by @stmatengss in #11013
  • Fix FusedSetKVBufferArg in RotaryEmbedding by @merrymercy in #11003
  • Update GLM-4.5 Model Doc by @zRzRzRzRzRzRzR in #11017
  • [router] migrate to rust python module for pythonic parser by @slin1237 in #11033
  • fix: show failed models in nightly ci by @mickqian in #10986
  • [router][tool call] Support normal content extraction before tool call (streaming) by @CatherineSue in #11038
  • [router] add harmony tool parser base structure and interface by @slin1237 in #11036
  • Unify SGL Kernel Releases by @Kangyan-Zhou in #10701
  • [1/2] Support FA4 for MHA Prefill in sgl-kernel by @lifuhuang in #10940
  • fix: check if weights are already local before downloading by @mickqian in #11015
  • [HiCacheStorage] mooncake store support page_first_direct layout by @huangtingwei9988 in #10591
  • [speculative decoding] rename lookahead to ngram by @a4zhangfei in #11010
  • Fix gemma 3 launch with transformers: the error: AttributeError: 'TransformersForCausalLM' object has no attribute 'tp_size' by @vincentzed in #9614
  • Fix sgl-kernel benchmark dead code by @BBuf in #11022
  • [router][tool call] Improve normal content extraction and error handling (non-stream) by @CatherineSue in #11050
  • chore: upgrade cutedsl 4.2.1 by @zhyncs in #11054
  • [Ci Monitor] Auto uploaded performance data to sglang_ci_data repo by @BBuf in #10976
  • chore: upgrade sgl-kernel 0.3.13 by @zhyncs in #11056
  • [router] add n to generate sampling params by @slin1237 in #11069
  • Use more general heuristics to set the default value of --mem-fraction-static by @merrymercy in #10975
  • [router][tool call] Separate JsonParser and LlamaParser by @CatherineSue in #11073
  • Fix mem fraction static for nightly tests by @merrymercy in #11076
  • fix: fp8 mllama4 without vision modules being quantized by @mickqian in #10611
  • Use get_pooled in process_single_choice by @CatherineSue in #11079
  • [router][grpc] Add logprobs support to router by @CatherineSue in #11082
  • feat(reasoning): improve enable thinking from request by @jinmingyi1998 in #10875
  • [Profile] dump memory trace when cuda graph profile is enabled by @ch-wan in #11083
  • Remove hybrid_linear_attn attention backend and refactor attention registry by @samuellees in #10816
  • [model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… by @DevashishLal-CB in #9642
  • Enable optional FP32 compute for LM Head by @narutolhy in #10729
  • Update CODEOWNERS for attention/ascend_backend.py by @merrymercy in #11092
  • [router] grpc router generate endpoint support by @slin1237 in #11070
  • [router][tool call] Full support for ToolChoice by @CatherineSue in #11085
  • Fix spec filter batch when target extend by @ispobock in #10991
  • [Fix] Resolve performance drop in speculative decoding aiter backend by @yichiche in #11087
  • [Auto Sync] Update fused_moe_triton_config.py (20250930) by @merrymercy in #11099
  • chore: bump sgl-kernel v0.3.14 by @FlamingoPg in #11067
  • [router][grpc-server] Fix gRPC server shutdown by @slin1237 in #11094
  • Fix eagle radix cache by @ispobock in #10846
  • [Eval] Add --repeat in run_eval by @hnyls2002 in #11101
  • [CPU] Adding Memory Capacity Acquisition Functionality by @ZailiWang in #11102
  • Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization by @trevor-m in #11081
  • Support Dots.ocr model by @albaNnaksqr in #11071
  • [router][bugfix] Fix input_logprobs handling with None value and logprob_start_len = -1 by @CatherineSue in #11113
  • Feature/make PEFT adapter module format compatibile by @ConnorLi96 in #11080
  • fix: KimiK2Detector Improve tool call ID parsing with regex by @JustinTong0323 in #10972
  • [router] add mcp list and mcp call in output array by @key4ng in #11112
  • Organize spec-related data structures by @hnyls2002 in #10735
  • [AMD] Add Tilelang and Fast Hadamard Transform builds to Dockerfile.rocm by @hubertlu-tw in #11114
  • [Auto Sync] Update base_grammar_backend.py, xgrammar_backen... (20250930) by @merrymercy in #11115
  • [Doc] Update multimodal language models documentation by @JustinTong0323 in #11111
  • Quick Fix: fix Qwen3-VL launch failure caused by MRotaryEmbedding arg by @yhyang201 in #10985
  • docker: x86 dev builds for hopper and blackwell by @ishandhanani in #11075
  • Refactor AMD CI. by @saienduri in #11128
  • feat: add fast_decode_plan from flashinfer, flashinfer to 0.4.0rc3 by @yyihuang in #10760
  • [HiCache]bug fix: fixed blank item in host_mem_release_queue by @zhangzuo21 in #11005
  • [Feature] Add EIC as sglang HiCache Storage backend by @mss1213 in #10271
  • [HiCache] Configurable and Dynamic Prefetch Timeout by @ykwd in #10512
  • [router] add pd service in grpc router for pd by @slin1237 in #11120
  • [router] Add multi-turn tool calling loop support for MCP integration by @key4ng in #11143
  • Fix metrics and request tracing (TimeStats) by @merrymercy in #11123
  • Remove debug print statement from scheduler output by @merrymercy in #11145
  • Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch by @AHEADer in #10720
  • Fix ngram spec with page size > 1 by @hnyls2002 in #11135
  • [ROCm] To reduce the compiling time when using torch compile. by @sogalin in #10559
  • Fix DeepSeek chunked prefill memory issue by @fzyzcjy in #11149
  • Clean up parallel_state.py by @merrymercy in #11148
  • Tiny improve dumper by @fzyzcjy in #11132
  • Tiny fix missing alt stream in nextn layer by @fzyzcjy in #10768
  • Fuse quantize and rope in trtllm_mla MTP by @fzyzcjy in #10779
  • Tiny detect slow ranks by @fzyzcjy in #10508
  • Remove unused pack .item() in paged allocator. by @hnyls2002 in #11156
  • Support dispatch low latency by @fzyzcjy in #10263
  • Support single batch overlap by @fzyzcjy in #10422
  • [router][grpc] Support tool call parser in streaming by @CatherineSue in #11160
  • [model] Add mamba2 and Falcon-H1 support. by @ilyasch2 in #10988
  • Clean up ascend allocator by @hnyls2002 in #11152
  • fix cpp JIT compilation issue of ngram speculative decoding by @b8zhong in #10837
  • Tiny cleanup deepseek_v2.py by @fzyzcjy in #11163
  • Tiny fix ep_gather behavior different in CI by @fzyzcjy in #11130
  • Tiny remove duplicated code by @fzyzcjy in #11164
  • [proto] Add script to compile python protos by @CatherineSue in #11171
  • Unify forward output datastructure by @hnyls2002 in #11124
  • [grpc] style fix for grpc compilation. by @hnyls2002 in #11175
  • Remove dp balance metadata and minimul token balance. by @hnyls2002 in #11170
  • Minor fixes for server_args, parallel_state, and test_deterministic.py by @merrymercy in #11159
  • fix: shoudn't include CUDA_ARCH 100 and 120 for cuda12.6.1 by @gongwei-130 in #11176
  • [router][grpc] Support streaming for v1/chat/completions by @CatherineSue in #11179
  • Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell by @DomBrown in #11138
  • Introduce naming convention in io_struct and base sglang io classes. by @hnyls2002 in #10133
  • [Generative Scores API] add performance tests to CICD by @vedantjh2 in #10830
  • [1/n] Enable DCA CUDA graph capture by @b8zhong in #9537
  • [Fix] Update to v0.1.5.post4 and refine HIP attention backend selection by @yichiche in #11161
  • [CI]] Tee server logs to both file and stdout/stderr using PIPE by @hnyls2002 in #11185
  • fix: radix cache memory accounting by @skyzh in #10637
  • Tiny add PD disaggregation + DP attention test by @fzyzcjy in #11167
  • [router] Steaming support for MCP Tool Calls in OpenAI Router by @key4ng in #11173
  • [Feature] Option to save model weights to CPU when memory saver mode is enabled by @mattnappo in #10873
  • Add --thinking-mode to run_eval by @hlu1 in #11189
  • [hot-fix] Fix CI break which caused by adding thinking_mode in eval by @hnyls2002 in #11192
  • Tiny move files to utils folder by @fzyzcjy in #11166
  • Fix CUDA illegal memory access issues in speculative decoding by @ur4t in #10892
  • Fix [test]: Env:SGLANG_TORCH_PROFILER_DIR for pytest. by @singhalshubham03 in #10780
  • Optimize debug log position of PD abort request by @ShangmingCai in #11090
  • fix 3fs indices by @pansicheng in #10855
  • model: support starcoder2 by @ppraneth in #10609
  • [Test] Initialize mem_fraction_static in setUpClass to fix pytest VLM test crashes. by @vshekhawat-hlab in #10859
  • fix xeon ci check by @DiweiSun in #10838
  • fix qwen2 eagle3 runtime error by @jiapingW in #10517
  • [minor] fix the lint by @hnyls2002 in #11198
  • [Fix] Fix the bug of the calculation of base_gpu_id (dp offset) in data_parallel_controller.py by @XSongQ in #10741
  • [fix]missing prefix_lens_cpu init when p/d disaggregation by @HanHan009527 in #11196
  • fix self.enable_kv_cache_events by @narutolhy in #11178
  • [HICache]: Refactor HiCache CI by @hzh0425 in #11011
  • fix sampling_seed handling when deterministic is enabled by @skyzh in #11096
  • [fix]enable flashmla when using draft model P/D attention select by @HanHan009527 in #11012
  • [router] fix get load response parsing by @slin1237 in #11213
  • [router] add grpc router pd mode for chat and generate by @slin1237 in #11140
  • EAGLE cache fix for HiCache by @ispobock in #11215
  • Add --max-new-tokens CLI flag for MMMU evaluation by @yhyang201 in #11217
  • Add DeepSeek-V3.2 Tool Call Template by @Xu-Wenqing in #11063
  • Tiny skip_sample adjust by @hnyls2002 in #11225
  • [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 by @DarkSharpness in #11194
  • Update v1/responses to be more OpenAI-compatible. by @vincentzed in #9624
  • chore: bump sgl-kernel v0.3.14.post1 by @FlamingoPg in #11137
  • Update DeepGEMM repository tag to specific commit by @merrymercy in #11229
  • [Feat] Support Torch Symm Mem AllReduce by @yuan-luo in #10571
  • Refactor and optimize mooncake CI by @ShangmingCai in #11162
  • [Fix AMD CI] VRAM cleanup by @sunxxuns in #11174
  • Update transformers package version to 4.57.0 by @JustinTong0323 in #11222
  • Remove gdrcopy check in ci_install_deepep.sh by @ch-wan in #11237
  • Rename runner labels by @merrymercy in #11228
  • [Auto Sync] Update io_struct.py (20251004) by @merrymercy in #11206
  • Create two new GH workflows to automatically bump SGLang and Kernel version by @Kangyan-Zhou in #10996
  • Fix spec_utils.py by @sglang-bot in #11247
  • ci: make find_local_hf_snapshot_dir more robust by @mickqian in #11248
  • [quantization] Fix scale remapping for mllama4 by @BowenBao in #10042
  • [quantization] Enable aiter mxfp4 fused_moe for Quark by @BowenBao in #10048
  • Use cu128 for torch audio to fix some CI tests by @merrymercy in #11251
  • Bump torch_memory_saver 0.0.9rc2 by @fzyzcjy in #11252
  • update sgl kernel version to 0.3.14.post1 by @merrymercy in #11242
  • Update condition for sgl-kernel-benchmark-test by @merrymercy in #11254
  • feat: add shortcut detection for multimodal templates in Jinja format by @JustinTong0323 in #11209
  • Improve bot release workflow by @Kangyan-Zhou in #11240
  • Add flashmla and fast hadamard transform to Dockerfile by @Fridge003 in #11235
  • Support DeepSeek V3.2 Exp by @fzyzcjy in #11061
  • chore: bump SGLang version to 0.5.3rc2 by @sglang-bot in #11259
  • chore: bump SGLang version to 0.5.3 by @sglang-bot in #11263

New Contributors

  • @chenqianfzh made their first contribution in #10356
  • @yonghenglh6 made their first contribution in #8778
  • @amysaq2023 made their first contribution in #8215
  • @lj970926 made their first contribution in #10410
  • @thalahors made their first contribution in #10076
  • @sufeng-buaa made their first contribution in #9962
  • @tao12345666333 made their first contribution in #10129
  • @ooapex made their first contribution in #10401
  • @Jimmy-L99 made their first contribution in #10439
  • @1195343015 made their first contribution in #10471
  • @csahithi made their first contribution in #9887
  • @scut-cbq made their first contribution in #8863
  • @brayden-hai made their first contribution in #10336
  • @PrinsYin made their first contribution in #10230
  • @philipkiely-baseten made their first contribution in #10528
  • @zhannngchen made their first contribution in #10523
  • @alpha-baby made their first contribution in #8274
  • @yukiy927 made their first contribution in #9947
  • @a4zhangfei made their first contribution in #9873
  • @leihuang-sketch made their first contribution in #10319
  • @fgebhart made their first contribution in #10661
  • @dmitrygx made their first contribution in #10673
  • @sglang-bot made their first contribution in #10706
  • @yushengsu-thu made their first contribution in #10694
  • @reyoung made their first contribution in #10731
  • @feng397 made their first contribution in #10730
  • @LorrinWWW made their first contribution in #10758
  • @vedantjh2 made their first contribution in #10755
  • @zju-stu-lizheng made their first contribution in #10323
  • @ZhengHSI made their first contribution in #10504
  • @GuoweiWangU made their first contribution in #10860
  • @lun-4 made their first contribution in #10827
  • @eraser00 made their first contribution in #10612
  • @TJ5 made their first contribution in #10550
  • @wejoncy made their first contribution in #10964
  • @lbk-sys made their first contribution in #10555
  • @Kangyan-Zhou made their first contribution in #10701
  • @samuellees made their first contribution in #10816
  • @albaNnaksqr made their first contribution in #11071
  • @ConnorLi96 made their first contribution in #11080
  • @zhangzuo21 made their first contribution in #11005
  • @AHEADer made their first contribution in #10720
  • @ilyasch2 made their first contribution in #10988
  • @DomBrown made their first contribution in #11138
  • @skyzh made their first contribution in #10637
  • @mattnappo made their first contribution in #10873
  • @ur4t made their first contribution in #10892
  • @singhalshubham03 made their first contribution in #10780
  • @XSongQ made their first contribution in #10741
  • @sunxxuns made their first contribution in #11174
  • @BowenBao made their first contribution in #10042

Full Changelog: v0.5.2...v0.5.3

Don't miss a new sglang release

NewReleases is sending notifications on new releases.