github sgl-project/sglang v0.4.10

latest releases: v0.5.2rc2, v0.5.2rc1, v0.5.2rc0...
one month ago

Highlights

This is a regular release with many new optimizations, features, and fixes. Please checkout the following exciting roadmaps and blogs

What's Changed

  • [AMD] add aiter fused moe in DeepEP path by @alexsun07 in #7268
  • enable aiter_biased_grouped_topk kernel by @valarLip in #7423
  • [PD Disaggregation] replace transfer with batch transfer for better performance by @ssssnow in #7236
  • Remove cumsum_buffer initilization by @ispobock in #7439
  • [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm by @BBuf in #7422
  • Support multi-thread model weight loading by @xianzhiT in #7277
  • [PD] NIXL: Register kv args in advance and cleanup finished requests by @trevor-m in #6717
  • fix: Add --model as an alias for --model-path in server_args by @CatherineSue in #7505
  • misc: Improvement to serving_chat.py and add more ut by @CatherineSue in #7489
  • Fuse sorted_token_ids padding to moe_align_block_size kernel by @ispobock in #7437
  • [OAI] patch origin request_id logic by @whybeyoung in #7508
  • [PD][Spec] Fix hidden state transfer for spec decode by @ShangmingCai in #7516
  • EPLB support for MTP by @yilian49 in #7510
  • clean duplicate code by @habaohaba in #7512
  • [ci] add router benchmark script and CI by @slin1237 in #7498
  • fix: force synchronization between TP workers when update_weights by @dangkai4u in #6626
  • [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model by @chunyuan-w in #6641
  • [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug by @ShangmingCai in #7522
  • npu fused op by @ll819214 in #7386
  • feat: send kvmetrics from sglang scheduler by @zixuanzhang226 in #6721
  • [PD] Add different TP sizes support for no-MLA models by @Hongbosherlock in #6793
  • enable aiter fp8 blockscale quant by @valarLip in #7520
  • take aiter get_rope back by @valarLip in #7521
  • Fix typo of flash_cache by @hebiao064 in #7513
  • feat: add return hidden_states at async generation by @yyihuang in #7507
  • minor: 'role' must be system/assistant/tool, but case insensitive for now by @minleminzui in #7499
  • Fix FP8 KV Cache Support in FA3 Backend by @guoyuhong in #7148
  • Fix gathered_buffer issues in tbo by @Qiaolin-Yu in #7531
  • [PD] Raise error for incompatible mooncake version and some minor fixes by @ShangmingCai in #7527
  • [CMake] Fix sgl-kernel CMakeLists for Blackwell by @MasterJH5574 in #7543
  • Add Tencent HunYuanMoEV1 model support by @mpjlu in #7549
  • Update seed in CPU UTs to avoid flaky failure with single test by @yanbing-j in #7544
  • chore: improve ci bug reporting by @mickqian in #7542
  • chore: remove vlm unnecessary import by @JustinTong0323 in #7541
  • chore: bump v0.4.8.post1 by @zhyncs in #7559
  • [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND by @trevor-m in #7330
  • [Fix] incorrect assert in EPLB by @ch-wan in #7575
  • Updates Gemma3n MLP layer to adapt latest transformers version by @JustinTong0323 in #7573
  • Fix MTP error when enabling two-batch overlap by @fzyzcjy in #7569
  • Add e2e test for multi instance multi stage memory release/resume occupuation by @MrAta in #7208
  • [CI] Add CI Testing for Prefill-Decode Disaggregation with Router by @key4ng in #7540
  • Updates transformers and timm dependencies by @JustinTong0323 in #7577
  • feat: support compatibility between MTP and two-batch-overlap by @Qiaolin-Yu in #7225
  • Move multimodal processors into a separate folder by @merrymercy in #7581
  • Fix broken CI TestVILAServer by @lifuhuang in #7610
  • [router] add centralized configuration module for sgl-router by @slin1237 in #7588
  • Fix: Minicpm by @JustinTong0323 in #7612
  • Hybrid kv cache for LLaMA4 by @tarinkk in #6563
  • [CPU] add optimizations for INT8 and FP8 DeepSeek by @chunyuan-w in #6769
  • Tiny add logs for expert location updater by @fzyzcjy in #7308
  • Fix flakiness in LoRA batch test. by @lifuhuang in #7552
  • [BUG] fix local_rank in initialize_dp_attention by @TomQuartz in #7584
  • Support dynamic LoRA loading / unloading in engine/server API by @lifuhuang in #7446
  • [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated by @ShangmingCai in #7598
  • fix unit tests by @zhyncs in #7618
  • Let ep_scatter support arbitrary strides / ue8m0 format by @fzyzcjy in #7309
  • Let EP prefill support new DeepGEMM by @fzyzcjy in #7310
  • docs: add gb200 nvl72 and a16z grant by @zhyncs in #7620
  • Adds support for OpenAI chat completions API in bench_serving by @JustinTong0323 in #7036
  • [bugfix] Remove PR comment posting from Rust benchmark workflow by @slin1237 in #7625
  • [Minor] clean up multimodal processor and tokenizer manager by @merrymercy in #7624
  • Add dsv3 fused a gemm to sgl-kernel by @ispobock in #7630
  • Add @mickqian as the CODEOWNERS of multimodal by @merrymercy in #7636
  • Fix stream reasoning parser and Adds Kimi reasoning parser by @JustinTong0323 in #7432
  • Fix sgl-router startup crash by @finetunej in #7619
  • [bugfix] fix runtime dropping panic in editable by @slin1237 in #7628
  • Move files related to EPLB by @fzyzcjy in #7580
  • [misc] reduce weird rope_scaling_factor warning by @Alcanderian in #7176
  • [AMD] Add unit-test-sgl-kernel-amd to AMD CI by @hubertlu-tw in #7539
  • Update CODEOWNERS by @merrymercy in #7640
  • [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py by @merrymercy in #7643
  • [CPU] add c++ kernel to bind CPU cores and memory node by @chunyuan-w in #7524
  • Improve streaming, log_level, memory report, weight loading, and benchmark script by @merrymercy in #7632
  • Add dsv3 router gemm kernel by @Fridge003 in #7627
  • chore: upgrade flashinfer v0.2.7 jit by @zhyncs in #7663
  • [doc] update lws doc for pd by @whybeyoung in #7318
  • Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes by @narutolhy in #7648
  • Add small requirements for benchmark/parse_result tools by @BBuf in #7671
  • [CPU] remove process_group from inputs of shm_allreduce and shm_allgather by @chunyuan-w in #7486
  • chore: bump sgl-kernel v0.2.1 by @zhyncs in #7675
  • support llama4 eagle3 by @sleepcoo in #6985
  • Refactor mm processors and Enable mixed modality processing by @JustinTong0323 in #7629
  • upgrade sgl kernel to 0.2.1 for main by @xiezhq-hermann in #7676
  • add description for llama4 eagle3 by @yizhang2077 in #7688
  • fix(model loader): use safe_open to prevent file handle leaks. by @SimonCqk in #7684
  • chore: upgrade flashinfer v0.2.7.post1 by @zhyncs in #7698
  • Improve error handling for requests with unloaded LoRA path(s) by @lifuhuang in #7642
  • Apply dsv3_fused_a_gemm kernel by @ispobock in #7635
  • Fix GPTQMarlinMoE by @lkm2835 in #7697
  • [1/n] apply wna16marlin kernel in moe weight only quantization by @AniZpZ in #7683
  • Apply dsv3 router gemm kernel for deepseek-r1 fp4 by @Fridge003 in #7677
  • [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill by @hubertlu-tw in #7717
  • [RL] add --skip-warmup by @zhuzilin in #7416
  • [RL] support update_weights_from_distributed with different group and multiple weights by @zhuzilin in #7292
  • [router] add --log-level to sgl-router by @zhuzilin in #6512
  • [b200] support trt-llm allreduce fuse rms_norm_add kernel by @BBuf in #7621
  • [CPU] Bind threads and numa node for each TP rank by @chunyuan-w in #6549
  • Support non-contiguous query input for extend/decode attention by @yanbing-j in #7462
  • Support updating weights at once by stopping all requests by @tianyuzhou95 in #6698
  • Fix num_tokens_pre_allocated in disaggregation log by @ZeldaHuang in #7714
  • [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll by @chunyuan-w in #7734
  • [CPU] fix all_reduce and all_gather by @chunyuan-w in #6770
  • fix awq and dsv3 fused gemm compatible by @AniZpZ in #7735
  • [CI][Router] Fix bench_one_batch_server for pd router test by @ShangmingCai in #7731
  • Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture by @ayrnb in #7278
  • fix dsv3 fused proj check by @AniZpZ in #7738
  • Ascend attention backend(PA&MLA) by @ping1jing2 in #7722
  • [fix] fix dsv3_router_gemm filter by @Alcanderian in #7750
  • [CPU] refine CPU integration code by @chunyuan-w in #7647
  • [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size by @chunyuan-w in #6771
  • support qwen3 dense model dp attention by @yizhang2077 in #7681
  • [optimize] add two stream norm for qwen3 by @yizhang2077 in #7740
  • feat: use D2D instead of H2H in pp by @TianyuZhang1214 in #7673
  • [Bug] add flashinfer bool check for fusedmoe in Qwen moe models by @yilian49 in #7723
  • [fix] put cpu in the first priority in get_device() by @Alcanderian in #7752
  • [optimize] fuse renormalize into moe_topk_softmax by @yizhang2077 in #7744
  • chore: bump sgl-kernel 0.2.2 by @zhyncs in #7755
  • fix CI: update native api ipynb by @JustinTong0323 in #7754
  • fuse renormal into moe topk softmax kernel python code by @yizhang2077 in #7751
  • Remove type conversion and fix id map in topk by @ispobock in #7759
  • Add V2-lite model test by @yanbing-j in #7390
  • refactor llama4 dp attention logic by @yizhang2077 in #7729
  • fix(docs): fix the broken link in docs/references/production_metrics.md by @rudeigerc in #7741
  • [fix] update bench_speculative.py for compatibility by @yankay in #7764
  • Move mem_fraction_static adjustment for multimodal models to server_args.py & Fix session control & Other cleanups by @merrymercy in #7748
  • [RL] Add --nccl-port to prevent port conflict by @zhuzilin in #7418
  • [RL] add pause and continue generation for async rl training by @zhuzilin in #7419
  • [Fix] Alloc return type error by @Capronir in #7778
  • [feat] Support EAGLE3 for Qwen by @Ximingwang-09 in #7745
  • saving hidden_states.clone() by @ch-wan in #7705
  • [1/n]: add cutlass W4A8 moe kernel for hopper architecture by @yangsijia-serena in #7772
  • add model: qwen2-audio by @leng-yue in #7596
  • Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario by @HydraQYH in #7782
  • Embedding parallel by attn_tp by @MoonBall in #7623
  • fix: fix apply_shuffle_mul_sum by @mickqian in #7444
  • chore: bump sgl-kernel v0.2.3 by @zhyncs in #7784
  • fix: use nvidia-nccl-cu12 2.27.5 by @zhyncs in #7787
  • DP Attention with Auto DeepEP Dispatch by @ch-wan in #7222
  • chore: upgrade sgl-kernel v0.2.3 by @zhyncs in #7786
  • Fix incorrect spec_num_draft_tokens in draft_extend by @ch-wan in #7757
  • [fix] fix misusing of is_cuda by @Alcanderian in #7790
  • Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 by @merrymercy in #7756
  • chore: bump sgl-kernel v0.2.4 by @zhyncs in #7800
  • ci: fix port args by @mickqian in #7792
  • Fix CI test OOM issue. by @lifuhuang in #7799
  • chore: upgrade sgl-kernel v0.2.4 by @zhyncs in #7801
  • chore: bump v0.4.9 by @zhyncs in #7802
  • [misc] remove pdlb rust by @slin1237 in #7796
  • fix: free disk space by @zhyncs in #7803
  • fix: disable dsv3_router_gemm in dsv3_nextn by @Alcanderian in #7793
  • Support logprobs in two-batch overlap by @fzyzcjy in #7709
  • Fix division-by-zero bug in LoRA triton kernels. by @lifuhuang in #7785
  • [AMD] Add test_fused_moe.py and test_rope_rocm.py to AMD CI by @hubertlu-tw in #5246
  • [RL] Fix illegal memory for _import_static_state by @hebiao064 in #7733
  • Fix _import_static_state issue by @nanjiangwill in #7812
  • Optimize moe align block size kernel by @ispobock in #7794
  • Log the timestamps of each prefill/decode iteration by @yuhsuan-t in #6094
  • [bugfix] Fix sgl-router get_server_info endpoint compatibility issue by @slin1237 in #7813
  • Integrate triton moe kernel by @yuan-luo in #7689
  • Kernels for efficient KV cache IO by @xiezhq-hermann in #7313
  • [docs] update router readme by @slin1237 in #7797
  • [misc] release new router version by @slin1237 in #7798
  • fix duplicate args in schedule_batch by @ZeldaHuang in #7816
  • [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs by @haohui in #7187
  • docs: update README by @zhyncs in #7821
  • feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode by @yangsijia-serena in #7762
  • Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang by @Edwardf0t1 in #7129
  • [Minor] Fix sporadic CI timeout caused by underestimated tests. by @lifuhuang in #7850
  • [Bugfix] Fix two batch overlap with auto DeepEP Dispatch by @ShangmingCai in #7853
  • Fix cache modules of triton import error by @kkHuang-amd in #7832
  • [router] forward stream_options in request by @ZhangShuaiyi in #7860
  • Fix illegal memory in trtllm allreduce fusion by @BBuf in #7864
  • Fix llama4 vision by @JustinTong0323 in #7840
  • Support Mimo-VL by @JustinTong0323 in #7579
  • fix: Handles input_embeds in GenerateReqInput when n>1 by @JustinTong0323 in #7830
  • [Multimodal][Perf] Use pybase64 instead of base64 by @b8zhong in #7724
  • Bump xgrammar's version to 0.1.20 by @whybeyoung in #7866
  • [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack by @chunyuan-w in #7818
  • [PD] Add guidance for prefill bootstrap timeout by @ShangmingCai in #7846
  • Update native_api doc to match the change in the get_model_info endpoint by @Arist12 in #7660
  • Revert "Embedding parallel by attn_tp (#7623)" by @zhyncs in #7880
  • chore: bump v0.4.9.post1 by @zhyncs in #7882
  • Fixes typo in assertion message by @JustinTong0323 in #7895
  • [CI] Add deepep tests to CI by @ch-wan in #7872
  • [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt by @chunyuan-w in #7885
  • [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding by @jianan-gu in #7838
  • Remove unused imports by @almaslof in #7898
  • [router] Update metrics when request completes by @ZhangShuaiyi in #7899
  • [feature] Add start step profile argument in /start_profile by @kyleliang-nv in #7608
  • [bugfix] add pd router policy validation by @slin1237 in #7904
  • vlm: support video as an input modality by @mickqian in #5888
  • Feat: Support Phi-3.5-MoE in SGLang by @byjiang1996 in #7907
  • add sentencepiece as dependency explicitly by @ZailiWang in #7922
  • Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen by @likesen-alibaba in #6449
  • [feature]Ascend quantization support by @ping1jing2 in #7791
  • [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module by @BBuf in #7775
  • Support Kimi K2 by @Atream in #7940
  • [feature] kv transfer support of ascend npu by @ping1jing2 in #7795
  • fix: minor fix for modelopt weight load compatibility by @AniZpZ in #7953
  • temporarily disable deepep-8-gpu and activate two small tests by @ch-wan in #7961
  • [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel by @HydraQYH in #7932
  • chore: bump sgl-kernel v0.2.5 by @zhyncs in #7964
  • Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (#7236)" by @fzyzcjy in #7968
  • chore: upgrade xgrammar 0.1.21 by @zhyncs in #7962
  • delete uselese code caused by fuse allreduce+add_rmsnorm pr by @BBuf in #7970
  • Fix wrong gemm branch cause 250us slower by @fzyzcjy in #7969
  • [router] add worker abstraction by @slin1237 in #7960
  • chore: upgrade sgl-kernel 0.2.5 by @zhyncs in #7971
  • chore: bump v0.4.9.post2 by @zhyncs in #7963
  • [minor fix] llama4 hybrid memory by @Ying1123 in #7950
  • [minor fix] SWA missing methods by @Ying1123 in #7972
  • [script] update loogle test by @Ying1123 in #7975
  • docs: update README by @zhyncs in #7985
  • Overlap the gating function with shared experts in DeepSeek by @ch-wan in #7978
  • [BugFix] fix pre_reorder_triton_kernel default int32 issue by @Yuechguo in #7814
  • [minor] Add server_args check for Llama4 with hybrid by @Ying1123 in #7988
  • Tiny fix mooncake log warning wrong output by @fzyzcjy in #7952
  • [BugFix] add verify logit_bias to avoid crash because of IndexError by @ehuaa in #7749
  • SWA Prefix Cache by @hanming-lu in #7367
  • chore: remove unnecessary limits on quantization methods in test script by @AniZpZ in #7997
  • Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes by @lifuhuang in #7844
  • Support for Phi-1.5 & Phi-2 models by @ppraneth in #7862
  • [Dockerfile] Multi-arch support for ROCm by @mqhc2020 in #7902
  • [CPU] fix no attribute 'can_fuse_mlp_allreduce' error by @chunyuan-w in #8010
  • perf: add kimi k2 fused_moe tuning config for h30_3e by @GaoYusong in #8021
  • [ci] CI supports use cached models by @HanHan009527 in #7874
  • [Minor] Remove redundant print by @merrymercy in #8005
  • [Feature]TP Group Switching for PD-Multiplexing by @ykcombat in #7653
  • [Feature] CUDA Green Context Support by @ykcombat in #7649
  • Fix flaky CI: test_vlm_models by @lifuhuang in #8006
  • Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode by @hzh0425 in #7982
  • prevent server crash from potential invalid grammar by @ehuaa in #7897
  • Setup workflow for releasing mi300x and mi350x dockers. by @saienduri in #8035
  • fix: modality length mismatch with image_data by @Yangruipis in #7887
  • Update CODEOWNERS by @CatherineSue in #8044
  • [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm by @HydraQYH in #8023
  • feat: update multimodal data handling in engine entrypoint by @JustinTong0323 in #8002
  • fix: remove redundant rotary embedding cache recomputation in MiniCPM by @JustinTong0323 in #8022
  • Fix the input tools format and history tool_calls in OpenAI API by @chen700564 in #6556
  • fix: resolve arm build issue by @zhyncs in #8052
  • concurrently load weights of DeepseekV2ForCausalLM by @tianyuzhou95 in #7943
  • H20 tune config for Kimi by @artetaout in #8047
  • Update amd docker image. by @saienduri in #8045
  • feat: replace Decord with video_reader-rs by @kozoy in #5163
  • remove kv_a.congigous in DeepseekV2AttentionMLA by @strgrb in #8058
  • update transformers to 4.53.2 by @JustinTong0323 in #8029
  • Fix different device type adjustment in PP by @Qiaolin-Yu in #7760
  • Use device_group for all_gather when disabling overlap scheduling by @Qiaolin-Yu in #8001
  • Revert "feat: replace Decord with video_reader-rs" by @mickqian in #8077
  • Fix CI xeon test with triton 3.3.1 by @yanbing-j in #8086
  • fix greenctx stream compability by @AniZpZ in #8090
  • [misc] update nvshmem and pin deepEP commit hash by @slin1237 in #8098
  • [Feature] Layer-wise Prefill by @jason-fxz in #7634
  • [1/n] chore: decouple quantization implementation from vLLM dependency by @AniZpZ in #7992
  • refactor: unify names of the feature field of MultimodalDataItem by @mickqian in #8075
  • feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics by @acelyc111 in #7597
  • [ci] limit cmake build nproc by @slin1237 in #8100
  • [ci] disable memory imbalance check for draft worker by @ch-wan in #8108
  • [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models by @hzh0425 in #8110
  • [ci] recover 8-gpu deepep test by @ch-wan in #8105
  • Refactor: move all quantization-related code to srt/layer/quantization by @ch-wan in #7989
  • [kernel] opt moe align block kernel by block/warp scan algorithm by @yuan-luo in #7884
  • Super tiny fix typo by @fzyzcjy in #8046
  • fix: update HostKVCache init to report correct msg when available memory is not enough by @ziqifan617 in #8102
  • [Hunyuan]: Fix Dense Model Support by @kzjeef in #8117
  • feat: add production metric for retracted requests due to insufficient kvcache by @aftersnow in #7030
  • refactor: simply MultimodalTokens logic by @mickqian in #7924
  • [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell by @HydraQYH in #8127
  • Feat: Support Granite 3.0 MoE in SGLang by @zminglei in #7959
  • load draft model fix by @yilian49 in #7506
  • [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input" by @jianan-gu in #7889
  • [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config by @jianan-gu in #7820
  • Hicache Storage Layer Prototype by @xiezhq-hermann in #7704
  • Revert "Fix different device type adjustment in PP" by @saienduri in #8141
  • feat: enchance green context stream creation robust with backward compatibility by @AniZpZ in #8136
  • fix compressed tensors WNA16 imports by @qeternity in #8142
  • [Bugfix] Fix w8a8_int8 import error on NPU by @iforgetmyname in #8147
  • [3/n] chore: decouple AWQ implementation from vLLM dependency by @Hongbosherlock in #8113
  • [router] Refactor router and policy traits with dependency injection by @slin1237 in #7987
  • [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm by @hubertlu-tw in #7661
  • [Doc] Steps to add a new attention backend by @merrymercy in #8155
  • chore: tune mem fraction static for vlm by @mickqian in #6881
  • Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs by @haohui in #7302
  • Feat: Support audio in Phi4-mm model by @byjiang1996 in #8048
  • [PD] Support non-MLA models PD different TP with DP attention by @ShangmingCai in #7931
  • [health_generate] fix: fix the /health_generate always success bug by @acelyc111 in #8028
  • [router] router metrics cleanup by @slin1237 in #8158
  • [router] allow router to have empty workers by @slin1237 in #8160
  • Add GB200 wide-EP docker by @kyleliang-nv in #8157
  • [1/N] MoE Refactor: refactor select_experts by @ch-wan in #7966
  • chore: bump sgl-kernel v0.2.6 by @zhyncs in #8165
  • chore: upgrade sgl-kernel 0.2.6 by @zhyncs in #8166
  • Fix suffix mismatch for the metrics. by @Charles-L-Chen in #8168
  • Update README.md by @merrymercy in #8171
  • Clean up server args by @merrymercy in #8161
  • Fix LoRA buffer contamination during adapter eviction by @lifuhuang in #8103
  • Fix Dockerfile.gb200 by @kyleliang-nv in #8169
  • [router] add ut for worker and errors by @slin1237 in #8170
  • bugfix: fix sglang crash in NVIDIA MIG container by @Garrybest in #8167
  • Support start up LoRA server without initial adapters by @lifuhuang in #8019
  • Clean warning logs for gate_proj loading in Lora by @Fridge003 in #8172
  • Fix tuning_fused_moe_triton.py by @ch-wan in #8175
  • [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability by @whybeyoung in #8115
  • Add bf16 output option for dsv3_router_gemm kernel by @Fridge003 in #7999
  • Enable FlashInfer support encoder models and add head_dim padding workaround by @ccs96307 in #6230
  • Add get_hidden_dim to qwen3.py for correct lora by @logachevpa in #7312
  • feat: add h200 tp 16 kimi k2 moe config by @zhyncs in #8176
  • feat: add b200 tp 16 kimi k2 moe config by @zhyncs in #8178
  • fix moe gate dtype, fix tbo, fix fake dispatch by @Atream in #7825
  • Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" by @merrymercy in #8181
  • feat: update nccl 2.27.6 by @zhyncs in #8182
  • Feat: Support for Persimmon Model by @ppraneth in #7983
  • feat: add h200 tp 16 kimi k2 moe config by @Qiaolin-Yu in #8183
  • Fix eagle3 cuda graph by @Ja1Zhou in #8163
  • fix: fix the bug of loading Internvl3 by @coco-alen in #8067
  • Fix dtype error in CI by @ispobock in #8197
  • [router] add ut for pd request, metrics and config by @slin1237 in #8184
  • [feature] enable NPU CI by @ping1jing2 in #7935
  • [fix] fix modelopt fp4 on b200 by @Alcanderian in #8195
  • chore: bump sgl-kernel v0.2.6.post1 by @zhyncs in #8200
  • Apply fused sorted token ids padding by @ispobock in #8193
  • [Refactor] simplify multimodal data processing by @JustinTong0323 in #8107
  • [router] add ut for pd router by @slin1237 in #8208
  • [router] upgade router version to 0.1.6 by @slin1237 in #8209
  • Remve router gemm output dtype conversion by @ispobock in #8204
  • chore: upgrade sgl-kernel 0.2.6.post1 by @zhyncs in #8202
  • [Feature] Add a test for Layer-wise Prefill by @jason-fxz in #8231
  • docs: update 2025 h2 roadmap by @zhyncs in #8237
  • fix: retrieve mm token by modality, raise error if none by @JustinTong0323 in #8221
  • [AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 by @hubertlu-tw in #7484
  • fix: sgl-router remove dead code by @oldsharp in #8257
  • [fix] benchmark : routed_scaling_factor is None by @panpan0000 in #8059
  • [Benchmark] add disable-auto-run param for hicache/bench_multiturn by @rzwei in #7822
  • Preliminary Support for Qwen3XMLDetector by @yhyang201 in #8260
  • chore: bump v0.4.9.post3 by @zhyncs in #8265
  • Skip llama4 vision module loading when multimodal disabled by @ispobock in #8272
  • Fix sgl-kernel ci test by @ispobock in #8284
  • Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching by @lifuhuang in #8261
  • Hicache IO kernel refactoring by @xiezhq-hermann in #8264
  • bug fix and tag by @xiezhq-hermann in #8282
  • HiCache Fix by @xiezhq-hermann in #8288
  • [sgl-kernel] Opt per_token_quant_fp8 with warp reduce by @yuan-luo in #8130
  • [router] add common ut infra to mock worker and app by @slin1237 in #8295
  • fix: workaround for deepgemm warmup issue by @zhyncs in #8302
  • [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages by @YiXR in #8133
  • Fix the issue of incorrect finish reason in final stream response chunk returned during tool call by @xianzhiT in #7708
  • fix: match chat-template for internvl3 by @JustinTong0323 in #8262
  • Fix gemma3n with hybrid swa by @JustinTong0323 in #8240
  • chore: upgrade sgl-kernel 0.2.7 by @zhyncs in #8304
  • fix: prevent crashes due to logit bias dimension mismatch by @0xymoro in #7685
  • feat(function call): complete utility method for KimiK2Detector and enhance documentation by @CatherineSue in #8043
  • Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP by @xianzhiT in #7562
  • [AMD] Pull latest image for AMD CI by @michael-amd in #8070
  • Pin the version of petit kernel to fix the APIs by @haohui in #8235
  • [bug] fix pd completion protocol for batching support by @slin1237 in #8317
  • [router] fix pd model completion request by @slin1237 in #8303
  • fix bug when eos_ids==0 by @bzantium in #8315
  • [router] add endpoint unit test by @slin1237 in #8298
  • [code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import by @BBuf in #8310
  • chore: upgrade flashinfer v0.2.9rc1 by @Swipe4057 in #8301
  • [router] add streaming unit test by @slin1237 in #8299
  • [router] add request format unit test by @slin1237 in #8300
  • HiCache Storage TP Refinement by @xiezhq-hermann in #8307
  • breakdown kernel update by @xiezhq-hermann in #8334
  • support idle batch for TBO by @sherry-1001 in #8233
  • [Feature] Integrate quick allreduce and select the best allreduce implementation by @lihaoyang-amd in #6619
  • DP Enhancement by @ch-wan in #8280
  • fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals by @ynwang007 in #8266
  • [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs by @hubertlu-tw in #7135
  • [CPU] Add tutorial docs for SGL on CPU by @ZailiWang in #8000
  • chore: upgrade mooncake 0.3.5 by @ShangmingCai in #8341
  • [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering torch.compile in forward pass by @BBuf in #8353
  • [P/D] Support ipv6 in P/D scenario by @thefacetakt in #7858
  • Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @Xu-Wenqing in #8344
  • [Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector by @CatherineSue in #8357
  • Clean up server_args, triton cache manager by @merrymercy in #8332
  • fix: upgrade nccl version by @zhyncs in #8359
  • [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 by @CatherineSue in #8363
  • fix: kimi k2 xgrammar crash by @zhyncs in #8367
  • Fix FP4 MoE accuracy from missing routed_scaling_factor by @trevor-m in #8333
  • [CI] Fix flaky threshold by @merrymercy in #8370
  • chore: bump v0.4.9.post4 by @zhyncs in #8305
  • Fix test_moe_fused_gate_combined sgl-kernel ci test by @ispobock in #8374
  • Uodate Dockerfile.gb200 to latest sglang by @kyleliang-nv in #8356
  • chore: improve mmmu benchmark by @mickqian in #7000
  • Save peak memory in logits processor by @ch-wan in #8343
  • Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce by @hebiao064 in #8267
  • chore: improvements on mm_utils by @mickqian in #7737
  • vlm: optimize tensor transport by @mickqian in #6003
  • Tiny assert EPLB is used together with expert parallel by @fzyzcjy in #8381
  • model: support intern-s1 by @RunningLeon in #8350
  • Add perf tests for LoRA by @lifuhuang in #8314
  • Remove slot usage in code to be backward-compatible with python 3.9 by @lifuhuang in #8396
  • Add docker release flow for gb200 by @kyleliang-nv in #8394
  • HiCache, check before terminate prefetching by @xiezhq-hermann in #8372
  • Add nvfp4 scaled mm benchmark. by @HydraQYH in #8401
  • Urgent Fix: intern-s1 chat-template matching by @JustinTong0323 in #8403
  • Tool to dump and compare internal activation tensors by @fzyzcjy in #7976
  • Minor tool for comparison of benchmark results by @fzyzcjy in #7974
  • Fix bench script making input data on L2 cache by @fzyzcjy in #7739
  • [NVIDIA] Add Flashinfer MoE blockscale fp8 backend by @kaixih in #8036
  • Update Cutlass in sgl-kernel to v4.1 by @Fridge003 in #8392
  • fix: minor fix TransportProxyTensor under tp by @mickqian in #8382
  • [router] add different policies for p node and d node by @slin1237 in #8395
  • Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @lambert0312 in #8351
  • fix: fix the missing metrics on non-rank0 nodes by @acelyc111 in #7720
  • [2/N] MoE Refactor: Unify weight loader and quant methods by @ch-wan in #8397
  • Use FlashInfer FP4 gemm. by @elfiegg in #8241
  • Support precomputed_embeddings for Llama 4 by @AlienKevin in #8156
  • [hotfix] fix merge conflicts in FlashInferEPMoE by @ch-wan in #8405
  • chore: update CODEOWNERS by @zhyncs in #8407
  • chore: upgrade flashinfer v0.2.9rc2 by @zhyncs in #8406
  • Support triton kernels v3.4.0 for fused_moe by @yuan-luo in #8258
  • [Bugfix] Prevent PD server crash from invalid grammar by @ShangmingCai in #8062
  • Change to use native arm runner by @kyleliang-nv in #8414
  • Support overlapped lora updates by @lifuhuang in #8213
  • Support ue8m0 for triton quant kernel by @fzyzcjy in #7603
  • Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic by @byjiang1996 in #8316
  • bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check by @CatherineSue in #8417
  • Fix test_openai_server by @CatherineSue in #8419
  • Fix docker buildx push error by @kyleliang-nv in #8425
  • bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation by @CatherineSue in #8422
  • [router] improve router logs and request id header by @slin1237 in #8415
  • [feat] Support different attention backends for prefill and decode by @Qiaolin-Yu in #6338
  • chore: bump transformer to 4.54.0 by @hebiao064 in #8416
  • [PD] Fix abort_request for PD disaggregation by @ShangmingCai in #8352
  • GLM-4.5 Model Support by @zRzRzRzRzRzRzR in #8224
  • Remove zstd compression for building Dockerfile.gb200 by @kyleliang-nv in #8442
  • doc: add bench_one_batch_server in the benchmark doc by @Qiaolin-Yu in #8441
  • GLM-4.5 Model Support Follow-up by @byjiang1996 in #8445
  • fix GLM4_MOE launch with compressed_tensor quant model by @zminglei in #8456
  • Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. by @strgrb in #8449
  • Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" by @BBuf in #8457
  • chore: bump v0.4.9.post5 by @zhyncs in #8458
  • fix:reorder topk experts to ensure shared expert replaces minimal score by @erictanjn in #8125
  • Update PR template by @ispobock in #8465
  • feat: throttle requests at scheduler based on --max_queued_requests by @harrisonlimh in #7565
  • fix: update dep by @zhyncs in #8467
  • [NVIDIA] Change to use num_local_experts by @kaixih in #8453
  • Fix parsing ChatCompletionMessage by @Onyad in #7273
  • [3/N] MoE Refactor: Simplify DeepEP Output by @ch-wan in #8421
  • feat: support glm4 tuning by @zhyncs in #8473
  • Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 by @hebiao064 in #8469
  • Update codeowner by @merrymercy in #8476
  • chore: add glm4 fp8 tp8 config by @zhyncs in #8478
  • chore: add glm 4.5 fp8 tp4 config by @zhyncs in #8480
  • [CI]Add genai-bench Performance Validation for PD Router by @key4ng in #8477
  • Update CODEOWNERS by @merrymercy in #8485
  • Rename the last step in pr-test.yml as pr-test-finish by @merrymercy in #8486
  • Reduce memory usage for fp4 moe by @fzyzcjy in #8413
  • Tiny add warnings for DeepEP when it is suboptimal by @fzyzcjy in #8426
  • Support colocating requests by @fzyzcjy in #7973
  • Fix incorrect KV cache allocation for MTP models. by @lifuhuang in #8482
  • Add PVC and update resource limits in k8s config by @haitwang-cloud in #8489
  • chore: bump v0.4.9.post6 by @zhyncs in #8517
  • Always trigger pr-test by @merrymercy in #8527
  • Update README.md by @merrymercy in #8528
  • [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% by @BBuf in #8499
  • Update cutlass_moe.py by @elfiegg in #8535
  • Fix moe align kernel test by @ispobock in #8531
  • Split the scheduler into multiple mixin classes to reduce the file size by @merrymercy in #8483
  • bring back kimi vl ci by @hebiao064 in #8537
  • fix: temporarily disable cuda-ipc for mm data tensor by @mickqian in #8431
  • Support EPLB in FusedMoE by @ch-wan in #8448
  • feat(hicache): support file backend reading directory config form env. by @hzh0425 in #8498
  • feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. by @hzh0425 in #8516
  • [router] allow longer time out for router e2e by @slin1237 in #8560
  • Update cutlass_moe.py by @elfiegg in #8545
  • Update CODEOWNERS by @ShangmingCai in #8562
  • [feature] [sgl-router] Add a dp-aware routing strategy by @oldsharp in #6869
  • [Hot-Fix] moe_aligned_block_size CI failed in AMD by @yuan-luo in #8461
  • [Model] Add support for Arcee Foundational Model by @adarshxs in #8154
  • Revert "Fix the input tools format and history tool_calls in OpenAI API (#6556)" by @CatherineSue in #8584
  • Add hf3fs support for hicache storage (based on #7704) by @pansicheng in #7280
  • [router] migrate router from actix to axum by @slin1237 in #8479
  • [Fix]Fix index oob in get_group_gemm_starts kernel. by @HydraQYH in #8564
  • Bump transfomers to 4.54.1 to fix Gemma cache issue. by @lifuhuang in #8541
  • Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. by @pyc96 in #8544
  • Bug: Fix google gemma3n-mm audio input not working bug by @byjiang1996 in #8365
  • update sgl-kernel for EP: kernel part by @ch-wan in #8514
  • chore: bump sgl-kernel v0.2.8 by @zhyncs in #8599
  • [bugfix] Fix 2 minor bugs in the hicache storage layer by @yapple in #8404
  • fix incorrect increase of hit count by @huangtingwei9988 in #8533
  • Support l3 cache (mooncake store) for hiradix cache by @huangtingwei9988 in #7211
  • update sgl-kernel for EP: python part by @ch-wan in #8550
  • add SVG logo by @hnyls2002 in #8603
  • [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE by @ch-wan in #8515
  • fix: fork should not run pypi router by @yihong0618 in #8604
  • model: support Step3V by @CatherineSue in #8583
  • [Feature] Hybrid EP and TP by @ch-wan in #8590
  • chore: bump v0.4.10 by @zhyncs in #8608

New Contributors

  • @valarLip made their first contribution in #7423
  • @ssssnow made their first contribution in #7236
  • @xianzhiT made their first contribution in #7277
  • @yilian49 made their first contribution in #7510
  • @ll819214 made their first contribution in #7386
  • @MasterJH5574 made their first contribution in #7543
  • @TomQuartz made their first contribution in #7584
  • @finetunej made their first contribution in #7619
  • @narutolhy made their first contribution in #7648
  • @SimonCqk made their first contribution in #7684
  • @ZeldaHuang made their first contribution in #7714
  • @ayrnb made their first contribution in #7278
  • @ping1jing2 made their first contribution in #7722
  • @TianyuZhang1214 made their first contribution in #7673
  • @rudeigerc made their first contribution in #7741
  • @Capronir made their first contribution in #7778
  • @yangsijia-serena made their first contribution in #7772
  • @leng-yue made their first contribution in #7596
  • @HydraQYH made their first contribution in #7782
  • @MoonBall made their first contribution in #7623
  • @nanjiangwill made their first contribution in #7812
  • @haohui made their first contribution in #7187
  • @ZhangShuaiyi made their first contribution in #7860
  • @almaslof made their first contribution in #7898
  • @kyleliang-nv made their first contribution in #7608
  • @likesen-alibaba made their first contribution in #6449
  • @Yuechguo made their first contribution in #7814
  • @hanming-lu made their first contribution in #7367
  • @ppraneth made their first contribution in #7862
  • @mqhc2020 made their first contribution in #7902
  • @ykcombat made their first contribution in #7653
  • @hzh0425 made their first contribution in #7982
  • @Yangruipis made their first contribution in #7887
  • @chen700564 made their first contribution in #6556
  • @artetaout made their first contribution in #8047
  • @kozoy made their first contribution in #5163
  • @jason-fxz made their first contribution in #7634
  • @acelyc111 made their first contribution in #7597
  • @ziqifan617 made their first contribution in #8102
  • @kzjeef made their first contribution in #8117
  • @aftersnow made their first contribution in #7030
  • @Charles-L-Chen made their first contribution in #8168
  • @Garrybest made their first contribution in #8167
  • @ccs96307 made their first contribution in #6230
  • @logachevpa made their first contribution in #7312
  • @coco-alen made their first contribution in #8067
  • @oldsharp made their first contribution in #8257
  • @rzwei made their first contribution in #7822
  • @YiXR made their first contribution in #8133
  • @0xymoro made their first contribution in #7685
  • @bzantium made their first contribution in #8315
  • @sherry-1001 made their first contribution in #8233
  • @lihaoyang-amd made their first contribution in #6619
  • @ynwang007 made their first contribution in #8266
  • @thefacetakt made their first contribution in #7858
  • @RunningLeon made their first contribution in #8350
  • @AlienKevin made their first contribution in #8156
  • @zRzRzRzRzRzRzR made their first contribution in #8224
  • @erictanjn made their first contribution in #8125
  • @harrisonlimh made their first contribution in #7565
  • @Onyad made their first contribution in #7273
  • @haitwang-cloud made their first contribution in #8489
  • @yapple made their first contribution in #8404
  • @yihong0618 made their first contribution in #8604

Full Changelog: v0.4.8...v0.4.10

Don't miss a new sglang release

NewReleases is sending notifications on new releases.