sgl-project/sglang v0.5.6 on GitHub

Highlights

Support for DeepSeek V3.2/V3.2 Speciale #14249
Blockwise diffusion language model support #12588
Support for new diffusion models (Flux2 #14000, Z-image #14067)
Introduce JIT Kernels #13453
Upgrade to Torch 2.9 #12969
Kimi-K2-Thinking model enhancement #12882
Memory management/Overlap spec compatibility #12224 #12839
More performance optimization: DeepSeek-v3-fp4/GLM-4.6/Kimi-K2/DeepSeek-V3.2...
CI/CD Enhancement

What's Changed

[router][grpc] Add more mcp test cases to responses api by @CatherineSue in #12749
[Intel]Add 'intel_xpu' attention backend for llama4 by @gaopengff in #11051
[Intel XPU]Update pytorch xpu to 2.9 by @gaopengff in #12363
[Docs] fix dead links in multiple documentation pages by @mattheliu in #12764
[mem pool] bugfix: wrong position for self.device in Mamba by @stmatengss in #12684
[Fix]HTTP Stream raise exception by @jimmy-evo in #11904
[CPU] Fix TP padding case with weight block size by @jianan-gu in #8243
[docs] Remove redundant --disable-radix-cache option from by @rchalamala in #12717
Pin uvloop to 0.21.0 by @yeahdongcn in #12279
[fix] Only enable flashinfer all reduce fusion by default for single-node servers by @leejnau in #12724
chore: update CODEOWNERS by @zhyncs in #12795
Fix hang in deepgemm compilation with symmetric memory enabled by @nvcastet in #12715
Add bot-bump-kernel-version-to-sglang workflow by @alisonshao in #12794
ignore the deepgemm check when the model weight with nvfp4 and moe ba… by @rainj-me in #12782
[AMD] Update wave-lang to 3.8.2 by @xintin in #12576
[DeepSeek-V3.2][NSA] Enable MHA Pathway for Short Sequence Prefill on B200 (SM100) by @YAMY1234 in #12788
[hotfix]: Resolve ModuleNotFoundError in PD deployment for is_in_ci() by @hzh0425 in #12772
[HotFix]: Add missing SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL env var by @hzh0425 in #12776
Add PP support for dots_vlm by @gty111 in #12763
fixes hardcoded "cuda" device references in unit tests to use a dynamic device selection by @kalyank007 in #12761
fix multimodal gen issues by @yhyang201 in #12765
[Test] Add DeepSeekV3.2 NSA Indexer Test Suite by @Johnsonms in #12520
[Bugfix] Fix illegal memory access by @elvischenv in #12758
[MoE] Add Comprehensive MoE Integration Tests by @Jonahcb in #12090
[Deepseek V3.2] Only skip Indexer logits computation when is_extend_without_speculative by @hlu1 in #12816
Fix missing dp_max_padding argument in set_dp_buffer_len by @Chen-0210 in #12812
optm(checkpoint-engine): disable multi-thread loading when update weights by @BraveY in #12374
Fix piecewise cuda graph ci test by @ispobock in #12836
update multimodal_gen readme by @mickqian in #12825
[router] Support structured model output for openai and grpc router by @key4ng in #12431
Fix data parallel controller launch for num nodes > 2 by @merrymercy in #12822
remove the fa4 page_size hardcode to 128 restriction on mla model arch by @rainj-me in #12801
sglang diffusion announcement by @wisclmy0611 in #12856
add back flashinfer jit cache to dev docker by @b8zhong in #12851
[router][grpc] Refactor: Add builders for chat and responses by @CatherineSue in #12852
[router][grpc] Move all error logs to their call sites by @CatherineSue in #12859
[router] Switch MCP tests from DeepWiki to self-hosted Brave search server by @key4ng in #12849
Add nightly performance test for GPT-OSS 4GPU models by @alisonshao in #12805
[sgl-kernel][Deepseek V3.2] Add row_starts to topk kernel by @hlu1 in #12582
[CI] Fix huggingface access for test_flash_attention_4.py by @Fridge003 in #12846
[Auto Sync] Update activation.py, logits_processor.py, rota... (20251107) by @merrymercy in #12853
[Docs][DeepseekV3.2] Update deepseekv3.2 docs for mha short seq prefill by @YAMY1234 in #12868
Support capturing aux_hidden_states for minimax m2. by @pyc96 in #12798
[CI] Tiny adjust CI esitmation time by @hnyls2002 in #12886
[DP-Attn] Clarify MLP sync / idle batch preparation logic by @hnyls2002 in #12843
Fix sending all requests to the first rank in DP attention by @fzyzcjy in #12832
Apply moe_reduce_sum kernel for fused_marlin_moe by @ispobock in #12888
use fast stream instead of torch.cuda.current_stream in llama 4 shared experts overlap by @b8zhong in #12811
[Fix] Fix trtllm-mla backend when chunked prefix cache is disabled by @Fridge003 in #12361
Refs/heads/add nightly test multi gpu configs by @alisonshao in #12870
chore: bump sgl-kernel version to 0.3.16.post6 by @sglang-bot in #12889
Update CODEOWNERS by @ispobock in #12897
Tiny simplify can_run_dp_cuda_graph gather logic by @hnyls2002 in #12891
Fix spec decoding acc length for dpsk-r1-fp4 tp8 by @Qiaolin-Yu in #12896
Revert "Fix spec decoding acc length for dpsk-r1-fp4 tp8" by @Qiaolin-Yu in #12900
Add Deepseek models into nightly tests by @Kangyan-Zhou in #12865
Fix empty server args in marlin moe test by @ispobock in #12904
Fix duplicate nightly test name by @Kangyan-Zhou in #12905
Add HF cleanup logic in ci_install_dependency.sh by @Kangyan-Zhou in #12895
fallback to triton mm_persistent kernel when deepGemm fail by @zminglei in #12911
Add kimi k2 thinking to ci by @ispobock in #12907
Fix Deepseek nightly tests by @Kangyan-Zhou in #12906
Add Jet-Nemotron by @futrime in #12448
[CI] increase ut buckets & adjust estimation time. by @hnyls2002 in #12919
[PD] feat: refactor custom mem pool and add barex pd support by @stmatengss in #12332
[CI] Fix matrix.part in pr-test. by @hnyls2002 in #12920
Adjust server launch time in ci by @ispobock in #12917
feat: basic support for server-level multimodal cache by @mickqian in #10775
Refactor / Unify event loop across PD-Disagg, Overlap, DP-Attn cases by @hnyls2002 in #12839
[lint] tiny fix unimported packages. by @hnyls2002 in #12927
ci: try to fix gpg error during kernel build by @ishandhanani in #12928
Support piecewise cuda graph for MLA by @ispobock in #11812
diffusion: skip full CI suite for multimodal_gen changes by @mickqian in #12940
Minor code cleanup / improvement for PREBUILT_EXTEND mode by @hnyls2002 in #12948
Bugfix: LMCache Connector with Sglang by @MMuzzammil1 in #12946
[Docs] Add docs for Qwen3-VL image and video support by @adarshxs in #12554
[Refactor] rename set_index_k_and_scale_buffer to set_index_k_scale_b… by @edwingao28 in #12956
Refactor KTransformers heterogeneous compute with unified GPU-quantization backend by @Atream in #12834
diffusion: fix detected file changes rule in CI by @mickqian in #12943
clean redundant code in previous PR by @Atream in #12957
Fix the run-time error when calling fused_rms_mxfp4_quant that change return output number by @kkHuang-amd in #12803
diffusion: fix wan-2.2-TI2V and support sp by @mickqian in #12926
[Refactor / Style] Unify all event loops (except for PP) by @hnyls2002 in #12959
chore: bump sgl-kernel version to 0.3.17 by @sglang-bot in #12931
chore: include a minimum image for vlms when warming-up by @mickqian in #9528
[PP] put pp assert in model runner by @XucSh in #12934
Fix errors of page head kernels in sgl-kernel for ROCm by @huangtingwei9988 in #12604
[Fix] Add validation for served model name to reserve : for LoRA adapter syntax by @neelabhsinha in #12912
Support hidden_dim % 4 == 0 in per_token_quant_fp8 by @BBuf in #12883
[RadixTree] Reduce Syscalls, Optimize Collection Filtering and Align with cpp by @CLFutureX in #12239
[Auto Sync] Update batch_invariant_ops.py (20251109) by @merrymercy in #12916
[router] bucket policy by @syy-hw in #11719
fix missing output_token_logprobs when using ngram speculative decoding by @a4zhangfei in #10702
feat(metrics): add scheduler and hiradix cache metrics (#10218) by @ShawnKung in #10225
diffusion: reduce effort of supporting new model by @mickqian in #12982
vlm: fix tiny multimodal cache bug by @yhyang201 in #12984
chore: bump sgl-kernel version to 0.3.17 by @sglang-bot in #12966
[1 / 2] register weak_ref_tensor in sgl-kernel by @BBuf in #12999
Support piecewise cuda graph for deepseek v3 by @ispobock in #12996
minor: fix notebook bug with new model_info fields added for warmup by @mickqian in #13005
Super tiny fix typo by @fzyzcjy in #13001
Add process_prefill_chunk back to fix PP event loop by @hnyls2002 in #13009
[misc][ci] Add run-ci after auto-labeler by @CatherineSue in #13013
Unify memory management across (overlap, non-overlap) x (page>=1) x (spec, non-spec, spec v2) x (retract, finished) by @hnyls2002 in #12224
Enhance retract test (page cases, long output cases) by @hnyls2002 in #12781
[AMD CI] Remove SRT docker build. by @saienduri in #11850
[CI] Limit the CI trigger frequency of low-privilege actors by @hnyls2002 in #13010
Resolve HF download issue and download models before CI run starts by @Kangyan-Zhou in #12952
Add pre-suffle weight for new aiter MoE support. by @sogalin in #12908
chore: bump SGLang version to 0.5.5.post1 by @sglang-bot in #13000
[router][ci] Fix maturin build by @key4ng in #13012
Simplify the BatchMultimodalOutput in io_struct.py by @merrymercy in #12993
[router][ci] Quick Improvement to make CI more stable by @key4ng in #12869
[9/n] decouple quantization impl from vllm dependency - adjust ci by @AniZpZ in #12753
[router] add postgres databases data connector by @lengrongfu in #12218
[AMD CI] Update docker release workflows docker file name. by @saienduri in #13028
fix tuning_fused_moe_triton_sep tool per_channel_quant bug by @BBuf in #13027
fix(ci): workflow id in permission rate limit by @cicirori in #13035
[PieceWise CUDA Graph] Support awq/gptq model in piecewise cudagraph by @BBuf in #12518
Re-enable Flashinfer TRTLLM GEN MHA and Add Unit Test by @samuellees in #12885
[AMD CI] Update CI Version Logic. by @saienduri in #13029
[diffusion] doc: add support_new_models by @mickqian in #13043
Sglang Tracing: optimize trace_event_batch() by @sufeng-buaa in #13036
[CI] Auto format code by @BBuf in #13053
disable overlap schedule if mamba radix cache open by @yizhang2077 in #13057
[AMD] Add PD test for AMD CI by @michael-amd in #11938
Remove duplicate import by @LHXuuu in #12980
[Bug] TypeError: maybe_executor_submit() by @Johnsonms in #13050
[Fix] Add TPOT back to bench_serving by @elvischenv in #12976
[bug][rocm]fix qr when variable inp by @haoyangli-amd in #11609
[ROCM] Optimized deepseek-r1 model with rmsnorm + fp8 quant fusion by @yctseng0211 in #12689
Update rope dtype config by @ispobock in #13037
Tiny simplify evcition metrics collector by @hnyls2002 in #12983
[RadixTree] Reduce Stack Push/Pop Overhead for Leaf Nodes, Improve radix_tree Leaf Collection Performance by @CLFutureX in #12199
fix: display served_model_name in /v1/models by @Sunhaihua1 in #13063
refine stdout logging codes by @cicirori in #13015
Support file:// scheme in load_video by @netanel-haber in #13076
[BugFix] Fix prefill memory leak in PD + GDN by @ZeldaHuang in #12994
[Bug] Login shell error: bash: /root/.cargo/env: No such file or directory by @Johnsonms in #12941
Fix cached tokens usage bug by @FrankMinions in #12814
Revert "[AMD] Add PD test for AMD CI (#11938)" by @hnyls2002 in #13088
[Test] Handle streaming chunks with null content in case of stream end. by @vshekhawat-hlab in #10862
[Fix] Update text_chunks in bench_serving chat completions by @ZeldaHuang in #13041
Fix CPP Radix Cache and add test to CI by @cctry in #11645
[AMD] Apply AITER_MXFP4_MOE_SF=1 only to gfx950 in aiter build by @hubertlu-tw in #13092
Export runner labels via env var by @Kangyan-Zhou in #13018
Fix spec decoding acc length for dpsk-r1-fp4 tp8 (2nd attempt) by @Qiaolin-Yu in #12915
[Fix] Fix nan error for large scale ep by @Fridge003 in #12866
[AMD CI] Update nightly docker build CI config. by @saienduri in #13090
overlap shared + routed expert computation in kimi linear by @b8zhong in #12660
Revert "fix: display served_model_name in /v1/models" by @CatherineSue in #13093
[Deepseek V3.2] Fix accuracy bug in the Indexer by @hlu1 in #12583
[Router] use call_id instead of id for matching function calls in Responses API for Harmony by @zhaowenzi in #13056
Upgrade to ROCm 7.0 image by @yctseng0211 in #13105
Improve overlap scheduling for better TTFT by @vipwangerxiao in #11856
At least tell the user that ngram verify is greedy! by @MayDomine in #13039
diffusion: remove unused workflows folder by @mickqian in #13114
Don't fuse wk+weight_proj for nextn by @trevor-m in #12863
[diffusion] log: improve logging while multiprocessing by @mickqian in #12997
Fix gpt oss 4gpu b200 trace links by @alisonshao in #12872
diffusion: refactor task type of models by @mickqian in #13118
[Feature] Trace: Support http/protobuf span exporter protocol by @zhanghaotong in #12396
[sgl-kernel][5/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #12666
feat(engine): add rid parameter to methods in Engine class by @ishandhanani in #13095
[PD] Add custom gpu id to device topo support by @stmatengss in #12817
[CI] Update job dependency and move dpsk v3.2 tests to 8-gpu suite by @Fridge003 in #12942
Fix run suite sanity check by @ispobock in #13133
[router] move radix tree to policy crate and addreses some code styles by @slin1237 in #13131
fix: Remove dupulicated kv_events initialization in scheduler by @wxsms in #13132
Fix strict level setting for Kimi K2 tool calls when not explicitly set by @JustinTong0323 in #13077
[misc] Remove performance and router-benchmark label matching by @CatherineSue in #13135
Fix re-trigger actor of CI rate limit by @hnyls2002 in #13136
[router] Support complex assistant and tool messages in /chat/completions by @hellodanylo in #12860
[router] add minmax m2 reasoning parser by @slin1237 in #13137
fix(tcp-port): replace bind_server_socket to get_zmq_socket(Port conflict) by @jimmy-evo in #11961
fix: duplicate resize images logic of qwen-vl series models by @yangsijia-serena in #12458
[router][grpc] Support vllm backend for grpc router by @CatherineSue in #13120
Dump total_throughput to output-file in bench_serving.py by @Rohan138 in #9790
Fix the Wrong Return Type of Scheduler.recv_requests by @Arist12 in #7886
Update aiter to v0.1.7.post1 by @sogalin in #13149
[RPC] Fix handle_rpc_request with **recv_req.parameters by @CharlieFRuan in #7906
[Ascend]adapt enable-profile-cuda-graph for NPU by @ping1jing2 in #12617
[Feature] Propagate Trace Headers into Root Span for OpenTelemetry Cross-Service Context by @zhanghaotong in #10808
chore: bump SGLang version to 0.5.5.post2 by @sglang-bot in #13129
[Ascend][feature] support L1+ L2 radixcache on ascend by @khalil2ji3mp6 in #12214
[ngram] use SGLANG_NGRAM_FORCE_GREEDY_VERIFY to control verify method by @a4zhangfei in #13153
[Ascend] torch_npu.npu_mrope for MRotaryEmbedding by @Makcum888e in #10907
[VLM] Support PP for Qwen2.5-VL by @yuan-luo in #13075
Fuse routed_scaling_factor to fused_marlin_moe by @ispobock in #12998
[Auto Sync] Update test_deterministic.py (20251112) by @merrymercy in #13128
bugfix: multi-model routing for /generate api by @SYChen123 in #12979
[router] Add comprehensive validation to Responses API by @key4ng in #13127
[router] Fix Flaky test_circuit_breaker_opens_and_recovers by @XinyueZhang369 in #13164
Add job and runner failure monitor workflow for CI by @dougyster in #13104
[DeepseekV32]: use _concat_mla_absorb_q_general to replace torch.cat by @bingps in #12215
Fix nan in global scaling factor for large scale nvfp4 EP by @wenscarl in #13162
[Ascend] LoRA: adding Ascend LoRA backend with using kernels from sgl_kernel_npu by @vlserov in #12288
Add RequestMetricsExporter utility to export request-level metrics by @scottjlee in #10973
Opt kimi_k2_thinking biased topk module by @BBuf in #13150
[router] remove worker url requirement by @slin1237 in #13172
Remove EBNF Composer by @TJ5 in #13163
fix build error in Dockerfile.diffusion by @Sunhaihua1 in #12975
[Ascend] add npu synchronize by @hustmf in #13154
[AMD] Add AITER Custom All-Reduce by @hubertlu-tw in #13102
Revert "fallback to triton mm_persistent kernel when deepGemm fail" by @fzyzcjy in #13178
Remove enable_dp_attention in deepseek nightly tests by @Kangyan-Zhou in #13190
[router] minmax-m2 xml tool parser by @slin1237 in #13148
fix: display served_model_name in /v1/models by @Sunhaihua1 in #13155
Replace [silu_and_mul_]scaled_fp4_group_quant by Flashinfer equivalent by @wenscarl in #12376
[FEAT][ROCM] enable fused shared expert for Rocm by @ZLkanyo009 in #12201
Enable Flashinfer TRTLLM-GEN-MoE FP8 blockwise kernel for Qwen3-Next on Blackwell by @samuellees in #12543
[Quantization] Support Quark Dense + MoE FP8 & FP8 PTPC by @BowenBao in #10485
Fix accept rate in speculative decoding metrics by @SiqiLi-Fighting in #13212
Set max parallel for 1-gpu runner by @hnyls2002 in #13215
Add model validation for all GPU runners to prevent cache corruption by @alisonshao in #13171
Bump actions/download-artifact from v4 to v6 for B200 workers by @Kangyan-Zhou in #13220
docs: update fused MoE config path by @edwardzjl in #13211
[PD / HiCache]fix deocde kvcache offload manager memory leak by @huangtingwei9988 in #12774
Fix broken Markdown formatting in DeepEP documentation by @Taishi-N324 in #13210
[Feature] Enable CUDA graph for PD-Multiplexing. by @ykcombat in #11595
Fix wrong running_bs in priority scheduling by @dtcccc in #13142
[sgl-kernel] support custom fp8 flashmla kernel by @FlamingoPg in #13087
[router][grpc] Refine docs in minimax_m2 to match other parsers by @CatherineSue in #13218
Update GDN causal conv1d cuda kernel - prepare for new changes by @byjiang1996 in #13188
Use 32x32 black image for VLM server warmup and bring glm4.1v back to UT by @byjiang1996 in #13222
[sgl-kernel] clean up fa fetch in CMakeLists.txt by @FlamingoPg in #12392
[Auto Sync] Update pynccl_wrapper.py, environ.py, registry.... (20251111) by @merrymercy in #13097
[AMD] Fix AITER_MXFP4_MOE_SF setting for gfx950 by @hubertlu-tw in #13239
remove deprecated tile_tokens_dim by @b8zhong in #13186
[feat] make warmup timeout configurable through SGLANG_WARMUP_TIMEOUT by @billishyahao in #13243
fix: fix serve command without diffusion dependency by @mickqian in #13246
docker: Fix apt-add-repository by @kshitij12345 in #13213
[Auto Sync] Update backend.py, forward_batch_info.py, piece... (20251113) by @merrymercy in #13221
Fix nightly tests to fail properly when any job fails by @alisonshao in #13096
Super tiny fix outdated doc by @fzyzcjy in #13255
Remove glm41v from CI to speed up CI by @byjiang1996 in #13257
Update model weight validation logic to handle special weight file naming by @Kangyan-Zhou in #13256
Add 3 models to 2 gpu runner in model downloading from nightly tests by @Kangyan-Zhou in #13261
[Tool Call] Steamline function arguments when tool_choice="auto" for deepseekv31_detector by @Muqi1029 in #11589
[model-gateway] change mg labeler from router to model-gateway by @slin1237 in #13265
ci: speed up b200 ci by @b8zhong in #13237
Extend lint test to test/ directory by @Kangyan-Zhou in #13247
[BugFix] weight load bug when checkpoint expert.gate and exepert.up_proj are not fused by @Yuechguo in #13113
Tiny fix update version logic location by @fzyzcjy in #12620
Tiny enhance dumper with ctx and enable flags by @fzyzcjy in #12622
Enhance dumper comparator with tensor unifier and location finder by @fzyzcjy in #12623
Tiny add utility to parse server logs by @fzyzcjy in #12605
[CPU] Use covt_e4m3_bf16 to optim BF16 to FP8 convert by @wangyxbh in #12191
chore: bump flashinfer v0.5.2 by @zhyncs in #13242
Super tiny fix CI by @fzyzcjy in #13283
Add script to create a model with fewer layers for debugging by @fzyzcjy in #13284
[BugFix] fix bench_serving error when multimodal image is testing by @ZLkanyo009 in #13254
Optimized prefill cache allocation for NPU by @terfendail in #13288
refactor: remove duplicate function _get_bootstrap_info_from_server by @acelyc111 in #13277
LLama4 Attention: Update assertion msg by @kshitij12345 in #12777
[Deepseek V3.2] Clean up MTP by @hlu1 in #13236
Support orion by @ppraneth in #10665
model: support teleflm by @ppraneth in #10573
[Doc] Add item for repetition punishment by @SenmiaoORZ in #13260
Support FP8 Per Token Quant Piecewise by @hebiao064 in #13272
[minor] remove debug code in python/sglang/srt/compilation/weak_ref_tensor_jit.py by @merrymercy in #13235
[NVIDIA] Fix use case of SGLANG_ENABLE_FLASHINFER_GEMM by @kaixih in #13274
Fix NSA indexer nightly test failed issues by @Johnsonms in #13298
Implement nightly test workflow naming conventions by @alisonshao in #13170
Remove nightly b200 tests and revert a change for test file by @Kangyan-Zhou in #13305
[router]Replace requests lib with openai in e2e_response_api by @XinyueZhang369 in #13293
Piecewise Cuda Graph Support for gpt-oss model by @Oasis-Git in #13045
Add missing model in model validate list by @Kangyan-Zhou in #13310
Add missing model for 2-gpu-runner in nightly tests by @Kangyan-Zhou in #13311
[Deterministic] Support Qwen3-Next model deterministic inference by @zminglei in #13100
CI: server performance test for SGLang Diffusion by @adarshxs in #13091
[model-gateway] smg release 0.2.3 by @slin1237 in #13312
Fix syntax errors in cpp_radix_tree by @Missmiaom in #13315
[Misc]Add date to cu13 dev image tag by @Fridge003 in #13316
[Diffusion] switch to local calculate_dimensions by @adarshxs in #13294
Add more statistics for spec decoding by @zhuzilin in #13317
Consolidate similar tests to reduce duplication by @alisonshao in #12871
feat: Add FP4 (E2M1) KV Cache Support for MHA by @JackChuang in #12612
Fix: test_vlm_offline_throughput output throughput by @dougyster in #13279
Add feature flag to broadcast mm inputs processing by @yuan-luo in #13278
re-submit 12911 but relax the requirement for deepgemm by @zminglei in #13226
Revert moe sum reduce for marlin moe by @ispobock in #13314
[RL] support only do cpu backup on draft model by @zhuzilin in #13318
[model-gateway] move python to binding folder by @slin1237 in #13295
[RL] Allow bypassing /health check by @zhuzilin in #13320
Support fast gemm when in batch invariant DeepGEMM fallback by @fzyzcjy in #13259
Support inverse transform ue8m0 scale by @fzyzcjy in #13285
Tiny refactor condition to requant scale ue8m0 by @fzyzcjy in #13286
[RL] support update_weights_from_tensor for mtp by @zhuzilin in #7415
Update marlin moe kernel interface by @ispobock in #13322
[opt kimi k2 1 / n] Add kimi k2 moe fused gate by @BBuf in #13287
Super tiny expose transform_scale_ue8m0 API for RL frameworks by @fzyzcjy in #13323
Update README by @zhyncs in #13326
Fix: add missing get_embed_and_head in MiniMax M2 for Eagle3 by @pyc96 in #13297
[router] Fix flaky router e2e tests by @XinyueZhang369 in #13306
Opt tp: tp attn support tp reduce scattered input by @xu-yfei in #10568
[Diffusion] add health endpoints to diffusion server by @adarshxs in #13329
[model-gateway] remove grpc feature flag and mark as default by @slin1237 in #13330
Temporarily disable test_vision_openai_server_a CI by @ispobock in #13331
[Feature] Spec-Overlap supporting DP-ATTN; PD-Disaggregation; npugraph mode by @iforgetmyname in #12443
[Ascend][Feat] Add Ascend sampling backend by @Alexhaoge in #12692
perf: optimize TypeBasedDispatcher using dict for O(1) lookup by @zhengxle in #12001
tiny fix lint by @hnyls2002 in #13337
[optimize] Provide Usrbio compilation and installation commands by @leihuang-sketch in #12329
chore: bump sgl-kernel version to 0.3.17.post1 by @sglang-bot in #13325
Clean up deprecated tile_tokens_dim for next flashinfer by @vincentzed in #13341
Add FP32 dtype support for RoPE - Part1 by @jinyouzhi in #13181
Add missing models by @Kangyan-Zhou in #13351
[CI] check unit-test-backend-8-gpu-h20 in workflow by @ch-wan in #13355
[Fix] Register custom ops only if they exist by @merrymercy in #13321
[Piecewise CUDA Graph] Support ModelOpt FP4 by @b8zhong in #13101
chore: bump sgl-kernel version to 0.3.17.post1 by @sglang-bot in #13358
[feature] Add layerwise NVTX support by @kyleliang-nv in #11870
[Ascend]support xgrammar backend for ascend npu by @ash-sigh in #12310
[Piecewise CUDA Graph] Support W4A8 by @b8zhong in #13179
[Performance] Move the contiguous to torch compile region by @DarkSharpness in #13199
Fix dpsk-r1-fp4 tp8 by reverting two commits (#13162 and #13341) by @Qiaolin-Yu in #13348
Add missing models by @Kangyan-Zhou in #13369
[RL] enable offloading hybrid linear attn model by @zhuzilin in #13336
[model-gateway] fix model gateway pypi release workflow path by @slin1237 in #13372
[model-gateway] fix SDist step readme path by @slin1237 in #13373
[diffusion] refactor and added tests for Flux, T2V, TI2V, I2V by @adarshxs in #13344
diffusion: correct check-changes for multimodal_gen by @mickqian in #13375
diffusion: enable fa4 for blackwell by @yhyang201 in #13263
[opt kimi k2 2/n] apply kimi k2 thinking moe_fused_gate by @BBuf in #13332
[2 / 2] apply sgl-kernel weak_ref_tensor by @BBuf in #12978
Add SGLANG_ENABLE_REQ_POOL_LEAK_STRICT_CHECK to bypass mem leak check by @zhuzilin in #13339
Add default enable_memory_saver to HybridLinearKVPool by @zhuzilin in #13371
Cleanup vision attention related codes by @JustinTong0323 in #13228
Remove unused code / testcases in lang by @hnyls2002 in #13335
Tiny deprecate the --range-begin in run_suite.py by @hnyls2002 in #13381
[router] bindings for go by @whybeyoung in #13384
fix generative_models.md table - remove newlines by @netanel-haber in #13385
fix nightly docker build by @b8zhong in #13386
fix import qwenvl error in RL engine by @dangkai4u in #12874
[CI] use cached deepep installation in gb200 CI by @ch-wan in #13388
Support spec decoding when LoRA is applied to target model by @lifuhuang in #12903
[CI] Fix B200 CI by @Fridge003 in #13387
[Tiny]Fix 1-gpu nightly test bugs by @Fridge003 in #13389
chore: bump SGLang version to 0.5.5.post3 by @sglang-bot in #13366
refactor: replace worker pool with semaphore-based concurrency in jobqueue by @RiversJin in #13383
Update docs by @merrymercy in #13391
(1/n)support context parallel with deepseekv3.2-DSA by @lixiaolx in #12065
[RL] re-abort_request when model_update_lock is still locked by @zhuzilin in #13338
[1/N] CI refactor: introduce CI register. by @hnyls2002 in #13345
[Doc] Update CI oncall list by @merrymercy in #13396
Update .github/MAINTAINER.md by @Ying1123 in #13398
[HiCache] support memory_pool_host page head layout by @huangtingwei9988 in #11644
[HiCache] add GPU id to IB dev topo for mooncake storage backend by @stmatengss in #13112
Support weight update for blackwell DeepGEMM by @fzyzcjy in #13324
refactor linear memory pool by @yizhang2077 in #13004
Remove deprecated scripts by @hnyls2002 in #13399
[model-gateway] update workflow names for gateway and exclude npu by @slin1237 in #13415
[Tiny fix] Fix bench_speculative.py run bug by @BBuf in #13416
[model-gateway] Add Gateway Release Tooling by @slin1237 in #13420
fix uneven PP layer indices by @alpha-baby in #13282
diffusion: fix wan2.2 ti2v num_frames adjust logic by @mickqian in #13379
[PD][bug fix] fix memleak when last_batch is none by @XucSh in #13144
Fix cache_tokens calculate issue when retracted by @QiuMike in #11900
[feature] Custom base path on FastAPI server by @kebyn in #5879
Adding user defined hooks support by @Carlomus in #13217
Fix log time stats by @qhsc in #13418
[Ci tiny fix] Lower score threshold in evaluation test by @BBuf in #13443
diffusion: fix loading with local model_path by @mickqian in #13445
[2/N] CI refactor: sperate some backend-independent CPU tasks. by @hnyls2002 in #13447
Temporarily disable model hooks CI by @hnyls2002 in #13450
[Deepseek V3.2] Use torch.compile to speed up torch.cat in nsa by @hlu1 in #13022
Remove verbs from GET endpoint paths to follow REST standards by @slin1237 in #13273
Add missing models by @Kangyan-Zhou in #13456
extend sagemaker.Dockerfile serve script to allow all sglang serve flags by @sirutBuasai in #13173
Fix 8-gpu B200 nightly tests by @Kangyan-Zhou in #13457
Fixes validation errors for Wan-AI models which store model weights in subdirectories by @Kangyan-Zhou in #13461
[Embeddings Performance Testing] Add performance test for embedding models by @vedantjh2 in #12359
[NVIDIA] Fix broken fp8 MoE of deepseek v3 by @kaixih in #13264
Temporarily comment out multimodal gen test to recover runners by @Kangyan-Zhou in #13463
Add interface_v1 option for dynamic HiCache backend by @pansicheng in #13140
Add bfloat16 tuned fused moe config for Dpsk-MTP layer on B200 by @Fridge003 in #13455
fix MambaPool clear method after refactoring by @zminglei in #13449
[AMD CI] Update sgl-router python path in dockerfile. by @saienduri in #13458
[CI] re-enable test_vision_openai_server_a ci by @yhyang201 in #13444
Adding CI Monitor Improvements by @dougyster in #13462
[GLM4.6v] Required changes for bumping up to transformer 5.x by @byjiang1996 in #13229
[GLM4.6v] Relax the constraint of non-user role chat completion message schema for new GLM-v release by @byjiang1996 in #13258
[model-gateway] use worker startup time out for worker registration by @slin1237 in #13473
Support JetVLM by @futrime in #13289
Add an unified server arg for multimodal inputs preprocessing config. by @WingEdge777 in #12149
[PD] Clarify init method docstrings for kvsender and kvreceiver by @ShangmingCai in #13476
Fix lora test by @hnyls2002 in #13479
[Piecewise CUDA Graph] Support ModelOpt FP8 by @b8zhong in #13094
CI: fix NFS EBUSY error in PR test workflow by @alisonshao in #13460
[CI] fix triggered by a non-run-ci label by @hnyls2002 in #13393
[CI] remove auto-labeling run-ci label. by @hnyls2002 in #13486
fix: change performance log directory to cache path by @ch-wan in #13482
[CI] Add input for pr-gate by @hnyls2002 in #13491
[opt kimi k2 3/n] opt kimi_k2 moe_fused_gate kernel by @BBuf in #13374
[CI] fix lint yml (syntax error) by @hnyls2002 in #13496
[VLM][feat] Support encoder DP for Qwen2.5-VL by @liusy58 in #13126
[HiCache] Critical fix to host memory double free by @xiezhq-hermann in #13501
[BugFix] Accuracy and function Issue when run ptpc quant model by @Yuechguo in #13157
fix: create git tags directly instead of temporary branches by @alisonshao in #13168
Add .github/CI_PERMISSIONS.json to define the CI permissions by @merrymercy in #13509
README.md -> FOLDER_README.md by @merrymercy in #13510
Use slash command to trigger CI by @merrymercy in #13512
Add docs on trigger ci by @merrymercy in #13513
[Feature] Re:Enable hybrid mem saver by @ocss884 in #12962
Trigger CI retry with edit by @merrymercy in #13516
Update docs by @merrymercy in #13519
Add /tag-and-rerun-ci by @sglang-bot in #13521
[CI] update pr-gate to be compatible with new slash triggering mananer. by @hnyls2002 in #13522
[CI] fix skipping pr-gate on main by @hnyls2002 in #13525
Small cleanups related to LoRA weight loading by @glenliu21 in #13474
[CI] fix CI skipped on main by @hnyls2002 in #13527
[model-gateway] fix gateway docker build due to recent py code change by @CatherineSue in #13532
[model-gateway] limit opened files in docker build to fix edge case by @CatherineSue in #13536
[docker] fix dockerfile naming for diffusion by @slin1237 in #13534
fix lora test by @gongwei-130 in #13537
Remove jet-ai/Jet-Nemotron-2B in nightly text tests as this is constantly failing by @Kangyan-Zhou in #13540
[Bug] Fixes accuracy issues caused by incorrect use of rope by @Baidu-AIAK in #13495
Flashinfer TRTLLM-GEN-MoE + Qwen3 by @b8zhong in #13489
[chore] Disable ccache for sgl-kernel release by @Fridge003 in #13541
Add Qwen/Qwen1.5-MoE-A2.7B to model list by @Kangyan-Zhou in #13543
[Fix] Fix DeepSeek V3 MTP on B200 by @Fridge003 in #13548
[router][grpc] Support num_reasoning_tokens in haromy models by @CatherineSue in #13047
[feat][Ascend][Mindspore]: support model-impl of mindspore by @chz34 in #9234
[AMD CI] Local cache fallback. by @saienduri in #13452
[CI] fix amd 1 gpu basic test by @hnyls2002 in #13551
[Doc] Update HiCache and Mooncake docs & Mooncake Setup Error Checking by @ykwd in #12740
purge unnecessary env variable set in deterministic test by @zminglei in #13481
chore: bump sgl-kernel version to 0.3.17.post2 by @sglang-bot in #13542
Add lmsys/gpt-oss-20b-bf16 to model validation check by @hnyls2002 in #13557
CI Failure Monitor Improvements by @dougyster in #13558
[RL] Allow passing tensors of different dtypes for FlattenedTensorBucket by @zhuzilin in #13413
[CI] Fix CUDA workflow's dependency. by @hnyls2002 in #13568
[NPU] Adapt pr-gate for pr-test workflow & workflows refresh by @iforgetmyname in #13567
Tiny enhance test suites sanity check by @hnyls2002 in #13589
[3/N] CI refactor: move some manually triggered tests. by @hnyls2002 in #13448
Support moe topk sigmoid kernel by @rogeryoungh in #13049
Expend compatibility check for all quantized MoE models by @JustinTong0323 in #13465
add https://github.com/netanel-haber to CI_PERMISSIONS.json by @netanel-haber in #13577
chore: bump sgl-kernel version to 0.3.17.post2 by @sglang-bot in #13570
[Auto Sync] Update base_grammar_backend.py, collector.py (20251116) by @merrymercy in #13357
[GDN] Remove unnecessary contiguous() by @byjiang1996 in #13604
[GDN] Remove unnecessary conv state clone by @byjiang1996 in #13603
[VLM] Support Piecewise CUDA Graph for Qwen2.5-VL by @yuan-luo in #13055
CI: improve diffusion CI by @mickqian in #13562
Support external custom models by @zhooooong in #13429
[CI fix] Fix image download failures in VLM CI tests by @BBuf in #13613
[NVIDIA] Add fp8 gemm benchmark on blackwell by @kaixih in #13528
[UT] Destroy process group after broadcast to resolve port occupation issues in multi-server tests by @galeselee in #12379
diffusion: remove PreprocessorConfig by @mickqian in #13248
diffusion: refactor pipeline folders by @mickqian in #13253
Add FP32 dtype support for RoPE - Part2 by @jinyouzhi in #13328
[Fix] Remove multimodal_gen redundant get_bool_env_var func by @shauntajoesph-ops in #13583
Add support for new aiter version (AR accuracy, is_shuffled PR) by @1am9trash in #13554
diffusion: improve baseline performance monitor by @mickqian in #13614
[Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel) by @DarkSharpness in #13453
[CI] Align metric units for CI rate limit by @hnyls2002 in #13633
[ROCM] Optimized deepseek-r1 fp8 model with + triton_gemm_a8w8 + batch_gemm_a8w8 + fused set_mla_kv_buffer kernel by @yctseng0211 in #13617
fix bench_speculative bug by @Lzhang-hub in #13197
Revert "[Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel)" by @merrymercy in #13644
[CI] optimize CI workflow info by @hnyls2002 in #13634
Kill zombie diffusion processes in CI & minor code style fix on rotary embedding fallback by @merrymercy in #13637
[CI] apply pr-gate for XPU by @hnyls2002 in #13663
Add fused_rmsnorm_gated_cpu kernel for CPU to support Qwen3-Next by @yanbing-j in #11577
[10/n] decouple quantization impl from vllm dependency - fix import by @FlamingoPg in #13524
Adding nightly tests as release guard for bot bump workflows by @dougyster in #13655
[DeepseekV3.2] Deepseek fp8 support for MHA path by @YAMY1234 in #12964
Fix launch of Olmo3 by @vincentzed in #13666
[Deepseek V3.2] Change indexer weights_proj to fp32 by @hlu1 in #13459
enable csgmv automatically on cuda by @b8zhong in #13600
Add nightly test CI monitor workflow by @alisonshao in #13038
allow loras to be implicitly evicted and loaded based on max_loaded_loras by @glenliu21 in #11526
Test reorganization: Move tests to manual/ by @alisonshao in #13610
[Piecewise CUDA Graph] Fix recompile issue for Mixtral and Grok2 by @hebiao064 in #13667
Super tiny remove unused MiniMaxM2MLP class by @fzyzcjy in #13659
Update quantization.md with new model resources by @zhaochenyang20 in #13677
[model-gateway] add both python and rust cli alias by @slin1237 in #13678
[diffusion] CI: improve validation method by @mickqian in #13627
[model-gateway] fix gateway cli arg parser to not use = by @CatherineSue in #13685
[CI] Move nightly tests to test/nightly/ by @alisonshao in #13683
[NVIDIA] Add cutedsl e2e test to GB200 CI by @kaixih in #12672
Add sgl-kernel CI test for Blackwell (B200) by @alisonshao in #13301
remove unnecessary starvation check by @glenliu21 in #13619
Fix target MLA with eagle3 support for PD disaggregation by @QiuMike in #13555
[kimi k2 thinking] Avoid useless torch.zeros_ by @BBuf in #13596
[opt kimi k2 4 / n] Delete useless pad kernel in sgl_moe_align_block_size by @BBuf in #13587
[VLM] Support Piecewise CUDA Graph for InternVL by @yuan-luo in #13640
[Piecewise Cuda Graph] rename, refactor and add more logging by @hebiao064 in https://github.com/sgl-project/sglang/pull/13675
difusion: speed up multimodal_gen ci by @yhyang201 in https://github.com/sgl-project/sglang/pull/13665
[diffusion] doc: minor update docs by @mickqian in https://github.com/sgl-project/sglang/pull/13177
Fix ZMQ bind error on non-zero rank nodes when using SGLANG_BLOCK_NONZERO_RANK_CHILDREN=0 by @ishandhanani in https://github.com/sgl-project/sglang/pull/13686
[diffusion] server: use meta to avoid Linear init for TextEncoder by @zyksir in https://github.com/sgl-project/sglang/pull/13564
[Auto Sync] Update http_server.py, io_struct.py, scheduler_... (20251120) by @merrymercy in https://github.com/sgl-project/sglang/pull/13679
[Bugfix] Fix hidden state size in EAGLE PD disaggregation buffers by @michelemarzollo in https://github.com/sgl-project/sglang/pull/13590
[HiCache] fix unit test with changed new APIs by @stmatengss in https://github.com/sgl-project/sglang/pull/13498
[Fix] Qwen3Next lmhead dtype by @ZeldaHuang in https://github.com/sgl-project/sglang/pull/13708
[NPU] chore: bump to CANN 8.3.RC1 and Pytorch 2.8.0 by @iforgetmyname in https://github.com/sgl-project/sglang/pull/13647
[11/N] MoE Refactor: Simplifying SBO Implementation with Dispatcher Hooks by @ch-wan in https://github.com/sgl-project/sglang/pull/13327
[Clean code] Compressed_tensors_moe code clean by @BBuf in https://github.com/sgl-project/sglang/pull/13719
[diffusion] profile: support performance metric dumping and comparison by @mickqian in https://github.com/sgl-project/sglang/pull/13630
[AMD] Enable fused shared expert append and flatten quant for fp8 deepseekR1 model by @yichiche in https://github.com/sgl-project/sglang/pull/13705
[diffusion] doc: add contributing.md by @mickqian in https://github.com/sgl-project/sglang/pull/13649
fix 3fs down, lock schedule main thread by @weibingo in https://github.com/sgl-project/sglang/pull/13407
Fix url: use https://roadmap.sglang.io for roadmap by @merrymercy in https://github.com/sgl-project/sglang/pull/13733
Super tiny delete unused files by @fzyzcjy in https://github.com/sgl-project/sglang/pull/13734
[diffusion] log: minor improve logging by @mickqian in https://github.com/sgl-project/sglang/pull/13735
[CI] minor hot fix of model validation list by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13737
Add to ci permission by @guapisolo in https://github.com/sgl-project/sglang/pull/13739
[Piecewise CUDA Graph] Support Kimi-K2 (non-Thinking) by @b8zhong in https://github.com/sgl-project/sglang/pull/13466
Fix: CI monitor should not exit with error on regressions by @alisonshao in https://github.com/sgl-project/sglang/pull/13694
Revert "enable csgmv automatically on cuda" by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/13707
Support torch 12.9 + DeepEP by removing custom nvshmem by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12949
add some more labels by @b8zhong in https://github.com/sgl-project/sglang/pull/13701
Feat/nemotron nano v3 support by @roikoren755 in https://github.com/sgl-project/sglang/pull/12690
Fix global scaling factor loading hang by @wenscarl in https://github.com/sgl-project/sglang/pull/13484
Fix B200 Nightly tests and move one manual test back to unit test to prevent the same issue by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13746
fix test_lora_update.py starvation message check by @glenliu21 in https://github.com/sgl-project/sglang/pull/13702
Fix model weights validation with automatic cache cleanup by @alisonshao in https://github.com/sgl-project/sglang/pull/13729
[Auto Sync] Update evict_policy.py, radix_cache.py (20251120) by @merrymercy in https://github.com/sgl-project/sglang/pull/13669
[Tiny] Renaming environ for NVFP4 dispatch by @Fridge003 in https://github.com/sgl-project/sglang/pull/13756
modularize gsm8k and mmmu test classes by @netanel-haber in https://github.com/sgl-project/sglang/pull/13506
Use dual stream for DS MoE whenever cuda graph is used (instead of with token threshold) by @trevor-m in https://github.com/sgl-project/sglang/pull/9405
[Ascend] support Kimi-K2-Thinking by @zhuyijie88 in https://github.com/sgl-project/sglang/pull/12759
Refactor eagle bigram key matching by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13714
fix hunyuanvideo and add 2gpu ci testing by @yhyang201 in https://github.com/sgl-project/sglang/pull/13720
Update mem checker during busy by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13704
Tiny support different prompts in send_one.py by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13768
[diffusion] refactor: refactor sampling params by @mickqian in https://github.com/sgl-project/sglang/pull/13706
[VLM] Replace torch.repeat_interleave with faster np.repeat for Qwen-VL series by @yuan-luo in https://github.com/sgl-project/sglang/pull/13736
[Spec v2] Remove allocate_lens and enable over-allocation by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13478
tinyfix: diffusion ci by @yhyang201 in https://github.com/sgl-project/sglang/pull/13769
align code style eagle draft&draft_extend cuda graph runner by @cicirori in https://github.com/sgl-project/sglang/pull/13533
Refactor MHA & MLA KV caches to support FP4 by @JackChuang in https://github.com/sgl-project/sglang/pull/13547
Move unnecessary input_addr capture under debug mode flag for speed-up by @byjiang1996 in https://github.com/sgl-project/sglang/pull/13690
Gather static input buffers for cuda graph by @cctry in https://github.com/sgl-project/sglang/pull/13676
Revert "Fix RMSNorm API CALL mismatch issue. (#10032)" by @ErsongWang in https://github.com/sgl-project/sglang/pull/13727
[model-gateway] update smg code owner by @slin1237 in https://github.com/sgl-project/sglang/pull/13777
[model-gateway] clean up router manager function order by @slin1237 in https://github.com/sgl-project/sglang/pull/13776
Fix typo in docs by @yinpeiqi in https://github.com/sgl-project/sglang/pull/13709
[Feature] HiCache JIT kernel (once again) by @DarkSharpness in https://github.com/sgl-project/sglang/pull/13764
[DeepEP] Add SGLANG_DEEPEP_BF16_DISPATCH env var in Normal mode by @BBuf in https://github.com/sgl-project/sglang/pull/13787
Upgrade flashmla kernel for NSA tp support by @YAMY1234 in https://github.com/sgl-project/sglang/pull/13718
[diffusion] feat: support sp for image models by @mickqian in https://github.com/sgl-project/sglang/pull/13180
[diffusion] CI: add run_suite to multimodal_gen CI by @mickqian in https://github.com/sgl-project/sglang/pull/13791
Fix pagination bug in CI monitor preventing performance-test-2-gpu data collection by @alisonshao in https://github.com/sgl-project/sglang/pull/13781
[Scheduler] Tiny organize code style by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13806
[Deepseek] Refactor deepseek server_args _handle_model_specific_adjustments by @hlu1 in https://github.com/sgl-project/sglang/pull/13687
[CI] Tiny refactoring sgl-kernel tests by @Fridge003 in https://github.com/sgl-project/sglang/pull/13813
Tune fp8_w8a8 fused triton moe for GLM-4.6-FP8 by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/13815
make trtllm attn backend's init_forward_metadat non blocking by @cicirori in https://github.com/sgl-project/sglang/pull/13802
remove package json which is not used by @slin1237 in https://github.com/sgl-project/sglang/pull/13810
[1/2] Refactor DeepGeem requant for FP8 Linear on Blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/13601
chore: bump sgl-kernel version to 0.3.18 by @sglang-bot in https://github.com/sgl-project/sglang/pull/13816
xgrammar up version to 0.1.27 by @Swipe4057 in https://github.com/sgl-project/sglang/pull/13650
Fix bug: Incorrect variable used in rem_total_token_offset calculatio… by @liuhuijiayou in https://github.com/sgl-project/sglang/pull/13201
[Doc] Refine fused_moe_triton configs doc by @BBuf in https://github.com/sgl-project/sglang/pull/13820
Update MindSpore documentation by @wangtiance in https://github.com/sgl-project/sglang/pull/13656
Refactor cache init logic by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13800
[Bugfix] Add jit kernel files in packaging by @yuan-luo in https://github.com/sgl-project/sglang/pull/13829
[diffusion] doc: minor update contributing.md with test section by @mickqian in https://github.com/sgl-project/sglang/pull/13792
[misc] Rename minilb install env & remove files & fix lint by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13831
[diffusion] CI: send nightly-test outputs of diffusion to slack for correctness monitoring by @yhyang201 in https://github.com/sgl-project/sglang/pull/13833
[chore]Upgrade flashinfer to 0.5.3 by @Fridge003 in https://github.com/sgl-project/sglang/pull/13751
[Intel XPU]support xgrammar backend for intel xpu by @gaopengff in https://github.com/sgl-project/sglang/pull/13245
[sgl-kernel Code Clean] Remove useless lightning_attention kernel by @BBuf in https://github.com/sgl-project/sglang/pull/13819
[VLM] Revise InternVL Piecewise CUDA Graph Supporting by @yuan-luo in https://github.com/sgl-project/sglang/pull/13846
Fix TorchAO quant in VLM by @zhooooong in https://github.com/sgl-project/sglang/pull/13508
[Fix]: Adjust FutureMap's token_id_bufs Size to Prevent ChunkedPrefill's next_token_ids from Overwriting Previous Prefill Requests' next_token_id by @ant-yy in https://github.com/sgl-project/sglang/pull/13713
Fix: Safe RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads by @YAMY1234 in https://github.com/sgl-project/sglang/pull/11871
[Fix] Fix uvloop get_event_loop() is not suitable for 0.22.x by @tom-jerr in https://github.com/sgl-project/sglang/pull/13612
Tiny unpin uvloop for other backends by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13858
[model-gateway] Refactor router e2e responses tests by @XinyueZhang369 in https://github.com/sgl-project/sglang/pull/13745
[Perf] Optimize DeepSeek-R1 w4afp8 glue kernels by @yuhyao in https://github.com/sgl-project/sglang/pull/10027
Fix quantized moe checker fail for Qwen3 dense fp8 model by @fzyzcjy in https://github.com/sgl-project/sglang/pull/13853
[model-gateway] add grpc server code owner by @slin1237 in https://github.com/sgl-project/sglang/pull/13865
[BugFix] fix outplace_fused_experts missing is_gated by @zminglei in https://github.com/sgl-project/sglang/pull/13864
fix xgrammar_backend crash with malformed inputs by @gongwei-130 in https://github.com/sgl-project/sglang/pull/13752
[Auto Sync] Update schedule_batch.py, schedule_policy.py, b... (20251122) by @merrymercy in https://github.com/sgl-project/sglang/pull/13763
[Doc] Add an Introduction to Expert Parallelism by @ch-wan in https://github.com/sgl-project/sglang/pull/13783
add LoRA warning if loading a preexisting LoRA adapter with a different name by @glenliu21 in https://github.com/sgl-project/sglang/pull/13822
[NPU] Fix NPU CI by @iforgetmyname in https://github.com/sgl-project/sglang/pull/13834
Overlap glm moe gemms in two cuda streams by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/13786
[Performance] Replace preprocess_video logic from GLM multimodal processor with transformer impl for speed up (up to 27% faster) and addressing OOM (up to 50x improvements) by @byjiang1996 in https://github.com/sgl-project/sglang/pull/13487
Add support for bf16 x bf16 cutlass fused MoE by @nvcastet in https://github.com/sgl-project/sglang/pull/10275
[Router bugfix] Fix router_manager selecting the wrong router when enable-igw. by @SYChen123 in https://github.com/sgl-project/sglang/pull/13572
Fix nightly test job to fail when any test fails by @alisonshao in https://github.com/sgl-project/sglang/pull/13871
[diffusion] refactor: remove training-related code by @mickqian in https://github.com/sgl-project/sglang/pull/13860
[CI] fix multimodel-gen-test job by @cyb70289 in https://github.com/sgl-project/sglang/pull/13874
Add validation and cleanup for corrupted safetensors in multimodal loader by @alisonshao in https://github.com/sgl-project/sglang/pull/13870
[CI] fix lint error by @cyb70289 in https://github.com/sgl-project/sglang/pull/13891
fix: draft model revision misuse model revision by @gongwei-130 in https://github.com/sgl-project/sglang/pull/11893
Fix trace publish paths in nightly-test-nvidia workflow by @alisonshao in https://github.com/sgl-project/sglang/pull/13888
Adding nightly tests for Kimi-K2-thinking, Qwen3, minimax-m2, GLM4.6 by @dougyster in https://github.com/sgl-project/sglang/pull/13890
[Fix] JIT kernel dependencies in other platforms by @DarkSharpness in https://github.com/sgl-project/sglang/pull/13889
remove RoPE CPU fp32 tests by @ZailiWang in https://github.com/sgl-project/sglang/pull/13827
Move test_dummy_grok_models.py from manual to srt (temporary) by @alisonshao in https://github.com/sgl-project/sglang/pull/13901
[CI tiny fix] Enhance robustness of vision chunked prefill test with ROUGE-L metric by @BBuf in https://github.com/sgl-project/sglang/pull/13793
update flashinfer_cubin==0.5.3 by @Lzhang-hub in https://github.com/sgl-project/sglang/pull/13848
Add test_dummy_grok_models.py to not_in_ci section by @alisonshao in https://github.com/sgl-project/sglang/pull/13908
fix diffusion profile bugs by @yizhang2077 in https://github.com/sgl-project/sglang/pull/13642
[Fix]: Further fix the buffer len of future map by @ant-yy in https://github.com/sgl-project/sglang/pull/13916
[diffusion] CI: minor refactor CI for less code duplication by @mickqian in https://github.com/sgl-project/sglang/pull/13905
Update release-whl-kernel.yml by @ispobock in https://github.com/sgl-project/sglang/pull/13921
[Ascend] qwen optimization by @Liwansi in https://github.com/sgl-project/sglang/pull/12078
Support piecewise cuda graph for Qwen3-next by @Chen-0210 in https://github.com/sgl-project/sglang/pull/13081
fix nixl prefill crash make decode health check failed by @llc-kc in https://github.com/sgl-project/sglang/pull/13657
[CI] CI registry update by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13927
[CI] rename: per_commit -> registered by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13928
[diffusion] doc: add doc for LoRA usage by @mickqian in https://github.com/sgl-project/sglang/pull/13931
[diffusion] feat: support LoRA by @mickqian in https://github.com/sgl-project/sglang/pull/13859
[CPU] Apply PR gating rule in CI workflow by @ZailiWang in https://github.com/sgl-project/sglang/pull/13933
[misc] add llama3.1 chat template by @slin1237 in https://github.com/sgl-project/sglang/pull/13935
[Minor] Fix lint by @Fridge003 in https://github.com/sgl-project/sglang/pull/13938
[DeepSeekV3.2] Centralize NSA dispatch logic in NativeSparseAttnBackend by @YAMY1234 in https://github.com/sgl-project/sglang/pull/13544
Fix docstrings for v1 HiCacheStorage methods by @ptovam in https://github.com/sgl-project/sglang/pull/13851
Add Llama4 attention backend auto-selection by @janbernloehr in https://github.com/sgl-project/sglang/pull/13421
Improve nightly tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13903
[Auto Sync] Improve profilers and simplify bench_one_batch_server.py by @merrymercy in https://github.com/sgl-project/sglang/pull/13866
Fix update weight error for blackwell DeepGEMM by @fzyzcjy in https://github.com/sgl-project/sglang/pull/13910
[chore] update torch version to 2.9 by @FlamingoPg in https://github.com/sgl-project/sglang/pull/12969
chore: bump sgl-kernel version to 0.3.18.post1 by @sglang-bot in https://github.com/sgl-project/sglang/pull/13942
[Tiny]Upgrade README for sgl-kernel by @Fridge003 in https://github.com/sgl-project/sglang/pull/13945
Fix nightly-test-nvidia.yml to have the correct trigger by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13950
chore: bump sgl-kernel version to 0.3.18.post1 by @sglang-bot in https://github.com/sgl-project/sglang/pull/13951
Fix Deepseek v3.1 loading issue by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13954
fix: spec overlap predict shape does not match verify output shapes by @timmy-feng in https://github.com/sgl-project/sglang/pull/12786
[VLM] Support InternVL Vision Encoder Data Parallelism by @yuan-luo in https://github.com/sgl-project/sglang/pull/13925
Support FlashAttention3 page_size > 1 and topk > 1 case with paged attn and spec decode by @yubofredwang in https://github.com/sgl-project/sglang/pull/7725
Fix nightly test failures: NSA indexer dtype and CPP radix cache init by @alisonshao in https://github.com/sgl-project/sglang/pull/13958
update CI permission list by @ZailiWang in https://github.com/sgl-project/sglang/pull/13962
Fix SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION not working by @fzyzcjy in https://github.com/sgl-project/sglang/pull/13961
[feat] support in-flight weight update by @ShawnY112358 in https://github.com/sgl-project/sglang/pull/10071
Turn off PREBUILD aiter in MI355 by @1am9trash in https://github.com/sgl-project/sglang/pull/13963
Support piecewise CUDA graph for embedding models by @zhooooong in https://github.com/sgl-project/sglang/pull/13852
diffusion: fix the issue where the qwen-edit&wan model produces incorrect output during sequence parallelism by @yhyang201 in https://github.com/sgl-project/sglang/pull/13922
[Feature] Initial block diffusion language model support by @ClawSeven in https://github.com/sgl-project/sglang/pull/12588
Use dynamically maintained num_waiting_tokens in get_load() by @vipwangerxiao in https://github.com/sgl-project/sglang/pull/13203
Optimize uneven PP layer distribution logic to improve PP performance by @ShangmingCai in https://github.com/sgl-project/sglang/pull/13977
Fix get_load API by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13991
Rename: --hooks to --forward-hooks by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13994
[Ascend] Support enable-mixed-chunk in non-MLA scenarios by @MichelleWu351 in https://github.com/sgl-project/sglang/pull/12491
fix spec dec request level metrics by @vedantjh2 in https://github.com/sgl-project/sglang/pull/13754
[model-gateway] Add PostgreSQL support to binding by @xuwenyihust in https://github.com/sgl-project/sglang/pull/13766
Put pr-gate after check-changes by @hnyls2002 in https://github.com/sgl-project/sglang/pull/14009
[diffusion] model: support black-forest-labs/FLUX.2-dev by @mickqian in https://github.com/sgl-project/sglang/pull/14000
fix: correct usage of minimax-m2 deepep moe forward by @yuukidach in https://github.com/sgl-project/sglang/pull/13892
Support internvl on Blackwell (which doesn't support fa3): add SingletonCache support to Vision{Sdpa|Triton|Ascend}Attention by @netanel-haber in https://github.com/sgl-project/sglang/pull/13151
[model-gateway] fix xpu ci by @slin1237 in https://github.com/sgl-project/sglang/pull/14012
[ci] mark skip as success instead of failure by @slin1237 in https://github.com/sgl-project/sglang/pull/14014
Revert "Fix nightly test failures: NSA indexer dtype and CPP radix cache init" by @Fridge003 in https://github.com/sgl-project/sglang/pull/14015
[model gateway][grpc] Add tojson filter to override minijinja's tojson by @CatherineSue in https://github.com/sgl-project/sglang/pull/14013
[ci] allow manual label to trigger ci in rust, change ci order by @slin1237 in https://github.com/sgl-project/sglang/pull/14016
[model-gateway][doc] Update transport terminology to protocol in README.md by @xuwenyihust in https://github.com/sgl-project/sglang/pull/13872
Fix Nvidia nightly test trigger params when it is triggered by parent workflow by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13966
Update CODEOWNERS for layer and executor files by @hebiao064 in https://github.com/sgl-project/sglang/pull/14020
[Code sync] Fix registration of some ops in grok & Fix oss sync scripts by @merrymercy in https://github.com/sgl-project/sglang/pull/13990
Add stress test workflow by @dougyster in https://github.com/sgl-project/sglang/pull/13937
Temporarily disable test_update_weights_from_disk.py in CI by @alisonshao in https://github.com/sgl-project/sglang/pull/14021
[model-gateway] Fix flaky test_circuit_breaker_half_open_failure_reopens by @XinyueZhang369 in https://github.com/sgl-project/sglang/pull/14019
Add adapter_model.safetensors to corruption validation for LoRA by @alisonshao in https://github.com/sgl-project/sglang/pull/14022
fix: cuda graph issue while running longcat_flash by @tianhaoz95 in https://github.com/sgl-project/sglang/pull/14007
Fix nightly test failure: CPP radix cache init by @alisonshao in https://github.com/sgl-project/sglang/pull/14018
Support KTransformers for Qwen3-VL moe by @mrhaoxx in https://github.com/sgl-project/sglang/pull/13983
Add nightly test support to unified run_suite.py by @alisonshao in https://github.com/sgl-project/sglang/pull/13941
Support nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 (and nvidia/C-RADIOv2-H) by @netanel-haber in https://github.com/sgl-project/sglang/pull/12277
[feat] update bucketed weights from distributed by @ShawnY112358 in https://github.com/sgl-project/sglang/pull/13824
Nightly test job filter by @alisonshao in https://github.com/sgl-project/sglang/pull/14025
Cleanup server args by @merrymercy in https://github.com/sgl-project/sglang/pull/14027
[Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13761 by @samuellees in https://github.com/sgl-project/sglang/pull/13761
Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs by @alisonshao in https://github.com/sgl-project/sglang/pull/14028
[model-gateway] allow refill rate to be zero by @slin1237 in https://github.com/sgl-project/sglang/pull/14030
Fix installation for nvidia-nvshmem-cu12 by @ch-wan in https://github.com/sgl-project/sglang/pull/14033
Fix nightly test failure: NSA indexer dtype by @alisonshao in https://github.com/sgl-project/sglang/pull/14017
Add CODEOWNERS entry for batch_invariant_ops by @hebiao064 in https://github.com/sgl-project/sglang/pull/14026
[Piecewise] support disable decode cuda graph when enable piecewise cuda graph by @hebiao064 in https://github.com/sgl-project/sglang/pull/13965
fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity by @sunxxuns in https://github.com/sgl-project/sglang/pull/13814
Use trtllm mha decode kernel for target_verify in speculative decoding by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/13976
[Intel XPU]Add xpu support for get_device_memory_capacity by @gaopengff in https://github.com/sgl-project/sglang/pull/13895
[diffusion] perf: improve black-forest-labs/FLUX.2-dev by @mickqian in https://github.com/sgl-project/sglang/pull/14040
[Feat]Add scheduler recv skipper weights to environment configuration by @jimmy-evo in https://github.com/sgl-project/sglang/pull/13855
Tiny support 3D tensors in inverse_transform_scale_ue8m0 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14002
Support sanity checking weight consistency especially for RL by @fzyzcjy in https://github.com/sgl-project/sglang/pull/13854
feat: Naive support Spec V2 + Constrained Decoding by @Ubospica in https://github.com/sgl-project/sglang/pull/13425
Adjust max-parallel for CUDA CI by @hnyls2002 in https://github.com/sgl-project/sglang/pull/14057
[2/2] Refactor DeepGeem requant for FP8 FusedMoE on Blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/13960
Super tiny add comments to SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14048
Remove disused B300 Dockerfile by @mmangkad in https://github.com/sgl-project/sglang/pull/13946
Temporarily disabled test by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14069
[chore] Arrange NV packages in Dockerfile by @Fridge003 in https://github.com/sgl-project/sglang/pull/13749
Fix utils import issue for nightly tests by @alisonshao in https://github.com/sgl-project/sglang/pull/13944
[sgl-kernel][1/2] Fused qk_norm_rope for Qwen3-MoE by @yuan-luo in https://github.com/sgl-project/sglang/pull/14036
[diffusion] refactor: refactor condition image resize logic by @mickqian in https://github.com/sgl-project/sglang/pull/14079
[diffusion] refactor: refactor ComponentLoader and support loading native models from diffusers and transformers by @mickqian in https://github.com/sgl-project/sglang/pull/13205
fix: small changes to enable test_mrope.py by @raayandhar in https://github.com/sgl-project/sglang/pull/14082
Fix structural_tag tool call with null schema by @AzazKamaz in https://github.com/sgl-project/sglang/pull/14006
[Bugfix] input prompt was not logged by @alphabetc1 in https://github.com/sgl-project/sglang/pull/13936
Support configuring the request limit per receiving poll by @vipwangerxiao in https://github.com/sgl-project/sglang/pull/14076
[Bugfix] qwen2.5-vl spec decode accept_len low by @Lzhang-hub in https://github.com/sgl-project/sglang/pull/13904
support qwen3_vl vision model dp by @Lzhang-hub in https://github.com/sgl-project/sglang/pull/13724
[diffusion] refactor: clean useless config files by @mickqian in https://github.com/sgl-project/sglang/pull/14094
Fix overlap scheduler not take effect when outputing logprobs by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14096
diffusion: support zimage by @yhyang201 in https://github.com/sgl-project/sglang/pull/14067
[CPU] Apply uv as package manager by @ZailiWang in https://github.com/sgl-project/sglang/pull/14106
Fix NIXL OBJ desciptors by @tshmilnvidia in https://github.com/sgl-project/sglang/pull/10712
[model-gateway] Add version command support to SMG by @tonyluj in https://github.com/sgl-project/sglang/pull/12558
Disable Deepep 2 GPU tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14111
fix: malformed KV events for NVIDIA Dynamo by @PeaBrane in https://github.com/sgl-project/sglang/pull/13488
enable piecewise cuda graph for prefill server by @fjybiocs in https://github.com/sgl-project/sglang/pull/13377
[diffusion] log: unify generation performance logging by @mickqian in https://github.com/sgl-project/sglang/pull/14117
Remove incorrect deep_gemm assertions from server_args.py by @ch-wan in https://github.com/sgl-project/sglang/pull/14113
Add auto-tune workflow by @merrymercy in https://github.com/sgl-project/sglang/pull/14124
feat: support flashinfer kernel autotune by @elvischenv in https://github.com/sgl-project/sglang/pull/12306
Move piecewise cuda graph test to manual dir to fix CI by @ShangmingCai in https://github.com/sgl-project/sglang/pull/14121
[diffusion] chore: add resolution shortcuts by @mickqian in https://github.com/sgl-project/sglang/pull/14129
Super tiny fix typo by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14131
add runtime check for PyTorch 2.9.1 + CuDNN < 9.15 to prevent Conv3d performance issues by @yhyang201 in https://github.com/sgl-project/sglang/pull/14119
Trigger PR test on main every 3 hours instead of push event by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14130
fix RuntimeError: RMSNorm failed with error code an illegal memory access was encountered by @gongwei-130 in https://github.com/sgl-project/sglang/pull/14135
Always run all stages in cron based PR tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14151
Fix condition for streaming output_ids in tokenizer manager by @merrymercy in https://github.com/sgl-project/sglang/pull/13759
Fix Minimax M2 loading issue by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13956
Tiny fix DeepGEMM precompile rank check by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14136
Super tiny add more info in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14145
Fix spec v2 does not support RL update weights from tensor by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14146
Support checking fp8 params in weight_checker by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14147
Show errors when misusing env variables by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14154
Always run model evaluation even if the trace upload step fails by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14157
diffusion: Fix LoRA weight merging for torch.nn.Linear layers in diffusers modules by @niehen6174 in https://github.com/sgl-project/sglang/pull/14150
add cpp files for cpp_radix_tree to pyproject.toml. by @strgrb in https://github.com/sgl-project/sglang/pull/14052
feat: longcat flash add aux layers capture for eagle3 by @tianhaoz95 in https://github.com/sgl-project/sglang/pull/14161
Implement profiler v2 and fix stage mixture bug by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14148
Support numactl bind for CPU and memory before process starts by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14156
Support grammar + spec + reasoning by @hnyls2002 in https://github.com/sgl-project/sglang/pull/14163
bugfix[schedule]: Excessive preemption occurs when preempting running requests to schedule new prefill requests. by @CLFutureX in https://github.com/sgl-project/sglang/pull/12494
Fix LMCache unit test and init bug by @DongDongJu in https://github.com/sgl-project/sglang/pull/14005
[ci]fix deepep import error on H20 action by @HanHan009527 in https://github.com/sgl-project/sglang/pull/14166
[Minor]Raise Error when deepep num dispatch token per rank is smaller than cuda graph bs by @Fridge003 in https://github.com/sgl-project/sglang/pull/14065
[sgl-kernel] fix b200 kernel ci by @FlamingoPg in https://github.com/sgl-project/sglang/pull/13907
Revert "[Minor]Raise Error when deepep num dispatch token per rank is smaller than cuda graph bs" by @Fridge003 in https://github.com/sgl-project/sglang/pull/14171
Fix: fix flashmla fp8 kv cache acc error by @FlamingoPg in https://github.com/sgl-project/sglang/pull/13841
[DeepSeekV3.2] Enable pure TP & Partial DP Attention by @YAMY1234 in https://github.com/sgl-project/sglang/pull/13646
[model-gateway] support VL models in router by @ooapex in https://github.com/sgl-project/sglang/pull/14140
[PD] Support json file configuration for Transfer Engine by @stmatengss in https://github.com/sgl-project/sglang/pull/14059
Feat: GLM-4.6 supports shared experts fusion by @UranusSeven in https://github.com/sgl-project/sglang/pull/13873
diffusion: improve z-image by @yhyang201 in https://github.com/sgl-project/sglang/pull/14104
[piecewise] Refactor VLM to support input embed buffer and remove external embedder hack by @ByronHsu in https://github.com/sgl-project/sglang/pull/14155
[Feature] Enable PTPC FP8 for compressed tensors moe (aiter kernel) by @qichu-yun in https://github.com/sgl-project/sglang/pull/12181
Pull Request Instructions: RL and Training Framework Integrations by @Richardczl98 in https://github.com/sgl-project/sglang/pull/14187
[Auto Sync] Update backend.py (20251130) by @merrymercy in https://github.com/sgl-project/sglang/pull/14153
[piecewise] move piecewise_cuda_graph_runner init to model_runner initialize by @zminglei in https://github.com/sgl-project/sglang/pull/14034
Tiny fix transform_scale_ue8m0 wrong output in some scenarios by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14003
Tiny add several args to bench serving by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14181
Super tiny allow millisecond precision in logging by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14183
Support profiling only prefill or decode without the other by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14182
[Piecewise] Use same global graph memory pool as the main cuda graph … by @byjiang1996 in https://github.com/sgl-project/sglang/pull/14044
Try to remove wrong logic about max total token in spec decoding by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14167
Fix speculative decoding error when retracting by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14180
[diffusion] refactor: remove hard-code of instanceof on PipelineConfig by @mickqian in https://github.com/sgl-project/sglang/pull/14186
[VLM] Boost Memory Pool based CUDA IPC by @yuan-luo in https://github.com/sgl-project/sglang/pull/14123
Add peak output tokens per second in bench_serving by @BBuf in https://github.com/sgl-project/sglang/pull/14165
fix: Increase FlashInfer workspace size for Qwen3VL models by @BBuf in https://github.com/sgl-project/sglang/pull/14173
Tiny call cudaProfilerStart only on first rank in node by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14211
[Minor] update docs by @merrymercy in https://github.com/sgl-project/sglang/pull/14212
Add cuda event based on waiting value by @hnyls2002 in https://github.com/sgl-project/sglang/pull/14214
Change PR test schedule to run every 6 hours by @merrymercy in https://github.com/sgl-project/sglang/pull/14218
Super tiny fix typo by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14219
[model-gateway] Avoid logging MCP connection token by @xuwenyihust in https://github.com/sgl-project/sglang/pull/13887
[spec-overlap] bugfix for pd disaggregation and npu by @liupeng374 in https://github.com/sgl-project/sglang/pull/14088
Add new moe wna16 marlin gemm by @BBuf in https://github.com/sgl-project/sglang/pull/14122
[model-gateway] refactor oai router 1/n by @slin1237 in https://github.com/sgl-project/sglang/pull/14228
[model-gateway] fix v1/models response format to be oai compatible by @CatherineSue in https://github.com/sgl-project/sglang/pull/13693
[model-gateway] add ModelType bitflags and Endpoint enum for worker by @slin1237 in https://github.com/sgl-project/sglang/pull/14230
chore: bump sgl-kernel version to 0.3.18.post2 by @sglang-bot in https://github.com/sgl-project/sglang/pull/14229
Modify git tag for DeepGemm in sgl-kernel. by @Sulfur6 in https://github.com/sgl-project/sglang/pull/14179
Disable Deepep 8 GPU tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14152
[model-gateway] add ModelCard and ProviderType for model configuration by @slin1237 in https://github.com/sgl-project/sglang/pull/14237
[MM][style] rename inputs_embeds to input_embeds for consistency by @ByronHsu in https://github.com/sgl-project/sglang/pull/14240
Revert "Skip weight loading in deepgemm compilation" by @ishandhanani in https://github.com/sgl-project/sglang/pull/14241
[model-gateway] add ModelCard support to WorkerMetadata by @slin1237 in https://github.com/sgl-project/sglang/pull/14243
Fix NSA Bug in Centralize NSA Dispatch Logic by @YAMY1234 in https://github.com/sgl-project/sglang/pull/14245
[model-gateway] Migrate Worker trait to model-aware methods by @slin1237 in https://github.com/sgl-project/sglang/pull/14250
Fix a distributed initialization error by @Edwardf0t1 in https://github.com/sgl-project/sglang/pull/13843
[CI] Fix test_deepep_large.py by @Fridge003 in https://github.com/sgl-project/sglang/pull/14247
Support fp4 fp8 non gated moe by @TomerBN-Nvidia in https://github.com/sgl-project/sglang/pull/13794
[model-gateway] Add e2e tests of streaming events and tool choice for response api by @XinyueZhang369 in https://github.com/sgl-project/sglang/pull/13880
[Auto Sync] optionally disable fake register in Update fp8_kernel.py (20251202) by @merrymercy in https://github.com/sgl-project/sglang/pull/14255
[Auto Sync] Add max_total_num_tokens metric: Update scheduler_metrics_mixin.py, collector.py (20251202) by @merrymercy in https://github.com/sgl-project/sglang/pull/14256
[Minor] Upgrade cutedsl version in Dockerfile by @Fridge003 in https://github.com/sgl-project/sglang/pull/13968
[diffusion] fix: fix Flux.2 condition image resize by @mickqian in https://github.com/sgl-project/sglang/pull/14232
[VLM] Support Piecewise CUDA Graph for Qwen3-Omni-MOE by @yuan-luo in https://github.com/sgl-project/sglang/pull/14222
[Docs] Update CI docs by @merrymercy in https://github.com/sgl-project/sglang/pull/14260
Revert "Try to remove wrong logic about max total token in spec decoding" by @hebiao064 in https://github.com/sgl-project/sglang/pull/14259
[model-gateway] add audio and moderation in model card by @slin1237 in https://github.com/sgl-project/sglang/pull/14263
Fix NIXL exception message by @kartikx in https://github.com/sgl-project/sglang/pull/14172
[diffusion] CI: add testcase-wise retry mechanism by @mickqian in https://github.com/sgl-project/sglang/pull/14261
Remove cargo config also in .zshenv by @hnyls2002 in https://github.com/sgl-project/sglang/pull/14267
Fix mrope_positions size when req is retracted by @llfl in https://github.com/sgl-project/sglang/pull/13700
fix: Support PP for Mistral Small 3.1 by @bluecoffee8 in https://github.com/sgl-project/sglang/pull/14254
sync attention doc and ep doc to doctree by @b8zhong in https://github.com/sgl-project/sglang/pull/14257
[model-gateway] include smg version command in py binding by @slin1237 in https://github.com/sgl-project/sglang/pull/14274
Optimize topk sigmoid in minimax_m2 by @rogeryoungh in https://github.com/sgl-project/sglang/pull/14047
fix trtllm mla spec by @b8zhong in https://github.com/sgl-project/sglang/pull/13738
[model-gateway] fix version output by @slin1237 in https://github.com/sgl-project/sglang/pull/14276
[VLM][Doc] Document for VLM DP Encoder by @yuan-luo in https://github.com/sgl-project/sglang/pull/14279
chore: bump sgl-kernel version to 0.3.18.post2 by @sglang-bot in https://github.com/sgl-project/sglang/pull/14244
[Auto Sync] Rename is_hybrid to is_hybrid_swa by @merrymercy in https://github.com/sgl-project/sglang/pull/14252
[model-gateway] change rust package name to sgl-model-gateway instead by @slin1237 in https://github.com/sgl-project/sglang/pull/14283
Update CODEOWNERS for multimodal_gen by @mickqian in https://github.com/sgl-project/sglang/pull/14286
[CI] Fix 4-GPU test timeout by using 3 partitions by @alisonshao in https://github.com/sgl-project/sglang/pull/14287
[Fix] improve model info registration and searching strategy by @liz-badada in https://github.com/sgl-project/sglang/pull/14281
[diffusion] refactor: simplify DmdDenoisingStage by @mickqian in https://github.com/sgl-project/sglang/pull/14269
Opt moe align block size kernel by @BBuf in https://github.com/sgl-project/sglang/pull/14133
[sgl-kernel] fix runtime error while preloading CUDA runtime by @anvdn in https://github.com/sgl-project/sglang/pull/13089
Fix duplicate download log messages in multi-process environment by @alisonshao in https://github.com/sgl-project/sglang/pull/14299
Revert PR #14044: Restore separate memory pool for piecewise CUDA graph by @alisonshao in https://github.com/sgl-project/sglang/pull/14278
Init TBO with dp_padded batch by @liquanfeng in https://github.com/sgl-project/sglang/pull/11423
feat: DeepSeek new v3.2 encoding by @Eva20150932-atlascloud in https://github.com/sgl-project/sglang/pull/14249
[Minor] update docs on CI by @merrymercy in https://github.com/sgl-project/sglang/pull/14315
Add /rerun-stage slash command to rerun specific PR test stages by @alisonshao in https://github.com/sgl-project/sglang/pull/14262
Fix nonetype error for ci failure monitor by @dougyster in https://github.com/sgl-project/sglang/pull/14319
Adding section for scheduled PR test runs on main by @dougyster in https://github.com/sgl-project/sglang/pull/14309
Clean up imports and move files by @merrymercy in https://github.com/sgl-project/sglang/pull/14317
[model-gateway] add workflow for external model providers by @slin1237 in https://github.com/sgl-project/sglang/pull/14323
ci: Add zyzshishui to CI permissions by @sunxxuns in https://github.com/sgl-project/sglang/pull/14324
chore: bump SGLang version to 0.5.6 by @sglang-bot in https://github.com/sgl-project/sglang/pull/14316

New Contributors

@gaopengff made their first contribution in #11051
@mattheliu made their first contribution in #12764
@rchalamala made their first contribution in #12717
@leejnau made their first contribution in #12724
@kalyank007 made their first contribution in #12761
@BraveY made their first contribution in #12374
@MMuzzammil1 made their first contribution in #12946
@edwingao28 made their first contribution in #12956
@CLFutureX made their first contribution in #12239
@syy-hw made their first contribution in #11719
@ShawnKung made their first contribution in #10225
@LHXuuu made their first contribution in #12980
@haoyangli-amd made their first contribution in #11609
@yctseng0211 made their first contribution in #12689
@Sunhaihua1 made their first contribution in #13063
@FrankMinions made their first contribution in #12814
@zhaowenzi made their first contribution in #13056
@MayDomine made their first contribution in #13039
@zhanghaotong made their first contribution in #12396
@hellodanylo made their first contribution in #12860
@Rohan138 made their first contribution in #9790
@CharlieFRuan made their first contribution in #7906
@khalil2ji3mp6 made their first contribution in #12214
@SYChen123 made their first contribution in #12979
@XinyueZhang369 made their first contribution in #13164
@dougyster made their first contribution in #13104
@vlserov made their first contribution in #12288
@hustmf made their first contribution in #13154
@ZLkanyo009 made their first contribution in #12201
@edwardzjl made their first contribution in #13211
@Taishi-N324 made their first contribution in #13210
@dtcccc made their first contribution in #13142
@billishyahao made their first contribution in #13243
@kshitij12345 made their first contribution in #13213
@wangyxbh made their first contribution in #12191
@terfendail made their first contribution in #13288
@SenmiaoORZ made their first contribution in #13260
@zhengxle made their first contribution in #12001
@RiversJin made their first contribution in #13383
@lixiaolx made their first contribution in #12065
@kebyn made their first contribution in #5879
@Carlomus made their first contribution in #13217
@sirutBuasai made their first contribution in #13173
@WingEdge777 made their first contribution in #12149
@liusy58 made their first contribution in #13126
@Baidu-AIAK made their first contribution in #13495
@chz34 made their first contribution in #9234
@galeselee made their first contribution in #12379
@shauntajoesph-ops made their first contribution in #13583
@1am9trash made their first contribution in #13554
@michelemarzollo made their first contribution in https://github.com/sgl-project/sglang/pull/13590
@weibingo made their first contribution in https://github.com/sgl-project/sglang/pull/13407
@roikoren755 made their first contribution in https://github.com/sgl-project/sglang/pull/12690
@ErsongWang made their first contribution in https://github.com/sgl-project/sglang/pull/13727
@yinpeiqi made their first contribution in https://github.com/sgl-project/sglang/pull/13709
@liuhuijiayou made their first contribution in https://github.com/sgl-project/sglang/pull/13201
@wangtiance made their first contribution in https://github.com/sgl-project/sglang/pull/13656
@ant-yy made their first contribution in https://github.com/sgl-project/sglang/pull/13713
@tom-jerr made their first contribution in https://github.com/sgl-project/sglang/pull/13612
@cyb70289 made their first contribution in https://github.com/sgl-project/sglang/pull/13874
@Liwansi made their first contribution in https://github.com/sgl-project/sglang/pull/12078
@llc-kc made their first contribution in https://github.com/sgl-project/sglang/pull/13657
@ptovam made their first contribution in https://github.com/sgl-project/sglang/pull/13851
@janbernloehr made their first contribution in https://github.com/sgl-project/sglang/pull/13421
@ShawnY112358 made their first contribution in https://github.com/sgl-project/sglang/pull/10071
@ClawSeven made their first contribution in https://github.com/sgl-project/sglang/pull/12588
@MichelleWu351 made their first contribution in https://github.com/sgl-project/sglang/pull/12491
@yuukidach made their first contribution in https://github.com/sgl-project/sglang/pull/13892
@tianhaoz95 made their first contribution in https://github.com/sgl-project/sglang/pull/14007
@mrhaoxx made their first contribution in https://github.com/sgl-project/sglang/pull/13983
@raayandhar made their first contribution in https://github.com/sgl-project/sglang/pull/14082
@AzazKamaz made their first contribution in https://github.com/sgl-project/sglang/pull/14006
@alphabetc1 made their first contribution in https://github.com/sgl-project/sglang/pull/13936
@tshmilnvidia made their first contribution in https://github.com/sgl-project/sglang/pull/10712
@PeaBrane made their first contribution in https://github.com/sgl-project/sglang/pull/13488
@fjybiocs made their first contribution in https://github.com/sgl-project/sglang/pull/13377
@niehen6174 made their first contribution in https://github.com/sgl-project/sglang/pull/14150
@DongDongJu made their first contribution in https://github.com/sgl-project/sglang/pull/14005
@UranusSeven made their first contribution in https://github.com/sgl-project/sglang/pull/13873
@qichu-yun made their first contribution in https://github.com/sgl-project/sglang/pull/12181
@Richardczl98 made their first contribution in https://github.com/sgl-project/sglang/pull/14187
@liupeng374 made their first contribution in https://github.com/sgl-project/sglang/pull/14088
@Sulfur6 made their first contribution in https://github.com/sgl-project/sglang/pull/14179
@TomerBN-Nvidia made their first contribution in https://github.com/sgl-project/sglang/pull/13794
@kartikx made their first contribution in https://github.com/sgl-project/sglang/pull/14172
@llfl made their first contribution in https://github.com/sgl-project/sglang/pull/13700
@bluecoffee8 made their first contribution in https://github.com/sgl-project/sglang/pull/14254
@Eva20150932-atlascloud made their first contribution in https://github.com/sgl-project/sglang/pull/14249

Full Changelog: v0.5.5...v0.5.6

sgl-project/sglang v0.5.6 Release v0.5.6 on GitHub

Highlights

What's Changed

New Contributors

sgl-project/sglang v0.5.6
Release v0.5.6

on GitHub