Highlights
- Support for DeepSeek V3.2/V3.2 Speciale #14249
- Blockwise diffusion language model support #12588
- Support for new diffusion models (Flux2 #14000, Z-image #14067)
- Introduce JIT Kernels #13453
- Upgrade to Torch 2.9 #12969
- Kimi-K2-Thinking model enhancement #12882
- Memory management/Overlap spec compatibility #12224 #12839
- More performance optimization: DeepSeek-v3-fp4/GLM-4.6/Kimi-K2/DeepSeek-V3.2...
- CI/CD Enhancement
What's Changed
- [router][grpc] Add more mcp test cases to responses api by @CatherineSue in #12749
- [Intel]Add 'intel_xpu' attention backend for llama4 by @gaopengff in #11051
- [Intel XPU]Update pytorch xpu to 2.9 by @gaopengff in #12363
- [Docs] fix dead links in multiple documentation pages by @mattheliu in #12764
- [mem pool] bugfix: wrong position for self.device in Mamba by @stmatengss in #12684
- [Fix]HTTP Stream raise exception by @jimmy-evo in #11904
- [CPU] Fix TP padding case with weight block size by @jianan-gu in #8243
- [docs] Remove redundant --disable-radix-cache option from by @rchalamala in #12717
- Pin uvloop to 0.21.0 by @yeahdongcn in #12279
- [fix] Only enable flashinfer all reduce fusion by default for single-node servers by @leejnau in #12724
- chore: update CODEOWNERS by @zhyncs in #12795
- Fix hang in deepgemm compilation with symmetric memory enabled by @nvcastet in #12715
- Add bot-bump-kernel-version-to-sglang workflow by @alisonshao in #12794
- ignore the deepgemm check when the model weight with nvfp4 and moe ba… by @rainj-me in #12782
- [AMD] Update wave-lang to 3.8.2 by @xintin in #12576
- [DeepSeek-V3.2][NSA] Enable MHA Pathway for Short Sequence Prefill on B200 (SM100) by @YAMY1234 in #12788
- [hotfix]: Resolve ModuleNotFoundError in PD deployment for is_in_ci() by @hzh0425 in #12772
- [HotFix]: Add missing SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL env var by @hzh0425 in #12776
- Add PP support for dots_vlm by @gty111 in #12763
- fixes hardcoded "cuda" device references in unit tests to use a dynamic device selection by @kalyank007 in #12761
- fix multimodal gen issues by @yhyang201 in #12765
- [Test] Add DeepSeekV3.2 NSA Indexer Test Suite by @Johnsonms in #12520
- [Bugfix] Fix illegal memory access by @elvischenv in #12758
- [MoE] Add Comprehensive MoE Integration Tests by @Jonahcb in #12090
- [Deepseek V3.2] Only skip Indexer logits computation when is_extend_without_speculative by @hlu1 in #12816
- Fix missing dp_max_padding argument in set_dp_buffer_len by @Chen-0210 in #12812
- optm(checkpoint-engine): disable multi-thread loading when update weights by @BraveY in #12374
- Fix piecewise cuda graph ci test by @ispobock in #12836
- update multimodal_gen readme by @mickqian in #12825
- [router] Support structured model output for openai and grpc router by @key4ng in #12431
- Fix data parallel controller launch for num nodes > 2 by @merrymercy in #12822
- remove the fa4 page_size hardcode to 128 restriction on mla model arch by @rainj-me in #12801
- sglang diffusion announcement by @wisclmy0611 in #12856
- add back flashinfer jit cache to dev docker by @b8zhong in #12851
- [router][grpc] Refactor: Add builders for chat and responses by @CatherineSue in #12852
- [router][grpc] Move all error logs to their call sites by @CatherineSue in #12859
- [router] Switch MCP tests from DeepWiki to self-hosted Brave search server by @key4ng in #12849
- Add nightly performance test for GPT-OSS 4GPU models by @alisonshao in #12805
- [sgl-kernel][Deepseek V3.2] Add row_starts to topk kernel by @hlu1 in #12582
- [CI] Fix huggingface access for test_flash_attention_4.py by @Fridge003 in #12846
- [Auto Sync] Update activation.py, logits_processor.py, rota... (20251107) by @merrymercy in #12853
- [Docs][DeepseekV3.2] Update deepseekv3.2 docs for mha short seq prefill by @YAMY1234 in #12868
- Support capturing aux_hidden_states for minimax m2. by @pyc96 in #12798
- [CI] Tiny adjust CI esitmation time by @hnyls2002 in #12886
- [DP-Attn] Clarify MLP sync / idle batch preparation logic by @hnyls2002 in #12843
- Fix sending all requests to the first rank in DP attention by @fzyzcjy in #12832
- Apply moe_reduce_sum kernel for fused_marlin_moe by @ispobock in #12888
- use fast stream instead of torch.cuda.current_stream in llama 4 shared experts overlap by @b8zhong in #12811
- [Fix] Fix trtllm-mla backend when chunked prefix cache is disabled by @Fridge003 in #12361
- Refs/heads/add nightly test multi gpu configs by @alisonshao in #12870
- chore: bump sgl-kernel version to 0.3.16.post6 by @sglang-bot in #12889
- Update CODEOWNERS by @ispobock in #12897
- Tiny simplify
can_run_dp_cuda_graphgather logic by @hnyls2002 in #12891 - Fix spec decoding acc length for dpsk-r1-fp4 tp8 by @Qiaolin-Yu in #12896
- Revert "Fix spec decoding acc length for dpsk-r1-fp4 tp8" by @Qiaolin-Yu in #12900
- Add Deepseek models into nightly tests by @Kangyan-Zhou in #12865
- Fix empty server args in marlin moe test by @ispobock in #12904
- Fix duplicate nightly test name by @Kangyan-Zhou in #12905
- Add HF cleanup logic in ci_install_dependency.sh by @Kangyan-Zhou in #12895
- fallback to triton mm_persistent kernel when deepGemm fail by @zminglei in #12911
- Add kimi k2 thinking to ci by @ispobock in #12907
- Fix Deepseek nightly tests by @Kangyan-Zhou in #12906
- Add Jet-Nemotron by @futrime in #12448
- [CI] increase ut buckets & adjust estimation time. by @hnyls2002 in #12919
- [PD] feat: refactor custom mem pool and add barex pd support by @stmatengss in #12332
- [CI] Fix
matrix.partin pr-test. by @hnyls2002 in #12920 - Adjust server launch time in ci by @ispobock in #12917
- feat: basic support for server-level multimodal cache by @mickqian in #10775
- Refactor / Unify event loop across PD-Disagg, Overlap, DP-Attn cases by @hnyls2002 in #12839
- [lint] tiny fix unimported packages. by @hnyls2002 in #12927
- ci: try to fix gpg error during kernel build by @ishandhanani in #12928
- Support piecewise cuda graph for MLA by @ispobock in #11812
- diffusion: skip full CI suite for multimodal_gen changes by @mickqian in #12940
- Minor code cleanup / improvement for
PREBUILT_EXTENDmode by @hnyls2002 in #12948 - Bugfix: LMCache Connector with Sglang by @MMuzzammil1 in #12946
- [Docs] Add docs for Qwen3-VL image and video support by @adarshxs in #12554
- [Refactor] rename set_index_k_and_scale_buffer to set_index_k_scale_b… by @edwingao28 in #12956
- Refactor KTransformers heterogeneous compute with unified GPU-quantization backend by @Atream in #12834
- diffusion: fix detected file changes rule in CI by @mickqian in #12943
- clean redundant code in previous PR by @Atream in #12957
- Fix the run-time error when calling fused_rms_mxfp4_quant that change return output number by @kkHuang-amd in #12803
- diffusion: fix wan-2.2-TI2V and support sp by @mickqian in #12926
- [Refactor / Style] Unify all event loops (except for PP) by @hnyls2002 in #12959
- chore: bump sgl-kernel version to 0.3.17 by @sglang-bot in #12931
- chore: include a minimum image for vlms when warming-up by @mickqian in #9528
- [PP] put pp assert in model runner by @XucSh in #12934
- Fix errors of page head kernels in sgl-kernel for ROCm by @huangtingwei9988 in #12604
- [Fix] Add validation for served model name to reserve
:for LoRA adapter syntax by @neelabhsinha in #12912 - Support hidden_dim % 4 == 0 in per_token_quant_fp8 by @BBuf in #12883
- [RadixTree] Reduce Syscalls, Optimize Collection Filtering and Align with cpp by @CLFutureX in #12239
- [Auto Sync] Update batch_invariant_ops.py (20251109) by @merrymercy in #12916
- [router] bucket policy by @syy-hw in #11719
- fix missing output_token_logprobs when using ngram speculative decoding by @a4zhangfei in #10702
- feat(metrics): add scheduler and hiradix cache metrics (#10218) by @ShawnKung in #10225
- diffusion: reduce effort of supporting new model by @mickqian in #12982
- vlm: fix tiny multimodal cache bug by @yhyang201 in #12984
- chore: bump sgl-kernel version to 0.3.17 by @sglang-bot in #12966
- [1 / 2] register weak_ref_tensor in sgl-kernel by @BBuf in #12999
- Support piecewise cuda graph for deepseek v3 by @ispobock in #12996
- minor: fix notebook bug with new model_info fields added for warmup by @mickqian in #13005
- Super tiny fix typo by @fzyzcjy in #13001
- Add
process_prefill_chunkback to fix PP event loop by @hnyls2002 in #13009 - [misc][ci] Add run-ci after auto-labeler by @CatherineSue in #13013
- Unify memory management across
(overlap, non-overlap) x (page>=1) x (spec, non-spec, spec v2) x (retract, finished)by @hnyls2002 in #12224 - Enhance retract test (page cases, long output cases) by @hnyls2002 in #12781
- [AMD CI] Remove SRT docker build. by @saienduri in #11850
- [CI] Limit the CI trigger frequency of low-privilege actors by @hnyls2002 in #13010
- Resolve HF download issue and download models before CI run starts by @Kangyan-Zhou in #12952
- Add pre-suffle weight for new aiter MoE support. by @sogalin in #12908
- chore: bump SGLang version to 0.5.5.post1 by @sglang-bot in #13000
- [router][ci] Fix maturin build by @key4ng in #13012
- Simplify the BatchMultimodalOutput in io_struct.py by @merrymercy in #12993
- [router][ci] Quick Improvement to make CI more stable by @key4ng in #12869
- [9/n] decouple quantization impl from vllm dependency - adjust ci by @AniZpZ in #12753
- [router] add postgres databases data connector by @lengrongfu in #12218
- [AMD CI] Update docker release workflows docker file name. by @saienduri in #13028
- fix tuning_fused_moe_triton_sep tool per_channel_quant bug by @BBuf in #13027
- fix(ci): workflow id in permission rate limit by @cicirori in #13035
- [PieceWise CUDA Graph] Support awq/gptq model in piecewise cudagraph by @BBuf in #12518
- Re-enable Flashinfer TRTLLM GEN MHA and Add Unit Test by @samuellees in #12885
- [AMD CI] Update CI Version Logic. by @saienduri in #13029
- [diffusion] doc: add support_new_models by @mickqian in #13043
- Sglang Tracing: optimize trace_event_batch() by @sufeng-buaa in #13036
- [CI] Auto format code by @BBuf in #13053
- disable overlap schedule if mamba radix cache open by @yizhang2077 in #13057
- [AMD] Add PD test for AMD CI by @michael-amd in #11938
- Remove duplicate import by @LHXuuu in #12980
- [Bug] TypeError: maybe_executor_submit() by @Johnsonms in #13050
- [Fix] Add TPOT back to bench_serving by @elvischenv in #12976
- [bug][rocm]fix qr when variable inp by @haoyangli-amd in #11609
- [ROCM] Optimized deepseek-r1 model with rmsnorm + fp8 quant fusion by @yctseng0211 in #12689
- Update rope dtype config by @ispobock in #13037
- Tiny simplify evcition metrics collector by @hnyls2002 in #12983
- [RadixTree] Reduce Stack Push/Pop Overhead for Leaf Nodes, Improve radix_tree Leaf Collection Performance by @CLFutureX in #12199
- fix: display served_model_name in /v1/models by @Sunhaihua1 in #13063
- refine stdout logging codes by @cicirori in #13015
- Support
file://scheme inload_videoby @netanel-haber in #13076 - [BugFix] Fix prefill memory leak in PD + GDN by @ZeldaHuang in #12994
- [Bug] Login shell error: bash: /root/.cargo/env: No such file or directory by @Johnsonms in #12941
- Fix cached tokens usage bug by @FrankMinions in #12814
- Revert "[AMD] Add PD test for AMD CI (#11938)" by @hnyls2002 in #13088
- [Test] Handle streaming chunks with null content in case of stream end. by @vshekhawat-hlab in #10862
- [Fix] Update text_chunks in bench_serving chat completions by @ZeldaHuang in #13041
- Fix CPP Radix Cache and add test to CI by @cctry in #11645
- [AMD] Apply AITER_MXFP4_MOE_SF=1 only to gfx950 in aiter build by @hubertlu-tw in #13092
- Export runner labels via env var by @Kangyan-Zhou in #13018
- Fix spec decoding acc length for dpsk-r1-fp4 tp8 (2nd attempt) by @Qiaolin-Yu in #12915
- [Fix] Fix nan error for large scale ep by @Fridge003 in #12866
- [AMD CI] Update nightly docker build CI config. by @saienduri in #13090
- overlap shared + routed expert computation in kimi linear by @b8zhong in #12660
- Revert "fix: display served_model_name in /v1/models" by @CatherineSue in #13093
- [Deepseek V3.2] Fix accuracy bug in the Indexer by @hlu1 in #12583
- [Router] use call_id instead of id for matching function calls in Responses API for Harmony by @zhaowenzi in #13056
- Upgrade to ROCm 7.0 image by @yctseng0211 in #13105
- Improve overlap scheduling for better TTFT by @vipwangerxiao in #11856
- At least tell the user that ngram verify is greedy! by @MayDomine in #13039
- diffusion: remove unused workflows folder by @mickqian in #13114
- Don't fuse wk+weight_proj for nextn by @trevor-m in #12863
- [diffusion] log: improve logging while multiprocessing by @mickqian in #12997
- Fix gpt oss 4gpu b200 trace links by @alisonshao in #12872
- diffusion: refactor task type of models by @mickqian in #13118
- [Feature] Trace: Support http/protobuf span exporter protocol by @zhanghaotong in #12396
- [sgl-kernel][5/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #12666
- feat(engine): add rid parameter to methods in Engine class by @ishandhanani in #13095
- [PD] Add custom gpu id to device topo support by @stmatengss in #12817
- [CI] Update job dependency and move dpsk v3.2 tests to 8-gpu suite by @Fridge003 in #12942
- Fix run suite sanity check by @ispobock in #13133
- [router] move radix tree to policy crate and addreses some code styles by @slin1237 in #13131
- fix: Remove dupulicated kv_events initialization in scheduler by @wxsms in #13132
- Fix strict level setting for Kimi K2 tool calls when not explicitly set by @JustinTong0323 in #13077
- [misc] Remove performance and router-benchmark label matching by @CatherineSue in #13135
- Fix re-trigger actor of CI rate limit by @hnyls2002 in #13136
- [router] Support complex assistant and tool messages in /chat/completions by @hellodanylo in #12860
- [router] add minmax m2 reasoning parser by @slin1237 in #13137
- fix(tcp-port): replace bind_server_socket to get_zmq_socket(Port conflict) by @jimmy-evo in #11961
- fix: duplicate resize images logic of qwen-vl series models by @yangsijia-serena in #12458
- [router][grpc] Support vllm backend for grpc router by @CatherineSue in #13120
- Dump
total_throughputto output-file inbench_serving.pyby @Rohan138 in #9790 - Fix the Wrong Return Type of
Scheduler.recv_requestsby @Arist12 in #7886 - Update aiter to v0.1.7.post1 by @sogalin in #13149
- [RPC] Fix handle_rpc_request with
**recv_req.parametersby @CharlieFRuan in #7906 - [Ascend]adapt enable-profile-cuda-graph for NPU by @ping1jing2 in #12617
- [Feature] Propagate Trace Headers into Root Span for OpenTelemetry Cross-Service Context by @zhanghaotong in #10808
- chore: bump SGLang version to 0.5.5.post2 by @sglang-bot in #13129
- [Ascend][feature] support L1+ L2 radixcache on ascend by @khalil2ji3mp6 in #12214
- [ngram] use SGLANG_NGRAM_FORCE_GREEDY_VERIFY to control verify method by @a4zhangfei in #13153
- [Ascend] torch_npu.npu_mrope for MRotaryEmbedding by @Makcum888e in #10907
- [VLM] Support PP for Qwen2.5-VL by @yuan-luo in #13075
- Fuse routed_scaling_factor to fused_marlin_moe by @ispobock in #12998
- [Auto Sync] Update test_deterministic.py (20251112) by @merrymercy in #13128
- bugfix: multi-model routing for /generate api by @SYChen123 in #12979
- [router] Add comprehensive validation to Responses API by @key4ng in #13127
- [router] Fix Flaky test_circuit_breaker_opens_and_recovers by @XinyueZhang369 in #13164
- Add job and runner failure monitor workflow for CI by @dougyster in #13104
- [DeepseekV32]: use
_concat_mla_absorb_q_generalto replacetorch.catby @bingps in #12215 - Fix nan in global scaling factor for large scale nvfp4 EP by @wenscarl in #13162
- [Ascend] LoRA: adding Ascend LoRA backend with using kernels from sgl_kernel_npu by @vlserov in #12288
- Add
RequestMetricsExporterutility to export request-level metrics by @scottjlee in #10973 - Opt kimi_k2_thinking biased topk module by @BBuf in #13150
- [router] remove worker url requirement by @slin1237 in #13172
- Remove EBNF Composer by @TJ5 in #13163
- fix build error in Dockerfile.diffusion by @Sunhaihua1 in #12975
- [Ascend] add npu synchronize by @hustmf in #13154
- [AMD] Add AITER Custom All-Reduce by @hubertlu-tw in #13102
- Revert "fallback to triton mm_persistent kernel when deepGemm fail" by @fzyzcjy in #13178
- Remove enable_dp_attention in deepseek nightly tests by @Kangyan-Zhou in #13190
- [router] minmax-m2 xml tool parser by @slin1237 in #13148
- fix: display served_model_name in /v1/models by @Sunhaihua1 in #13155
- Replace [silu_and_mul_]scaled_fp4_group_quant by Flashinfer equivalent by @wenscarl in #12376
- [FEAT][ROCM] enable fused shared expert for Rocm by @ZLkanyo009 in #12201
- Enable Flashinfer TRTLLM-GEN-MoE FP8 blockwise kernel for Qwen3-Next on Blackwell by @samuellees in #12543
- [Quantization] Support Quark Dense + MoE FP8 & FP8 PTPC by @BowenBao in #10485
- Fix accept rate in speculative decoding metrics by @SiqiLi-Fighting in #13212
- Set max parallel for 1-gpu runner by @hnyls2002 in #13215
- Add model validation for all GPU runners to prevent cache corruption by @alisonshao in #13171
- Bump actions/download-artifact from v4 to v6 for B200 workers by @Kangyan-Zhou in #13220
- docs: update fused MoE config path by @edwardzjl in #13211
- [PD / HiCache]fix deocde kvcache offload manager memory leak by @huangtingwei9988 in #12774
- Fix broken Markdown formatting in DeepEP documentation by @Taishi-N324 in #13210
- [Feature] Enable CUDA graph for PD-Multiplexing. by @ykcombat in #11595
- Fix wrong running_bs in priority scheduling by @dtcccc in #13142
- [sgl-kernel] support custom fp8 flashmla kernel by @FlamingoPg in #13087
- [router][grpc] Refine docs in minimax_m2 to match other parsers by @CatherineSue in #13218
- Update GDN causal conv1d cuda kernel - prepare for new changes by @byjiang1996 in #13188
- Use 32x32 black image for VLM server warmup and bring glm4.1v back to UT by @byjiang1996 in #13222
- [sgl-kernel] clean up fa fetch in CMakeLists.txt by @FlamingoPg in #12392
- [Auto Sync] Update pynccl_wrapper.py, environ.py, registry.... (20251111) by @merrymercy in #13097
- [AMD] Fix AITER_MXFP4_MOE_SF setting for gfx950 by @hubertlu-tw in #13239
- remove deprecated
tile_tokens_dimby @b8zhong in #13186 - [feat] make warmup timeout configurable through SGLANG_WARMUP_TIMEOUT by @billishyahao in #13243
- fix: fix serve command without diffusion dependency by @mickqian in #13246
- docker: Fix apt-add-repository by @kshitij12345 in #13213
- [Auto Sync] Update backend.py, forward_batch_info.py, piece... (20251113) by @merrymercy in #13221
- Fix nightly tests to fail properly when any job fails by @alisonshao in #13096
- Super tiny fix outdated doc by @fzyzcjy in #13255
- Remove glm41v from CI to speed up CI by @byjiang1996 in #13257
- Update model weight validation logic to handle special weight file naming by @Kangyan-Zhou in #13256
- Add 3 models to 2 gpu runner in model downloading from nightly tests by @Kangyan-Zhou in #13261
- [Tool Call] Steamline function arguments when tool_choice="auto" for deepseekv31_detector by @Muqi1029 in #11589
- [model-gateway] change mg labeler from router to model-gateway by @slin1237 in #13265
- ci: speed up b200 ci by @b8zhong in #13237
- Extend lint test to test/ directory by @Kangyan-Zhou in #13247
- [BugFix] weight load bug when checkpoint expert.gate and exepert.up_proj are not fused by @Yuechguo in #13113
- Tiny fix update version logic location by @fzyzcjy in #12620
- Tiny enhance dumper with ctx and enable flags by @fzyzcjy in #12622
- Enhance dumper comparator with tensor unifier and location finder by @fzyzcjy in #12623
- Tiny add utility to parse server logs by @fzyzcjy in #12605
- [CPU] Use covt_e4m3_bf16 to optim BF16 to FP8 convert by @wangyxbh in #12191
- chore: bump flashinfer v0.5.2 by @zhyncs in #13242
- Super tiny fix CI by @fzyzcjy in #13283
- Add script to create a model with fewer layers for debugging by @fzyzcjy in #13284
- [BugFix] fix bench_serving error when multimodal image is testing by @ZLkanyo009 in #13254
- Optimized prefill cache allocation for NPU by @terfendail in #13288
- refactor: remove duplicate function _get_bootstrap_info_from_server by @acelyc111 in #13277
- LLama4 Attention: Update assertion msg by @kshitij12345 in #12777
- [Deepseek V3.2] Clean up MTP by @hlu1 in #13236
- Support orion by @ppraneth in #10665
- model: support teleflm by @ppraneth in #10573
- [Doc] Add item for repetition punishment by @SenmiaoORZ in #13260
- Support FP8 Per Token Quant Piecewise by @hebiao064 in #13272
- [minor] remove debug code in python/sglang/srt/compilation/weak_ref_tensor_jit.py by @merrymercy in #13235
- [NVIDIA] Fix use case of SGLANG_ENABLE_FLASHINFER_GEMM by @kaixih in #13274
- Fix NSA indexer nightly test failed issues by @Johnsonms in #13298
- Implement nightly test workflow naming conventions by @alisonshao in #13170
- Remove nightly b200 tests and revert a change for test file by @Kangyan-Zhou in #13305
- [router]Replace requests lib with openai in e2e_response_api by @XinyueZhang369 in #13293
- Piecewise Cuda Graph Support for gpt-oss model by @Oasis-Git in #13045
- Add missing model in model validate list by @Kangyan-Zhou in #13310
- Add missing model for 2-gpu-runner in nightly tests by @Kangyan-Zhou in #13311
- [Deterministic] Support Qwen3-Next model deterministic inference by @zminglei in #13100
- CI: server performance test for SGLang Diffusion by @adarshxs in #13091
- [model-gateway] smg release 0.2.3 by @slin1237 in #13312
- Fix syntax errors in cpp_radix_tree by @Missmiaom in #13315
- [Misc]Add date to cu13 dev image tag by @Fridge003 in #13316
- [Diffusion] switch to local
calculate_dimensionsby @adarshxs in #13294 - Add more statistics for spec decoding by @zhuzilin in #13317
- Consolidate similar tests to reduce duplication by @alisonshao in #12871
- feat: Add FP4 (E2M1) KV Cache Support for MHA by @JackChuang in #12612
- Fix: test_vlm_offline_throughput output throughput by @dougyster in #13279
- Add feature flag to broadcast mm inputs processing by @yuan-luo in #13278
- re-submit 12911 but relax the requirement for deepgemm by @zminglei in #13226
- Revert moe sum reduce for marlin moe by @ispobock in #13314
- [RL] support only do cpu backup on draft model by @zhuzilin in #13318
- [model-gateway] move python to binding folder by @slin1237 in #13295
- [RL] Allow bypassing /health check by @zhuzilin in #13320
- Support fast gemm when in batch invariant DeepGEMM fallback by @fzyzcjy in #13259
- Support inverse transform ue8m0 scale by @fzyzcjy in #13285
- Tiny refactor condition to requant scale ue8m0 by @fzyzcjy in #13286
- [RL] support update_weights_from_tensor for mtp by @zhuzilin in #7415
- Update marlin moe kernel interface by @ispobock in #13322
- [opt kimi k2 1 / n] Add kimi k2 moe fused gate by @BBuf in #13287
- Super tiny expose transform_scale_ue8m0 API for RL frameworks by @fzyzcjy in #13323
- Update README by @zhyncs in #13326
- Fix: add missing get_embed_and_head in MiniMax M2 for Eagle3 by @pyc96 in #13297
- [router] Fix flaky router e2e tests by @XinyueZhang369 in #13306
- Opt tp: tp attn support tp reduce scattered input by @xu-yfei in #10568
- [Diffusion] add health endpoints to diffusion server by @adarshxs in #13329
- [model-gateway] remove grpc feature flag and mark as default by @slin1237 in #13330
- Temporarily disable test_vision_openai_server_a CI by @ispobock in #13331
- [Feature] Spec-Overlap supporting DP-ATTN; PD-Disaggregation; npugraph mode by @iforgetmyname in #12443
- [Ascend][Feat] Add Ascend sampling backend by @Alexhaoge in #12692
- perf: optimize TypeBasedDispatcher using dict for O(1) lookup by @zhengxle in #12001
- tiny fix lint by @hnyls2002 in #13337
- [optimize] Provide Usrbio compilation and installation commands by @leihuang-sketch in #12329
- chore: bump sgl-kernel version to 0.3.17.post1 by @sglang-bot in #13325
- Clean up deprecated tile_tokens_dim for next flashinfer by @vincentzed in #13341
- Add FP32 dtype support for RoPE - Part1 by @jinyouzhi in #13181
- Add missing models by @Kangyan-Zhou in #13351
- [CI] check unit-test-backend-8-gpu-h20 in workflow by @ch-wan in #13355
- [Fix] Register custom ops only if they exist by @merrymercy in #13321
- [Piecewise CUDA Graph] Support ModelOpt FP4 by @b8zhong in #13101
- chore: bump sgl-kernel version to 0.3.17.post1 by @sglang-bot in #13358
- [feature] Add layerwise NVTX support by @kyleliang-nv in #11870
- [Ascend]support xgrammar backend for ascend npu by @ash-sigh in #12310
- [Piecewise CUDA Graph] Support W4A8 by @b8zhong in #13179
- [Performance] Move the contiguous to torch compile region by @DarkSharpness in #13199
- Fix dpsk-r1-fp4 tp8 by reverting two commits (#13162 and #13341) by @Qiaolin-Yu in #13348
- Add missing models by @Kangyan-Zhou in #13369
- [RL] enable offloading hybrid linear attn model by @zhuzilin in #13336
- [model-gateway] fix model gateway pypi release workflow path by @slin1237 in #13372
- [model-gateway] fix SDist step readme path by @slin1237 in #13373
- [diffusion] refactor and added tests for Flux, T2V, TI2V, I2V by @adarshxs in #13344
- diffusion: correct check-changes for multimodal_gen by @mickqian in #13375
- diffusion: enable fa4 for blackwell by @yhyang201 in #13263
- [opt kimi k2 2/n] apply kimi k2 thinking moe_fused_gate by @BBuf in #13332
- [2 / 2] apply sgl-kernel weak_ref_tensor by @BBuf in #12978
- Add SGLANG_ENABLE_REQ_POOL_LEAK_STRICT_CHECK to bypass mem leak check by @zhuzilin in #13339
- Add default enable_memory_saver to HybridLinearKVPool by @zhuzilin in #13371
- Cleanup vision attention related codes by @JustinTong0323 in #13228
- Remove unused code / testcases in
langby @hnyls2002 in #13335 - Tiny deprecate the
--range-begininrun_suite.pyby @hnyls2002 in #13381 - [router] bindings for go by @whybeyoung in #13384
- fix generative_models.md table - remove newlines by @netanel-haber in #13385
- fix nightly docker build by @b8zhong in #13386
- fix import qwenvl error in RL engine by @dangkai4u in #12874
- [CI] use cached deepep installation in gb200 CI by @ch-wan in #13388
- Support spec decoding when LoRA is applied to target model by @lifuhuang in #12903
- [CI] Fix B200 CI by @Fridge003 in #13387
- [Tiny]Fix 1-gpu nightly test bugs by @Fridge003 in #13389
- chore: bump SGLang version to 0.5.5.post3 by @sglang-bot in #13366
- refactor: replace worker pool with semaphore-based concurrency in jobqueue by @RiversJin in #13383
- Update docs by @merrymercy in #13391
- (1/n)support context parallel with deepseekv3.2-DSA by @lixiaolx in #12065
- [RL] re-abort_request when model_update_lock is still locked by @zhuzilin in #13338
- [1/N] CI refactor: introduce CI register. by @hnyls2002 in #13345
- [Doc] Update CI oncall list by @merrymercy in #13396
- Update .github/MAINTAINER.md by @Ying1123 in #13398
- [HiCache] support memory_pool_host page head layout by @huangtingwei9988 in #11644
- [HiCache] add GPU id to IB dev topo for mooncake storage backend by @stmatengss in #13112
- Support weight update for blackwell DeepGEMM by @fzyzcjy in #13324
- refactor linear memory pool by @yizhang2077 in #13004
- Remove deprecated scripts by @hnyls2002 in #13399
- [model-gateway] update workflow names for gateway and exclude npu by @slin1237 in #13415
- [Tiny fix] Fix bench_speculative.py run bug by @BBuf in #13416
- [model-gateway] Add Gateway Release Tooling by @slin1237 in #13420
- fix uneven PP layer indices by @alpha-baby in #13282
- diffusion: fix wan2.2 ti2v num_frames adjust logic by @mickqian in #13379
- [PD][bug fix] fix memleak when last_batch is none by @XucSh in #13144
- Fix cache_tokens calculate issue when retracted by @QiuMike in #11900
- [feature] Custom base path on FastAPI server by @kebyn in #5879
- Adding user defined hooks support by @Carlomus in #13217
- Fix log time stats by @qhsc in #13418
- [Ci tiny fix] Lower score threshold in evaluation test by @BBuf in #13443
- diffusion: fix loading with local model_path by @mickqian in #13445
- [2/N] CI refactor: sperate some backend-independent CPU tasks. by @hnyls2002 in #13447
- Temporarily disable model hooks CI by @hnyls2002 in #13450
- [Deepseek V3.2] Use torch.compile to speed up torch.cat in nsa by @hlu1 in #13022
- Remove verbs from GET endpoint paths to follow REST standards by @slin1237 in #13273
- Add missing models by @Kangyan-Zhou in #13456
- extend sagemaker.Dockerfile serve script to allow all sglang serve flags by @sirutBuasai in #13173
- Fix 8-gpu B200 nightly tests by @Kangyan-Zhou in #13457
- Fixes validation errors for Wan-AI models which store model weights in subdirectories by @Kangyan-Zhou in #13461
- [Embeddings Performance Testing] Add performance test for embedding models by @vedantjh2 in #12359
- [NVIDIA] Fix broken fp8 MoE of deepseek v3 by @kaixih in #13264
- Temporarily comment out multimodal gen test to recover runners by @Kangyan-Zhou in #13463
- Add interface_v1 option for dynamic HiCache backend by @pansicheng in #13140
- Add bfloat16 tuned fused moe config for Dpsk-MTP layer on B200 by @Fridge003 in #13455
- fix MambaPool clear method after refactoring by @zminglei in #13449
- [AMD CI] Update sgl-router python path in dockerfile. by @saienduri in #13458
- [CI] re-enable test_vision_openai_server_a ci by @yhyang201 in #13444
- Adding CI Monitor Improvements by @dougyster in #13462
- [GLM4.6v] Required changes for bumping up to transformer 5.x by @byjiang1996 in #13229
- [GLM4.6v] Relax the constraint of non-user role chat completion message schema for new GLM-v release by @byjiang1996 in #13258
- [model-gateway] use worker startup time out for worker registration by @slin1237 in #13473
- Support JetVLM by @futrime in #13289
- Add an unified server arg for multimodal inputs preprocessing config. by @WingEdge777 in #12149
- [PD] Clarify init method docstrings for kvsender and kvreceiver by @ShangmingCai in #13476
- Fix lora test by @hnyls2002 in #13479
- [Piecewise CUDA Graph] Support ModelOpt FP8 by @b8zhong in #13094
- CI: fix NFS EBUSY error in PR test workflow by @alisonshao in #13460
- [CI] fix triggered by a non-run-ci label by @hnyls2002 in #13393
- [CI] remove auto-labeling
run-cilabel. by @hnyls2002 in #13486 - fix: change performance log directory to cache path by @ch-wan in #13482
- [CI] Add input for pr-gate by @hnyls2002 in #13491
- [opt kimi k2 3/n] opt kimi_k2 moe_fused_gate kernel by @BBuf in #13374
- [CI] fix lint yml (syntax error) by @hnyls2002 in #13496
- [VLM][feat] Support encoder DP for Qwen2.5-VL by @liusy58 in #13126
- [HiCache] Critical fix to host memory double free by @xiezhq-hermann in #13501
- [BugFix] Accuracy and function Issue when run ptpc quant model by @Yuechguo in #13157
- fix: create git tags directly instead of temporary branches by @alisonshao in #13168
- Add .github/CI_PERMISSIONS.json to define the CI permissions by @merrymercy in #13509
- README.md -> FOLDER_README.md by @merrymercy in #13510
- Use slash command to trigger CI by @merrymercy in #13512
- Add docs on trigger ci by @merrymercy in #13513
- [Feature] Re:Enable hybrid mem saver by @ocss884 in #12962
- Trigger CI retry with edit by @merrymercy in #13516
- Update docs by @merrymercy in #13519
- Add /tag-and-rerun-ci by @sglang-bot in #13521
- [CI] update pr-gate to be compatible with new slash triggering mananer. by @hnyls2002 in #13522
- [CI] fix skipping pr-gate on main by @hnyls2002 in #13525
- Small cleanups related to LoRA weight loading by @glenliu21 in #13474
- [CI] fix CI skipped on main by @hnyls2002 in #13527
- [model-gateway] fix gateway docker build due to recent py code change by @CatherineSue in #13532
- [model-gateway] limit opened files in docker build to fix edge case by @CatherineSue in #13536
- [docker] fix dockerfile naming for diffusion by @slin1237 in #13534
- fix lora test by @gongwei-130 in #13537
- Remove jet-ai/Jet-Nemotron-2B in nightly text tests as this is constantly failing by @Kangyan-Zhou in #13540
- [Bug] Fixes accuracy issues caused by incorrect use of rope by @Baidu-AIAK in #13495
- Flashinfer TRTLLM-GEN-MoE + Qwen3 by @b8zhong in #13489
- [chore] Disable ccache for sgl-kernel release by @Fridge003 in #13541
- Add Qwen/Qwen1.5-MoE-A2.7B to model list by @Kangyan-Zhou in #13543
- [Fix] Fix DeepSeek V3 MTP on B200 by @Fridge003 in #13548
- [router][grpc] Support num_reasoning_tokens in haromy models by @CatherineSue in #13047
- [feat][Ascend][Mindspore]: support model-impl of mindspore by @chz34 in #9234
- [AMD CI] Local cache fallback. by @saienduri in #13452
- [CI] fix amd 1 gpu basic test by @hnyls2002 in #13551
- [Doc] Update HiCache and Mooncake docs & Mooncake Setup Error Checking by @ykwd in #12740
- purge unnecessary env variable set in deterministic test by @zminglei in #13481
- chore: bump sgl-kernel version to 0.3.17.post2 by @sglang-bot in #13542
- Add
lmsys/gpt-oss-20b-bf16to model validation check by @hnyls2002 in #13557 - CI Failure Monitor Improvements by @dougyster in #13558
- [RL] Allow passing tensors of different dtypes for FlattenedTensorBucket by @zhuzilin in #13413
- [CI] Fix CUDA workflow's dependency. by @hnyls2002 in #13568
- [NPU] Adapt pr-gate for pr-test workflow & workflows refresh by @iforgetmyname in #13567
- Tiny enhance test suites sanity check by @hnyls2002 in #13589
- [3/N] CI refactor: move some manually triggered tests. by @hnyls2002 in #13448
- Support moe topk sigmoid kernel by @rogeryoungh in #13049
- Expend compatibility check for all quantized MoE models by @JustinTong0323 in #13465
- add https://github.com/netanel-haber to CI_PERMISSIONS.json by @netanel-haber in #13577
- chore: bump sgl-kernel version to 0.3.17.post2 by @sglang-bot in #13570
- [Auto Sync] Update base_grammar_backend.py, collector.py (20251116) by @merrymercy in #13357
- [GDN] Remove unnecessary contiguous() by @byjiang1996 in #13604
- [GDN] Remove unnecessary conv state clone by @byjiang1996 in #13603
- [VLM] Support Piecewise CUDA Graph for Qwen2.5-VL by @yuan-luo in #13055
- CI: improve diffusion CI by @mickqian in #13562
- Support external custom models by @zhooooong in #13429
- [CI fix] Fix image download failures in VLM CI tests by @BBuf in #13613
- [NVIDIA] Add fp8 gemm benchmark on blackwell by @kaixih in #13528
- [UT] Destroy process group after broadcast to resolve port occupation issues in multi-server tests by @galeselee in #12379
- diffusion: remove PreprocessorConfig by @mickqian in #13248
- diffusion: refactor pipeline folders by @mickqian in #13253
- Add FP32 dtype support for RoPE - Part2 by @jinyouzhi in #13328
- [Fix] Remove multimodal_gen redundant get_bool_env_var func by @shauntajoesph-ops in #13583
- Add support for new aiter version (AR accuracy, is_shuffled PR) by @1am9trash in #13554
- diffusion: improve baseline performance monitor by @mickqian in #13614
- [Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel) by @DarkSharpness in #13453
- [CI] Align metric units for CI rate limit by @hnyls2002 in #13633
- [ROCM] Optimized deepseek-r1 fp8 model with + triton_gemm_a8w8 + batch_gemm_a8w8 + fused set_mla_kv_buffer kernel by @yctseng0211 in #13617
- fix bench_speculative bug by @Lzhang-hub in #13197
- Revert "[Feature] Introduce JIT Kernel in sglang (with hicache JIT kernel)" by @merrymercy in #13644
- [CI] optimize CI workflow info by @hnyls2002 in #13634
- Kill zombie diffusion processes in CI & minor code style fix on rotary embedding fallback by @merrymercy in #13637
- [CI] apply pr-gate for XPU by @hnyls2002 in #13663
- Add fused_rmsnorm_gated_cpu kernel for CPU to support Qwen3-Next by @yanbing-j in #11577
- [10/n] decouple quantization impl from vllm dependency - fix import by @FlamingoPg in #13524
- Adding nightly tests as release guard for bot bump workflows by @dougyster in #13655
- [DeepseekV3.2] Deepseek fp8 support for MHA path by @YAMY1234 in #12964
- Fix launch of
Olmo3by @vincentzed in #13666 - [Deepseek V3.2] Change indexer weights_proj to fp32 by @hlu1 in #13459
- enable csgmv automatically on cuda by @b8zhong in #13600
- Add nightly test CI monitor workflow by @alisonshao in #13038
- allow loras to be implicitly evicted and loaded based on max_loaded_loras by @glenliu21 in #11526
- Test reorganization: Move tests to manual/ by @alisonshao in #13610
- [Piecewise CUDA Graph] Fix recompile issue for Mixtral and Grok2 by @hebiao064 in #13667
- Super tiny remove unused MiniMaxM2MLP class by @fzyzcjy in #13659
- Update quantization.md with new model resources by @zhaochenyang20 in #13677
- [model-gateway] add both python and rust cli alias by @slin1237 in #13678
- [diffusion] CI: improve validation method by @mickqian in #13627
- [model-gateway] fix gateway cli arg parser to not use = by @CatherineSue in #13685
- [CI] Move nightly tests to test/nightly/ by @alisonshao in #13683
- [NVIDIA] Add cutedsl e2e test to GB200 CI by @kaixih in #12672
- Add sgl-kernel CI test for Blackwell (B200) by @alisonshao in #13301
- remove unnecessary starvation check by @glenliu21 in #13619
- Fix target MLA with eagle3 support for PD disaggregation by @QiuMike in #13555
- [kimi k2 thinking] Avoid useless torch.zeros_ by @BBuf in #13596
- [opt kimi k2 4 / n] Delete useless pad kernel in sgl_moe_align_block_size by @BBuf in #13587
- [VLM] Support Piecewise CUDA Graph for InternVL by @yuan-luo in #13640
- [Piecewise Cuda Graph] rename, refactor and add more logging by @hebiao064 in https://github.com/sgl-project/sglang/pull/13675
- difusion: speed up multimodal_gen ci by @yhyang201 in https://github.com/sgl-project/sglang/pull/13665
- [diffusion] doc: minor update docs by @mickqian in https://github.com/sgl-project/sglang/pull/13177
- Fix ZMQ bind error on non-zero rank nodes when using SGLANG_BLOCK_NONZERO_RANK_CHILDREN=0 by @ishandhanani in https://github.com/sgl-project/sglang/pull/13686
- [diffusion] server: use meta to avoid Linear init for TextEncoder by @zyksir in https://github.com/sgl-project/sglang/pull/13564
- [Auto Sync] Update http_server.py, io_struct.py, scheduler_... (20251120) by @merrymercy in https://github.com/sgl-project/sglang/pull/13679
- [Bugfix] Fix hidden state size in EAGLE PD disaggregation buffers by @michelemarzollo in https://github.com/sgl-project/sglang/pull/13590
- [HiCache] fix unit test with changed new APIs by @stmatengss in https://github.com/sgl-project/sglang/pull/13498
- [Fix] Qwen3Next lmhead dtype by @ZeldaHuang in https://github.com/sgl-project/sglang/pull/13708
- [NPU] chore: bump to CANN 8.3.RC1 and Pytorch 2.8.0 by @iforgetmyname in https://github.com/sgl-project/sglang/pull/13647
- [11/N] MoE Refactor: Simplifying SBO Implementation with Dispatcher Hooks by @ch-wan in https://github.com/sgl-project/sglang/pull/13327
- [Clean code] Compressed_tensors_moe code clean by @BBuf in https://github.com/sgl-project/sglang/pull/13719
- [diffusion] profile: support performance metric dumping and comparison by @mickqian in https://github.com/sgl-project/sglang/pull/13630
- [AMD] Enable fused shared expert append and flatten quant for fp8 deepseekR1 model by @yichiche in https://github.com/sgl-project/sglang/pull/13705
- [diffusion] doc: add contributing.md by @mickqian in https://github.com/sgl-project/sglang/pull/13649
- fix 3fs down, lock schedule main thread by @weibingo in https://github.com/sgl-project/sglang/pull/13407
- Fix url: use https://roadmap.sglang.io for roadmap by @merrymercy in https://github.com/sgl-project/sglang/pull/13733
- Super tiny delete unused files by @fzyzcjy in https://github.com/sgl-project/sglang/pull/13734
- [diffusion] log: minor improve logging by @mickqian in https://github.com/sgl-project/sglang/pull/13735
- [CI] minor hot fix of model validation list by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13737
- Add to ci permission by @guapisolo in https://github.com/sgl-project/sglang/pull/13739
- [Piecewise CUDA Graph] Support Kimi-K2 (non-Thinking) by @b8zhong in https://github.com/sgl-project/sglang/pull/13466
- Fix: CI monitor should not exit with error on regressions by @alisonshao in https://github.com/sgl-project/sglang/pull/13694
- Revert "enable csgmv automatically on cuda" by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/13707
- Support torch 12.9 + DeepEP by removing custom nvshmem by @fzyzcjy in https://github.com/sgl-project/sglang/pull/12949
- add some more labels by @b8zhong in https://github.com/sgl-project/sglang/pull/13701
- Feat/nemotron nano v3 support by @roikoren755 in https://github.com/sgl-project/sglang/pull/12690
- Fix global scaling factor loading hang by @wenscarl in https://github.com/sgl-project/sglang/pull/13484
- Fix B200 Nightly tests and move one manual test back to unit test to prevent the same issue by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13746
- fix test_lora_update.py starvation message check by @glenliu21 in https://github.com/sgl-project/sglang/pull/13702
- Fix model weights validation with automatic cache cleanup by @alisonshao in https://github.com/sgl-project/sglang/pull/13729
- [Auto Sync] Update evict_policy.py, radix_cache.py (20251120) by @merrymercy in https://github.com/sgl-project/sglang/pull/13669
- [Tiny] Renaming environ for NVFP4 dispatch by @Fridge003 in https://github.com/sgl-project/sglang/pull/13756
- modularize gsm8k and mmmu test classes by @netanel-haber in https://github.com/sgl-project/sglang/pull/13506
- Use dual stream for DS MoE whenever cuda graph is used (instead of with token threshold) by @trevor-m in https://github.com/sgl-project/sglang/pull/9405
- [Ascend] support Kimi-K2-Thinking by @zhuyijie88 in https://github.com/sgl-project/sglang/pull/12759
- Refactor eagle bigram key matching by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13714
- fix hunyuanvideo and add 2gpu ci testing by @yhyang201 in https://github.com/sgl-project/sglang/pull/13720
- Update mem checker during busy by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13704
- Tiny support different prompts in
send_one.pyby @hnyls2002 in https://github.com/sgl-project/sglang/pull/13768 - [diffusion] refactor: refactor sampling params by @mickqian in https://github.com/sgl-project/sglang/pull/13706
- [VLM] Replace torch.repeat_interleave with faster np.repeat for Qwen-VL series by @yuan-luo in https://github.com/sgl-project/sglang/pull/13736
- [Spec v2] Remove
allocate_lensand enable over-allocation by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13478 - tinyfix: diffusion ci by @yhyang201 in https://github.com/sgl-project/sglang/pull/13769
- align code style eagle draft&draft_extend cuda graph runner by @cicirori in https://github.com/sgl-project/sglang/pull/13533
- Refactor MHA & MLA KV caches to support FP4 by @JackChuang in https://github.com/sgl-project/sglang/pull/13547
- Move unnecessary input_addr capture under debug mode flag for speed-up by @byjiang1996 in https://github.com/sgl-project/sglang/pull/13690
- Gather static input buffers for cuda graph by @cctry in https://github.com/sgl-project/sglang/pull/13676
- Revert "Fix RMSNorm API CALL mismatch issue. (#10032)" by @ErsongWang in https://github.com/sgl-project/sglang/pull/13727
- [model-gateway] update smg code owner by @slin1237 in https://github.com/sgl-project/sglang/pull/13777
- [model-gateway] clean up router manager function order by @slin1237 in https://github.com/sgl-project/sglang/pull/13776
- Fix typo in docs by @yinpeiqi in https://github.com/sgl-project/sglang/pull/13709
- [Feature] HiCache JIT kernel (once again) by @DarkSharpness in https://github.com/sgl-project/sglang/pull/13764
- [DeepEP] Add SGLANG_DEEPEP_BF16_DISPATCH env var in Normal mode by @BBuf in https://github.com/sgl-project/sglang/pull/13787
- Upgrade flashmla kernel for NSA tp support by @YAMY1234 in https://github.com/sgl-project/sglang/pull/13718
- [diffusion] feat: support sp for image models by @mickqian in https://github.com/sgl-project/sglang/pull/13180
- [diffusion] CI: add run_suite to multimodal_gen CI by @mickqian in https://github.com/sgl-project/sglang/pull/13791
- Fix pagination bug in CI monitor preventing performance-test-2-gpu data collection by @alisonshao in https://github.com/sgl-project/sglang/pull/13781
- [Scheduler] Tiny organize code style by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13806
- [Deepseek] Refactor deepseek server_args _handle_model_specific_adjustments by @hlu1 in https://github.com/sgl-project/sglang/pull/13687
- [CI] Tiny refactoring sgl-kernel tests by @Fridge003 in https://github.com/sgl-project/sglang/pull/13813
- Tune fp8_w8a8 fused triton moe for GLM-4.6-FP8 by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/13815
- make trtllm attn backend's init_forward_metadat non blocking by @cicirori in https://github.com/sgl-project/sglang/pull/13802
- remove package json which is not used by @slin1237 in https://github.com/sgl-project/sglang/pull/13810
- [1/2] Refactor DeepGeem requant for FP8 Linear on Blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/13601
- chore: bump sgl-kernel version to 0.3.18 by @sglang-bot in https://github.com/sgl-project/sglang/pull/13816
- xgrammar up version to 0.1.27 by @Swipe4057 in https://github.com/sgl-project/sglang/pull/13650
- Fix bug: Incorrect variable used in rem_total_token_offset calculatio… by @liuhuijiayou in https://github.com/sgl-project/sglang/pull/13201
- [Doc] Refine fused_moe_triton configs doc by @BBuf in https://github.com/sgl-project/sglang/pull/13820
- Update MindSpore documentation by @wangtiance in https://github.com/sgl-project/sglang/pull/13656
- Refactor cache init logic by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13800
- [Bugfix] Add jit kernel files in packaging by @yuan-luo in https://github.com/sgl-project/sglang/pull/13829
- [diffusion] doc: minor update contributing.md with test section by @mickqian in https://github.com/sgl-project/sglang/pull/13792
- [misc] Rename minilb install env & remove files & fix lint by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13831
- [diffusion] CI: send nightly-test outputs of diffusion to slack for correctness monitoring by @yhyang201 in https://github.com/sgl-project/sglang/pull/13833
- [chore]Upgrade flashinfer to 0.5.3 by @Fridge003 in https://github.com/sgl-project/sglang/pull/13751
- [Intel XPU]support xgrammar backend for intel xpu by @gaopengff in https://github.com/sgl-project/sglang/pull/13245
- [sgl-kernel Code Clean] Remove useless lightning_attention kernel by @BBuf in https://github.com/sgl-project/sglang/pull/13819
- [VLM] Revise InternVL Piecewise CUDA Graph Supporting by @yuan-luo in https://github.com/sgl-project/sglang/pull/13846
- Fix TorchAO quant in VLM by @zhooooong in https://github.com/sgl-project/sglang/pull/13508
- [Fix]: Adjust FutureMap's token_id_bufs Size to Prevent ChunkedPrefill's next_token_ids from Overwriting Previous Prefill Requests' next_token_id by @ant-yy in https://github.com/sgl-project/sglang/pull/13713
- Fix: Safe RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads by @YAMY1234 in https://github.com/sgl-project/sglang/pull/11871
- [Fix] Fix uvloop get_event_loop() is not suitable for 0.22.x by @tom-jerr in https://github.com/sgl-project/sglang/pull/13612
- Tiny unpin uvloop for other backends by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13858
- [model-gateway] Refactor router e2e responses tests by @XinyueZhang369 in https://github.com/sgl-project/sglang/pull/13745
- [Perf] Optimize DeepSeek-R1 w4afp8 glue kernels by @yuhyao in https://github.com/sgl-project/sglang/pull/10027
- Fix quantized moe checker fail for Qwen3 dense fp8 model by @fzyzcjy in https://github.com/sgl-project/sglang/pull/13853
- [model-gateway] add grpc server code owner by @slin1237 in https://github.com/sgl-project/sglang/pull/13865
- [BugFix] fix outplace_fused_experts missing is_gated by @zminglei in https://github.com/sgl-project/sglang/pull/13864
- fix xgrammar_backend crash with malformed inputs by @gongwei-130 in https://github.com/sgl-project/sglang/pull/13752
- [Auto Sync] Update schedule_batch.py, schedule_policy.py, b... (20251122) by @merrymercy in https://github.com/sgl-project/sglang/pull/13763
- [Doc] Add an Introduction to Expert Parallelism by @ch-wan in https://github.com/sgl-project/sglang/pull/13783
- add LoRA warning if loading a preexisting LoRA adapter with a different name by @glenliu21 in https://github.com/sgl-project/sglang/pull/13822
- [NPU] Fix NPU CI by @iforgetmyname in https://github.com/sgl-project/sglang/pull/13834
- Overlap glm moe gemms in two cuda streams by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/13786
- [Performance] Replace preprocess_video logic from GLM multimodal processor with transformer impl for speed up (up to 27% faster) and addressing OOM (up to 50x improvements) by @byjiang1996 in https://github.com/sgl-project/sglang/pull/13487
- Add support for bf16 x bf16 cutlass fused MoE by @nvcastet in https://github.com/sgl-project/sglang/pull/10275
- [Router bugfix] Fix router_manager selecting the wrong router when enable-igw. by @SYChen123 in https://github.com/sgl-project/sglang/pull/13572
- Fix nightly test job to fail when any test fails by @alisonshao in https://github.com/sgl-project/sglang/pull/13871
- [diffusion] refactor: remove training-related code by @mickqian in https://github.com/sgl-project/sglang/pull/13860
- [CI] fix multimodel-gen-test job by @cyb70289 in https://github.com/sgl-project/sglang/pull/13874
- Add validation and cleanup for corrupted safetensors in multimodal loader by @alisonshao in https://github.com/sgl-project/sglang/pull/13870
- [CI] fix lint error by @cyb70289 in https://github.com/sgl-project/sglang/pull/13891
- fix: draft model revision misuse model revision by @gongwei-130 in https://github.com/sgl-project/sglang/pull/11893
- Fix trace publish paths in nightly-test-nvidia workflow by @alisonshao in https://github.com/sgl-project/sglang/pull/13888
- Adding nightly tests for Kimi-K2-thinking, Qwen3, minimax-m2, GLM4.6 by @dougyster in https://github.com/sgl-project/sglang/pull/13890
- [Fix] JIT kernel dependencies in other platforms by @DarkSharpness in https://github.com/sgl-project/sglang/pull/13889
- remove RoPE CPU fp32 tests by @ZailiWang in https://github.com/sgl-project/sglang/pull/13827
- Move test_dummy_grok_models.py from manual to srt (temporary) by @alisonshao in https://github.com/sgl-project/sglang/pull/13901
- [CI tiny fix] Enhance robustness of vision chunked prefill test with ROUGE-L metric by @BBuf in https://github.com/sgl-project/sglang/pull/13793
- update flashinfer_cubin==0.5.3 by @Lzhang-hub in https://github.com/sgl-project/sglang/pull/13848
- Add test_dummy_grok_models.py to not_in_ci section by @alisonshao in https://github.com/sgl-project/sglang/pull/13908
- fix diffusion profile bugs by @yizhang2077 in https://github.com/sgl-project/sglang/pull/13642
- [Fix]: Further fix the buffer len of future map by @ant-yy in https://github.com/sgl-project/sglang/pull/13916
- [diffusion] CI: minor refactor CI for less code duplication by @mickqian in https://github.com/sgl-project/sglang/pull/13905
- Update release-whl-kernel.yml by @ispobock in https://github.com/sgl-project/sglang/pull/13921
- [Ascend] qwen optimization by @Liwansi in https://github.com/sgl-project/sglang/pull/12078
- Support piecewise cuda graph for Qwen3-next by @Chen-0210 in https://github.com/sgl-project/sglang/pull/13081
- fix nixl prefill crash make decode health check failed by @llc-kc in https://github.com/sgl-project/sglang/pull/13657
- [CI] CI registry update by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13927
- [CI] rename:
per_commit->registeredby @hnyls2002 in https://github.com/sgl-project/sglang/pull/13928 - [diffusion] doc: add doc for LoRA usage by @mickqian in https://github.com/sgl-project/sglang/pull/13931
- [diffusion] feat: support LoRA by @mickqian in https://github.com/sgl-project/sglang/pull/13859
- [CPU] Apply PR gating rule in CI workflow by @ZailiWang in https://github.com/sgl-project/sglang/pull/13933
- [misc] add llama3.1 chat template by @slin1237 in https://github.com/sgl-project/sglang/pull/13935
- [Minor] Fix lint by @Fridge003 in https://github.com/sgl-project/sglang/pull/13938
- [DeepSeekV3.2] Centralize NSA dispatch logic in NativeSparseAttnBackend by @YAMY1234 in https://github.com/sgl-project/sglang/pull/13544
- Fix docstrings for v1 HiCacheStorage methods by @ptovam in https://github.com/sgl-project/sglang/pull/13851
- Add Llama4 attention backend auto-selection by @janbernloehr in https://github.com/sgl-project/sglang/pull/13421
- Improve nightly tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13903
- [Auto Sync] Improve profilers and simplify bench_one_batch_server.py by @merrymercy in https://github.com/sgl-project/sglang/pull/13866
- Fix update weight error for blackwell DeepGEMM by @fzyzcjy in https://github.com/sgl-project/sglang/pull/13910
- [chore] update torch version to 2.9 by @FlamingoPg in https://github.com/sgl-project/sglang/pull/12969
- chore: bump sgl-kernel version to 0.3.18.post1 by @sglang-bot in https://github.com/sgl-project/sglang/pull/13942
- [Tiny]Upgrade README for sgl-kernel by @Fridge003 in https://github.com/sgl-project/sglang/pull/13945
- Fix nightly-test-nvidia.yml to have the correct trigger by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13950
- chore: bump sgl-kernel version to 0.3.18.post1 by @sglang-bot in https://github.com/sgl-project/sglang/pull/13951
- Fix Deepseek v3.1 loading issue by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13954
- fix: spec overlap
predictshape does not match verify output shapes by @timmy-feng in https://github.com/sgl-project/sglang/pull/12786 - [VLM] Support InternVL Vision Encoder Data Parallelism by @yuan-luo in https://github.com/sgl-project/sglang/pull/13925
- Support FlashAttention3 page_size > 1 and topk > 1 case with paged attn and spec decode by @yubofredwang in https://github.com/sgl-project/sglang/pull/7725
- Fix nightly test failures: NSA indexer dtype and CPP radix cache init by @alisonshao in https://github.com/sgl-project/sglang/pull/13958
- update CI permission list by @ZailiWang in https://github.com/sgl-project/sglang/pull/13962
- Fix SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION not working by @fzyzcjy in https://github.com/sgl-project/sglang/pull/13961
- [feat] support in-flight weight update by @ShawnY112358 in https://github.com/sgl-project/sglang/pull/10071
- Turn off PREBUILD aiter in MI355 by @1am9trash in https://github.com/sgl-project/sglang/pull/13963
- Support piecewise CUDA graph for embedding models by @zhooooong in https://github.com/sgl-project/sglang/pull/13852
- diffusion: fix the issue where the qwen-edit&wan model produces incorrect output during sequence parallelism by @yhyang201 in https://github.com/sgl-project/sglang/pull/13922
- [Feature] Initial block diffusion language model support by @ClawSeven in https://github.com/sgl-project/sglang/pull/12588
- Use dynamically maintained num_waiting_tokens in get_load() by @vipwangerxiao in https://github.com/sgl-project/sglang/pull/13203
- Optimize uneven PP layer distribution logic to improve PP performance by @ShangmingCai in https://github.com/sgl-project/sglang/pull/13977
- Fix
get_loadAPI by @hnyls2002 in https://github.com/sgl-project/sglang/pull/13991 - Rename:
--hooksto--forward-hooksby @hnyls2002 in https://github.com/sgl-project/sglang/pull/13994 - [Ascend] Support enable-mixed-chunk in non-MLA scenarios by @MichelleWu351 in https://github.com/sgl-project/sglang/pull/12491
- fix spec dec request level metrics by @vedantjh2 in https://github.com/sgl-project/sglang/pull/13754
- [model-gateway] Add PostgreSQL support to binding by @xuwenyihust in https://github.com/sgl-project/sglang/pull/13766
- Put
pr-gateaftercheck-changesby @hnyls2002 in https://github.com/sgl-project/sglang/pull/14009 - [diffusion] model: support black-forest-labs/FLUX.2-dev by @mickqian in https://github.com/sgl-project/sglang/pull/14000
- fix: correct usage of minimax-m2 deepep moe forward by @yuukidach in https://github.com/sgl-project/sglang/pull/13892
- Support internvl on Blackwell (which doesn't support fa3): add
SingletonCachesupport to Vision{Sdpa|Triton|Ascend}Attention by @netanel-haber in https://github.com/sgl-project/sglang/pull/13151 - [model-gateway] fix xpu ci by @slin1237 in https://github.com/sgl-project/sglang/pull/14012
- [ci] mark skip as success instead of failure by @slin1237 in https://github.com/sgl-project/sglang/pull/14014
- Revert "Fix nightly test failures: NSA indexer dtype and CPP radix cache init" by @Fridge003 in https://github.com/sgl-project/sglang/pull/14015
- [model gateway][grpc] Add tojson filter to override minijinja's tojson by @CatherineSue in https://github.com/sgl-project/sglang/pull/14013
- [ci] allow manual label to trigger ci in rust, change ci order by @slin1237 in https://github.com/sgl-project/sglang/pull/14016
- [model-gateway][doc] Update transport terminology to protocol in README.md by @xuwenyihust in https://github.com/sgl-project/sglang/pull/13872
- Fix Nvidia nightly test trigger params when it is triggered by parent workflow by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13966
- Update CODEOWNERS for layer and executor files by @hebiao064 in https://github.com/sgl-project/sglang/pull/14020
- [Code sync] Fix registration of some ops in grok & Fix oss sync scripts by @merrymercy in https://github.com/sgl-project/sglang/pull/13990
- Add stress test workflow by @dougyster in https://github.com/sgl-project/sglang/pull/13937
- Temporarily disable test_update_weights_from_disk.py in CI by @alisonshao in https://github.com/sgl-project/sglang/pull/14021
- [model-gateway] Fix flaky test_circuit_breaker_half_open_failure_reopens by @XinyueZhang369 in https://github.com/sgl-project/sglang/pull/14019
- Add adapter_model.safetensors to corruption validation for LoRA by @alisonshao in https://github.com/sgl-project/sglang/pull/14022
- fix: cuda graph issue while running longcat_flash by @tianhaoz95 in https://github.com/sgl-project/sglang/pull/14007
- Fix nightly test failure: CPP radix cache init by @alisonshao in https://github.com/sgl-project/sglang/pull/14018
- Support KTransformers for Qwen3-VL moe by @mrhaoxx in https://github.com/sgl-project/sglang/pull/13983
- Add nightly test support to unified run_suite.py by @alisonshao in https://github.com/sgl-project/sglang/pull/13941
- Support nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 (and nvidia/C-RADIOv2-H) by @netanel-haber in https://github.com/sgl-project/sglang/pull/12277
- [feat] update bucketed weights from distributed by @ShawnY112358 in https://github.com/sgl-project/sglang/pull/13824
- Nightly test job filter by @alisonshao in https://github.com/sgl-project/sglang/pull/14025
- Cleanup server args by @merrymercy in https://github.com/sgl-project/sglang/pull/14027
- [Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13761 by @samuellees in https://github.com/sgl-project/sglang/pull/13761
- Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs by @alisonshao in https://github.com/sgl-project/sglang/pull/14028
- [model-gateway] allow refill rate to be zero by @slin1237 in https://github.com/sgl-project/sglang/pull/14030
- Fix installation for nvidia-nvshmem-cu12 by @ch-wan in https://github.com/sgl-project/sglang/pull/14033
- Fix nightly test failure: NSA indexer dtype by @alisonshao in https://github.com/sgl-project/sglang/pull/14017
- Add CODEOWNERS entry for batch_invariant_ops by @hebiao064 in https://github.com/sgl-project/sglang/pull/14026
- [Piecewise] support disable decode cuda graph when enable piecewise cuda graph by @hebiao064 in https://github.com/sgl-project/sglang/pull/13965
- fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity by @sunxxuns in https://github.com/sgl-project/sglang/pull/13814
- Use trtllm mha decode kernel for target_verify in speculative decoding by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/13976
- [Intel XPU]Add xpu support for get_device_memory_capacity by @gaopengff in https://github.com/sgl-project/sglang/pull/13895
- [diffusion] perf: improve black-forest-labs/FLUX.2-dev by @mickqian in https://github.com/sgl-project/sglang/pull/14040
- [Feat]Add scheduler recv skipper weights to environment configuration by @jimmy-evo in https://github.com/sgl-project/sglang/pull/13855
- Tiny support 3D tensors in inverse_transform_scale_ue8m0 by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14002
- Support sanity checking weight consistency especially for RL by @fzyzcjy in https://github.com/sgl-project/sglang/pull/13854
- feat: Naive support Spec V2 + Constrained Decoding by @Ubospica in https://github.com/sgl-project/sglang/pull/13425
- Adjust max-parallel for CUDA CI by @hnyls2002 in https://github.com/sgl-project/sglang/pull/14057
- [2/2] Refactor DeepGeem requant for FP8 FusedMoE on Blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/13960
- Super tiny add comments to SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14048
- Remove disused B300 Dockerfile by @mmangkad in https://github.com/sgl-project/sglang/pull/13946
- Temporarily disabled test by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14069
- [chore] Arrange NV packages in Dockerfile by @Fridge003 in https://github.com/sgl-project/sglang/pull/13749
- Fix utils import issue for nightly tests by @alisonshao in https://github.com/sgl-project/sglang/pull/13944
- [sgl-kernel][1/2] Fused qk_norm_rope for Qwen3-MoE by @yuan-luo in https://github.com/sgl-project/sglang/pull/14036
- [diffusion] refactor: refactor condition image resize logic by @mickqian in https://github.com/sgl-project/sglang/pull/14079
- [diffusion] refactor: refactor ComponentLoader and support loading native models from diffusers and transformers by @mickqian in https://github.com/sgl-project/sglang/pull/13205
- fix: small changes to enable test_mrope.py by @raayandhar in https://github.com/sgl-project/sglang/pull/14082
- Fix structural_tag tool call with null schema by @AzazKamaz in https://github.com/sgl-project/sglang/pull/14006
- [Bugfix] input prompt was not logged by @alphabetc1 in https://github.com/sgl-project/sglang/pull/13936
- Support configuring the request limit per receiving poll by @vipwangerxiao in https://github.com/sgl-project/sglang/pull/14076
- [Bugfix] qwen2.5-vl spec decode accept_len low by @Lzhang-hub in https://github.com/sgl-project/sglang/pull/13904
- support qwen3_vl vision model dp by @Lzhang-hub in https://github.com/sgl-project/sglang/pull/13724
- [diffusion] refactor: clean useless config files by @mickqian in https://github.com/sgl-project/sglang/pull/14094
- Fix overlap scheduler not take effect when outputing logprobs by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14096
- diffusion: support zimage by @yhyang201 in https://github.com/sgl-project/sglang/pull/14067
- [CPU] Apply uv as package manager by @ZailiWang in https://github.com/sgl-project/sglang/pull/14106
- Fix NIXL OBJ desciptors by @tshmilnvidia in https://github.com/sgl-project/sglang/pull/10712
- [model-gateway] Add version command support to SMG by @tonyluj in https://github.com/sgl-project/sglang/pull/12558
- Disable Deepep 2 GPU tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14111
- fix: malformed KV events for NVIDIA Dynamo by @PeaBrane in https://github.com/sgl-project/sglang/pull/13488
- enable piecewise cuda graph for prefill server by @fjybiocs in https://github.com/sgl-project/sglang/pull/13377
- [diffusion] log: unify generation performance logging by @mickqian in https://github.com/sgl-project/sglang/pull/14117
- Remove incorrect deep_gemm assertions from server_args.py by @ch-wan in https://github.com/sgl-project/sglang/pull/14113
- Add auto-tune workflow by @merrymercy in https://github.com/sgl-project/sglang/pull/14124
- feat: support flashinfer kernel autotune by @elvischenv in https://github.com/sgl-project/sglang/pull/12306
- Move piecewise cuda graph test to manual dir to fix CI by @ShangmingCai in https://github.com/sgl-project/sglang/pull/14121
- [diffusion] chore: add resolution shortcuts by @mickqian in https://github.com/sgl-project/sglang/pull/14129
- Super tiny fix typo by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14131
- add runtime check for PyTorch 2.9.1 + CuDNN < 9.15 to prevent Conv3d performance issues by @yhyang201 in https://github.com/sgl-project/sglang/pull/14119
- Trigger PR test on main every 3 hours instead of push event by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14130
- fix RuntimeError: RMSNorm failed with error code an illegal memory access was encountered by @gongwei-130 in https://github.com/sgl-project/sglang/pull/14135
- Always run all stages in cron based PR tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14151
- Fix condition for streaming output_ids in tokenizer manager by @merrymercy in https://github.com/sgl-project/sglang/pull/13759
- Fix Minimax M2 loading issue by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/13956
- Tiny fix DeepGEMM precompile rank check by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14136
- Super tiny add more info in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14145
- Fix spec v2 does not support RL update weights from tensor by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14146
- Support checking fp8 params in weight_checker by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14147
- Show errors when misusing env variables by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14154
- Always run model evaluation even if the trace upload step fails by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14157
- diffusion: Fix LoRA weight merging for torch.nn.Linear layers in diffusers modules by @niehen6174 in https://github.com/sgl-project/sglang/pull/14150
- add cpp files for cpp_radix_tree to pyproject.toml. by @strgrb in https://github.com/sgl-project/sglang/pull/14052
- feat: longcat flash add aux layers capture for eagle3 by @tianhaoz95 in https://github.com/sgl-project/sglang/pull/14161
- Implement profiler v2 and fix stage mixture bug by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14148
- Support numactl bind for CPU and memory before process starts by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14156
- Support grammar + spec + reasoning by @hnyls2002 in https://github.com/sgl-project/sglang/pull/14163
- bugfix[schedule]: Excessive preemption occurs when preempting running requests to schedule new prefill requests. by @CLFutureX in https://github.com/sgl-project/sglang/pull/12494
- Fix LMCache unit test and init bug by @DongDongJu in https://github.com/sgl-project/sglang/pull/14005
- [ci]fix deepep import error on H20 action by @HanHan009527 in https://github.com/sgl-project/sglang/pull/14166
- [Minor]Raise Error when deepep num dispatch token per rank is smaller than cuda graph bs by @Fridge003 in https://github.com/sgl-project/sglang/pull/14065
- [sgl-kernel] fix b200 kernel ci by @FlamingoPg in https://github.com/sgl-project/sglang/pull/13907
- Revert "[Minor]Raise Error when deepep num dispatch token per rank is smaller than cuda graph bs" by @Fridge003 in https://github.com/sgl-project/sglang/pull/14171
- Fix: fix flashmla fp8 kv cache acc error by @FlamingoPg in https://github.com/sgl-project/sglang/pull/13841
- [DeepSeekV3.2] Enable pure TP & Partial DP Attention by @YAMY1234 in https://github.com/sgl-project/sglang/pull/13646
- [model-gateway] support VL models in router by @ooapex in https://github.com/sgl-project/sglang/pull/14140
- [PD] Support json file configuration for Transfer Engine by @stmatengss in https://github.com/sgl-project/sglang/pull/14059
- Feat: GLM-4.6 supports shared experts fusion by @UranusSeven in https://github.com/sgl-project/sglang/pull/13873
- diffusion: improve z-image by @yhyang201 in https://github.com/sgl-project/sglang/pull/14104
- [piecewise] Refactor VLM to support input embed buffer and remove external embedder hack by @ByronHsu in https://github.com/sgl-project/sglang/pull/14155
- [Feature] Enable PTPC FP8 for compressed tensors moe (aiter kernel) by @qichu-yun in https://github.com/sgl-project/sglang/pull/12181
- Pull Request Instructions: RL and Training Framework Integrations by @Richardczl98 in https://github.com/sgl-project/sglang/pull/14187
- [Auto Sync] Update backend.py (20251130) by @merrymercy in https://github.com/sgl-project/sglang/pull/14153
- [piecewise] move piecewise_cuda_graph_runner init to model_runner initialize by @zminglei in https://github.com/sgl-project/sglang/pull/14034
- Tiny fix transform_scale_ue8m0 wrong output in some scenarios by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14003
- Tiny add several args to bench serving by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14181
- Super tiny allow millisecond precision in logging by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14183
- Support profiling only prefill or decode without the other by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14182
- [Piecewise] Use same global graph memory pool as the main cuda graph … by @byjiang1996 in https://github.com/sgl-project/sglang/pull/14044
- Try to remove wrong logic about max total token in spec decoding by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14167
- Fix speculative decoding error when retracting by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14180
- [diffusion] refactor: remove hard-code of instanceof on PipelineConfig by @mickqian in https://github.com/sgl-project/sglang/pull/14186
- [VLM] Boost Memory Pool based CUDA IPC by @yuan-luo in https://github.com/sgl-project/sglang/pull/14123
- Add peak output tokens per second in bench_serving by @BBuf in https://github.com/sgl-project/sglang/pull/14165
- fix: Increase FlashInfer workspace size for Qwen3VL models by @BBuf in https://github.com/sgl-project/sglang/pull/14173
- Tiny call cudaProfilerStart only on first rank in node by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14211
- [Minor] update docs by @merrymercy in https://github.com/sgl-project/sglang/pull/14212
- Add cuda event based on waiting value by @hnyls2002 in https://github.com/sgl-project/sglang/pull/14214
- Change PR test schedule to run every 6 hours by @merrymercy in https://github.com/sgl-project/sglang/pull/14218
- Super tiny fix typo by @fzyzcjy in https://github.com/sgl-project/sglang/pull/14219
- [model-gateway] Avoid logging MCP connection token by @xuwenyihust in https://github.com/sgl-project/sglang/pull/13887
- [spec-overlap] bugfix for pd disaggregation and npu by @liupeng374 in https://github.com/sgl-project/sglang/pull/14088
- Add new moe wna16 marlin gemm by @BBuf in https://github.com/sgl-project/sglang/pull/14122
- [model-gateway] refactor oai router 1/n by @slin1237 in https://github.com/sgl-project/sglang/pull/14228
- [model-gateway] fix v1/models response format to be oai compatible by @CatherineSue in https://github.com/sgl-project/sglang/pull/13693
- [model-gateway] add ModelType bitflags and Endpoint enum for worker by @slin1237 in https://github.com/sgl-project/sglang/pull/14230
- chore: bump sgl-kernel version to 0.3.18.post2 by @sglang-bot in https://github.com/sgl-project/sglang/pull/14229
- Modify git tag for DeepGemm in sgl-kernel. by @Sulfur6 in https://github.com/sgl-project/sglang/pull/14179
- Disable Deepep 8 GPU tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/14152
- [model-gateway] add ModelCard and ProviderType for model configuration by @slin1237 in https://github.com/sgl-project/sglang/pull/14237
- [MM][style] rename inputs_embeds to input_embeds for consistency by @ByronHsu in https://github.com/sgl-project/sglang/pull/14240
- Revert "Skip weight loading in deepgemm compilation" by @ishandhanani in https://github.com/sgl-project/sglang/pull/14241
- [model-gateway] add ModelCard support to WorkerMetadata by @slin1237 in https://github.com/sgl-project/sglang/pull/14243
- Fix NSA Bug in Centralize NSA Dispatch Logic by @YAMY1234 in https://github.com/sgl-project/sglang/pull/14245
- [model-gateway] Migrate Worker trait to model-aware methods by @slin1237 in https://github.com/sgl-project/sglang/pull/14250
- Fix a distributed initialization error by @Edwardf0t1 in https://github.com/sgl-project/sglang/pull/13843
- [CI] Fix test_deepep_large.py by @Fridge003 in https://github.com/sgl-project/sglang/pull/14247
- Support fp4 fp8 non gated moe by @TomerBN-Nvidia in https://github.com/sgl-project/sglang/pull/13794
- [model-gateway] Add e2e tests of streaming events and tool choice for response api by @XinyueZhang369 in https://github.com/sgl-project/sglang/pull/13880
- [Auto Sync] optionally disable fake register in Update fp8_kernel.py (20251202) by @merrymercy in https://github.com/sgl-project/sglang/pull/14255
- [Auto Sync] Add max_total_num_tokens metric: Update scheduler_metrics_mixin.py, collector.py (20251202) by @merrymercy in https://github.com/sgl-project/sglang/pull/14256
- [Minor] Upgrade cutedsl version in Dockerfile by @Fridge003 in https://github.com/sgl-project/sglang/pull/13968
- [diffusion] fix: fix Flux.2 condition image resize by @mickqian in https://github.com/sgl-project/sglang/pull/14232
- [VLM] Support Piecewise CUDA Graph for Qwen3-Omni-MOE by @yuan-luo in https://github.com/sgl-project/sglang/pull/14222
- [Docs] Update CI docs by @merrymercy in https://github.com/sgl-project/sglang/pull/14260
- Revert "Try to remove wrong logic about max total token in spec decoding" by @hebiao064 in https://github.com/sgl-project/sglang/pull/14259
- [model-gateway] add audio and moderation in model card by @slin1237 in https://github.com/sgl-project/sglang/pull/14263
- Fix NIXL exception message by @kartikx in https://github.com/sgl-project/sglang/pull/14172
- [diffusion] CI: add testcase-wise retry mechanism by @mickqian in https://github.com/sgl-project/sglang/pull/14261
- Remove cargo config also in
.zshenvby @hnyls2002 in https://github.com/sgl-project/sglang/pull/14267 - Fix mrope_positions size when req is retracted by @llfl in https://github.com/sgl-project/sglang/pull/13700
- fix: Support PP for Mistral Small 3.1 by @bluecoffee8 in https://github.com/sgl-project/sglang/pull/14254
- sync attention doc and ep doc to doctree by @b8zhong in https://github.com/sgl-project/sglang/pull/14257
- [model-gateway] include smg version command in py binding by @slin1237 in https://github.com/sgl-project/sglang/pull/14274
- Optimize topk sigmoid in minimax_m2 by @rogeryoungh in https://github.com/sgl-project/sglang/pull/14047
- fix trtllm mla spec by @b8zhong in https://github.com/sgl-project/sglang/pull/13738
- [model-gateway] fix version output by @slin1237 in https://github.com/sgl-project/sglang/pull/14276
- [VLM][Doc] Document for VLM DP Encoder by @yuan-luo in https://github.com/sgl-project/sglang/pull/14279
- chore: bump sgl-kernel version to 0.3.18.post2 by @sglang-bot in https://github.com/sgl-project/sglang/pull/14244
- [Auto Sync] Rename is_hybrid to is_hybrid_swa by @merrymercy in https://github.com/sgl-project/sglang/pull/14252
- [model-gateway] change rust package name to sgl-model-gateway instead by @slin1237 in https://github.com/sgl-project/sglang/pull/14283
- Update CODEOWNERS for multimodal_gen by @mickqian in https://github.com/sgl-project/sglang/pull/14286
- [CI] Fix 4-GPU test timeout by using 3 partitions by @alisonshao in https://github.com/sgl-project/sglang/pull/14287
- [Fix] improve model info registration and searching strategy by @liz-badada in https://github.com/sgl-project/sglang/pull/14281
- [diffusion] refactor: simplify DmdDenoisingStage by @mickqian in https://github.com/sgl-project/sglang/pull/14269
- Opt moe align block size kernel by @BBuf in https://github.com/sgl-project/sglang/pull/14133
- [sgl-kernel] fix runtime error while preloading CUDA runtime by @anvdn in https://github.com/sgl-project/sglang/pull/13089
- Fix duplicate download log messages in multi-process environment by @alisonshao in https://github.com/sgl-project/sglang/pull/14299
- Revert PR #14044: Restore separate memory pool for piecewise CUDA graph by @alisonshao in https://github.com/sgl-project/sglang/pull/14278
- Init TBO with dp_padded batch by @liquanfeng in https://github.com/sgl-project/sglang/pull/11423
- feat: DeepSeek new v3.2 encoding by @Eva20150932-atlascloud in https://github.com/sgl-project/sglang/pull/14249
- [Minor] update docs on CI by @merrymercy in https://github.com/sgl-project/sglang/pull/14315
- Add /rerun-stage slash command to rerun specific PR test stages by @alisonshao in https://github.com/sgl-project/sglang/pull/14262
- Fix nonetype error for ci failure monitor by @dougyster in https://github.com/sgl-project/sglang/pull/14319
- Adding section for scheduled PR test runs on main by @dougyster in https://github.com/sgl-project/sglang/pull/14309
- Clean up imports and move files by @merrymercy in https://github.com/sgl-project/sglang/pull/14317
- [model-gateway] add workflow for external model providers by @slin1237 in https://github.com/sgl-project/sglang/pull/14323
- ci: Add zyzshishui to CI permissions by @sunxxuns in https://github.com/sgl-project/sglang/pull/14324
- chore: bump SGLang version to 0.5.6 by @sglang-bot in https://github.com/sgl-project/sglang/pull/14316
New Contributors
- @gaopengff made their first contribution in #11051
- @mattheliu made their first contribution in #12764
- @rchalamala made their first contribution in #12717
- @leejnau made their first contribution in #12724
- @kalyank007 made their first contribution in #12761
- @BraveY made their first contribution in #12374
- @MMuzzammil1 made their first contribution in #12946
- @edwingao28 made their first contribution in #12956
- @CLFutureX made their first contribution in #12239
- @syy-hw made their first contribution in #11719
- @ShawnKung made their first contribution in #10225
- @LHXuuu made their first contribution in #12980
- @haoyangli-amd made their first contribution in #11609
- @yctseng0211 made their first contribution in #12689
- @Sunhaihua1 made their first contribution in #13063
- @FrankMinions made their first contribution in #12814
- @zhaowenzi made their first contribution in #13056
- @MayDomine made their first contribution in #13039
- @zhanghaotong made their first contribution in #12396
- @hellodanylo made their first contribution in #12860
- @Rohan138 made their first contribution in #9790
- @CharlieFRuan made their first contribution in #7906
- @khalil2ji3mp6 made their first contribution in #12214
- @SYChen123 made their first contribution in #12979
- @XinyueZhang369 made their first contribution in #13164
- @dougyster made their first contribution in #13104
- @vlserov made their first contribution in #12288
- @hustmf made their first contribution in #13154
- @ZLkanyo009 made their first contribution in #12201
- @edwardzjl made their first contribution in #13211
- @Taishi-N324 made their first contribution in #13210
- @dtcccc made their first contribution in #13142
- @billishyahao made their first contribution in #13243
- @kshitij12345 made their first contribution in #13213
- @wangyxbh made their first contribution in #12191
- @terfendail made their first contribution in #13288
- @SenmiaoORZ made their first contribution in #13260
- @zhengxle made their first contribution in #12001
- @RiversJin made their first contribution in #13383
- @lixiaolx made their first contribution in #12065
- @kebyn made their first contribution in #5879
- @Carlomus made their first contribution in #13217
- @sirutBuasai made their first contribution in #13173
- @WingEdge777 made their first contribution in #12149
- @liusy58 made their first contribution in #13126
- @Baidu-AIAK made their first contribution in #13495
- @chz34 made their first contribution in #9234
- @galeselee made their first contribution in #12379
- @shauntajoesph-ops made their first contribution in #13583
- @1am9trash made their first contribution in #13554
- @michelemarzollo made their first contribution in https://github.com/sgl-project/sglang/pull/13590
- @weibingo made their first contribution in https://github.com/sgl-project/sglang/pull/13407
- @roikoren755 made their first contribution in https://github.com/sgl-project/sglang/pull/12690
- @ErsongWang made their first contribution in https://github.com/sgl-project/sglang/pull/13727
- @yinpeiqi made their first contribution in https://github.com/sgl-project/sglang/pull/13709
- @liuhuijiayou made their first contribution in https://github.com/sgl-project/sglang/pull/13201
- @wangtiance made their first contribution in https://github.com/sgl-project/sglang/pull/13656
- @ant-yy made their first contribution in https://github.com/sgl-project/sglang/pull/13713
- @tom-jerr made their first contribution in https://github.com/sgl-project/sglang/pull/13612
- @cyb70289 made their first contribution in https://github.com/sgl-project/sglang/pull/13874
- @Liwansi made their first contribution in https://github.com/sgl-project/sglang/pull/12078
- @llc-kc made their first contribution in https://github.com/sgl-project/sglang/pull/13657
- @ptovam made their first contribution in https://github.com/sgl-project/sglang/pull/13851
- @janbernloehr made their first contribution in https://github.com/sgl-project/sglang/pull/13421
- @ShawnY112358 made their first contribution in https://github.com/sgl-project/sglang/pull/10071
- @ClawSeven made their first contribution in https://github.com/sgl-project/sglang/pull/12588
- @MichelleWu351 made their first contribution in https://github.com/sgl-project/sglang/pull/12491
- @yuukidach made their first contribution in https://github.com/sgl-project/sglang/pull/13892
- @tianhaoz95 made their first contribution in https://github.com/sgl-project/sglang/pull/14007
- @mrhaoxx made their first contribution in https://github.com/sgl-project/sglang/pull/13983
- @raayandhar made their first contribution in https://github.com/sgl-project/sglang/pull/14082
- @AzazKamaz made their first contribution in https://github.com/sgl-project/sglang/pull/14006
- @alphabetc1 made their first contribution in https://github.com/sgl-project/sglang/pull/13936
- @tshmilnvidia made their first contribution in https://github.com/sgl-project/sglang/pull/10712
- @PeaBrane made their first contribution in https://github.com/sgl-project/sglang/pull/13488
- @fjybiocs made their first contribution in https://github.com/sgl-project/sglang/pull/13377
- @niehen6174 made their first contribution in https://github.com/sgl-project/sglang/pull/14150
- @DongDongJu made their first contribution in https://github.com/sgl-project/sglang/pull/14005
- @UranusSeven made their first contribution in https://github.com/sgl-project/sglang/pull/13873
- @qichu-yun made their first contribution in https://github.com/sgl-project/sglang/pull/12181
- @Richardczl98 made their first contribution in https://github.com/sgl-project/sglang/pull/14187
- @liupeng374 made their first contribution in https://github.com/sgl-project/sglang/pull/14088
- @Sulfur6 made their first contribution in https://github.com/sgl-project/sglang/pull/14179
- @TomerBN-Nvidia made their first contribution in https://github.com/sgl-project/sglang/pull/13794
- @kartikx made their first contribution in https://github.com/sgl-project/sglang/pull/14172
- @llfl made their first contribution in https://github.com/sgl-project/sglang/pull/13700
- @bluecoffee8 made their first contribution in https://github.com/sgl-project/sglang/pull/14254
- @Eva20150932-atlascloud made their first contribution in https://github.com/sgl-project/sglang/pull/14249
Full Changelog: v0.5.5...v0.5.6