sgl-project/sglang v0.5.5 on GitHub

Highlights

Day 0 support for Kimi-K2-Thinking https://huggingface.co/moonshotai/Kimi-K2-Thinking
Day 0 support for Minimax-M2 https://huggingface.co/MiniMaxAI/MiniMax-M2
Video and image generation support https://lmsys.org/blog/2025-11-07-sglang-diffusion/
Q4 Roadmap: #12780
Blackwell kernel optimizations and MoE runner backend refactor
Overlap spec and prefill cuda graph support more models

What's Changed

[8/n] decouple quantization impl from vllm dependency - gguf srt by @FlamingoPg in #11964
lang: support direct video inference by @mickqian in #9936
Enable Llama 4 + TRTLLM MHA by @b8zhong in #12003
Refactor Triton-kernel MoE runner integration by @Jonahcb in #11795
use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk by @b8zhong in #11816
Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 by @thelongestusernameofall in #11909
Revert "lang: support direct video inference" by @merrymercy in #12038
support more model in piecewise cuda graph by @narutolhy in #11745
[Fix] Fix lint to pass CI by @Fridge003 in #12037
Revert "[Fix] Fix lint to pass CI" by @Fridge003 in #12042
fix: fix MMMU loading issue by @ZailiWang in #11759
Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once by @xu-yfei in #10953
Add gguf dependency for cpu/xpu by @ZailiWang in #12041
fix: the hardcode hf repo name comparison for deepseek-ocr by @rainj-me in #12031
Install numactl in Dockerfile for GH200/GB200/GB300 by @fzyzcjy in #11853
[router] Add mTLS Support for Router-to-Worker Communication by @slin1237 in #12019
Tiny cleanup send_single by @fzyzcjy in #12056
Refactoring GLM-4.5 and GLM-4.5V related implementations by @zRzRzRzRzRzRzR in #11800
[Fix] fix missing ipc_name of __getitem__ in some IO structs by @whybeyoung in #12053
fix: bench_serving ITL calculation when using spec-decoding by @JustinTong0323 in #12064
Fix dpsk-r1-fp4 launching crash by @Qiaolin-Yu in #12063
Revise POINTSV15Chat model by @yuan-luo in #12049
Add 'gguf' to project dependencies by @Muqi1029 in #12046
[Profiler] expand '~' by @Muqi1029 in #11999
[b200] fix piecewise cuda graph launch bug by @BBuf in #12067
Fix multi processing serializer bug by @fzyzcjy in #11958
[Fix]: HiCache hasher failed when EAGLE mode enabled by @leavelet in #12025
adjust dynamic vs static outputs comparison in test_lora_update.py by @glenliu21 in #11884
[router] implement response api get input item function and refactor input/output store by @key4ng in #11924
fix(compile_utils, ep_moe): update environment variable and dtype check by @ishandhanani in #12034
[router] fix ut router config init to use build pattern by @slin1237 in #12084
docs(server-arguments): add allowed options for each argument by @Jonahcb in #11560
[router] migrate app context to builder pattern 1/n by @slin1237 in #12086
[router] migrate app context to builder pattern 2/n by @slin1237 in #12089
[router][grpc] Remove gpt_oss parsers and remove _parser suffix in tool parser files by @CatherineSue in #12091
[1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU by @zminglei in #12000
Fix: Update blog link by @LucaLow in #12071
perf: trtllm_mla attention backend spec decoding speedup w/ cuda graph by @cicirori in #12093
[2/N]Support DeepSeek-R1 w4a8 low latency deepep by @ayrnb in #8464
Enhance tests in deterministic kernels by @fzyzcjy in #12070
[Doc] Add documentation for DeepSeek V3.2 by @Fridge003 in #11877
[10/N] MoE Refactor: reorganize deepgemm runner in DeepEPMoE by @ch-wan in #12054
Support true on-policy by @fzyzcjy in #12058
[Docs] update sgl-kernel readme by @FlamingoPg in #11379
Fix 'KeyError' for per_token expert distribution recorder by @vipwangerxiao in #9501
Fix kernel version bump file by @Kangyan-Zhou in #12087
[Fix] Set global args in cpu test by @Fridge003 in #12105
chore: bump sgl-kernel version to 0.3.16.post4 by @sglang-bot in #12103
[Auto Sync] Update test_deterministic.py, test_deterministi... (20251024) by @merrymercy in #12083
[router] Refactor data connector architecture with unified storage modules by @key4ng in #12096
fix: release workflow should work on both archs by @ishandhanani in #12110
[bugs] docker file name should be .Dockerfile so it can properly render by @slin1237 in #11869
Clean up server args & Add CI scripts by @merrymercy in #12124
[Misc] Improve the error message of failed import by @DarkSharpness in #12119
[CI] Add ci monitor balance workflow by @BBuf in #11962
Skip TestLlama4LoRA in CI by @lifuhuang in #12098
clean up github tokens by @merrymercy in #12126
Fix Illegal Instruction/IMA errors when using DP attention -- num_tokens_for_logprob calculation by @YAMY1234 in #12115
Fix token for CI monitor by @merrymercy in #12127
Reenable b200 tests by @Kangyan-Zhou in #11814
Update document index for DeepSeek-v32 docs by @Fridge003 in #12101
Update sgl-kernel version to 0.3.16.post4 by @Fridge003 in #12125
[Doc] Fix format for deepseek v3.2 document by @Fridge003 in #12130
Accelerate deepseek fp4 b200 ci by @Qiaolin-Yu in #11993
Clean up server launch code and multi tokenizer by @merrymercy in #12132
[Test] Add dsv3.2 nsa backend testing by @Johnsonms in #11936
[docs] upd docker files names everywhere by @vincentzed in #12133
Make bmm batch invariant injection optional by @fzyzcjy in #12118
[Doc] Small update of DeepSeek v3.2 document by @Fridge003 in #12138
docs: update README by @zhyncs in #12139
[router] MCP Manager - Support Connection Pooling, Tool Inventory and Proxy by @slin1237 in #12097
[NVIDIA] Change default quant method for model_opt by @kaixih in #11991
[router] update smg code owners for each component by @slin1237 in #12141
[router] cleaned up all the redundant comments in the config module by @CatherineSue in #12147
Clean up attention backend selection code & Other minor rename by @merrymercy in #12136
[log] Make forward iter count optional by @hnyls2002 in #12116
[misc] depdencies & enviroment flag by @hnyls2002 in #12113
[quantization] AWQ Marlin doesn't work when dtype is bfloat16 by @kevin85421 in #11494
[HiCache]Page head layout IO kernel by @huangtingwei9988 in #11615
Do not use MagicMock to mock server_args in tests by @hnyls2002 in #12154
[router][grpc] Fix tool call id in parse_json_schema_response by @CatherineSue in #12152
[router] centralize mcp tool args handling by @slin1237 in #12155
Fix ITL metrics when using openai endpoint with spec by @hnyls2002 in #12156
[Fix] fix allreduce bug in Piecewise Graph by @zyksir in #12106
Support DeepGEMM for deterministic inference by @fzyzcjy in #12142
model: support NVILA and NVILA Lite by @futrime in #10399
Avoid using flashinfer_allreduce_fusion when dp attention is enabled. by @elfiegg in #11632
transfer mrope_position_delta to device when first running by @ash-sigh in #11047
add gitignore for claude code and serena mcp by @slin1237 in #12166
Support MiniMax M2 model by @zhaochenyang20 in #12129
[misc][grpc] Remove duplicate log by @CatherineSue in #12168
[router][grpc] Add ResponsesContext and fix error propagation in responses api by @CatherineSue in #12164
[router] Remove SharedXxxStorage type aliases to make Arc explicit by @CatherineSue in #12171
Remove deprecated --enable-beta-spec argument and fix b200 test by @Kangyan-Zhou in #12167
fix broken deepep/flashmla install in container by adding --no-build-isolation by @ishandhanani in #12170
Remove description for --enable-beta-spec argument by @JustinTong0323 in #12177
chore: bump SGLang version to 0.5.4.post1 by @sglang-bot in #12169
[doc] add example of using w4fp8 for Deepseek by @Kevin-XiongC in #12057
[sgl-route] Optimize the use of constant slices and retain to simplif… by @lengrongfu in #12159
[Fix] Fix cu130 sgl-kernel wheel renaming by @Fridge003 in #12173
docs: update contact by @zhyncs in #12192
[sgl-kernel] feat: Support sm120 cutlass fp8 gemm kernel by @kaln27 in #9403
[sgl-kernel][4/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #12080
GLM-4-0414 and GLM-4.1V Code Refactor by @zRzRzRzRzRzRzR in #12117
Add support for AutoRound quantized models by @WeiweiZhang1 in #10153
Optimize triton_mrope with torch compile by @yuan-luo in #12112
Fix crash after flush cache by @cctry in #12107
[Detokenizer Manager] Cleanup state when reqs are finished by @Muqi1029 in #12205
fix(metrics): double times add_latency for DECODE_BOOTSTRAP by @jinmingyi1998 in #12209
improve mimax-m2 rmsnorm precision by @haichao592 in #12186
check_offload_progress more frequently by @pansicheng in #11656
[Feature] PD-Multiplexing Context and Scheduler. by @ykcombat in #11592
rope xpu: fix missing argument 'fused_set_kv_buffer_arg' and replace native with sgl_kernel_xpu impl by @chunyuan-w in #12006
Add support for Matryoshka embeddings (#126) by @satyamk7054 in #11142
fix: AttributeError: 'NixlKVManager' object has no attribute 'prefill_tp_size_table' by @gongwei-130 in #12234
Compiling rope while preserving true on policy by @fzyzcjy in #12161
[Auto Sync] Update scheduler.py, spec_info.py, run_suite.py... (20251027) by @zhyncs in #12235
Support running FP4 Deepseek on SM120. by @weireweire in #11708
Add env var to control custom Triton kernel cache and set CSGMV as default backend. by @lifuhuang in #12176
Use explicit uint64 dtype for Tensor data_ptr() to avoid overflow by @jianan-gu in #11994
Update openai package version to 2.6.1 by @JustinTong0323 in #12222
[2/2] Use moe_sum_reduce cuda kernel by @yuan-luo in #10654
docker: add CUDA13 support in dockerfile and update GDRCopy/NVSHMEM for blackwell support by @ishandhanani in #11517
[router] remove code duplication by @slin1237 in #12245
[DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache by @hlu1 in #11655
Add per-request retraction count by @scottjlee in #11177
Opt fused triton moe: add tma for down proj kernel by @xu-yfei in #10567
Support releasing CUDA graph memory when paused by @fzyzcjy in #7873
[router] use mcp struct from sdk and clean up code across codebase by @slin1237 in #12249
[router] configure workflow retries and timeout based on routerConfig by @slin1237 in #12252
Feature/Add GET endpoint to query loaded LoRA adapters by @ConnorLi96 in #12229
[hotfix] Incorrect CombineOverlapArgs in SBO by @ch-wan in #12230
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 by @sufeng-buaa in #10804
[Bug fix] [PP] fix wrong dtype for quantified model by @XucSh in #12247
Fix potential eos bug on decode instance when PD is enabled by @ShangmingCai in #12206
Revert "[Feature] PD-Multiplexing Context and Scheduler." by @zhyncs in #12267
chore: cleanup quant deps by @zhyncs in #12268
[router] Fix type unmatch during validation by @key4ng in #12257
Modify rocm.Dockerfile by @sogalin in #12274
[router] upgrade grpc dependency and py 3.13 3.14 support by @slin1237 in #12284
Fix 'BypassedTopKOutput' object has no attribute 'topk_weights' for DeepEP by @trevor-m in #12231
Tiny fix sgl-kernel related CI installing the wrong binary by @fzyzcjy in #12283
doc for logit_bias by @whybeyoung in #12188
Use Flashinfer TRT-LLM as Llama 4 compatible MoE backend by @b8zhong in #11928
[rust][ci] Add end-to-end tests for Oracle history backend by @key4ng in #12233
[router] support arm, windows, mac, linux, reduce wheel size and number by @slin1237 in #12285
fix seqlen bug for trtllm_mla's draft_extend by @bmac3 in #12295
Update deepseek_v32.md by @hlu1 in #12296
Super tiny fix expert distribution dump error by @fzyzcjy in #12271
[router][grpc] Fix inconsistent behavior of conversation_id not found by @CatherineSue in #12299
fix: Llama 4 BF16 load on Blackwell by @b8zhong in #12308
Add continuous_usage_stats support for streaming responses by @BBuf in #12241
[hotfix] missing w13_weight_fp8 and w2_weight_fp8 in UE8M0 requantization by @ch-wan in #12259
[hotfix] Fix pytest not found in CI by @Fridge003 in #12311
a tiny fix for support deepseek bf16 weights by @Gao016 in #12313
[metrics][EPLB]: Support selected count of physical experts on each GPU by @acelyc111 in #9825
doc: improve modelopt error description by @lianakoleva in #12269
EPLB: prefer to use physical experts in the same gpu or node by @acelyc111 in #10874
Add Batch‑Invariant RMSNorm by @zyzshishui in #12144
followup fix for llama 4 trtllm flashinfer backend by @b8zhong in #12314
[Deepseek V3.2] Enable flashmla_auto with MTP by @hlu1 in #12294
feat: preview filename from tuning_fused_moe_triton.py by @lianakoleva in #12276
[ci] Try fixing broken CIs by @Fridge003 in #12317
Refactor abortion in event loop by @hnyls2002 in #12312
[Test] Fix session control test by @hnyls2002 in #12336
Eagle3 DP attention for Qwen3 MoE by @qhsc in #12002
feat: return partial generation results when aborting requests in waiting queue by @guoyuhong in #11673
[Bug fix] trace: fix import error in mini_lb if sgl-router image does not install sglang by @sufeng-buaa in #12338
[router] fix router release workflow and add build test in PR by @CatherineSue in #12315
Triton fused_moe_kernel support ep moe tuning by @BBuf in #12343
[Fix] fix type issue of env flag value MODELOPT_MAX_TOKENS_PER_EXPERT by @zejunchen-zejun in #11709
[bug] fix router pypi license file by @slin1237 in #12345
fix: llama 4 + trtllm gen + fp8 kv cache incompatibility by @b8zhong in #12347
[2/2] Deepseek deterministic: support deepseek v3 deterministic inference on 8 x H200 by @zminglei in #12095
Fix Flashinfer Backend for SM120 Usage by @weireweire in #12325
[router] refactor mcp to use LRU and fix pooling bug by @CatherineSue in #12346
support cutlass fp4 kernel in sm120 by @AichenF in #11737
[bug] fix router installation to include additional dependency by @slin1237 in #12348
[router] update router docker to use maturin and build from local by @CatherineSue in #12350
Fix Duplicate Classmethod in spec_info.py by @hebiao064 in #12354
[CI] Add Llama 3.1 8B FP4 to B200 CI by @b8zhong in #12182
Fuse wk and weight_proj in Indexer for DeepSeekV3.2-FP4 by @trevor-m in #12094
[router] Harmony Pipeline: Chat Completion & Responses API with MCP Support by @slin1237 in #12153
[bugfix] fix deepseekvl2 and deepseek_ocr model type conflict by @leihuang-sketch in #12050
[Ckpt Engine] feat: new sglang entrypoint support for update by @stmatengss in #12216
[Perf] Optimize multimodal mm_inputs process in scheduler by @yuan-luo in #11910
[NPU] fix pp_size>1 by @Makcum888e in #12195
Super tiny add tag for benchmark scripts by @fzyzcjy in #12340
Allow benchmarking tool to handle empty response by @Kangyan-Zhou in #12174
Super tiny fix AMD ci by @fzyzcjy in #12378
Import flash_mla from sgl-kernel by @Fridge003 in #12135
[Bug fix][PP] fix deadlock with tie_word_embeddings by @XucSh in #12362
[fix] added image token as prefix for deepseek-ocr by @Tushar-ml in #12358
Fix DeepSeek chat templates to handle tool call arguments type checking (#11700) by @Kangyan-Zhou in #12123
[Feature] Initial eagle3 support for Deepseek-like models by @JensenFire in #12319
Enable fast silu-and-mul-and-quant fused kernel by @fzyzcjy in #11806
[Test] Enhance radix cache test for spec cases by @hnyls2002 in #12394
[NPU] bugfix for Qwen3-Next and performance update by @iforgetmyname in #11969
[Feature] Support DeepSeek MTP on NPU by @iforgetmyname in #11897
Revert "Triton fused_moe_kernel support ep moe tuning" by @BBuf in #12377
[sgl-kernel] upd deepgemm hash to rebased commit by @FlamingoPg in #11960
[router] harmony responses api streaming support by @slin1237 in #12395
[docker] clean up main dockerfile for router and dev configurations by @CatherineSue in #12364
feat: add EP support in tuning by @Chen-0210 in #12012
[router] use safety_identifier replace user on chat history storage by @lengrongfu in #12185
[CI Monitor] Fix ci_monitor perf analyzer bug by @BBuf in #12281
[router] Fix safety_identifier missing by @key4ng in #12404
[ci] Fix ci_install_deepep by @Fridge003 in #12375
Update news section in README.md by @merrymercy in #12409
[router] Function call support for openai router Responses API by @key4ng in #12386
minor code sync by @merrymercy in #12403
[Bug fix][PD Dissaggregation] fix prefill hanging issue with PP and DP Attention, by @popsiclexu in #12368
[NVIDIA] Add CI workloads for GB200 by @kaixih in #12242
[router] web_search_preview tool basic implementation by @key4ng in #12290
[router] 0.2.2 release by @slin1237 in #12399
enable cudaProfilerApi for one batch benchmarking by @lpc0220 in #11116
[Refactor] tuning_fused_moe for MLLM and small refactor by @JustinTong0323 in #11224
[DeepSeekV32] Bug fix to ensure page_table and result in same type by @Johnsonms in #12300
[CI] fix tests' time estimation by @hnyls2002 in #12401
Reserved abortion API when retracting by @hnyls2002 in #12425
Fix the shared expert & routed expert overlap in Llama 4 by @b8zhong in #12405
feat: Add Non-intrusive Tensor Dumping for Model Inference by @guoyuhong in #10566
feat: support trtllm_mha FP8 query attention kernel by @elvischenv in #12307
[Bugfix]: distinguish processors for deepseek_vl2 and deepseek_ocr to p… by @bppps in #12384
[ci] install released version router by @key4ng in #12410
Revert "fix llama4 kv cache layout" by @b8zhong in #12437
Add trait for BasePrefixCache by @hnyls2002 in #12436
[CI] Add more bins for 1-gpu CI test by @Fridge003 in #12422
[bugfix] set is_prefill_only=false when mixed_chunk by @Bruce-x-1997 in #10889
Clean up sgl kernel by @merrymercy in #12413
[CI] fix possible port conflicts. by @hnyls2002 in #12452
Fix ci install to allow prerelease by @merrymercy in #12449
fix: Add default value for backend in sample_mmmu_requests by @ZailiWang in #12256
Enable bailing_moe to support TP=16 by @guoyuhong in #12369
fix:watchdog thread exception by @Kindyaa in #12328
Simplify watchdog by @hnyls2002 in #12463
[Bug fix] Fix severe memory waste issue with torch.empty pin_memory by @sjtushenhai in #12266
Feat: deepseek-ocr logits processor by @JustinTong0323 in #12415
Fix lint in deepseek-ocr by @ispobock in #12470
[Test] Add Functional Tests for Penalty Parameters by @neelabhsinha in #11931
[Bug] OOM (Out-of-Memory) errors for extreme testing scenarios (min_tokens=2) by @LuYanFCP in #11757
[Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. by @ykcombat in #12275
[VLM] Optimize async mm data process mechanism by @yuan-luo in #12066
fix default env var for mooncake store by @huangtingwei9988 in #12429
add served model name in bench serving by @carolove in #12428
Tiny assert no running requests when releasing memory to avoid IMA by @fzyzcjy in #12341
fix: dummy health check server not accessible on non-zero rank nodes by @ishandhanani in #12297
Fix run benchmark by @ispobock in #12473
Add env var to disable FA4 warmup by @cicirori in #12430
Try to allow NCCL cumem for multi node nvlink case by @fzyzcjy in #11987
Support Kimi Linear by @ispobock in #12469
[CI] Fix kernel installation on aarch runners by @Fridge003 in #12475
fa3 & trtllm_mha spec overlap by @JustinTong0323 in #11874
chore: bump SGLang version to 0.5.4.post2 by @sglang-bot in #12439
Tiny fix eos handling for PD disaggregation by @ShangmingCai in #12334
Forward unknown tool calls instead of dropping by @Surya-Gunukula in #12226
Use sgl fp4 quant kernel by default by @Qiaolin-Yu in #12482
[hot fix] Remove from python.sglang.xxx by @hnyls2002 in #12483
perf: trtllm mla performance minor improvements by @cicirori in #12435
Filter tokenizer warning for kimi models by @ispobock in #12485
[CI] Build aarch64 kernels for sgl-kernel test by @Fridge003 in #12480
[Hotfix] Remove extra comment in sgl-kernel README by @Fridge003 in #12500
[feat] Add SGLANG_TOOL_STRICT_LEVEL for tool-call behavior control by @JustinTong0323 in #12423
Reduce docker image size. mount cache when use pip/cargo build by @whybeyoung in #12238
[HICache / PD]: Support offloading incremental KV cache in decode side. by @hzh0425 in #11966
[Deterministic] add deepseek v3 deterministic inference CI test by @zminglei in #12412
[Bug] test_flashattn_mla_backend errors in Hopper #12487 by @Johnsonms in #12488
Update Mooncake EP's a2a interface by @UNIDY2002 in #12391
[CI][NPU] remove pypi mirror site that hangs ci dependency installation by @iforgetmyname in #12499
[Ascend] Add Ascend NPU support for sglang.check_env & rework proposal by @Alexhaoge in #11052
[Feature] Qwen3-Next & FLA: Support MTP topk>1; Up to 6% faster by @byjiang1996 in #11133
[CI] Move some Lora/Deterministic CI tests to nightly by @Fridge003 in #12507
Migrate weak_ref_tensor to sgl-kernel by @BBuf in #12505
feat: Add FP4 (E2M1) KV Cache Support with Quantization Utilities for MLA by @JackChuang in #10078
chore: bump sgl-kernel version to 0.3.16.post5 by @sglang-bot in #12511
[FEAT] Shared mem pool based cuda ipc for multi-modal data transport by @kousakawang in #11917
Add prefix for torch symm mem by @yuan-luo in #12506
[ServerArgs] allow --mamba-ssm-dtype extend by @hanming-lu in #12481
[Fix] concat_mla_absorb_q_kernel fails for long inputs by @bingps in #12453
Super tiny fix naming in bench serving scripts by @fzyzcjy in #12515
move all get_stream in sgl_kernel to c++ to reduce the launch overhead by @merrymercy in #12521
[Refact] Remove hardcoded KV cache dimension in MLATokenToKVPool by @Johnsonms in #12502
[Bug] Fix Intern-S1 model accuracy and support /generate interface with input_ids by @hhaAndroid in #12367
chore: upgrade flashinfer 0.5.0 by @zhyncs in #12523
[hotfix] Remove flashinfer-jit-cache from pyproject by @Fridge003 in #12530
fix: move dummy format loader check before quantization checks by @cicirori in #12532
chore: upgrade mooncake 0.3.7.post1 by @ShangmingCai in #12541
fix: Fix KTransformers hybrid inference with int8 quantization and format by @Atream in #12536
Conditionally recapture cuda graph after model weight update from disk by @harrisonlimh in #12060
[spec v2] Fix output repetition by speculative sampling error by @hnyls2002 in #12561
[hot-fix] Fix broken CI by @hnyls2002 in #12564
fix: fix the bug which leads qwen2_5_vl to crash with mixed_chunk by @PanJason in #11330
Fix error when calling quantization by @fzyzcjy in #12548
[Test] Add parameters to SRTRunner by @Jonahcb in #12227
[ROCm] Update Mooncake to v0.3.7.post1 and add -DUSE_HIP=ON to rocm.Dockerfile by @yeahdongcn in #12560
Reduce the overhead of nccl symmetric memory by @merrymercy in #12524
tiny optimize for bench serving by @yizhang2077 in #12553
Super tiny allow profile activities in bench_serving by @fzyzcjy in #12549
Super tiny dump server info such as args in bench for post analysis by @fzyzcjy in #12550
update usage of trtllm_fp8_per_tensor_scale_moe by @b8zhong in #12569
[router][grpc] Consolidate error messages build in error.rs by @CatherineSue in #12301
Remove the dependency of nccl.h in symmetric memory by @merrymercy in #12571
[chore] Fix update_kernel_whl_index script for multiple cuda version by @Fridge003 in #12519
Enable mixed type LayerNorm kernel for NSA indexer by @akhilg-nv in #12044
Super tiny add UT for copy_to_gpu_no_ce by @fzyzcjy in #12270
[Doc] fix miss index for production request trace by @stmatengss in #12547
[GDN/SWA] mamba and swa radix cache edge case fix by @hanming-lu in #12111
[Qwen3 VL] Add LoRA support for Qwen 3 VL by @Jonahcb in #12165
test: support return logprobs in bench_offline_throughput test by @aftersnow in #12462
Tiny fix ExpertDistributionReq error by @fzyzcjy in #11760
fix: respect --ignore-eos in PD case for benchmarking by @ishandhanani in #12597
Improve the metrics for PD by @merrymercy in #12580
Enable memory saver for hybrid model by @ocss884 in #11974
Restore torch defaults between sgl-kernel tests by @benbarsdell in #11131
feat: limit peak memory usage when computing logprobs by @aftersnow in #6318
[router][grpc] Restructure modules and code clean up by @CatherineSue in #12598
Add --speculative-moe-runner-backend server arg by @trevor-m in #10183
[Deterministic] Optimize bmm_batch_invariant op by @zminglei in #12522
chore: bump mooncake version to 0.3.7.post2 by @ShangmingCai in #12599
[sepc-v2] Fix imcompatibility with constrained decoding by @hnyls2002 in #12615
Support aggregating engine metrics in sgl-router by @fzyzcjy in #11456
Ensure GPU work is finished when release memory occupation call is finished by @fzyzcjy in #12592
Add sanity checks when a test file is not added to CI (reland) by @fzyzcjy in #12594
[router][grpc] Fix model validation, tool call check, streaming logic and misc in responses by @CatherineSue in #12616
[HotFix] Disable torch dynamo for mrope_triton kernel by @yuan-luo in #12593
Fix skip layer in get_quant_method by @ispobock in #12632
[Test] Merge all constrained decoding tests. by @hnyls2002 in #12633
Add io struct naming check back by @hnyls2002 in #12634
Fix output_ids inconsistency by @hnyls2002 in #12628
fix: Lazy import mooncake-ep to fix extra gpu contexts being created by @trevor-m in #12641
[hotfix] Fix deepep w4a8 bug by @Fridge003 in #12642
[Auto Sync] Update scheduler_metrics_mixin.py, collector.py (20251104) by @merrymercy in #12647
[Bug] Fix NSA Backend KV-Buffer Shape Mismatch in DeepSeek-V3.2 by @Johnsonms in #12645
[NVIDIA] Fix wrong symmetric sizes for fp4 cases by @kaixih in #12640
[router][grpc] Fix index issues in reasoning content and missing streaming events by @CatherineSue in #12650
Revert "Enable memory saver for hybrid model" by @Fridge003 in #12648
Add multi-GPU configurations to nightly-test.yml by @alisonshao in #12585
[fix] Handle escaped characters in GLM tool call parser to prevent double serialization by @soaringk in #12456
[router][grpc] Emit OutputItemDone event and store output item array by @CatherineSue in #12656
Register allgather/reducescatter buffers with symm memory by @nvcastet in #12572
chore: bump SGLang version to 0.5.4.post3 by @sglang-bot in #12639
[NVIDIA] Fix cutedsl backend of MoE by @kaixih in #12353
[PD-Disagg] Check finish after pop tranferred by @hnyls2002 in #12638
fix typo of args description in sglang.profiler by @ai-easy-cpu in #12486
[Dockerfile] Speed up docker image building by @acelyc111 in #8784
Fix VLLM dependency test by @Kangyan-Zhou in #12670
[Feature] add --lora-request-distribution arg to bench_serving.py and support skewed and distinct workloads by @glenliu21 in #12175
[router][grpc] Implement tool_choice support for Responses API by @CatherineSue in #12668
Expand and update test coverage for AMD CI by @hubertlu-tw in #10044
fix: add seed bench_serving to cache key, remove redundant function definition. by @cicirori in #12680
[Profiler] Add SGLANG_PROFILE_RECORD_SHAPES for recording shapes when profiling by @zejunchen-zejun in #11641
fix trtllm_mla attention backend when disabling cuda graph. by @cicirori in #12687
Refactor --debug-tensor-dump-layers to list by @guoyuhong in #12691
[Grammar Fix] GLM-4-MOE self.first_k_dense_replace is undefined. by @zRzRzRzRzRzRzR in #12455
add Kimi k2 reasoning parser by @MoyanZitto in #12702
Commented out b200 tests due to runner shortage by @Kangyan-Zhou in #12609
[CI] Fix qwen3-vl lora nightly ci by @Fridge003 in #12708
Fix server args for gpt oss so users can override the moe runner backend by @merrymercy in #12696
[router][grpc] Support streaming parsing with Tool Choice in chat completions API by @CatherineSue in #12677
feat: initial multimodal-gen support by @mickqian in #12484
Enable Aiter Attention for VL model by @Yuechguo in #12699
[router] fix: validate HTTP status codes in health check by @wyx-0203 in #12631
Support Expert Deferral Mechanism in KTransformers by @Atream in #12586
Add mm_fp4 trtllm backend by @wenscarl in #12406
[NVIDIA] Fix unit test of MoE and add it to nightly ci by @kaixih in #12709
[misc] Add labeler for automatic labeling by @CatherineSue in #12710
[router][ci] speed up python binding to 1.5 min by @key4ng in #12673
Fix CI and style by @merrymercy in #12658
Revert "Commented out b200 tests due to runner shortage (#12609)" by @Kangyan-Zhou in #12712
[misc] Change sync-labels to false by @CatherineSue in #12714
[router][grpc] Make harmony parser checks recipient first before channel by @CatherineSue in #12713
[router][quick fix] Add minimal option for reasoning effort in spec by @key4ng in #12711
[router] add basic ci tests for gpt-oss model support by @key4ng in #12651
fix labeler by @key4ng in #12718
[ci] fix permission by @key4ng in #12729
[chore]Remove dockerfile from target file of bump kernel version by @Fridge003 in #12728
[CPU] Upgrade default PT version to 2.9 by @ZailiWang in #12611
Revert "[ci] fix permission" by @key4ng in #12732
Revert "[router] web_search_preview tool basic implementation" by @key4ng in #12716
fix sgl-kernel version by @gongwei-130 in #12723
[chore] SGLang tag management in Dockerfile by @Fridge003 in #12734
Add nightly test multi gpu configs by @alisonshao in #12721
DeepSeek-V3.2: Add Adaptive MHA Attention Pathway for Short-Sequence Prefill by @YAMY1234 in #11892
Temporarily fix missing routed_scaling_factor for CompressedTensorsWNA16MoEMethod by @Atream in #12738
[chore] Fix triton installation for cu13 image by @Fridge003 in #12742
keep attention backend document up to date by @b8zhong in #12741
[Fix]Tiny fix in Dockerfile by @Fridge003 in #12748
[router][grpc] Support mixin tool calls in Responses API by @CatherineSue in #12736
fix: tiny fix cli by @mickqian in #12744
[router][ci] Disable cache by @key4ng in #12752
fix mamba prefix cache leak caused by abort by @yizhang2077 in #12693
[BUGFIX] fix output_ids in abort by @yizhang2077 in #12737
[GDN] Fuse b.sigmoid(), fused_gdn_gating and unsqueeze into one kernel: up to 0.85% e2e speedup by @byjiang1996 in #12508
[VLM] Optimize qwen_vl preprocess_video by @yuan-luo in #12240
Add timing metrics for requests by @cicirori in #12646
fix qwen3-omni audio length < 30s by @jiapingW in #12674
docs: document video-capable multimodal models by @WazupSteve in #12565
fix ci by @key4ng in #12760
[Refactor] Refactor fused_moe_triton tuning tools: extract shared utils, add EP/MLLM support, reduce overhead by @BBuf in #12440
Update dsv3 quantization auto setting for sm100 by @ispobock in #12778
chore: bump SGLang version to 0.5.5 by @sglang-bot in #12739

New Contributors

@thelongestusernameofall made their first contribution in #11909
@LucaLow made their first contribution in #12071
@vipwangerxiao made their first contribution in #9501
@Johnsonms made their first contribution in #11936
@ash-sigh made their first contribution in #11047
@Kevin-XiongC made their first contribution in #12057
@kaln27 made their first contribution in #9403
@haichao592 made their first contribution in #12186
@satyamk7054 made their first contribution in #11142
@weireweire made their first contribution in #11708
@bmac3 made their first contribution in #12295
@Gao016 made their first contribution in #12313
@lianakoleva made their first contribution in #12269
@zyzshishui made their first contribution in #12144
@zejunchen-zejun made their first contribution in #11709
@AichenF made their first contribution in #11737
@JensenFire made their first contribution in #12319
@Chen-0210 made their first contribution in #12012
@popsiclexu made their first contribution in #12368
@lpc0220 made their first contribution in #11116
@elvischenv made their first contribution in #12307
@sjtushenhai made their first contribution in #12266
@LuYanFCP made their first contribution in #11757
@carolove made their first contribution in #12428
@Surya-Gunukula made their first contribution in #12226
@Alexhaoge made their first contribution in #11052
@JackChuang made their first contribution in #10078
@bingps made their first contribution in #12453
@hhaAndroid made their first contribution in #12367
@yeahdongcn made their first contribution in #12560
@akhilg-nv made their first contribution in #12044
@alisonshao made their first contribution in #12585
@soaringk made their first contribution in #12456
@ai-easy-cpu made their first contribution in #12486
@MoyanZitto made their first contribution in #12702
@wyx-0203 made their first contribution in #12631
@WazupSteve made their first contribution in #12565

Full Changelog: v0.5.4...v0.5.5

sgl-project/sglang v0.5.5 Release v0.5.5 on GitHub

Highlights

What's Changed

New Contributors

sgl-project/sglang v0.5.5
Release v0.5.5

on GitHub