Highlights
- Day 0 support for Kimi-K2-Thinking https://huggingface.co/moonshotai/Kimi-K2-Thinking
- Day 0 support for Minimax-M2 https://huggingface.co/MiniMaxAI/MiniMax-M2
- Video and image generation support https://lmsys.org/blog/2025-11-07-sglang-diffusion/
- Q4 Roadmap: #12780
- Blackwell kernel optimizations and MoE runner backend refactor
- Overlap spec and prefill cuda graph support more models
What's Changed
- [8/n] decouple quantization impl from vllm dependency - gguf srt by @FlamingoPg in #11964
- lang: support direct video inference by @mickqian in #9936
- Enable Llama 4 + TRTLLM MHA by @b8zhong in #12003
- Refactor Triton-kernel MoE runner integration by @Jonahcb in #11795
- use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk by @b8zhong in #11816
- Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 by @thelongestusernameofall in #11909
- Revert "lang: support direct video inference" by @merrymercy in #12038
- support more model in piecewise cuda graph by @narutolhy in #11745
- [Fix] Fix lint to pass CI by @Fridge003 in #12037
- Revert "[Fix] Fix lint to pass CI" by @Fridge003 in #12042
- fix: fix MMMU loading issue by @ZailiWang in #11759
- Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once by @xu-yfei in #10953
- Add gguf dependency for cpu/xpu by @ZailiWang in #12041
- fix: the hardcode hf repo name comparison for deepseek-ocr by @rainj-me in #12031
- Install numactl in Dockerfile for GH200/GB200/GB300 by @fzyzcjy in #11853
- [router] Add mTLS Support for Router-to-Worker Communication by @slin1237 in #12019
- Tiny cleanup send_single by @fzyzcjy in #12056
- Refactoring GLM-4.5 and GLM-4.5V related implementations by @zRzRzRzRzRzRzR in #11800
- [Fix] fix missing
ipc_nameof__getitem__in some IO structs by @whybeyoung in #12053 - fix: bench_serving ITL calculation when using spec-decoding by @JustinTong0323 in #12064
- Fix dpsk-r1-fp4 launching crash by @Qiaolin-Yu in #12063
- Revise POINTSV15Chat model by @yuan-luo in #12049
- Add 'gguf' to project dependencies by @Muqi1029 in #12046
- [Profiler] expand '~' by @Muqi1029 in #11999
- [b200] fix piecewise cuda graph launch bug by @BBuf in #12067
- Fix multi processing serializer bug by @fzyzcjy in #11958
- [Fix]: HiCache hasher failed when EAGLE mode enabled by @leavelet in #12025
- adjust dynamic vs static outputs comparison in test_lora_update.py by @glenliu21 in #11884
- [router] implement response api get input item function and refactor input/output store by @key4ng in #11924
- fix(compile_utils, ep_moe): update environment variable and dtype check by @ishandhanani in #12034
- [router] fix ut router config init to use build pattern by @slin1237 in #12084
- docs(server-arguments): add allowed options for each argument by @Jonahcb in #11560
- [router] migrate app context to builder pattern 1/n by @slin1237 in #12086
- [router] migrate app context to builder pattern 2/n by @slin1237 in #12089
- [router][grpc] Remove gpt_oss parsers and remove _parser suffix in tool parser files by @CatherineSue in #12091
- [1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU by @zminglei in #12000
- Fix: Update blog link by @LucaLow in #12071
- perf: trtllm_mla attention backend spec decoding speedup w/ cuda graph by @cicirori in #12093
- [2/N]Support DeepSeek-R1 w4a8 low latency deepep by @ayrnb in #8464
- Enhance tests in deterministic kernels by @fzyzcjy in #12070
- [Doc] Add documentation for DeepSeek V3.2 by @Fridge003 in #11877
- [10/N] MoE Refactor: reorganize deepgemm runner in DeepEPMoE by @ch-wan in #12054
- Support true on-policy by @fzyzcjy in #12058
- [Docs] update sgl-kernel readme by @FlamingoPg in #11379
- Fix 'KeyError' for per_token expert distribution recorder by @vipwangerxiao in #9501
- Fix kernel version bump file by @Kangyan-Zhou in #12087
- [Fix] Set global args in cpu test by @Fridge003 in #12105
- chore: bump sgl-kernel version to 0.3.16.post4 by @sglang-bot in #12103
- [Auto Sync] Update test_deterministic.py, test_deterministi... (20251024) by @merrymercy in #12083
- [router] Refactor data connector architecture with unified storage modules by @key4ng in #12096
- fix: release workflow should work on both archs by @ishandhanani in #12110
- [bugs] docker file name should be .Dockerfile so it can properly render by @slin1237 in #11869
- Clean up server args & Add CI scripts by @merrymercy in #12124
- [Misc] Improve the error message of failed import by @DarkSharpness in #12119
- [CI] Add ci monitor balance workflow by @BBuf in #11962
- Skip TestLlama4LoRA in CI by @lifuhuang in #12098
- clean up github tokens by @merrymercy in #12126
- Fix Illegal Instruction/IMA errors when using DP attention -- num_tokens_for_logprob calculation by @YAMY1234 in #12115
- Fix token for CI monitor by @merrymercy in #12127
- Reenable b200 tests by @Kangyan-Zhou in #11814
- Update document index for DeepSeek-v32 docs by @Fridge003 in #12101
- Update sgl-kernel version to 0.3.16.post4 by @Fridge003 in #12125
- [Doc] Fix format for deepseek v3.2 document by @Fridge003 in #12130
- Accelerate deepseek fp4 b200 ci by @Qiaolin-Yu in #11993
- Clean up server launch code and multi tokenizer by @merrymercy in #12132
- [Test] Add dsv3.2 nsa backend testing by @Johnsonms in #11936
- [docs] upd docker files names everywhere by @vincentzed in #12133
- Make bmm batch invariant injection optional by @fzyzcjy in #12118
- [Doc] Small update of DeepSeek v3.2 document by @Fridge003 in #12138
- docs: update README by @zhyncs in #12139
- [router] MCP Manager - Support Connection Pooling, Tool Inventory and Proxy by @slin1237 in #12097
- [NVIDIA] Change default quant method for model_opt by @kaixih in #11991
- [router] update smg code owners for each component by @slin1237 in #12141
- [router] cleaned up all the redundant comments in the config module by @CatherineSue in #12147
- Clean up attention backend selection code & Other minor rename by @merrymercy in #12136
- [log] Make forward iter count optional by @hnyls2002 in #12116
- [misc] depdencies & enviroment flag by @hnyls2002 in #12113
- [quantization] AWQ Marlin doesn't work when dtype is bfloat16 by @kevin85421 in #11494
- [HiCache]Page head layout IO kernel by @huangtingwei9988 in #11615
- Do not use
MagicMockto mockserver_argsin tests by @hnyls2002 in #12154 - [router][grpc] Fix tool call id in
parse_json_schema_responseby @CatherineSue in #12152 - [router] centralize mcp tool args handling by @slin1237 in #12155
- Fix ITL metrics when using openai endpoint with spec by @hnyls2002 in #12156
- [Fix] fix allreduce bug in Piecewise Graph by @zyksir in #12106
- Support DeepGEMM for deterministic inference by @fzyzcjy in #12142
- model: support NVILA and NVILA Lite by @futrime in #10399
- Avoid using flashinfer_allreduce_fusion when dp attention is enabled. by @elfiegg in #11632
- transfer mrope_position_delta to device when first running by @ash-sigh in #11047
- add gitignore for claude code and serena mcp by @slin1237 in #12166
- Support MiniMax M2 model by @zhaochenyang20 in #12129
- [misc][grpc] Remove duplicate log by @CatherineSue in #12168
- [router][grpc] Add
ResponsesContextand fix error propagation in responses api by @CatherineSue in #12164 - [router] Remove SharedXxxStorage type aliases to make Arc explicit by @CatherineSue in #12171
- Remove deprecated --enable-beta-spec argument and fix b200 test by @Kangyan-Zhou in #12167
- fix broken deepep/flashmla install in container by adding
--no-build-isolationby @ishandhanani in #12170 - Remove description for
--enable-beta-specargument by @JustinTong0323 in #12177 - chore: bump SGLang version to 0.5.4.post1 by @sglang-bot in #12169
- [doc] add example of using w4fp8 for Deepseek by @Kevin-XiongC in #12057
- [sgl-route] Optimize the use of constant slices and retain to simplif… by @lengrongfu in #12159
- [Fix] Fix cu130 sgl-kernel wheel renaming by @Fridge003 in #12173
- docs: update contact by @zhyncs in #12192
- [sgl-kernel] feat: Support sm120 cutlass fp8 gemm kernel by @kaln27 in #9403
- [sgl-kernel][4/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #12080
- GLM-4-0414 and GLM-4.1V Code Refactor by @zRzRzRzRzRzRzR in #12117
- Add support for AutoRound quantized models by @WeiweiZhang1 in #10153
- Optimize triton_mrope with torch compile by @yuan-luo in #12112
- Fix crash after flush cache by @cctry in #12107
- [Detokenizer Manager] Cleanup state when reqs are finished by @Muqi1029 in #12205
- fix(metrics): double times add_latency for DECODE_BOOTSTRAP by @jinmingyi1998 in #12209
- improve mimax-m2 rmsnorm precision by @haichao592 in #12186
- check_offload_progress more frequently by @pansicheng in #11656
- [Feature] PD-Multiplexing Context and Scheduler. by @ykcombat in #11592
- rope xpu: fix missing argument 'fused_set_kv_buffer_arg' and replace native with sgl_kernel_xpu impl by @chunyuan-w in #12006
- Add support for Matryoshka embeddings (#126) by @satyamk7054 in #11142
- fix: AttributeError: 'NixlKVManager' object has no attribute 'prefill_tp_size_table' by @gongwei-130 in #12234
- Compiling rope while preserving true on policy by @fzyzcjy in #12161
- [Auto Sync] Update scheduler.py, spec_info.py, run_suite.py... (20251027) by @zhyncs in #12235
- Support running FP4 Deepseek on SM120. by @weireweire in #11708
- Add env var to control custom Triton kernel cache and set CSGMV as default backend. by @lifuhuang in #12176
- Use explicit uint64 dtype for Tensor data_ptr() to avoid overflow by @jianan-gu in #11994
- Update openai package version to 2.6.1 by @JustinTong0323 in #12222
- [2/2] Use moe_sum_reduce cuda kernel by @yuan-luo in #10654
- docker: add CUDA13 support in dockerfile and update GDRCopy/NVSHMEM for blackwell support by @ishandhanani in #11517
- [router] remove code duplication by @slin1237 in #12245
- [DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache by @hlu1 in #11655
- Add per-request retraction count by @scottjlee in #11177
- Opt fused triton moe: add tma for down proj kernel by @xu-yfei in #10567
- Support releasing CUDA graph memory when paused by @fzyzcjy in #7873
- [router] use mcp struct from sdk and clean up code across codebase by @slin1237 in #12249
- [router] configure workflow retries and timeout based on routerConfig by @slin1237 in #12252
- Feature/Add GET endpoint to query loaded LoRA adapters by @ConnorLi96 in #12229
- [hotfix] Incorrect CombineOverlapArgs in SBO by @ch-wan in #12230
- [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 by @sufeng-buaa in #10804
- [Bug fix] [PP] fix wrong dtype for quantified model by @XucSh in #12247
- Fix potential eos bug on decode instance when PD is enabled by @ShangmingCai in #12206
- Revert "[Feature] PD-Multiplexing Context and Scheduler." by @zhyncs in #12267
- chore: cleanup quant deps by @zhyncs in #12268
- [router] Fix type unmatch during validation by @key4ng in #12257
- Modify rocm.Dockerfile by @sogalin in #12274
- [router] upgrade grpc dependency and py 3.13 3.14 support by @slin1237 in #12284
- Fix 'BypassedTopKOutput' object has no attribute 'topk_weights' for DeepEP by @trevor-m in #12231
- Tiny fix sgl-kernel related CI installing the wrong binary by @fzyzcjy in #12283
- doc for logit_bias by @whybeyoung in #12188
- Use Flashinfer TRT-LLM as Llama 4 compatible MoE backend by @b8zhong in #11928
- [rust][ci] Add end-to-end tests for Oracle history backend by @key4ng in #12233
- [router] support arm, windows, mac, linux, reduce wheel size and number by @slin1237 in #12285
- fix seqlen bug for trtllm_mla's draft_extend by @bmac3 in #12295
- Update deepseek_v32.md by @hlu1 in #12296
- Super tiny fix expert distribution dump error by @fzyzcjy in #12271
- [router][grpc] Fix inconsistent behavior of conversation_id not found by @CatherineSue in #12299
- fix: Llama 4 BF16 load on Blackwell by @b8zhong in #12308
- Add continuous_usage_stats support for streaming responses by @BBuf in #12241
- [hotfix] missing
w13_weight_fp8andw2_weight_fp8in UE8M0 requantization by @ch-wan in #12259 - [hotfix] Fix pytest not found in CI by @Fridge003 in #12311
- a tiny fix for support deepseek bf16 weights by @Gao016 in #12313
- [metrics][EPLB]: Support selected count of physical experts on each GPU by @acelyc111 in #9825
- doc: improve modelopt error description by @lianakoleva in #12269
- EPLB: prefer to use physical experts in the same gpu or node by @acelyc111 in #10874
- Add Batch‑Invariant RMSNorm by @zyzshishui in #12144
- followup fix for llama 4 trtllm flashinfer backend by @b8zhong in #12314
- [Deepseek V3.2] Enable flashmla_auto with MTP by @hlu1 in #12294
- feat: preview filename from tuning_fused_moe_triton.py by @lianakoleva in #12276
- [ci] Try fixing broken CIs by @Fridge003 in #12317
- Refactor abortion in event loop by @hnyls2002 in #12312
- [Test] Fix session control test by @hnyls2002 in #12336
- Eagle3 DP attention for Qwen3 MoE by @qhsc in #12002
- feat: return partial generation results when aborting requests in waiting queue by @guoyuhong in #11673
- [Bug fix] trace: fix import error in mini_lb if sgl-router image does not install sglang by @sufeng-buaa in #12338
- [router] fix router release workflow and add build test in PR by @CatherineSue in #12315
- Triton fused_moe_kernel support ep moe tuning by @BBuf in #12343
- [Fix] fix type issue of env flag value MODELOPT_MAX_TOKENS_PER_EXPERT by @zejunchen-zejun in #11709
- [bug] fix router pypi license file by @slin1237 in #12345
- fix: llama 4 + trtllm gen + fp8 kv cache incompatibility by @b8zhong in #12347
- [2/2] Deepseek deterministic: support deepseek v3 deterministic inference on 8 x H200 by @zminglei in #12095
- Fix Flashinfer Backend for SM120 Usage by @weireweire in #12325
- [router] refactor mcp to use LRU and fix pooling bug by @CatherineSue in #12346
- support cutlass fp4 kernel in sm120 by @AichenF in #11737
- [bug] fix router installation to include additional dependency by @slin1237 in #12348
- [router] update router docker to use maturin and build from local by @CatherineSue in #12350
- Fix Duplicate Classmethod in spec_info.py by @hebiao064 in #12354
- [CI] Add Llama 3.1 8B FP4 to B200 CI by @b8zhong in #12182
- Fuse wk and weight_proj in Indexer for DeepSeekV3.2-FP4 by @trevor-m in #12094
- [router] Harmony Pipeline: Chat Completion & Responses API with MCP Support by @slin1237 in #12153
- [bugfix] fix deepseekvl2 and deepseek_ocr model type conflict by @leihuang-sketch in #12050
- [Ckpt Engine] feat: new sglang entrypoint support for update by @stmatengss in #12216
- [Perf] Optimize multimodal mm_inputs process in scheduler by @yuan-luo in #11910
- [NPU] fix pp_size>1 by @Makcum888e in #12195
- Super tiny add tag for benchmark scripts by @fzyzcjy in #12340
- Allow benchmarking tool to handle empty response by @Kangyan-Zhou in #12174
- Super tiny fix AMD ci by @fzyzcjy in #12378
- Import flash_mla from sgl-kernel by @Fridge003 in #12135
- [Bug fix][PP] fix deadlock with tie_word_embeddings by @XucSh in #12362
- [fix] added image token as prefix for deepseek-ocr by @Tushar-ml in #12358
- Fix DeepSeek chat templates to handle tool call arguments type checking (#11700) by @Kangyan-Zhou in #12123
- [Feature] Initial eagle3 support for Deepseek-like models by @JensenFire in #12319
- Enable fast silu-and-mul-and-quant fused kernel by @fzyzcjy in #11806
- [Test] Enhance radix cache test for spec cases by @hnyls2002 in #12394
- [NPU] bugfix for Qwen3-Next and performance update by @iforgetmyname in #11969
- [Feature] Support DeepSeek MTP on NPU by @iforgetmyname in #11897
- Revert "Triton fused_moe_kernel support ep moe tuning" by @BBuf in #12377
- [sgl-kernel] upd deepgemm hash to rebased commit by @FlamingoPg in #11960
- [router] harmony responses api streaming support by @slin1237 in #12395
- [docker] clean up main dockerfile for router and dev configurations by @CatherineSue in #12364
- feat: add EP support in tuning by @Chen-0210 in #12012
- [router] use safety_identifier replace user on chat history storage by @lengrongfu in #12185
- [CI Monitor] Fix ci_monitor perf analyzer bug by @BBuf in #12281
- [router] Fix safety_identifier missing by @key4ng in #12404
- [ci] Fix ci_install_deepep by @Fridge003 in #12375
- Update news section in README.md by @merrymercy in #12409
- [router] Function call support for openai router Responses API by @key4ng in #12386
- minor code sync by @merrymercy in #12403
- [Bug fix][PD Dissaggregation] fix prefill hanging issue with PP and DP Attention, by @popsiclexu in #12368
- [NVIDIA] Add CI workloads for GB200 by @kaixih in #12242
- [router] web_search_preview tool basic implementation by @key4ng in #12290
- [router] 0.2.2 release by @slin1237 in #12399
- enable cudaProfilerApi for one batch benchmarking by @lpc0220 in #11116
- [Refactor] tuning_fused_moe for MLLM and small refactor by @JustinTong0323 in #11224
- [DeepSeekV32] Bug fix to ensure
page_tableandresultin same type by @Johnsonms in #12300 - [CI] fix tests' time estimation by @hnyls2002 in #12401
- Reserved abortion API when retracting by @hnyls2002 in #12425
- Fix the shared expert & routed expert overlap in Llama 4 by @b8zhong in #12405
- feat: Add Non-intrusive Tensor Dumping for Model Inference by @guoyuhong in #10566
- feat: support trtllm_mha FP8 query attention kernel by @elvischenv in #12307
- [Bugfix]: distinguish processors for deepseek_vl2 and deepseek_ocr to p… by @bppps in #12384
- [ci] install released version router by @key4ng in #12410
- Revert "fix llama4 kv cache layout" by @b8zhong in #12437
- Add trait for
BasePrefixCacheby @hnyls2002 in #12436 - [CI] Add more bins for 1-gpu CI test by @Fridge003 in #12422
- [bugfix] set is_prefill_only=false when mixed_chunk by @Bruce-x-1997 in #10889
- Clean up sgl kernel by @merrymercy in #12413
- [CI] fix possible port conflicts. by @hnyls2002 in #12452
- Fix ci install to allow prerelease by @merrymercy in #12449
- fix: Add default value for backend in sample_mmmu_requests by @ZailiWang in #12256
- Enable bailing_moe to support TP=16 by @guoyuhong in #12369
- fix:watchdog thread exception by @Kindyaa in #12328
- Simplify watchdog by @hnyls2002 in #12463
- [Bug fix] Fix severe memory waste issue with torch.empty pin_memory by @sjtushenhai in #12266
- Feat: deepseek-ocr logits processor by @JustinTong0323 in #12415
- Fix lint in deepseek-ocr by @ispobock in #12470
- [Test] Add Functional Tests for Penalty Parameters by @neelabhsinha in #11931
- [Bug] OOM (Out-of-Memory) errors for extreme testing scenarios (min_tokens=2) by @LuYanFCP in #11757
- [Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. by @ykcombat in #12275
- [VLM] Optimize async mm data process mechanism by @yuan-luo in #12066
- fix default env var for mooncake store by @huangtingwei9988 in #12429
- add served model name in bench serving by @carolove in #12428
- Tiny assert no running requests when releasing memory to avoid IMA by @fzyzcjy in #12341
- fix: dummy health check server not accessible on non-zero rank nodes by @ishandhanani in #12297
- Fix run benchmark by @ispobock in #12473
- Add env var to disable FA4 warmup by @cicirori in #12430
- Try to allow NCCL cumem for multi node nvlink case by @fzyzcjy in #11987
- Support Kimi Linear by @ispobock in #12469
- [CI] Fix kernel installation on aarch runners by @Fridge003 in #12475
- fa3 & trtllm_mha spec overlap by @JustinTong0323 in #11874
- chore: bump SGLang version to 0.5.4.post2 by @sglang-bot in #12439
- Tiny fix eos handling for PD disaggregation by @ShangmingCai in #12334
- Forward unknown tool calls instead of dropping by @Surya-Gunukula in #12226
- Use sgl fp4 quant kernel by default by @Qiaolin-Yu in #12482
- [hot fix] Remove
from python.sglang.xxxby @hnyls2002 in #12483 - perf: trtllm mla performance minor improvements by @cicirori in #12435
- Filter tokenizer warning for kimi models by @ispobock in #12485
- [CI] Build aarch64 kernels for sgl-kernel test by @Fridge003 in #12480
- [Hotfix] Remove extra comment in sgl-kernel README by @Fridge003 in #12500
- [feat] Add SGLANG_TOOL_STRICT_LEVEL for tool-call behavior control by @JustinTong0323 in #12423
- Reduce docker image size. mount cache when use pip/cargo build by @whybeyoung in #12238
- [HICache / PD]: Support offloading incremental KV cache in decode side. by @hzh0425 in #11966
- [Deterministic] add deepseek v3 deterministic inference CI test by @zminglei in #12412
- [Bug] test_flashattn_mla_backend errors in Hopper #12487 by @Johnsonms in #12488
- Update Mooncake EP's a2a interface by @UNIDY2002 in #12391
- [CI][NPU] remove pypi mirror site that hangs ci dependency installation by @iforgetmyname in #12499
- [Ascend] Add Ascend NPU support for sglang.check_env & rework proposal by @Alexhaoge in #11052
- [Feature] Qwen3-Next & FLA: Support MTP topk>1; Up to 6% faster by @byjiang1996 in #11133
- [CI] Move some Lora/Deterministic CI tests to nightly by @Fridge003 in #12507
- Migrate weak_ref_tensor to sgl-kernel by @BBuf in #12505
- feat: Add FP4 (E2M1) KV Cache Support with Quantization Utilities for MLA by @JackChuang in #10078
- chore: bump sgl-kernel version to 0.3.16.post5 by @sglang-bot in #12511
- [FEAT] Shared mem pool based cuda ipc for multi-modal data transport by @kousakawang in #11917
- Add prefix for torch symm mem by @yuan-luo in #12506
- [ServerArgs] allow --mamba-ssm-dtype extend by @hanming-lu in #12481
- [Fix]
concat_mla_absorb_q_kernelfails for long inputs by @bingps in #12453 - Super tiny fix naming in bench serving scripts by @fzyzcjy in #12515
- move all get_stream in sgl_kernel to c++ to reduce the launch overhead by @merrymercy in #12521
- [Refact] Remove hardcoded KV cache dimension in MLATokenToKVPool by @Johnsonms in #12502
- [Bug] Fix Intern-S1 model accuracy and support /generate interface with input_ids by @hhaAndroid in #12367
- chore: upgrade flashinfer 0.5.0 by @zhyncs in #12523
- [hotfix] Remove flashinfer-jit-cache from pyproject by @Fridge003 in #12530
- fix: move dummy format loader check before quantization checks by @cicirori in #12532
- chore: upgrade mooncake 0.3.7.post1 by @ShangmingCai in #12541
- fix: Fix KTransformers hybrid inference with int8 quantization and format by @Atream in #12536
- Conditionally recapture cuda graph after model weight update from disk by @harrisonlimh in #12060
- [spec v2] Fix output repetition by speculative sampling error by @hnyls2002 in #12561
- [hot-fix] Fix broken CI by @hnyls2002 in #12564
- fix: fix the bug which leads qwen2_5_vl to crash with mixed_chunk by @PanJason in #11330
- Fix error when calling quantization by @fzyzcjy in #12548
- [Test] Add parameters to SRTRunner by @Jonahcb in #12227
- [ROCm] Update Mooncake to v0.3.7.post1 and add -DUSE_HIP=ON to rocm.Dockerfile by @yeahdongcn in #12560
- Reduce the overhead of nccl symmetric memory by @merrymercy in #12524
- tiny optimize for bench serving by @yizhang2077 in #12553
- Super tiny allow profile activities in bench_serving by @fzyzcjy in #12549
- Super tiny dump server info such as args in bench for post analysis by @fzyzcjy in #12550
- update usage of
trtllm_fp8_per_tensor_scale_moeby @b8zhong in #12569 - [router][grpc] Consolidate error messages build in error.rs by @CatherineSue in #12301
- Remove the dependency of nccl.h in symmetric memory by @merrymercy in #12571
- [chore] Fix update_kernel_whl_index script for multiple cuda version by @Fridge003 in #12519
- Enable mixed type LayerNorm kernel for NSA indexer by @akhilg-nv in #12044
- Super tiny add UT for copy_to_gpu_no_ce by @fzyzcjy in #12270
- [Doc] fix miss index for production request trace by @stmatengss in #12547
- [GDN/SWA] mamba and swa radix cache edge case fix by @hanming-lu in #12111
- [Qwen3 VL] Add LoRA support for Qwen 3 VL by @Jonahcb in #12165
- test: support return logprobs in bench_offline_throughput test by @aftersnow in #12462
- Tiny fix ExpertDistributionReq error by @fzyzcjy in #11760
- fix: respect
--ignore-eosin PD case for benchmarking by @ishandhanani in #12597 - Improve the metrics for PD by @merrymercy in #12580
- Enable memory saver for hybrid model by @ocss884 in #11974
- Restore torch defaults between sgl-kernel tests by @benbarsdell in #11131
- feat: limit peak memory usage when computing logprobs by @aftersnow in #6318
- [router][grpc] Restructure modules and code clean up by @CatherineSue in #12598
- Add --speculative-moe-runner-backend server arg by @trevor-m in #10183
- [Deterministic] Optimize bmm_batch_invariant op by @zminglei in #12522
- chore: bump mooncake version to 0.3.7.post2 by @ShangmingCai in #12599
- [sepc-v2] Fix imcompatibility with constrained decoding by @hnyls2002 in #12615
- Support aggregating engine metrics in sgl-router by @fzyzcjy in #11456
- Ensure GPU work is finished when release memory occupation call is finished by @fzyzcjy in #12592
- Add sanity checks when a test file is not added to CI (reland) by @fzyzcjy in #12594
- [router][grpc] Fix model validation, tool call check, streaming logic and misc in responses by @CatherineSue in #12616
- [HotFix] Disable torch dynamo for mrope_triton kernel by @yuan-luo in #12593
- Fix skip layer in get_quant_method by @ispobock in #12632
- [Test] Merge all constrained decoding tests. by @hnyls2002 in #12633
- Add io struct naming check back by @hnyls2002 in #12634
- Fix
output_idsinconsistency by @hnyls2002 in #12628 - fix: Lazy import mooncake-ep to fix extra gpu contexts being created by @trevor-m in #12641
- [hotfix] Fix deepep w4a8 bug by @Fridge003 in #12642
- [Auto Sync] Update scheduler_metrics_mixin.py, collector.py (20251104) by @merrymercy in #12647
- [Bug] Fix NSA Backend KV-Buffer Shape Mismatch in DeepSeek-V3.2 by @Johnsonms in #12645
- [NVIDIA] Fix wrong symmetric sizes for fp4 cases by @kaixih in #12640
- [router][grpc] Fix index issues in reasoning content and missing streaming events by @CatherineSue in #12650
- Revert "Enable memory saver for hybrid model" by @Fridge003 in #12648
- Add multi-GPU configurations to nightly-test.yml by @alisonshao in #12585
- [fix] Handle escaped characters in GLM tool call parser to prevent double serialization by @soaringk in #12456
- [router][grpc] Emit OutputItemDone event and store output item array by @CatherineSue in #12656
- Register allgather/reducescatter buffers with symm memory by @nvcastet in #12572
- chore: bump SGLang version to 0.5.4.post3 by @sglang-bot in #12639
- [NVIDIA] Fix cutedsl backend of MoE by @kaixih in #12353
- [PD-Disagg] Check finish after pop tranferred by @hnyls2002 in #12638
- fix typo of args description in sglang.profiler by @ai-easy-cpu in #12486
- [Dockerfile] Speed up docker image building by @acelyc111 in #8784
- Fix VLLM dependency test by @Kangyan-Zhou in #12670
- [Feature] add --lora-request-distribution arg to bench_serving.py and support skewed and distinct workloads by @glenliu21 in #12175
- [router][grpc] Implement tool_choice support for Responses API by @CatherineSue in #12668
- Expand and update test coverage for AMD CI by @hubertlu-tw in #10044
- fix: add
seedbench_serving to cache key, remove redundant function definition. by @cicirori in #12680 - [Profiler] Add SGLANG_PROFILE_RECORD_SHAPES for recording shapes when profiling by @zejunchen-zejun in #11641
- fix trtllm_mla attention backend when disabling cuda graph. by @cicirori in #12687
- Refactor
--debug-tensor-dump-layersto list by @guoyuhong in #12691 - [Grammar Fix] GLM-4-MOE self.first_k_dense_replace is undefined. by @zRzRzRzRzRzRzR in #12455
- add Kimi k2 reasoning parser by @MoyanZitto in #12702
- Commented out b200 tests due to runner shortage by @Kangyan-Zhou in #12609
- [CI] Fix qwen3-vl lora nightly ci by @Fridge003 in #12708
- Fix server args for gpt oss so users can override the moe runner backend by @merrymercy in #12696
- [router][grpc] Support streaming parsing with Tool Choice in chat completions API by @CatherineSue in #12677
- feat: initial multimodal-gen support by @mickqian in #12484
- Enable Aiter Attention for VL model by @Yuechguo in #12699
- [router] fix: validate HTTP status codes in health check by @wyx-0203 in #12631
- Support Expert Deferral Mechanism in KTransformers by @Atream in #12586
- Add mm_fp4 trtllm backend by @wenscarl in #12406
- [NVIDIA] Fix unit test of MoE and add it to nightly ci by @kaixih in #12709
- [misc] Add labeler for automatic labeling by @CatherineSue in #12710
- [router][ci] speed up python binding to 1.5 min by @key4ng in #12673
- Fix CI and style by @merrymercy in #12658
- Revert "Commented out b200 tests due to runner shortage (#12609)" by @Kangyan-Zhou in #12712
- [misc] Change sync-labels to false by @CatherineSue in #12714
- [router][grpc] Make harmony parser checks recipient first before channel by @CatherineSue in #12713
- [router][quick fix] Add minimal option for reasoning effort in spec by @key4ng in #12711
- [router] add basic ci tests for gpt-oss model support by @key4ng in #12651
- fix labeler by @key4ng in #12718
- [ci] fix permission by @key4ng in #12729
- [chore]Remove dockerfile from target file of bump kernel version by @Fridge003 in #12728
- [CPU] Upgrade default PT version to 2.9 by @ZailiWang in #12611
- Revert "[ci] fix permission" by @key4ng in #12732
- Revert "[router] web_search_preview tool basic implementation" by @key4ng in #12716
- fix sgl-kernel version by @gongwei-130 in #12723
- [chore] SGLang tag management in Dockerfile by @Fridge003 in #12734
- Add nightly test multi gpu configs by @alisonshao in #12721
- DeepSeek-V3.2: Add Adaptive MHA Attention Pathway for Short-Sequence Prefill by @YAMY1234 in #11892
- Temporarily fix missing routed_scaling_factor for CompressedTensorsWNA16MoEMethod by @Atream in #12738
- [chore] Fix triton installation for cu13 image by @Fridge003 in #12742
- keep attention backend document up to date by @b8zhong in #12741
- [Fix]Tiny fix in Dockerfile by @Fridge003 in #12748
- [router][grpc] Support mixin tool calls in Responses API by @CatherineSue in #12736
- fix: tiny fix cli by @mickqian in #12744
- [router][ci] Disable cache by @key4ng in #12752
- fix mamba prefix cache leak caused by abort by @yizhang2077 in #12693
- [BUGFIX] fix output_ids in abort by @yizhang2077 in #12737
- [GDN] Fuse b.sigmoid(), fused_gdn_gating and unsqueeze into one kernel: up to 0.85% e2e speedup by @byjiang1996 in #12508
- [VLM] Optimize qwen_vl preprocess_video by @yuan-luo in #12240
- Add timing metrics for requests by @cicirori in #12646
- fix qwen3-omni audio length < 30s by @jiapingW in #12674
- docs: document video-capable multimodal models by @WazupSteve in #12565
- fix ci by @key4ng in #12760
- [Refactor] Refactor fused_moe_triton tuning tools: extract shared utils, add EP/MLLM support, reduce overhead by @BBuf in #12440
- Update dsv3 quantization auto setting for sm100 by @ispobock in #12778
- chore: bump SGLang version to 0.5.5 by @sglang-bot in #12739
New Contributors
- @thelongestusernameofall made their first contribution in #11909
- @LucaLow made their first contribution in #12071
- @vipwangerxiao made their first contribution in #9501
- @Johnsonms made their first contribution in #11936
- @ash-sigh made their first contribution in #11047
- @Kevin-XiongC made their first contribution in #12057
- @kaln27 made their first contribution in #9403
- @haichao592 made their first contribution in #12186
- @satyamk7054 made their first contribution in #11142
- @weireweire made their first contribution in #11708
- @bmac3 made their first contribution in #12295
- @Gao016 made their first contribution in #12313
- @lianakoleva made their first contribution in #12269
- @zyzshishui made their first contribution in #12144
- @zejunchen-zejun made their first contribution in #11709
- @AichenF made their first contribution in #11737
- @JensenFire made their first contribution in #12319
- @Chen-0210 made their first contribution in #12012
- @popsiclexu made their first contribution in #12368
- @lpc0220 made their first contribution in #11116
- @elvischenv made their first contribution in #12307
- @sjtushenhai made their first contribution in #12266
- @LuYanFCP made their first contribution in #11757
- @carolove made their first contribution in #12428
- @Surya-Gunukula made their first contribution in #12226
- @Alexhaoge made their first contribution in #11052
- @JackChuang made their first contribution in #10078
- @bingps made their first contribution in #12453
- @hhaAndroid made their first contribution in #12367
- @yeahdongcn made their first contribution in #12560
- @akhilg-nv made their first contribution in #12044
- @alisonshao made their first contribution in #12585
- @soaringk made their first contribution in #12456
- @ai-easy-cpu made their first contribution in #12486
- @MoyanZitto made their first contribution in #12702
- @wyx-0203 made their first contribution in #12631
- @WazupSteve made their first contribution in #12565
Full Changelog: v0.5.4...v0.5.5