github sgl-project/sglang v0.5.5
Release v0.5.5

one day ago

Highlights

What's Changed

  • [8/n] decouple quantization impl from vllm dependency - gguf srt by @FlamingoPg in #11964
  • lang: support direct video inference by @mickqian in #9936
  • Enable Llama 4 + TRTLLM MHA by @b8zhong in #12003
  • Refactor Triton-kernel MoE runner integration by @Jonahcb in #11795
  • use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk by @b8zhong in #11816
  • Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 by @thelongestusernameofall in #11909
  • Revert "lang: support direct video inference" by @merrymercy in #12038
  • support more model in piecewise cuda graph by @narutolhy in #11745
  • [Fix] Fix lint to pass CI by @Fridge003 in #12037
  • Revert "[Fix] Fix lint to pass CI" by @Fridge003 in #12042
  • fix: fix MMMU loading issue by @ZailiWang in #11759
  • Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once by @xu-yfei in #10953
  • Add gguf dependency for cpu/xpu by @ZailiWang in #12041
  • fix: the hardcode hf repo name comparison for deepseek-ocr by @rainj-me in #12031
  • Install numactl in Dockerfile for GH200/GB200/GB300 by @fzyzcjy in #11853
  • [router] Add mTLS Support for Router-to-Worker Communication by @slin1237 in #12019
  • Tiny cleanup send_single by @fzyzcjy in #12056
  • Refactoring GLM-4.5 and GLM-4.5V related implementations by @zRzRzRzRzRzRzR in #11800
  • [Fix] fix missing ipc_name of __getitem__ in some IO structs by @whybeyoung in #12053
  • fix: bench_serving ITL calculation when using spec-decoding by @JustinTong0323 in #12064
  • Fix dpsk-r1-fp4 launching crash by @Qiaolin-Yu in #12063
  • Revise POINTSV15Chat model by @yuan-luo in #12049
  • Add 'gguf' to project dependencies by @Muqi1029 in #12046
  • [Profiler] expand '~' by @Muqi1029 in #11999
  • [b200] fix piecewise cuda graph launch bug by @BBuf in #12067
  • Fix multi processing serializer bug by @fzyzcjy in #11958
  • [Fix]: HiCache hasher failed when EAGLE mode enabled by @leavelet in #12025
  • adjust dynamic vs static outputs comparison in test_lora_update.py by @glenliu21 in #11884
  • [router] implement response api get input item function and refactor input/output store by @key4ng in #11924
  • fix(compile_utils, ep_moe): update environment variable and dtype check by @ishandhanani in #12034
  • [router] fix ut router config init to use build pattern by @slin1237 in #12084
  • docs(server-arguments): add allowed options for each argument by @Jonahcb in #11560
  • [router] migrate app context to builder pattern 1/n by @slin1237 in #12086
  • [router] migrate app context to builder pattern 2/n by @slin1237 in #12089
  • [router][grpc] Remove gpt_oss parsers and remove _parser suffix in tool parser files by @CatherineSue in #12091
  • [1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU by @zminglei in #12000
  • Fix: Update blog link by @LucaLow in #12071
  • perf: trtllm_mla attention backend spec decoding speedup w/ cuda graph by @cicirori in #12093
  • [2/N]Support DeepSeek-R1 w4a8 low latency deepep by @ayrnb in #8464
  • Enhance tests in deterministic kernels by @fzyzcjy in #12070
  • [Doc] Add documentation for DeepSeek V3.2 by @Fridge003 in #11877
  • [10/N] MoE Refactor: reorganize deepgemm runner in DeepEPMoE by @ch-wan in #12054
  • Support true on-policy by @fzyzcjy in #12058
  • [Docs] update sgl-kernel readme by @FlamingoPg in #11379
  • Fix 'KeyError' for per_token expert distribution recorder by @vipwangerxiao in #9501
  • Fix kernel version bump file by @Kangyan-Zhou in #12087
  • [Fix] Set global args in cpu test by @Fridge003 in #12105
  • chore: bump sgl-kernel version to 0.3.16.post4 by @sglang-bot in #12103
  • [Auto Sync] Update test_deterministic.py, test_deterministi... (20251024) by @merrymercy in #12083
  • [router] Refactor data connector architecture with unified storage modules by @key4ng in #12096
  • fix: release workflow should work on both archs by @ishandhanani in #12110
  • [bugs] docker file name should be .Dockerfile so it can properly render by @slin1237 in #11869
  • Clean up server args & Add CI scripts by @merrymercy in #12124
  • [Misc] Improve the error message of failed import by @DarkSharpness in #12119
  • [CI] Add ci monitor balance workflow by @BBuf in #11962
  • Skip TestLlama4LoRA in CI by @lifuhuang in #12098
  • clean up github tokens by @merrymercy in #12126
  • Fix Illegal Instruction/IMA errors when using DP attention -- num_tokens_for_logprob calculation by @YAMY1234 in #12115
  • Fix token for CI monitor by @merrymercy in #12127
  • Reenable b200 tests by @Kangyan-Zhou in #11814
  • Update document index for DeepSeek-v32 docs by @Fridge003 in #12101
  • Update sgl-kernel version to 0.3.16.post4 by @Fridge003 in #12125
  • [Doc] Fix format for deepseek v3.2 document by @Fridge003 in #12130
  • Accelerate deepseek fp4 b200 ci by @Qiaolin-Yu in #11993
  • Clean up server launch code and multi tokenizer by @merrymercy in #12132
  • [Test] Add dsv3.2 nsa backend testing by @Johnsonms in #11936
  • [docs] upd docker files names everywhere by @vincentzed in #12133
  • Make bmm batch invariant injection optional by @fzyzcjy in #12118
  • [Doc] Small update of DeepSeek v3.2 document by @Fridge003 in #12138
  • docs: update README by @zhyncs in #12139
  • [router] MCP Manager - Support Connection Pooling, Tool Inventory and Proxy by @slin1237 in #12097
  • [NVIDIA] Change default quant method for model_opt by @kaixih in #11991
  • [router] update smg code owners for each component by @slin1237 in #12141
  • [router] cleaned up all the redundant comments in the config module by @CatherineSue in #12147
  • Clean up attention backend selection code & Other minor rename by @merrymercy in #12136
  • [log] Make forward iter count optional by @hnyls2002 in #12116
  • [misc] depdencies & enviroment flag by @hnyls2002 in #12113
  • [quantization] AWQ Marlin doesn't work when dtype is bfloat16 by @kevin85421 in #11494
  • [HiCache]Page head layout IO kernel by @huangtingwei9988 in #11615
  • Do not use MagicMock to mock server_args in tests by @hnyls2002 in #12154
  • [router][grpc] Fix tool call id in parse_json_schema_response by @CatherineSue in #12152
  • [router] centralize mcp tool args handling by @slin1237 in #12155
  • Fix ITL metrics when using openai endpoint with spec by @hnyls2002 in #12156
  • [Fix] fix allreduce bug in Piecewise Graph by @zyksir in #12106
  • Support DeepGEMM for deterministic inference by @fzyzcjy in #12142
  • model: support NVILA and NVILA Lite by @futrime in #10399
  • Avoid using flashinfer_allreduce_fusion when dp attention is enabled. by @elfiegg in #11632
  • transfer mrope_position_delta to device when first running by @ash-sigh in #11047
  • add gitignore for claude code and serena mcp by @slin1237 in #12166
  • Support MiniMax M2 model by @zhaochenyang20 in #12129
  • [misc][grpc] Remove duplicate log by @CatherineSue in #12168
  • [router][grpc] Add ResponsesContext and fix error propagation in responses api by @CatherineSue in #12164
  • [router] Remove SharedXxxStorage type aliases to make Arc explicit by @CatherineSue in #12171
  • Remove deprecated --enable-beta-spec argument and fix b200 test by @Kangyan-Zhou in #12167
  • fix broken deepep/flashmla install in container by adding --no-build-isolation by @ishandhanani in #12170
  • Remove description for --enable-beta-spec argument by @JustinTong0323 in #12177
  • chore: bump SGLang version to 0.5.4.post1 by @sglang-bot in #12169
  • [doc] add example of using w4fp8 for Deepseek by @Kevin-XiongC in #12057
  • [sgl-route] Optimize the use of constant slices and retain to simplif… by @lengrongfu in #12159
  • [Fix] Fix cu130 sgl-kernel wheel renaming by @Fridge003 in #12173
  • docs: update contact by @zhyncs in #12192
  • [sgl-kernel] feat: Support sm120 cutlass fp8 gemm kernel by @kaln27 in #9403
  • [sgl-kernel][4/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #12080
  • GLM-4-0414 and GLM-4.1V Code Refactor by @zRzRzRzRzRzRzR in #12117
  • Add support for AutoRound quantized models by @WeiweiZhang1 in #10153
  • Optimize triton_mrope with torch compile by @yuan-luo in #12112
  • Fix crash after flush cache by @cctry in #12107
  • [Detokenizer Manager] Cleanup state when reqs are finished by @Muqi1029 in #12205
  • fix(metrics): double times add_latency for DECODE_BOOTSTRAP by @jinmingyi1998 in #12209
  • improve mimax-m2 rmsnorm precision by @haichao592 in #12186
  • check_offload_progress more frequently by @pansicheng in #11656
  • [Feature] PD-Multiplexing Context and Scheduler. by @ykcombat in #11592
  • rope xpu: fix missing argument 'fused_set_kv_buffer_arg' and replace native with sgl_kernel_xpu impl by @chunyuan-w in #12006
  • Add support for Matryoshka embeddings (#126) by @satyamk7054 in #11142
  • fix: AttributeError: 'NixlKVManager' object has no attribute 'prefill_tp_size_table' by @gongwei-130 in #12234
  • Compiling rope while preserving true on policy by @fzyzcjy in #12161
  • [Auto Sync] Update scheduler.py, spec_info.py, run_suite.py... (20251027) by @zhyncs in #12235
  • Support running FP4 Deepseek on SM120. by @weireweire in #11708
  • Add env var to control custom Triton kernel cache and set CSGMV as default backend. by @lifuhuang in #12176
  • Use explicit uint64 dtype for Tensor data_ptr() to avoid overflow by @jianan-gu in #11994
  • Update openai package version to 2.6.1 by @JustinTong0323 in #12222
  • [2/2] Use moe_sum_reduce cuda kernel by @yuan-luo in #10654
  • docker: add CUDA13 support in dockerfile and update GDRCopy/NVSHMEM for blackwell support by @ishandhanani in #11517
  • [router] remove code duplication by @slin1237 in #12245
  • [DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache by @hlu1 in #11655
  • Add per-request retraction count by @scottjlee in #11177
  • Opt fused triton moe: add tma for down proj kernel by @xu-yfei in #10567
  • Support releasing CUDA graph memory when paused by @fzyzcjy in #7873
  • [router] use mcp struct from sdk and clean up code across codebase by @slin1237 in #12249
  • [router] configure workflow retries and timeout based on routerConfig by @slin1237 in #12252
  • Feature/Add GET endpoint to query loaded LoRA adapters by @ConnorLi96 in #12229
  • [hotfix] Incorrect CombineOverlapArgs in SBO by @ch-wan in #12230
  • [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 by @sufeng-buaa in #10804
  • [Bug fix] [PP] fix wrong dtype for quantified model by @XucSh in #12247
  • Fix potential eos bug on decode instance when PD is enabled by @ShangmingCai in #12206
  • Revert "[Feature] PD-Multiplexing Context and Scheduler." by @zhyncs in #12267
  • chore: cleanup quant deps by @zhyncs in #12268
  • [router] Fix type unmatch during validation by @key4ng in #12257
  • Modify rocm.Dockerfile by @sogalin in #12274
  • [router] upgrade grpc dependency and py 3.13 3.14 support by @slin1237 in #12284
  • Fix 'BypassedTopKOutput' object has no attribute 'topk_weights' for DeepEP by @trevor-m in #12231
  • Tiny fix sgl-kernel related CI installing the wrong binary by @fzyzcjy in #12283
  • doc for logit_bias by @whybeyoung in #12188
  • Use Flashinfer TRT-LLM as Llama 4 compatible MoE backend by @b8zhong in #11928
  • [rust][ci] Add end-to-end tests for Oracle history backend by @key4ng in #12233
  • [router] support arm, windows, mac, linux, reduce wheel size and number by @slin1237 in #12285
  • fix seqlen bug for trtllm_mla's draft_extend by @bmac3 in #12295
  • Update deepseek_v32.md by @hlu1 in #12296
  • Super tiny fix expert distribution dump error by @fzyzcjy in #12271
  • [router][grpc] Fix inconsistent behavior of conversation_id not found by @CatherineSue in #12299
  • fix: Llama 4 BF16 load on Blackwell by @b8zhong in #12308
  • Add continuous_usage_stats support for streaming responses by @BBuf in #12241
  • [hotfix] missing w13_weight_fp8 and w2_weight_fp8 in UE8M0 requantization by @ch-wan in #12259
  • [hotfix] Fix pytest not found in CI by @Fridge003 in #12311
  • a tiny fix for support deepseek bf16 weights by @Gao016 in #12313
  • [metrics][EPLB]: Support selected count of physical experts on each GPU by @acelyc111 in #9825
  • doc: improve modelopt error description by @lianakoleva in #12269
  • EPLB: prefer to use physical experts in the same gpu or node by @acelyc111 in #10874
  • Add Batch‑Invariant RMSNorm by @zyzshishui in #12144
  • followup fix for llama 4 trtllm flashinfer backend by @b8zhong in #12314
  • [Deepseek V3.2] Enable flashmla_auto with MTP by @hlu1 in #12294
  • feat: preview filename from tuning_fused_moe_triton.py by @lianakoleva in #12276
  • [ci] Try fixing broken CIs by @Fridge003 in #12317
  • Refactor abortion in event loop by @hnyls2002 in #12312
  • [Test] Fix session control test by @hnyls2002 in #12336
  • Eagle3 DP attention for Qwen3 MoE by @qhsc in #12002
  • feat: return partial generation results when aborting requests in waiting queue by @guoyuhong in #11673
  • [Bug fix] trace: fix import error in mini_lb if sgl-router image does not install sglang by @sufeng-buaa in #12338
  • [router] fix router release workflow and add build test in PR by @CatherineSue in #12315
  • Triton fused_moe_kernel support ep moe tuning by @BBuf in #12343
  • [Fix] fix type issue of env flag value MODELOPT_MAX_TOKENS_PER_EXPERT by @zejunchen-zejun in #11709
  • [bug] fix router pypi license file by @slin1237 in #12345
  • fix: llama 4 + trtllm gen + fp8 kv cache incompatibility by @b8zhong in #12347
  • [2/2] Deepseek deterministic: support deepseek v3 deterministic inference on 8 x H200 by @zminglei in #12095
  • Fix Flashinfer Backend for SM120 Usage by @weireweire in #12325
  • [router] refactor mcp to use LRU and fix pooling bug by @CatherineSue in #12346
  • support cutlass fp4 kernel in sm120 by @AichenF in #11737
  • [bug] fix router installation to include additional dependency by @slin1237 in #12348
  • [router] update router docker to use maturin and build from local by @CatherineSue in #12350
  • Fix Duplicate Classmethod in spec_info.py by @hebiao064 in #12354
  • [CI] Add Llama 3.1 8B FP4 to B200 CI by @b8zhong in #12182
  • Fuse wk and weight_proj in Indexer for DeepSeekV3.2-FP4 by @trevor-m in #12094
  • [router] Harmony Pipeline: Chat Completion & Responses API with MCP Support by @slin1237 in #12153
  • [bugfix] fix deepseekvl2 and deepseek_ocr model type conflict by @leihuang-sketch in #12050
  • [Ckpt Engine] feat: new sglang entrypoint support for update by @stmatengss in #12216
  • [Perf] Optimize multimodal mm_inputs process in scheduler by @yuan-luo in #11910
  • [NPU] fix pp_size>1 by @Makcum888e in #12195
  • Super tiny add tag for benchmark scripts by @fzyzcjy in #12340
  • Allow benchmarking tool to handle empty response by @Kangyan-Zhou in #12174
  • Super tiny fix AMD ci by @fzyzcjy in #12378
  • Import flash_mla from sgl-kernel by @Fridge003 in #12135
  • [Bug fix][PP] fix deadlock with tie_word_embeddings by @XucSh in #12362
  • [fix] added image token as prefix for deepseek-ocr by @Tushar-ml in #12358
  • Fix DeepSeek chat templates to handle tool call arguments type checking (#11700) by @Kangyan-Zhou in #12123
  • [Feature] Initial eagle3 support for Deepseek-like models by @JensenFire in #12319
  • Enable fast silu-and-mul-and-quant fused kernel by @fzyzcjy in #11806
  • [Test] Enhance radix cache test for spec cases by @hnyls2002 in #12394
  • [NPU] bugfix for Qwen3-Next and performance update by @iforgetmyname in #11969
  • [Feature] Support DeepSeek MTP on NPU by @iforgetmyname in #11897
  • Revert "Triton fused_moe_kernel support ep moe tuning" by @BBuf in #12377
  • [sgl-kernel] upd deepgemm hash to rebased commit by @FlamingoPg in #11960
  • [router] harmony responses api streaming support by @slin1237 in #12395
  • [docker] clean up main dockerfile for router and dev configurations by @CatherineSue in #12364
  • feat: add EP support in tuning by @Chen-0210 in #12012
  • [router] use safety_identifier replace user on chat history storage by @lengrongfu in #12185
  • [CI Monitor] Fix ci_monitor perf analyzer bug by @BBuf in #12281
  • [router] Fix safety_identifier missing by @key4ng in #12404
  • [ci] Fix ci_install_deepep by @Fridge003 in #12375
  • Update news section in README.md by @merrymercy in #12409
  • [router] Function call support for openai router Responses API by @key4ng in #12386
  • minor code sync by @merrymercy in #12403
  • [Bug fix][PD Dissaggregation] fix prefill hanging issue with PP and DP Attention, by @popsiclexu in #12368
  • [NVIDIA] Add CI workloads for GB200 by @kaixih in #12242
  • [router] web_search_preview tool basic implementation by @key4ng in #12290
  • [router] 0.2.2 release by @slin1237 in #12399
  • enable cudaProfilerApi for one batch benchmarking by @lpc0220 in #11116
  • [Refactor] tuning_fused_moe for MLLM and small refactor by @JustinTong0323 in #11224
  • [DeepSeekV32] Bug fix to ensure page_table and result in same type by @Johnsonms in #12300
  • [CI] fix tests' time estimation by @hnyls2002 in #12401
  • Reserved abortion API when retracting by @hnyls2002 in #12425
  • Fix the shared expert & routed expert overlap in Llama 4 by @b8zhong in #12405
  • feat: Add Non-intrusive Tensor Dumping for Model Inference by @guoyuhong in #10566
  • feat: support trtllm_mha FP8 query attention kernel by @elvischenv in #12307
  • [Bugfix]: distinguish processors for deepseek_vl2 and deepseek_ocr to p… by @bppps in #12384
  • [ci] install released version router by @key4ng in #12410
  • Revert "fix llama4 kv cache layout" by @b8zhong in #12437
  • Add trait for BasePrefixCache by @hnyls2002 in #12436
  • [CI] Add more bins for 1-gpu CI test by @Fridge003 in #12422
  • [bugfix] set is_prefill_only=false when mixed_chunk by @Bruce-x-1997 in #10889
  • Clean up sgl kernel by @merrymercy in #12413
  • [CI] fix possible port conflicts. by @hnyls2002 in #12452
  • Fix ci install to allow prerelease by @merrymercy in #12449
  • fix: Add default value for backend in sample_mmmu_requests by @ZailiWang in #12256
  • Enable bailing_moe to support TP=16 by @guoyuhong in #12369
  • fix:watchdog thread exception by @Kindyaa in #12328
  • Simplify watchdog by @hnyls2002 in #12463
  • [Bug fix] Fix severe memory waste issue with torch.empty pin_memory by @sjtushenhai in #12266
  • Feat: deepseek-ocr logits processor by @JustinTong0323 in #12415
  • Fix lint in deepseek-ocr by @ispobock in #12470
  • [Test] Add Functional Tests for Penalty Parameters by @neelabhsinha in #11931
  • [Bug] OOM (Out-of-Memory) errors for extreme testing scenarios (min_tokens=2) by @LuYanFCP in #11757
  • [Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. by @ykcombat in #12275
  • [VLM] Optimize async mm data process mechanism by @yuan-luo in #12066
  • fix default env var for mooncake store by @huangtingwei9988 in #12429
  • add served model name in bench serving by @carolove in #12428
  • Tiny assert no running requests when releasing memory to avoid IMA by @fzyzcjy in #12341
  • fix: dummy health check server not accessible on non-zero rank nodes by @ishandhanani in #12297
  • Fix run benchmark by @ispobock in #12473
  • Add env var to disable FA4 warmup by @cicirori in #12430
  • Try to allow NCCL cumem for multi node nvlink case by @fzyzcjy in #11987
  • Support Kimi Linear by @ispobock in #12469
  • [CI] Fix kernel installation on aarch runners by @Fridge003 in #12475
  • fa3 & trtllm_mha spec overlap by @JustinTong0323 in #11874
  • chore: bump SGLang version to 0.5.4.post2 by @sglang-bot in #12439
  • Tiny fix eos handling for PD disaggregation by @ShangmingCai in #12334
  • Forward unknown tool calls instead of dropping by @Surya-Gunukula in #12226
  • Use sgl fp4 quant kernel by default by @Qiaolin-Yu in #12482
  • [hot fix] Remove from python.sglang.xxx by @hnyls2002 in #12483
  • perf: trtllm mla performance minor improvements by @cicirori in #12435
  • Filter tokenizer warning for kimi models by @ispobock in #12485
  • [CI] Build aarch64 kernels for sgl-kernel test by @Fridge003 in #12480
  • [Hotfix] Remove extra comment in sgl-kernel README by @Fridge003 in #12500
  • [feat] Add SGLANG_TOOL_STRICT_LEVEL for tool-call behavior control by @JustinTong0323 in #12423
  • Reduce docker image size. mount cache when use pip/cargo build by @whybeyoung in #12238
  • [HICache / PD]: Support offloading incremental KV cache in decode side. by @hzh0425 in #11966
  • [Deterministic] add deepseek v3 deterministic inference CI test by @zminglei in #12412
  • [Bug] test_flashattn_mla_backend errors in Hopper #12487 by @Johnsonms in #12488
  • Update Mooncake EP's a2a interface by @UNIDY2002 in #12391
  • [CI][NPU] remove pypi mirror site that hangs ci dependency installation by @iforgetmyname in #12499
  • [Ascend] Add Ascend NPU support for sglang.check_env & rework proposal by @Alexhaoge in #11052
  • [Feature] Qwen3-Next & FLA: Support MTP topk>1; Up to 6% faster by @byjiang1996 in #11133
  • [CI] Move some Lora/Deterministic CI tests to nightly by @Fridge003 in #12507
  • Migrate weak_ref_tensor to sgl-kernel by @BBuf in #12505
  • feat: Add FP4 (E2M1) KV Cache Support with Quantization Utilities for MLA by @JackChuang in #10078
  • chore: bump sgl-kernel version to 0.3.16.post5 by @sglang-bot in #12511
  • [FEAT] Shared mem pool based cuda ipc for multi-modal data transport by @kousakawang in #11917
  • Add prefix for torch symm mem by @yuan-luo in #12506
  • [ServerArgs] allow --mamba-ssm-dtype extend by @hanming-lu in #12481
  • [Fix] concat_mla_absorb_q_kernel fails for long inputs by @bingps in #12453
  • Super tiny fix naming in bench serving scripts by @fzyzcjy in #12515
  • move all get_stream in sgl_kernel to c++ to reduce the launch overhead by @merrymercy in #12521
  • [Refact] Remove hardcoded KV cache dimension in MLATokenToKVPool by @Johnsonms in #12502
  • [Bug] Fix Intern-S1 model accuracy and support /generate interface with input_ids by @hhaAndroid in #12367
  • chore: upgrade flashinfer 0.5.0 by @zhyncs in #12523
  • [hotfix] Remove flashinfer-jit-cache from pyproject by @Fridge003 in #12530
  • fix: move dummy format loader check before quantization checks by @cicirori in #12532
  • chore: upgrade mooncake 0.3.7.post1 by @ShangmingCai in #12541
  • fix: Fix KTransformers hybrid inference with int8 quantization and format by @Atream in #12536
  • Conditionally recapture cuda graph after model weight update from disk by @harrisonlimh in #12060
  • [spec v2] Fix output repetition by speculative sampling error by @hnyls2002 in #12561
  • [hot-fix] Fix broken CI by @hnyls2002 in #12564
  • fix: fix the bug which leads qwen2_5_vl to crash with mixed_chunk by @PanJason in #11330
  • Fix error when calling quantization by @fzyzcjy in #12548
  • [Test] Add parameters to SRTRunner by @Jonahcb in #12227
  • [ROCm] Update Mooncake to v0.3.7.post1 and add -DUSE_HIP=ON to rocm.Dockerfile by @yeahdongcn in #12560
  • Reduce the overhead of nccl symmetric memory by @merrymercy in #12524
  • tiny optimize for bench serving by @yizhang2077 in #12553
  • Super tiny allow profile activities in bench_serving by @fzyzcjy in #12549
  • Super tiny dump server info such as args in bench for post analysis by @fzyzcjy in #12550
  • update usage of trtllm_fp8_per_tensor_scale_moe by @b8zhong in #12569
  • [router][grpc] Consolidate error messages build in error.rs by @CatherineSue in #12301
  • Remove the dependency of nccl.h in symmetric memory by @merrymercy in #12571
  • [chore] Fix update_kernel_whl_index script for multiple cuda version by @Fridge003 in #12519
  • Enable mixed type LayerNorm kernel for NSA indexer by @akhilg-nv in #12044
  • Super tiny add UT for copy_to_gpu_no_ce by @fzyzcjy in #12270
  • [Doc] fix miss index for production request trace by @stmatengss in #12547
  • [GDN/SWA] mamba and swa radix cache edge case fix by @hanming-lu in #12111
  • [Qwen3 VL] Add LoRA support for Qwen 3 VL by @Jonahcb in #12165
  • test: support return logprobs in bench_offline_throughput test by @aftersnow in #12462
  • Tiny fix ExpertDistributionReq error by @fzyzcjy in #11760
  • fix: respect --ignore-eos in PD case for benchmarking by @ishandhanani in #12597
  • Improve the metrics for PD by @merrymercy in #12580
  • Enable memory saver for hybrid model by @ocss884 in #11974
  • Restore torch defaults between sgl-kernel tests by @benbarsdell in #11131
  • feat: limit peak memory usage when computing logprobs by @aftersnow in #6318
  • [router][grpc] Restructure modules and code clean up by @CatherineSue in #12598
  • Add --speculative-moe-runner-backend server arg by @trevor-m in #10183
  • [Deterministic] Optimize bmm_batch_invariant op by @zminglei in #12522
  • chore: bump mooncake version to 0.3.7.post2 by @ShangmingCai in #12599
  • [sepc-v2] Fix imcompatibility with constrained decoding by @hnyls2002 in #12615
  • Support aggregating engine metrics in sgl-router by @fzyzcjy in #11456
  • Ensure GPU work is finished when release memory occupation call is finished by @fzyzcjy in #12592
  • Add sanity checks when a test file is not added to CI (reland) by @fzyzcjy in #12594
  • [router][grpc] Fix model validation, tool call check, streaming logic and misc in responses by @CatherineSue in #12616
  • [HotFix] Disable torch dynamo for mrope_triton kernel by @yuan-luo in #12593
  • Fix skip layer in get_quant_method by @ispobock in #12632
  • [Test] Merge all constrained decoding tests. by @hnyls2002 in #12633
  • Add io struct naming check back by @hnyls2002 in #12634
  • Fix output_ids inconsistency by @hnyls2002 in #12628
  • fix: Lazy import mooncake-ep to fix extra gpu contexts being created by @trevor-m in #12641
  • [hotfix] Fix deepep w4a8 bug by @Fridge003 in #12642
  • [Auto Sync] Update scheduler_metrics_mixin.py, collector.py (20251104) by @merrymercy in #12647
  • [Bug] Fix NSA Backend KV-Buffer Shape Mismatch in DeepSeek-V3.2 by @Johnsonms in #12645
  • [NVIDIA] Fix wrong symmetric sizes for fp4 cases by @kaixih in #12640
  • [router][grpc] Fix index issues in reasoning content and missing streaming events by @CatherineSue in #12650
  • Revert "Enable memory saver for hybrid model" by @Fridge003 in #12648
  • Add multi-GPU configurations to nightly-test.yml by @alisonshao in #12585
  • [fix] Handle escaped characters in GLM tool call parser to prevent double serialization by @soaringk in #12456
  • [router][grpc] Emit OutputItemDone event and store output item array by @CatherineSue in #12656
  • Register allgather/reducescatter buffers with symm memory by @nvcastet in #12572
  • chore: bump SGLang version to 0.5.4.post3 by @sglang-bot in #12639
  • [NVIDIA] Fix cutedsl backend of MoE by @kaixih in #12353
  • [PD-Disagg] Check finish after pop tranferred by @hnyls2002 in #12638
  • fix typo of args description in sglang.profiler by @ai-easy-cpu in #12486
  • [Dockerfile] Speed up docker image building by @acelyc111 in #8784
  • Fix VLLM dependency test by @Kangyan-Zhou in #12670
  • [Feature] add --lora-request-distribution arg to bench_serving.py and support skewed and distinct workloads by @glenliu21 in #12175
  • [router][grpc] Implement tool_choice support for Responses API by @CatherineSue in #12668
  • Expand and update test coverage for AMD CI by @hubertlu-tw in #10044
  • fix: add seed bench_serving to cache key, remove redundant function definition. by @cicirori in #12680
  • [Profiler] Add SGLANG_PROFILE_RECORD_SHAPES for recording shapes when profiling by @zejunchen-zejun in #11641
  • fix trtllm_mla attention backend when disabling cuda graph. by @cicirori in #12687
  • Refactor --debug-tensor-dump-layers to list by @guoyuhong in #12691
  • [Grammar Fix] GLM-4-MOE self.first_k_dense_replace is undefined. by @zRzRzRzRzRzRzR in #12455
  • add Kimi k2 reasoning parser by @MoyanZitto in #12702
  • Commented out b200 tests due to runner shortage by @Kangyan-Zhou in #12609
  • [CI] Fix qwen3-vl lora nightly ci by @Fridge003 in #12708
  • Fix server args for gpt oss so users can override the moe runner backend by @merrymercy in #12696
  • [router][grpc] Support streaming parsing with Tool Choice in chat completions API by @CatherineSue in #12677
  • feat: initial multimodal-gen support by @mickqian in #12484
  • Enable Aiter Attention for VL model by @Yuechguo in #12699
  • [router] fix: validate HTTP status codes in health check by @wyx-0203 in #12631
  • Support Expert Deferral Mechanism in KTransformers by @Atream in #12586
  • Add mm_fp4 trtllm backend by @wenscarl in #12406
  • [NVIDIA] Fix unit test of MoE and add it to nightly ci by @kaixih in #12709
  • [misc] Add labeler for automatic labeling by @CatherineSue in #12710
  • [router][ci] speed up python binding to 1.5 min by @key4ng in #12673
  • Fix CI and style by @merrymercy in #12658
  • Revert "Commented out b200 tests due to runner shortage (#12609)" by @Kangyan-Zhou in #12712
  • [misc] Change sync-labels to false by @CatherineSue in #12714
  • [router][grpc] Make harmony parser checks recipient first before channel by @CatherineSue in #12713
  • [router][quick fix] Add minimal option for reasoning effort in spec by @key4ng in #12711
  • [router] add basic ci tests for gpt-oss model support by @key4ng in #12651
  • fix labeler by @key4ng in #12718
  • [ci] fix permission by @key4ng in #12729
  • [chore]Remove dockerfile from target file of bump kernel version by @Fridge003 in #12728
  • [CPU] Upgrade default PT version to 2.9 by @ZailiWang in #12611
  • Revert "[ci] fix permission" by @key4ng in #12732
  • Revert "[router] web_search_preview tool basic implementation" by @key4ng in #12716
  • fix sgl-kernel version by @gongwei-130 in #12723
  • [chore] SGLang tag management in Dockerfile by @Fridge003 in #12734
  • Add nightly test multi gpu configs by @alisonshao in #12721
  • DeepSeek-V3.2: Add Adaptive MHA Attention Pathway for Short-Sequence Prefill by @YAMY1234 in #11892
  • Temporarily fix missing routed_scaling_factor for CompressedTensorsWNA16MoEMethod by @Atream in #12738
  • [chore] Fix triton installation for cu13 image by @Fridge003 in #12742
  • keep attention backend document up to date by @b8zhong in #12741
  • [Fix]Tiny fix in Dockerfile by @Fridge003 in #12748
  • [router][grpc] Support mixin tool calls in Responses API by @CatherineSue in #12736
  • fix: tiny fix cli by @mickqian in #12744
  • [router][ci] Disable cache by @key4ng in #12752
  • fix mamba prefix cache leak caused by abort by @yizhang2077 in #12693
  • [BUGFIX] fix output_ids in abort by @yizhang2077 in #12737
  • [GDN] Fuse b.sigmoid(), fused_gdn_gating and unsqueeze into one kernel: up to 0.85% e2e speedup by @byjiang1996 in #12508
  • [VLM] Optimize qwen_vl preprocess_video by @yuan-luo in #12240
  • Add timing metrics for requests by @cicirori in #12646
  • fix qwen3-omni audio length < 30s by @jiapingW in #12674
  • docs: document video-capable multimodal models by @WazupSteve in #12565
  • fix ci by @key4ng in #12760
  • [Refactor] Refactor fused_moe_triton tuning tools: extract shared utils, add EP/MLLM support, reduce overhead by @BBuf in #12440
  • Update dsv3 quantization auto setting for sm100 by @ispobock in #12778
  • chore: bump SGLang version to 0.5.5 by @sglang-bot in #12739

New Contributors

  • @thelongestusernameofall made their first contribution in #11909
  • @LucaLow made their first contribution in #12071
  • @vipwangerxiao made their first contribution in #9501
  • @Johnsonms made their first contribution in #11936
  • @ash-sigh made their first contribution in #11047
  • @Kevin-XiongC made their first contribution in #12057
  • @kaln27 made their first contribution in #9403
  • @haichao592 made their first contribution in #12186
  • @satyamk7054 made their first contribution in #11142
  • @weireweire made their first contribution in #11708
  • @bmac3 made their first contribution in #12295
  • @Gao016 made their first contribution in #12313
  • @lianakoleva made their first contribution in #12269
  • @zyzshishui made their first contribution in #12144
  • @zejunchen-zejun made their first contribution in #11709
  • @AichenF made their first contribution in #11737
  • @JensenFire made their first contribution in #12319
  • @Chen-0210 made their first contribution in #12012
  • @popsiclexu made their first contribution in #12368
  • @lpc0220 made their first contribution in #11116
  • @elvischenv made their first contribution in #12307
  • @sjtushenhai made their first contribution in #12266
  • @LuYanFCP made their first contribution in #11757
  • @carolove made their first contribution in #12428
  • @Surya-Gunukula made their first contribution in #12226
  • @Alexhaoge made their first contribution in #11052
  • @JackChuang made their first contribution in #10078
  • @bingps made their first contribution in #12453
  • @hhaAndroid made their first contribution in #12367
  • @yeahdongcn made their first contribution in #12560
  • @akhilg-nv made their first contribution in #12044
  • @alisonshao made their first contribution in #12585
  • @soaringk made their first contribution in #12456
  • @ai-easy-cpu made their first contribution in #12486
  • @MoyanZitto made their first contribution in #12702
  • @wyx-0203 made their first contribution in #12631
  • @WazupSteve made their first contribution in #12565

Full Changelog: v0.5.4...v0.5.5

Don't miss a new sglang release

NewReleases is sending notifications on new releases.