Highlights
-
LoRA Weight Loading Overlap with Computation: Overlap LoRA weight loading with computation during inference, reducing TTFT by ~78% and TPOT by ~34.88% on large adaptors: #15512
-
TRT-LLM NSA Kernel Integration for DeepSeek V3.2: Integrate TRT-LLM DSA kernels for Native Sparse Attention, boosting DeepSeek V3.2 performance by 3x-5x on Blackwell platforms with trtllm for both --nsa-prefill-backend and --nsa-decode-backend
(with minor accuracy drop): #16758, #17662, #18389 -
Flashinfer All-to-All MoE Dispatcher: Add the Flashinfer all-to-all MoE dispatcher for efficient expert parallelism communication, enabling optimized routing in MoE models: #14668
-
FA4 (FP4 Attention) Support for Multimodal Encoder: Introduce FP4 attention backend and variable-length attention function for multimodal encoders, enabling lower-precision inference for vision-language models: #13539
-
Anthropic Compatible API Endpoint: Add native Anthropic API compatibility to SGLang, allowing direct integration with tools and clients built for the Anthropic API format: #18630
-
SGLang-Diffusion Advanced Optimizations: Production-ready improvements including token-level sequence sharding, parallel VAE decoding, fused kernels, Nunchaku and FP8 support, and multiple new models in the ComfyUI plugin: blog
-
Spec V2 Critical bug fix: Fix out-of-index bug caused by torch garbage collection in speculative decoding v2, improving reliability of speculative verification: #18958
-
Deploying DeepSeek on GB300 NVL72: Optimization work for long-context inference using prefill-decode disaggregation and other SGLang features on NVIDIA's latest GB300 platform: blog
-
Bump AITER version to 0.1.10.post3: Support FP8 Prefill/Decode/KV Cache
-
Commit-to-Version Lookup in docs.sglang.io: Easily find the earliest official version that includes a given PR or commit, streamlining release tracking for users and developers: #18450
New Model Support
- Kimi-K2.5: #17789, cookbook
- GLM-5: cookbook (still requires a custom docker for transformers upgrade, will follow up with a rc release since transformers upgrade is risky)
- Qwen 3.5: #18489, #18926, #18937, cookbook
- MiniMax 2.5: cookbook
- Ernie4.5-VL: #15679
- Step3-VL: #17513
- Step-3.5-Flash: #18084, cookbook
- LLaDA 2.1: cookbook
- Ring 2.5 1T / Ling 2.5 1T: #18598, cookbook, cookbook
- MOVA (Diffusion): #17704
- GLM-OCR: #17582, cookbook
- DeepSeek-OCR-2: #17897
SGLang-Diffusion
- Support multiple new models in ComfyUI Plugin
- Parallel Folding and Parallel VAE Decoding for faster image/video generation
- Nunchaku and FP8 support for diffusion models
- Sequence Sharding (token-level) replacing Frame Sharding for improved efficiency
- LTX-2 support: #17495, #17496
- MOVA model support: #17704
- Cache-DiT optimizations and fused kernel improvements
- Numerous bug fixes and refactors across the diffusion pipeline
Performance
- Integrate TRT-LLM NSA kernels with up to 3-5x speedup on Blackwell: #16758, #17662, #18389
- LoRA weight loading overlap reducing TTFT by ~78%: #15512
- Flashinfer all-to-all MoE dispatcher: #14668
- FA4 for multimodal encoder: #13539
- Optimize GDN decode for Qwen3 Next: #17094
- Tune fused MoE kernels for Llama-4-Scout, MiniMax M2: #17891, #18851, #18833
- Symmetric memory pre-allocation to avoid fragmentation: #17089
- Optimize fused_moe triton kernel TMA: #18782
- Fused triton kernel for Ernie4.5-VL rotary embedding: #18856
- Support MxINT4 Flashinfer TRT-LLM MoE GEMM: #16892
- AITER bias MoE support for GPT-OSS MxFP4: #17735
Prefill-Decode Disaggregation
- Support KV transfer with MORI-IO: #14626
- Mooncake intra-node NVLink KV transfer: #17866
- Improve KV offset calculation for MHA model with different TP size: #18163
- Document SGLANG_MOONCAKE_CUSTOM_MEM_POOL: #18259
Diffusion LLM (dLLM)
- Remove cuda graph batch size limitation: #17458
- JointThreshold algorithm for joint M2T and T2T decoding: #18171
- Basic dLLM scheduling strategy and implementation: #17484
Speculative Decoding
- Fix out-of-index bug caused by torch garbage collection in Spec V2: #18958
- Move forward timeout before verify to fix Eagle v1 filter mismatch: #18760
Dependencies
- Flashinfer updated to 0.6.3: #17700
- AITER updated to 0.1.10.post3: #18741
- Mooncake transfer engine updated to 0.3.9: #18316
AMD Hardware
- AITER updated to v0.1.10.post3 with FP8 Prefill, FP8 Decode, FP8 KV Cache support
- ROCm 7 standardization and ROCm 6.3 deprecation: #17785
- Kimi K2.5 Day 0 ROCm support: #17863
- FP8 prefill attention kernel integration: #18528
- Two-batch overlapping for MORI EP: #17953
- DeepSeek V3.2 and Kimi-K2 nightly CI tests: #17523
NPU/Ascend
- Support for MiniCPM3-4B: #16866
- Qwen 3.5 support on Ascend: #18544
- Accuracy improvements for StableLM-2: #17470
- Bug fixes for DeepSeek V3.2 and DeepSeek-VL2: #17007
CPU Backend
- Optimize Qwen3-Next model on CPU: #12525
- Optimize flash_attn_varlen_func: #15708
- Add INT4 kernels for CPU: #8226
Kernel Slimming
Documentation
- Add RL documentation: #17663
- Update torch compile description: #17819
- Refine spec decode docs for SpecV2/STANDALONE/NGRAM: #18321
- Consolidate diffusion documentation: #18095
What's Changed
- Update test README with CI registry documentation and 5090/H100 guidance by @alisonshao in #17368
- update dependence docs of npu by @amote-i in #17573
- [AMD] CI - migrate perf test and fix stage-b-test-1-gpu-amd by @yctseng0211 in #17340
- Skip mm feature pool init to avoid EPD OOM by @liusy58 in #16388
- Update mamba env setting by @ispobock in #17566
- [NPU]bugfix: fix for dsv3.2 and dsvl2 by @JiaruiChang5268 in #17007
- [AMD CI] Add 2-GPU sgl-kernel Tests by @bingxche in #17555
- Lazy import torchao by @merrymercy in #17626
- Re-enable unit-test-deepep-8-gpu and unit-test-backend-4-gpu-gb200 by @alisonshao in #17438
- fix gpt-oss launch failure with piecewise cuda graph by @zminglei in #17532
- [NPU] [CI] temporarily disable mtp test by @iforgetmyname in #17614
- [NPU] update doc for Ascend NPU by @Hexq0210 in #17621
- turn off dit_layerwise_offload for wan on rocm by @zyzshishui in #17569
- set cooldown_interval_minutes to 0 for liusy58 by @liusy58 in #17637
- Support symmetric memory pre-allocation to avoid fragmentation by @nvcastet in #17089
- [DeepSeek V3.2] Enable trtllm NSA with bf16 kvcache by @akhilg-nv in #16758
- add the fa4 mm backend and varlen func by @vincentzed in #13539
- [Refactor] Algebraic data type for nextn config + some basic refactors by @xyjixyjixyji in #17347
- [DLLM] Remove cuda graph batch size limitation by @btw616 in #17458
- Add return routed experts to the completions and chat/completions endpoints by @mansoor-s in #17434
- [MUSA][1/N] sglang.check_env by @yeahdongcn in #16959
- [MUSA][2/N] sgl-kernel build by @yeahdongcn in #17053
- fix post_residual_addition more generally by @nanjiangwill in #17286
- feature: adding openai compatible API request to bench_serving by @dougyster in #17219
- [NPU]support model MiniCPM3-4B for npu by @McZyWu in #16866
- [NPU] solve accuracy problem for stablelm-2-1-6b for npu by @McZyWu in #17470
- [Docker] Install cudnn==9.16 for cuda 13 image to avoid check error by @Fridge003 in #17668
- Refactor: Extract DeepSeek common utilities into shared module by @DotSlash-A in #16969
- [Diffusion] LTX-2 Support PR1 by @gmixiaojin in #17495
- [Diffusion] LTX-2 Support PR2 by @gmixiaojin in #17496
- fix: nightly wheel naming for non-post versions by @dougyster in #17538
- [JIT Kernel]Add Some CUDA Runtime API Wrapper for JIT Kernel Header by @HydraQYH in #17588
- Fix: mistake sigmoid in kda by @strgrb in #17508
- Use attn tp group in embedding for more models by @ispobock in #17570
- [Diffusion] Add diffusion time embedding to jit kernel by @BBuf in #17658
- Move fa4 from sgl-kernel to jit kernel by @BBuf in #17353
- add documentation example for LoRA overlap loading and cleanup unused function by @glenliu21 in #17464
- [Bugfix] fix TypeError when log-requests-level >=2 in prefill node warmup by @yunkchen in #17129
- [Kimi-Linear] Refactor Kimi-Linear to support RadixLinearAttention by @yuan-luo in #17506
- [NPU] torch_npu profiler tensorboard path type fix by @mengchengTang in #17545
- [NVIDIA] Add flashinfer all-to-all MOE dispatcher by @trevor-m in #14668
- Fix test timeout issue in pr-test by @Kangyan-Zhou in #17681
- Fix NSA indexer test and move it to pre commit test by @Kangyan-Zhou in #17682
- Temporarily disable lora overlap loading test due to flakiness by @Kangyan-Zhou in #17683
- fix: Refactor register_image_processor to use kwarg instead of positional arg by @JustinTong0323 in #17685
- [diffusion]: Fix ZImage SP sharding for caption and latent by @dutsc in #17301
- Fix slash command handler trigger condition by trimming the comments by @Kangyan-Zhou in #17691
- Add PyTorch .bin file validation to CI weight validation by @alisonshao in #17533
- [DeepSeek-V3.2] Fix TRT-LLM NSA in target_verify/draft_extend by @mmangkad in #17662
- Fix swa memory pool size with spec by @ispobock in #17630
- [Refactore] [CI] Remove redundant CI test runs step 2 by @Makcum888e in #17584
- revert row from #17584 by @Makcum888e in #17701
- [Refactor] Use is_in_ci() utility in JIT kernel benchmarks by @luke396 in #17118
- use published reasoning parser crate by @slin1237 in #17709
- update to use official openai protocol crate by @slin1237 in #17710
- remove self managed protocols as it has been replaced with official oai spec by @slin1237 in #17711
- [diffusion] refactor: remove useless lazy-import cache-dit codes by @mickqian in #17659
- Support mxint4 flashinfer_trtllm moe gemm by @HandH1998 in #16892
- A few updates to the night tests by @Kangyan-Zhou in #17694
- Add an all type in pyproject.tml to include diffusion support by @Kangyan-Zhou in #17697
- Extend b200 kernel tests timeout for CPU differences by @Kangyan-Zhou in #17718
- [misc] remove tool parser and tree benchmark as they are not meaningful atm by @slin1237 in #17719
- [misc] replace existing tool call code with new crate package by @slin1237 in #17720
- Upload nightly test metrics to GH artifacts by @Kangyan-Zhou in #17696
- Fix flaky streaming logprobs test by handling detokenizer text buffering by @Kangyan-Zhou in #17687
- [Bugfix]Repeated add modelslim quant_config and bugfix with "enable-piecewise-cuda-graph" on NPU by @chenxu214 in #17511
- Fix sgl-kernel install: fail instead of PyPI fallback when artifacts missing by @alisonshao in #17728
- Add EP=2 to qwen235b nightly tests by @Kangyan-Zhou in #17738
- Update nightly-test-nvidia.yml to remove push trigger by @Kangyan-Zhou in #17625
- remove self managed mcp as it has been replaced with official rmcp crate by @slin1237 in #17740
- [Kimi-Linear] Remove duplicated code in kimi-linear by @yuan-luo in #17731
- [NIXL] Add custom NIXL backend selection for KVManager by @zackyoray in #17146
- Merge performance/accuracy test suites into regular stage-b suites by @alisonshao in #17609
- remove self managed wasm as it has been replaced with official smg wa… by @slin1237 in #17746
- Exclude some diffusion package for ARM in docker release by @Kangyan-Zhou in #17745
- update wasm endpoint by @slin1237 in #17748
- [Fix] Pass missing backend argument in pipelines_core initialization by @Prozac614 in #17343
- remove multimodal as this is completely dead code by @slin1237 in #17750
- accuracy enhancement for baichuan2-13B for npu by @McZyWu in #16868
- Bump FI version by @shaharmor98 in #17700
- refactor mamba radix cache logic in server_args by @yizhang2077 in #17645
- [AMD CI] Add moonshotai/Kimi-K2-Instruct-0905 testcases by @sogalin in #17656
- [NPU]DeepSeek-V3.2 support npu mlaprolog by @lawtherWu in #15381
- Add test_gpt_oss_4gpu.py to B200 test suite by @alisonshao in #17743
- fix: move nightly whl to cuda version folder by @dougyster in #17762
- [NPU] Split pyproject npu from pyproject other by @Makcum888e in #17641
- Special logic for healthcheck by @whybeyoung in #17734
- [Docs] Add RL documentation by @zijiexia in #17663
- fix(processor): support InternS1 text_config in InternVL processor by @Mahdi-CV in #17040
- [bugfix] Internal processing of hf3fs crash # 16614 by @leihuang-sketch in #16938
- [diffusion] Support Qwen-Image, Multi-GPU Z-Image, and Enhanced ComfyUI Integration by @niehen6174 in #17678
- Support Kimi-K2.5 model by @yhyang201 in #17789
- [HiCache][HA 1/N] Support HiCache storage runtime attach/detach by @alphabetc1 in #15892
- fix: preserve disconnect events in api key middleware by @alphabetc1 in #17253
- [AMD] Update dsv3.2 AMD GPU docs and unify ROCm TileLang build by @hubertlu-tw in #17783
- [Bug Fix] Fix reasoning parser when continue_final_message=true by @laixinn in #17065
- [GLM-OCR] Support GLM-OCR Model by @zRzRzRzRzRzRzR in #17582
- fix(quantization): add sgl_kernel fallback for FP4 quantize on Blackwell GPUs by @MikkoParkkola in #17816
- [Doc] Update description on torch compile by @Fridge003 in #17819
- [NPU] Adapt cann 8.5: use sfa and lightning indexer op from cann and CI update by @monkeyLoveding in #17615
- [DeepSeek] Update tests and document for DeepSeek V3.2 NVFP4 checkpoint by @Fridge003 in #17657
- [Diffusion] dit-precision refactor by @fsygd in #17751
- Make flashMLA work on: Cu13, B300 by @vincentzed in #17600
- [hybrid-model] clean up and consolidate redundant fields in RadixLinearAttention by @zminglei in #17660
- Pass GPU ids to kill specified devices in script. by @hnyls2002 in #17840
- [AMD] Deprecate ROCm 6.3 artifacts and standardize gfx942 on ROCm 7 by @hubertlu-tw in #17785
- [Diffusion] glm-image apply flashinfer rope by @BBuf in #17689
- [diffusion] fix: fix suppressing error log on non-main ranks by @mickqian in #17712
- [diffusion] feat: add an arg for controlling the number of prefetched layers in Layerwise-offload by @mickqian in #17693
- [diffusion] Fix vertex generate by @yashikagandhi-google in #17611
- fix: add bias when enable mm fallback variant by @gongyisheng in #17690
- [AMD] CI - enable deepseekv3.2 on MI325-8gpu and merge perf/accuracy test suites into stage-b suites by @yctseng0211 in #17633
- [DSv32] Overlap indexer qk projection and activation quant by @zianglih in #17688
- [Diffusion] Delete sgl-kernel outdated time_embedding kernel by @BBuf in #17278
- Add a performance dashboard server and frontend for nightly CUDA tests by @Kangyan-Zhou in #17725
- [diffusion] doc: fix wrong docker run command by @mickqian in #17856
- [JIT kernel] Update jit_kernel cache and develop doc by @BBuf in #17842
- [AMD] Add Kimi-K2, DeepSeek-V3.2 tests to nightly CI by @michaelzhang-ai in #17523
- [diffusion] comfyui: fix import typo by @triple-mu in #17834
- [AMD][Kimi K2.5 Day 0] ROCm: route W4A16 MoE to Triton and fix packed-weight loading by @jhinpan in #17863
- [MUSA][7/N] Enhance CUDA / PyNccl wrapper to support MTLink connectivity detection by @gingerXue in #17499
- [Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel by @zhendonghua in #17891
- Make the functions in logits_processor.py and sampler.py more modular by @merrymercy in #17885
- [Diffusion] Support MOVA model by @CloudRipple in #17704
- [JIT Kernel]Support fused_add_rmsnorm in JIT Kernel by @HydraQYH in #17677
- [Fix][trtllm-mha] Canonicalize the strides when num_head = 1 by @xyjixyjixyji in #17732
- Integration mori backend for EP a2a data communication by @kkHuang-amd in #17012
- feat: add custom request header logging by @joearedmond in #17786
- update ascend docs by @amote-i in #17741
- [FIX] kimi_k2 reasoning parser by @JustinTong0323 in #17901
- Fix flaky tool calls in the Kimi K2.5 model by @JustinTong0323 in #17914
- [MUSA][4/N] Add common device utilities, distributed backend, and custom op wiring by @yeahdongcn in #17246
- [PD] Support KV transfer with MORI-IO by @maning00 in #14626
- [Diffusion][MOVA] fix: resolve library mismatch in scheduler and update dit offload method name by @CloudRipple in #17916
- [diffusion] model: move tp_rmsnorm check to WanTransformerBlock by @triple-mu in #17792
- Add aiter bias moe support in gpt-oss mxfp4 model by @kkHuang-amd in #17735
- [diffusion]: align sglang diffusion AMD pyproject_other.toml diffusion dependency with pyproject.toml by @ZiguanWang in #16225
- [wip] sync with upstream zImage by @yhyang201 in #17822
- Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE by @zianglih in #17449
- Support LightOnOCR-2-1B by @shvmjndl in #17806
- [diffusion]: add dummy device attribute to fix AttributeError by @Ratish1 in #17949
- Add tool call tests for DeepSeek V3.2 in nightly CI by @harvenstar in #17951
- [MUSA] Add labeler config by @yeahdongcn in #17923
- Fix
torch.__version__for PEP440 by @EduardDurech in #15682 - Fix capture_sizes range for pcg by @ch-wan in #17956
- Fix logprob_start_len handling for prefill-only requests by @ch-wan in #17395
- feat: add forward timeout by @zhooooong in #17831
- [AMD] fix pip sglang version by @yctseng0211 in #17950
- Add concurrency tracking to runner utilization report by @Kangyan-Zhou in #17963
- Support DeepSeek-OCR-2 in SGLang (OCR2 vision pipeline, tokenization alignment, and weight loading fixes)#17833 by @baonudesifeizhai in #17897
- add weightless qk norm to RMSNorm interface for Llama 4 by @b8zhong in #12813
- GPTJForCausalLM Support by @wenchen76 in #7839
- [Fix] Remove unused Type import in gpt_j.py by @Kangyan-Zhou in #17975
- Fix the scenario where eh_proj is quantized in the bailing moe nextn weights by @LHXuuu in #17808
- [Intel GPU] fix device in DeepseekScalingRotaryEmbedding to run DeepSeek-V2-Lite BF16 on XPU by @polisettyvarma in #10021
- Fix prefill latency performance drop of bench serving by @gaopengff in #14592
- [Intel GPU] fix import error to run DeepSeek-V2-Lite model with BF16 on XPU by @polisettyvarma in #10858
- [CPU] Optimize Qwen3-next model on CPU by @jianan-gu in #12525
- [CPU] optimize flash_attn_varlen_func by @mingfeima in #15708
- [CPU][INT4] Add INT4 kernels for CPU by @jianan-gu in #8226
- fix(benchmark): add missing args for speculative decoding benchmark by @cswuyg in #17974
- [NPU] enhance accuracy for model kimi-vl-a3b-instruct by @McZyWu in #17480
- adapt MODELSCOPE download by @Hide-on-bushsh in #17922
- Increase install dependency timeout for gb200 by @Kangyan-Zhou in #17977
- SGLang Tracing: Improve root span attributes by @zhanghaotong in #17008
- Add cuda graph status to prefill log by @ispobock in #17836
- Fix SHM pointer re-serialization in DP attention. by @FlamingoPg in #17930
- update npu docs by @amote-i in #17987
- [Model] Add K-EXAONE model support by @xvyaward in #16294
- [BUGFIX] Fix dp size > 1 for qwen3 vl model by @zju-stu-lizheng in #17624
- [Diffusion] Fix lora default lora_scale bug by @BBuf in #17982
- Optimize GDN decode for Qwen3 Next by @samuellees in #17094
- [BugFix] Fix server crashes when req.grammar and ngram spec are enabled by @SYChen123 in #17585
- [NPU] support llama-3.2-11B-vision-instruct mode for NPU by @JiaruiChang5268 in #17492
- [sglang] fix mm token padded value overlap with text token id by @bixue2010 in #17781
- doc update for CANN version by @wangtiance in #18014
- [NPU] fix sgl-kernel-npu package url error in npu.Dockerfile by @22dimensions in #18017
- Add ROCm + Mori docker build instructions in rocm.Dockerfile by @kkHuang-amd in #18018
- [Diffusion] Fix FLUX.1-schnell time embedding argument mismatch by @BBuf in #17988
- Fix cuBLAS >=12.9 detection for cu12/cu13 package naming by @mmangkad in #17766
- Fix .gitignore may ignore files like core_attention.py by @yeahdongcn in #18021
- 【docs】【NPU】Update Expert Parallelism docs for Ascend NPU by @husf1130 in #17940
- add reasoning_tokens usage test for tool call by @harvenstar in #18022
- Reduce topk kernel shared memory from 128KB to 32KB for better occupancy by @hammersam in #17747
- Fix OOM in DeepSeek weight loading by deferring dict(weights) materialization by @hsuchifeng in #17744
- [EPD][Perf] parallelize ZMQ send for encode server by @ZhengWG in #16487
- [Fix] Triton TP MoE Dpsk V3/Qwen3 Coder with SwapAB by @b8zhong in #17965
- Add launch_command assignment in crash dump by @merrymercy in #17967
- [diffusion] refactor: split component_loader into component-wise files by @mickqian in #17820
- [Fix] Revert back to using CUTLASS
mm_fp4backend by @b8zhong in #17369 - [MUSA] Update 3rd party dir to build/_deps by @yeahdongcn in #18035
- [CPU] toml file update by @ZailiWang in #17861
- Update python/sglang/README.md by @haojin2 in #18045
- [Performance] Optimize Mllama LayerNorm -> Upd by @vincentzed in #9725
- Fix: Remove duplicate assignment for use_w4afp8 by @tianchongchong in #17858
- [Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization by @b8zhong in #15514
- feat: validate ib devices in server args by @acelyc111 in #17598
- Improve error output in tnightly tets by @Kangyan-Zhou in #18053
- Skipped warning on sm100 by @mattteochen in #18000
- Fix rerun stage command with merged commit history by @Kangyan-Zhou in #17960
- [BugFix] Fix draft model specified config file by @khalil2ji3mp6 in #17815
- Set torch url index in pyproject.toml by @Fridge003 in #16802
- [metric] Optional extra metric labels by @yinghai in #18049
- [BugFix] fix gpt-oss accuracy issue when enabling piecewise cuda graph by @zminglei in #18013
- Fix swa kv cache memory allocation by @ispobock in #18039
- Disable test_mla_int8_deepseek_v3.py temporarily by @alisonshao in #18057
- Migrate 4-GPU/8-GPU workflow jobs to stage-c and add CI registry decorators by @alisonshao in #17299
- Fix installation script for H200 runners by @Kangyan-Zhou in #18050
- [Bugfix] fix the display error (inconsistent context) by @lingebeng in #17699
- [model] Support MiniCPM-V 4.5 by @tc-mb in #9610
- Fix Diffusion Request Validation to allow missing input artifacts if the input only contains text by @Kangyan-Zhou in #16610
- Optimizing all_reduce in RMSNormTP in minimax_m2 by @rogeryoungh in #16483
- Fix CUDA 12 dependency when importing Mooncake in official CUDA 13.x image by @ZhenshengWu in #17540
- [Feature] Support file:// URL format for multimodal inputs by @ppraneth in #14490
- support qwen3-next eagle3 by @sleepcoo in #14607
- feat: Add Ling Flash v2.0 support for Eagle3 by @yefei12 in #15119
- Move deleted 8-GPU tests to test/manual/ by @alisonshao in #18060
- Reset evict swa status when retract by @ispobock in #18059
- [NPU] disaggregation_decode_enable_fake_auto parameter adaptation by @Estrella-xx in #17811
- [NPU] support the Enable return routed experts by @jiashaokun-1 in #17025
- [VLM] Optimize get_rope_index for GLM4v by @yuan-luo in #17420
- Optimize custom-all-reduce by @yuan-luo in #17674
- [BUGFIX]: using language-only should not reserve space for the vision encoder by @koush in #18011
- [TestFix] rewrite LoRA overlap loading tests by @glenliu21 in #18047
- [Fix] Remove no use code in MiMo-V2-Flash by @yuan-luo in #18051
- Fix: Avoid Double Reduce in VLM DP Attention by @yhyang201 in #17991
- [diffusion] cli: introduce generic attention backend configuration in ServerArgs by @mickqian in #18036
- [diffusion] fix: fix missing component names for VAELoader by @mickqian in #18069
- Add bootstrap_room validation to detect metadata corruption in PD disaggregation by @Simon-Li in #17430
- [AMD] Fix aiter version in rocm image by @yctseng0211 in #18076
- Refine logprob logic for request handling by @ch-wan in #17986
- [EPD][refactor]: introduce BaseMMReceiver for gRPC transport integration by @liusy58 in #17921
- Improve Per Commit Test job filtering for sglang-kernel by @Kangyan-Zhou in #18054
- fix: zmq_to_tokenizer encoder transfer when host listens to 0.0.0.0 by @RangerCD in #17929
- fix: correct weight loading prefix mapping for Qwen3-VL by @Lollipop in #18024
- [diffusion] CI: deprecate WarmupRunner in CI by @yingluosanqian in #18038
- [AMD] enable MoRI to release and nightly builds by @HaiShaw in #18101
- [Diffusion] Fix Ring Parallel bug with FA4 by @BBuf in #18062
- support smem in per_token_quant_fp8 kernel by @zhangxin81 in #16725
- [Diffusion] remove accelerate dependency for device mapping by @RubiaCx in #18026
- [NPU]mindspore model support moe by @zzzzzzzxh in #15363
- docs: move deepseek_ocr to popular model usage and add cookbook reference by @sglang-bot in #18120
- [NPU] update nightly tests by @Sugar920 in #17952
- add Step-3.5-Flash model support by @yhyang201 in #18084
- [DeepSeek V3.2] [Bugfix] slice indexer and padding fa3 when can not run cuda graph by @xu-yfei in #17076
- [NPU] support dsv32 radixcache on ascend by @khalil2ji3mp6 in #17964
- [MiMoV2Flash] [feat]: support two batch overlap by @TZHelloWorld in #17634
- [Fix] data race in req_to_token pool by @cctry in #17850
- Re-enable test_mla_int8_deepseek_v3.py after HF token fix by @alisonshao in #18123
- feature: adding gpt-oss 120b nightly test by @dougyster in #18134
- [Performance] Optimize radix cache eviction performance by @YiXR in #14339
- [Diffsuion & JIT_kernel] QKNorm cross heads kernel by @BBuf in #18073
- [HiCache]: Support DeepSeek v32 cpu offloading by @hzh0425 in #17415
- [diffusion] UX: improve logging by @mickqian in #18122
- [Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels by @celve in #17889
- Add triton_fused_moe config for GLM-4.7-FP8 tp8 H20 H20-3e by @HanHan009527 in #18091
- [Diffusion] fix serving image_edit get input image bug by @BBuf in #18109
- MoE Refactor: Refactor
modelopt_quant.py->flashinfer_trllm.pyby @b8zhong in #16685 - Support Markdown/Notebook-Friendly Documentation Export for Downstream Integration by @klhhhhh in #18131
- [TestFix] use unit tests for LoRA overlap loading tests by @glenliu21 in #18140
- [NVIDIA] Add --top-k argument to run_eval.py by @kaixih in #18025
- Gigachat 3 tool parser and tests by @ajpqs in #14765
- [HiCache] fix: apply extra_backend_tag in Mooncake batch_exists by @00fish0 in #17265
- [Perf] Use safetensors
load_filein multithread loader by @mmangkad in #18124 - [Docker] Remove hardcoded
America/Los_Angelestimezone, default to UTC by @mmangkad in #18121 - AMD PD/D PR ci by @Lzy17 in #17183
- Warmup before profiling prefill latency for dynamic chunk sizing by @xiaoweiw-nv in #17198
- [PD] feat: support mooncake intra-node nvlink kv transfer by @TTThanos in #17866
- [Bugfix] Fix Mistral Large 3 NVFP4 TRTLLM MoE by @elvischenv in #18065
- fix: add cu13 dev container to our release by @ishandhanani in #18192
- Revert broken sgl_kernel exclusion patterns in paths-filter by @Kangyan-Zhou in #18193
- enable ut test for xpu devices by @DiweiSun in #11712
- [HiCache] feat: Add detailed cache hit breakdown for HiCache in
sglextand Prometheus metrics by @vladnosiv in #17648 - [Diffusion] Only import sgl_kernel in custom op cuda path (SiluAndMul and RMSNorm) by @yeahdongcn in #15592
- [diffusion] hardware: support diffusion models on MTGPU (multi-GPU, 5/N) by @yeahdongcn in #17318
- [diffusion] hardware: support diffusion models on MTGPU (doc, 6/N) by @yeahdongcn in #17346
- add streaming parallel tool call test case by @harvenstar in #18097
- Update weight rename check for Qwen3 Embeddings by @satyamk7054 in #17535
- fix: ensuring nightly whls are tagged with latest commit by @dougyster in #18204
- [diffusion] fix server cache-dit bug under continuous dynamic requests by @nono-Sang in #17140
- [Docs] fix readme typo by @kuafou in #18207
- Fix Session for multimodal and expose it through Engine by @aurickq in #18152
- fuse qkvbfg linear into one gemm and f_b g_b into batched gemm. by @strgrb in #17801
- fix: bumping nightly whl version by @dougyster in #18212
- Support Markdown/Notebook-Friendly Documentation Export for Downstream Integration (copy all markdown and rst files) by @klhhhhh in #18223
- [diffusion] kernel fusion: gated residual layernorm scale shift and layernorm scale shift kernel fusion for Qwen-Image, WAN and HunyuanVideo by @jianyingzhu in #14717
- [DeepGemm] Add a flag for fast warmup by @Fridge003 in #18111
- [RadixTree][5/N Refactor]: Introduce pre- and post-processing methods for key matching by @hzh0425 in #18147
- Moving _alloc_extend_naive out of npu allocator by @ch-wan in #18200
- [Diffusion] update code owner by @BBuf in #18247
- [Diffusion] Downgrade prompt log from info to debug. by @Evrard-Nil in #17813
- Make sure we always disable symm memory without dp padding by @nvcastet in #18129
- optimize get_topk_ragged by fusing get k and k_scale triton kernel by @BJWang-ant in #16043
- [diffusion][mova] clean codes by @CloudRipple in #18107
- [diffusion] fix the bug of redundant memory usage on GPU-0 by @nono-Sang in #18221
- Support passing spaces_between_special_tokens per request by @RunningLeon in #17939
- support interns1-pro by @RunningLeon in #18145
- [diffusion] refactor: move model_stages into stages folder by @mickqian in #18248
- [AMD] Add kimi mi35x nightly test, folder organization and several stability fixes by @michaelzhang-ai in #17895
- fix: fix MockModelRunner in attention tests by @zack041 in #18240
- Add MoE fused config for Qwen3-Coder-Next-FP8 on H100 TP=2 by @mmangkad in #18195
- [Bugfix] fix a obvious logic error by @lingebeng in #18254
- fix: add SGLANG_IS_IN_CI env var to release-docs workflow by @zwang86 in #18225
- fix kimi k2.5's moe gemm config init by @cicirori in #18064
- [diffusion] chore: forbid Chinese characters by @mickqian in #18249
- [PD] improve kv offset calculation for MHA model with different tp size by @Ch3ngY1 in #18163
- [PD] doc: Document SGLANG_MOONCAKE_CUSTOM_MEM_POOL and supported values by @stmatengss in #18259
- [docs] fix misspellings & typos by @app/ in #18276
- Support Markdown/Notebook-Friendly Documentation Export for Downstream Integration(convert rat files to md files and save) by @klhhhhh in #18278
- Fix test_return_routed_experts to use response-level sglext by @alisonshao in #18274
- [Diffusion] Support layerwise offload for mova by @BBuf in #18272
- [XPU] Integrate MoE and minor improvements in XPU attention backend by @airMeng in #13561
- [FIX] Always support TP > 4 for FP4 Gemm by @danielafrimi in #17300
- [piecewise graph]: support MiniMax-M2 by @hzh0425 in #18217
- [PD] Minor code cleanup for mooncake backend by @ShangmingCai in #18279
- docker: add patch to increase GPU deepep timeout by @ishandhanani in #18298
- [diffusion][hot fix] fix accuracy bug caused by PR 14717 by @yingluosanqian in #18296
- [Kernel] Add JIT apply_rope_with_cos_sin_cache_inplace by @pansicheng in #18155
- throw error if got adapter with added_tokens by @glenliu21 in #18046
- [diffusion] feat: allow T5's TP Group to reuse the transformer's SP Group by @nono-Sang in #17818
- NixlKVManager optimizations by @ovidiusm in #17654
- Fix flaky test_frequency_penalty_reduces_word_repetition by using deterministic seeds by @alisonshao in #18285
- Refactor(qwen3-vl) optimize position encoding interpolation by @aaaandychen in #16781
- [Doc] refine spec decode docs for SpecV2/STANDALONE/NGRAM by @alphabetc1 in #18321
- [Doc] add a summary section for spec decode document by @alphabetc1 in #18323
- [Kernel] Migrate GPTQ-Marlin GEMM kernel to JIT by @celve in #18067
- fix npu best practice by @amote-i in #18330
- [diffusion][hot fix] fix torch.compile graph break caused by torch._dynamo.disable by @yingluosanqian in #18336
- add hicache jit test by @XucSh in #17847
- [diffusion] fix: offload text encoder model in image encoding stage by @xiaoyewww in #18317
- Add Nemotron 3 Nano tests by @shaharmor98 in #18119
- Add CI permission for Shunkangz, dongjiyingdjy, samuellees by @Fridge003 in #18377
- [Docs] Add Falcon H1, Hunyuan-Large, Qwen3-Omni support and update Diffusion usage by @pokymono in #17888
- add hybrid model PD to NIXL connector by @nealvaidya in #16229
- Merge stage-c-test-large-4-gpu suites into partitioned suites by @alisonshao in #18325
- Revert "[Build] Enable full kernel in aarch64 wheel" by @Fridge003 in #18385
- [Qwen3Next] Optimize fused_sigmoid_gating_delta_rule_update_kernel by @hlu1 in #18271
- Support execute_shell_command for env var support by @zhaochenyang20 in #18390
- [NPU] update npu doc by @Hexq0210 in #18344
- [Diffusion] Apply fused_norm_scale_shift to MOVA by @BBuf in #18257
- [Doc] Update CUDA 13 install guide to install torch first by @mmangkad in #18404
- Remove unnecessary norm_type argument from GLM-Image dits by @haojin2 in #18382
- [Doc] Fix outdated
--fp4-gemm-backenddocumentation by @mmangkad in #18350 - [diffusion] fix: respect dist_timeout option by @mickqian in #18386
- [diffusion] feat: support saving videos directly on the server to avoid the overhead of tensor transfer by @nono-Sang in #18253
- [Kimi-K2.5] Fix NVFP4 Kimi-K2.5 weight mapping and exclude list by @mmangkad in #18370
- [NPU][diffusion] model: support WAN/FLUX/Qwen-Image/Qwen-Image-edit on Ascend by @Makcum888e in #13662
- [Fix] Fix backend selection after flashinfer version update by @DarkSharpness in #18364
- fix: sync server_args.kv_cache_dtype when detecting FP8 KV cache by @zack041 in #18394
- [diffusion] feat: support efficient sequence shard by @nono-Sang in #18161
- [ModelOpt] Fix broken Qwen3-235B-A22B-Instruct-2507-NVFP4 launch by @vincentzed in #18189
- [diffusion] refactor: group component loaders under the component_loaders/ directory by @mickqian in #18438
- Fix TRT-LLM MLA backend applying k_scale to BF16 KV cache in BMM1 by @debo3 in #18396
- [diffusion] chore: revise process title by @mickqian in #18446
- Add tensor parallelism support to LFM2 ShortConv layers by @tugot17 in #17777
- [Kimi-K2.5] Fix missing
quant_configinKimiK25by @mmangkad in #18440 - Update author information in pyproject.toml by @merrymercy in #18453
- [ModelOPT] Support Qwen 3 Next Coder NVFP4 by @vincentzed in #18224
- Refactoring Mooncake TE as a shared distributed component by @ShangmingCai in #17810
- [BugFix][PD]Fix metadata_buffer_index leak when aborted in PD by @ZhengWG in #17483
- fix: fix the wrong return value type of draft model runner by @acelyc111 in #18105
- fix: use --no-build-isolation for human-eval install by @harvenstar in #18455
- [AMD] Update aiter to v0.1.10.post2 by @bingxche in #18423
- [AMD] CI - Fix AMD daily image release and install dependency by @yctseng0211 in #18452
- [DLLM] Add JointThreshold algorithm for joint M2T and T2T decoding by @edwardzjl in #18171
- Make compressed-tensors MoEs support ignored layers by @LHXuuu in #17828
- Revert "optimize get_topk_ragged by fusing get k and k_scale triton kernel" by @Fridge003 in #18471
- feat: Add ModelScope support for multimodal_gen models by @yrk111222 in #17924
- [diffusion] chore: fix unclean shutdown and resource leaks by @mickqian in #18477
- [Feature] Support bidirectional attention for Gemma-3 by @zzhbrr in #10707
- Pass
quantize_configto_initialize_modelby @klshuster in #18273 - Fix MMLU benchmark to auto-download data and resolve path issue by @JustinTong0323 in #18486
- [MODEL] Adding Support for Qwen3.5 Models by @zju-stu-lizheng in #18489
- [AMD] add amd ci monitor by @bingxche in #17476
- feat(kv-events): Add medium field to KV event types for storage tier tracking by @ishandhanani in #18205
- docs: expand and update modelopt documentation by @zack041 in #18479
- Add cache_config_info metric. by @kartikx in #17273
- [HiCache][PP] add test case for compatibility by @stmatengss in #16395
- Fix idle batch predict dtype in spec v2 by @Qiaolin-Yu in #18379
- Make bench_one_batch_server compatible for more backends by @maocheng23 in #18512
- [EPD] Add notification mechanism to fix server hang and add timeout env var by @liusy58 in #18229
- Deepseekv32 compatibility with transformers v5 by @JustinTong0323 in #18297
- Support GlmMoeDsaForCausalLM by @JustinTong0323 in #18521
- [AMD] Turn on aiter-prebuild by @yctseng0211 in #18425
- [HiCache] fix: StorageMetricsCollector was initialized twice by @alphabetc1 in #18354
- [DLLM] Basic dLLM scheduling strategy and implementation by @ClawSeven in #17484
- [diffusion] feat: support parallel wan-vae decode by @nono-Sang in #18179
- [NPU] [CI] Enable run multimodal NPU CI when changes only in multimodal_gen by @Makcum888e in #18523
- [diffusion] fix: fix fsdp by @mickqian in #18187
- [sgl-kernel] upgrade deepgemm by @BBuf in #18362
- [NPU][docs] improve docs for Best Practice on Ascend NPU by @husf1130 in #18360
- [NPU] update npu doc by @Hexq0210 in #18474
- fix(config): Support setting Mamba state dtype via config file by @zju-stu-lizheng in #18532
- [NPU][docs]fix bug about hyperlink for best practice for ascend npu by @husf1130 in #18561
- Revert "[sgl-kernel] upgrade deepgemm" by @Fridge003 in #18562
- Tilelang sparse decode fwd for dsv32 mi355 by @1am9trash in #18488
- Fix radix cache key to include generated tokens in multi-turn (regression) by @ycchen-tw in #16521
- Fix wrong prefill log. by @hnyls2002 in #18570
- [Doc] Comprehensive Guide: Navigating DP, DPA, and SMG Best Practices by @zhaohaidao in #18096
- Enhance SMG guide with RL rollout systems benefits by @zhaochenyang20 in #18588
- Add cache hit rate UT by @hnyls2002 in #18566
- [AMD] Fix Janus-Pro crash and add Kimi-K2.5 nightly test by @michaelzhang-ai in #18269
- Fix Bug on dsv3.2 by @BourneSun0527 in #18553
- Fp8 prefill attn kernel integration by @1am9trash in #18528
- Register cp-atten-allgather buffers with symm memory by @wangfakang in #17756
- [NPU] support model skywork-reward-gemma2-2-27B-v0.2 by @McZyWu in #16947
- [V3.2] Change default CP token split method to
--round-robin-splitby @Fridge003 in #18613 - add support to enable lora with embedding models by @vedantjh2 in #17780
- Fix prefill stats for dllm by @ispobock in #18632
- Add LMF2 MoE model architecture by @tugot17 in #17997
- Clean up noisy startup log messages and refactor loader.py by @merrymercy in #18531
- [diffusion] docs: consolidate diffusion documentation into docs by @qianyue76 in #18095
- [PCG] GPT OSS Triton Kernel Support by @Oasis-Git in #18405
- [Bugfix] fix config bug caused by PR #18273 by @1195343015 in #18535
- Avoid kimi linear stream sync by @vincentzed in #16186
- Add CI permission for Chen-0210 by @Chen-0210 in #18494
- glm5 md by @liupeng374 in #18655
- [diffusion] fix: webui cannot correctly display generated video using wan2.2 by @yeahdongcn in #18473
- List more CI runs for
pr-testby @hnyls2002 in #18650 - update glm5 readme on npu by @xiaobaicxy in #18657
- fix the max-parallel for
/rerun-stageby @hnyls2002 in #18658 - Update modelopt quantization config parsing by @Edwardf0t1 in #13919
- [Mamba] Add float16 support for SSM cache dtype by @danielafrimi in #18444
- Try fix the max-parallel for maunally triggered test again. by @hnyls2002 in #18686
- Update ci permission by @ispobock in #18693
- [AMD] rocm 7.2 image release, PR test, Nightly Test by @yctseng0211 in #17799
- [Flashinfer Autotune] Fix FlashInfer FP4 MoE autotuning crash by removing incorrect flatten on hidden_states_scale by @YAMY1234 in #18500
- [Qwen3_5] Refactor
Qwen3_5ForCausalLMMTPclass implementation by @zju-stu-lizheng in #18538 - Update README commands to include model-path option by @wplf in #18557
- fix: /metrics endpoint always reports engine_type="unified" in PD disaggregation mode by @2JooYeon in https://github.com/sgl-project/sglang/pull/18552
- [Z-Image] Replace TextEncoderConfig with Qwen3TextConfig by @rootonchair in https://github.com/sgl-project/sglang/pull/18560
- [AMD] Enable release image build for ROCm 7.2.0 by @akao-amd in https://github.com/sgl-project/sglang/pull/18698
- [Ascend]Support qwen3.5 by @chenxu214 in https://github.com/sgl-project/sglang/pull/18544
- [AMD] reset AMD image release time and reduce CI queue time by @yctseng0211 in https://github.com/sgl-project/sglang/pull/18707
- [AMD] Fix accuracy issue when running TP4 dsv3 model with mtp by @1am9trash in https://github.com/sgl-project/sglang/pull/18607
- add tool_choice=auto nightly test case by @harvenstar in https://github.com/sgl-project/sglang/pull/18302
- Make PR based docker and pypi workflow work for forked PR by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18720
- Fix flaky penalty tests by using higher temperature for effect comparison by @alisonshao in https://github.com/sgl-project/sglang/pull/18380
- Add
spec_accept_histogramrequest statistic by @scottjlee in https://github.com/sgl-project/sglang/pull/18332 - refactor: replace local proto compilation with smg-grpc-proto package by @slin1237 in https://github.com/sgl-project/sglang/pull/18682
- [BUGFIX] fix bug in handle mamba radix cache in server_args by @yizhang2077 in https://github.com/sgl-project/sglang/pull/18723
- Fix B200 installation issue by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18725
- refactor: consolidate gRPC client into shared crate dependency by @slin1237 in https://github.com/sgl-project/sglang/pull/18730
- speed up sgl-kernel build by @BBuf in https://github.com/sgl-project/sglang/pull/18586
- fix: image version in pypi pr workflow by @dougyster in https://github.com/sgl-project/sglang/pull/18735
- refactor: remove crate re-export aliases from lib.rs by @slin1237 in https://github.com/sgl-project/sglang/pull/18737
- Reuse initialized transfer engine in mooncake store by @ShangmingCai in https://github.com/sgl-project/sglang/pull/18460
- Build ROCm7.2 Image with latest AITER v0.1.10.post3 by @HaiShaw in https://github.com/sgl-project/sglang/pull/18741
- [Diffusion] [BUG] Fix missing initialization of GLM-Image text encoder config by @haojin2 in https://github.com/sgl-project/sglang/pull/18704
- Fix invalid import paths in glm_image.py by @alisonshao in https://github.com/sgl-project/sglang/pull/18757
- Revert changes to weight_utils.py by @merrymercy in https://github.com/sgl-project/sglang/pull/18759
- feat: support release lookup by @alphabetc1 in https://github.com/sgl-project/sglang/pull/18450
- Modify glm5 readme on npu by @BourneSun0527 in https://github.com/sgl-project/sglang/pull/18768
- [AMD] Fix Multimodal Test 1 GPU by @bingxche in https://github.com/sgl-project/sglang/pull/18716
- [diffusion]Allows quality adjustment of generated images/videos through requests. by @IPostYellow in https://github.com/sgl-project/sglang/pull/17937
- [BUG] fixed local model loading issue in multimodal generation test by @blazingbhavneek in https://github.com/sgl-project/sglang/pull/18687
- [Kernel] Add JIT rotary_embedding_kernel by @pansicheng in https://github.com/sgl-project/sglang/pull/17934
- [Spec] Move forward timeout before verify to fix Eagle v1 filter mismatch by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18760
- [diffusion] feat: support tp for qwen-image-edit-2511 by @xiaoyewww in https://github.com/sgl-project/sglang/pull/18619
- Rename request timeout env vars for waiting/running stages by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18766
- [Bugfix] Add warnings when NSA indexer cache indice mismatch in PD module by @ShangmingCai in https://github.com/sgl-project/sglang/pull/18727
- Support LingV2_5 model by @ant-yy in https://github.com/sgl-project/sglang/pull/18598
- [diffusion] feat: support SparseVideoGen2 attention backend by @tie-pilot-qxw in https://github.com/sgl-project/sglang/pull/17507
- [schedule] Fix streaming return of customized_info by @yinghai in https://github.com/sgl-project/sglang/pull/18654
- Cleanup unused rerun stages by @ispobock in https://github.com/sgl-project/sglang/pull/18788
- Adjust mamba cache allocation by @ispobock in https://github.com/sgl-project/sglang/pull/18786
- Enhence gsm8k test by @ispobock in https://github.com/sgl-project/sglang/pull/18791
- Cleanup debug log for Ring model by @ispobock in https://github.com/sgl-project/sglang/pull/18793
- Added cuda availability guard by @mattteochen in https://github.com/sgl-project/sglang/pull/18480
- [diffusion] refactor: merge redundant default_dtype and param_dtype parameters in FSDP loader by @mickqian in https://github.com/sgl-project/sglang/pull/18789
- [diffusion] fix: webui task_type check by @yeahdongcn in https://github.com/sgl-project/sglang/pull/18462
- [diffusion] fix typo by @triple-mu in https://github.com/sgl-project/sglang/pull/18790
- [diffusion] chore: use batched P2P ops in VAE parallel decoding by @mickqian in https://github.com/sgl-project/sglang/pull/18728
- [Kernel Slimming] Migrate GPTQ-Marlin repack kernel to JIT by @celve in https://github.com/sgl-project/sglang/pull/18543
- refactor context parallel state by @dongjiyingdjy in https://github.com/sgl-project/sglang/pull/17213
- [bugfix] fix mamba slot leak when scheduling fails with radix cache (#15840) by @kuafou in https://github.com/sgl-project/sglang/pull/16067
- fix double-free kv cache for requests that have already finished and been freed during preemption by @JD-ETH in https://github.com/sgl-project/sglang/pull/18694
- Update notified user in post_ci_failures_to_slack.py by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18817
- [FlashInfer] Bump FlashInfer version from 0.6.2 to 0.6.3 by @mmangkad in https://github.com/sgl-project/sglang/pull/18448
- Update performance dashboard for nightly tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18824
- [Perf] refactor piecewise cuda graph support of Qwen3-Next by @zminglei in https://github.com/sgl-project/sglang/pull/17613
- feat: add SGLANG_DISTRIBUTED_INIT_METHOD_OVERRIDE env var by @YazhiGao in https://github.com/sgl-project/sglang/pull/18743
- [PD-Disagg] Fix double free when prebuilt batch is aborted. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18822
- Add timeout abort kits for normal / eagle. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18815
- [Env] centralize hicache vars in environ.py by @alphabetc1 in https://github.com/sgl-project/sglang/pull/17204
- Handle abort for retracted requests in disagg decode prealloc queue by @qmzznbxhl in https://github.com/sgl-project/sglang/pull/18705
- [diffusion][MUSA] fix: MUSA platform breakage caused by PR #13662 by @yeahdongcn in https://github.com/sgl-project/sglang/pull/18456
- Fix/partial gen from waiting queue miss metadata by @JD-ETH in https://github.com/sgl-project/sglang/pull/17610
- [VLM][LLM] Optimize fused_moe triton kernel tma by @yuan-luo in https://github.com/sgl-project/sglang/pull/18782
- [AMD] Fix sgl-model-gateway Build Errors in ROCm Docker Release by @bingxche in https://github.com/sgl-project/sglang/pull/18836
- Kernel: optimize decoding metadata in NSA multi-spec backend with fused kernels by @Johnsonms in https://github.com/sgl-project/sglang/pull/17554
- Fix dsv32 encode_messages by @whybeyoung in https://github.com/sgl-project/sglang/pull/18126
- Add ci test for ring model by @ispobock in https://github.com/sgl-project/sglang/pull/18829
- feat: Support
mrope_sectionwithrope_type: "yarn"by @raayandhar in https://github.com/sgl-project/sglang/pull/13313 - Enable SGLANG_ENABLE_SPEC_V2 for nightly speculative decoding tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18719
- [kernel slimming] Move fast_hadamard_transform to jit_kernel by @BBuf in https://github.com/sgl-project/sglang/pull/18475
- [Diffusion] opt vae decode with
channels_last_3dby @BBuf in https://github.com/sgl-project/sglang/pull/18540 - Add CI permissions by @mmangkad in https://github.com/sgl-project/sglang/pull/18847
- Fix model loading for DeepSeek-V3.2-AWQ by @bingps in https://github.com/sgl-project/sglang/pull/16907
- [Doc] Convert the speculative decoding notebook to markdow by @alphabetc1 in https://github.com/sgl-project/sglang/pull/18395
- Model: Support IBM Granite (Dense/Mamba + MoE) by @blazingbhavneek in https://github.com/sgl-project/sglang/pull/18040
- [FIX] Correct JIT kernel compilation on newer GPUs with outdated driver metadata. by @muse-coder in https://github.com/sgl-project/sglang/pull/18496
- [AMD] Fix/qwen3 5 amd rope cutedsl fallback by @andyluo7 in https://github.com/sgl-project/sglang/pull/18753
- [Perf] Tune MiniMax M2 fused moe kernel on H100 GPU by @zhendonghua in https://github.com/sgl-project/sglang/pull/18851
- perf: add minimax-2.5 fused_moe tuning config for h20 by @zhangxiaolei123456 in https://github.com/sgl-project/sglang/pull/18833
- [diffusion]: Enable torch.compile for UlyssesAttention by @Ratish1 in https://github.com/sgl-project/sglang/pull/18840
- fix bug on kimi2.5 when dp2 tp4 by @haowen-han in https://github.com/sgl-project/sglang/pull/18604
- Extract dumper and prefill delayer tests common utils by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18857
- Add missing dumper tests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18859
- [AMD] Fix nightly 1-GPU test failures and bench_serving regression by @michaelzhang-ai in https://github.com/sgl-project/sglang/pull/18761
- [diffusion] quant: add support for svdquant and nunchaku by @mickqian in https://github.com/sgl-project/sglang/pull/18549
- change npu.dockerfile by @chenxu214 in https://github.com/sgl-project/sglang/pull/18835
- [diffusion]: Improve layerwise offload buffer reuse and shared-storage handling by @Ratish1 in https://github.com/sgl-project/sglang/pull/18611
- feature: adding build commit to sgl kernel workflow by @dougyster in https://github.com/sgl-project/sglang/pull/18853
- Enable DeepGemm fast warmup in CI to prevent cold-cache timeouts by @alisonshao in https://github.com/sgl-project/sglang/pull/18823
- update pre-commit config by @SoluMilken in https://github.com/sgl-project/sglang/pull/18860
- fix: update Blackwell log/error messages to include SM12x by @blake-snc in https://github.com/sgl-project/sglang/pull/18751
- fix: add SM110 (Jetson AGX Thor) to Blackwell capability check by @WiwilZ in https://github.com/sgl-project/sglang/pull/18787
- test: add test for Modelopt FP8 on SM90 by @zack041 in https://github.com/sgl-project/sglang/pull/18463
- fix_get_quant_method_in_fused_moe_condition by @tom-zju in https://github.com/sgl-project/sglang/pull/18459
- Use ephemeral nccl port via get_free_port() by @chanh in https://github.com/sgl-project/sglang/pull/18009
- feat: expose consistent_hashing policy in Python router CLI args by @bledden in https://github.com/sgl-project/sglang/pull/17972
- Improve profiler options for bench_serving by @akhilg-nv in https://github.com/sgl-project/sglang/pull/16991
- Fix libnuma.so does not exsit by @QiuMike in https://github.com/sgl-project/sglang/pull/15355
- fix(sgl-kernel): support CUDA 13 runtime preloading for DGX Spark by @blake-snc in https://github.com/sgl-project/sglang/pull/18747
- fix(sgl-kernel): use >= 120 for SM12x CUDA kernel dispatch by @blake-snc in https://github.com/sgl-project/sglang/pull/18750
- Create ascend_npu_qwen3_5_examples.md by @chenxu214 in https://github.com/sgl-project/sglang/pull/18864
- Update ascend_npu_support.rst by @chenxu214 in https://github.com/sgl-project/sglang/pull/18868
- Add claude skills for sgl-kernel and jit-kernel by @BBuf in https://github.com/sgl-project/sglang/pull/18855
- Nsa trtllm mla sparse fp8 support with Deepseek v3.2 NVFP4 by @rainj-me in https://github.com/sgl-project/sglang/pull/18389
- fix: nightly whl dev date suffix by @dougyster in https://github.com/sgl-project/sglang/pull/18873
- [VLM] Optimize Ernie4.5-VL rotary embedding with fused triton kernel by @yuan-luo in https://github.com/sgl-project/sglang/pull/18856
- [diffusion] fix: avoid saving output for warmup requests by @mickqian in https://github.com/sgl-project/sglang/pull/18867
- [diffusion] refactor: refactor server_args adjust and validate logics by @mickqian in https://github.com/sgl-project/sglang/pull/18863
- [Diff]: support SGLANG_TORCH_PROFILER_DIR environment variable for profiler log directory by @Johnsonms in https://github.com/sgl-project/sglang/pull/18454
- [AMD] MORI-EP inter kernel type switch by @Duyi-Wang in https://github.com/sgl-project/sglang/pull/18437
- Flip dumper to disable by default and refactor environment handling by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18878
- Change dump output format to dict with value and metadata by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18879
- Collect upper level metadata to dump output by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18880
- Support dumping gradients, parameters, lazy values by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18881
- fix: unifying docker image build pipeline by @dougyster in https://github.com/sgl-project/sglang/pull/18814
- fix: adding performance logging for nightly diffusion by @dougyster in https://github.com/sgl-project/sglang/pull/18023
- Fix test_lora_qwen3 nightly failure: replace adapter with added_tokens by @alisonshao in https://github.com/sgl-project/sglang/pull/18884
- Update ascend_npu_qwen3_5_examples.md by @realray808 in https://github.com/sgl-project/sglang/pull/18888
- [Diffusion] Fix LoRA weight snapshot aliasing in unmerge by @ChangyiYang in https://github.com/sgl-project/sglang/pull/18883
- Fix GLM-4V processor registration when glm_ocr is unavailable by @alisonshao in https://github.com/sgl-project/sglang/pull/18885
- [JIT kernel] hd=512,1024 in JIT QK norm (cta based) by @vincentzed in https://github.com/sgl-project/sglang/pull/17515
- [diffusion] logging: improve peak vram logging by @mickqian in https://github.com/sgl-project/sglang/pull/18865
- Revert "[diffusion]: Improve layerwise offload buffer reuse and shared-storage handling" by @mickqian in https://github.com/sgl-project/sglang/pull/18866
- [Model] Add Qwen3ForRewardModel and fix Qwen3ForSequenceClassification by @shvmjndl in https://github.com/sgl-project/sglang/pull/17992
- [Perf] ~9.5x faster Blackwell MXFP4 MoE weight loading by @mmangkad in https://github.com/sgl-project/sglang/pull/18858
- [diffusion][Wan]: fix sparse attention backends being applied to cross-attention by @Ratish1 in https://github.com/sgl-project/sglang/pull/17596
- refactor FAKE transfer backend and remove --disaggregation-decode-enable-fake-auto parameter by @Estrella-xx in https://github.com/sgl-project/sglang/pull/18345
- [2/N] Quantization Refactor: Compressed tensors MoE schemes by @TamirBaydasov in https://github.com/sgl-project/sglang/pull/17503
- Fix modelopt FP8 create weights by @danielafrimi in https://github.com/sgl-project/sglang/pull/18447
- Fix GLM-5 fused shared expert by @FrankMinions in https://github.com/sgl-project/sglang/pull/18804
- [diffusion]: fix scheduler crash on ZMQ messages with unexpected frame counts by @Ratish1 in https://github.com/sgl-project/sglang/pull/17890
- Adapt the Qwen2Model._update_causal_mask for transformers==4.57.1 by @pansicheng in https://github.com/sgl-project/sglang/pull/18774
- [diffusion] operator: unify rotary embedding impl by @triple-mu in https://github.com/sgl-project/sglang/pull/18164
- [misc] adding metadata field in UpdateWeightFromDiskReqInput by @happierpig in https://github.com/sgl-project/sglang/pull/18821
- Skip flaky test_tool_choice_required_non_streaming for Mistral by @alisonshao in https://github.com/sgl-project/sglang/pull/18889
- [AMD] Fix RotaryEmbedding crash on AMD/ROCm (regression from #17934) by @michaelzhang-ai in https://github.com/sgl-project/sglang/pull/18903
- [TBO] fix cuda graph intermittently becomes disabled bug by @billishyahao in https://github.com/sgl-project/sglang/pull/18320
- [Diffusion] [NPU] [Doc] Add NPU documentation for sglang-diffusion by @Makcum888e in https://github.com/sgl-project/sglang/pull/18894
- Revert "[AMD] Fix RotaryEmbedding crash on AMD/ROCm (regression from #17934)" by @HaiShaw in https://github.com/sgl-project/sglang/pull/18922
- [diffusion]: fix sparse video gen 2 backend being applied to cross-attention by @Ratish1 in https://github.com/sgl-project/sglang/pull/18900
- [Diffusion] Fix get model name when model local path end with "/" by @Makcum888e in https://github.com/sgl-project/sglang/pull/18918
- ROCm use rotary_embedding from sgl-kernel by @HaiShaw in https://github.com/sgl-project/sglang/pull/18920
- [Diffusion] [NPU] Fix CI run by @Makcum888e in https://github.com/sgl-project/sglang/pull/18921
- Revert "[diffusion] operator: unify rotary embedding impl" by @mickqian in https://github.com/sgl-project/sglang/pull/18929
- [PCG] support piecewise cuda graph for kimi-linear model by @zminglei in https://github.com/sgl-project/sglang/pull/18849
- [diffusion]: MOVA torch.compile opt by @triple-mu in https://github.com/sgl-project/sglang/pull/18914
- [gRPC] Fix scheduler startup broken by context parallel refactor by @slin1237 in https://github.com/sgl-project/sglang/pull/18933
- [diffusion] update code owner by @ping1jing2 in https://github.com/sgl-project/sglang/pull/18495
- [3/N] Quantization Refactor: ModelSlim MoE schemes by @TamirBaydasov in https://github.com/sgl-project/sglang/pull/17993
- fix(glm-image): single-GPU T5 config + SP support for 4D latents (#18… by @Nickcp39 in https://github.com/sgl-project/sglang/pull/18739
- Fix generated-shared-prefix bench_serving by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/18769
- Fix benchmark_sglang_fused_moe_triton.py by @satyamk7054 in https://github.com/sgl-project/sglang/pull/18940
- cleanup prefill metrics logging to fix dp-attn metrics by @Ratish1 in https://github.com/sgl-project/sglang/pull/18778
- feat: add cuda core dump CI warpper by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18909
- Refactor sampler: Use a better hash function for deterministic sampling and clear dispatch for probs/logprobs/logits sampling paths by @merrymercy in https://github.com/sgl-project/sglang/pull/18915
- Fix eval tests not capturing server launch failures by @alisonshao in https://github.com/sgl-project/sglang/pull/18886
- Expose priority parameter in Engine.generate() and Engine.async_generate() by @PeaBrane in https://github.com/sgl-project/sglang/pull/18944
- feat: [Qwen3.5] Support block-wise FP8 quantization and model adaptation by @zju-stu-lizheng in https://github.com/sgl-project/sglang/pull/18926
- Revert "Fix generated-shared-prefix bench_serving" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18956
- feat: add nsa and swa disagg support with nixl by @nealvaidya in https://github.com/sgl-project/sglang/pull/18939
- [feat] Add return_routed_experts param to async_generate for parity with generate by @Aphoh in https://github.com/sgl-project/sglang/pull/18508
- [Refactor] Fix test and clean up hicache code by @DarkSharpness in https://github.com/sgl-project/sglang/pull/18555
- [diffusion] refactor: unify SamplingParams construction and improve DiffGenerator return types by @mickqian in https://github.com/sgl-project/sglang/pull/18928
- Reasoning models fix docs by @HaiShaw in https://github.com/sgl-project/sglang/pull/18963
- Remove unused fast-hadamard-transform PyTorch extension sources by @BBuf in https://github.com/sgl-project/sglang/pull/18927
- [Tiny fix] Super tiny fix mul_add naive forward bug by @BBuf in https://github.com/sgl-project/sglang/pull/18964
- Enable fa3 PDL by compiling it with corresponding flags by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/18756
- [AMD] ROCm7.2: Add /sgl-workspace/aiter to PYTHONPATH by @HaiShaw in https://github.com/sgl-project/sglang/pull/18972
- [BUG] Refactor task resolution logic in benchmark function for multimodal generation by @zijiexia in https://github.com/sgl-project/sglang/pull/18948
- Add DP ViT support for Kimi K2.5 by @yhyang201 in https://github.com/sgl-project/sglang/pull/18689
- [Fix] Add lora tied lm head support (for Qwen2.5, Gemma, etc model need) by @yushengsu-thu in https://github.com/sgl-project/sglang/pull/18634
- [4/N] Quantization Refactor: Quark MoE schemes by @TamirBaydasov in https://github.com/sgl-project/sglang/pull/18252
- [Feature] Implement update_weights_from_disk for SGLang-D (Diffusion … by @dreamyang-liu in https://github.com/sgl-project/sglang/pull/18306
- [Doc] Add
flashinfer_deepgemmto--fp8-gemm-backendby @mmangkad in https://github.com/sgl-project/sglang/pull/18982 - Fix flaky Qwen3-Next KL divergence tests by reverting mamba slot release by @alisonshao in https://github.com/sgl-project/sglang/pull/18910
- [AMD] Fix mi35x dsv32 mtp nightly by @bingxche in https://github.com/sgl-project/sglang/pull/18978
- Add batched zero copy to NIXL backend by @hxieustc in https://github.com/sgl-project/sglang/pull/18850
- [Qwen3.5] Enable nvfp4 checkpoint by @hlu1 in https://github.com/sgl-project/sglang/pull/18937
- Fix PCG MoE Error by @Oasis-Git in https://github.com/sgl-project/sglang/pull/17739
- Feat/add fi selective state update kernel call by @shaharmor98 in https://github.com/sgl-project/sglang/pull/18070
- [RadixTree][4/N Refactor]: Move available_and_evictable_str to individual radix cache classes by @pansicheng in https://github.com/sgl-project/sglang/pull/17852
- [Diffusion] Refactor diffusion triton kernels by @BBuf in https://github.com/sgl-project/sglang/pull/18966
- [Fix] Fix rank used in parallel executor when enable_cfg_parallel is false by @Prozac614 in https://github.com/sgl-project/sglang/pull/18975
- [Diffusion] [NPU] Enable profiler on NPU by @Makcum888e in https://github.com/sgl-project/sglang/pull/17807
- Move lora request validation to tokenizer_manager from server by @satyamk7054 in https://github.com/sgl-project/sglang/pull/18962
- [diffusion] chore: improve memory usage on consumer-level GPU by @mickqian in https://github.com/sgl-project/sglang/pull/18997
- [diffusion] CI: enable warmup as default by @mickqian in https://github.com/sgl-project/sglang/pull/19010
- Add SDAR model support by @chengshuang18 in https://github.com/sgl-project/sglang/pull/18318
- [spec v2]Fix torch gc of future indices by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18958
- Revert "Add SDAR model support" by @ch-wan in https://github.com/sgl-project/sglang/pull/19032
- Register tensors with symmetric memory for qwen by @nvcastet in https://github.com/sgl-project/sglang/pull/18643
- Fix long prompt KV allocation by falling back to torch native APIs when exceeding Triton tensor limit by @ch-wan in https://github.com/sgl-project/sglang/pull/18250
- Fix flashinfer autotune to only wrap run_once() by @ch-wan in https://github.com/sgl-project/sglang/pull/19004
- Support cleanup previous dumps in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19013
- Hint users when wrongly execute it with partial ranks in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19014
- Support captured dump output and console output control in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19017
- Support filtering labels in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19018
- Enhance configure and env parsing in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19034
- Support resetting and enhance HTTP endpoints for dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19046
- Support using SGLang port in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19038
- Feature/sdar support by @chengshuang18 in https://github.com/sgl-project/sglang/pull/19044
- [Fix][Qwen3.5] Pass max_mamba_cache_size to mamba pool in disaggregation decode path by @YAMY1234 in https://github.com/sgl-project/sglang/pull/19002
- [AMD] Replace msgpack with msgspec in MORI-IO by @Duyi-Wang in https://github.com/sgl-project/sglang/pull/19007
- fix lint on main by @ch-wan in https://github.com/sgl-project/sglang/pull/19052
- feature: docker patch workflow by @dougyster in https://github.com/sgl-project/sglang/pull/19025
- Fix lint on main by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19054
- [diffusion] logging: log available mem when each stage starts in debug level by @mickqian in https://github.com/sgl-project/sglang/pull/18998
- [jit kernel] Support per_token_group_quant_8bit jit kernel by @yuan-luo in https://github.com/sgl-project/sglang/pull/18905
- [diffusion] feat: support nunchaku for Z-Image-Turbo and flux.1 (int4) by @mickqian in https://github.com/sgl-project/sglang/pull/18959
- Fix NSA FP8 KV cache path for both-trtllm MHA one-shot by @mmangkad in https://github.com/sgl-project/sglang/pull/18931
- [Fix] DO NOT skip save_kv_cache for dllm by @DarkSharpness in https://github.com/sgl-project/sglang/pull/19020
- [Fix] Run FlashInfer autotune on non-default stream for NCCL 2.29+ compatibility by @nvcastet in https://github.com/sgl-project/sglang/pull/18987
- Fix adjust_num_token_non_padded_for_attn_tp returning CPU tensor by @ch-wan in https://github.com/sgl-project/sglang/pull/19051
- [AMD] support two batch overlapping for mori ep by @billishyahao in https://github.com/sgl-project/sglang/pull/17953
- [feat] feat: support swa in trtllm_mha by @LuYanFCP in https://github.com/sgl-project/sglang/pull/18970
- Add generated-shared-prefix dataset in bench_one_batch by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/18986
- [GPT-OSS] support fp8 online quantization for gpt-oss bf16 by @zminglei in https://github.com/sgl-project/sglang/pull/18988
- Refactor graph input buffers by @ch-wan in https://github.com/sgl-project/sglang/pull/18991
- [DSv32] Fix MTP and CP compatability by @vladnosiv in https://github.com/sgl-project/sglang/pull/19062
- Fix bug in symm mem pre-allocation default by @nvcastet in https://github.com/sgl-project/sglang/pull/19082
- Remove error dllm and diffusion doc in basic_useage by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/19105
- [Quantization] Support config.json quantization_config format, fix exclude_modules matching, and fix KV cache scale loading for Nemotron by @danielafrimi in https://github.com/sgl-project/sglang/pull/18546
- [diffusion] refactor: reduce redundancy and improve stage api by @mickqian in https://github.com/sgl-project/sglang/pull/19060
- [FEAT] Add Anthropic compatible API endpoint by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/18630
- [diffusion] feat: support passing component path via server args by @mickqian in https://github.com/sgl-project/sglang/pull/19108
- [Feature] rewrite rope kernel; remove flashinfer dependencies by @DarkSharpness in https://github.com/sgl-project/sglang/pull/18844
- [Diffusion] Restruct and clean Diffusion rotary embedding by @BBuf in https://github.com/sgl-project/sglang/pull/19064
- fix tool handling in OpenAIServingChat by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/18996
- fix KimiK2Detector regex patterns with re.DOTALL by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/19120
- [sgl] view could hold the memory too long and introduced large memory by @bixue2010 in https://github.com/sgl-project/sglang/pull/19109
- [FlashInfer] Switch FlashInfer allreduce fusion to unified API by @mmangkad in https://github.com/sgl-project/sglang/pull/18341
- [Refactor] Benchmark Phase 1: extract utils and datasets from bench_serving by @Ratish1 in https://github.com/sgl-project/sglang/pull/19077
- [Benchmark] Remove re-exports from bench_serving.py by @hnyls2002 in https://github.com/sgl-project/sglang/pull/19130
- Revert "[jit kernel] Support per_token_group_quant_8bit jit kernel" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/19131
- Fix dev Docker build OOM on ARM64 cu13 by adding docker system prune by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18947
- [Fix] Quick fix for int32 overflow in Mooncakes' send_kvcache_slice by @YAMY1234 in https://github.com/sgl-project/sglang/pull/19076
- [diffusion] Adapt FP8 linear to sgld feature (Rebase) by @fy1214 in https://github.com/sgl-project/sglang/pull/17023
- [BUG] [DLLM] Missing max_running_requests value by @blazingbhavneek in https://github.com/sgl-project/sglang/pull/18740
- Fix spec v2+dp attention in nsa backend by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/19134
- [Qwen3-Next] Enable fused_qkvzba_split_reshape_cat also for prefill by @YAMY1234 in https://github.com/sgl-project/sglang/pull/18917
- [PD] Change bootstrap_room metadata dtype from int64 to uint64 by @ShangmingCai in https://github.com/sgl-project/sglang/pull/19141
- Refactor dumper and change on_forward_pass_start API by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19065
- Support non-intrusive dumping in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19068
- Support enabling partial non intrusive dump in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19069
- Auto annotate context in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19071
- Extract framework plugins in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19072
- Enhance hook mechanism in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19073
- Configure and call dumper in main SGLang logic by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19093
- Support multi colocated dumper, named exp cleanup, argparse config by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19094
- Enhance reset, states, http in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19095
- Fix wrongly large dumped file and handle non intrusive hook reset in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19124
- [DSv32] [GLM5] Improve Model Quality by Avoiding FP32 Precision Loss in
weights_projby @zianglih in https://github.com/sgl-project/sglang/pull/19041 - Support kwargs and megatron core tensor parsing in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19138
- [diffusion] chore: minor cleanups by @mickqian in https://github.com/sgl-project/sglang/pull/19123
- [diffusion] CI: relax perf check threshold by @mickqian in https://github.com/sgl-project/sglang/pull/19154
- Fix corrupted JSONL metrics file due to concurrent writes by @talorabr in https://github.com/sgl-project/sglang/pull/19011
- [diffusion] refactor: rename quantized model path server arg by @mickqian in https://github.com/sgl-project/sglang/pull/19142
- Revert "[AMD] support two batch overlapping for mori ep #17953" by @Fridge003 in https://github.com/sgl-project/sglang/pull/19161
- fix(diffusion): enforce strict input_reference validation for T2V by @Ratish1 in https://github.com/sgl-project/sglang/pull/14825
- Revert "Refactor graph input buffers (#18991)" by @Fridge003 in https://github.com/sgl-project/sglang/pull/19173
- Update rocm7.2 Dockerfile to install amdsmi for QuickReduce Initialization by @clintg6 in https://github.com/sgl-project/sglang/pull/19091
- Fix bench_one_batch_server by moving the print statements by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/19175
- [AMD] ENV Flags tuning and cleanup by @HaiShaw in https://github.com/sgl-project/sglang/pull/19176
- [Diffusion] Detect Flux2 custom VAE path from component_paths by @ChangyiYang in https://github.com/sgl-project/sglang/pull/19170
- [ROCm] Use unreg path for custom all-reduce during CUDA graph capture by @zyzshishui in https://github.com/sgl-project/sglang/pull/19162
- Reorganize topk logic to clean up code and expose logical experts by @ocss884 in https://github.com/sgl-project/sglang/pull/16945
- Use single mma warp group for short q_len in FA to optimize decoding performance by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/18985
- [NPU] bump sgl-kernel-npu to 2026.02.01.post2 by @iforgetmyname in https://github.com/sgl-project/sglang/pull/19178
- [Refactor] Split rotary_embedding.py into a modular package by @BBuf in https://github.com/sgl-project/sglang/pull/19144
- [Diffusion] Match rotary_embedding module name style by @BBuf in https://github.com/sgl-project/sglang/pull/19179
- [Kernel Slimming] Migrate AWQ marlin repack kernel to JIT by @celve in https://github.com/sgl-project/sglang/pull/18949
- [PD-Disagg] Support query dp rank from bootstrap server. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/19168
- add new ci user by @narutolhy in https://github.com/sgl-project/sglang/pull/19133
New Contributors
- @00fish0 made their first contribution in #17265
- @1195343015 made their first contribution in #18535
- @1am9trash made their first contribution in #18488
- @22dimensions made their first contribution in #18017
- @2JooYeon made their first contribution in https://github.com/sgl-project/sglang/pull/18552
- @Aphoh made their first contribution in https://github.com/sgl-project/sglang/pull/18508
- @BJWang-ant made their first contribution in #16043
- @BourneSun0527 made their first contribution in #18553
- @Ch3ngY1 made their first contribution in #18163
- @ChangyiYang made their first contribution in https://github.com/sgl-project/sglang/pull/18883
- @CloudRipple made their first contribution in #17704
- @DiweiSun made their first contribution in #11712
- @DotSlash-A made their first contribution in #16969
- @Duyi-Wang made their first contribution in https://github.com/sgl-project/sglang/pull/18437
- @EduardDurech made their first contribution in #15682
- @Estrella-xx made their first contribution in #17811
- @Evrard-Nil made their first contribution in #17813
- @FrankMinions made their first contribution in https://github.com/sgl-project/sglang/pull/18804
- @HaiShaw made their first contribution in #18101
- @HanHan009527 made their first contribution in #18091
- @HandH1998 made their first contribution in #16892
- @Hide-on-bushsh made their first contribution in #17922
- @JD-ETH made their first contribution in https://github.com/sgl-project/sglang/pull/18694
- @JiaruiChang5268 made their first contribution in #17007
- @Lollipop made their first contribution in #18024
- @LuYanFCP made their first contribution in https://github.com/sgl-project/sglang/pull/18970
- @Lzy17 made their first contribution in #17183
- @Mahdi-CV made their first contribution in #17040
- @Makcum888e made their first contribution in #17584
- @McZyWu made their first contribution in #16866
- @MikkoParkkola made their first contribution in #17816
- @Nickcp39 made their first contribution in https://github.com/sgl-project/sglang/pull/18739
- @PeaBrane made their first contribution in https://github.com/sgl-project/sglang/pull/18944
- @RangerCD made their first contribution in #17929
- @RubiaCx made their first contribution in #18026
- @RunningLeon made their first contribution in #17939
- @Simon-Li made their first contribution in #17430
- @SoluMilken made their first contribution in https://github.com/sgl-project/sglang/pull/18860
- @Sugar920 made their first contribution in #17952
- @TTThanos made their first contribution in #17866
- @TamirBaydasov made their first contribution in https://github.com/sgl-project/sglang/pull/17503
- @WiwilZ made their first contribution in https://github.com/sgl-project/sglang/pull/18787
- @YazhiGao made their first contribution in https://github.com/sgl-project/sglang/pull/18743
- @ZhenshengWu made their first contribution in #17540
- @ZiguanWang made their first contribution in #16225
- @aaaandychen made their first contribution in #16781
- @airMeng made their first contribution in #13561
- @ajpqs made their first contribution in #14765
- @akao-amd made their first contribution in https://github.com/sgl-project/sglang/pull/18698
- @akhilg-nv made their first contribution in #16758
- @amote-i made their first contribution in #17573
- @andyluo7 made their first contribution in https://github.com/sgl-project/sglang/pull/18753
- @ant-yy made their first contribution in https://github.com/sgl-project/sglang/pull/18598
- @aurickq made their first contribution in #18152
- @billishyahao made their first contribution in https://github.com/sgl-project/sglang/pull/18320
- @bingps made their first contribution in https://github.com/sgl-project/sglang/pull/16907
- @bixue2010 made their first contribution in #17781
- @blake-snc made their first contribution in https://github.com/sgl-project/sglang/pull/18751
- @blazingbhavneek made their first contribution in https://github.com/sgl-project/sglang/pull/18687
- @bledden made their first contribution in https://github.com/sgl-project/sglang/pull/17972
- @cctry made their first contribution in #17850
- @celve made their first contribution in #17889
- @chanh made their first contribution in https://github.com/sgl-project/sglang/pull/18009
- @chengshuang18 made their first contribution in https://github.com/sgl-project/sglang/pull/18318
- @chenxu214 made their first contribution in #17511
- @cicirori made their first contribution in #18064
- @clintg6 made their first contribution in https://github.com/sgl-project/sglang/pull/19091
- @cswuyg made their first contribution in #17974
- @debo3 made their first contribution in #18396
- @dongjiyingdjy made their first contribution in https://github.com/sgl-project/sglang/pull/17213
- @dreamyang-liu made their first contribution in https://github.com/sgl-project/sglang/pull/18306
- @dutsc made their first contribution in #17301
- @edwardzjl made their first contribution in #18171
- @fsygd made their first contribution in #17751
- @fy1214 made their first contribution in https://github.com/sgl-project/sglang/pull/17023
- @gaopengff made their first contribution in #14592
- @gingerXue made their first contribution in #17499
- @glenliu21 made their first contribution in #17464
- @gongyisheng made their first contribution in #17690
- @hammersam made their first contribution in #17747
- @haojin2 made their first contribution in #18045
- @haowen-han made their first contribution in https://github.com/sgl-project/sglang/pull/18604
- @happierpig made their first contribution in https://github.com/sgl-project/sglang/pull/18821
- @hsuchifeng made their first contribution in #17744
- @hxieustc made their first contribution in https://github.com/sgl-project/sglang/pull/18850
- @jhinpan made their first contribution in #17863
- @jianyingzhu made their first contribution in #14717
- @jiashaokun-1 made their first contribution in #17025
- @joearedmond made their first contribution in #17786
- @kaixih made their first contribution in #18025
- @kartikx made their first contribution in #17273
- @klhhhhh made their first contribution in #18131
- @klshuster made their first contribution in #18273
- @koush made their first contribution in #18011
- @kuafou made their first contribution in #18207
- @laixinn made their first contribution in #17065
- @lawtherWu made their first contribution in #15381
- @lingebeng made their first contribution in #17699
- @luke396 made their first contribution in #17118
- @maning00 made their first contribution in #14626
- @mansoor-s made their first contribution in #17434
- @maocheng23 made their first contribution in #18512
- @mattteochen made their first contribution in #18000
- @mengchengTang made their first contribution in #17545
- @michaelzhang-ai made their first contribution in #17523
- @mmangkad made their first contribution in #17662
- @muse-coder made their first contribution in https://github.com/sgl-project/sglang/pull/18496
- @nanjiangwill made their first contribution in #17286
- @nono-Sang made their first contribution in #17140
- @nvcastet made their first contribution in #17089
- @ovidiusm made their first contribution in #17654
- @pansicheng made their first contribution in #18155
- @ping1jing2 made their first contribution in https://github.com/sgl-project/sglang/pull/18495
- @pokymono made their first contribution in #17888
- @polisettyvarma made their first contribution in #10021
- @qianyue76 made their first contribution in #18095
- @qmzznbxhl made their first contribution in https://github.com/sgl-project/sglang/pull/18705
- @raayandhar made their first contribution in https://github.com/sgl-project/sglang/pull/13313
- @realray808 made their first contribution in https://github.com/sgl-project/sglang/pull/18888
- @rootonchair made their first contribution in https://github.com/sgl-project/sglang/pull/18560
- @shaharmor98 made their first contribution in #17700
- @shvmjndl made their first contribution in #17806
- @sleepcoo made their first contribution in #14607
- @sogalin made their first contribution in #17656
- @strgrb made their first contribution in #17508
- @talorabr made their first contribution in https://github.com/sgl-project/sglang/pull/19011
- @tc-mb made their first contribution in #9610
- @tianchongchong made their first contribution in #17858
- @tie-pilot-qxw made their first contribution in https://github.com/sgl-project/sglang/pull/17507
- @tom-zju made their first contribution in https://github.com/sgl-project/sglang/pull/18459
- @triple-mu made their first contribution in #17834
- @tugot17 made their first contribution in #17777
- @vedantjh2 made their first contribution in #17780
- @wangfakang made their first contribution in #17756
- @wenchen76 made their first contribution in #7839
- @xiaobaicxy made their first contribution in #18657
- @xiaoweiw-nv made their first contribution in #17198
- @xiaoyewww made their first contribution in #18317
- @xu-yfei made their first contribution in #17076
- @xvyaward made their first contribution in #16294
- @xyjixyjixyji made their first contribution in #17347
- @ycchen-tw made their first contribution in #16521
- @yefei12 made their first contribution in #15119
- @yingluosanqian made their first contribution in #18038
- @yunkchen made their first contribution in #17129
- @zack041 made their first contribution in #18240
- @zackyoray made their first contribution in #17146
- @zhangxiaolei123456 made their first contribution in https://github.com/sgl-project/sglang/pull/18833
- @zhangxin81 made their first contribution in #16725
- @zhaochenyang20 made their first contribution in #18390
- @zhaohaidao made their first contribution in #18096
- @zhendonghua made their first contribution in #17891
- @zianglih made their first contribution in #17688
- @zijiexia made their first contribution in #17663
- @zju-stu-lizheng made their first contribution in #17624
- @zwang86 made their first contribution in #18225
- @zzhbrr made their first contribution in #10707
- @zzzzzzzxh made their first contribution in #15363
Full Changelog: v0.5.8...v0.5.9