Highlights

LoRA Weight Loading Overlap with Computation: Overlap LoRA weight loading with computation during inference, reducing TTFT by ~78% and TPOT by ~34.88% on large adaptors: #15512
TRT-LLM NSA Kernel Integration for DeepSeek V3.2: Integrate TRT-LLM DSA kernels for Native Sparse Attention, boosting DeepSeek V3.2 performance by 3x-5x on Blackwell platforms with trtllm for both --nsa-prefill-backend and --nsa-decode-backend
(with minor accuracy drop): #16758, #17662, #18389
Flashinfer All-to-All MoE Dispatcher: Add the Flashinfer all-to-all MoE dispatcher for efficient expert parallelism communication, enabling optimized routing in MoE models: #14668
FA4 (FP4 Attention) Support for Multimodal Encoder: Introduce FP4 attention backend and variable-length attention function for multimodal encoders, enabling lower-precision inference for vision-language models: #13539
Anthropic Compatible API Endpoint: Add native Anthropic API compatibility to SGLang, allowing direct integration with tools and clients built for the Anthropic API format: #18630
SGLang-Diffusion Advanced Optimizations: Production-ready improvements including token-level sequence sharding, parallel VAE decoding, fused kernels, Nunchaku and FP8 support, and multiple new models in the ComfyUI plugin: blog
Spec V2 Critical bug fix: Fix out-of-index bug caused by torch garbage collection in speculative decoding v2, improving reliability of speculative verification: #18958
Deploying DeepSeek on GB300 NVL72: Optimization work for long-context inference using prefill-decode disaggregation and other SGLang features on NVIDIA's latest GB300 platform: blog
Bump AITER version to 0.1.10.post3: Support FP8 Prefill/Decode/KV Cache
Commit-to-Version Lookup in docs.sglang.io: Easily find the earliest official version that includes a given PR or commit, streamlining release tracking for users and developers: #18450

New Model Support

Kimi-K2.5: #17789, cookbook
GLM-5: cookbook (still requires a custom docker for transformers upgrade, will follow up with a rc release since transformers upgrade is risky)
Qwen 3.5: #18489, #18926, #18937, cookbook
MiniMax 2.5: cookbook
Ernie4.5-VL: #15679
Step3-VL: #17513
Step-3.5-Flash: #18084, cookbook
LLaDA 2.1: cookbook
Ring 2.5 1T / Ling 2.5 1T: #18598, cookbook, cookbook
MOVA (Diffusion): #17704
GLM-OCR: #17582, cookbook
DeepSeek-OCR-2: #17897

SGLang-Diffusion

Support multiple new models in ComfyUI Plugin
Parallel Folding and Parallel VAE Decoding for faster image/video generation
Nunchaku and FP8 support for diffusion models
Sequence Sharding (token-level) replacing Frame Sharding for improved efficiency
LTX-2 support: #17495, #17496
MOVA model support: #17704
Cache-DiT optimizations and fused kernel improvements
Numerous bug fixes and refactors across the diffusion pipeline

Performance

Integrate TRT-LLM NSA kernels with up to 3-5x speedup on Blackwell: #16758, #17662, #18389
LoRA weight loading overlap reducing TTFT by ~78%: #15512
Flashinfer all-to-all MoE dispatcher: #14668
FA4 for multimodal encoder: #13539
Optimize GDN decode for Qwen3 Next: #17094
Tune fused MoE kernels for Llama-4-Scout, MiniMax M2: #17891, #18851, #18833
Symmetric memory pre-allocation to avoid fragmentation: #17089
Optimize fused_moe triton kernel TMA: #18782
Fused triton kernel for Ernie4.5-VL rotary embedding: #18856
Support MxINT4 Flashinfer TRT-LLM MoE GEMM: #16892
AITER bias MoE support for GPT-OSS MxFP4: #17735

Prefill-Decode Disaggregation

Support KV transfer with MORI-IO: #14626
Mooncake intra-node NVLink KV transfer: #17866
Improve KV offset calculation for MHA model with different TP size: #18163
Document SGLANG_MOONCAKE_CUSTOM_MEM_POOL: #18259

Diffusion LLM (dLLM)

Remove cuda graph batch size limitation: #17458
JointThreshold algorithm for joint M2T and T2T decoding: #18171
Basic dLLM scheduling strategy and implementation: #17484

Speculative Decoding

Fix out-of-index bug caused by torch garbage collection in Spec V2: #18958
Move forward timeout before verify to fix Eagle v1 filter mismatch: #18760

Dependencies

Flashinfer updated to 0.6.3: #17700
AITER updated to 0.1.10.post3: #18741
Mooncake transfer engine updated to 0.3.9: #18316

AMD Hardware

AITER updated to v0.1.10.post3 with FP8 Prefill, FP8 Decode, FP8 KV Cache support
ROCm 7 standardization and ROCm 6.3 deprecation: #17785
Kimi K2.5 Day 0 ROCm support: #17863
FP8 prefill attention kernel integration: #18528
Two-batch overlapping for MORI EP: #17953
DeepSeek V3.2 and Kimi-K2 nightly CI tests: #17523

NPU/Ascend

Support for MiniCPM3-4B: #16866
Qwen 3.5 support on Ascend: #18544
Accuracy improvements for StableLM-2: #17470
Bug fixes for DeepSeek V3.2 and DeepSeek-VL2: #17007

CPU Backend

Optimize Qwen3-Next model on CPU: #12525
Optimize flash_attn_varlen_func: #15708
Add INT4 kernels for CPU: #8226

Kernel Slimming

Migrate GPTQ-Marlin repack kernel to JIT: #18543
Migrate AWQ Marlin repack kernel to JIT: #18949

Documentation

Add RL documentation: #17663
Update torch compile description: #17819
Refine spec decode docs for SpecV2/STANDALONE/NGRAM: #18321
Consolidate diffusion documentation: #18095

What's Changed

Update test README with CI registry documentation and 5090/H100 guidance by @alisonshao in #17368
update dependence docs of npu by @amote-i in #17573
[AMD] CI - migrate perf test and fix stage-b-test-1-gpu-amd by @yctseng0211 in #17340
Skip mm feature pool init to avoid EPD OOM by @liusy58 in #16388
Update mamba env setting by @ispobock in #17566
[NPU]bugfix: fix for dsv3.2 and dsvl2 by @JiaruiChang5268 in #17007
[AMD CI] Add 2-GPU sgl-kernel Tests by @bingxche in #17555
Lazy import torchao by @merrymercy in #17626
Re-enable unit-test-deepep-8-gpu and unit-test-backend-4-gpu-gb200 by @alisonshao in #17438
fix gpt-oss launch failure with piecewise cuda graph by @zminglei in #17532
[NPU] [CI] temporarily disable mtp test by @iforgetmyname in #17614
[NPU] update doc for Ascend NPU by @Hexq0210 in #17621
turn off dit_layerwise_offload for wan on rocm by @zyzshishui in #17569
set cooldown_interval_minutes to 0 for liusy58 by @liusy58 in #17637
Support symmetric memory pre-allocation to avoid fragmentation by @nvcastet in #17089
[DeepSeek V3.2] Enable trtllm NSA with bf16 kvcache by @akhilg-nv in #16758
add the fa4 mm backend and varlen func by @vincentzed in #13539
[Refactor] Algebraic data type for nextn config + some basic refactors by @xyjixyjixyji in #17347
[DLLM] Remove cuda graph batch size limitation by @btw616 in #17458
Add return routed experts to the completions and chat/completions endpoints by @mansoor-s in #17434
[MUSA][1/N] sglang.check_env by @yeahdongcn in #16959
[MUSA][2/N] sgl-kernel build by @yeahdongcn in #17053
fix post_residual_addition more generally by @nanjiangwill in #17286
feature: adding openai compatible API request to bench_serving by @dougyster in #17219
[NPU]support model MiniCPM3-4B for npu by @McZyWu in #16866
[NPU] solve accuracy problem for stablelm-2-1-6b for npu by @McZyWu in #17470
[Docker] Install cudnn==9.16 for cuda 13 image to avoid check error by @Fridge003 in #17668
Refactor: Extract DeepSeek common utilities into shared module by @DotSlash-A in #16969
[Diffusion] LTX-2 Support PR1 by @gmixiaojin in #17495
[Diffusion] LTX-2 Support PR2 by @gmixiaojin in #17496
fix: nightly wheel naming for non-post versions by @dougyster in #17538
[JIT Kernel]Add Some CUDA Runtime API Wrapper for JIT Kernel Header by @HydraQYH in #17588
Fix: mistake sigmoid in kda by @strgrb in #17508
Use attn tp group in embedding for more models by @ispobock in #17570
[Diffusion] Add diffusion time embedding to jit kernel by @BBuf in #17658
Move fa4 from sgl-kernel to jit kernel by @BBuf in #17353
add documentation example for LoRA overlap loading and cleanup unused function by @glenliu21 in #17464
[Bugfix] fix TypeError when log-requests-level >=2 in prefill node warmup by @yunkchen in #17129
[Kimi-Linear] Refactor Kimi-Linear to support RadixLinearAttention by @yuan-luo in #17506
[NPU] torch_npu profiler tensorboard path type fix by @mengchengTang in #17545
[NVIDIA] Add flashinfer all-to-all MOE dispatcher by @trevor-m in #14668
Fix test timeout issue in pr-test by @Kangyan-Zhou in #17681
Fix NSA indexer test and move it to pre commit test by @Kangyan-Zhou in #17682
Temporarily disable lora overlap loading test due to flakiness by @Kangyan-Zhou in #17683
fix: Refactor register_image_processor to use kwarg instead of positional arg by @JustinTong0323 in #17685
[diffusion]: Fix ZImage SP sharding for caption and latent by @dutsc in #17301
Fix slash command handler trigger condition by trimming the comments by @Kangyan-Zhou in #17691
Add PyTorch .bin file validation to CI weight validation by @alisonshao in #17533
[DeepSeek-V3.2] Fix TRT-LLM NSA in target_verify/draft_extend by @mmangkad in #17662
Fix swa memory pool size with spec by @ispobock in #17630
[Refactore] [CI] Remove redundant CI test runs step 2 by @Makcum888e in #17584
revert row from #17584 by @Makcum888e in #17701
[Refactor] Use is_in_ci() utility in JIT kernel benchmarks by @luke396 in #17118
use published reasoning parser crate by @slin1237 in #17709
update to use official openai protocol crate by @slin1237 in #17710
remove self managed protocols as it has been replaced with official oai spec by @slin1237 in #17711
[diffusion] refactor: remove useless lazy-import cache-dit codes by @mickqian in #17659
Support mxint4 flashinfer_trtllm moe gemm by @HandH1998 in #16892
A few updates to the night tests by @Kangyan-Zhou in #17694
Add an all type in pyproject.tml to include diffusion support by @Kangyan-Zhou in #17697
Extend b200 kernel tests timeout for CPU differences by @Kangyan-Zhou in #17718
[misc] remove tool parser and tree benchmark as they are not meaningful atm by @slin1237 in #17719
[misc] replace existing tool call code with new crate package by @slin1237 in #17720
Upload nightly test metrics to GH artifacts by @Kangyan-Zhou in #17696
Fix flaky streaming logprobs test by handling detokenizer text buffering by @Kangyan-Zhou in #17687
[Bugfix]Repeated add modelslim quant_config and bugfix with "enable-piecewise-cuda-graph" on NPU by @chenxu214 in #17511
Fix sgl-kernel install: fail instead of PyPI fallback when artifacts missing by @alisonshao in #17728
Add EP=2 to qwen235b nightly tests by @Kangyan-Zhou in #17738
Update nightly-test-nvidia.yml to remove push trigger by @Kangyan-Zhou in #17625
remove self managed mcp as it has been replaced with official rmcp crate by @slin1237 in #17740
[Kimi-Linear] Remove duplicated code in kimi-linear by @yuan-luo in #17731
[NIXL] Add custom NIXL backend selection for KVManager by @zackyoray in #17146
Merge performance/accuracy test suites into regular stage-b suites by @alisonshao in #17609
remove self managed wasm as it has been replaced with official smg wa… by @slin1237 in #17746
Exclude some diffusion package for ARM in docker release by @Kangyan-Zhou in #17745
update wasm endpoint by @slin1237 in #17748
[Fix] Pass missing backend argument in pipelines_core initialization by @Prozac614 in #17343
remove multimodal as this is completely dead code by @slin1237 in #17750
accuracy enhancement for baichuan2-13B for npu by @McZyWu in #16868
Bump FI version by @shaharmor98 in #17700
refactor mamba radix cache logic in server_args by @yizhang2077 in #17645
[AMD CI] Add moonshotai/Kimi-K2-Instruct-0905 testcases by @sogalin in #17656
[NPU]DeepSeek-V3.2 support npu mlaprolog by @lawtherWu in #15381
Add test_gpt_oss_4gpu.py to B200 test suite by @alisonshao in #17743
fix: move nightly whl to cuda version folder by @dougyster in #17762
[NPU] Split pyproject npu from pyproject other by @Makcum888e in #17641
Special logic for healthcheck by @whybeyoung in #17734
[Docs] Add RL documentation by @zijiexia in #17663
fix(processor): support InternS1 text_config in InternVL processor by @Mahdi-CV in #17040
[bugfix] Internal processing of hf3fs crash # 16614 by @leihuang-sketch in #16938
[diffusion] Support Qwen-Image, Multi-GPU Z-Image, and Enhanced ComfyUI Integration by @niehen6174 in #17678
Support Kimi-K2.5 model by @yhyang201 in #17789
[HiCache][HA 1/N] Support HiCache storage runtime attach/detach by @alphabetc1 in #15892
fix: preserve disconnect events in api key middleware by @alphabetc1 in #17253
[AMD] Update dsv3.2 AMD GPU docs and unify ROCm TileLang build by @hubertlu-tw in #17783
[Bug Fix] Fix reasoning parser when continue_final_message=true by @laixinn in #17065
[GLM-OCR] Support GLM-OCR Model by @zRzRzRzRzRzRzR in #17582
fix(quantization): add sgl_kernel fallback for FP4 quantize on Blackwell GPUs by @MikkoParkkola in #17816
[Doc] Update description on torch compile by @Fridge003 in #17819
[NPU] Adapt cann 8.5: use sfa and lightning indexer op from cann and CI update by @monkeyLoveding in #17615
[DeepSeek] Update tests and document for DeepSeek V3.2 NVFP4 checkpoint by @Fridge003 in #17657
[Diffusion] dit-precision refactor by @fsygd in #17751
Make flashMLA work on: Cu13, B300 by @vincentzed in #17600
[hybrid-model] clean up and consolidate redundant fields in RadixLinearAttention by @zminglei in #17660
Pass GPU ids to kill specified devices in script. by @hnyls2002 in #17840
[AMD] Deprecate ROCm 6.3 artifacts and standardize gfx942 on ROCm 7 by @hubertlu-tw in #17785
[Diffusion] glm-image apply flashinfer rope by @BBuf in #17689
[diffusion] fix: fix suppressing error log on non-main ranks by @mickqian in #17712
[diffusion] feat: add an arg for controlling the number of prefetched layers in Layerwise-offload by @mickqian in #17693
[diffusion] Fix vertex generate by @yashikagandhi-google in #17611
fix: add bias when enable mm fallback variant by @gongyisheng in #17690
[AMD] CI - enable deepseekv3.2 on MI325-8gpu and merge perf/accuracy test suites into stage-b suites by @yctseng0211 in #17633
[DSv32] Overlap indexer qk projection and activation quant by @zianglih in #17688
[Diffusion] Delete sgl-kernel outdated time_embedding kernel by @BBuf in #17278
Add a performance dashboard server and frontend for nightly CUDA tests by @Kangyan-Zhou in #17725
[diffusion] doc: fix wrong docker run command by @mickqian in #17856
[JIT kernel] Update jit_kernel cache and develop doc by @BBuf in #17842
[AMD] Add Kimi-K2, DeepSeek-V3.2 tests to nightly CI by @michaelzhang-ai in #17523
[diffusion] comfyui: fix import typo by @triple-mu in #17834
[AMD][Kimi K2.5 Day 0] ROCm: route W4A16 MoE to Triton and fix packed-weight loading by @jhinpan in #17863
[MUSA][7/N] Enhance CUDA / PyNccl wrapper to support MTLink connectivity detection by @gingerXue in #17499
[Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel by @zhendonghua in #17891
Make the functions in logits_processor.py and sampler.py more modular by @merrymercy in #17885
[Diffusion] Support MOVA model by @CloudRipple in #17704
[JIT Kernel]Support fused_add_rmsnorm in JIT Kernel by @HydraQYH in #17677
[Fix][trtllm-mha] Canonicalize the strides when num_head = 1 by @xyjixyjixyji in #17732
Integration mori backend for EP a2a data communication by @kkHuang-amd in #17012
feat: add custom request header logging by @joearedmond in #17786
update ascend docs by @amote-i in #17741
[FIX] kimi_k2 reasoning parser by @JustinTong0323 in #17901
Fix flaky tool calls in the Kimi K2.5 model by @JustinTong0323 in #17914
[MUSA][4/N] Add common device utilities, distributed backend, and custom op wiring by @yeahdongcn in #17246
[PD] Support KV transfer with MORI-IO by @maning00 in #14626
[Diffusion][MOVA] fix: resolve library mismatch in scheduler and update dit offload method name by @CloudRipple in #17916
[diffusion] model: move tp_rmsnorm check to WanTransformerBlock by @triple-mu in #17792
Add aiter bias moe support in gpt-oss mxfp4 model by @kkHuang-amd in #17735
[diffusion]: align sglang diffusion AMD pyproject_other.toml diffusion dependency with pyproject.toml by @ZiguanWang in #16225
[wip] sync with upstream zImage by @yhyang201 in #17822
Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE by @zianglih in #17449
Support LightOnOCR-2-1B by @shvmjndl in #17806
[diffusion]: add dummy device attribute to fix AttributeError by @Ratish1 in #17949
Add tool call tests for DeepSeek V3.2 in nightly CI by @harvenstar in #17951
[MUSA] Add labeler config by @yeahdongcn in #17923
Fix torch.__version__ for PEP440 by @EduardDurech in #15682
Fix capture_sizes range for pcg by @ch-wan in #17956
Fix logprob_start_len handling for prefill-only requests by @ch-wan in #17395
feat: add forward timeout by @zhooooong in #17831
[AMD] fix pip sglang version by @yctseng0211 in #17950
Add concurrency tracking to runner utilization report by @Kangyan-Zhou in #17963
Support DeepSeek-OCR-2 in SGLang (OCR2 vision pipeline, tokenization alignment, and weight loading fixes)#17833 by @baonudesifeizhai in #17897
add weightless qk norm to RMSNorm interface for Llama 4 by @b8zhong in #12813
GPTJForCausalLM Support by @wenchen76 in #7839
[Fix] Remove unused Type import in gpt_j.py by @Kangyan-Zhou in #17975
Fix the scenario where eh_proj is quantized in the bailing moe nextn weights by @LHXuuu in #17808
[Intel GPU] fix device in DeepseekScalingRotaryEmbedding to run DeepSeek-V2-Lite BF16 on XPU by @polisettyvarma in #10021
Fix prefill latency performance drop of bench serving by @gaopengff in #14592
[Intel GPU] fix import error to run DeepSeek-V2-Lite model with BF16 on XPU by @polisettyvarma in #10858
[CPU] Optimize Qwen3-next model on CPU by @jianan-gu in #12525
[CPU] optimize flash_attn_varlen_func by @mingfeima in #15708
[CPU][INT4] Add INT4 kernels for CPU by @jianan-gu in #8226
fix(benchmark): add missing args for speculative decoding benchmark by @cswuyg in #17974
[NPU] enhance accuracy for model kimi-vl-a3b-instruct by @McZyWu in #17480
adapt MODELSCOPE download by @Hide-on-bushsh in #17922
Increase install dependency timeout for gb200 by @Kangyan-Zhou in #17977
SGLang Tracing: Improve root span attributes by @zhanghaotong in #17008
Add cuda graph status to prefill log by @ispobock in #17836
Fix SHM pointer re-serialization in DP attention. by @FlamingoPg in #17930
update npu docs by @amote-i in #17987
[Model] Add K-EXAONE model support by @xvyaward in #16294
[BUGFIX] Fix dp size > 1 for qwen3 vl model by @zju-stu-lizheng in #17624
[Diffusion] Fix lora default lora_scale bug by @BBuf in #17982
Optimize GDN decode for Qwen3 Next by @samuellees in #17094
[BugFix] Fix server crashes when req.grammar and ngram spec are enabled by @SYChen123 in #17585
[NPU] support llama-3.2-11B-vision-instruct mode for NPU by @JiaruiChang5268 in #17492
[sglang] fix mm token padded value overlap with text token id by @bixue2010 in #17781
doc update for CANN version by @wangtiance in #18014
[NPU] fix sgl-kernel-npu package url error in npu.Dockerfile by @22dimensions in #18017
Add ROCm + Mori docker build instructions in rocm.Dockerfile by @kkHuang-amd in #18018
[Diffusion] Fix FLUX.1-schnell time embedding argument mismatch by @BBuf in #17988
Fix cuBLAS >=12.9 detection for cu12/cu13 package naming by @mmangkad in #17766
Fix .gitignore may ignore files like core_attention.py by @yeahdongcn in #18021
【docs】【NPU】Update Expert Parallelism docs for Ascend NPU by @husf1130 in #17940
add reasoning_tokens usage test for tool call by @harvenstar in #18022
Reduce topk kernel shared memory from 128KB to 32KB for better occupancy by @hammersam in #17747
Fix OOM in DeepSeek weight loading by deferring dict(weights) materialization by @hsuchifeng in #17744
[EPD][Perf] parallelize ZMQ send for encode server by @ZhengWG in #16487
[Fix] Triton TP MoE Dpsk V3/Qwen3 Coder with SwapAB by @b8zhong in #17965
Add launch_command assignment in crash dump by @merrymercy in #17967
[diffusion] refactor: split component_loader into component-wise files by @mickqian in #17820
[Fix] Revert back to using CUTLASS mm_fp4 backend by @b8zhong in #17369
[MUSA] Update 3rd party dir to build/_deps by @yeahdongcn in #18035
[CPU] toml file update by @ZailiWang in #17861
Update python/sglang/README.md by @haojin2 in #18045
[Performance] Optimize Mllama LayerNorm -> Upd by @vincentzed in #9725
Fix: Remove duplicate assignment for use_w4afp8 by @tianchongchong in #17858
[Perf] Add Flashinfer DeepGEMM SM90 for SwapAB Optimization by @b8zhong in #15514
feat: validate ib devices in server args by @acelyc111 in #17598
Improve error output in tnightly tets by @Kangyan-Zhou in #18053
Skipped warning on sm100 by @mattteochen in #18000
Fix rerun stage command with merged commit history by @Kangyan-Zhou in #17960
[BugFix] Fix draft model specified config file by @khalil2ji3mp6 in #17815
Set torch url index in pyproject.toml by @Fridge003 in #16802
[metric] Optional extra metric labels by @yinghai in #18049
[BugFix] fix gpt-oss accuracy issue when enabling piecewise cuda graph by @zminglei in #18013
Fix swa kv cache memory allocation by @ispobock in #18039
Disable test_mla_int8_deepseek_v3.py temporarily by @alisonshao in #18057
Migrate 4-GPU/8-GPU workflow jobs to stage-c and add CI registry decorators by @alisonshao in #17299
Fix installation script for H200 runners by @Kangyan-Zhou in #18050
[Bugfix] fix the display error (inconsistent context) by @lingebeng in #17699
[model] Support MiniCPM-V 4.5 by @tc-mb in #9610
Fix Diffusion Request Validation to allow missing input artifacts if the input only contains text by @Kangyan-Zhou in #16610
Optimizing all_reduce in RMSNormTP in minimax_m2 by @rogeryoungh in #16483
Fix CUDA 12 dependency when importing Mooncake in official CUDA 13.x image by @ZhenshengWu in #17540
[Feature] Support file:// URL format for multimodal inputs by @ppraneth in #14490
support qwen3-next eagle3 by @sleepcoo in #14607
feat: Add Ling Flash v2.0 support for Eagle3 by @yefei12 in #15119
Move deleted 8-GPU tests to test/manual/ by @alisonshao in #18060
Reset evict swa status when retract by @ispobock in #18059
[NPU] disaggregation_decode_enable_fake_auto parameter adaptation by @Estrella-xx in #17811
[NPU] support the Enable return routed experts by @jiashaokun-1 in #17025
[VLM] Optimize get_rope_index for GLM4v by @yuan-luo in #17420
Optimize custom-all-reduce by @yuan-luo in #17674
[BUGFIX]: using language-only should not reserve space for the vision encoder by @koush in #18011
[TestFix] rewrite LoRA overlap loading tests by @glenliu21 in #18047
[Fix] Remove no use code in MiMo-V2-Flash by @yuan-luo in #18051
Fix: Avoid Double Reduce in VLM DP Attention by @yhyang201 in #17991
[diffusion] cli: introduce generic attention backend configuration in ServerArgs by @mickqian in #18036
[diffusion] fix: fix missing component names for VAELoader by @mickqian in #18069
Add bootstrap_room validation to detect metadata corruption in PD disaggregation by @Simon-Li in #17430
[AMD] Fix aiter version in rocm image by @yctseng0211 in #18076
Refine logprob logic for request handling by @ch-wan in #17986
[EPD][refactor]: introduce BaseMMReceiver for gRPC transport integration by @liusy58 in #17921
Improve Per Commit Test job filtering for sglang-kernel by @Kangyan-Zhou in #18054
fix: zmq_to_tokenizer encoder transfer when host listens to 0.0.0.0 by @RangerCD in #17929
fix: correct weight loading prefix mapping for Qwen3-VL by @Lollipop in #18024
[diffusion] CI: deprecate WarmupRunner in CI by @yingluosanqian in #18038
[AMD] enable MoRI to release and nightly builds by @HaiShaw in #18101
[Diffusion] Fix Ring Parallel bug with FA4 by @BBuf in #18062
support smem in per_token_quant_fp8 kernel by @zhangxin81 in #16725
[Diffusion] remove accelerate dependency for device mapping by @RubiaCx in #18026
[NPU]mindspore model support moe by @zzzzzzzxh in #15363
docs: move deepseek_ocr to popular model usage and add cookbook reference by @sglang-bot in #18120
[NPU] update nightly tests by @Sugar920 in #17952
add Step-3.5-Flash model support by @yhyang201 in #18084
[DeepSeek V3.2] [Bugfix] slice indexer and padding fa3 when can not run cuda graph by @xu-yfei in #17076
[NPU] support dsv32 radixcache on ascend by @khalil2ji3mp6 in #17964
[MiMoV2Flash] [feat]: support two batch overlap by @TZHelloWorld in #17634
[Fix] data race in req_to_token pool by @cctry in #17850
Re-enable test_mla_int8_deepseek_v3.py after HF token fix by @alisonshao in #18123
feature: adding gpt-oss 120b nightly test by @dougyster in #18134
[Performance] Optimize radix cache eviction performance by @YiXR in #14339
[Diffsuion & JIT_kernel] QKNorm cross heads kernel by @BBuf in #18073
[HiCache]: Support DeepSeek v32 cpu offloading by @hzh0425 in #17415
[diffusion] UX: improve logging by @mickqian in #18122
[Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels by @celve in #17889
Add triton_fused_moe config for GLM-4.7-FP8 tp8 H20 H20-3e by @HanHan009527 in #18091
[Diffusion] fix serving image_edit get input image bug by @BBuf in #18109
MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py by @b8zhong in #16685
Support Markdown/Notebook-Friendly Documentation Export for Downstream Integration by @klhhhhh in #18131
[TestFix] use unit tests for LoRA overlap loading tests by @glenliu21 in #18140
[NVIDIA] Add --top-k argument to run_eval.py by @kaixih in #18025
Gigachat 3 tool parser and tests by @ajpqs in #14765
[HiCache] fix: apply extra_backend_tag in Mooncake batch_exists by @00fish0 in #17265
[Perf] Use safetensors load_file in multithread loader by @mmangkad in #18124
[Docker] Remove hardcoded America/Los_Angeles timezone, default to UTC by @mmangkad in #18121
AMD PD/D PR ci by @Lzy17 in #17183
Warmup before profiling prefill latency for dynamic chunk sizing by @xiaoweiw-nv in #17198
[PD] feat: support mooncake intra-node nvlink kv transfer by @TTThanos in #17866
[Bugfix] Fix Mistral Large 3 NVFP4 TRTLLM MoE by @elvischenv in #18065
fix: add cu13 dev container to our release by @ishandhanani in #18192
Revert broken sgl_kernel exclusion patterns in paths-filter by @Kangyan-Zhou in #18193
enable ut test for xpu devices by @DiweiSun in #11712
[HiCache] feat: Add detailed cache hit breakdown for HiCache in sglext and Prometheus metrics by @vladnosiv in #17648
[Diffusion] Only import sgl_kernel in custom op cuda path (SiluAndMul and RMSNorm) by @yeahdongcn in #15592
[diffusion] hardware: support diffusion models on MTGPU (multi-GPU, 5/N) by @yeahdongcn in #17318
[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N) by @yeahdongcn in #17346
add streaming parallel tool call test case by @harvenstar in #18097
Update weight rename check for Qwen3 Embeddings by @satyamk7054 in #17535
fix: ensuring nightly whls are tagged with latest commit by @dougyster in #18204
[diffusion] fix server cache-dit bug under continuous dynamic requests by @nono-Sang in #17140
[Docs] fix readme typo by @kuafou in #18207
Fix Session for multimodal and expose it through Engine by @aurickq in #18152
fuse qkvbfg linear into one gemm and f_b g_b into batched gemm. by @strgrb in #17801
fix: bumping nightly whl version by @dougyster in #18212
Support Markdown/Notebook-Friendly Documentation Export for Downstream Integration (copy all markdown and rst files) by @klhhhhh in #18223
[diffusion] kernel fusion: gated residual layernorm scale shift and layernorm scale shift kernel fusion for Qwen-Image, WAN and HunyuanVideo by @jianyingzhu in #14717
[DeepGemm] Add a flag for fast warmup by @Fridge003 in #18111
[RadixTree][5/N Refactor]: Introduce pre- and post-processing methods for key matching by @hzh0425 in #18147
Moving _alloc_extend_naive out of npu allocator by @ch-wan in #18200
[Diffusion] update code owner by @BBuf in #18247
[Diffusion] Downgrade prompt log from info to debug. by @Evrard-Nil in #17813
Make sure we always disable symm memory without dp padding by @nvcastet in #18129
optimize get_topk_ragged by fusing get k and k_scale triton kernel by @BJWang-ant in #16043
[diffusion][mova] clean codes by @CloudRipple in #18107
[diffusion] fix the bug of redundant memory usage on GPU-0 by @nono-Sang in #18221
Support passing spaces_between_special_tokens per request by @RunningLeon in #17939
support interns1-pro by @RunningLeon in #18145
[diffusion] refactor: move model_stages into stages folder by @mickqian in #18248
[AMD] Add kimi mi35x nightly test, folder organization and several stability fixes by @michaelzhang-ai in #17895
fix: fix MockModelRunner in attention tests by @zack041 in #18240
Add MoE fused config for Qwen3-Coder-Next-FP8 on H100 TP=2 by @mmangkad in #18195
[Bugfix] fix a obvious logic error by @lingebeng in #18254
fix: add SGLANG_IS_IN_CI env var to release-docs workflow by @zwang86 in #18225
fix kimi k2.5's moe gemm config init by @cicirori in #18064
[diffusion] chore: forbid Chinese characters by @mickqian in #18249
[PD] improve kv offset calculation for MHA model with different tp size by @Ch3ngY1 in #18163
[PD] doc: Document SGLANG_MOONCAKE_CUSTOM_MEM_POOL and supported values by @stmatengss in #18259
[docs] fix misspellings & typos by @app/ in #18276
Support Markdown/Notebook-Friendly Documentation Export for Downstream Integration(convert rat files to md files and save) by @klhhhhh in #18278
Fix test_return_routed_experts to use response-level sglext by @alisonshao in #18274
[Diffusion] Support layerwise offload for mova by @BBuf in #18272
[XPU] Integrate MoE and minor improvements in XPU attention backend by @airMeng in #13561
[FIX] Always support TP > 4 for FP4 Gemm by @danielafrimi in #17300
[piecewise graph]: support MiniMax-M2 by @hzh0425 in #18217
[PD] Minor code cleanup for mooncake backend by @ShangmingCai in #18279
docker: add patch to increase GPU deepep timeout by @ishandhanani in #18298
[diffusion][hot fix] fix accuracy bug caused by PR 14717 by @yingluosanqian in #18296
[Kernel] Add JIT apply_rope_with_cos_sin_cache_inplace by @pansicheng in #18155
throw error if got adapter with added_tokens by @glenliu21 in #18046
[diffusion] feat: allow T5's TP Group to reuse the transformer's SP Group by @nono-Sang in #17818
NixlKVManager optimizations by @ovidiusm in #17654
Fix flaky test_frequency_penalty_reduces_word_repetition by using deterministic seeds by @alisonshao in #18285
Refactor(qwen3-vl) optimize position encoding interpolation by @aaaandychen in #16781
[Doc] refine spec decode docs for SpecV2/STANDALONE/NGRAM by @alphabetc1 in #18321
[Doc] add a summary section for spec decode document by @alphabetc1 in #18323
[Kernel] Migrate GPTQ-Marlin GEMM kernel to JIT by @celve in #18067
fix npu best practice by @amote-i in #18330
[diffusion][hot fix] fix torch.compile graph break caused by torch._dynamo.disable by @yingluosanqian in #18336
add hicache jit test by @XucSh in #17847
[diffusion] fix: offload text encoder model in image encoding stage by @xiaoyewww in #18317
Add Nemotron 3 Nano tests by @shaharmor98 in #18119
Add CI permission for Shunkangz, dongjiyingdjy, samuellees by @Fridge003 in #18377
[Docs] Add Falcon H1, Hunyuan-Large, Qwen3-Omni support and update Diffusion usage by @pokymono in #17888
add hybrid model PD to NIXL connector by @nealvaidya in #16229
Merge stage-c-test-large-4-gpu suites into partitioned suites by @alisonshao in #18325
Revert "[Build] Enable full kernel in aarch64 wheel" by @Fridge003 in #18385
[Qwen3Next] Optimize fused_sigmoid_gating_delta_rule_update_kernel by @hlu1 in #18271
Support execute_shell_command for env var support by @zhaochenyang20 in #18390
[NPU] update npu doc by @Hexq0210 in #18344
[Diffusion] Apply fused_norm_scale_shift to MOVA by @BBuf in #18257
[Doc] Update CUDA 13 install guide to install torch first by @mmangkad in #18404
Remove unnecessary norm_type argument from GLM-Image dits by @haojin2 in #18382
[Doc] Fix outdated --fp4-gemm-backend documentation by @mmangkad in #18350
[diffusion] fix: respect dist_timeout option by @mickqian in #18386
[diffusion] feat: support saving videos directly on the server to avoid the overhead of tensor transfer by @nono-Sang in #18253
[Kimi-K2.5] Fix NVFP4 Kimi-K2.5 weight mapping and exclude list by @mmangkad in #18370
[NPU][diffusion] model: support WAN/FLUX/Qwen-Image/Qwen-Image-edit on Ascend by @Makcum888e in #13662
[Fix] Fix backend selection after flashinfer version update by @DarkSharpness in #18364
fix: sync server_args.kv_cache_dtype when detecting FP8 KV cache by @zack041 in #18394
[diffusion] feat: support efficient sequence shard by @nono-Sang in #18161
[ModelOpt] Fix broken Qwen3-235B-A22B-Instruct-2507-NVFP4 launch by @vincentzed in #18189
[diffusion] refactor: group component loaders under the component_loaders/ directory by @mickqian in #18438
Fix TRT-LLM MLA backend applying k_scale to BF16 KV cache in BMM1 by @debo3 in #18396
[diffusion] chore: revise process title by @mickqian in #18446
Add tensor parallelism support to LFM2 ShortConv layers by @tugot17 in #17777
[Kimi-K2.5] Fix missing quant_config in KimiK25 by @mmangkad in #18440
Update author information in pyproject.toml by @merrymercy in #18453
[ModelOPT] Support Qwen 3 Next Coder NVFP4 by @vincentzed in #18224
Refactoring Mooncake TE as a shared distributed component by @ShangmingCai in #17810
[BugFix][PD]Fix metadata_buffer_index leak when aborted in PD by @ZhengWG in #17483
fix: fix the wrong return value type of draft model runner by @acelyc111 in #18105
fix: use --no-build-isolation for human-eval install by @harvenstar in #18455
[AMD] Update aiter to v0.1.10.post2 by @bingxche in #18423
[AMD] CI - Fix AMD daily image release and install dependency by @yctseng0211 in #18452
[DLLM] Add JointThreshold algorithm for joint M2T and T2T decoding by @edwardzjl in #18171
Make compressed-tensors MoEs support ignored layers by @LHXuuu in #17828
Revert "optimize get_topk_ragged by fusing get k and k_scale triton kernel" by @Fridge003 in #18471
feat: Add ModelScope support for multimodal_gen models by @yrk111222 in #17924
[diffusion] chore: fix unclean shutdown and resource leaks by @mickqian in #18477
[Feature] Support bidirectional attention for Gemma-3 by @zzhbrr in #10707
Pass quantize_config to _initialize_model by @klshuster in #18273
Fix MMLU benchmark to auto-download data and resolve path issue by @JustinTong0323 in #18486
[MODEL] Adding Support for Qwen3.5 Models by @zju-stu-lizheng in #18489
[AMD] add amd ci monitor by @bingxche in #17476
feat(kv-events): Add medium field to KV event types for storage tier tracking by @ishandhanani in #18205
docs: expand and update modelopt documentation by @zack041 in #18479
Add cache_config_info metric. by @kartikx in #17273
[HiCache][PP] add test case for compatibility by @stmatengss in #16395
Fix idle batch predict dtype in spec v2 by @Qiaolin-Yu in #18379
Make bench_one_batch_server compatible for more backends by @maocheng23 in #18512
[EPD] Add notification mechanism to fix server hang and add timeout env var by @liusy58 in #18229
Deepseekv32 compatibility with transformers v5 by @JustinTong0323 in #18297
Support GlmMoeDsaForCausalLM by @JustinTong0323 in #18521
[AMD] Turn on aiter-prebuild by @yctseng0211 in #18425
[HiCache] fix: StorageMetricsCollector was initialized twice by @alphabetc1 in #18354
[DLLM] Basic dLLM scheduling strategy and implementation by @ClawSeven in #17484
[diffusion] feat: support parallel wan-vae decode by @nono-Sang in #18179
[NPU] [CI] Enable run multimodal NPU CI when changes only in multimodal_gen by @Makcum888e in #18523
[diffusion] fix: fix fsdp by @mickqian in #18187
[sgl-kernel] upgrade deepgemm by @BBuf in #18362
[NPU][docs] improve docs for Best Practice on Ascend NPU by @husf1130 in #18360
[NPU] update npu doc by @Hexq0210 in #18474
fix(config): Support setting Mamba state dtype via config file by @zju-stu-lizheng in #18532
[NPU][docs]fix bug about hyperlink for best practice for ascend npu by @husf1130 in #18561
Revert "[sgl-kernel] upgrade deepgemm" by @Fridge003 in #18562
Tilelang sparse decode fwd for dsv32 mi355 by @1am9trash in #18488
Fix radix cache key to include generated tokens in multi-turn (regression) by @ycchen-tw in #16521
Fix wrong prefill log. by @hnyls2002 in #18570
[Doc] Comprehensive Guide: Navigating DP, DPA, and SMG Best Practices by @zhaohaidao in #18096
Enhance SMG guide with RL rollout systems benefits by @zhaochenyang20 in #18588
Add cache hit rate UT by @hnyls2002 in #18566
[AMD] Fix Janus-Pro crash and add Kimi-K2.5 nightly test by @michaelzhang-ai in #18269
Fix Bug on dsv3.2 by @BourneSun0527 in #18553
Fp8 prefill attn kernel integration by @1am9trash in #18528
Register cp-atten-allgather buffers with symm memory by @wangfakang in #17756
[NPU] support model skywork-reward-gemma2-2-27B-v0.2 by @McZyWu in #16947
[V3.2] Change default CP token split method to --round-robin-split by @Fridge003 in #18613
add support to enable lora with embedding models by @vedantjh2 in #17780
Fix prefill stats for dllm by @ispobock in #18632
Add LMF2 MoE model architecture by @tugot17 in #17997
Clean up noisy startup log messages and refactor loader.py by @merrymercy in #18531
[diffusion] docs: consolidate diffusion documentation into docs by @qianyue76 in #18095
[PCG] GPT OSS Triton Kernel Support by @Oasis-Git in #18405
[Bugfix] fix config bug caused by PR #18273 by @1195343015 in #18535
Avoid kimi linear stream sync by @vincentzed in #16186
Add CI permission for Chen-0210 by @Chen-0210 in #18494
glm5 md by @liupeng374 in #18655
[diffusion] fix: webui cannot correctly display generated video using wan2.2 by @yeahdongcn in #18473
List more CI runs for pr-test by @hnyls2002 in #18650
update glm5 readme on npu by @xiaobaicxy in #18657
fix the max-parallel for /rerun-stage by @hnyls2002 in #18658
Update modelopt quantization config parsing by @Edwardf0t1 in #13919
[Mamba] Add float16 support for SSM cache dtype by @danielafrimi in #18444
Try fix the max-parallel for maunally triggered test again. by @hnyls2002 in #18686
Update ci permission by @ispobock in #18693
[AMD] rocm 7.2 image release, PR test, Nightly Test by @yctseng0211 in #17799
[Flashinfer Autotune] Fix FlashInfer FP4 MoE autotuning crash by removing incorrect flatten on hidden_states_scale by @YAMY1234 in #18500
[Qwen3_5] Refactor Qwen3_5ForCausalLMMTP class implementation by @zju-stu-lizheng in #18538
Update README commands to include model-path option by @wplf in #18557
fix: /metrics endpoint always reports engine_type="unified" in PD disaggregation mode by @2JooYeon in https://github.com/sgl-project/sglang/pull/18552
[Z-Image] Replace TextEncoderConfig with Qwen3TextConfig by @rootonchair in https://github.com/sgl-project/sglang/pull/18560
[AMD] Enable release image build for ROCm 7.2.0 by @akao-amd in https://github.com/sgl-project/sglang/pull/18698
[Ascend]Support qwen3.5 by @chenxu214 in https://github.com/sgl-project/sglang/pull/18544
[AMD] reset AMD image release time and reduce CI queue time by @yctseng0211 in https://github.com/sgl-project/sglang/pull/18707
[AMD] Fix accuracy issue when running TP4 dsv3 model with mtp by @1am9trash in https://github.com/sgl-project/sglang/pull/18607
add tool_choice=auto nightly test case by @harvenstar in https://github.com/sgl-project/sglang/pull/18302
Make PR based docker and pypi workflow work for forked PR by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18720
Fix flaky penalty tests by using higher temperature for effect comparison by @alisonshao in https://github.com/sgl-project/sglang/pull/18380
Add spec_accept_histogram request statistic by @scottjlee in https://github.com/sgl-project/sglang/pull/18332
refactor: replace local proto compilation with smg-grpc-proto package by @slin1237 in https://github.com/sgl-project/sglang/pull/18682
[BUGFIX] fix bug in handle mamba radix cache in server_args by @yizhang2077 in https://github.com/sgl-project/sglang/pull/18723
Fix B200 installation issue by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18725
refactor: consolidate gRPC client into shared crate dependency by @slin1237 in https://github.com/sgl-project/sglang/pull/18730
speed up sgl-kernel build by @BBuf in https://github.com/sgl-project/sglang/pull/18586
fix: image version in pypi pr workflow by @dougyster in https://github.com/sgl-project/sglang/pull/18735
refactor: remove crate re-export aliases from lib.rs by @slin1237 in https://github.com/sgl-project/sglang/pull/18737
Reuse initialized transfer engine in mooncake store by @ShangmingCai in https://github.com/sgl-project/sglang/pull/18460
Build ROCm7.2 Image with latest AITER v0.1.10.post3 by @HaiShaw in https://github.com/sgl-project/sglang/pull/18741
[Diffusion] [BUG] Fix missing initialization of GLM-Image text encoder config by @haojin2 in https://github.com/sgl-project/sglang/pull/18704
Fix invalid import paths in glm_image.py by @alisonshao in https://github.com/sgl-project/sglang/pull/18757
Revert changes to weight_utils.py by @merrymercy in https://github.com/sgl-project/sglang/pull/18759
feat: support release lookup by @alphabetc1 in https://github.com/sgl-project/sglang/pull/18450
Modify glm5 readme on npu by @BourneSun0527 in https://github.com/sgl-project/sglang/pull/18768
[AMD] Fix Multimodal Test 1 GPU by @bingxche in https://github.com/sgl-project/sglang/pull/18716
[diffusion]Allows quality adjustment of generated images/videos through requests. by @IPostYellow in https://github.com/sgl-project/sglang/pull/17937
[BUG] fixed local model loading issue in multimodal generation test by @blazingbhavneek in https://github.com/sgl-project/sglang/pull/18687
[Kernel] Add JIT rotary_embedding_kernel by @pansicheng in https://github.com/sgl-project/sglang/pull/17934
[Spec] Move forward timeout before verify to fix Eagle v1 filter mismatch by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18760
[diffusion] feat: support tp for qwen-image-edit-2511 by @xiaoyewww in https://github.com/sgl-project/sglang/pull/18619
Rename request timeout env vars for waiting/running stages by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18766
[Bugfix] Add warnings when NSA indexer cache indice mismatch in PD module by @ShangmingCai in https://github.com/sgl-project/sglang/pull/18727
Support LingV2_5 model by @ant-yy in https://github.com/sgl-project/sglang/pull/18598
[diffusion] feat: support SparseVideoGen2 attention backend by @tie-pilot-qxw in https://github.com/sgl-project/sglang/pull/17507
[schedule] Fix streaming return of customized_info by @yinghai in https://github.com/sgl-project/sglang/pull/18654
Cleanup unused rerun stages by @ispobock in https://github.com/sgl-project/sglang/pull/18788
Adjust mamba cache allocation by @ispobock in https://github.com/sgl-project/sglang/pull/18786
Enhence gsm8k test by @ispobock in https://github.com/sgl-project/sglang/pull/18791
Cleanup debug log for Ring model by @ispobock in https://github.com/sgl-project/sglang/pull/18793
Added cuda availability guard by @mattteochen in https://github.com/sgl-project/sglang/pull/18480
[diffusion] refactor: merge redundant default_dtype and param_dtype parameters in FSDP loader by @mickqian in https://github.com/sgl-project/sglang/pull/18789
[diffusion] fix: webui task_type check by @yeahdongcn in https://github.com/sgl-project/sglang/pull/18462
[diffusion] fix typo by @triple-mu in https://github.com/sgl-project/sglang/pull/18790
[diffusion] chore: use batched P2P ops in VAE parallel decoding by @mickqian in https://github.com/sgl-project/sglang/pull/18728
[Kernel Slimming] Migrate GPTQ-Marlin repack kernel to JIT by @celve in https://github.com/sgl-project/sglang/pull/18543
refactor context parallel state by @dongjiyingdjy in https://github.com/sgl-project/sglang/pull/17213
[bugfix] fix mamba slot leak when scheduling fails with radix cache (#15840) by @kuafou in https://github.com/sgl-project/sglang/pull/16067
fix double-free kv cache for requests that have already finished and been freed during preemption by @JD-ETH in https://github.com/sgl-project/sglang/pull/18694
Update notified user in post_ci_failures_to_slack.py by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18817
[FlashInfer] Bump FlashInfer version from 0.6.2 to 0.6.3 by @mmangkad in https://github.com/sgl-project/sglang/pull/18448
Update performance dashboard for nightly tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18824
[Perf] refactor piecewise cuda graph support of Qwen3-Next by @zminglei in https://github.com/sgl-project/sglang/pull/17613
feat: add SGLANG_DISTRIBUTED_INIT_METHOD_OVERRIDE env var by @YazhiGao in https://github.com/sgl-project/sglang/pull/18743
[PD-Disagg] Fix double free when prebuilt batch is aborted. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18822
Add timeout abort kits for normal / eagle. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18815
[Env] centralize hicache vars in environ.py by @alphabetc1 in https://github.com/sgl-project/sglang/pull/17204
Handle abort for retracted requests in disagg decode prealloc queue by @qmzznbxhl in https://github.com/sgl-project/sglang/pull/18705
[diffusion][MUSA] fix: MUSA platform breakage caused by PR #13662 by @yeahdongcn in https://github.com/sgl-project/sglang/pull/18456
Fix/partial gen from waiting queue miss metadata by @JD-ETH in https://github.com/sgl-project/sglang/pull/17610
[VLM][LLM] Optimize fused_moe triton kernel tma by @yuan-luo in https://github.com/sgl-project/sglang/pull/18782
[AMD] Fix sgl-model-gateway Build Errors in ROCm Docker Release by @bingxche in https://github.com/sgl-project/sglang/pull/18836
Kernel: optimize decoding metadata in NSA multi-spec backend with fused kernels by @Johnsonms in https://github.com/sgl-project/sglang/pull/17554
Fix dsv32 encode_messages by @whybeyoung in https://github.com/sgl-project/sglang/pull/18126
Add ci test for ring model by @ispobock in https://github.com/sgl-project/sglang/pull/18829
feat: Support mrope_section with rope_type: "yarn" by @raayandhar in https://github.com/sgl-project/sglang/pull/13313
Enable SGLANG_ENABLE_SPEC_V2 for nightly speculative decoding tests by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18719
[kernel slimming] Move fast_hadamard_transform to jit_kernel by @BBuf in https://github.com/sgl-project/sglang/pull/18475
[Diffusion] opt vae decode with channels_last_3d by @BBuf in https://github.com/sgl-project/sglang/pull/18540
Add CI permissions by @mmangkad in https://github.com/sgl-project/sglang/pull/18847
Fix model loading for DeepSeek-V3.2-AWQ by @bingps in https://github.com/sgl-project/sglang/pull/16907
[Doc] Convert the speculative decoding notebook to markdow by @alphabetc1 in https://github.com/sgl-project/sglang/pull/18395
Model: Support IBM Granite (Dense/Mamba + MoE) by @blazingbhavneek in https://github.com/sgl-project/sglang/pull/18040
[FIX] Correct JIT kernel compilation on newer GPUs with outdated driver metadata. by @muse-coder in https://github.com/sgl-project/sglang/pull/18496
[AMD] Fix/qwen3 5 amd rope cutedsl fallback by @andyluo7 in https://github.com/sgl-project/sglang/pull/18753
[Perf] Tune MiniMax M2 fused moe kernel on H100 GPU by @zhendonghua in https://github.com/sgl-project/sglang/pull/18851
perf: add minimax-2.5 fused_moe tuning config for h20 by @zhangxiaolei123456 in https://github.com/sgl-project/sglang/pull/18833
[diffusion]: Enable torch.compile for UlyssesAttention by @Ratish1 in https://github.com/sgl-project/sglang/pull/18840
fix bug on kimi2.5 when dp2 tp4 by @haowen-han in https://github.com/sgl-project/sglang/pull/18604
Extract dumper and prefill delayer tests common utils by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18857
Add missing dumper tests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18859
[AMD] Fix nightly 1-GPU test failures and bench_serving regression by @michaelzhang-ai in https://github.com/sgl-project/sglang/pull/18761
[diffusion] quant: add support for svdquant and nunchaku by @mickqian in https://github.com/sgl-project/sglang/pull/18549
change npu.dockerfile by @chenxu214 in https://github.com/sgl-project/sglang/pull/18835
[diffusion]: Improve layerwise offload buffer reuse and shared-storage handling by @Ratish1 in https://github.com/sgl-project/sglang/pull/18611
feature: adding build commit to sgl kernel workflow by @dougyster in https://github.com/sgl-project/sglang/pull/18853
Enable DeepGemm fast warmup in CI to prevent cold-cache timeouts by @alisonshao in https://github.com/sgl-project/sglang/pull/18823
update pre-commit config by @SoluMilken in https://github.com/sgl-project/sglang/pull/18860
fix: update Blackwell log/error messages to include SM12x by @blake-snc in https://github.com/sgl-project/sglang/pull/18751
fix: add SM110 (Jetson AGX Thor) to Blackwell capability check by @WiwilZ in https://github.com/sgl-project/sglang/pull/18787
test: add test for Modelopt FP8 on SM90 by @zack041 in https://github.com/sgl-project/sglang/pull/18463
fix_get_quant_method_in_fused_moe_condition by @tom-zju in https://github.com/sgl-project/sglang/pull/18459
Use ephemeral nccl port via get_free_port() by @chanh in https://github.com/sgl-project/sglang/pull/18009
feat: expose consistent_hashing policy in Python router CLI args by @bledden in https://github.com/sgl-project/sglang/pull/17972
Improve profiler options for bench_serving by @akhilg-nv in https://github.com/sgl-project/sglang/pull/16991
Fix libnuma.so does not exsit by @QiuMike in https://github.com/sgl-project/sglang/pull/15355
fix(sgl-kernel): support CUDA 13 runtime preloading for DGX Spark by @blake-snc in https://github.com/sgl-project/sglang/pull/18747
fix(sgl-kernel): use >= 120 for SM12x CUDA kernel dispatch by @blake-snc in https://github.com/sgl-project/sglang/pull/18750
Create ascend_npu_qwen3_5_examples.md by @chenxu214 in https://github.com/sgl-project/sglang/pull/18864
Update ascend_npu_support.rst by @chenxu214 in https://github.com/sgl-project/sglang/pull/18868
Add claude skills for sgl-kernel and jit-kernel by @BBuf in https://github.com/sgl-project/sglang/pull/18855
Nsa trtllm mla sparse fp8 support with Deepseek v3.2 NVFP4 by @rainj-me in https://github.com/sgl-project/sglang/pull/18389
fix: nightly whl dev date suffix by @dougyster in https://github.com/sgl-project/sglang/pull/18873
[VLM] Optimize Ernie4.5-VL rotary embedding with fused triton kernel by @yuan-luo in https://github.com/sgl-project/sglang/pull/18856
[diffusion] fix: avoid saving output for warmup requests by @mickqian in https://github.com/sgl-project/sglang/pull/18867
[diffusion] refactor: refactor server_args adjust and validate logics by @mickqian in https://github.com/sgl-project/sglang/pull/18863
[Diff]: support SGLANG_TORCH_PROFILER_DIR environment variable for profiler log directory by @Johnsonms in https://github.com/sgl-project/sglang/pull/18454
[AMD] MORI-EP inter kernel type switch by @Duyi-Wang in https://github.com/sgl-project/sglang/pull/18437
Flip dumper to disable by default and refactor environment handling by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18878
Change dump output format to dict with value and metadata by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18879
Collect upper level metadata to dump output by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18880
Support dumping gradients, parameters, lazy values by @fzyzcjy in https://github.com/sgl-project/sglang/pull/18881
fix: unifying docker image build pipeline by @dougyster in https://github.com/sgl-project/sglang/pull/18814
fix: adding performance logging for nightly diffusion by @dougyster in https://github.com/sgl-project/sglang/pull/18023
Fix test_lora_qwen3 nightly failure: replace adapter with added_tokens by @alisonshao in https://github.com/sgl-project/sglang/pull/18884
Update ascend_npu_qwen3_5_examples.md by @realray808 in https://github.com/sgl-project/sglang/pull/18888
[Diffusion] Fix LoRA weight snapshot aliasing in unmerge by @ChangyiYang in https://github.com/sgl-project/sglang/pull/18883
Fix GLM-4V processor registration when glm_ocr is unavailable by @alisonshao in https://github.com/sgl-project/sglang/pull/18885
[JIT kernel] hd=512,1024 in JIT QK norm (cta based) by @vincentzed in https://github.com/sgl-project/sglang/pull/17515
[diffusion] logging: improve peak vram logging by @mickqian in https://github.com/sgl-project/sglang/pull/18865
Revert "[diffusion]: Improve layerwise offload buffer reuse and shared-storage handling" by @mickqian in https://github.com/sgl-project/sglang/pull/18866
[Model] Add Qwen3ForRewardModel and fix Qwen3ForSequenceClassification by @shvmjndl in https://github.com/sgl-project/sglang/pull/17992
[Perf] ~9.5x faster Blackwell MXFP4 MoE weight loading by @mmangkad in https://github.com/sgl-project/sglang/pull/18858
[diffusion][Wan]: fix sparse attention backends being applied to cross-attention by @Ratish1 in https://github.com/sgl-project/sglang/pull/17596
refactor FAKE transfer backend and remove --disaggregation-decode-enable-fake-auto parameter by @Estrella-xx in https://github.com/sgl-project/sglang/pull/18345
[2/N] Quantization Refactor: Compressed tensors MoE schemes by @TamirBaydasov in https://github.com/sgl-project/sglang/pull/17503
Fix modelopt FP8 create weights by @danielafrimi in https://github.com/sgl-project/sglang/pull/18447
Fix GLM-5 fused shared expert by @FrankMinions in https://github.com/sgl-project/sglang/pull/18804
[diffusion]: fix scheduler crash on ZMQ messages with unexpected frame counts by @Ratish1 in https://github.com/sgl-project/sglang/pull/17890
Adapt the Qwen2Model._update_causal_mask for transformers==4.57.1 by @pansicheng in https://github.com/sgl-project/sglang/pull/18774
[diffusion] operator: unify rotary embedding impl by @triple-mu in https://github.com/sgl-project/sglang/pull/18164
[misc] adding metadata field in UpdateWeightFromDiskReqInput by @happierpig in https://github.com/sgl-project/sglang/pull/18821
Skip flaky test_tool_choice_required_non_streaming for Mistral by @alisonshao in https://github.com/sgl-project/sglang/pull/18889
[AMD] Fix RotaryEmbedding crash on AMD/ROCm (regression from #17934) by @michaelzhang-ai in https://github.com/sgl-project/sglang/pull/18903
[TBO] fix cuda graph intermittently becomes disabled bug by @billishyahao in https://github.com/sgl-project/sglang/pull/18320
[Diffusion] [NPU] [Doc] Add NPU documentation for sglang-diffusion by @Makcum888e in https://github.com/sgl-project/sglang/pull/18894
Revert "[AMD] Fix RotaryEmbedding crash on AMD/ROCm (regression from #17934)" by @HaiShaw in https://github.com/sgl-project/sglang/pull/18922
[diffusion]: fix sparse video gen 2 backend being applied to cross-attention by @Ratish1 in https://github.com/sgl-project/sglang/pull/18900
[Diffusion] Fix get model name when model local path end with "/" by @Makcum888e in https://github.com/sgl-project/sglang/pull/18918
ROCm use rotary_embedding from sgl-kernel by @HaiShaw in https://github.com/sgl-project/sglang/pull/18920
[Diffusion] [NPU] Fix CI run by @Makcum888e in https://github.com/sgl-project/sglang/pull/18921
Revert "[diffusion] operator: unify rotary embedding impl" by @mickqian in https://github.com/sgl-project/sglang/pull/18929
[PCG] support piecewise cuda graph for kimi-linear model by @zminglei in https://github.com/sgl-project/sglang/pull/18849
[diffusion]: MOVA torch.compile opt by @triple-mu in https://github.com/sgl-project/sglang/pull/18914
[gRPC] Fix scheduler startup broken by context parallel refactor by @slin1237 in https://github.com/sgl-project/sglang/pull/18933
[diffusion] update code owner by @ping1jing2 in https://github.com/sgl-project/sglang/pull/18495
[3/N] Quantization Refactor: ModelSlim MoE schemes by @TamirBaydasov in https://github.com/sgl-project/sglang/pull/17993
fix(glm-image): single-GPU T5 config + SP support for 4D latents (#18… by @Nickcp39 in https://github.com/sgl-project/sglang/pull/18739
Fix generated-shared-prefix bench_serving by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/18769
Fix benchmark_sglang_fused_moe_triton.py by @satyamk7054 in https://github.com/sgl-project/sglang/pull/18940
cleanup prefill metrics logging to fix dp-attn metrics by @Ratish1 in https://github.com/sgl-project/sglang/pull/18778
feat: add cuda core dump CI warpper by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18909
Refactor sampler: Use a better hash function for deterministic sampling and clear dispatch for probs/logprobs/logits sampling paths by @merrymercy in https://github.com/sgl-project/sglang/pull/18915
Fix eval tests not capturing server launch failures by @alisonshao in https://github.com/sgl-project/sglang/pull/18886
Expose priority parameter in Engine.generate() and Engine.async_generate() by @PeaBrane in https://github.com/sgl-project/sglang/pull/18944
feat: [Qwen3.5] Support block-wise FP8 quantization and model adaptation by @zju-stu-lizheng in https://github.com/sgl-project/sglang/pull/18926
Revert "Fix generated-shared-prefix bench_serving" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18956
feat: add nsa and swa disagg support with nixl by @nealvaidya in https://github.com/sgl-project/sglang/pull/18939
[feat] Add return_routed_experts param to async_generate for parity with generate by @Aphoh in https://github.com/sgl-project/sglang/pull/18508
[Refactor] Fix test and clean up hicache code by @DarkSharpness in https://github.com/sgl-project/sglang/pull/18555
[diffusion] refactor: unify SamplingParams construction and improve DiffGenerator return types by @mickqian in https://github.com/sgl-project/sglang/pull/18928
Reasoning models fix docs by @HaiShaw in https://github.com/sgl-project/sglang/pull/18963
Remove unused fast-hadamard-transform PyTorch extension sources by @BBuf in https://github.com/sgl-project/sglang/pull/18927
[Tiny fix] Super tiny fix mul_add naive forward bug by @BBuf in https://github.com/sgl-project/sglang/pull/18964
Enable fa3 PDL by compiling it with corresponding flags by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/18756
[AMD] ROCm7.2: Add /sgl-workspace/aiter to PYTHONPATH by @HaiShaw in https://github.com/sgl-project/sglang/pull/18972
[BUG] Refactor task resolution logic in benchmark function for multimodal generation by @zijiexia in https://github.com/sgl-project/sglang/pull/18948
Add DP ViT support for Kimi K2.5 by @yhyang201 in https://github.com/sgl-project/sglang/pull/18689
[Fix] Add lora tied lm head support (for Qwen2.5, Gemma, etc model need) by @yushengsu-thu in https://github.com/sgl-project/sglang/pull/18634
[4/N] Quantization Refactor: Quark MoE schemes by @TamirBaydasov in https://github.com/sgl-project/sglang/pull/18252
[Feature] Implement update_weights_from_disk for SGLang-D (Diffusion … by @dreamyang-liu in https://github.com/sgl-project/sglang/pull/18306
[Doc] Add flashinfer_deepgemm to --fp8-gemm-backend by @mmangkad in https://github.com/sgl-project/sglang/pull/18982
Fix flaky Qwen3-Next KL divergence tests by reverting mamba slot release by @alisonshao in https://github.com/sgl-project/sglang/pull/18910
[AMD] Fix mi35x dsv32 mtp nightly by @bingxche in https://github.com/sgl-project/sglang/pull/18978
Add batched zero copy to NIXL backend by @hxieustc in https://github.com/sgl-project/sglang/pull/18850
[Qwen3.5] Enable nvfp4 checkpoint by @hlu1 in https://github.com/sgl-project/sglang/pull/18937
Fix PCG MoE Error by @Oasis-Git in https://github.com/sgl-project/sglang/pull/17739
Feat/add fi selective state update kernel call by @shaharmor98 in https://github.com/sgl-project/sglang/pull/18070
[RadixTree][4/N Refactor]: Move available_and_evictable_str to individual radix cache classes by @pansicheng in https://github.com/sgl-project/sglang/pull/17852
[Diffusion] Refactor diffusion triton kernels by @BBuf in https://github.com/sgl-project/sglang/pull/18966
[Fix] Fix rank used in parallel executor when enable_cfg_parallel is false by @Prozac614 in https://github.com/sgl-project/sglang/pull/18975
[Diffusion] [NPU] Enable profiler on NPU by @Makcum888e in https://github.com/sgl-project/sglang/pull/17807
Move lora request validation to tokenizer_manager from server by @satyamk7054 in https://github.com/sgl-project/sglang/pull/18962
[diffusion] chore: improve memory usage on consumer-level GPU by @mickqian in https://github.com/sgl-project/sglang/pull/18997
[diffusion] CI: enable warmup as default by @mickqian in https://github.com/sgl-project/sglang/pull/19010
Add SDAR model support by @chengshuang18 in https://github.com/sgl-project/sglang/pull/18318
[spec v2]Fix torch gc of future indices by @hnyls2002 in https://github.com/sgl-project/sglang/pull/18958
Revert "Add SDAR model support" by @ch-wan in https://github.com/sgl-project/sglang/pull/19032
Register tensors with symmetric memory for qwen by @nvcastet in https://github.com/sgl-project/sglang/pull/18643
Fix long prompt KV allocation by falling back to torch native APIs when exceeding Triton tensor limit by @ch-wan in https://github.com/sgl-project/sglang/pull/18250
Fix flashinfer autotune to only wrap run_once() by @ch-wan in https://github.com/sgl-project/sglang/pull/19004
Support cleanup previous dumps in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19013
Hint users when wrongly execute it with partial ranks in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19014
Support captured dump output and console output control in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19017
Support filtering labels in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19018
Enhance configure and env parsing in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19034
Support resetting and enhance HTTP endpoints for dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19046
Support using SGLang port in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19038
Feature/sdar support by @chengshuang18 in https://github.com/sgl-project/sglang/pull/19044
[Fix][Qwen3.5] Pass max_mamba_cache_size to mamba pool in disaggregation decode path by @YAMY1234 in https://github.com/sgl-project/sglang/pull/19002
[AMD] Replace msgpack with msgspec in MORI-IO by @Duyi-Wang in https://github.com/sgl-project/sglang/pull/19007
fix lint on main by @ch-wan in https://github.com/sgl-project/sglang/pull/19052
feature: docker patch workflow by @dougyster in https://github.com/sgl-project/sglang/pull/19025
Fix lint on main by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19054
[diffusion] logging: log available mem when each stage starts in debug level by @mickqian in https://github.com/sgl-project/sglang/pull/18998
[jit kernel] Support per_token_group_quant_8bit jit kernel by @yuan-luo in https://github.com/sgl-project/sglang/pull/18905
[diffusion] feat: support nunchaku for Z-Image-Turbo and flux.1 (int4) by @mickqian in https://github.com/sgl-project/sglang/pull/18959
Fix NSA FP8 KV cache path for both-trtllm MHA one-shot by @mmangkad in https://github.com/sgl-project/sglang/pull/18931
[Fix] DO NOT skip save_kv_cache for dllm by @DarkSharpness in https://github.com/sgl-project/sglang/pull/19020
[Fix] Run FlashInfer autotune on non-default stream for NCCL 2.29+ compatibility by @nvcastet in https://github.com/sgl-project/sglang/pull/18987
Fix adjust_num_token_non_padded_for_attn_tp returning CPU tensor by @ch-wan in https://github.com/sgl-project/sglang/pull/19051
[AMD] support two batch overlapping for mori ep by @billishyahao in https://github.com/sgl-project/sglang/pull/17953
[feat] feat: support swa in trtllm_mha by @LuYanFCP in https://github.com/sgl-project/sglang/pull/18970
Add generated-shared-prefix dataset in bench_one_batch by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/18986
[GPT-OSS] support fp8 online quantization for gpt-oss bf16 by @zminglei in https://github.com/sgl-project/sglang/pull/18988
Refactor graph input buffers by @ch-wan in https://github.com/sgl-project/sglang/pull/18991
[DSv32] Fix MTP and CP compatability by @vladnosiv in https://github.com/sgl-project/sglang/pull/19062
Fix bug in symm mem pre-allocation default by @nvcastet in https://github.com/sgl-project/sglang/pull/19082
Remove error dllm and diffusion doc in basic_useage by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/19105
[Quantization] Support config.json quantization_config format, fix exclude_modules matching, and fix KV cache scale loading for Nemotron by @danielafrimi in https://github.com/sgl-project/sglang/pull/18546
[diffusion] refactor: reduce redundancy and improve stage api by @mickqian in https://github.com/sgl-project/sglang/pull/19060
[FEAT] Add Anthropic compatible API endpoint by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/18630
[diffusion] feat: support passing component path via server args by @mickqian in https://github.com/sgl-project/sglang/pull/19108
[Feature] rewrite rope kernel; remove flashinfer dependencies by @DarkSharpness in https://github.com/sgl-project/sglang/pull/18844
[Diffusion] Restruct and clean Diffusion rotary embedding by @BBuf in https://github.com/sgl-project/sglang/pull/19064
fix tool handling in OpenAIServingChat by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/18996
fix KimiK2Detector regex patterns with re.DOTALL by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/19120
[sgl] view could hold the memory too long and introduced large memory by @bixue2010 in https://github.com/sgl-project/sglang/pull/19109
[FlashInfer] Switch FlashInfer allreduce fusion to unified API by @mmangkad in https://github.com/sgl-project/sglang/pull/18341
[Refactor] Benchmark Phase 1: extract utils and datasets from bench_serving by @Ratish1 in https://github.com/sgl-project/sglang/pull/19077
[Benchmark] Remove re-exports from bench_serving.py by @hnyls2002 in https://github.com/sgl-project/sglang/pull/19130
Revert "[jit kernel] Support per_token_group_quant_8bit jit kernel" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/19131
Fix dev Docker build OOM on ARM64 cu13 by adding docker system prune by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/18947
[Fix] Quick fix for int32 overflow in Mooncakes' send_kvcache_slice by @YAMY1234 in https://github.com/sgl-project/sglang/pull/19076
[diffusion] Adapt FP8 linear to sgld feature (Rebase) by @fy1214 in https://github.com/sgl-project/sglang/pull/17023
[BUG] [DLLM] Missing max_running_requests value by @blazingbhavneek in https://github.com/sgl-project/sglang/pull/18740
Fix spec v2+dp attention in nsa backend by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/19134
[Qwen3-Next] Enable fused_qkvzba_split_reshape_cat also for prefill by @YAMY1234 in https://github.com/sgl-project/sglang/pull/18917
[PD] Change bootstrap_room metadata dtype from int64 to uint64 by @ShangmingCai in https://github.com/sgl-project/sglang/pull/19141
Refactor dumper and change on_forward_pass_start API by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19065
Support non-intrusive dumping in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19068
Support enabling partial non intrusive dump in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19069
Auto annotate context in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19071
Extract framework plugins in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19072
Enhance hook mechanism in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19073
Configure and call dumper in main SGLang logic by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19093
Support multi colocated dumper, named exp cleanup, argparse config by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19094
Enhance reset, states, http in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19095
Fix wrongly large dumped file and handle non intrusive hook reset in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19124
[DSv32] [GLM5] Improve Model Quality by Avoiding FP32 Precision Loss in weights_proj by @zianglih in https://github.com/sgl-project/sglang/pull/19041
Support kwargs and megatron core tensor parsing in dumper by @fzyzcjy in https://github.com/sgl-project/sglang/pull/19138
[diffusion] chore: minor cleanups by @mickqian in https://github.com/sgl-project/sglang/pull/19123
[diffusion] CI: relax perf check threshold by @mickqian in https://github.com/sgl-project/sglang/pull/19154
Fix corrupted JSONL metrics file due to concurrent writes by @talorabr in https://github.com/sgl-project/sglang/pull/19011
[diffusion] refactor: rename quantized model path server arg by @mickqian in https://github.com/sgl-project/sglang/pull/19142
Revert "[AMD] support two batch overlapping for mori ep #17953" by @Fridge003 in https://github.com/sgl-project/sglang/pull/19161
fix(diffusion): enforce strict input_reference validation for T2V by @Ratish1 in https://github.com/sgl-project/sglang/pull/14825
Revert "Refactor graph input buffers (#18991)" by @Fridge003 in https://github.com/sgl-project/sglang/pull/19173
Update rocm7.2 Dockerfile to install amdsmi for QuickReduce Initialization by @clintg6 in https://github.com/sgl-project/sglang/pull/19091
Fix bench_one_batch_server by moving the print statements by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/19175
[AMD] ENV Flags tuning and cleanup by @HaiShaw in https://github.com/sgl-project/sglang/pull/19176
[Diffusion] Detect Flux2 custom VAE path from component_paths by @ChangyiYang in https://github.com/sgl-project/sglang/pull/19170
[ROCm] Use unreg path for custom all-reduce during CUDA graph capture by @zyzshishui in https://github.com/sgl-project/sglang/pull/19162
Reorganize topk logic to clean up code and expose logical experts by @ocss884 in https://github.com/sgl-project/sglang/pull/16945
Use single mma warp group for short q_len in FA to optimize decoding performance by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/18985
[NPU] bump sgl-kernel-npu to 2026.02.01.post2 by @iforgetmyname in https://github.com/sgl-project/sglang/pull/19178
[Refactor] Split rotary_embedding.py into a modular package by @BBuf in https://github.com/sgl-project/sglang/pull/19144
[Diffusion] Match rotary_embedding module name style by @BBuf in https://github.com/sgl-project/sglang/pull/19179
[Kernel Slimming] Migrate AWQ marlin repack kernel to JIT by @celve in https://github.com/sgl-project/sglang/pull/18949
[PD-Disagg] Support query dp rank from bootstrap server. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/19168
add new ci user by @narutolhy in https://github.com/sgl-project/sglang/pull/19133

New Contributors

@00fish0 made their first contribution in #17265
@1195343015 made their first contribution in #18535
@1am9trash made their first contribution in #18488
@22dimensions made their first contribution in #18017
@2JooYeon made their first contribution in https://github.com/sgl-project/sglang/pull/18552
@Aphoh made their first contribution in https://github.com/sgl-project/sglang/pull/18508
@BJWang-ant made their first contribution in #16043
@BourneSun0527 made their first contribution in #18553
@Ch3ngY1 made their first contribution in #18163
@ChangyiYang made their first contribution in https://github.com/sgl-project/sglang/pull/18883
@CloudRipple made their first contribution in #17704
@DiweiSun made their first contribution in #11712
@DotSlash-A made their first contribution in #16969
@Duyi-Wang made their first contribution in https://github.com/sgl-project/sglang/pull/18437
@EduardDurech made their first contribution in #15682
@Estrella-xx made their first contribution in #17811
@Evrard-Nil made their first contribution in #17813
@FrankMinions made their first contribution in https://github.com/sgl-project/sglang/pull/18804
@HaiShaw made their first contribution in #18101
@HanHan009527 made their first contribution in #18091
@HandH1998 made their first contribution in #16892
@Hide-on-bushsh made their first contribution in #17922
@JD-ETH made their first contribution in https://github.com/sgl-project/sglang/pull/18694
@JiaruiChang5268 made their first contribution in #17007
@Lollipop made their first contribution in #18024
@LuYanFCP made their first contribution in https://github.com/sgl-project/sglang/pull/18970
@Lzy17 made their first contribution in #17183
@Mahdi-CV made their first contribution in #17040
@Makcum888e made their first contribution in #17584
@McZyWu made their first contribution in #16866
@MikkoParkkola made their first contribution in #17816
@Nickcp39 made their first contribution in https://github.com/sgl-project/sglang/pull/18739
@PeaBrane made their first contribution in https://github.com/sgl-project/sglang/pull/18944
@RangerCD made their first contribution in #17929
@RubiaCx made their first contribution in #18026
@RunningLeon made their first contribution in #17939
@Simon-Li made their first contribution in #17430
@SoluMilken made their first contribution in https://github.com/sgl-project/sglang/pull/18860
@Sugar920 made their first contribution in #17952
@TTThanos made their first contribution in #17866
@TamirBaydasov made their first contribution in https://github.com/sgl-project/sglang/pull/17503
@WiwilZ made their first contribution in https://github.com/sgl-project/sglang/pull/18787
@YazhiGao made their first contribution in https://github.com/sgl-project/sglang/pull/18743
@ZhenshengWu made their first contribution in #17540
@ZiguanWang made their first contribution in #16225
@aaaandychen made their first contribution in #16781
@airMeng made their first contribution in #13561
@ajpqs made their first contribution in #14765
@akao-amd made their first contribution in https://github.com/sgl-project/sglang/pull/18698
@akhilg-nv made their first contribution in #16758
@amote-i made their first contribution in #17573
@andyluo7 made their first contribution in https://github.com/sgl-project/sglang/pull/18753
@ant-yy made their first contribution in https://github.com/sgl-project/sglang/pull/18598
@aurickq made their first contribution in #18152
@billishyahao made their first contribution in https://github.com/sgl-project/sglang/pull/18320
@bingps made their first contribution in https://github.com/sgl-project/sglang/pull/16907
@bixue2010 made their first contribution in #17781
@blake-snc made their first contribution in https://github.com/sgl-project/sglang/pull/18751
@blazingbhavneek made their first contribution in https://github.com/sgl-project/sglang/pull/18687
@bledden made their first contribution in https://github.com/sgl-project/sglang/pull/17972
@cctry made their first contribution in #17850
@celve made their first contribution in #17889
@chanh made their first contribution in https://github.com/sgl-project/sglang/pull/18009
@chengshuang18 made their first contribution in https://github.com/sgl-project/sglang/pull/18318
@chenxu214 made their first contribution in #17511
@cicirori made their first contribution in #18064
@clintg6 made their first contribution in https://github.com/sgl-project/sglang/pull/19091
@cswuyg made their first contribution in #17974
@debo3 made their first contribution in #18396
@dongjiyingdjy made their first contribution in https://github.com/sgl-project/sglang/pull/17213
@dreamyang-liu made their first contribution in https://github.com/sgl-project/sglang/pull/18306
@dutsc made their first contribution in #17301
@edwardzjl made their first contribution in #18171
@fsygd made their first contribution in #17751
@fy1214 made their first contribution in https://github.com/sgl-project/sglang/pull/17023
@gaopengff made their first contribution in #14592
@gingerXue made their first contribution in #17499
@glenliu21 made their first contribution in #17464
@gongyisheng made their first contribution in #17690
@hammersam made their first contribution in #17747
@haojin2 made their first contribution in #18045
@haowen-han made their first contribution in https://github.com/sgl-project/sglang/pull/18604
@happierpig made their first contribution in https://github.com/sgl-project/sglang/pull/18821
@hsuchifeng made their first contribution in #17744
@hxieustc made their first contribution in https://github.com/sgl-project/sglang/pull/18850
@jhinpan made their first contribution in #17863
@jianyingzhu made their first contribution in #14717
@jiashaokun-1 made their first contribution in #17025
@joearedmond made their first contribution in #17786
@kaixih made their first contribution in #18025
@kartikx made their first contribution in #17273
@klhhhhh made their first contribution in #18131
@klshuster made their first contribution in #18273
@koush made their first contribution in #18011
@kuafou made their first contribution in #18207
@laixinn made their first contribution in #17065
@lawtherWu made their first contribution in #15381
@lingebeng made their first contribution in #17699
@luke396 made their first contribution in #17118
@maning00 made their first contribution in #14626
@mansoor-s made their first contribution in #17434
@maocheng23 made their first contribution in #18512
@mattteochen made their first contribution in #18000
@mengchengTang made their first contribution in #17545
@michaelzhang-ai made their first contribution in #17523
@mmangkad made their first contribution in #17662
@muse-coder made their first contribution in https://github.com/sgl-project/sglang/pull/18496
@nanjiangwill made their first contribution in #17286
@nono-Sang made their first contribution in #17140
@nvcastet made their first contribution in #17089
@ovidiusm made their first contribution in #17654
@pansicheng made their first contribution in #18155
@ping1jing2 made their first contribution in https://github.com/sgl-project/sglang/pull/18495
@pokymono made their first contribution in #17888
@polisettyvarma made their first contribution in #10021
@qianyue76 made their first contribution in #18095
@qmzznbxhl made their first contribution in https://github.com/sgl-project/sglang/pull/18705
@raayandhar made their first contribution in https://github.com/sgl-project/sglang/pull/13313
@realray808 made their first contribution in https://github.com/sgl-project/sglang/pull/18888
@rootonchair made their first contribution in https://github.com/sgl-project/sglang/pull/18560
@shaharmor98 made their first contribution in #17700
@shvmjndl made their first contribution in #17806
@sleepcoo made their first contribution in #14607
@sogalin made their first contribution in #17656
@strgrb made their first contribution in #17508
@talorabr made their first contribution in https://github.com/sgl-project/sglang/pull/19011
@tc-mb made their first contribution in #9610
@tianchongchong made their first contribution in #17858
@tie-pilot-qxw made their first contribution in https://github.com/sgl-project/sglang/pull/17507
@tom-zju made their first contribution in https://github.com/sgl-project/sglang/pull/18459
@triple-mu made their first contribution in #17834
@tugot17 made their first contribution in #17777
@vedantjh2 made their first contribution in #17780
@wangfakang made their first contribution in #17756
@wenchen76 made their first contribution in #7839
@xiaobaicxy made their first contribution in #18657
@xiaoweiw-nv made their first contribution in #17198
@xiaoyewww made their first contribution in #18317
@xu-yfei made their first contribution in #17076
@xvyaward made their first contribution in #16294
@xyjixyjixyji made their first contribution in #17347
@ycchen-tw made their first contribution in #16521
@yefei12 made their first contribution in #15119
@yingluosanqian made their first contribution in #18038
@yunkchen made their first contribution in #17129
@zack041 made their first contribution in #18240
@zackyoray made their first contribution in #17146
@zhangxiaolei123456 made their first contribution in https://github.com/sgl-project/sglang/pull/18833
@zhangxin81 made their first contribution in #16725
@zhaochenyang20 made their first contribution in #18390
@zhaohaidao made their first contribution in #18096
@zhendonghua made their first contribution in #17891
@zianglih made their first contribution in #17688
@zijiexia made their first contribution in #17663
@zju-stu-lizheng made their first contribution in #17624
@zwang86 made their first contribution in #18225
@zzhbrr made their first contribution in #10707
@zzzzzzzxh made their first contribution in #15363

Full Changelog: v0.5.8...v0.5.9

sgl-project/sglang v0.5.9 on GitHub

Highlights

New Model Support

SGLang-Diffusion

Performance

Prefill-Decode Disaggregation

Diffusion LLM (dLLM)

Speculative Decoding

Dependencies

AMD Hardware

NPU/Ascend

CPU Backend

Kernel Slimming

Documentation

What's Changed

New Contributors

sgl-project/sglang v0.5.9
on GitHub