Highlights

https://lmsys.org/blog/2026-01-16-sglang-diffusion/
https://lmsys.org/blog/2026-01-15-chunked-pipeline/
https://lmsys.org/blog/2026-01-21-novita-glm4/
https://lmsys.org/blog/2026-01-12-epd/

New Model Support

Day 0 Support for GLM 4.7 Flash: #17247
LFM2 model support: #16890
Qwen3-VL-Embedding & Qwen3-VL-Reranker model support: #16635, #16403
DeepSeek V3.2 NVFP4: https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4
[Diffusion] black-forest-labs/FLUX.2-klein-9B

DeepSeek V3.2 Optimization

Context Parallelism Optimization with support for fused MoE, multi-batch, and FP8 KV cache: #13959

Flash Attention 4

Support for Flash Attention 4 decoding kernels: #16034

SGLang-Diffusion

Run sglang-diffusion with diffusers backend
Features: Multi-LoRA inference, SLA attention backends, warmup switch in CLI, ComfyUI Plugin
Performance improvements for all models

Dependencies

sgl-kernel updated to 0.3.21: #17075
Cutedsl updated to 4.3.4: #17075
Added dependencies for tvm-ffi and quack-kernels: #17075
Flashinfer updated to 0.6.1: #15551
Mooncake transfer engine updated to 0.3.8.post1: #16792

Security

Fixed urllib and gpgv vulnerabilities: #17439

What's Changed

Refactor custom allreduce logics by @iforgetmyname in #13710
[Doc] Update DeepSeek-V3.2 document by @Fridge003 in #14321
Feature/support distilled vae generic by @baonudesifeizhai in #14195
[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Triton Kernels by @Johnsonms in #13812
Update CODEOWNERS for multimodal by @mickqian in #14329
[bug fix] use npu phy id in container env by @jinke446 in #14266
[model-gateway] multimodality initialization by @slin1237 in #13350
[Doc] Fix DeepSeek V32 Doc by @Fridge003 in #14336
sync attention, deepseek doc by @b8zhong in #14335
[PD] Support decode pp for PD disaggregation by @ShangmingCai in #14265
[model-gateway] add image processor and transformer structure by @slin1237 in #14344
[CPU] Support chunk_gated_delta_rule kernel for Qwen3-Next by @Valentine233 in #12441
[bugfix] Fix prefill tbo disabled when --deepep-mode=auto by @yuhyao in #14333
[CI] update estimated elapsed time of some unittests by @ch-wan in #14347
[NPU] bug fix: w_vc need contiguous for NPU batch_matmul_transpose ops by @ZhengdQin in #13980
[bugfix] NpuFuseEPMoE miss initialization parameters by @chenxu140 in #14295
[Ascend] fix AscendAttnMaskBuilder bug to support float16 models by @MichelleWu351 in #14271
Tiny adjust CI testcases by @hnyls2002 in #14362
[NPU][Doc] updated installation guide for Ascend NPU by @VDV1985 in #13585
Feature/add vae path to cli doc#14004 by @baonudesifeizhai in #14355
[CPU] add fused_qkvzba_split_reshape_cat kernel for Qwen3-next by @blzheng in #12330
Single Batch Overlap for MoE Models by @Sulfur6 in #9660
Move custom_ops under layers; move _custom_ops.py → custom_all_reduce_ops.py by @merrymercy in #14326
[model-gateway] add llava model image processor and tests by @slin1237 in #14371
ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back by @sunxxuns in #14226
[Tiny]Small fixes in deepseek v32 doc by @Fridge003 in #14372
Fix validation to detect missing model files before loading by @alisonshao in #14253
[model-gateway] add qwen2_vl model image processor and tests by @slin1237 in #14374
[model-gateway] add qwen2.5_vl model image processor by @slin1237 in #14375
Revert "Revert "enable csgmv automatically on cuda"" by @b8zhong in #14277
[model-gateway] use worker crate in openai router by @slin1237 in #14330
[model-gateway] add qwen3_vl model image processor by @slin1237 in #14377
Fix sgl-router silently parse selector wrongly causing OME fail to discover pods by @fzyzcjy in #14359
[sgl-kernel][Feat][B200][1/N]Support MXFP8 Grouped GEMM in Blackwell by @HydraQYH in #13731
[CPU] document updates by @ZailiWang in #14272
Support PP x PD decode with nixl backend by @bluecoffee8 in #14392
[VLM] Introduce Cache for positional embedding ids for Qwen-VL family by @yuan-luo in #14292
use faster covnersion from float8_e4m3fn to bfloat16 by @mingfeima in #12316
[model-gateway][doc] Add STDIO Explicitly to Example in README by @xuwenyihust in #14393
[CPU] add support for mamba causal conv1d for qwen3-next by @mingfeima in #12309
[model-gateway] add phi3 vision image processor by @slin1237 in #14381
[model-gateway] introduce provider in openai router by @slin1237 in #14394
[AMD] fix the regression issue for DeepseekV3 on MI300 by @yctseng0211 in #14383
[NPU][1/N] NPU basic functions refactor and new modelslim quant type by @iforgetmyname in #13359
[CPU] Optimize small oc GEMM for Qwen3-next on CPU by @jianan-gu in #12446
Try to fix B200 DeepEP error by @fzyzcjy in #14399
[1/2] Add rope kernel in sgl-kernel by @Qiaolin-Yu in #14334
[bug fix] fix ima with get_mla_kv_buffer_kernel overflow by @XucSh in #14224
Add Mistral Large 3 support. by @dcampora in #14213
[diffusion] fix gen video doc by @yeahdongcn in #14409
Add 'NPU' to the runtime exception message in get_device by @rauletorresc in #14225
Add mooncake transfer_engine_bench into maunal test by @hnyls2002 in #14429
[model-gateway] add phi4 vision image processor by @slin1237 in #14430
diffusion: Add Configurable Generator Device and Seed Support via API by @niehen6174 in #14366
[model-gateway] introduce request ctx for oai router by @slin1237 in #14434
[NPU]add nightly-test-npu by @cherryblo in #14143
[model-gateway] add llama4 vision image processor by @slin1237 in #14438
[model-gateway] extract conversation out of oai router by @slin1237 in #14440
[DeepseekV3.2][NSA][Indexer] Fix PAGED top-k transform for NSA indexer chunked execution on H200 by @YAMY1234 in #14325
[model-gateway] move oai header util to router header util by @slin1237 in #14441
[FIX] trtllm-moe-fp4-renorm for Qwen series models by @samuellees in #14350
add doc for quantized kv cache by @b8zhong in #14348
fix: Correct environment variable syntax in docker-compose configuration by @yankay in #8287
[model-gateway] move all responses api event from oai to proto by @slin1237 in #14446
[model-gateway] add mistral 3 image processor by @slin1237 in #14445
[model-gateway] grpc to leverage event type by @slin1237 in #14450
ministral3 by @JustinTong0323 in #14251
[Bug] fix not desired disable fused share experts caused by rocm logic by @ocss884 in #14432
Rename secrets.WHL_TOKEN -> secrets.GH_PAT_FOR_WHL_RELEASE by @sglang-bot in #14421
further optimze model load by @zyksir in #13836
Add CI permissions for user 'yushengsu-thu' by @alisonshao in #14468
[ez] Fix typing by @yinghai in #14473
Add AMD stage support to /rerun-stage command and fix related bugs by @alisonshao in #14463
Add YAMY1234 to CI Permission by @Fridge003 in #14475
clean up gemlite usage by @zminglei in #14444
[diffusion] chore: further improve model searching logic by @mickqian in #14484
fix bug about pin memory by @zyksir in #14472
[diffusion] cli: add argument --adjust-frames, --override-protected-fields by @gmixiaojin in #13996
dockerfile: add lightweight runtime stage and refactors by @ishandhanani in #13861
diffusion: Fix CLIP text encoder attention mask and causal masking bugFix clip attention by @niehen6174 in #14364
Enable RadixCache for Mamba2 models by @roikoren755 in #13584
[Diffusion] Fix profiler trace missing Python stack in diffusion pipeline by @BBuf in #14499
support GLM-V vision model dp by @zRzRzRzRzRzRzR in #14097
[misc] add model arch and type to server info and use it for harmony by @slin1237 in #14456
Add Mistral Large 3 Eagle Support by @elvischenv in #14466
Add Mistral Large 3 to nightly CI tests by @alisonshao in #14459
[diffusion] chore: set allowing overriding protected fields of sampling params as default behavior by @mickqian in #14471
[model-gateway] move conversation to first class routing by @slin1237 in #14506
[Spec] Mamba2 support in target models by @roikoren755 in #13434
[diffusion] Support cache-dit by @Brain97 in #14234
Add fused FP8 KV cache write kernel for TRTLLM MHA backend by @harvenstar in #14093
[model-gateway] Add WASM support for middleware by @tonyluj in #12471
[model-gateway] reorganized conversation handler by @slin1237 in #14507
tiny remove deprecated endpoint call by @b8zhong in #13607
[model-gateway] fix server info comment by @slin1237 in #14508
Add Mistral Large 3 basic test to PR CI by @alisonshao in #14460
Fix removing worker will make it healthy forever in prometheus metrics by @fzyzcjy in #14420
[model-gateway] Make Tokenizer Builder Aware of Env Vars Like HF_ENDPOINT by @xuwenyihust in #14405
[model-gateway] change sgl-router to sgl-model-gateway by @slin1237 in #14312
[model-gateway] fix left over sgl-router names to sgl-model-gateway by @slin1237 in #14512
[model-gateway] fix logs in smg workflow by @slin1237 in #14513
[model-gateway] fix left over sgl-router names in wasm by @slin1237 in #14514
[model-gateway] fix code owner for wasm by @slin1237 in #14516
chore: bump sgl-kernel version to 0.3.18.post3 by @sglang-bot in #14427
Tiny use trtllm_mha as default when possible by @fzyzcjy in #14291
[Docs] Add /rerun-stage command to contribution guide by @alisonshao in #14521
Fix safetensors validation to catch corruption after download by @alisonshao in #14465
[CODEOWNER] update codeowner for qwen3-next related by @hanming-lu in #14522
fix rmsnorm -> layernorm in qwen3 omni by @vincentzed in #11791
[diffusion] chore: temporarily upgrade diffusers to make Z-image compatible with Cache-DiT by @mickqian in #14530
[bug] fix notebook to include new keys from model_info by @slin1237 in #14528
Revise DP Multi-Modal Encoder Document by @yhyang201 in #14290
[CPU] add mamba fla kernels for Qwen3-next by @blzheng in #12324
Revert "tiny remove deprecated endpoint call" by @Fridge003 in #14533
support mtp with deepseek r1 nvfp4 model by @rainj-me in #13115
[diffusion] refactor: simplify SamplingParams override logic by @mickqian in #14539
[Diffusion] Add QKV fusion optimization for Flux models by @BBuf in #14505
[model-gateway][tracing]: implement request tracing using OpenTelemetry with trace context propagation (HTTP) by @sufeng-buaa in #13897
diffusion: fix LoRA dtype handling and weight attribute access for z-image model by @niehen6174 in #14543
fix "GrammarMatcher has terminated after accepting the stop token, but is trying to find the next token mask" when both reasoning and spec are enabled by @gongwei-130 in #14464
[1/n] Fix hanging during DeepGemm Warmup by @Fridge003 in #14493
[Bug fix] Add /model_info endpoint to mini_lb by @alisonshao in #14535
[Qwen3-next] remove heuristics and add radix cache kl test by @hanming-lu in #14520
[Misc]Register and refactor some environs for dpsk-fp4 and DeepEp by @Fridge003 in #14538
chore: bump sgl-kernel version to 0.3.18.post3 by @sglang-bot in #14518
Update CI_PERMISSIONS.json by @harrisonlimh in #14552
Update DeepSeek V3 docs to use B200 by @leejnau in #14447
[Doc] Add short explanation on page size by @b8zhong in #14557
[docs] Add missing word in argument description by @almaslof in #14205
support piecewise cuda graph for Olmo models by @zminglei in #14476
Enhance prefill PP node robustness by @qhsc in #14494
DOC update nemo-skills in docs by @gwarmstrong in #14555
remove unecessary dual stream token threshold from the rest of models (qwen moe, kimi linear, etc.) by @b8zhong in #14337
feat(ci): add framework target to release-docker workflows by @ishandhanani in #14559
Fix attention backend logic for Qwen3-Next on SM100 by @Chen-0210 in #14560
[FLA] Add explicit kernel arguments to kda.py for Kimi Linear support by @alisonshao in #14561
Add CUDA kernel size analysis tool for sgl-kernel optimization by @BBuf in #14544
[DLLM] Add threshold based parallel decoding support by @btw616 in #14412
Add unit-test-backend-8-gpu-b200 to rerun-stage command by @alisonshao in #14569
[apply][2/2] Fused qk_norm_rope for Qwen3-MoE by @yuan-luo in #13998
Add Expert Parallelism (EP) support for kimi-k2-thinking by @BBuf in #13725
Tiny remove wrong import from python.sglang by @hnyls2002 in #14577
Add small model test for spec v2 + dp + trtllm_mla by @hnyls2002 in #14576
[diffusion] cli: profiling utilities support by @AichenF in #14185
[NPU]LoRA: Adding Torch Native backend by @vlserov in #14132
[BugFix] fix prefixcache performance and accuracy on ascend by @khalil2ji3mp6 in #13573
Fix FP8 KV Triton type issue and add regression test by @harvenstar in #14553
Rename TensorRT Model Optimizer to Model Optimizer by @Edwardf0t1 in #14455
[CI] Tiny speed up VLM CI by @b8zhong in #14517
[Minor] Temporarily skipping deepep large mtp test by @Fridge003 in #14586
[model-gateway] extra accumulator and tool handler in oai router by @slin1237 in #14587
[model-gateway] Fixed WASM Security Vulnerability - Execution Timeout by @slin1237 in #14588
[model-gateway] reorganize metrics, logging, and otel to its own module by @slin1237 in #14590
Refactor tuning block wise kernel and opt Qwen/Qwen3-VL-32B-Instruct-FP8 by @BBuf in #14141
[CI]Unblock and split spec v2+dp test by @Fridge003 in #14551
[Tool Call] Fix DeepSeekV32Detector skipping functions with no params in streaming mode by @momaek in #14573
[feat] use cachebuffer to store mm feature to speedup hash by @liusy58 in #14386
[CI] Fix unit-test-backend-8-gpu-b200 running on every /rerun-stage by @alisonshao in #14591
[model-gateway] fix WASM memory limit per module by @slin1237 in #14600
Tiny fix missing policy decision recording by @fzyzcjy in #14605
Super tiny remove unneeded policy flag by @fzyzcjy in #14608
[model-gateway] refactor otel to be more efficient by @slin1237 in #14604
Super tiny remove unused select_worker_pair by @fzyzcjy in #14609
[model-gateway] fix WASM unbounded request/response body read vuln by @slin1237 in #14612
[2/2] Add rope kernel in sgl-kernel by @Qiaolin-Yu in #14452
[DLLM] Add initial cuda graph support by @btw616 in #14203
Super tiny fix unused code in router by @fzyzcjy in #14618
[Glm46v] Bug fix for accuracy drop and unable to launch server by @byjiang1996 in #14585
Fix amd rope definition by @Qiaolin-Yu in #14556
modify the sgl-kernel to be compatible with transformers 5.x. by @yhyang201 in #14625
[Reasoning + Structured Output] make reasoning compatible with structured output by @Muqi1029 in #12551
[Feat] add support for LoRA layers in transformer_2 within LoRAPipeline by @Prozac614 in #14606
chore: bump sgl-kernel version to 0.3.19 by @sglang-bot in #14632
[cpu] Implement all gather/reduce for arm64 cpu by @cyb70289 in #12527
[diffusion] chore: further refine output resolution adjustment logic by @mickqian in #14558
Fix dp-aware incompatible with service-discovery by @fzyzcjy in #14629
update transformers package version to 5.0.0rc0 by @yhyang201 in #14356
chore: bump sgl-kernel version to 0.3.19 by @sglang-bot in #14649
chore: bump SGLang version to 0.5.6.post1 by @sglang-bot in #14651
[AMD] change fused rms quant interface for aiter upgrade by @yctseng0211 in #14497
[model-gateway] reducing cpu overhead in various of places by @slin1237 in #14658
[model-gateway] reduce cpu overhead in grpc router by @slin1237 in #14663
[model-gateway] fix WASM arbitrary file read security vol by @slin1237 in #14664
vlm: Use fa3 as the default backend for qwen3 vl by @mickqian in #14634
[model-gateway] Optimize memory usage in HTTP router by @slin1237 in #14667
fix: use .get() when accessing strict mem-check env variable by @yhyang201 in #14657
improve default glm mtp setting by @b8zhong in #14457
Fix cache-aware router should pick min load instead of min tenant size by @fzyzcjy in #14650
Bump up diffusers to latest official release version by @byjiang1996 in #14670
[model-gateway] add OTEL integration to grpc router by @slin1237 in #14671
[CI] Increase max-parallel to 15 for high priority PRs by @alisonshao in #14675
[HiCache] fix condition check when use decode offload by @ssssnow in #14489
[RadixTree] Optimize the Time Complexity of Node Retrieval Operation from O(n*m) to O(n) by @CLFutureX in #13334
Tiny support printing requests in bench_serving for observability by @fzyzcjy in #14652
Aiter fp8 kv cache by @kkHuang-amd in #13147
[SMG]feat: implement TokenGuardBody for managing token return by @jimmy-evo in #14653
[NPU] chore: bump basic software version to 8.3.rc2 by @iforgetmyname in #14614
[CI] Unblock gb200 cutedsl test by @Fridge003 in #14469
Add ffmpeg into sglang docker - required by transformers multimodal V… by @byjiang1996 in #14679
[Bugfix] Fix KeyError for Mistral-Large-3 rope_scaling config by @alisonshao in #14627
Tiny support sgl-router http response status code metrics by @fzyzcjy in #14689
[CI] Migrate Eagle 1-GPU tests to test/registered/ by @alisonshao in #14529
Revert "[Bug] fix not desired disable fused share experts caused by r… by @zhyncs in #14676
Add per-request decode tp size by @merrymercy in #14678
[ci][smg] fix docker release ci and add it to pr test by @slin1237 in #14683
Tiny extract select_worker_min_load by @fzyzcjy in #14648
Fix dp-aware incompatible with completions and chat completions APIs by @fzyzcjy in #14647
[CI] Fix Llama 3.1 8B FP4 CI by @b8zhong in #14699
fix: make override DeepseekV2Model work by @zhyncs in #14707
chore: add code owners for deepseek_v2.py by @zhyncs in #14714
[CI] Move mistral large 3 basic to nightly by @alisonshao in #14622
fix the deepep 8 gpu unit test by @rainj-me in #14601
Add fuse_marlin_moe test to ci and add new ep test by @BBuf in #14686
[Bugfix] Fix environ error in scheduler_runtime_checker_mixin.py by @llfl in #14461
[Feat] Add received_time in serving_base by @zhanghaotong in #13432
fix: prevent HugginqFace access when SGLANG_USE_MODELSCOPE is enabled by @yrk111222 in #12039
[Test] Skip STANDALONE speculative decoding tests for different hidden sizes by @alisonshao in #14733
[diffusion] support batch compare by @Brain97 in #14738
Revert "[Feat] Add received_time in serving_base" by @merrymercy in #14743
[Model] Add PaddleOCR-VL Model Support by @yudian0504 in #12953
fix rope parameter initialization error caused by transformers v5.0 update by @yhyang201 in #14745
[model-gateway] optimize core modules by @slin1237 in #14751
[SMG] perf: optimize tokenizer for reduced CPU and memory overhead by @slin1237 in #14752
Add FP8 Blockwise GEMM Backend Flag --fp8-gemm-backend by @b8zhong in #14379
fix: checking if tokenizer is in cache before downloading from HF by @dougyster in #14698
fix: making rate limit a warning instead of error by @dougyster in #14753
move multi-item scoring functions in tokenizer manager into a separate file by @merrymercy in #14740
Improve CI by trying a warmup before unit tests by @merrymercy in #14669
[Perf] Optimize radix tree for cache-aware load balancin by @slin1237 in #14758
[Feature] Add LoRA support for embedding layers by @yushengsu-thu in #14177
[model-gateway] release gateway 0.2.4 by @slin1237 in #14763
[ci]: Enable the new hf API by @MingxuZh in #14687
Re-add the API serving timing metrics. by @hnyls2002 in #14744
fix: adding rate limit warning at verify token permission stage by @dougyster in #14756
Disable 8-gpu-b200 runner in PR tests by @alisonshao in #14768
[fix] Fix issues for in-flight weight updates by @ShawnY112358 in #14064
[Auto Sync] Update data_parallel_controller.py, detokenizer... (20251209) by @merrymercy in #14759
fix: race condition between validation and download locks by @alisonshao in #14761
Fix VLM accuracy thresholds for nightly tests by @alisonshao in #14777
fix server args bug by @TomerBN-Nvidia in #14725
handling incomplete rope_scaling confign ci after transformers upgrade by @yhyang201 in #14784
fix b200 ci by @b8zhong in #14786
[RL] support weight reload for low-bit rollout by @AniZpZ in #9650
fix: add missing logic for SGLANG_USE_MODELSCOPE variable by @yrk111222 in #14794
fix b200 fa4 ci by @b8zhong in #14788
[diffusion] profile: early exit when enough steps are captured to red… by @mickqian in #14803
[GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V by @yuan-luo in #14720
[CI] Add LoRA support to diffusion server configuration and test cases by @Prozac614 in #14697
Revert "fix: checking if tokenizer is in cache before downloading from HF" by @yhyang201 in #14808
[Difussion] Refactor diffusion fuse qkv and support qwen-image by @BBuf in #14793
[Router-GO] implement a Go SGLang Router - OpenAI Compatible API Server by @whybeyoung in #14770
[model-gateway] Dynamically Populate Tool Call Parser Choices by @xuwenyihust in #14807
Support HTTP response status code prometheus metrics by @fzyzcjy in #14710
Fix router keep nonzero metrics after worker is deleted by @fzyzcjy in #14819
Tiny fix incorrect worker removal command by @fzyzcjy in #14822
[NPU] bug fix for mtp and w4a8 by @liupeng374 in #14806
[CI] fix UT success check in test_eagle_infer_beta_dp_attention.py by @hnyls2002 in #14831
Fix CI registry scan to only check test/registered directory by @alisonshao in #14812
[model-gateway] add anthropic message api spec by @slin1237 in #14834
Fix tiny typo in multimodal_gen/README.md by @wplf in #14830
Tiny support customizing Prometheus duration buckets by @fzyzcjy in #14716
Tiny support engine response http status statistics in router by @fzyzcjy in #14712
[CI] Reduce stage-b auto-partition from 4 to 2 by @alisonshao in #14769
Apply back moe_sum_reduce for fused_marlin_moe by @ispobock in #14829
[diffusion] parallel: pad tokens for video models under sp by @mickqian in #14833
[diffusion] CI: use unified sampling_params for CI by @mickqian in #14045
[Auto Sync] Update tool_chat_template_deepseekv31.jinja (20251210) by @zhyncs in #14837
Revert transformers to 4.57.1 by @yhyang201 in #14801
[model-gateway] Fix incompatible metric comparison in PowerOfTwo policy by @ppraneth in #14823
[bugfix] qwen25-VL support lora by @SYChen123 in #14638
fix lora target all + csgmv backend by @b8zhong in #14796
[model-gateway] adds default implementations to RouterTrait in mod.rs by @slin1237 in #14841
[AMD] Add model to AMD nightly test by @michael-amd in #14442
Treat unittest SkipTest exception as pass instead of as failure by @byjiang1996 in #14847
[model-gateway] code clean up on oai router by @slin1237 in #14850
[model-gateway] fix import order in oai conversation by @slin1237 in #14851
fix fp8 gemm nightly CI by @b8zhong in #14844
fix: restrict cache validation behaviors to CI only by @alisonshao in #14849
Fix CUDA version handling in ci_install_deepep.sh by @merrymercy in #14854
Fix TestGLM41VPPAccuracy test flakiness by @byjiang1996 in #14848
Minor code style fix for dllm by @hnyls2002 in #14836
Enable TP for Mamba-based models by @roikoren755 in #14811
[CI] Temp disable gb200 test by @Fridge003 in #14865
Refactor Marlin MoeRunner by @trangdough in #14554
[6/n] Fix num_token_non_padded computation in prefill by @yuchengz816-bot in #14313
Remove myself to test CI gate issue by @Kangyan-Zhou in #14871
fix: creating blobs only once for publish trace retries by @dougyster in #14845
Move and update MindSpore docs, make it appear on the online documentation by @wangtiance in #14861
fix nightly vlm ci : restore original eval for requests without regex by @yhyang201 in #14875
Only count limitations for previous runs that reaches the test stages by @Kangyan-Zhou in #14856
[CI][BUG] fix ib setup for disaggregation hicache test by @luketong777 in #14877
[Fix] Remove unused import from test_disaggregation_hicache.py by @ShangmingCai in #14880
fix: adding temporary bypass for nightly tests by @dougyster in #14876
Avoid deleting entire cache for missing shards (#14754 follow-up) by @alisonshao in #14853
Tiny add more error info for bench_serving by @fzyzcjy in #14827
Tiny support range ratio in GSP in bench serving by @fzyzcjy in #14828
[diffusion] feat: enable torch compile to eliminate GPU bubble by @AichenF in #13641
[NPU]dsv3.2 cp for npu by @liupeng374 in #14541
[diffusion] feat: support sageattn & sageattn3 backend by @mickqian in #14878
[Ascend]Support of piecewise graph compilation for prefill on NPU by @Vladimir221 in #12287
Introduce server_fixtures in sglang.test by @hnyls2002 in #14899
[diffusion] UX: suppress excessive loggers by @mickqian in #14900
Tiny refactor cleanup WorkflowContext.get_or_err by @fzyzcjy in #14890
Tiny clean router load report logic by @fzyzcjy in #14889
[model-gateway] code clean up on oai router in responses by @slin1237 in #14852
[model-gateway] fix annotation error and code formating by @slin1237 in #14910
[model-gateway] fix imports and delete unused code by @slin1237 in #14911
[docs] Fix kernel name by @almaslof in #14887
[SMG][DS32][fix] support dsv32, add role developer by @jimmy-evo in #14307
Update CI_PERMISSIONS.json by @Kangyan-Zhou in #14917
[FIX][DS32]openai protocol: support openai message role: developer by @jimmy-evo in #14304
[loader] enable private loader by @yinghai in #14620
chore: bump SGLang version to 0.5.6.post2 by @sglang-bot in #14858
extend timeout for b200 test by @b8zhong in #14925
ci: adding more nightly tests to bot bump workflows by @dougyster in #14928
update mistral detector by @JustinTong0323 in #14921
support non-disturbing remote-instance-weight-loader by @amysaq2023 in #13125
[refactor] Update reasoning parameter to require_reasoning by @JustinTong0323 in #14922
[CPU] layernorm & fused add-layernorm kernels by @ZailiWang in #14074
Add retry logic for scheduled CI tests by @alisonshao in #14771
[CI] Add Mistral Large 3 Eagle nightly performance test by @alisonshao in #14525
fix: handle Jinja2 template errors as client errors in OpenAIServingChat by @JustinTong0323 in #14748
Fix black formatting in ci_utils.py by @alisonshao in #14932
[bugfix] fix TBO crashes when attn_tp_size > 1 by @yuhyao in #13730
fix: making the publish trace error check broader by @dougyster in #14931
[CI]add nightly CI for glm4v_moe arch model by @zminglei in #14927
Check KV4 compatibility with attention backends and add KV4 support to the attention_backend doc by @JackChuang in #14467
Re-org eagle unit tests by @hnyls2002 in #14909
Super tiny remove sgl_router_active_workers by @fzyzcjy in #14891
remove dpsk3.2 sys prompt by @JustinTong0323 in #14923
[DLLM] Add documentation for diffusion LLMs by @ClawSeven in #14358
[RL] refactor flash rl weight reload in sglang by @AniZpZ in #14870
[PP] Refactor PP to async mode by @XucSh in #11852
[Fix] Enable applying different LoRA adapters to different transformers in multi-transformer pipelines by @Prozac614 in #14839
[model-gateway] optimize worker selection by @ppraneth in #14894
Fix negative duration panic in token bucket wait time calculation by @xiaguan in #14941
Tiny add router e2e duration histogram by @fzyzcjy in #14892
Tiny add e2e http request arrival metric by @fzyzcjy in #14893
Super tiny remove non-updated sgl_router_worker_load by @fzyzcjy in #14888
Super tiny move error.rs by @fzyzcjy in #14944
direct register custom op for mm_fp4 by @b8zhong in #13699
fix: trtllm mha attention auto-selection on sm120 by @b8zhong in #14842
Super tiny refactor error.rs logic by @fzyzcjy in #14949
[NPU] optimization for dsv3.2 by @ZhengdQin in #14572
[NVIDIA] Enable TRTLLM BF16 MoE on Blackwell GPUs by @samuellees in #13798
[Fix] suppress remote weight loading engine w/o mooncake installed by @ZailiWang in #14937
enable flashinfer-jit-cache in image build and ci install to speed up model launch by @gongwei-130 in #14959
[diffusion] chore: minor code cleanups and improve logging by @mickqian in #14916
[Diffusion] upgrade cache-dit for better compatiblity by @DefTruth in #14534
[1/N] Update doc of Pipeline Parallelism by @ShangmingCai in #14985
[PD] Add decode PP event loop for PD disaggregation by @bluecoffee8 in #14945
[Diffusion] Tiny fix Docker Hub link in installation documentation by @BBuf in #14987
Update CODEOWNERS for multimodal_gen by @mickqian in #14995
[model-gateway] refactor: extract workflow engine to src/workflow module by @slin1237 in #14996
[model-gateway] feat: add DAG parallel execution support and workflow optimization by @slin1237 in #14999
[model-gateway] fix: handle workflow deadlock and optimize cycle detection by @slin1237 in #15000
[model-gateway] refactor: workflow engine cleanup and minor optimization by @slin1237 in #15001
Fix CI by reverting incorrect metric check logic by @Kangyan-Zhou in #15004
Super tiny extract route_typed_request_once by @fzyzcjy in #14951
Revert several PRs by @zhyncs in #14958
Add KV4-capable backend flashmla and update server args by @JackChuang in #14989
Refactor of http and engine entrypoints to allow custom override by @merrymercy in #14869
Update ci permission by @merrymercy in #15014
[model-gateway] refactor: unify worker management into modular workflow structure by @slin1237 in #15010
Tune triton fused moe for the case of glm-4.6-fp8 b200 tp4 by @Qiaolin-Yu in #15020
[Feature] Multi lora optimization - resolve scheduler blocking issue and save Non-Lora inference performance by @ConnorLi96 in #14795
[registry] Add a strict mode to model registration by @yinghai in #14933
Super tiny remove unused argument by @fzyzcjy in #14966
Super tiny fix confusing slash_command_handler hint by @fzyzcjy in #14976
Super tiny add gsp-fast-prepare by @fzyzcjy in #14992
Tiny extract SchedulerWatchdog by @fzyzcjy in #15021
Add soft watchdogs to debug soft hangs by @fzyzcjy in #15023
Clean up server args and engine startup processes by @merrymercy in #15015
tiny update: use rope kernel in sgl-kernel for amd by @Qiaolin-Yu in #14955
Tiny remove the duplicate function in spec v2 by @hnyls2002 in #14957
Fix regression caused by fa3 block_table by @wenscarl in #15009
Add a special label for b200 CI runner that can run kernel tests by @Kangyan-Zhou in #15033
[CI]Add gb200 runner back by @Fridge003 in #15024
Fix decode OOM caused by retraction by @hnyls2002 in #14939
Super tiny remove unused log_request by @fzyzcjy in #15035
Add code field and unify error responses for router by @fzyzcjy in #15028
Tiny unify grpc existing error responses into new format by @fzyzcjy in #15030
Tiny change http router response format to unify by @fzyzcjy in #15031
Provide more fine grained error reason for reqwest error by @fzyzcjy in #15032
Add error code in prometheus metrics and add X-SMG-Error-Code header by @fzyzcjy in #15036
Add sgl_router_attempt_http_responses_total for single attempt information by @fzyzcjy in #15037
call check_quantized_moe_compatibility after initialize by @chunyuan-w in #13876
Mistral Large 3 NVFP4 support by @dcampora in #14485
[diffusion] fix: use NDRotaryEmbedding in flux_2 by @mickqian in #15034
[Fix] Disable trtllm moe backend for draft model for a qucik fix by @samuellees in #15002
feat: Improve LoRA compatibility by adding unified format detection and diffusers-based normalization by @MikukuOvO in #14659
feature: adding nightly wheel workflow and indexer by @dougyster in #14924
Fix GLM-4.6 tool calls don't support streaming output for arguments i… by @cynial in #13989
[PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks by @YAMY1234 in #15027
[NPU] perf update with kvcache nz & w4a8 quant by @liupeng374 in #14423
Clean up GDN Init by @hebiao064 in #14855
[VLM] Support VLM ViT Piecewise CUDA Graph by @yuan-luo in #14422
Fix load metric not updated when using guard by @fzyzcjy in #15059
Fix double decrease load by @fzyzcjy in #15060
[Diffusion] Add multimodal gen profiling doc by @BBuf in #15069
Fix IMA with flashinfer + spec + topk & Add radix attention test cases for eagle by @hnyls2002 in #13740
[diffusion] doc: update profiling.md with output location details by @mickqian in #15072
Tiny adjust CI run suite by @hnyls2002 in #15074
Fix spec info's filter when reqs are finished right after prefill by @hnyls2002 in #14742
[model-gateway] Simplify error response creation by @slin1237 in #15079
[bug] fix grpc secheduler launcher breaking change by @slin1237 in #15080
feature: ci failure monitor improvements by @dougyster in #15055
fix: adding schedule for nightly wheel by @dougyster in #15054
fix flaky image access in ci by switching to raw content url by @yhyang201 in #14940
[scheduler] enhance scheduler in dp_attention mixed case with spec by @liupeng374 in #14201
add transformers version validation for glm-4.6v moe models by @yhyang201 in #14998
[model-gateway] Avoid MCP Server Initialization Issue by @xuwenyihust in #15065
Add nightly accuracy test for DeepSeek V3.2 by @Fridge003 in #14935
fix: dpskv32 chat history processing, default drop_thinking to true by @JustinTong0323 in #15064
[model-gateway] Refactor worker steps and add update workflow by @slin1237 in #15085
Add sglang:decode_sum_seq_lens metric by @fzyzcjy in #15066
[Doc][TPU]add sglang-jax tpu docs by @JamesBrianD in #15056
[Fix] avoid stream sync in _compute_mrope_positions by @narutolhy in #14956
Support prefill max requests limitation by @fzyzcjy in #14993
[model-gateway] Remove unused TokenizerMetrics to reduce CPU overhead by @slin1237 in #15087
[Fix] Environment variable SGL_* is deprecated by @miter6 in #14943
[model-gateway] Fix metric emission gaps and name mismatch by @slin1237 in #15093
[model-gateway] Add circuit breaker and discovery watcher metrics by @slin1237 in #15094
[model-gateway] optimize metric labels to avoid unnecessary allocations by @slin1237 in #15095
Fix issue not reported when load decrement is incorrect by @fzyzcjy in #15061
Avoid confusing zero value metric when worker is removed by @fzyzcjy in #15096
[NPU][CI] change de trigger of release image workflow by @monkeyLoveding in #14969
[ci] Move dpsk-r1-fp4 b200 test to stage b by @Qiaolin-Yu in #15084
[scheduler] remove scheduler allgather for best throughout by @liupeng374 in #14294
[NPU] bug fix for multi stream by @liupeng374 in #15048
Introduce native kv cache move by @hnyls2002 in #15108
[diffusion] feat support resolution check for video model by @Brain97 in #14881
[Diffusion] Tiny fix _templated_ring_attention bug by @BBuf in #15053
[Diffusion] feat: Add support for additional sampling parameters in video generation API by @BBuf in #15062
diffusion: support webui by @wplf in #14961
[kernel][moe] add moe topk fast by @thenumberouscode in #13969
feat: support EPD disaggregation by @gty111 in #12263
[CI] Add disaggregation decode PP test by @ShangmingCai in #15114
[Diffusion] Refactor fuse qkv with QKVParallelLinear linear by @BBuf in #15090
[model-gateway] Add new SMG metrics architecture with 6 layers by @slin1237 in #15106
Fix tensor mismatch error in sepc + topk > 1 + page_size > 1 by @ZeldaHuang in #14874
[model-gateway] Implement Layer 1 HTTP metrics instrumentation by @slin1237 in #15121
feat(metrics): implement Layer 2 router metrics (smg_router_*) by @slin1237 in #15124
Fix Mamba2-based models' default attention backend by @roikoren755 in #15117
Add NanoV3 reasoning parser support by @danielafrimi in #15113
[hotfix]: Add missing args for 3FS bench_client.py by @hzh0425 in #14791
feature: ci failure monitor slack bot by @dougyster in #15110
[model-gateway] add streaming metrics (TTFT, TPOT, tokens, duration) for gRPC router by @slin1237 in #15125
[refactor] Move trtllm_fp8_kv_kernel to triton_ops directory by @harvenstar in #15044
feat(gateway): Add server-side TLS support by @Ratish1 in #15052
[model-gateway] Parallelize metrics requests by @ppraneth in #14953
Super tiny cleanup circuit breaker code by @fzyzcjy in #15098
Fix circuit breaker wrong metrics by @fzyzcjy in #15099
[NSA] Fix NSA backend assertion error when running DeepSeek-V3.2 PP with radix-cache by @YAMY1234 in #15086
[Diffusion] Fix default resolution 720p width from 1080 to 1280 by @BBuf in #15058
fix: adjusting frequency for ci failure monitor by @dougyster in #15134
Add cyb70289 to CI permissions by @cyb70289 in #14938
Fix cache aware wrong routing caused by incorrect load tracking by @fzyzcjy in #15101
Fix H200 CI by commenting out Warmup Weights and JIT Compilation by @Kangyan-Zhou in #15139
docs: update usage by @zhyncs in #15142
[Qwen3-next] support mamba radix cache for overlap scheduler by @hanming-lu in #14792
Enable TRT Allreduce Fusion by default for compatible models by @b8zhong in #14764
[VLM] Support chunked vit attention by @yuan-luo in #14907
[model-gateway] Add Layer 3 worker metrics (smg_worker_*) by @slin1237 in #15130
[model-gateway] upgrade axum and axum server by @slin1237 in #15146
[model-gateway] Add streaming metrics for harmony gRPC router by @slin1237 in #15147
ci: adding errors to Github summary by @dougyster in #14778
Fix import warnings by @merrymercy in #15144
fix: move ci-bot by @dougyster in #15154
[model-gateway] add mcp and discovery metrics by @slin1237 in #15156
Tiny improve summary text in bench_one_batch_server.py by @hnyls2002 in #15158
fix CompressedTensorsW8A8Int8 min_capability by @mmdbhs in #13914
[diffusion]: support mutli image input and qwen-image-edit-2509 by @yhyang201 in #15005
feature: PR wheel by @dougyster in #15170
[CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut by @blzheng in #9324
fix: adding date and fixing release name issue by @dougyster in #15174
Fused two elementwise kernels for k_nope and k_pe concat by @kkHuang-amd in #14862
Fix num running requests (load) wrong cleared for ongoing requests by @fzyzcjy in #15116
[Feature] Fuse mrope all in 1 kernel by @DarkSharpness in #14906
chore: change npu pr-test a2 runner by @Goalina in #15152
[Diffusion] Cache dit support parallel by @BBuf in #15163
[diffusion] fix: Fixed pytorch non-writable array warning by @RuixiangMa in #15017
[diffusion] fix: fix video model sp when resolution is not specified by @mickqian in #15047
[model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels by @slin1237 in #15160
Add missing assertion in NemotronH path by @roikoren755 in #15193
[Diffusion] Fix AttributeError in _build_parallelism_config when acce… by @BBuf in #15196
[diffusion] chore: minor code cleanups by @mickqian in https://github.com/sgl-project/sglang/pull/15190
[Diffusion] Zimage support pack qkv by @BBuf in https://github.com/sgl-project/sglang/pull/15191
fix(attention): Prevent trtllm_mha auto-selection with eagle3 speculative decoding by @Ratish1 in https://github.com/sgl-project/sglang/pull/15127
[Diffusion] Z-Image FFN pack gate and up proj by @BBuf in https://github.com/sgl-project/sglang/pull/15201
[NPU][eagle3] support qwen eagle3 on NPU by @Liwansi in https://github.com/sgl-project/sglang/pull/14820
Add cache for flashinfer installation by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/15153
feature: create docker image from pr branch by @dougyster in https://github.com/sgl-project/sglang/pull/15185
chore: update CI_PERMISSIONS by @zhyncs in https://github.com/sgl-project/sglang/pull/15212
[Feature] npu support enable_torch_compile for torchair backend by @XDaoHong in https://github.com/sgl-project/sglang/pull/13410
Add EPD disaggregation doc by @gty111 in https://github.com/sgl-project/sglang/pull/15224
[Bugfix][Tool Call] Add null system prompt to support tool system prompt by @Muqi1029 in https://github.com/sgl-project/sglang/pull/15092
[AMD CI] Temporarily disable 2 gpu accuracy test. by @saienduri in https://github.com/sgl-project/sglang/pull/15204
[BugFix] Fix CPU inference failure by @cyb70289 in https://github.com/sgl-project/sglang/pull/15231
[AMD CI] Fix typo. by @saienduri in https://github.com/sgl-project/sglang/pull/15229
[Feature] Add AIME25 dataset support for SGLang simple_eval by @yurekami in https://github.com/sgl-project/sglang/pull/14990
Remove duplicate bs=1 in nightly benchmark by @Fridge003 in https://github.com/sgl-project/sglang/pull/15162
[Diffusion] fix pack qkv opt break tensor parallel by @BBuf in https://github.com/sgl-project/sglang/pull/15225
[Qwen3-next] Add PD disaggregation support for mamba with extra_buffer by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15180
[bugfix][quark] Fixed an issue where per_token could not be properly recognized when the token count was 1. by @haoyangli0109 in https://github.com/sgl-project/sglang/pull/14415
Adding tool calling and reasoning parser support for Intern-S1 by @KennyYao2001 in https://github.com/sgl-project/sglang/pull/14866
fix: removing latest-sglang=1 by @dougyster in https://github.com/sgl-project/sglang/pull/15220
Increase timeout for TestDeepseekV3MTP for potential DeepGEMM cold start by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/15239
[CPU] Add 4D input support for ROPE in sgl-kernel by @blzheng in https://github.com/sgl-project/sglang/pull/9337
Support piecewise cuda graph for fused marlin moe by @ispobock in https://github.com/sgl-project/sglang/pull/15100
Enhance runtime memory check in CI by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15192
[Diffusion] Use current_platform.device_type to replace hard-coded cuda device by @yeahdongcn in https://github.com/sgl-project/sglang/pull/15232
[diffusion] doc: update profiling.md by @mickqian in https://github.com/sgl-project/sglang/pull/15270
[CI] Improve flaky 4 GPU test success rate by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15234
Fix accuracy issue when using a16w16 mla_decode_fwd by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/14936
[AMD] Support fused_rms_mxfp4_quant in the prefill stage for DeepSeek-R1-MXFP4 by @yichiche in https://github.com/sgl-project/sglang/pull/14975
[misc] Upgrade cutedsl to 4.3.1 by @Fridge003 in https://github.com/sgl-project/sglang/pull/14857
Fix lint by @Fridge003 in https://github.com/sgl-project/sglang/pull/15281
Remove incorrect BlockRemoved event emission during node splits by @nealvaidya in https://github.com/sgl-project/sglang/pull/14934
support non disturbing remote instance weight loader v2 by @amysaq2023 in https://github.com/sgl-project/sglang/pull/14997
[sgl-kernel] Update flashmla to include fp8 sparse_mla optimizations by @hlu1 in https://github.com/sgl-project/sglang/pull/15242
Fix lora doc by @Fridge003 in https://github.com/sgl-project/sglang/pull/15282
Fix test_pp_single_node.py estimated time from 800s to 500s by @alisonshao in https://github.com/sgl-project/sglang/pull/15291
fix: skipping TestEPDDisaggregationOneEncoder test by @dougyster in https://github.com/sgl-project/sglang/pull/15292
Revert "[misc] Upgrade cutedsl to 4.3.1 (#14857)" by @zhyncs in https://github.com/sgl-project/sglang/pull/15293
[NVIDIA] Fixes for NVFP4 all-gather with spec decoding by @trevor-m in https://github.com/sgl-project/sglang/pull/15280
[NPU] fix for NPU memory settings logic by @iforgetmyname in https://github.com/sgl-project/sglang/pull/15258
fix: moving decorator to header by @dougyster in https://github.com/sgl-project/sglang/pull/15297
Minor style fixes to the scheduler.py by @merrymercy in https://github.com/sgl-project/sglang/pull/15218
[Test] Update LoRA eviction policy tests to match current behavior by @alisonshao in https://github.com/sgl-project/sglang/pull/15283
[BugFix] fix gptq_marlin_gemm has no parameter called b_bias by @ehuaa in https://github.com/sgl-project/sglang/pull/13571
fix(function_call): fallback to decode when batch decode options differ by @luqitao in https://github.com/sgl-project/sglang/pull/15155
Add Ollama-compatible API endpoints + Smart Router by @alisonshao in https://github.com/sgl-project/sglang/pull/14376
[DeepSeekV3.2] Add pure TP+MTP test by @ashtonchew in https://github.com/sgl-project/sglang/pull/15088
[Perf] Enable Flashinfer autotune by default by @elvischenv in https://github.com/sgl-project/sglang/pull/14357
Update FP4 GEMM Benchmark by @b8zhong in https://github.com/sgl-project/sglang/pull/14449
Revert "direct register custom op for mm_fp4 (#13699)" by @b8zhong in https://github.com/sgl-project/sglang/pull/15284
diffusion: Add sampling parameters and model info endpoint to OpenAI API by @niehen6174 in https://github.com/sgl-project/sglang/pull/15071
[PP] Add pp support for Qwen3-VL by @XucSh in https://github.com/sgl-project/sglang/pull/12333
[Hotfix] Fix required enable_mamba_track argument for Flashinfer autotune path by @elvischenv in https://github.com/sgl-project/sglang/pull/15314
Fix gpu-fault when running mtp in eager mode by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/15233
[bug fix][pp] fix qwen3 model load by @XucSh in https://github.com/sgl-project/sglang/pull/15223
Fix the accuracy issue when running mxfp4 dsv3 model and enable ep by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/15304
fix qwenvl compressed tensors quantization weight loader by @LHXuuu in https://github.com/sgl-project/sglang/pull/11914
[Piecewise CUDA Graph] Support INT8 by @b8zhong in https://github.com/sgl-project/sglang/pull/14918
[bug fix][pp] fix weight load for qwen2.5-vl by @XucSh in https://github.com/sgl-project/sglang/pull/15138
[Diffusion] Add flux2 tp2 test in ci to avoid break diffusion tensor parallel by @BBuf in https://github.com/sgl-project/sglang/pull/15237
[Diffusion] Enhance trace export with gzip and integrity check by @BBuf in https://github.com/sgl-project/sglang/pull/15326
Add cuda_graph_forward_passes_total and num_retracted_reqs_total by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15189
Add realtime token counter metrics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15198
Tiny dump native stacktraces in watchdog by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15222
Super tiny rename failure_count for consistency by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15186
[PP] Minor code cleanup for Pipeline Parallelism by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15329
tiny unify environ usage by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15335
[model-gateway] reduce cpu overhead by @slin1237 in https://github.com/sgl-project/sglang/pull/15316
[model-gateway] optimize worker registry and reduce lock contention in grpc client fetch by @slin1237 in https://github.com/sgl-project/sglang/pull/15336
[DeepSeek-V32]Update nightly performance benchmark by @Fridge003 in https://github.com/sgl-project/sglang/pull/15308
Fix dp run error with fp8-kv enable in high concurrency test by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/15241
fix: prevent points regex from matching checkpoints/endpoints by @xvlincaigou in https://github.com/sgl-project/sglang/pull/15120
Fix condition check for require_gathered_buffer by @ch-wan in https://github.com/sgl-project/sglang/pull/15328
Reserve more memory for DeepSeekOCR model and adjust server start timeout for DeepGEMM to reduce flakiness by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/15277
[CI] Migrate LoRA tests to test/registered/lora/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15176
Add request-level timestamp for when prefill finishes by @scottjlee in https://github.com/sgl-project/sglang/pull/14860
[Deepseek V3.2] Support Overlap Spec + NSA by @b8zhong in https://github.com/sgl-project/sglang/pull/15307
[VLM] Support cos sin cache for Qwen3-VL & GLM-4.1V by @yuan-luo in https://github.com/sgl-project/sglang/pull/15205
Feature/trtllm mha workspace size configurable #15089 by @baonudesifeizhai in https://github.com/sgl-project/sglang/pull/15131
feat: DeepSeek-V3.2 Streaming tool call output by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/15278
Add doc for qwen3 next by @yizhang2077 in https://github.com/sgl-project/sglang/pull/15337
fix: adjust time for test_epd_disaggregation.py by @dougyster in https://github.com/sgl-project/sglang/pull/15354
Mistral Large 3 NVFP4 TRTLLM MoE support by @elvischenv in https://github.com/sgl-project/sglang/pull/15049
unified management of environment variables for vlm cuda ipc transport by @yhyang201 in https://github.com/sgl-project/sglang/pull/14501
Split test_piecewise_cuda_graph.py to optimize CI resource usage by @alisonshao in https://github.com/sgl-project/sglang/pull/15290
Fix issue: ENABLE_BELOW_SM90 cannot be enabled on aarch64 CPU by @MarcoDWei in https://github.com/sgl-project/sglang/pull/12967
[PP] Fix dynamic chunking strategy for PP by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15372
[model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups by @slin1237 in https://github.com/sgl-project/sglang/pull/15361
Monkey patch deepseek-ocr's v_head_dim by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15384
Fix gpt-oss yarn with truncate argument by @hnyls2002 in https://github.com/sgl-project/sglang/pull/14270
Fix warp illegal instruction in kimi k2 thinking PCG by @BBuf in https://github.com/sgl-project/sglang/pull/15306
[bug fix][pp] fix inconsistent latency between tp by @XucSh in https://github.com/sgl-project/sglang/pull/15379
[sgl-kernel][1/2] Fused qk_norm_rope for GLM4.6 by @Kevin-XiongC in https://github.com/sgl-project/sglang/pull/15141
Clean up init function of the scheduler and event loop for PD by @merrymercy in https://github.com/sgl-project/sglang/pull/15298
[perf]optimize w4afp8 kernel on deepseek-v3-0324 by @Bruce-x-1997 in https://github.com/sgl-project/sglang/pull/12921
[Diffusion] Fix sglang generate --perf-dump-path to include per-denoising-step timings by @BBuf in https://github.com/sgl-project/sglang/pull/15397
Tiny fix unknown route in prometheus metrics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15404
Support GPU execution time breakdown by forward mode metrics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15396
Tiny extract ModelRunnerOutput by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15400
Super tiny add moe_ep_rank to prometheus labels by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15407
Support EPLB balancedness prometheus metric without GPU->CPU synchronize by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15401
[PP] Add dynamic chunking PP test by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15395
[Tiny]Add warning for deepgemm on Blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/15352
Update benchmarks to use HF token from environment. by @FrankD412 in https://github.com/sgl-project/sglang/pull/15421
[Performance] optimize NSA backend metadata computation for multi-step speculative decoding by @Johnsonms in https://github.com/sgl-project/sglang/pull/14781
[AMD] Clear pre-built AITER kernels and warmup to prevent segfaults and test timeouts by @sunxxuns in https://github.com/sgl-project/sglang/pull/15318
multimodal: precompute hash for MultimodalDataItem by @sufeng-buaa in https://github.com/sgl-project/sglang/pull/14354
tiny fix lint on main by @b8zhong in https://github.com/sgl-project/sglang/pull/15424
feat(dsv32): better error handling for DeepSeek-v3.2 encoder by @jimmy-evo in https://github.com/sgl-project/sglang/pull/14353
Support using different attention backend for draft decoding. by @pyc96 in https://github.com/sgl-project/sglang/pull/14843
[DLLM] Add CI for diffusion LLMs by @ClawSeven in https://github.com/sgl-project/sglang/pull/14723
chore: update CI_PERMISSIONS by @zhyncs in https://github.com/sgl-project/sglang/pull/15431
[Deepseek V3.2] Fix Deepseek MTP in V1 mode by @b8zhong in https://github.com/sgl-project/sglang/pull/15429
[DLLM] Fix dLLM regression by @ClawSeven in https://github.com/sgl-project/sglang/pull/15371
[diffusion] profiling: add bench_serving.py and VBench by @mickqian in https://github.com/sgl-project/sglang/pull/15410
[Feature] Xiaomi MiMo-V2-Flash day0 support by @acelyc111 in https://github.com/sgl-project/sglang/pull/15207
fix mindspore import warning by @b8zhong in https://github.com/sgl-project/sglang/pull/15287
Update readme by @merrymercy in https://github.com/sgl-project/sglang/pull/15425
Add customized sampler registration by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15423
[Fix]: Refactor _build_req_from_sampling to use shallow_asdict by @cocoshe in https://github.com/sgl-project/sglang/pull/13782
[amd] Add deterministic all-reduce kernel for AMD (ROCm) by @sunxxuns in https://github.com/sgl-project/sglang/pull/15340
[AMD] Enable all diffusion models and fix encoder loading on MI325 by @zyzshishui in https://github.com/sgl-project/sglang/pull/13760
[sgl-kernel] chore: update deepgemm version by @FlamingoPg in https://github.com/sgl-project/sglang/pull/13402
fix: unreachable error check in retraction by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15433
[AMD] Fix and add accuracy-test-2-gpu-amd back by @yctseng0211 in https://github.com/sgl-project/sglang/pull/15415
[AMD] add unit-test-backend-8-gpu-amd back by @yctseng0211 in https://github.com/sgl-project/sglang/pull/15253
Support FP8 MLA prefill and 128k context. by @weireweire in https://github.com/sgl-project/sglang/pull/14395
[Auto Sync] Update scheduler_runtime_checker_mixin.py (20251219) by @merrymercy in https://github.com/sgl-project/sglang/pull/15437
[diffusion]Support url image input by @IPostYellow in https://github.com/sgl-project/sglang/pull/15262
diffusion: support qwen-image-edit-2511 by @yhyang201 in https://github.com/sgl-project/sglang/pull/15458
Fix: Support multiple input images to SGLang Diffusion when using generate mode by @suyedu in https://github.com/sgl-project/sglang/pull/15394
[Diffusion] kernel: timestep embedding kernel implementation by @66RING in https://github.com/sgl-project/sglang/pull/12995
[diffusion] Add Sage Attention 3 Support for sm 120 (RTX5090) by @ryang-max in https://github.com/sgl-project/sglang/pull/15382
[diffusion] fix: fix wrong validation on 2k resolution by @mickqian in https://github.com/sgl-project/sglang/pull/15478
Add MiDasheng Model Support by @Jacki1223 in https://github.com/sgl-project/sglang/pull/15219
fix: update model name after weights update by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15416
[Diffusion] Add diffusion attention backends doc by @BBuf in https://github.com/sgl-project/sglang/pull/15408
[NPU]Fix for ipc handle with npu by @hustmf in https://github.com/sgl-project/sglang/pull/14138
[NPU] bugfix for chunkedprefill by @Hexq0210 in https://github.com/sgl-project/sglang/pull/15166
Tiny fix mimo model conflicts with main by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15483
Enhance protection rules of code owners by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15406
Vertex generate pathway in server by @yashikagandhi-google in https://github.com/sgl-project/sglang/pull/15348
EP Support for Piecewise Cuda Graph by @Oasis-Git in https://github.com/sgl-project/sglang/pull/14164
fixed trtllm nvfp4 backend for moe by @khushgx in https://github.com/sgl-project/sglang/pull/15022
[model-gateway] fix graceful shutdown for TLS/Non-TLS server by @slin1237 in https://github.com/sgl-project/sglang/pull/15491
[model-gateway] refactor: extract common graceful shutdown code before TLS branch by @slin1237 in https://github.com/sgl-project/sglang/pull/15494
[model-gateway] Improve logging in data_connector module by @slin1237 in https://github.com/sgl-project/sglang/pull/15495
[model-gateway] Improve logging in policies module by @slin1237 in https://github.com/sgl-project/sglang/pull/15496
[AMD] Add TP=8 models to nightly test and make TP=2 test stable by @michael-amd in https://github.com/sgl-project/sglang/pull/15296
[DSv32] Move deep_gemm.get_paged_mqa_logits_metadata to init time as metadata by @qianlihuang in https://github.com/sgl-project/sglang/pull/15040
[model-gateway] Improve logging across core modules by @slin1237 in https://github.com/sgl-project/sglang/pull/15497
[model-gateway] Optimize workflow engine with pre-computed dependency graph by @slin1237 in https://github.com/sgl-project/sglang/pull/15503
[model-gateway] Run workflow event subscribers concurrently by @slin1237 in https://github.com/sgl-project/sglang/pull/15504
[model-gateway] simplify workflow engine backoff and reduce duplicate reads by @slin1237 in https://github.com/sgl-project/sglang/pull/15505
[router] bugfix: cache_aware in grpc inbalance forward by @llfl in https://github.com/sgl-project/sglang/pull/15473
Clean hidden_states_before_norm by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15485
[ci] remove rust benchmark in unit test ci by @slin1237 in https://github.com/sgl-project/sglang/pull/15510
[model-gateway] Implement RAII load guard with response body attachment by @slin1237 in https://github.com/sgl-project/sglang/pull/15507
[diffusion] refactor: deprecate WorkloadType by @mickqian in https://github.com/sgl-project/sglang/pull/15267
[GLM-4.7] GLM-4.7 Tool Parser and Doc Update by @zRzRzRzRzRzRzR in https://github.com/sgl-project/sglang/pull/15333
tiny fix sampling seed for completion api by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15498
[AMD] remove the redundant projection by @yctseng0211 in https://github.com/sgl-project/sglang/pull/15178
[AMD] Support fast_topk kernels in sgl-kernel by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/15172
[NPU] [BUGFIX] Fix NPU inference (torch_npu._npu_reshape_and_cache() crash) by @OrangeRedeng in https://github.com/sgl-project/sglang/pull/15484
Optimize MiMo-V2-Flash by flashinfer fused allreduce by @yuan-luo in https://github.com/sgl-project/sglang/pull/15464
vlm: Refactor engine vlm params and support precessor output as input by @minleminzui in https://github.com/sgl-project/sglang/pull/14091
[VLM] Support ViT Piecewise CUDA Graph for Qwen3-VL by @yuan-luo in https://github.com/sgl-project/sglang/pull/15320
fix MiMo-V2-Flash typo by @acelyc111 in https://github.com/sgl-project/sglang/pull/15536
[Diffusion] Wan video model support zero-cost weight offload and overlap with compute by @BBuf in https://github.com/sgl-project/sglang/pull/15511
[diffusion] chore: allow all attention backends if not specified by @mickqian in https://github.com/sgl-project/sglang/pull/15530
[diffusion] log: fix wrong use of suppress_other_loggers by @mickqian in https://github.com/sgl-project/sglang/pull/15534
[Diffusion] Profiler doc add --perf-dump-path Desc by @BBuf in https://github.com/sgl-project/sglang/pull/15533
[diffusion] refactor: support scheduling logic for reqs inside scheduler by @mickqian in https://github.com/sgl-project/sglang/pull/15479
feat: Add limit-mm-data-per-request argument to server arguments by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/15418
Fix docker gateway image name and add latest tag by @slin1237 in https://github.com/sgl-project/sglang/pull/15542
[model-gateway] add model gateway multi-arch docker build, test and document docker image by @slin1237 in https://github.com/sgl-project/sglang/pull/15544
[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching by @ppraneth in https://github.com/sgl-project/sglang/pull/15515
[model-gateway] bugfix: backward compatibility for GET endpoints by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15413
fix: update tool name handling and argument extraction in R1 chat tem… by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/15547
Optimize Bailing-MoE with FlashInfer Fused All-Reduce by @yuan-luo in https://github.com/sgl-project/sglang/pull/15526
[sgl-kernel] Streamline kernel size report (Top 20 only) and clean up by @BBuf in https://github.com/sgl-project/sglang/pull/15552
Apply new moe align block size kernel by @BBuf in https://github.com/sgl-project/sglang/pull/14134
[CI] Migrate CUDA Graph tests to test/registered/cuda_graph/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15436
feature: unified nightly metric layer by @dougyster in https://github.com/sgl-project/sglang/pull/15324
[CI] Fix /rerun-stage command by using requests for workflow dispatch by @alisonshao in https://github.com/sgl-project/sglang/pull/15447
[model-gateway]: Tool parser for glm47 by @UbeCc in https://github.com/sgl-project/sglang/pull/15520
[Diffusion] Simplify --perf-dump-path JSON output (remove duplicate denoise steps) by @BBuf in https://github.com/sgl-project/sglang/pull/15537
[diffusion] chore: minor improvements and typo-fixing by @mickqian in https://github.com/sgl-project/sglang/pull/15556
[diffusion] bench: improve bench_serving by adding more controlling args by @mickqian in https://github.com/sgl-project/sglang/pull/15554
[FusedMoE] Fix fused w13 tp sharded weight loading by @yinghai in https://github.com/sgl-project/sglang/pull/15432
[EAGLE] Fix slow Triton compilation in EAGLE KV cache copy by chunking large num_locs_upper by @YAMY1234 in https://github.com/sgl-project/sglang/pull/15111
Support piecewise cuda graph for dsv3 fp4 by @ispobock in https://github.com/sgl-project/sglang/pull/15531
[Feature] Enable return routed experts by @ocss884 in https://github.com/sgl-project/sglang/pull/12162
[CI] Fix AMD CI to exclude multimodal_gen from main_package filter by @sunxxuns in https://github.com/sgl-project/sglang/pull/15558
[model-gateway] /parse/easoning and parse/function_call for sgl-model-gateway by @UbeCc in https://github.com/sgl-project/sglang/pull/15568
[model-gateway] Use UUIDs for router-managed worker resources by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15540
[1 / N] Clean up logprob utils by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15509
Revert "[FusedMoE] Fix fused w13 tp sharded weight loading" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15579
[model-gateway] minor code clean up by @slin1237 in https://github.com/sgl-project/sglang/pull/15578
chore: bump sgl-kernel version to 0.3.20 by @sglang-bot in https://github.com/sgl-project/sglang/pull/15564
fix ds3.2 nsa backend prefill TBO by @Chen-0210 in https://github.com/sgl-project/sglang/pull/14901
Add triton_fused_moe config for GLM-4.6-FP8 tp8 blackwell by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15569
[model-gateway] add WorkerService abstraction for worker business logic by @slin1237 in https://github.com/sgl-project/sglang/pull/15580
[model-gateway] refactor WorkerManager with fan_out helper and thin handlers by @slin1237 in https://github.com/sgl-project/sglang/pull/15583
Split dpsk fp4 4 gpu tests and move the mtp part to real stage b by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15553
Fix type mismatch in LoRA batch validation causing assertion failures by @ConnorLi96 in https://github.com/sgl-project/sglang/pull/15427
[feature] support hicache-3fs usrbio lib build for ubuntu24.04 by @leihuang-sketch in https://github.com/sgl-project/sglang/pull/15230
[model-gateway] add retry and circuit breaker support to gRPC routers by @slin1237 in https://github.com/sgl-project/sglang/pull/15585
Optimize Rust CI builds with proper sccache configuration by @slin1237 in https://github.com/sgl-project/sglang/pull/15581
[Tiny]Move deepseek fp4 cutlass moe test to per-commit test by @Fridge003 in https://github.com/sgl-project/sglang/pull/15565
Tiny fix bench serving GSP mode cache file strategy by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15587
Support gsp send routing id in bench serving by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15588
[model-gateway] add retry support to OpenAI router chat endpoint by @slin1237 in https://github.com/sgl-project/sglang/pull/15589
Adapt fixture-kit to gsm8k mixin by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15599
Add glm-4.6-fp8 with/without mtp in nightly ci by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15566
add decode round robin policy by @Hexq0210 in https://github.com/sgl-project/sglang/pull/15164
Tiny avoid EnvField misuse by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15612
Support soft watchdog for tokenizer/detokenizer/dp-controller processes by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15607
Tiny add stuck simulation by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15613
Tiny enable soft watchdog in CI for stuck without logs by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15616
[diffusion] Remove Default post dit offload in local mode by @ryang-max in https://github.com/sgl-project/sglang/pull/15573
[VLM] Tiny: Unify VLM environment variables by @yuan-luo in https://github.com/sgl-project/sglang/pull/15572
[Diffusion] Support peak memory record in offline generate and serving by @BBuf in https://github.com/sgl-project/sglang/pull/15610
[model-gateway] return 503 when all workers are circuit-broken by @slin1237 in https://github.com/sgl-project/sglang/pull/15611
Fix router gRPC mode launch error caused by async loading by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15368
Tiny add back missing router per attempt response metric by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15621
Adjust wrong mtp meaning introduce by mimo by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15632
bugfix[schedule]: Refactor sort method and add related UT by @SeanWeiSean in https://github.com/sgl-project/sglang/pull/13576
chore: bump sgl-kernel version to 0.3.20 by @sglang-bot in https://github.com/sgl-project/sglang/pull/15590
Improve engine customization interface by @merrymercy in https://github.com/sgl-project/sglang/pull/15635
[GLM-ASR] GLM-ASR Support by @zRzRzRzRzRzRzR in https://github.com/sgl-project/sglang/pull/15570
MoE: Skip SiLU/GELU activation for masked experts by @yuchengz816-bot in https://github.com/sgl-project/sglang/pull/15539
[PD] Support fake decode for PD disaggregation without prefill node by @Baidu-AIAK in https://github.com/sgl-project/sglang/pull/14628
[CI] Migrate nightly tests to test/registered/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15582
[CI] Migrate Attention Backend tests to test/registered/attention/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15563
[CI] Enable retry logic for flaky CI tests by @alisonshao in https://github.com/sgl-project/sglang/pull/14983
[AMD] CI - Detect the aiter version and rebuild if needed by @yctseng0211 in https://github.com/sgl-project/sglang/pull/15460
[AMD] CI - Improve image discovery with remote registry fallback by @bingxche in https://github.com/sgl-project/sglang/pull/15463
fix: increasing H200 test timeout by @dougyster in https://github.com/sgl-project/sglang/pull/15600
Support PP for zmq_to_scheduler by @gty111 in https://github.com/sgl-project/sglang/pull/15312
[2/N] Update doc of Pipeline Parallelism with case study by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15684
Fix pipeline parallelism doc typos by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15688
[diffusion] Generalize layerwise offloader to flux1 by @ryang-max in https://github.com/sgl-project/sglang/pull/15633
[CI] fix UT assert error in test_tokenizer_manager.py by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15646
[Feature] support fastsafetensors by @stmatengss in https://github.com/sgl-project/sglang/pull/15091
[Minor] Enhance JIT kernel and add dev docs by @DarkSharpness in https://github.com/sgl-project/sglang/pull/14570
Super tiny add test_soft_watchdog to nightly by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15692
fix: potential crash for missing stream attribute by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15644
[model-gateway] Replace tokenizer with tokenizer registry for dynamic tokenizer loading in gRPC router by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/12968
Fix Illegal Memory Access when fa3 + spec + topk + page_size > 1 by @yubofredwang in https://github.com/sgl-project/sglang/pull/15469
Tiny add more information in retract logging. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15694
[model-gateway] Optimize router selection with lock-free snapshots by @ppraneth in https://github.com/sgl-project/sglang/pull/15672
[model-gateway]: add gRPC router embeddings endpoint implementation by @Ratish1 in https://github.com/sgl-project/sglang/pull/15273
Tiny apply gsm8k mixin to ngram test by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15606
Tiny fix CI by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15696
[model-gateway] Fix tokenizer caching and improve error handling by @slin1237 in https://github.com/sgl-project/sglang/pull/15695
Add kv_transfer_total_mb to metrics by @merrymercy in https://github.com/sgl-project/sglang/pull/15667
Update MiniMax-M2 ToolCall and add MiniMax-M2.1 in Docs by @rogeryoungh in https://github.com/sgl-project/sglang/pull/15538
[model-gateway] Add tokenize/detokenize HTTP endpoints and tokenizer management by @slin1237 in https://github.com/sgl-project/sglang/pull/15702
[bug fix] fix hicache jit kernel by @XucSh in https://github.com/sgl-project/sglang/pull/15177
Raise the accept length bar in dpsk-r1-fp4 spec decoding tests by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15705
Tiny add back fixes of incorrect metrics after worker removal by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15624
Tiny add back router worker health metric and fix init state by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15622
[AMD] Add AMD Nightly Performance & VLMs Accuracy Tests by @michael-amd in https://github.com/sgl-project/sglang/pull/15500
[Feature][MM] split the images of one request into multiparts by @XucSh in https://github.com/sgl-project/sglang/pull/11828
Tiny add flush in the suite partition status print. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15719
Tiny fix test eagle infer b. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15716
Move some quant args to its own section in environ variables doc by @vincentzed in https://github.com/sgl-project/sglang/pull/15722
[docs] major SGL Model Gateway documentation update by @slin1237 in https://github.com/sgl-project/sglang/pull/15715
[diffusion] http-server: fix openai endpoint image download strict content_type limit by @mickqian in https://github.com/sgl-project/sglang/pull/15717
[CI] Remove pcg-omni-ci by @Oasis-Git in https://github.com/sgl-project/sglang/pull/15656
[Feat] lora strength param by @Prozac614 in https://github.com/sgl-project/sglang/pull/15691
Simplify server args by @merrymercy in https://github.com/sgl-project/sglang/pull/15704
[2/N] clean duplicate code of logprob processing in spec. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15593
update benchmark README to use --fp8-gemm-backend instead of env var by @leejnau in https://github.com/sgl-project/sglang/pull/15689
[model-gateway]Enable IGW mode with gRPC router and auto enable IGW when service discovery is turned on by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/15459
Tiny env cleanup in deepgemm by @vincentzed in https://github.com/sgl-project/sglang/pull/15706
Fix smg_http_requests_total semantics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15655
Tiny refactor request logger by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15740
Support JSON format request logging for easier parsing by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15743
Retry removing wrong logic about max total token in spec decoding by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15748
Tiny unify realtime_tokens_total metric by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15747
Add metrics for having prefill and decode in different ranks by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15752
Super tiny code cleanup by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15652
Tiny add num retracted tokens metric by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15653
Add request counter in addition to existing response counter by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15768
Tiny add flush for CI crash locating by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15769
[diffusion] refactor: unify the profiling api for all executors by @mickqian in https://github.com/sgl-project/sglang/pull/15718
[NPU]qwen3 pp bugfix by @Liwansi in https://github.com/sgl-project/sglang/pull/15390
[NPU] Bug fix in device detect by @hustmf in https://github.com/sgl-project/sglang/pull/14137
[model-gateway] Fix IGW routing and optimize RouterManager by @slin1237 in https://github.com/sgl-project/sglang/pull/15741
[bug] fix code formatting which blocks ci by @slin1237 in https://github.com/sgl-project/sglang/pull/15780
[model-gateway] Implement Zero-Copy Vision Tensor Access by @ppraneth in https://github.com/sgl-project/sglang/pull/15750
fix: nightly fix b200 gpqa by @dougyster in https://github.com/sgl-project/sglang/pull/15745
fix(monitoring): update Grafana dashboard metrics prefix from sglang: to sglang_ by @yurekami in https://github.com/sgl-project/sglang/pull/15758
[model-gateway] Fix logging module name, parse endpoint context, and tokenizer factory by @slin1237 in https://github.com/sgl-project/sglang/pull/15782
Move limit-mm-data-per-request to make code clean by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/15775
Add LoRA metrics for potential auto scaling by @ConnorLi96 in https://github.com/sgl-project/sglang/pull/15149
[model-gateway] release smg 0.3.0 by @slin1237 in https://github.com/sgl-project/sglang/pull/15781
[Auto Sync] Update server_args.py (20251223) by @merrymercy in https://github.com/sgl-project/sglang/pull/15700
Fix code sync scripts by @merrymercy in https://github.com/sgl-project/sglang/pull/15787
Add overlap scheduling for embeddings code path by @satyamk7054 in https://github.com/sgl-project/sglang/pull/14032
Tiny refactor select_workers API for future passing more information by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15596
Add manual routing policy for router by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15586
DP: support piggyback server load report by @changhuaixin in https://github.com/sgl-project/sglang/pull/11469
Clarify None handling in sglang's environ by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15770
[Diffusion] Refactor attention backend checking to use backend enum by @yeahdongcn in https://github.com/sgl-project/sglang/pull/15555
[Fix] Remove unused LoRA application logic from RowParallelLinearWithLoRA class in linear.py by @Prozac614 in https://github.com/sgl-project/sglang/pull/15801
[CI] Add tests to validate the size, extension, and format of output images/videos. by @Prozac614 in https://github.com/sgl-project/sglang/pull/15736
[VLM] Support apply qk norm in multi cuda streams by @yuan-luo in https://github.com/sgl-project/sglang/pull/15720
Tiny fix missing record_router_upstream_response by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15811
ci: migrate MLA tests to test/registered/mla/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15798
fuse ssm state store into chunk_gated_delta_rule_fwd_h by @yizhang2077 in https://github.com/sgl-project/sglang/pull/15409
Adjust server args for Mimo-v2-flash model by @ispobock in https://github.com/sgl-project/sglang/pull/15803
[1/N][Sparse With Hicache]: Add Sparse Interface by @hzh0425 in https://github.com/sgl-project/sglang/pull/14741
[JIT sgl-kernel] Jit support per tensor quant by @BBuf in https://github.com/sgl-project/sglang/pull/15709
[Diffusion] Flux.1.dev support Tensor Parallel by @BBuf in https://github.com/sgl-project/sglang/pull/15666
Cleanup ModelRunner by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15802
[diffusion] log: avoid logging in hot path if unnecessary by @mickqian in https://github.com/sgl-project/sglang/pull/15818
[Nemotron 3 Nano] Add triton MoE configs by @roikoren755 in https://github.com/sgl-project/sglang/pull/15815
[MiMoV2Flash] fix: respect --swa-full-tokens-ratio arg by @acelyc111 in https://github.com/sgl-project/sglang/pull/15488
Tiny change bench-serving to use routing key header by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15827
feat: log request when e2e latency exceeds the specified value by @zhooooong in https://github.com/sgl-project/sglang/pull/15759
Custom All Reduce for Piecewise Cuda Graph by @Oasis-Git in https://github.com/sgl-project/sglang/pull/15356
Change GLM-ASR class name by @zRzRzRzRzRzRzR in https://github.com/sgl-project/sglang/pull/15772
[diffusion] improve: improve post-processing by moving compute-intensive tasks to GPU by @mickqian in https://github.com/sgl-project/sglang/pull/15822
Use X-SMG-Routing-Key header instead of json body and add tests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15826
Clean up the __init__ of TokenizerManager and DetokenizerManager by @merrymercy in https://github.com/sgl-project/sglang/pull/15796
Optimize FP8 MLA KV cache writes with Triton kernel by @harvenstar in https://github.com/sgl-project/sglang/pull/15522
[model-gateway] update ManualPolicy with header-based routing by @slin1237 in https://github.com/sgl-project/sglang/pull/15847
fix: improving format and design by @dougyster in https://github.com/sgl-project/sglang/pull/15791
ci: add continue-on-error for scheduled PR tests by @alisonshao in https://github.com/sgl-project/sglang/pull/15701
Fix chunk_kda_fwd missing argument by @ispobock in https://github.com/sgl-project/sglang/pull/15851
Separate swa and local attention chunk cache eviction by @ispobock in https://github.com/sgl-project/sglang/pull/15820
Super tiny move last_prefill_tokens to metrics mixin by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15857
Fix prefill num tokens metrics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15858
Use allow auto truncate in the OpenAI API endpoint by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/15369
Introduce ModelRunnerKVCacheMixin to simplify the code. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15821
[NPU] update Mixed chunk op to FIA by @Hexq0210 in https://github.com/sgl-project/sglang/pull/15518
[VLM] Refactor load_mm_data to improve performance by @yuan-luo in https://github.com/sgl-project/sglang/pull/14644
Fix swa available memory check by @ispobock in https://github.com/sgl-project/sglang/pull/15867
[Bug] fix piggyback load report return None bug by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15870
diffusion: support Qwen-Image-Layered by @chhnb in https://github.com/sgl-project/sglang/pull/15817
[diffusion] ZImage support Tensor Parallel by @zhaziqwe in https://github.com/sgl-project/sglang/pull/15849
[BUGFIX] fix edge case for qwen3-next by @yizhang2077 in https://github.com/sgl-project/sglang/pull/14209
fix: warn once per env var key by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15846
Tiny log warn users when tracing is automatically disabled by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15889
[CI] Fix CI test case skip problem by @Prozac614 in https://github.com/sgl-project/sglang/pull/15874
[Fix] assert error in log_prefill_stats by @changhuaixin in https://github.com/sgl-project/sglang/pull/15881
[diffusion] refactor: centralize hardware platform detection and streamline environment variable management by @mickqian in https://github.com/sgl-project/sglang/pull/15842
[Diffusion] Improve qwen image edit performace to align with LightX2V by @BBuf in https://github.com/sgl-project/sglang/pull/15812
[diffusion] ci: support returning request id from endpoint by @mickqian in https://github.com/sgl-project/sglang/pull/15844
[feat] Init support for webui-I2I by @wplf in https://github.com/sgl-project/sglang/pull/15778
Revert "[feat] Init support for webui-I2I" by @merrymercy in https://github.com/sgl-project/sglang/pull/15906
[Tool Call][DSV32] Streamline function call parameters by @Muqi1029 in https://github.com/sgl-project/sglang/pull/14750
[model-gateway]: fix crash in embedding worker health check by @Ratish1 in https://github.com/sgl-project/sglang/pull/15910
Revert "[VLM] Refactor load_mm_data to improve performance" by @merrymercy in https://github.com/sgl-project/sglang/pull/15911
refactor: add type hints to scheduler mixins by @ch-wan in https://github.com/sgl-project/sglang/pull/15913
hotfix: add type hints to scheduler mixins by @ch-wan in https://github.com/sgl-project/sglang/pull/15916
Revert embedding integration tests from 5f3a47d by @slin1237 in https://github.com/sgl-project/sglang/pull/15914
[BugFix][VLM] Correct weight loading with tie_word_embeddings == False by @ZhengWG in https://github.com/sgl-project/sglang/pull/15398
fix: adding deepseek base tests to b200 by @dougyster in https://github.com/sgl-project/sglang/pull/15915
[model-gateway] add JWT/OIDC authentication for control plane APIs by @slin1237 in https://github.com/sgl-project/sglang/pull/15850
Add a test case for crash dump by @merrymercy in https://github.com/sgl-project/sglang/pull/15905
[diffusion] chore: remove stepvideo code by @yhyang201 in https://github.com/sgl-project/sglang/pull/15918
Tiny cleanup the models' name in test_utils by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15920
Add Mimo-v2-flash model to ci test by @ispobock in https://github.com/sgl-project/sglang/pull/15887
[model-gateway] Add consistent hashing for ManualPolicy routing by @slin1237 in https://github.com/sgl-project/sglang/pull/15907
[diffusion] refactor: unify model loading and offloading behavior by @mickqian in https://github.com/sgl-project/sglang/pull/15923
[NPU] Support w4a8 with activation clip by @jiaming1130 in https://github.com/sgl-project/sglang/pull/14736
[model-gateway] optimize radix tree memory and reduce allocations by @slin1237 in https://github.com/sgl-project/sglang/pull/15933
[model-gateway]: fix grpc embedding test by @Ratish1 in https://github.com/sgl-project/sglang/pull/15934
[model-gateway] Add PrefixHash load balancing policy for KV cache-aware routing by @slin1237 in https://github.com/sgl-project/sglang/pull/15935
Fix temp_prefill_info assertion error in PP disaggregation mode by @harvenstar in https://github.com/sgl-project/sglang/pull/15943
chore: bump mooncake version to 0.3.8 by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15886
[diffusion] logging: log avail gpu mem while loading and generating by @mickqian in https://github.com/sgl-project/sglang/pull/15936
[diffusion] chore: remove useless params by @yhyang201 in https://github.com/sgl-project/sglang/pull/15925
[model-gateway]: remove unnecessary comment by @Ratish1 in https://github.com/sgl-project/sglang/pull/15947
Clean up logging by @merrymercy in https://github.com/sgl-project/sglang/pull/15919
Support kv8 (FP8) with torch_native attention backend by @JackChuang in https://github.com/sgl-project/sglang/pull/12596
Tiny fix cannot launch nvfp4 checkpoint with bf16 kv cache by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15986
[diffusion] chore: clean ComposedPipelineBase by @mickqian in https://github.com/sgl-project/sglang/pull/15937
Add correctness validation for decode_attention test by @harvenstar in https://github.com/sgl-project/sglang/pull/15806
[Feature] JIT Fused QK norm + qk norm clean up by @DarkSharpness in https://github.com/sgl-project/sglang/pull/15835
Refactor fp8 nextn layer for DeepSeek nvfp4 checkpoint by @Fridge003 in https://github.com/sgl-project/sglang/pull/15353
SGLang Tracing: fix attribute errors (header extraction & bootstrap span closing) by @vladnosiv in https://github.com/sgl-project/sglang/pull/15693
Refactor: separate CI-specific weight validation into dedicated module by @alisonshao in https://github.com/sgl-project/sglang/pull/15216
[fix]deepgemm precompile when warmup by @TZHelloWorld in https://github.com/sgl-project/sglang/pull/15891
Tiny add smg_manual_policy_cache_entries metric by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15987
Tiny extract PeriodicTask in router by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15988
Unify spec v2's naming manner. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15990
Add EAGLE3 test with MMLU dataset. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15945
Update test parameters for deepep_large test by @ch-wan in https://github.com/sgl-project/sglang/pull/16001
Add micro benchmarks for manual policy by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15991
Tiny fix WASM test errors on machines with many cores by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15992
Apply fixture-kit mode to MMMUVLMMixin by @majiayu000 in https://github.com/sgl-project/sglang/pull/15615
Tiny cleanup duplicate code for multi-layer eagle worker. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/16004
[Doc]Update MTP moe backends for EP document by @Fridge003 in https://github.com/sgl-project/sglang/pull/16013
[diffusion] CI: relax threshold by supporting different profiles by @mickqian in https://github.com/sgl-project/sglang/pull/16002
Temporarily disable temp_prefill_info assertion to unblock CI by @fzyzcjy in https://github.com/sgl-project/sglang/pull/16008
[diffusion] chore: fix default offload setting for image generation model by @mickqian in https://github.com/sgl-project/sglang/pull/15928
Fix metrics by @merrymercy in https://github.com/sgl-project/sglang/pull/15998
[JIT kernel] Jit kernel tests support ci by @BBuf in https://github.com/sgl-project/sglang/pull/15939
[diffusion] fix: fix stages not logged when perf_dump_path is provided by @mickqian in https://github.com/sgl-project/sglang/pull/16016
[scheduler] fix: correcting extend_logprob_start_len calculation by @ch-wan in https://github.com/sgl-project/sglang/pull/15922
Add host tensor allocator for memory_pool_host and support Mooncake standalone storage by @YiXR in https://github.com/sgl-project/sglang/pull/14873
[diffusion] modify sgld webui for reference to content task and better visualization capabilities by @wplf in https://github.com/sgl-project/sglang/pull/16017
[Docs] Improve documentation index page by @merrymercy in https://github.com/sgl-project/sglang/pull/16028
feat PD: add eagle3 support for DeepSeek V3 in EP mode by @QiuMike in https://github.com/sgl-project/sglang/pull/14280
Tiny print launch command with shlex by @hnyls2002 in https://github.com/sgl-project/sglang/pull/16010
[model-gateway] Organize Rust CLI arguments into logical groups for better --help output by @slin1237 in https://github.com/sgl-project/sglang/pull/16036
[model-gateway] Organize CLI arguments into logical groups for better --help output by @slin1237 in https://github.com/sgl-project/sglang/pull/16035
[model-gateway][CI] Display benchmark results in GitHub Actions summary by @slin1237 in https://github.com/sgl-project/sglang/pull/16037
[model-gateway] perf: optimize observability logging for minimal CPU/memory overhead by @slin1237 in https://github.com/sgl-project/sglang/pull/16039
[model-gateway]: optimize metrics for minimal CPU and memory overhead by @slin1237 in https://github.com/sgl-project/sglang/pull/16041
[Diffusion] Disable packed QKV for FLUX & Z-Image by @BBuf in https://github.com/sgl-project/sglang/pull/16038
[ci] update genai bench to 0.0.3 for pd testing by @slin1237 in https://github.com/sgl-project/sglang/pull/16051
[model-gateway] update WorkerRegistryStats with connection mode and circuit breaker info by @slin1237 in https://github.com/sgl-project/sglang/pull/16046
Update model and feature support for Ascend NPU by @Hexq0210 in https://github.com/sgl-project/sglang/pull/16003
[docs] Fix non-clickable ToC links in model gateway documentation by @slin1237 in https://github.com/sgl-project/sglang/pull/16054
[HiCache] Fix deadlock when creating new group by @XucSh in https://github.com/sgl-project/sglang/pull/15805
[Diffusion] Refactor qwen_image's rope in a single helper func by @BBuf in https://github.com/sgl-project/sglang/pull/16047
Clamp logprob tokens with model vocab size by @cklxx in https://github.com/sgl-project/sglang/pull/14414
[Diffusion] Qwen image edit support qknorm optimization by @BBuf in https://github.com/sgl-project/sglang/pull/16062
[JIT kernel] Jit kernel add codeowners by @BBuf in https://github.com/sgl-project/sglang/pull/16085
[diffusion] chore: minor refactor by streamlining the VAE class hierarchy by @mickqian in https://github.com/sgl-project/sglang/pull/16069
[model-gateway] fix tokenizer to match transformers special token handling by @slin1237 in https://github.com/sgl-project/sglang/pull/16087
[diffusion] fix: fix serving with dit-layerwise-offload enabled by @mickqian in https://github.com/sgl-project/sglang/pull/16066
[model-gateway] Add classification model support infrastructure by @slin1237 in https://github.com/sgl-project/sglang/pull/16061
[model-gateway] Improve tree benchmark with realistic multi-tenant scenarios by @slin1237 in https://github.com/sgl-project/sglang/pull/14838
[Feature] support bench jsonl files with sharegpt format by @jiapingW in https://github.com/sgl-project/sglang/pull/15057
[model-gateway] Optimize radix tree timestamp updates for multi-tenant scaling by @slin1237 in https://github.com/sgl-project/sglang/pull/16093
[CI] fix test_mla_deepseek_v3.py by @alphabetc1 in https://github.com/sgl-project/sglang/pull/16096
[model-gateway] Add classify pipeline stages and protocol types by @slin1237 in https://github.com/sgl-project/sglang/pull/16094
[model-gateway] Optimize INSERT with leaf-only timestamp updates by @slin1237 in https://github.com/sgl-project/sglang/pull/16097
[model-gateway] Wire classify pipeline to gRPC router by @slin1237 in https://github.com/sgl-project/sglang/pull/16098
[model-gateway] Generate UUID-based request IDs for embedding/classify by @slin1237 in https://github.com/sgl-project/sglang/pull/16100
[model-gateway] Fix duplicate classify prefix in response ID by @slin1237 in https://github.com/sgl-project/sglang/pull/16101
Enable testing slash command handler changes on non-fork PRs by @alisonshao in https://github.com/sgl-project/sglang/pull/15921
[model-gateway]: optimize prefix_match with zero-copy tenant and deferred char count by @slin1237 in https://github.com/sgl-project/sglang/pull/16099
Fix extend_input_len calculation in decode.py by @ch-wan in https://github.com/sgl-project/sglang/pull/16103
Add a new branch cut GH workflow, and adopt setuptools-scm for version control by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/15985
ci: migrate remaining spec/eagle tests to test/registered/spec/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15800
Clean up swa handling in fa3 backend by @ispobock in https://github.com/sgl-project/sglang/pull/15877
[diffusion] improve: tiny speedup qwen-image-edit-2511 by avoiding unnecessary calculation by @mickqian in https://github.com/sgl-project/sglang/pull/15896
[diffusion] improve: tiny improve layerwise offload manager by consolidating weights per layer by @mickqian in https://github.com/sgl-project/sglang/pull/16081
[CI] Fix LoRA downloading issues and respect offline flag by @Prozac614 in https://github.com/sgl-project/sglang/pull/15813
Reduce CI failure monitor to run once every 12 hours by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/16123
[LoRA] Torch native backend: rework implementation and updated tests by @vlserov in https://github.com/sgl-project/sglang/pull/15187
Refactor: Moving extend_logprob_start_len calculation out of prepare_for_extend by @ch-wan in https://github.com/sgl-project/sglang/pull/16105
Enhance comments in set_extend_input_len method by @ch-wan in https://github.com/sgl-project/sglang/pull/16130
Fix Qwen Next GDN w/ Radix Cache by @hebiao064 in https://github.com/sgl-project/sglang/pull/16053
Add PR review process into template. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/16133
Add attack204 into CI_PERMISSION by @hnyls2002 in https://github.com/sgl-project/sglang/pull/16131
[AMD CI] Organize AMD nightly perf test files by @bingxche in https://github.com/sgl-project/sglang/pull/16114
[diffusion] model: support TurboWan2.1-T2V-1.3B/14B SLA by @IPostYellow in https://github.com/sgl-project/sglang/pull/15888
Reworked fast_pos_embed_interpolate() using torch by @terfendail in https://github.com/sgl-project/sglang/pull/10959
Fix wrong assigning extend_input_len_per_req with eagle. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/16129
Tiny rename test_deepseek_v3_fp4_mtp_stage_b.py by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/16141
Fix race condition in /tag-and-rerun-ci command by @alisonshao in https://github.com/sgl-project/sglang/pull/16142
[PP] Add a minimum chunk value for PP dynamic chunking by @ShangmingCai in https://github.com/sgl-project/sglang/pull/16140
[VLM] Support Video for InternVL3_5 by @yuan-luo in https://github.com/sgl-project/sglang/pull/15942
[CI] Fixing release with cut branch workflow by @Fridge003 in https://github.com/sgl-project/sglang/pull/16153
[CI] set max-parallel to 4 by default. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/16154
feat(SpecEagleV2): add standalone_worker_v2 by @attack204 in https://github.com/sgl-project/sglang/pull/12625
Fix ZMQ binding and model loading for FastWan compatibility by @yh0903 in https://github.com/sgl-project/sglang/pull/13978
Add profiling capture support to the encoder server by @Jumiar in https://github.com/sgl-project/sglang/pull/15730
[docs][NPU]Update model and feature docs support for Ascend NPU by @husf1130 in https://github.com/sgl-project/sglang/pull/16124
[Fix] Distinguish between video generation and image generation in the bench serving of the diffusion model. by @jiapingW in https://github.com/sgl-project/sglang/pull

sgl-project/sglang v0.5.8 on GitHub

Highlights

New Model Support

DeepSeek V3.2 Optimization

Flash Attention 4

SGLang-Diffusion

Dependencies

Security

What's Changed

sgl-project/sglang v0.5.8
on GitHub