Highlights
- New Model Support:
- Day 0 Support for Mimo-V2-Flash: #15207, https://lmsys.org/blog/2025-12-16-mimo-v2-flash/
- Day 0 Support for Nemotron-Nano-v3: https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/
- Day 0 Support for LLaDA 2.0: https://lmsys.org/blog/2025-12-19-diffusion-llm/
- [SGLang-Diffusion] Day 0 Support for Qwen-Image-Edit-2509, Qwen-Image-Edit-2511, Qwen-Image-2512 and Qwen-Image-Layered
- Model Gateway v0.3.0 Release:
https://docs.sglang.io/advanced_features/sgl_model_gateway.html - Scalable pipeline parallelism with dynamic chunking support for ultra-long contexts (PP Refactor Roadmap #11857)
- Encoder Disaggregation for Multi-modal models (Roadmap #15118)
- SGLang-Diffusion:
- Set
--dit-layerwise-offload trueto reduce peak VRAM usage by up to 30GB, and improve performance by up to 58% for all models - Significantly reduce the latency of
Qwen-Image-Edit, making it one-of-the-fastest among all open-source solutions. More improvements are on the way - Add support for AMD/4090/5090, along with additional attention choices (sage-attn, sage-attn3), more parallelism options (TP) and enhancements to HTTP API (Google vertex supported)
- Cache-dit integration to improve performance by up to 165%
- Set
What's Changed
- Refactor custom allreduce logics by @iforgetmyname in #13710
- [Doc] Update DeepSeek-V3.2 document by @Fridge003 in #14321
- Feature/support distilled vae generic by @baonudesifeizhai in #14195
- [Performance] Optimize NSA Indexer K/S Buffer Access with Fused Triton Kernels by @Johnsonms in #13812
- Update CODEOWNERS for multimodal by @mickqian in #14329
- [bug fix] use npu phy id in container env by @jinke446 in #14266
- [model-gateway] multimodality initialization by @slin1237 in #13350
- [Doc] Fix DeepSeek V32 Doc by @Fridge003 in #14336
- sync attention, deepseek doc by @b8zhong in #14335
- [PD] Support decode pp for PD disaggregation by @ShangmingCai in #14265
- [model-gateway] add image processor and transformer structure by @slin1237 in #14344
- [CPU] Support chunk_gated_delta_rule kernel for Qwen3-Next by @Valentine233 in #12441
- [bugfix] Fix prefill tbo disabled when --deepep-mode=auto by @yuhyao in #14333
- [CI] update estimated elapsed time of some unittests by @ch-wan in #14347
- [NPU] bug fix: w_vc need contiguous for NPU batch_matmul_transpose ops by @ZhengdQin in #13980
- [bugfix] NpuFuseEPMoE miss initialization parameters by @chenxu140 in #14295
- [Ascend] fix AscendAttnMaskBuilder bug to support float16 models by @MichelleWu351 in #14271
- Tiny adjust CI testcases by @hnyls2002 in #14362
- [NPU][Doc] updated installation guide for Ascend NPU by @VDV1985 in #13585
- Feature/add vae path to cli doc#14004 by @baonudesifeizhai in #14355
- [CPU] add fused_qkvzba_split_reshape_cat kernel for Qwen3-next by @blzheng in #12330
- Single Batch Overlap for MoE Models by @Sulfur6 in #9660
- Move custom_ops under layers; move _custom_ops.py → custom_all_reduce_ops.py by @merrymercy in #14326
- [model-gateway] add llava model image processor and tests by @slin1237 in #14371
- ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back by @sunxxuns in #14226
- [Tiny]Small fixes in deepseek v32 doc by @Fridge003 in #14372
- Fix validation to detect missing model files before loading by @alisonshao in #14253
- [model-gateway] add qwen2_vl model image processor and tests by @slin1237 in #14374
- [model-gateway] add qwen2.5_vl model image processor by @slin1237 in #14375
- Revert "Revert "enable csgmv automatically on cuda"" by @b8zhong in #14277
- [model-gateway] use worker crate in openai router by @slin1237 in #14330
- [model-gateway] add qwen3_vl model image processor by @slin1237 in #14377
- Fix sgl-router silently parse selector wrongly causing OME fail to discover pods by @fzyzcjy in #14359
- [sgl-kernel][Feat][B200][1/N]Support MXFP8 Grouped GEMM in Blackwell by @HydraQYH in #13731
- [CPU] document updates by @ZailiWang in #14272
- Support PP x PD decode with nixl backend by @bluecoffee8 in #14392
- [VLM] Introduce Cache for positional embedding ids for Qwen-VL family by @yuan-luo in #14292
- use faster covnersion from float8_e4m3fn to bfloat16 by @mingfeima in #12316
- [model-gateway][doc] Add STDIO Explicitly to Example in README by @xuwenyihust in #14393
- [CPU] add support for mamba causal conv1d for qwen3-next by @mingfeima in #12309
- [model-gateway] add phi3 vision image processor by @slin1237 in #14381
- [model-gateway] introduce provider in openai router by @slin1237 in #14394
- [AMD] fix the regression issue for DeepseekV3 on MI300 by @yctseng0211 in #14383
- [NPU][1/N] NPU basic functions refactor and new modelslim quant type by @iforgetmyname in #13359
- [CPU] Optimize small oc GEMM for Qwen3-next on CPU by @jianan-gu in #12446
- Try to fix B200 DeepEP error by @fzyzcjy in #14399
- [1/2] Add rope kernel in sgl-kernel by @Qiaolin-Yu in #14334
- [bug fix] fix ima with get_mla_kv_buffer_kernel overflow by @XucSh in #14224
- Add Mistral Large 3 support. by @dcampora in #14213
- [diffusion] fix gen video doc by @yeahdongcn in #14409
- Add 'NPU' to the runtime exception message in
get_deviceby @rauletorresc in #14225 - Add mooncake
transfer_engine_benchinto maunal test by @hnyls2002 in #14429 - [model-gateway] add phi4 vision image processor by @slin1237 in #14430
- diffusion: Add Configurable Generator Device and Seed Support via API by @niehen6174 in #14366
- [model-gateway] introduce request ctx for oai router by @slin1237 in #14434
- [NPU]add nightly-test-npu by @cherryblo in #14143
- [model-gateway] add llama4 vision image processor by @slin1237 in #14438
- [model-gateway] extract conversation out of oai router by @slin1237 in #14440
- [DeepseekV3.2][NSA][Indexer] Fix PAGED top-k transform for NSA indexer chunked execution on H200 by @YAMY1234 in #14325
- [model-gateway] move oai header util to router header util by @slin1237 in #14441
- [FIX] trtllm-moe-fp4-renorm for Qwen series models by @samuellees in #14350
- add doc for quantized kv cache by @b8zhong in #14348
- fix: Correct environment variable syntax in docker-compose configuration by @yankay in #8287
- [model-gateway] move all responses api event from oai to proto by @slin1237 in #14446
- [model-gateway] add mistral 3 image processor by @slin1237 in #14445
- [model-gateway] grpc to leverage event type by @slin1237 in #14450
- ministral3 by @JustinTong0323 in #14251
- [Bug] fix not desired disable fused share experts caused by rocm logic by @ocss884 in #14432
- Rename secrets.WHL_TOKEN -> secrets.GH_PAT_FOR_WHL_RELEASE by @sglang-bot in #14421
- further optimze model load by @zyksir in #13836
- Add CI permissions for user 'yushengsu-thu' by @alisonshao in #14468
- [ez] Fix typing by @yinghai in #14473
- Add AMD stage support to /rerun-stage command and fix related bugs by @alisonshao in #14463
- Add YAMY1234 to CI Permission by @Fridge003 in #14475
- clean up gemlite usage by @zminglei in #14444
- [diffusion] chore: further improve model searching logic by @mickqian in #14484
- fix bug about pin memory by @zyksir in #14472
- [diffusion] cli: add argument --adjust-frames, --override-protected-fields by @gmixiaojin in #13996
- dockerfile: add lightweight runtime stage and refactors by @ishandhanani in #13861
- diffusion: Fix CLIP text encoder attention mask and causal masking bugFix clip attention by @niehen6174 in #14364
- Enable RadixCache for Mamba2 models by @roikoren755 in #13584
- [Diffusion] Fix profiler trace missing Python stack in diffusion pipeline by @BBuf in #14499
- support GLM-V vision model dp by @zRzRzRzRzRzRzR in #14097
- [misc] add model arch and type to server info and use it for harmony by @slin1237 in #14456
- Add Mistral Large 3 Eagle Support by @elvischenv in #14466
- Add Mistral Large 3 to nightly CI tests by @alisonshao in #14459
- [diffusion] chore: set allowing overriding protected fields of sampling params as default behavior by @mickqian in #14471
- [model-gateway] move conversation to first class routing by @slin1237 in #14506
- [Spec] Mamba2 support in target models by @roikoren755 in #13434
- [diffusion] Support cache-dit by @Brain97 in #14234
- Add fused FP8 KV cache write kernel for TRTLLM MHA backend by @harvenstar in #14093
- [model-gateway] Add WASM support for middleware by @tonyluj in #12471
- [model-gateway] reorganized conversation handler by @slin1237 in #14507
- tiny remove deprecated endpoint call by @b8zhong in #13607
- [model-gateway] fix server info comment by @slin1237 in #14508
- Add Mistral Large 3 basic test to PR CI by @alisonshao in #14460
- Fix removing worker will make it healthy forever in prometheus metrics by @fzyzcjy in #14420
- [model-gateway] Make Tokenizer Builder Aware of Env Vars Like HF_ENDPOINT by @xuwenyihust in #14405
- [model-gateway] change sgl-router to sgl-model-gateway by @slin1237 in #14312
- [model-gateway] fix left over sgl-router names to sgl-model-gateway by @slin1237 in #14512
- [model-gateway] fix logs in smg workflow by @slin1237 in #14513
- [model-gateway] fix left over sgl-router names in wasm by @slin1237 in #14514
- [model-gateway] fix code owner for wasm by @slin1237 in #14516
- chore: bump sgl-kernel version to 0.3.18.post3 by @sglang-bot in #14427
- Tiny use trtllm_mha as default when possible by @fzyzcjy in #14291
- [Docs] Add /rerun-stage command to contribution guide by @alisonshao in #14521
- Fix safetensors validation to catch corruption after download by @alisonshao in #14465
- [CODEOWNER] update codeowner for qwen3-next related by @hanming-lu in #14522
- fix rmsnorm -> layernorm in qwen3 omni by @vincentzed in #11791
- [diffusion] chore: temporarily upgrade diffusers to make Z-image compatible with Cache-DiT by @mickqian in #14530
- [bug] fix notebook to include new keys from model_info by @slin1237 in #14528
- Revise DP Multi-Modal Encoder Document by @yhyang201 in #14290
- [CPU] add mamba fla kernels for Qwen3-next by @blzheng in #12324
- Revert "tiny remove deprecated endpoint call" by @Fridge003 in #14533
- support mtp with deepseek r1 nvfp4 model by @rainj-me in #13115
- [diffusion] refactor: simplify SamplingParams override logic by @mickqian in #14539
- [Diffusion] Add QKV fusion optimization for Flux models by @BBuf in #14505
- [model-gateway][tracing]: implement request tracing using OpenTelemetry with trace context propagation (HTTP) by @sufeng-buaa in #13897
- diffusion: fix LoRA dtype handling and weight attribute access for z-image model by @niehen6174 in #14543
- fix "GrammarMatcher has terminated after accepting the stop token, but is trying to find the next token mask" when both reasoning and spec are enabled by @gongwei-130 in #14464
- [1/n] Fix hanging during DeepGemm Warmup by @Fridge003 in #14493
- [Bug fix] Add /model_info endpoint to mini_lb by @alisonshao in #14535
- [Qwen3-next] remove heuristics and add radix cache kl test by @hanming-lu in #14520
- [Misc]Register and refactor some environs for dpsk-fp4 and DeepEp by @Fridge003 in #14538
- chore: bump sgl-kernel version to 0.3.18.post3 by @sglang-bot in #14518
- Update CI_PERMISSIONS.json by @harrisonlimh in #14552
- Update DeepSeek V3 docs to use B200 by @leejnau in #14447
- [Doc] Add short explanation on page size by @b8zhong in #14557
- [docs] Add missing word in argument description by @almaslof in #14205
- support piecewise cuda graph for Olmo models by @zminglei in #14476
- Enhance prefill PP node robustness by @qhsc in #14494
- DOC update nemo-skills in docs by @gwarmstrong in #14555
- remove unecessary dual stream token threshold from the rest of models (qwen moe, kimi linear, etc.) by @b8zhong in #14337
- feat(ci): add framework target to release-docker workflows by @ishandhanani in #14559
- Fix attention backend logic for Qwen3-Next on SM100 by @Chen-0210 in #14560
- [FLA] Add explicit kernel arguments to kda.py for Kimi Linear support by @alisonshao in #14561
- Add CUDA kernel size analysis tool for sgl-kernel optimization by @BBuf in #14544
- [DLLM] Add threshold based parallel decoding support by @btw616 in #14412
- Add unit-test-backend-8-gpu-b200 to rerun-stage command by @alisonshao in #14569
- [apply][2/2] Fused qk_norm_rope for Qwen3-MoE by @yuan-luo in #13998
- Add Expert Parallelism (EP) support for kimi-k2-thinking by @BBuf in #13725
- Tiny remove wrong import from
python.sglangby @hnyls2002 in #14577 - Add small model test for spec v2 + dp + trtllm_mla by @hnyls2002 in #14576
- [diffusion] cli: profiling utilities support by @AichenF in #14185
- [NPU]LoRA: Adding Torch Native backend by @vlserov in #14132
- [BugFix] fix prefixcache performance and accuracy on ascend by @khalil2ji3mp6 in #13573
- Fix FP8 KV Triton type issue and add regression test by @harvenstar in #14553
- Rename TensorRT Model Optimizer to Model Optimizer by @Edwardf0t1 in #14455
- [CI] Tiny speed up VLM CI by @b8zhong in #14517
- [Minor] Temporarily skipping deepep large mtp test by @Fridge003 in #14586
- [model-gateway] extra accumulator and tool handler in oai router by @slin1237 in #14587
- [model-gateway] Fixed WASM Security Vulnerability - Execution Timeout by @slin1237 in #14588
- [model-gateway] reorganize metrics, logging, and otel to its own module by @slin1237 in #14590
- Refactor tuning block wise kernel and opt Qwen/Qwen3-VL-32B-Instruct-FP8 by @BBuf in #14141
- [CI]Unblock and split spec v2+dp test by @Fridge003 in #14551
- [Tool Call] Fix DeepSeekV32Detector skipping functions with no params in streaming mode by @momaek in #14573
- [feat] use cachebuffer to store mm feature to speedup hash by @liusy58 in #14386
- [CI] Fix unit-test-backend-8-gpu-b200 running on every /rerun-stage by @alisonshao in #14591
- [model-gateway] fix WASM memory limit per module by @slin1237 in #14600
- Tiny fix missing policy decision recording by @fzyzcjy in #14605
- Super tiny remove unneeded policy flag by @fzyzcjy in #14608
- [model-gateway] refactor otel to be more efficient by @slin1237 in #14604
- Super tiny remove unused select_worker_pair by @fzyzcjy in #14609
- [model-gateway] fix WASM unbounded request/response body read vuln by @slin1237 in #14612
- [2/2] Add rope kernel in sgl-kernel by @Qiaolin-Yu in #14452
- [DLLM] Add initial cuda graph support by @btw616 in #14203
- Super tiny fix unused code in router by @fzyzcjy in #14618
- [Glm46v] Bug fix for accuracy drop and unable to launch server by @byjiang1996 in #14585
- Fix amd rope definition by @Qiaolin-Yu in #14556
- modify the sgl-kernel to be compatible with transformers 5.x. by @yhyang201 in #14625
- [Reasoning + Structured Output] make reasoning compatible with structured output by @Muqi1029 in #12551
- [Feat] add support for LoRA layers in transformer_2 within LoRAPipeline by @Prozac614 in #14606
- chore: bump sgl-kernel version to 0.3.19 by @sglang-bot in #14632
- [cpu] Implement all gather/reduce for arm64 cpu by @cyb70289 in #12527
- [diffusion] chore: further refine output resolution adjustment logic by @mickqian in #14558
- Fix dp-aware incompatible with service-discovery by @fzyzcjy in #14629
- update transformers package version to 5.0.0rc0 by @yhyang201 in #14356
- chore: bump sgl-kernel version to 0.3.19 by @sglang-bot in #14649
- chore: bump SGLang version to 0.5.6.post1 by @sglang-bot in #14651
- [AMD] change fused rms quant interface for aiter upgrade by @yctseng0211 in #14497
- [model-gateway] reducing cpu overhead in various of places by @slin1237 in #14658
- [model-gateway] reduce cpu overhead in grpc router by @slin1237 in #14663
- [model-gateway] fix WASM arbitrary file read security vol by @slin1237 in #14664
- vlm: Use fa3 as the default backend for qwen3 vl by @mickqian in #14634
- [model-gateway] Optimize memory usage in HTTP router by @slin1237 in #14667
- fix: use .get() when accessing strict mem-check env variable by @yhyang201 in #14657
- improve default glm mtp setting by @b8zhong in #14457
- Fix cache-aware router should pick min load instead of min tenant size by @fzyzcjy in #14650
- Bump up diffusers to latest official release version by @byjiang1996 in #14670
- [model-gateway] add OTEL integration to grpc router by @slin1237 in #14671
- [CI] Increase max-parallel to 15 for high priority PRs by @alisonshao in #14675
- [HiCache] fix condition check when use decode offload by @ssssnow in #14489
- [RadixTree] Optimize the Time Complexity of Node Retrieval Operation from O(n*m) to O(n) by @CLFutureX in #13334
- Tiny support printing requests in bench_serving for observability by @fzyzcjy in #14652
- Aiter fp8 kv cache by @kkHuang-amd in #13147
- [SMG]feat: implement TokenGuardBody for managing token return by @jimmy-evo in #14653
- [NPU] chore: bump basic software version to 8.3.rc2 by @iforgetmyname in #14614
- [CI] Unblock gb200 cutedsl test by @Fridge003 in #14469
- Add ffmpeg into sglang docker - required by transformers multimodal V… by @byjiang1996 in #14679
- [Bugfix] Fix KeyError for Mistral-Large-3 rope_scaling config by @alisonshao in #14627
- Tiny support sgl-router http response status code metrics by @fzyzcjy in #14689
- [CI] Migrate Eagle 1-GPU tests to test/registered/ by @alisonshao in #14529
- Revert "[Bug] fix not desired disable fused share experts caused by r… by @zhyncs in #14676
- Add per-request decode tp size by @merrymercy in #14678
- [ci][smg] fix docker release ci and add it to pr test by @slin1237 in #14683
- Tiny extract select_worker_min_load by @fzyzcjy in #14648
- Fix dp-aware incompatible with completions and chat completions APIs by @fzyzcjy in #14647
- [CI] Fix Llama 3.1 8B FP4 CI by @b8zhong in #14699
- fix: make override DeepseekV2Model work by @zhyncs in #14707
- chore: add code owners for deepseek_v2.py by @zhyncs in #14714
- [CI] Move mistral large 3 basic to nightly by @alisonshao in #14622
- fix the deepep 8 gpu unit test by @rainj-me in #14601
- Add fuse_marlin_moe test to ci and add new ep test by @BBuf in #14686
- [Bugfix] Fix environ error in scheduler_runtime_checker_mixin.py by @llfl in #14461
- [Feat] Add received_time in serving_base by @zhanghaotong in #13432
- fix: prevent HugginqFace access when SGLANG_USE_MODELSCOPE is enabled by @yrk111222 in #12039
- [Test] Skip STANDALONE speculative decoding tests for different hidden sizes by @alisonshao in #14733
- [diffusion] support batch compare by @Brain97 in #14738
- Revert "[Feat] Add received_time in serving_base" by @merrymercy in #14743
- [Model] Add PaddleOCR-VL Model Support by @yudian0504 in #12953
- fix rope parameter initialization error caused by transformers v5.0 update by @yhyang201 in #14745
- [model-gateway] optimize core modules by @slin1237 in #14751
- [SMG] perf: optimize tokenizer for reduced CPU and memory overhead by @slin1237 in #14752
- Add FP8 Blockwise GEMM Backend Flag
--fp8-gemm-backendby @b8zhong in #14379 - fix: checking if tokenizer is in cache before downloading from HF by @dougyster in #14698
- fix: making rate limit a warning instead of error by @dougyster in #14753
- move multi-item scoring functions in tokenizer manager into a separate file by @merrymercy in #14740
- Improve CI by trying a warmup before unit tests by @merrymercy in #14669
- [Perf] Optimize radix tree for cache-aware load balancin by @slin1237 in #14758
- [Feature] Add LoRA support for embedding layers by @yushengsu-thu in #14177
- [model-gateway] release gateway 0.2.4 by @slin1237 in #14763
- [ci]: Enable the new hf API by @MingxuZh in #14687
- Re-add the API serving timing metrics. by @hnyls2002 in #14744
- fix: adding rate limit warning at verify token permission stage by @dougyster in #14756
- Disable 8-gpu-b200 runner in PR tests by @alisonshao in #14768
- [fix] Fix issues for in-flight weight updates by @ShawnY112358 in #14064
- [Auto Sync] Update data_parallel_controller.py, detokenizer... (20251209) by @merrymercy in #14759
- fix: race condition between validation and download locks by @alisonshao in #14761
- Fix VLM accuracy thresholds for nightly tests by @alisonshao in #14777
- fix server args bug by @TomerBN-Nvidia in #14725
- handling incomplete rope_scaling confign ci after transformers upgrade by @yhyang201 in #14784
- fix b200 ci by @b8zhong in #14786
- [RL] support weight reload for low-bit rollout by @AniZpZ in #9650
- fix: add missing logic for SGLANG_USE_MODELSCOPE variable by @yrk111222 in #14794
- fix b200 fa4 ci by @b8zhong in #14788
- [diffusion] profile: early exit when enough steps are captured to red… by @mickqian in #14803
- [GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V by @yuan-luo in #14720
- [CI] Add LoRA support to diffusion server configuration and test cases by @Prozac614 in #14697
- Revert "fix: checking if tokenizer is in cache before downloading from HF" by @yhyang201 in #14808
- [Difussion] Refactor diffusion fuse qkv and support qwen-image by @BBuf in #14793
- [Router-GO] implement a Go SGLang Router - OpenAI Compatible API Server by @whybeyoung in #14770
- [model-gateway] Dynamically Populate Tool Call Parser Choices by @xuwenyihust in #14807
- Support HTTP response status code prometheus metrics by @fzyzcjy in #14710
- Fix router keep nonzero metrics after worker is deleted by @fzyzcjy in #14819
- Tiny fix incorrect worker removal command by @fzyzcjy in #14822
- [NPU] bug fix for mtp and w4a8 by @liupeng374 in #14806
- [CI] fix UT success check in
test_eagle_infer_beta_dp_attention.pyby @hnyls2002 in #14831 - Fix CI registry scan to only check test/registered directory by @alisonshao in #14812
- [model-gateway] add anthropic message api spec by @slin1237 in #14834
- Fix tiny typo in multimodal_gen/README.md by @wplf in #14830
- Tiny support customizing Prometheus duration buckets by @fzyzcjy in #14716
- Tiny support engine response http status statistics in router by @fzyzcjy in #14712
- [CI] Reduce stage-b auto-partition from 4 to 2 by @alisonshao in #14769
- Apply back moe_sum_reduce for fused_marlin_moe by @ispobock in #14829
- [diffusion] parallel: pad tokens for video models under sp by @mickqian in #14833
- [diffusion] CI: use unified sampling_params for CI by @mickqian in #14045
- [Auto Sync] Update tool_chat_template_deepseekv31.jinja (20251210) by @zhyncs in #14837
- Revert transformers to 4.57.1 by @yhyang201 in #14801
- [model-gateway] Fix incompatible metric comparison in
PowerOfTwopolicy by @ppraneth in #14823 - [bugfix] qwen25-VL support lora by @SYChen123 in #14638
- fix lora target all + csgmv backend by @b8zhong in #14796
- [model-gateway] adds default implementations to RouterTrait in mod.rs by @slin1237 in #14841
- [AMD] Add model to AMD nightly test by @michael-amd in #14442
- Treat unittest SkipTest exception as pass instead of as failure by @byjiang1996 in #14847
- [model-gateway] code clean up on oai router by @slin1237 in #14850
- [model-gateway] fix import order in oai conversation by @slin1237 in #14851
- fix fp8 gemm nightly CI by @b8zhong in #14844
- fix: restrict cache validation behaviors to CI only by @alisonshao in #14849
- Fix CUDA version handling in ci_install_deepep.sh by @merrymercy in #14854
- Fix TestGLM41VPPAccuracy test flakiness by @byjiang1996 in #14848
- Minor code style fix for dllm by @hnyls2002 in #14836
- Enable TP for Mamba-based models by @roikoren755 in #14811
- [CI] Temp disable gb200 test by @Fridge003 in #14865
- Refactor Marlin MoeRunner by @trangdough in #14554
- [6/n] Fix
num_token_non_paddedcomputation in prefill by @yuchengz816-bot in #14313 - Remove myself to test CI gate issue by @Kangyan-Zhou in #14871
- fix: creating blobs only once for publish trace retries by @dougyster in #14845
- Move and update MindSpore docs, make it appear on the online documentation by @wangtiance in #14861
- fix nightly vlm ci : restore original eval for requests without regex by @yhyang201 in #14875
- Only count limitations for previous runs that reaches the test stages by @Kangyan-Zhou in #14856
- [CI][BUG] fix ib setup for disaggregation hicache test by @luketong777 in #14877
- [Fix] Remove unused import from test_disaggregation_hicache.py by @ShangmingCai in #14880
- fix: adding temporary bypass for nightly tests by @dougyster in #14876
- Avoid deleting entire cache for missing shards (#14754 follow-up) by @alisonshao in #14853
- Tiny add more error info for bench_serving by @fzyzcjy in #14827
- Tiny support range ratio in GSP in bench serving by @fzyzcjy in #14828
- [diffusion] feat: enable torch compile to eliminate GPU bubble by @AichenF in #13641
- [NPU]dsv3.2 cp for npu by @liupeng374 in #14541
- [diffusion] feat: support sageattn & sageattn3 backend by @mickqian in #14878
- [Ascend]Support of piecewise graph compilation for prefill on NPU by @Vladimir221 in #12287
- Introduce
server_fixturesinsglang.testby @hnyls2002 in #14899 - [diffusion] UX: suppress excessive loggers by @mickqian in #14900
- Tiny refactor cleanup WorkflowContext.get_or_err by @fzyzcjy in #14890
- Tiny clean router load report logic by @fzyzcjy in #14889
- [model-gateway] code clean up on oai router in responses by @slin1237 in #14852
- [model-gateway] fix annotation error and code formating by @slin1237 in #14910
- [model-gateway] fix imports and delete unused code by @slin1237 in #14911
- [docs] Fix kernel name by @almaslof in #14887
- [SMG][DS32][fix] support dsv32, add role developer by @jimmy-evo in #14307
- Update CI_PERMISSIONS.json by @Kangyan-Zhou in #14917
- [FIX][DS32]openai protocol: support openai message role: developer by @jimmy-evo in #14304
- [loader] enable private loader by @yinghai in #14620
- chore: bump SGLang version to 0.5.6.post2 by @sglang-bot in #14858
- extend timeout for b200 test by @b8zhong in #14925
- ci: adding more nightly tests to bot bump workflows by @dougyster in #14928
- update mistral detector by @JustinTong0323 in #14921
- support non-disturbing remote-instance-weight-loader by @amysaq2023 in #13125
- [refactor] Update reasoning
parametertorequire_reasoningby @JustinTong0323 in #14922 - [CPU] layernorm & fused add-layernorm kernels by @ZailiWang in #14074
- Add retry logic for scheduled CI tests by @alisonshao in #14771
- [CI] Add Mistral Large 3 Eagle nightly performance test by @alisonshao in #14525
- fix: handle Jinja2 template errors as client errors in OpenAIServingChat by @JustinTong0323 in #14748
- Fix black formatting in ci_utils.py by @alisonshao in #14932
- [bugfix] fix TBO crashes when attn_tp_size > 1 by @yuhyao in #13730
- fix: making the publish trace error check broader by @dougyster in #14931
- [CI]add nightly CI for glm4v_moe arch model by @zminglei in #14927
- Check KV4 compatibility with attention backends and add KV4 support to the attention_backend doc by @JackChuang in #14467
- Re-org eagle unit tests by @hnyls2002 in #14909
- Super tiny remove sgl_router_active_workers by @fzyzcjy in #14891
- remove dpsk3.2 sys prompt by @JustinTong0323 in #14923
- [DLLM] Add documentation for diffusion LLMs by @ClawSeven in #14358
- [RL] refactor flash rl weight reload in sglang by @AniZpZ in #14870
- [PP] Refactor PP to async mode by @XucSh in #11852
- [Fix] Enable applying different LoRA adapters to different transformers in multi-transformer pipelines by @Prozac614 in #14839
- [model-gateway] optimize worker selection by @ppraneth in #14894
- Fix negative duration panic in token bucket wait time calculation by @xiaguan in #14941
- Tiny add router e2e duration histogram by @fzyzcjy in #14892
- Tiny add e2e http request arrival metric by @fzyzcjy in #14893
- Super tiny remove non-updated sgl_router_worker_load by @fzyzcjy in #14888
- Super tiny move error.rs by @fzyzcjy in #14944
- direct register custom op for mm_fp4 by @b8zhong in #13699
- fix: trtllm mha attention auto-selection on sm120 by @b8zhong in #14842
- Super tiny refactor error.rs logic by @fzyzcjy in #14949
- [NPU] optimization for dsv3.2 by @ZhengdQin in #14572
- [NVIDIA] Enable TRTLLM BF16 MoE on Blackwell GPUs by @samuellees in #13798
- [Fix] suppress remote weight loading engine w/o mooncake installed by @ZailiWang in #14937
- enable flashinfer-jit-cache in image build and ci install to speed up model launch by @gongwei-130 in #14959
- [diffusion] chore: minor code cleanups and improve logging by @mickqian in #14916
- [Diffusion] upgrade cache-dit for better compatiblity by @DefTruth in #14534
- [1/N] Update doc of Pipeline Parallelism by @ShangmingCai in #14985
- [PD] Add decode PP event loop for PD disaggregation by @bluecoffee8 in #14945
- [Diffusion] Tiny fix Docker Hub link in installation documentation by @BBuf in #14987
- Update CODEOWNERS for multimodal_gen by @mickqian in #14995
- [model-gateway] refactor: extract workflow engine to src/workflow module by @slin1237 in #14996
- [model-gateway] feat: add DAG parallel execution support and workflow optimization by @slin1237 in #14999
- [model-gateway] fix: handle workflow deadlock and optimize cycle detection by @slin1237 in #15000
- [model-gateway] refactor: workflow engine cleanup and minor optimization by @slin1237 in #15001
- Fix CI by reverting incorrect metric check logic by @Kangyan-Zhou in #15004
- Super tiny extract route_typed_request_once by @fzyzcjy in #14951
- Revert several PRs by @zhyncs in #14958
- Add KV4-capable backend flashmla and update server args by @JackChuang in #14989
- Refactor of http and engine entrypoints to allow custom override by @merrymercy in #14869
- Update ci permission by @merrymercy in #15014
- [model-gateway] refactor: unify worker management into modular workflow structure by @slin1237 in #15010
- Tune triton fused moe for the case of glm-4.6-fp8 b200 tp4 by @Qiaolin-Yu in #15020
- Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly by @ConnorLi96 in #14795
- [registry] Add a strict mode to model registration by @yinghai in #14933
- Super tiny remove unused argument by @fzyzcjy in #14966
- Super tiny fix confusing slash_command_handler hint by @fzyzcjy in #14976
- Super tiny add gsp-fast-prepare by @fzyzcjy in #14992
- Tiny extract SchedulerWatchdog by @fzyzcjy in #15021
- Add soft watchdogs to debug soft hangs by @fzyzcjy in #15023
- Clean up server args and engine startup processes by @merrymercy in #15015
- tiny update: use rope kernel in sgl-kernel for amd by @Qiaolin-Yu in #14955
- Tiny remove the duplicate function in spec v2 by @hnyls2002 in #14957
- Fix regression caused by fa3 block_table by @wenscarl in #15009
- Add a special label for b200 CI runner that can run kernel tests by @Kangyan-Zhou in #15033
- [CI]Add gb200 runner back by @Fridge003 in #15024
- Fix decode OOM caused by retraction by @hnyls2002 in #14939
- Super tiny remove unused log_request by @fzyzcjy in #15035
- Add
codefield and unify error responses for router by @fzyzcjy in #15028 - Tiny unify grpc existing error responses into new format by @fzyzcjy in #15030
- Tiny change http router response format to unify by @fzyzcjy in #15031
- Provide more fine grained error reason for reqwest error by @fzyzcjy in #15032
- Add error code in prometheus metrics and add X-SMG-Error-Code header by @fzyzcjy in #15036
- Add sgl_router_attempt_http_responses_total for single attempt information by @fzyzcjy in #15037
- call check_quantized_moe_compatibility after initialize by @chunyuan-w in #13876
- Mistral Large 3 NVFP4 support by @dcampora in #14485
- [diffusion] fix: use NDRotaryEmbedding in flux_2 by @mickqian in #15034
- [Fix] Disable trtllm moe backend for draft model for a qucik fix by @samuellees in #15002
- feat: Improve LoRA compatibility by adding unified format detection and diffusers-based normalization by @MikukuOvO in #14659
- feature: adding nightly wheel workflow and indexer by @dougyster in #14924
- Fix GLM-4.6 tool calls don't support streaming output for arguments i… by @cynial in #13989
- [PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks by @YAMY1234 in #15027
- [NPU] perf update with kvcache nz & w4a8 quant by @liupeng374 in #14423
- Clean up GDN Init by @hebiao064 in #14855
- [VLM] Support VLM ViT Piecewise CUDA Graph by @yuan-luo in #14422
- Fix load metric not updated when using guard by @fzyzcjy in #15059
- Fix double decrease load by @fzyzcjy in #15060
- [Diffusion] Add multimodal gen profiling doc by @BBuf in #15069
- Fix IMA with flashinfer + spec + topk & Add radix attention test cases for eagle by @hnyls2002 in #13740
- [diffusion] doc: update profiling.md with output location details by @mickqian in #15072
- Tiny adjust CI run suite by @hnyls2002 in #15074
- Fix spec info's filter when reqs are finished right after prefill by @hnyls2002 in #14742
- [model-gateway] Simplify error response creation by @slin1237 in #15079
- [bug] fix grpc secheduler launcher breaking change by @slin1237 in #15080
- feature: ci failure monitor improvements by @dougyster in #15055
- fix: adding schedule for nightly wheel by @dougyster in #15054
- fix flaky image access in ci by switching to raw content url by @yhyang201 in #14940
- [scheduler] enhance scheduler in dp_attention mixed case with spec by @liupeng374 in #14201
- add transformers version validation for glm-4.6v moe models by @yhyang201 in #14998
- [model-gateway] Avoid MCP Server Initialization Issue by @xuwenyihust in #15065
- Add nightly accuracy test for DeepSeek V3.2 by @Fridge003 in #14935
- fix: dpskv32 chat history processing, default drop_thinking to true by @JustinTong0323 in #15064
- [model-gateway] Refactor worker steps and add update workflow by @slin1237 in #15085
- Add sglang:decode_sum_seq_lens metric by @fzyzcjy in #15066
- [Doc][TPU]add sglang-jax tpu docs by @JamesBrianD in #15056
- [Fix] avoid stream sync in _compute_mrope_positions by @narutolhy in #14956
- Support prefill max requests limitation by @fzyzcjy in #14993
- [model-gateway] Remove unused TokenizerMetrics to reduce CPU overhead by @slin1237 in #15087
- [Fix] Environment variable SGL_* is deprecated by @miter6 in #14943
- [model-gateway] Fix metric emission gaps and name mismatch by @slin1237 in #15093
- [model-gateway] Add circuit breaker and discovery watcher metrics by @slin1237 in #15094
- [model-gateway] optimize metric labels to avoid unnecessary allocations by @slin1237 in #15095
- Fix issue not reported when load decrement is incorrect by @fzyzcjy in #15061
- Avoid confusing zero value metric when worker is removed by @fzyzcjy in #15096
- [NPU][CI] change de trigger of release image workflow by @monkeyLoveding in #14969
- [ci] Move dpsk-r1-fp4 b200 test to stage b by @Qiaolin-Yu in #15084
- [scheduler] remove scheduler allgather for best throughout by @liupeng374 in #14294
- [NPU] bug fix for multi stream by @liupeng374 in #15048
- Introduce native kv cache move by @hnyls2002 in #15108
- [diffusion] feat support resolution check for video model by @Brain97 in #14881
- [Diffusion] Tiny fix _templated_ring_attention bug by @BBuf in #15053
- [Diffusion] feat: Add support for additional sampling parameters in video generation API by @BBuf in #15062
- diffusion: support webui by @wplf in #14961
- [kernel][moe] add moe topk fast by @thenumberouscode in #13969
- feat: support EPD disaggregation by @gty111 in #12263
- [CI] Add disaggregation decode PP test by @ShangmingCai in #15114
- [Diffusion] Refactor fuse qkv with QKVParallelLinear linear by @BBuf in #15090
- [model-gateway] Add new SMG metrics architecture with 6 layers by @slin1237 in #15106
- Fix tensor mismatch error in sepc + topk > 1 + page_size > 1 by @ZeldaHuang in #14874
- [model-gateway] Implement Layer 1 HTTP metrics instrumentation by @slin1237 in #15121
- feat(metrics): implement Layer 2 router metrics (smg_router_*) by @slin1237 in #15124
- Fix Mamba2-based models' default attention backend by @roikoren755 in #15117
- Add NanoV3 reasoning parser support by @danielafrimi in #15113
- [hotfix]: Add missing args for 3FS bench_client.py by @hzh0425 in #14791
- feature: ci failure monitor slack bot by @dougyster in #15110
- [model-gateway] add streaming metrics (TTFT, TPOT, tokens, duration) for gRPC router by @slin1237 in #15125
- [refactor] Move trtllm_fp8_kv_kernel to triton_ops directory by @harvenstar in #15044
- feat(gateway): Add server-side TLS support by @Ratish1 in #15052
- [model-gateway] Parallelize metrics requests by @ppraneth in #14953
- Super tiny cleanup circuit breaker code by @fzyzcjy in #15098
- Fix circuit breaker wrong metrics by @fzyzcjy in #15099
- [NSA] Fix NSA backend assertion error when running DeepSeek-V3.2 PP with radix-cache by @YAMY1234 in #15086
- [Diffusion] Fix default resolution 720p width from 1080 to 1280 by @BBuf in #15058
- fix: adjusting frequency for ci failure monitor by @dougyster in #15134
- Add cyb70289 to CI permissions by @cyb70289 in #14938
- Fix cache aware wrong routing caused by incorrect load tracking by @fzyzcjy in #15101
- Fix H200 CI by commenting out Warmup Weights and JIT Compilation by @Kangyan-Zhou in #15139
- docs: update usage by @zhyncs in #15142
- [Qwen3-next] support mamba radix cache for overlap scheduler by @hanming-lu in #14792
- Enable TRT Allreduce Fusion by default for compatible models by @b8zhong in #14764
- [VLM] Support chunked vit attention by @yuan-luo in #14907
- [model-gateway] Add Layer 3 worker metrics (smg_worker_*) by @slin1237 in #15130
- [model-gateway] upgrade axum and axum server by @slin1237 in #15146
- [model-gateway] Add streaming metrics for harmony gRPC router by @slin1237 in #15147
- ci: adding errors to Github summary by @dougyster in #14778
- Fix import warnings by @merrymercy in #15144
- fix: move ci-bot by @dougyster in #15154
- [model-gateway] add mcp and discovery metrics by @slin1237 in #15156
- Tiny improve summary text in
bench_one_batch_server.pyby @hnyls2002 in #15158 - fix CompressedTensorsW8A8Int8 min_capability by @mmdbhs in #13914
- [diffusion]: support mutli image input and qwen-image-edit-2509 by @yhyang201 in #15005
- feature: PR wheel by @dougyster in #15170
- [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut by @blzheng in #9324
- fix: adding date and fixing release name issue by @dougyster in #15174
- Fused two elementwise kernels for k_nope and k_pe concat by @kkHuang-amd in #14862
- Fix num running requests (load) wrong cleared for ongoing requests by @fzyzcjy in #15116
- [Feature] Fuse mrope all in 1 kernel by @DarkSharpness in #14906
- chore: change npu pr-test a2 runner by @Goalina in #15152
- [Diffusion] Cache dit support parallel by @BBuf in #15163
- [diffusion] fix: Fixed pytorch non-writable array warning by @RuixiangMa in #15017
- [diffusion] fix: fix video model sp when resolution is not specified by @mickqian in #15047
- [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels by @slin1237 in #15160
- Add missing assertion in NemotronH path by @roikoren755 in #15193
- [Diffusion] Fix AttributeError in _build_parallelism_config when acce… by @BBuf in #15196
- [diffusion] chore: minor code cleanups by @mickqian in #15190
- [Diffusion] Zimage support pack qkv by @BBuf in #15191
- fix(attention): Prevent trtllm_mha auto-selection with eagle3 speculative decoding by @Ratish1 in #15127
- [Diffusion] Z-Image FFN pack gate and up proj by @BBuf in #15201
- [NPU][eagle3] support qwen eagle3 on NPU by @Liwansi in #14820
- Add cache for flashinfer installation by @Kangyan-Zhou in #15153
- feature: create docker image from pr branch by @dougyster in #15185
- chore: update CI_PERMISSIONS by @zhyncs in #15212
- [Feature] npu support enable_torch_compile for torchair backend by @XDaoHong in #13410
- Add EPD disaggregation doc by @gty111 in https://github.com/sgl-project/sglang/pull/15224
- [Bugfix][Tool Call] Add null system prompt to support tool system prompt by @Muqi1029 in https://github.com/sgl-project/sglang/pull/15092
- [AMD CI] Temporarily disable 2 gpu accuracy test. by @saienduri in https://github.com/sgl-project/sglang/pull/15204
- [BugFix] Fix CPU inference failure by @cyb70289 in https://github.com/sgl-project/sglang/pull/15231
- [AMD CI] Fix typo. by @saienduri in https://github.com/sgl-project/sglang/pull/15229
- [Feature] Add AIME25 dataset support for SGLang simple_eval by @yurekami in https://github.com/sgl-project/sglang/pull/14990
- Remove duplicate bs=1 in nightly benchmark by @Fridge003 in https://github.com/sgl-project/sglang/pull/15162
- [Diffusion] fix pack qkv opt break tensor parallel by @BBuf in https://github.com/sgl-project/sglang/pull/15225
- [Qwen3-next] Add PD disaggregation support for mamba with extra_buffer by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15180
- [bugfix][quark] Fixed an issue where
per_tokencould not be properly recognized when the token count was 1. by @haoyangli-amd in https://github.com/sgl-project/sglang/pull/14415 - Adding tool calling and reasoning parser support for Intern-S1 by @KennyYao2001 in https://github.com/sgl-project/sglang/pull/14866
- fix: removing latest-sglang=1 by @dougyster in https://github.com/sgl-project/sglang/pull/15220
- Increase timeout for TestDeepseekV3MTP for potential DeepGEMM cold start by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/15239
- [CPU] Add 4D input support for ROPE in sgl-kernel by @blzheng in https://github.com/sgl-project/sglang/pull/9337
- Support piecewise cuda graph for fused marlin moe by @ispobock in https://github.com/sgl-project/sglang/pull/15100
- Enhance runtime memory check in CI by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15192
- [Diffusion] Use current_platform.device_type to replace hard-coded cuda device by @yeahdongcn in https://github.com/sgl-project/sglang/pull/15232
- [diffusion] doc: update profiling.md by @mickqian in https://github.com/sgl-project/sglang/pull/15270
- [CI] Improve flaky 4 GPU test success rate by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15234
- Fix accuracy issue when using a16w16 mla_decode_fwd by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/14936
- [AMD] Support fused_rms_mxfp4_quant in the prefill stage for DeepSeek-R1-MXFP4 by @yichiche in https://github.com/sgl-project/sglang/pull/14975
- [misc] Upgrade cutedsl to 4.3.1 by @Fridge003 in https://github.com/sgl-project/sglang/pull/14857
- Fix lint by @Fridge003 in https://github.com/sgl-project/sglang/pull/15281
- Remove incorrect BlockRemoved event emission during node splits by @nealvaidya in https://github.com/sgl-project/sglang/pull/14934
- support non disturbing remote instance weight loader v2 by @amysaq2023 in https://github.com/sgl-project/sglang/pull/14997
- [sgl-kernel] Update flashmla to include fp8 sparse_mla optimizations by @hlu1 in https://github.com/sgl-project/sglang/pull/15242
- Fix lora doc by @Fridge003 in https://github.com/sgl-project/sglang/pull/15282
- Fix test_pp_single_node.py estimated time from 800s to 500s by @alisonshao in https://github.com/sgl-project/sglang/pull/15291
- fix: skipping TestEPDDisaggregationOneEncoder test by @dougyster in https://github.com/sgl-project/sglang/pull/15292
- Revert "[misc] Upgrade cutedsl to 4.3.1 (#14857)" by @zhyncs in https://github.com/sgl-project/sglang/pull/15293
- [NVIDIA] Fixes for NVFP4 all-gather with spec decoding by @trevor-m in https://github.com/sgl-project/sglang/pull/15280
- [NPU] fix for NPU memory settings logic by @iforgetmyname in https://github.com/sgl-project/sglang/pull/15258
- fix: moving decorator to header by @dougyster in https://github.com/sgl-project/sglang/pull/15297
- Minor style fixes to the scheduler.py by @merrymercy in https://github.com/sgl-project/sglang/pull/15218
- [Test] Update LoRA eviction policy tests to match current behavior by @alisonshao in https://github.com/sgl-project/sglang/pull/15283
- [BugFix] fix gptq_marlin_gemm has no parameter called b_bias by @ehuaa in https://github.com/sgl-project/sglang/pull/13571
- fix(function_call): fallback to decode when batch decode options differ by @luqitao in https://github.com/sgl-project/sglang/pull/15155
- Add Ollama-compatible API endpoints + Smart Router by @alisonshao in https://github.com/sgl-project/sglang/pull/14376
- [DeepSeekV3.2] Add pure TP+MTP test by @ashtonchew in https://github.com/sgl-project/sglang/pull/15088
- [Perf] Enable Flashinfer autotune by default by @elvischenv in https://github.com/sgl-project/sglang/pull/14357
- Update FP4 GEMM Benchmark by @b8zhong in https://github.com/sgl-project/sglang/pull/14449
- Revert "direct register custom op for mm_fp4 (#13699)" by @b8zhong in https://github.com/sgl-project/sglang/pull/15284
- diffusion: Add sampling parameters and model info endpoint to OpenAI API by @niehen6174 in https://github.com/sgl-project/sglang/pull/15071
- [PP] Add pp support for Qwen3-VL by @XucSh in https://github.com/sgl-project/sglang/pull/12333
- [Hotfix] Fix required enable_mamba_track argument for Flashinfer autotune path by @elvischenv in https://github.com/sgl-project/sglang/pull/15314
- Fix gpu-fault when running mtp in eager mode by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/15233
- [bug fix][pp] fix qwen3 model load by @XucSh in https://github.com/sgl-project/sglang/pull/15223
- Fix the accuracy issue when running mxfp4 dsv3 model and enable ep by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/15304
- fix qwenvl compressed tensors quantization weight loader by @LHXuuu in https://github.com/sgl-project/sglang/pull/11914
- [Piecewise CUDA Graph] Support INT8 by @b8zhong in https://github.com/sgl-project/sglang/pull/14918
- [bug fix][pp] fix weight load for qwen2.5-vl by @XucSh in https://github.com/sgl-project/sglang/pull/15138
- [Diffusion] Add flux2 tp2 test in ci to avoid break diffusion tensor parallel by @BBuf in https://github.com/sgl-project/sglang/pull/15237
- [Diffusion] Enhance trace export with gzip and integrity check by @BBuf in https://github.com/sgl-project/sglang/pull/15326
- Add cuda_graph_forward_passes_total and num_retracted_reqs_total by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15189
- Add realtime token counter metrics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15198
- Tiny dump native stacktraces in watchdog by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15222
- Super tiny rename failure_count for consistency by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15186
- [PP] Minor code cleanup for Pipeline Parallelism by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15329
- tiny unify environ usage by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15335
- [model-gateway] reduce cpu overhead by @slin1237 in https://github.com/sgl-project/sglang/pull/15316
- [model-gateway] optimize worker registry and reduce lock contention in grpc client fetch by @slin1237 in https://github.com/sgl-project/sglang/pull/15336
- [DeepSeek-V32]Update nightly performance benchmark by @Fridge003 in https://github.com/sgl-project/sglang/pull/15308
- Fix dp run error with fp8-kv enable in high concurrency test by @kkHuang-amd in https://github.com/sgl-project/sglang/pull/15241
- fix: prevent points regex from matching checkpoints/endpoints by @xvlincaigou in https://github.com/sgl-project/sglang/pull/15120
- Fix condition check for require_gathered_buffer by @ch-wan in https://github.com/sgl-project/sglang/pull/15328
- Reserve more memory for DeepSeekOCR model and adjust server start timeout for DeepGEMM to reduce flakiness by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/15277
- [CI] Migrate LoRA tests to test/registered/lora/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15176
- Add request-level timestamp for when prefill finishes by @scottjlee in https://github.com/sgl-project/sglang/pull/14860
- [Deepseek V3.2] Support Overlap Spec + NSA by @b8zhong in https://github.com/sgl-project/sglang/pull/15307
- [VLM] Support cos sin cache for Qwen3-VL & GLM-4.1V by @yuan-luo in https://github.com/sgl-project/sglang/pull/15205
- Feature/trtllm mha workspace size configurable #15089 by @baonudesifeizhai in https://github.com/sgl-project/sglang/pull/15131
- feat: DeepSeek-V3.2 Streaming tool call output by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/15278
- Add doc for qwen3 next by @yizhang2077 in https://github.com/sgl-project/sglang/pull/15337
- fix: adjust time for test_epd_disaggregation.py by @dougyster in https://github.com/sgl-project/sglang/pull/15354
- Mistral Large 3 NVFP4 TRTLLM MoE support by @elvischenv in https://github.com/sgl-project/sglang/pull/15049
- unified management of environment variables for vlm cuda ipc transport by @yhyang201 in https://github.com/sgl-project/sglang/pull/14501
- Split test_piecewise_cuda_graph.py to optimize CI resource usage by @alisonshao in https://github.com/sgl-project/sglang/pull/15290
- Fix issue: ENABLE_BELOW_SM90 cannot be enabled on aarch64 CPU by @MarcoDWei in https://github.com/sgl-project/sglang/pull/12967
- [PP] Fix dynamic chunking strategy for PP by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15372
- [model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups by @slin1237 in https://github.com/sgl-project/sglang/pull/15361
- Monkey patch deepseek-ocr's
v_head_dimby @hnyls2002 in https://github.com/sgl-project/sglang/pull/15384 - Fix gpt-oss yarn with
truncateargument by @hnyls2002 in https://github.com/sgl-project/sglang/pull/14270 - Fix warp illegal instruction in kimi k2 thinking PCG by @BBuf in https://github.com/sgl-project/sglang/pull/15306
- [bug fix][pp] fix inconsistent latency between tp by @XucSh in https://github.com/sgl-project/sglang/pull/15379
- [sgl-kernel][1/2] Fused qk_norm_rope for GLM4.6 by @Kevin-XiongC in https://github.com/sgl-project/sglang/pull/15141
- Clean up init function of the scheduler and event loop for PD by @merrymercy in https://github.com/sgl-project/sglang/pull/15298
- [perf]optimize w4afp8 kernel on deepseek-v3-0324 by @Bruce-x-1997 in https://github.com/sgl-project/sglang/pull/12921
- [Diffusion] Fix
sglang generate --perf-dump-pathto include per-denoising-step timings by @BBuf in https://github.com/sgl-project/sglang/pull/15397 - Tiny fix unknown route in prometheus metrics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15404
- Support GPU execution time breakdown by forward mode metrics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15396
- Tiny extract ModelRunnerOutput by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15400
- Super tiny add moe_ep_rank to prometheus labels by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15407
- Support EPLB balancedness prometheus metric without GPU->CPU synchronize by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15401
- [PP] Add dynamic chunking PP test by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15395
- [Tiny]Add warning for deepgemm on Blackwell by @Fridge003 in https://github.com/sgl-project/sglang/pull/15352
- Update benchmarks to use HF token from environment. by @FrankD412 in https://github.com/sgl-project/sglang/pull/15421
- [Performance] optimize NSA backend metadata computation for multi-step speculative decoding by @Johnsonms in https://github.com/sgl-project/sglang/pull/14781
- [AMD] Clear pre-built AITER kernels and warmup to prevent segfaults and test timeouts by @sunxxuns in https://github.com/sgl-project/sglang/pull/15318
- multimodal: precompute hash for MultimodalDataItem by @sufeng-buaa in https://github.com/sgl-project/sglang/pull/14354
- tiny fix lint on main by @b8zhong in https://github.com/sgl-project/sglang/pull/15424
- feat(dsv32): better error handling for DeepSeek-v3.2 encoder by @jimmy-evo in https://github.com/sgl-project/sglang/pull/14353
- Support using different attention backend for draft decoding. by @pyc96 in https://github.com/sgl-project/sglang/pull/14843
- [DLLM] Add CI for diffusion LLMs by @ClawSeven in https://github.com/sgl-project/sglang/pull/14723
- chore: update CI_PERMISSIONS by @zhyncs in https://github.com/sgl-project/sglang/pull/15431
- [Deepseek V3.2] Fix Deepseek MTP in V1 mode by @b8zhong in https://github.com/sgl-project/sglang/pull/15429
- [DLLM] Fix dLLM regression by @ClawSeven in https://github.com/sgl-project/sglang/pull/15371
- [diffusion] profiling: add bench_serving.py and VBench by @mickqian in https://github.com/sgl-project/sglang/pull/15410
- [Feature] Xiaomi
MiMo-V2-Flashday0 support by @acelyc111 in https://github.com/sgl-project/sglang/pull/15207 - fix mindspore import warning by @b8zhong in https://github.com/sgl-project/sglang/pull/15287
- Update readme by @merrymercy in https://github.com/sgl-project/sglang/pull/15425
- Add customized sampler registration by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15423
- [Fix]: Refactor _build_req_from_sampling to use shallow_asdict by @cocoshe in https://github.com/sgl-project/sglang/pull/13782
- [amd] Add deterministic all-reduce kernel for AMD (ROCm) by @sunxxuns in https://github.com/sgl-project/sglang/pull/15340
- [AMD] Enable all diffusion models and fix encoder loading on MI325 by @zyzshishui in https://github.com/sgl-project/sglang/pull/13760
- [sgl-kernel] chore: update deepgemm version by @FlamingoPg in https://github.com/sgl-project/sglang/pull/13402
- fix: unreachable error check in retraction by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15433
- [AMD] Fix and add accuracy-test-2-gpu-amd back by @yctseng0211 in https://github.com/sgl-project/sglang/pull/15415
- [AMD] add unit-test-backend-8-gpu-amd back by @yctseng0211 in https://github.com/sgl-project/sglang/pull/15253
- Support FP8 MLA prefill and 128k context. by @weireweire in https://github.com/sgl-project/sglang/pull/14395
- [Auto Sync] Update scheduler_runtime_checker_mixin.py (20251219) by @merrymercy in https://github.com/sgl-project/sglang/pull/15437
- [diffusion]Support url image input by @IPostYellow in https://github.com/sgl-project/sglang/pull/15262
- diffusion: support qwen-image-edit-2511 by @yhyang201 in https://github.com/sgl-project/sglang/pull/15458
- Fix: Support multiple input images to SGLang Diffusion when using generate mode by @suyedu in https://github.com/sgl-project/sglang/pull/15394
- [Diffusion] kernel: timestep embedding kernel implementation by @66RING in https://github.com/sgl-project/sglang/pull/12995
- [diffusion] Add Sage Attention 3 Support for sm 120 (RTX5090) by @ryang-max in https://github.com/sgl-project/sglang/pull/15382
- [diffusion] fix: fix wrong validation on 2k resolution by @mickqian in https://github.com/sgl-project/sglang/pull/15478
- Add MiDasheng Model Support by @Jacki1223 in https://github.com/sgl-project/sglang/pull/15219
- fix: update model name after weights update by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15416
- [Diffusion] Add diffusion attention backends doc by @BBuf in https://github.com/sgl-project/sglang/pull/15408
- [NPU]Fix for ipc handle with npu by @hustmf in https://github.com/sgl-project/sglang/pull/14138
- [NPU] bugfix for chunkedprefill by @Hexq0210 in https://github.com/sgl-project/sglang/pull/15166
- Tiny fix mimo model conflicts with main by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15483
- Enhance protection rules of code owners by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15406
- Vertex generate pathway in server by @yashikagandhi-google in https://github.com/sgl-project/sglang/pull/15348
- EP Support for Piecewise Cuda Graph by @Oasis-Git in https://github.com/sgl-project/sglang/pull/14164
- fixed trtllm nvfp4 backend for moe by @khushgx in https://github.com/sgl-project/sglang/pull/15022
- [model-gateway] fix graceful shutdown for TLS/Non-TLS server by @slin1237 in https://github.com/sgl-project/sglang/pull/15491
- [model-gateway] refactor: extract common graceful shutdown code before TLS branch by @slin1237 in https://github.com/sgl-project/sglang/pull/15494
- [model-gateway] Improve logging in data_connector module by @slin1237 in https://github.com/sgl-project/sglang/pull/15495
- [model-gateway] Improve logging in policies module by @slin1237 in https://github.com/sgl-project/sglang/pull/15496
- [AMD] Add TP=8 models to nightly test and make TP=2 test stable by @michael-amd in https://github.com/sgl-project/sglang/pull/15296
- [DSv32] Move deep_gemm.get_paged_mqa_logits_metadata to init time as metadata by @qianlihuang in https://github.com/sgl-project/sglang/pull/15040
- [model-gateway] Improve logging across core modules by @slin1237 in https://github.com/sgl-project/sglang/pull/15497
- [model-gateway] Optimize workflow engine with pre-computed dependency graph by @slin1237 in https://github.com/sgl-project/sglang/pull/15503
- [model-gateway] Run workflow event subscribers concurrently by @slin1237 in https://github.com/sgl-project/sglang/pull/15504
- [model-gateway] simplify workflow engine backoff and reduce duplicate reads by @slin1237 in https://github.com/sgl-project/sglang/pull/15505
- [router] bugfix: cache_aware in grpc inbalance forward by @llfl in https://github.com/sgl-project/sglang/pull/15473
- Clean
hidden_states_before_normby @hnyls2002 in https://github.com/sgl-project/sglang/pull/15485 - [ci] remove rust benchmark in unit test ci by @slin1237 in https://github.com/sgl-project/sglang/pull/15510
- [model-gateway] Implement RAII load guard with response body attachment by @slin1237 in https://github.com/sgl-project/sglang/pull/15507
- [diffusion] refactor: deprecate WorkloadType by @mickqian in https://github.com/sgl-project/sglang/pull/15267
- [GLM-4.7] GLM-4.7 Tool Parser and Doc Update by @zRzRzRzRzRzRzR in https://github.com/sgl-project/sglang/pull/15333
- tiny fix sampling seed for completion api by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15498
- [AMD] remove the redundant projection by @yctseng0211 in https://github.com/sgl-project/sglang/pull/15178
- [AMD] Support fast_topk kernels in sgl-kernel by @hubertlu-tw in https://github.com/sgl-project/sglang/pull/15172
- [NPU] [BUGFIX] Fix NPU inference (torch_npu._npu_reshape_and_cache() crash) by @OrangeRedeng in https://github.com/sgl-project/sglang/pull/15484
- Optimize MiMo-V2-Flash by flashinfer fused allreduce by @yuan-luo in https://github.com/sgl-project/sglang/pull/15464
- vlm: Refactor engine vlm params and support precessor output as input by @minleminzui in https://github.com/sgl-project/sglang/pull/14091
- [VLM] Support ViT Piecewise CUDA Graph for Qwen3-VL by @yuan-luo in https://github.com/sgl-project/sglang/pull/15320
- fix MiMo-V2-Flash typo by @acelyc111 in https://github.com/sgl-project/sglang/pull/15536
- [Diffusion] Wan video model support zero-cost weight offload and overlap with compute by @BBuf in https://github.com/sgl-project/sglang/pull/15511
- [diffusion] chore: allow all attention backends if not specified by @mickqian in https://github.com/sgl-project/sglang/pull/15530
- [diffusion] log: fix wrong use of suppress_other_loggers by @mickqian in https://github.com/sgl-project/sglang/pull/15534
- [Diffusion] Profiler doc add
--perf-dump-pathDesc by @BBuf in https://github.com/sgl-project/sglang/pull/15533 - [diffusion] refactor: support scheduling logic for reqs inside scheduler by @mickqian in https://github.com/sgl-project/sglang/pull/15479
- feat: Add limit-mm-data-per-request argument to server arguments by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/15418
- Fix docker gateway image name and add latest tag by @slin1237 in https://github.com/sgl-project/sglang/pull/15542
- [model-gateway] add model gateway multi-arch docker build, test and document docker image by @slin1237 in https://github.com/sgl-project/sglang/pull/15544
- [model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching by @ppraneth in https://github.com/sgl-project/sglang/pull/15515
- [model-gateway] bugfix: backward compatibility for GET endpoints by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15413
- fix: update tool name handling and argument extraction in R1 chat tem… by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/15547
- Optimize Bailing-MoE with FlashInfer Fused All-Reduce by @yuan-luo in https://github.com/sgl-project/sglang/pull/15526
- [sgl-kernel] Streamline kernel size report (Top 20 only) and clean up by @BBuf in https://github.com/sgl-project/sglang/pull/15552
- Apply new moe align block size kernel by @BBuf in https://github.com/sgl-project/sglang/pull/14134
- [CI] Migrate CUDA Graph tests to test/registered/cuda_graph/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15436
- feature: unified nightly metric layer by @dougyster in https://github.com/sgl-project/sglang/pull/15324
- [CI] Fix /rerun-stage command by using requests for workflow dispatch by @alisonshao in https://github.com/sgl-project/sglang/pull/15447
- [model-gateway]: Tool parser for glm47 by @UbeCc in https://github.com/sgl-project/sglang/pull/15520
- [Diffusion] Simplify
--perf-dump-pathJSON output (remove duplicate denoise steps) by @BBuf in https://github.com/sgl-project/sglang/pull/15537 - [diffusion] chore: minor improvements and typo-fixing by @mickqian in https://github.com/sgl-project/sglang/pull/15556
- [diffusion] bench: improve bench_serving by adding more controlling args by @mickqian in https://github.com/sgl-project/sglang/pull/15554
- [FusedMoE] Fix fused w13 tp sharded weight loading by @yinghai in https://github.com/sgl-project/sglang/pull/15432
- [EAGLE] Fix slow Triton compilation in EAGLE KV cache copy by chunking large num_locs_upper by @YAMY1234 in https://github.com/sgl-project/sglang/pull/15111
- Support piecewise cuda graph for dsv3 fp4 by @ispobock in https://github.com/sgl-project/sglang/pull/15531
- [Feature] Enable return routed experts by @ocss884 in https://github.com/sgl-project/sglang/pull/12162
- [CI] Fix AMD CI to exclude multimodal_gen from main_package filter by @sunxxuns in https://github.com/sgl-project/sglang/pull/15558
- [model-gateway] /parse/easoning and parse/function_call for sgl-model-gateway by @UbeCc in https://github.com/sgl-project/sglang/pull/15568
- [model-gateway] Use UUIDs for router-managed worker resources by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15540
- [1 / N] Clean up logprob utils by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15509
- Revert "[FusedMoE] Fix fused w13 tp sharded weight loading" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15579
- [model-gateway] minor code clean up by @slin1237 in https://github.com/sgl-project/sglang/pull/15578
- chore: bump sgl-kernel version to 0.3.20 by @sglang-bot in https://github.com/sgl-project/sglang/pull/15564
- fix ds3.2 nsa backend prefill TBO by @Chen-0210 in https://github.com/sgl-project/sglang/pull/14901
- Add triton_fused_moe config for GLM-4.6-FP8 tp8 blackwell by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15569
- [model-gateway] add WorkerService abstraction for worker business logic by @slin1237 in https://github.com/sgl-project/sglang/pull/15580
- [model-gateway] refactor WorkerManager with fan_out helper and thin handlers by @slin1237 in https://github.com/sgl-project/sglang/pull/15583
- Split dpsk fp4 4 gpu tests and move the mtp part to real stage b by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15553
- Fix type mismatch in LoRA batch validation causing assertion failures by @ConnorLi96 in https://github.com/sgl-project/sglang/pull/15427
- [feature] support hicache-3fs usrbio lib build for ubuntu24.04 by @leihuang-sketch in https://github.com/sgl-project/sglang/pull/15230
- [model-gateway] add retry and circuit breaker support to gRPC routers by @slin1237 in https://github.com/sgl-project/sglang/pull/15585
- Optimize Rust CI builds with proper sccache configuration by @slin1237 in https://github.com/sgl-project/sglang/pull/15581
- [Tiny]Move deepseek fp4 cutlass moe test to per-commit test by @Fridge003 in https://github.com/sgl-project/sglang/pull/15565
- Tiny fix bench serving GSP mode cache file strategy by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15587
- Support gsp send routing id in bench serving by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15588
- [model-gateway] add retry support to OpenAI router chat endpoint by @slin1237 in https://github.com/sgl-project/sglang/pull/15589
- Adapt fixture-kit to gsm8k mixin by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15599
- Add glm-4.6-fp8 with/without mtp in nightly ci by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15566
- add decode round robin policy by @Hexq0210 in https://github.com/sgl-project/sglang/pull/15164
- Tiny avoid EnvField misuse by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15612
- Support soft watchdog for tokenizer/detokenizer/dp-controller processes by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15607
- Tiny add stuck simulation by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15613
- Tiny enable soft watchdog in CI for stuck without logs by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15616
- [diffusion] Remove Default post dit offload in local mode by @ryang-max in https://github.com/sgl-project/sglang/pull/15573
- [VLM] Tiny: Unify VLM environment variables by @yuan-luo in https://github.com/sgl-project/sglang/pull/15572
- [Diffusion] Support peak memory record in offline generate and serving by @BBuf in https://github.com/sgl-project/sglang/pull/15610
- [model-gateway] return 503 when all workers are circuit-broken by @slin1237 in https://github.com/sgl-project/sglang/pull/15611
- Fix router gRPC mode launch error caused by async loading by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15368
- Tiny add back missing router per attempt response metric by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15621
- Adjust wrong
mtpmeaning introduce by mimo by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15632 - bugfix[schedule]: Refactor sort method and add related UT by @SeanWeiSean in https://github.com/sgl-project/sglang/pull/13576
- chore: bump sgl-kernel version to 0.3.20 by @sglang-bot in https://github.com/sgl-project/sglang/pull/15590
- Improve engine customization interface by @merrymercy in https://github.com/sgl-project/sglang/pull/15635
- [GLM-ASR] GLM-ASR Support by @zRzRzRzRzRzRzR in https://github.com/sgl-project/sglang/pull/15570
- MoE: Skip SiLU/GELU activation for masked experts by @yuchengz816-bot in https://github.com/sgl-project/sglang/pull/15539
- [PD] Support fake decode for PD disaggregation without prefill node by @Baidu-AIAK in https://github.com/sgl-project/sglang/pull/14628
- [CI] Migrate nightly tests to test/registered/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15582
- [CI] Migrate Attention Backend tests to test/registered/attention/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15563
- [CI] Enable retry logic for flaky CI tests by @alisonshao in https://github.com/sgl-project/sglang/pull/14983
- [AMD] CI - Detect the aiter version and rebuild if needed by @yctseng0211 in https://github.com/sgl-project/sglang/pull/15460
- [AMD] CI - Improve image discovery with remote registry fallback by @bingxche in https://github.com/sgl-project/sglang/pull/15463
- fix: increasing H200 test timeout by @dougyster in https://github.com/sgl-project/sglang/pull/15600
- Support PP for zmq_to_scheduler by @gty111 in https://github.com/sgl-project/sglang/pull/15312
- [2/N] Update doc of Pipeline Parallelism with case study by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15684
- Fix pipeline parallelism doc typos by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15688
- [diffusion] Generalize layerwise offloader to flux1 by @ryang-max in https://github.com/sgl-project/sglang/pull/15633
- [CI] fix UT assert error in test_tokenizer_manager.py by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15646
- [Feature] support fastsafetensors by @stmatengss in https://github.com/sgl-project/sglang/pull/15091
- [Minor] Enhance JIT kernel and add dev docs by @DarkSharpness in https://github.com/sgl-project/sglang/pull/14570
- Super tiny add test_soft_watchdog to nightly by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15692
- fix: potential crash for missing stream attribute by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15644
- [model-gateway] Replace tokenizer with tokenizer registry for dynamic tokenizer loading in gRPC router by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/12968
- Fix Illegal Memory Access when fa3 + spec + topk + page_size > 1 by @yubofredwang in https://github.com/sgl-project/sglang/pull/15469
- Tiny add more information in retract logging. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15694
- [model-gateway] Optimize router selection with lock-free snapshots by @ppraneth in https://github.com/sgl-project/sglang/pull/15672
- [model-gateway]: add gRPC router embeddings endpoint implementation by @Ratish1 in https://github.com/sgl-project/sglang/pull/15273
- Tiny apply gsm8k mixin to ngram test by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15606
- Tiny fix CI by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15696
- [model-gateway] Fix tokenizer caching and improve error handling by @slin1237 in https://github.com/sgl-project/sglang/pull/15695
- Add kv_transfer_total_mb to metrics by @merrymercy in https://github.com/sgl-project/sglang/pull/15667
- Update MiniMax-M2 ToolCall and add MiniMax-M2.1 in Docs by @rogeryoungh in https://github.com/sgl-project/sglang/pull/15538
- [model-gateway] Add tokenize/detokenize HTTP endpoints and tokenizer management by @slin1237 in https://github.com/sgl-project/sglang/pull/15702
- [bug fix] fix hicache jit kernel by @XucSh in https://github.com/sgl-project/sglang/pull/15177
- Raise the accept length bar in dpsk-r1-fp4 spec decoding tests by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/15705
- Tiny add back fixes of incorrect metrics after worker removal by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15624
- Tiny add back router worker health metric and fix init state by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15622
- [AMD] Add AMD Nightly Performance & VLMs Accuracy Tests by @michael-amd in https://github.com/sgl-project/sglang/pull/15500
- [Feature][MM] split the images of one request into multiparts by @XucSh in https://github.com/sgl-project/sglang/pull/11828
- Tiny add flush in the suite partition status print. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15719
- Tiny fix test eagle infer b. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15716
- Move some quant args to its own section in environ variables doc by @vincentzed in https://github.com/sgl-project/sglang/pull/15722
- [docs] major SGL Model Gateway documentation update by @slin1237 in https://github.com/sgl-project/sglang/pull/15715
- [diffusion] http-server: fix openai endpoint image download strict content_type limit by @mickqian in https://github.com/sgl-project/sglang/pull/15717
- [CI] Remove pcg-omni-ci by @Oasis-Git in https://github.com/sgl-project/sglang/pull/15656
- [Feat] lora strength param by @Prozac614 in https://github.com/sgl-project/sglang/pull/15691
- Simplify server args by @merrymercy in https://github.com/sgl-project/sglang/pull/15704
- [2/N] clean duplicate code of logprob processing in spec. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15593
- update benchmark README to use --fp8-gemm-backend instead of env var by @leejnau in https://github.com/sgl-project/sglang/pull/15689
- [model-gateway]Enable IGW mode with gRPC router and auto enable IGW when service discovery is turned on by @YouNeedCryDear in https://github.com/sgl-project/sglang/pull/15459
- Tiny env cleanup in deepgemm by @vincentzed in https://github.com/sgl-project/sglang/pull/15706
- Fix smg_http_requests_total semantics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15655
- Tiny refactor request logger by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15740
- Support JSON format request logging for easier parsing by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15743
- Retry removing wrong logic about max total token in spec decoding by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15748
- Tiny unify realtime_tokens_total metric by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15747
- Add metrics for having prefill and decode in different ranks by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15752
- Super tiny code cleanup by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15652
- Tiny add num retracted tokens metric by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15653
- Add request counter in addition to existing response counter by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15768
- Tiny add flush for CI crash locating by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15769
- [diffusion] refactor: unify the profiling api for all executors by @mickqian in https://github.com/sgl-project/sglang/pull/15718
- [NPU]qwen3 pp bugfix by @Liwansi in https://github.com/sgl-project/sglang/pull/15390
- [NPU] Bug fix in device detect by @hustmf in https://github.com/sgl-project/sglang/pull/14137
- [model-gateway] Fix IGW routing and optimize RouterManager by @slin1237 in https://github.com/sgl-project/sglang/pull/15741
- [bug] fix code formatting which blocks ci by @slin1237 in https://github.com/sgl-project/sglang/pull/15780
- [model-gateway] Implement Zero-Copy Vision Tensor Access by @ppraneth in https://github.com/sgl-project/sglang/pull/15750
- fix: nightly fix b200 gpqa by @dougyster in https://github.com/sgl-project/sglang/pull/15745
- fix(monitoring): update Grafana dashboard metrics prefix from sglang: to sglang_ by @yurekami in https://github.com/sgl-project/sglang/pull/15758
- [model-gateway] Fix logging module name, parse endpoint context, and tokenizer factory by @slin1237 in https://github.com/sgl-project/sglang/pull/15782
- Move limit-mm-data-per-request to make code clean by @JustinTong0323 in https://github.com/sgl-project/sglang/pull/15775
- Add LoRA metrics for potential auto scaling by @ConnorLi96 in https://github.com/sgl-project/sglang/pull/15149
- [model-gateway] release smg 0.3.0 by @slin1237 in https://github.com/sgl-project/sglang/pull/15781
- [Auto Sync] Update server_args.py (20251223) by @merrymercy in https://github.com/sgl-project/sglang/pull/15700
- Fix code sync scripts by @merrymercy in https://github.com/sgl-project/sglang/pull/15787
- Add overlap scheduling for embeddings code path by @satyamk7054 in https://github.com/sgl-project/sglang/pull/14032
- Tiny refactor select_workers API for future passing more information by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15596
- Add manual routing policy for router by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15586
- DP: support piggyback server load report by @changhuaixin in https://github.com/sgl-project/sglang/pull/11469
- Clarify
Nonehandling in sglang's environ by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15770 - [Diffusion] Refactor attention backend checking to use backend enum by @yeahdongcn in https://github.com/sgl-project/sglang/pull/15555
- [Fix] Remove unused LoRA application logic from RowParallelLinearWithLoRA class in linear.py by @Prozac614 in https://github.com/sgl-project/sglang/pull/15801
- [CI] Add tests to validate the size, extension, and format of output images/videos. by @Prozac614 in https://github.com/sgl-project/sglang/pull/15736
- [VLM] Support apply qk norm in multi cuda streams by @yuan-luo in https://github.com/sgl-project/sglang/pull/15720
- Tiny fix missing record_router_upstream_response by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15811
- ci: migrate MLA tests to test/registered/mla/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15798
- fuse ssm state store into chunk_gated_delta_rule_fwd_h by @yizhang2077 in https://github.com/sgl-project/sglang/pull/15409
- Adjust server args for Mimo-v2-flash model by @ispobock in https://github.com/sgl-project/sglang/pull/15803
- [1/N][Sparse With Hicache]: Add Sparse Interface by @hzh0425 in https://github.com/sgl-project/sglang/pull/14741
- [JIT sgl-kernel] Jit support per tensor quant by @BBuf in https://github.com/sgl-project/sglang/pull/15709
- [Diffusion] Flux.1.dev support Tensor Parallel by @BBuf in https://github.com/sgl-project/sglang/pull/15666
- Cleanup
ModelRunnerby @hnyls2002 in https://github.com/sgl-project/sglang/pull/15802 - [diffusion] log: avoid logging in hot path if unnecessary by @mickqian in https://github.com/sgl-project/sglang/pull/15818
- [Nemotron 3 Nano] Add triton MoE configs by @roikoren755 in https://github.com/sgl-project/sglang/pull/15815
- [MiMoV2Flash] fix: respect --swa-full-tokens-ratio arg by @acelyc111 in https://github.com/sgl-project/sglang/pull/15488
- Tiny change bench-serving to use routing key header by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15827
- feat: log request when e2e latency exceeds the specified value by @zhooooong in https://github.com/sgl-project/sglang/pull/15759
- Custom All Reduce for Piecewise Cuda Graph by @Oasis-Git in https://github.com/sgl-project/sglang/pull/15356
- Change GLM-ASR class name by @zRzRzRzRzRzRzR in https://github.com/sgl-project/sglang/pull/15772
- [diffusion] improve: improve post-processing by moving compute-intensive tasks to GPU by @mickqian in https://github.com/sgl-project/sglang/pull/15822
- Use X-SMG-Routing-Key header instead of json body and add tests by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15826
- Clean up the
__init__of TokenizerManager and DetokenizerManager by @merrymercy in https://github.com/sgl-project/sglang/pull/15796 - Optimize FP8 MLA KV cache writes with Triton kernel by @harvenstar in https://github.com/sgl-project/sglang/pull/15522
- [model-gateway] update ManualPolicy with header-based routing by @slin1237 in https://github.com/sgl-project/sglang/pull/15847
- fix: improving format and design by @dougyster in https://github.com/sgl-project/sglang/pull/15791
- ci: add continue-on-error for scheduled PR tests by @alisonshao in https://github.com/sgl-project/sglang/pull/15701
- Fix chunk_kda_fwd missing argument by @ispobock in https://github.com/sgl-project/sglang/pull/15851
- Separate swa and local attention chunk cache eviction by @ispobock in https://github.com/sgl-project/sglang/pull/15820
- Super tiny move last_prefill_tokens to metrics mixin by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15857
- Fix prefill num tokens metrics by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15858
- Use allow auto truncate in the OpenAI API endpoint by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/15369
- Introduce
ModelRunnerKVCacheMixinto simplify the code. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15821 - [NPU] update Mixed chunk op to FIA by @Hexq0210 in https://github.com/sgl-project/sglang/pull/15518
- [VLM] Refactor load_mm_data to improve performance by @yuan-luo in https://github.com/sgl-project/sglang/pull/14644
- Fix swa available memory check by @ispobock in https://github.com/sgl-project/sglang/pull/15867
- [Bug] fix piggyback load report return
Nonebug by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15870 - diffusion: support Qwen-Image-Layered by @chhnb in https://github.com/sgl-project/sglang/pull/15817
- [diffusion] ZImage support Tensor Parallel by @zhaziqwe in https://github.com/sgl-project/sglang/pull/15849
- [BUGFIX] fix edge case for qwen3-next by @yizhang2077 in https://github.com/sgl-project/sglang/pull/14209
- fix: warn once per env var key by @alphabetc1 in https://github.com/sgl-project/sglang/pull/15846
- Tiny log warn users when tracing is automatically disabled by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15889
- [CI] Fix CI test case skip problem by @Prozac614 in https://github.com/sgl-project/sglang/pull/15874
- [Fix] assert error in log_prefill_stats by @changhuaixin in https://github.com/sgl-project/sglang/pull/15881
- [diffusion] refactor: centralize hardware platform detection and streamline environment variable management by @mickqian in https://github.com/sgl-project/sglang/pull/15842
- [Diffusion] Improve qwen image edit performace to align with LightX2V by @BBuf in https://github.com/sgl-project/sglang/pull/15812
- [diffusion] ci: support returning request id from endpoint by @mickqian in https://github.com/sgl-project/sglang/pull/15844
- [feat] Init support for webui-I2I by @wplf in https://github.com/sgl-project/sglang/pull/15778
- Revert "[feat] Init support for webui-I2I" by @merrymercy in https://github.com/sgl-project/sglang/pull/15906
- [Tool Call][DSV32] Streamline function call parameters by @Muqi1029 in https://github.com/sgl-project/sglang/pull/14750
- [model-gateway]: fix crash in embedding worker health check by @Ratish1 in https://github.com/sgl-project/sglang/pull/15910
- Revert "[VLM] Refactor load_mm_data to improve performance" by @merrymercy in https://github.com/sgl-project/sglang/pull/15911
- refactor: add type hints to scheduler mixins by @ch-wan in https://github.com/sgl-project/sglang/pull/15913
- hotfix: add type hints to scheduler mixins by @ch-wan in https://github.com/sgl-project/sglang/pull/15916
- Revert embedding integration tests from 5f3a47d by @slin1237 in https://github.com/sgl-project/sglang/pull/15914
- [BugFix][VLM] Correct weight loading with tie_word_embeddings == False by @ZhengWG in https://github.com/sgl-project/sglang/pull/15398
- fix: adding deepseek base tests to b200 by @dougyster in https://github.com/sgl-project/sglang/pull/15915
- [model-gateway] add JWT/OIDC authentication for control plane APIs by @slin1237 in https://github.com/sgl-project/sglang/pull/15850
- Add a test case for crash dump by @merrymercy in https://github.com/sgl-project/sglang/pull/15905
- [diffusion] chore: remove stepvideo code by @yhyang201 in https://github.com/sgl-project/sglang/pull/15918
- Tiny cleanup the models' name in
test_utilsby @hnyls2002 in https://github.com/sgl-project/sglang/pull/15920 - Add Mimo-v2-flash model to ci test by @ispobock in https://github.com/sgl-project/sglang/pull/15887
- [model-gateway] Add consistent hashing for ManualPolicy routing by @slin1237 in https://github.com/sgl-project/sglang/pull/15907
- [diffusion] refactor: unify model loading and offloading behavior by @mickqian in https://github.com/sgl-project/sglang/pull/15923
- [NPU] Support w4a8 with activation clip by @jiaming1130 in https://github.com/sgl-project/sglang/pull/14736
- [model-gateway] optimize radix tree memory and reduce allocations by @slin1237 in https://github.com/sgl-project/sglang/pull/15933
- [model-gateway]: fix grpc embedding test by @Ratish1 in https://github.com/sgl-project/sglang/pull/15934
- [model-gateway] Add PrefixHash load balancing policy for KV cache-aware routing by @slin1237 in https://github.com/sgl-project/sglang/pull/15935
- Fix temp_prefill_info assertion error in PP disaggregation mode by @harvenstar in https://github.com/sgl-project/sglang/pull/15943
- chore: bump mooncake version to 0.3.8 by @ShangmingCai in https://github.com/sgl-project/sglang/pull/15886
- [diffusion] logging: log avail gpu mem while loading and generating by @mickqian in https://github.com/sgl-project/sglang/pull/15936
- [diffusion] chore: remove useless params by @yhyang201 in https://github.com/sgl-project/sglang/pull/15925
- [model-gateway]: remove unnecessary comment by @Ratish1 in https://github.com/sgl-project/sglang/pull/15947
- Clean up logging by @merrymercy in https://github.com/sgl-project/sglang/pull/15919
- Support kv8 (FP8) with torch_native attention backend by @JackChuang in https://github.com/sgl-project/sglang/pull/12596
- Tiny fix cannot launch nvfp4 checkpoint with bf16 kv cache by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15986
- [diffusion] chore: clean ComposedPipelineBase by @mickqian in https://github.com/sgl-project/sglang/pull/15937
- Add correctness validation for decode_attention test by @harvenstar in https://github.com/sgl-project/sglang/pull/15806
- [Feature] JIT Fused QK norm + qk norm clean up by @DarkSharpness in https://github.com/sgl-project/sglang/pull/15835
- Refactor fp8 nextn layer for DeepSeek nvfp4 checkpoint by @Fridge003 in https://github.com/sgl-project/sglang/pull/15353
- SGLang Tracing: fix attribute errors (header extraction & bootstrap span closing) by @vladnosiv in https://github.com/sgl-project/sglang/pull/15693
- Refactor: separate CI-specific weight validation into dedicated module by @alisonshao in https://github.com/sgl-project/sglang/pull/15216
- [fix]deepgemm precompile when warmup by @TZHelloWorld in https://github.com/sgl-project/sglang/pull/15891
- Tiny add smg_manual_policy_cache_entries metric by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15987
- Tiny extract PeriodicTask in router by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15988
- Unify spec v2's naming manner. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15990
- Add EAGLE3 test with MMLU dataset. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/15945
- Update test parameters for deepep_large test by @ch-wan in https://github.com/sgl-project/sglang/pull/16001
- Add micro benchmarks for manual policy by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15991
- Tiny fix WASM test errors on machines with many cores by @fzyzcjy in https://github.com/sgl-project/sglang/pull/15992
- Apply fixture-kit mode to MMMUVLMMixin by @majiayu000 in https://github.com/sgl-project/sglang/pull/15615
- Tiny cleanup duplicate code for multi-layer eagle worker. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/16004
- [Doc]Update MTP moe backends for EP document by @Fridge003 in https://github.com/sgl-project/sglang/pull/16013
- [diffusion] CI: relax threshold by supporting different profiles by @mickqian in https://github.com/sgl-project/sglang/pull/16002
- Temporarily disable temp_prefill_info assertion to unblock CI by @fzyzcjy in https://github.com/sgl-project/sglang/pull/16008
- [diffusion] chore: fix default offload setting for image generation model by @mickqian in https://github.com/sgl-project/sglang/pull/15928
- Fix metrics by @merrymercy in https://github.com/sgl-project/sglang/pull/15998
- [JIT kernel] Jit kernel tests support ci by @BBuf in https://github.com/sgl-project/sglang/pull/15939
- [diffusion] fix: fix stages not logged when perf_dump_path is provided by @mickqian in https://github.com/sgl-project/sglang/pull/16016
- [scheduler] fix: correcting
extend_logprob_start_lencalculation by @ch-wan in https://github.com/sgl-project/sglang/pull/15922 - Add host tensor allocator for memory_pool_host and support Mooncake standalone storage by @YiXR in https://github.com/sgl-project/sglang/pull/14873
- [diffusion] modify sgld webui for reference to content task and better visualization capabilities by @wplf in https://github.com/sgl-project/sglang/pull/16017
- [Docs] Improve documentation index page by @merrymercy in https://github.com/sgl-project/sglang/pull/16028
- feat PD: add eagle3 support for DeepSeek V3 in EP mode by @QiuMike in https://github.com/sgl-project/sglang/pull/14280
- Tiny print launch command with
shlexby @hnyls2002 in https://github.com/sgl-project/sglang/pull/16010 - [model-gateway] Organize Rust CLI arguments into logical groups for better --help output by @slin1237 in https://github.com/sgl-project/sglang/pull/16036
- [model-gateway] Organize CLI arguments into logical groups for better --help output by @slin1237 in https://github.com/sgl-project/sglang/pull/16035
- [model-gateway][CI] Display benchmark results in GitHub Actions summary by @slin1237 in https://github.com/sgl-project/sglang/pull/16037
- [model-gateway] perf: optimize observability logging for minimal CPU/memory overhead by @slin1237 in https://github.com/sgl-project/sglang/pull/16039
- [model-gateway]: optimize metrics for minimal CPU and memory overhead by @slin1237 in https://github.com/sgl-project/sglang/pull/16041
- [Diffusion] Disable packed QKV for FLUX & Z-Image by @BBuf in https://github.com/sgl-project/sglang/pull/16038
- [ci] update genai bench to 0.0.3 for pd testing by @slin1237 in https://github.com/sgl-project/sglang/pull/16051
- [model-gateway] update WorkerRegistryStats with connection mode and circuit breaker info by @slin1237 in https://github.com/sgl-project/sglang/pull/16046
- Update model and feature support for Ascend NPU by @Hexq0210 in https://github.com/sgl-project/sglang/pull/16003
- [docs] Fix non-clickable ToC links in model gateway documentation by @slin1237 in https://github.com/sgl-project/sglang/pull/16054
- [HiCache] Fix deadlock when creating new group by @XucSh in https://github.com/sgl-project/sglang/pull/15805
- [Diffusion] Refactor qwen_image's rope in a single helper func by @BBuf in https://github.com/sgl-project/sglang/pull/16047
- Clamp logprob tokens with model vocab size by @cklxx in https://github.com/sgl-project/sglang/pull/14414
- [Diffusion] Qwen image edit support qknorm optimization by @BBuf in https://github.com/sgl-project/sglang/pull/16062
- [JIT kernel] Jit kernel add codeowners by @BBuf in https://github.com/sgl-project/sglang/pull/16085
- [diffusion] chore: minor refactor by streamlining the VAE class hierarchy by @mickqian in https://github.com/sgl-project/sglang/pull/16069
- [model-gateway] fix tokenizer to match transformers special token handling by @slin1237 in https://github.com/sgl-project/sglang/pull/16087
- [diffusion] fix: fix serving with dit-layerwise-offload enabled by @mickqian in https://github.com/sgl-project/sglang/pull/16066
- [model-gateway] Add classification model support infrastructure by @slin1237 in https://github.com/sgl-project/sglang/pull/16061
- [model-gateway] Improve tree benchmark with realistic multi-tenant scenarios by @slin1237 in https://github.com/sgl-project/sglang/pull/14838
- [Feature] support bench jsonl files with sharegpt format by @jiapingW in https://github.com/sgl-project/sglang/pull/15057
- [model-gateway] Optimize radix tree timestamp updates for multi-tenant scaling by @slin1237 in https://github.com/sgl-project/sglang/pull/16093
- [CI] fix test_mla_deepseek_v3.py by @alphabetc1 in https://github.com/sgl-project/sglang/pull/16096
- [model-gateway] Add classify pipeline stages and protocol types by @slin1237 in https://github.com/sgl-project/sglang/pull/16094
- [model-gateway] Optimize INSERT with leaf-only timestamp updates by @slin1237 in https://github.com/sgl-project/sglang/pull/16097
- [model-gateway] Wire classify pipeline to gRPC router by @slin1237 in https://github.com/sgl-project/sglang/pull/16098
- [model-gateway] Generate UUID-based request IDs for embedding/classify by @slin1237 in https://github.com/sgl-project/sglang/pull/16100
- [model-gateway] Fix duplicate classify prefix in response ID by @slin1237 in https://github.com/sgl-project/sglang/pull/16101
- Enable testing slash command handler changes on non-fork PRs by @alisonshao in https://github.com/sgl-project/sglang/pull/15921
- [model-gateway]: optimize prefix_match with zero-copy tenant and deferred char count by @slin1237 in https://github.com/sgl-project/sglang/pull/16099
- Fix extend_input_len calculation in decode.py by @ch-wan in https://github.com/sgl-project/sglang/pull/16103
- Add a new branch cut GH workflow, and adopt setuptools-scm for version control by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/15985
- ci: migrate remaining spec/eagle tests to test/registered/spec/ by @alisonshao in https://github.com/sgl-project/sglang/pull/15800
- Clean up swa handling in fa3 backend by @ispobock in https://github.com/sgl-project/sglang/pull/15877
- [diffusion] improve: tiny speedup qwen-image-edit-2511 by avoiding unnecessary calculation by @mickqian in https://github.com/sgl-project/sglang/pull/15896
- [diffusion] improve: tiny improve layerwise offload manager by consolidating weights per layer by @mickqian in https://github.com/sgl-project/sglang/pull/16081
- [CI] Fix LoRA downloading issues and respect offline flag by @Prozac614 in https://github.com/sgl-project/sglang/pull/15813
- Reduce CI failure monitor to run once every 12 hours by @Kangyan-Zhou in https://github.com/sgl-project/sglang/pull/16123
- [LoRA] Torch native backend: rework implementation and updated tests by @vlserov in https://github.com/sgl-project/sglang/pull/15187
- Refactor: Moving
extend_logprob_start_lencalculation out ofprepare_for_extendby @ch-wan in https://github.com/sgl-project/sglang/pull/16105 - Enhance comments in set_extend_input_len method by @ch-wan in https://github.com/sgl-project/sglang/pull/16130
- Fix Qwen Next GDN w/ Radix Cache by @hebiao064 in https://github.com/sgl-project/sglang/pull/16053
- Add PR review process into template. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/16133
- Add
attack204into CI_PERMISSION by @hnyls2002 in https://github.com/sgl-project/sglang/pull/16131 - [AMD CI] Organize AMD nightly perf test files by @bingxche in https://github.com/sgl-project/sglang/pull/16114
- [diffusion] model: support TurboWan2.1-T2V-1.3B/14B SLA by @IPostYellow in https://github.com/sgl-project/sglang/pull/15888
- Reworked fast_pos_embed_interpolate() using torch by @terfendail in https://github.com/sgl-project/sglang/pull/10959
- Fix wrong assigning
extend_input_len_per_reqwith eagle. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/16129 - Tiny rename test_deepseek_v3_fp4_mtp_stage_b.py by @Qiaolin-Yu in https://github.com/sgl-project/sglang/pull/16141
- Fix race condition in /tag-and-rerun-ci command by @alisonshao in https://github.com/sgl-project/sglang/pull/16142
- [PP] Add a minimum chunk value for PP dynamic chunking by @ShangmingCai in https://github.com/sgl-project/sglang/pull/16140
- [VLM] Support Video for InternVL3_5 by @yuan-luo in https://github.com/sgl-project/sglang/pull/15942
- [CI] Fixing release with cut branch workflow by @Fridge003 in https://github.com/sgl-project/sglang/pull/16153
- cherry-pick: SGLang Tracing: Supports propagating trace headers through sgl.Engine by @ishandhanani in https://github.com/sgl-project/sglang/pull/16244
New Contributors
- @baonudesifeizhai made their first contribution in #14195
- @jinke446 made their first contribution in #14266
- @Valentine233 made their first contribution in #12441
- @dcampora made their first contribution in #14213
- @rauletorresc made their first contribution in #14225
- @cherryblo made their first contribution in #14143
- @gmixiaojin made their first contribution in #13996
- @Brain97 made their first contribution in #14234
- @gwarmstrong made their first contribution in #14555
- @btw616 made their first contribution in #14412
- @momaek made their first contribution in #14573
- @Prozac614 made their first contribution in #14606
- @MingxuZh made their first contribution in #14687
- @wplf made their first contribution in #14830
- @trangdough made their first contribution in #14554
- @yuchengz816-bot made their first contribution in #14313
- @luketong777 made their first contribution in #14877
- @Vladimir221 made their first contribution in #12287
- @MikukuOvO made their first contribution in #14659
- @cynial made their first contribution in #13989
- @JamesBrianD made their first contribution in #15056
- @thenumberouscode made their first contribution in #13969
- @danielafrimi made their first contribution in #15113
- @Ratish1 made their first contribution in #15052
- @mmdbhs made their first contribution in #13914
- @Goalina made their first contribution in #15152
- @RuixiangMa made their first contribution in #15017
- @XDaoHong made their first contribution in #13410
- @yurekami made their first contribution in https://github.com/sgl-project/sglang/pull/14990
- @KennyYao2001 made their first contribution in https://github.com/sgl-project/sglang/pull/14866
- @nealvaidya made their first contribution in https://github.com/sgl-project/sglang/pull/14934
- @luqitao made their first contribution in https://github.com/sgl-project/sglang/pull/15155
- @ashtonchew made their first contribution in https://github.com/sgl-project/sglang/pull/15088
- @xvlincaigou made their first contribution in https://github.com/sgl-project/sglang/pull/15120
- @MarcoDWei made their first contribution in https://github.com/sgl-project/sglang/pull/12967
- @FrankD412 made their first contribution in https://github.com/sgl-project/sglang/pull/15421
- @cocoshe made their first contribution in https://github.com/sgl-project/sglang/pull/13782
- @IPostYellow made their first contribution in https://github.com/sgl-project/sglang/pull/15262
- @suyedu made their first contribution in https://github.com/sgl-project/sglang/pull/15394
- @66RING made their first contribution in https://github.com/sgl-project/sglang/pull/12995
- @Jacki1223 made their first contribution in https://github.com/sgl-project/sglang/pull/15219
- @yashikagandhi-google made their first contribution in https://github.com/sgl-project/sglang/pull/15348
- @khushgx made their first contribution in https://github.com/sgl-project/sglang/pull/15022
- @qianlihuang made their first contribution in https://github.com/sgl-project/sglang/pull/15040
- @OrangeRedeng made their first contribution in https://github.com/sgl-project/sglang/pull/15484
- @UbeCc made their first contribution in https://github.com/sgl-project/sglang/pull/15520
- @SeanWeiSean made their first contribution in https://github.com/sgl-project/sglang/pull/13576
- @zhaziqwe made their first contribution in https://github.com/sgl-project/sglang/pull/15849
- @jiaming1130 made their first contribution in https://github.com/sgl-project/sglang/pull/14736
- @vladnosiv made their first contribution in https://github.com/sgl-project/sglang/pull/15693
- @TZHelloWorld made their first contribution in https://github.com/sgl-project/sglang/pull/15891
- @majiayu000 made their first contribution in https://github.com/sgl-project/sglang/pull/15615
- @cklxx made their first contribution in https://github.com/sgl-project/sglang/pull/14414
Full Changelog: v0.5.6...v0.5.7