Highlights
- AMD AI Dev Day 2025 SGLang (slide), PyTorch Conference 2025 SGLang (slide)
- Model gateway v0.2 release: https://docs.sglang.ai/advanced_features/router.html
- [beta] Overlap scheduler for speculative decoding: #11762
- [beta] Piecewise CUDA graph for prefill: #11490
- Prefix cache for qwen3 next and GDN/mamba models: #11214
- Fullset optimizations for DeepSeek-V3.2 (MTP, PD-Disagg, Function Calling) (https://docs.sglang.ai/basic_usage/deepseek_v32.html, #11989)
- Various Blackwell kernel optimizations
- DGX Spark Support: https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/
- KTransformer integration: https://lmsys.org/blog/2025-10-22-KTransformers/
- New model support: Nemotron, DeepSeek OCR, Qwen3-Omni, Olmo 3
- Native ModelOpt quantization support
What's Changed
- [router] add ipv6 support across all components by @slin1237 in #11219
- Remove env var warnings for release by @merrymercy in #11262
- Enable native ModelOpt quantization support (1/3) by @Edwardf0t1 in #7149
- [router][tool call] Clean up redundant
detect_formatandhas_tool_markersby @CatherineSue in #11270 - disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in #11274
- docker: add manifest to versioned docker releases by @ishandhanani in #11268
- [Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in #11182
- [router][grpc] Refine streaming processes by @CatherineSue in #11277
- Fix code sync scripts by @merrymercy in #11276
- [Auto Sync] Update test_utils.py (20251006) by @merrymercy in #11280
- Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in #11279
- Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in #11238
- Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in #11261
- fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in #11282
- docs: update sgl-kernel README by @zhyncs in #11286
- chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in #11281
- [router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in #11283
- convert test_deterministic into unit tests by @skyzh in #11095
- Feature/longbench v2 evaluation utils by @alhridoy in #10949
- [ci] fix pp test by @hnyls2002 in #11294
- EAGLE cache fix for SWARadixCache by @ispobock in #11231
- Remove overlap thread by @hnyls2002 in #11210
- [router] add reasoning and tool parser argument in router by @slin1237 in #11290
- Remove sampling info events and overlap thread file by @hnyls2002 in #11300
- Introduce future indices by @hnyls2002 in #11301
- [sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in #11068
- [Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in #11302
- [router] add get server info and get model info in grpc server by @slin1237 in #11303
- [router][grpc] Refactor chat template content format detection by @CatherineSue in #11288
- [Doc] HiCache Design Documents by @ykwd in #11027
- [Doc]: Best Practice for HICache by @hzh0425 in #11001
- [router] fix grpc connection conversion and add optimization by @slin1237 in #11305
- [router][grpc] Fix sampling_params.stop_strs is None by @CatherineSue in #11306
- Update tool parser and related documentation by @JustinTong0323 in #11223
- [router][grpc] Fix error message format in grpc chat handler by @CatherineSue in #11307
- [quantization] Properly ignore quantization for layers excluded in quant_config by @BowenBao in #11205
- [router] support Openai router conversation API CRUD by @key4ng in #11297
- [router][grpc] Fix request_id extraction when n > 1 by @CatherineSue in #11311
- [router] cleanup worker health check to return early by @slin1237 in #11310
- [oai serving chat] Add argument
--sampling-defaultsand fixChatCompletionRequestdefaults by @CatherineSue in #11304 - Clean match_prefix and prepare_for_extend for mem cache V2 by @cctry in #11200
- ci: unify the model launch method of nightly ci by @mickqian in #11230
- [Chore] Update xgrammar 0.1.24 -> 0.1.25 by @DarkSharpness in #10710
- update sampling_params documentation with defaults by @JustinTong0323 in #11315
- Optimize copy_kv_cache for spec decoding by @YAMY1234 in #11126
- Rename
ngram_utils->ngram_infoby @hnyls2002 in #11316 - [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator by @CatherineSue in #11314
- [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints by @adarshxs in #9545
- [8/N] MoE Refactor: deprecate
EPMoEby @ch-wan in #11211 - Skip weight loading in deepgemm compilation by @ch-wan in #11312
- [2/2] Support MHA prefill with FlashAttention 4. by @lifuhuang in #10937
- [Doc] Update mooncake nvlink transport doc for PD disaggregation by @ShangmingCai in #11321
- fix(decode): adjust ServerArgs import to explicit module path by @xiaguan in #11007
- Support LoRA in bench_serving oai interface by @lifuhuang in #11318
- benchmark: enhance configurable multimodal benchmarking in bench_serving by @AlienKevin in #9812
- [CI] improve disaggregation CI. by @hnyls2002 in #11264
- model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) by @netanel-haber in #10909
- [router] refactor generate to use new pipeline arch by @slin1237 in #11323
- [router] improve reasoning parser lock and reduce req cloning by @slin1237 in #11336
- [router][grpc] Cleanup debug logs in grpc_server and grpc_router by @CatherineSue in #11340
- [router] Fix all unused_qualifications by @CatherineSue in #11341
- [router] Support history management using conversation by @key4ng in #11339
- [router][grpc] Add dependencies in Cargo.toml to support chat template rendering by @CatherineSue in #11342
- fix: fix revision for sgl-flash-attn in sgl-kernel by @mickqian in #11327
- [Auto Sync] Update scheduler.py (20251009) by @zhyncs in #11350
- [Generative Score API] Multi-Item scoring with custom attention mask. by @sundar24295s in #10979
- [router][grpc] disable health check generation and increase timeout by @slin1237 in #11353
- [router] Refactor OpenAI router: split monolithic file and move location by @key4ng in #11359
- [router][lint] Add unused_qualifications to cargo lint warnings by @CatherineSue in #11366
- [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size by @trevor-m in #11309
- [router][grpc] Fix tool call streaming bugs: empty tool names, state pollution, and panics by @CatherineSue in #11373
- add code pp support for nixl by @shaharmor98 in #11375
- fix bench_serving mishandling of internal states by @shaharmor98 in #11376
- [router][grpc] Replace fake health check with correct ones by @CatherineSue in #11387
- [router] change grpc client from mutable to clone by @slin1237 in #11394
- chore: upgrade flashinfer 0.4.0 by @zhyncs in #11364
- [router] conversation item API: create, retrieve and delete by @key4ng in #11369
- chore: bump SGLang version to 0.5.3.post1 by @sglang-bot in #11324
- move more files under srt/utils by @merrymercy in #11285
- [grammar] Avoid server crash when grammar backend is None by @JustinTong0323 in #11401
- fix: fix gpu-proc affinity set incorrectly when pp_size > 1 by @acelyc111 in #11389
- [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded by @glenliu21 in #11365
- [CI] Refactor PD disaggregation test suite by @ShangmingCai in #11363
- Replace pad with cat for better performance by @yuan-luo in #11388
- fix: reinstall torch in deps install by @zhyncs in #11414
- feat(hicache): Support passing prefix keys for l3 store. by @hzh0425 in #9045
- fix file and object naming scheme in HiCacheNixl to avoid data corruption by @ziruiliu in #10969
- Dedicated toml files for CPU/XPU by @ZailiWang in #10734
- Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in #11144
- chore: update pyproject by @zhyncs in #11420
- fix: fix video input for qwen3-vl by @mickqian in #11361
- perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in #11381
- [HiCache] feat: add multi tenant with prefix tag by @stmatengss in #9256
- [CI] Merge build-dev into workflow matrix by @csahithi in #11345
- Revert "perf: optimize qwen-vl with symm mem allreduce" by @ch-wan in #11436
- Revert "fix: fix video input for qwen3-vl" by @merrymercy in #11437
- Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" by @scottjlee in #11433
- [router] Fix ci nvcc not found error by @key4ng in #11411
- feat(mooncake): support GB suffix for global_segment_size by @xiaguan in #10745
- Separate allocation logic from scheduler by @cctry in #11313
- [router] disable rate limiter by default by @slin1237 in #11435
- [router] leverage RAII to actively cancel request during client disconnect by @slin1237 in #11399
- [router][grpc] Consolidate parser checks for chat completions by @CatherineSue in #11439
- Reorder PD disagg CI tests by @merrymercy in #11438
- fix: Change dsv32 hack temporary path to use system temp directory by @wxsms in #11445
- Fix batch invariant ops by @hebiao064 in #11368
- [BugFix] test_mla_fp8.py fails on Cublas 12.9 by @Liu-congo in #11360
- [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton by @byjiang1996 in #11450
- Remove tilelang dependency in Dockerfile by @Fridge003 in #11455
- Enable native ModelOpt quantization support (2/3) by @Edwardf0t1 in #9991
- Reland [1/2] Optimizations and refactors about quant kernel by @fzyzcjy in #10312
- Super tiny delete unused openai router in sgl-router by @fzyzcjy in #11448
- Adjust logits metada init for target verify by @hnyls2002 in #11467
- [Documentation][Configuration] Server args and documentation of PD-Multiplexing. by @ykcombat in #11427
- Fix enable_v2 in int8 quant by @fzyzcjy in #11470
- [Fix] Fix split prefill with fa3. by @ykcombat in #11428
- fix stop when stream by @whybeyoung in #11462
- Add option to disable
any_whitespaceforxgrammarandllguidancebackends. by @lulor in #8919 - [7/n] decouple quantization impl from vllm dependency - gguf kernel by @FlamingoPg in #11019
- fix Xeon CI by @ZailiWang in #11454
- [CI] Add nightly builds to dockerhub by @csahithi in #9804
- [Feature] support regex strings as a stopping condition by @glenliu21 in #10635
- Beta spec-overlap for EAGLE by @hnyls2002 in #11398
- Piecewise CUDA Graph Support & Torch Compile Backend by @Oasis-Git in #10062
- [Router]: Small Typo in a comment within tree.rs by @xuwenyihust in #11489
- chore: bump sgl-kernel version to 0.3.16 by @sglang-bot in #11476
- [smol] [perf] Qwen3-VL in place op. by @vincentzed in #11481
- [chore][1/N] Avoid using default mutable parameters by @kevin85421 in #11478
- [bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends by @MahmoudAshraf97 in #10172
- [ perf ] Replace json-> orjson in hot path by @vincentzed in #11221
- [chore][2/N] Avoid using default mutable parameters by @kevin85421 in #11479
- Fix the GPT function calling regex to allow dash in the name by @antoine-roux in #10577
- bailingMoE: Fix Key error of deepep_mode by @QiuMike in #11465
- Fix CI break by express-laned PRs. by @hnyls2002 in #11499
- Move args from
global_configtoenvironby @hnyls2002 in #11332 - move fla env check position by @yizhang2077 in #11500
- Temporarily remove b200 tests by @merrymercy in #11501
- Fix port conflicts in CI by @merrymercy in #11497
- temporarily remove b200 tests by @merrymercy in #11502
- Fix unit tests by @merrymercy in #11503
- Bugfix: Fix Type consistency for KV indices in SWARadixCache by @hzh0425 in #11452
- doc: add doc for adding new models into nightly-ci by @mickqian in #11443
- [CI] fix lint by @hnyls2002 in #11509
- Deprecate
global_server_args_dictby @hnyls2002 in #11331 - chore: remove flashinfer cleanup cache by @zhyncs in #11514
- fix: revert temporarily remove b200 tests by @zhyncs in #11515
- [Fix] Improve longbench prompt and other logics by @byjiang1996 in #11474
- Sync changes on io_struct.py and deterministic ops by @merrymercy in #11498
- [lint] Fix the lint issue by @ch-wan in #11516
- Revert "Deprecate
global_server_args_dict" by @ch-wan in #11520 - Improve dp attention port assignment scheme by @jokerwyt in #5889
- [router] openai router: support grok model by @key4ng in #11511
- docs(router): add token-bucket rate limiting to the docs by @Jonahcb in #11485
- [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11432
- Update DeepSeek-R1-FP4 default config on blackwell by @Qiaolin-Yu in #11512
- [Fix]: add missing device attribute to ChunkCache by @leavelet in #11493
- [Feature] Support mamba radix cache v0 by @yizhang2077 in #11214
- ci: improve nightly-ci by @mickqian in #11385
- [CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering by @BBuf in #11505
- [HICache]: Support 3FS-Store with page_first_direct layout by @hzh0425 in #11460
- Tiny fix test run estimated time by @ShangmingCai in #11544
- [Reland] perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in #11457
- Depreate
global_server_args_dictby @hnyls2002 in #11528 - [Fix] Add per_channel_quant parameter to MoE config functions by @mmangkad in #11201
- [router][ci] Add Nightly Release Workflow for SGLang Router by @slin1237 in #11527
- [router] add tokenizer path to be dir by @slin1237 in #11530
- Remove
tp_worker.workerby @hnyls2002 in #11548 - fix: fix video input for qwen3-vl by @mickqian in #11442
- [NVIDIA] BUMP FA3 by @johnnynunez in #11444
- [Fix] Include grpc reflection runtime dependency by @ai-jz in #11419
- Adjust overlap event loop by @hnyls2002 in #11507
- Move deep gemm related arguments to
sglang.srt.environby @hnyls2002 in #11547 - [router][grpc] Further delegate non-stream processing to
processing.rsby @CatherineSue in #11553 - [router] allow user to specify chat template path by @slin1237 in #11549
- Minor: improve sampler & remove unused fields from model_config.py by @merrymercy in #11531
- [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter by @Jonahcb in #11483
- Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in #11441
- Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) by @trevor-m in #11557
- [CI] Add Basic Test for DeepSeek V3.2 by @Fridge003 in #11308
- [router][grpc] Add error handling to
generate_tool_constraintsby @CatherineSue in #11562 - [NVIDIA] update pyproject.toml to support cu130 option by @johnnynunez in #11521
- [CI Monitor] Ci monitor only deal with main branch in default by @BBuf in #11538
- Tiny cleanup fp4 gemm calls by @fzyzcjy in #11537
- [router][grpc] Add
serve_grpctolaunch_serverand log id for HealthCheck by @CatherineSue in #11564 - [router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds by @YouNeedCryDear in #11571
- [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11534
- chore: bump sgl-kernel version to 0.3.16.post1 by @sglang-bot in #11573
- Fix accept rate in speculative decoding metrics by @Qiaolin-Yu in #11572
- Compilation Folder Reset by @Oasis-Git in #11539
- [FEATURE] Add Profile Trace Merger for Distributed Traces by @neelabhsinha in #11413
- [DSv32] Use torch.compile for _get_logits_head_gate by @trevor-m in #11565
- Make DeepEP combine recv do not overlap by @fzyzcjy in #11535
- bench_serving support PD Disaggregation by @BBuf in #11542
- Implement LRU eviction policy for LoRA adapters by @ConnorLi96 in #11041
- Revert "[NVIDIA] BUMP FA3 (#11444)" by @zhyncs in #11582
- chore: bump sgl-kernel version to 0.3.16.post2 by @sglang-bot in #11583
- [Auto Sync] Update model_config.py (20251014) by @merrymercy in #11580
- Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json by @Qiaolin-Yu in #11587
- [router][protocols] Add Axum validate extractor and use it for
/v1/chat/completionsendpoint by @CatherineSue in #11588 - [router] update generate spec to align with sgl io struct by @slin1237 in #11591
- [router] change worker api to async instead of sync by @slin1237 in #11566
- Update news section in README.md by @merrymercy in #11598
- [router] delete useless table content comment in spec by @slin1237 in #11597
- [router] allow router launch server to use grpc mode by @slin1237 in #11600
- [Docs] [Router]: Update sg-router doc on circuit breaker by @xuwenyihust in #11449
- [router] when given both local tokenizer and chat template, log all by @slin1237 in #11601
- [AMD CI] Add image and weights caching. by @saienduri in #11593
- Update release-docker-dev.yml by @sglang-bot in #11603
- Optimize Triton Draft Backend by @hnyls2002 in #11556
- Refactor spec decoding metrics calculation into separate
TokenizerManagerutility function by @scottjlee in #11586 - make radix cache deterministic by @skyzh in #10721
- move eagle draft post process to cuda graph by @cicirori in #11434
- Reduce one step decode for draft model. by @hnyls2002 in #11561
- [router] add py binding and readme for openai router and history backend by @key4ng in #11453
- [router] cleanup app context and move to startup by @slin1237 in #11617
- [router] add chang and keyang to sgl router author by @slin1237 in #11620
- use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. by @strgrb in #11605
- [router] update router readme to latest features by @slin1237 in #11619
- Fix log for chunked prefix cache by @Fridge003 in #11624
- [Auto Sync] Update scheduler.py, server_args.py (20251014) by @merrymercy in #11623
- [Auto Sync] Update collector.py (20251014) by @merrymercy in #11625
- [Minor] Update xgrammar dependency by @DarkSharpness in #11622
- Update install.md by @merrymercy in #11631
- fix: Update SGL_KERNEL_VERSION to 0.3.15 by @zhyncs in #11633
- [router][grpc] add warm up to grpc server by @slin1237 in #11627
- Refactor kv cache free by @cctry in #11351
- [router] update router doc to latest features by @slin1237 in #11639
- fix: upgrade transformers to 4.57.1 by @csahithi in #11628
- [router] add worker self discovery for metadata by @slin1237 in #11638
- [router] upgrade to 0.2.0 by @slin1237 in #11642
- [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP by @UNIDY2002 in #10423
- [1/N]Support DeepSeek-R1 w4a8 normal deepep by @ayrnb in #8247
- [Fix] Fix accuracy bug in CSGMV kernel caching key. by @lifuhuang in #11579
- feat: add add_chunked_prefix_cache_attention_backend by @zhyncs in #11636
- Super tiny improve FA3 import error message by @fzyzcjy in #11590
- [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl by @ZhengWG in #11458
- [Doc] Update support matrix for attn and hybrid attn by @b8zhong in #11293
- Clean up some Qwen3-Next and deterministic code by @hebiao064 in #11585
- docs: update sglang installation guide by @zhyncs in #11659
- Tiny cleanup some eagle unused codes by @hnyls2002 in #11660
- Fix 1-step draft model forward by @ShangmingCai in #11653
- [tool call] Fix prev_tool_call_arr management in base_format_detector.py by @CatherineSue in #11367
- [router] Fix response api related spec by @key4ng in #11621
- Fix missing json imports in serving_responses.py by @CatherineSue in #11681
- [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11674
- [sgl-kernel] Optimize gguf test by @FlamingoPg in #11667
- [router][grpc] Simplify model_id determination by @CatherineSue in #11684
- [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding by @slin1237 in #11676
- chore: bump SGLang version to 0.5.3.post2 by @sglang-bot in #11680
- [CI][XPU]enable sglang CI on Intel XPU by @DiweiSun in #9493
- enable rmsnorm on XPU by @huaiyuzh in #10248
- Sync code and test CI; rename some env vars by @merrymercy in #11686
- docs: Add Contributor Covenant Code of Conduct by @zhyncs in #11689
- [Mamba] Increase default mamba_full_memory_ratio to 0.9 by @hanming-lu in #11679
- [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) by @ShangmingCai in #10912
- [sgl-kernel] support hadamard by @FlamingoPg in #11663
- Fix missing a2a backend init of GLM4.5 MoE Block by @ShangmingCai in #11692
- Split test_intel_amx_attention_backend.py to pass CI of timeout by @yanbing-j in #11370
- Set csgmv as default lora backend. by @lifuhuang in #11488
- [Bugfix] Fix Qwen3/DSV3/DSV3.2 model support by @iforgetmyname in #11510
- [CI] Add GLM4MoE model test by @ShangmingCai in #11706
- [router] fix get_models endpoint for openai router by @key4ng in #11687
- [ci]use H20 to run disaggregation test by @HanHan009527 in #11543
- chore: bump SGLang version to 0.5.3.post3 by @sglang-bot in #11693
- model: qwen3-omni (thinker-only) by @mickqian in #10911
- [Router] Refactor protocol definitions: split spec.rs into modular files by @key4ng in #11677
- [router] fix p and d worker filtering and bootstrap port handling by @slin1237 in #11729
- [router][grpc] add dissag info to warm up in grpc server by @slin1237 in #11727
- [router] Fix tool_choice normalization in ChatCompletionRequest and fix ut by @CatherineSue in #11731
- Revert "make radix cache deterministic" by @Fridge003 in #11728
- Reduce the image processing latency in VLM by @zhooooong in #11541
- [router] add spec.rs to enables tests under spec folder by @key4ng in #11734
- [router] Add rustfmt and set group imports by default by @CatherineSue in #11732
- Revert "[router] fix get_models endpoint for openai router (#11687)" by @key4ng in #11740
- [router][CI] Clean up deprecated fields in
pr-test-pd-router.ymlby @CatherineSue in #11739 - [CI] Fix broken event loop creation by @hnyls2002 in #11746
- [overlap-spec] Make plan stream an option by @hnyls2002 in #11724
- ci: reduce and refactor vlm ut and combine test files by @mickqian in #11062
- Abstraction for spec worker and code cleanup by @hnyls2002 in #11643
- add tuned fuse moe kernel for qwen3 235b fp8 on h200 by @pdasgup in #11730
- Revert "Set csgmv as default lora backend. (#11488)" by @zhyncs in #11735
- [router] Fix UTF-8 Boundary Panic in Stop Sequence Decoder by @slin1237 in #11766
- [router] fix grpc client time out to 1h by @slin1237 in #11768
- [doc] update router document by @key4ng in #11767
- [Feature] Reuse flashinfer workspace for PD-Multiplexing. by @ykcombat in #11540
- Turn on shm_allreduce and shm_allgather for fp16 by @chunyuan-w in #10725
- [Auto Sync] Update scheduler.py (20251017) by @zhyncs in #11738
- [router][grpc] Remove timeout for connections and remove
max_tokensdeprecation warning log by @CatherineSue in #11775 - Cleaning indexer for DeepSeek V3.2 by @Fridge003 in #11682
- [minor] sync code on python/sglang/test/test_deterministic.py and improve ci tests by @merrymercy in #11777
- [Auto Sync] Update common.py (20251017) by @merrymercy in #11782
- [Fix] Skip visual layers when applying LoRA to Qwen2VL modules by @anvdn in #11519
- [Lint] Add
python/sglangto ruff F401 checks and remove unused imports in files by @CatherineSue in #11685 - Super tiny fix missing input throughput by @fzyzcjy in #11607
- Support shared experts overlap in cutlass moe by @fzyzcjy in #11611
- Support casting bf16 NextN moe to fp8 by @fzyzcjy in #11613
- Manually flip deepep_mode for cuda_graph by @zhuzilin in #11666
- Set CUDA_VISIBLE_DEVICES to achieve one GPU per process by @merrymercy in #9170
- Super tiny fix CI by @fzyzcjy in #11788
- Make single-batch overlap compatible with offloading by @fzyzcjy in #11614
- completely remove mixed mode deterministic test as prefix mode could cover it by @zminglei in #11783
- [Refactor] move
deep_gemm_wrapperout ofquantizationby @ch-wan in #11784 - Enable lint on main by @fzyzcjy in #11794
- [router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client by @CatherineSue in #11798
- Try add back no-commit-to-branch by @fzyzcjy in #11799
- fix(glm45): disable reduce scatter by @jinmingyi1998 in #11665
- fix command line usage of profiling by @Qiaolin-Yu in #11793
- [RL] support weight update with DP attention by @zhuzilin in #11669
- [RL] use cpu group to prepare_mlp_sync_batch_raw when the server is offloaded by @zhuzilin in #10152
- set default attention backend for deterministic inference by @zminglei in #11801
- Eager Compiler for Torch Compile by @Oasis-Git in #11803
- Fix install instructions and pyproject.tomls by @merrymercy in #11781
- Bump torch_memory_saver to avoid installing pre-release versions by @fzyzcjy in #11797
- [HiCache] feat: add more eviction policy by @stmatengss in #11506
- [overlap-spec] support page size > 1 by @hnyls2002 in #11772
- support server arg override KV cache to bf16 to avoid slow cases by @b8zhong in #11749
- feat(example/fastapi): support --startup-timeout using Qwen3-Next-80B-A3B-Instruct as example by @Kindyaa in #11710
- ci: update
lmms-evalto speed up multimodal CI by @b8zhong in #11000 - Use cutlass fp4 gemm by default by @Qiaolin-Yu in #11813
- Fix Dockerfile not installing correct version of DeepEP for arm build by @kyleliang-nv in #11773
- [router] Add Configurable L0 and L1 Tokenizer Caching by @slin1237 in #11688
- [2/2] [feature] support openai like classification api in router by @whybeyoung in #11670
- [1/2][feature] support openai like classification api by @whybeyoung in #11618
- make sure logit bias is applied during eagle spec decoding verification by @petricevich in #11555
- fix: do not wrap invalid grammar objects during constrained generation by @tazjin in #11328
- Improve
send_sonescript by @hnyls2002 in #11817 - Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads by @YAMY1234 in #10788
- Update CODEOWNERS for layer quantization path by @merrymercy in #11818
- support tokenized batch request by @narutolhy in #11091
- Tiny add hints when users send requests to wrong place by @fzyzcjy in #11808
- Make single-batch overlap compatible with NextN by @fzyzcjy in #11804
- Support not officially supported high sgl-kernel version with low srt version by @fzyzcjy in #11786
- Avoid generation gets hanging when user specifies multiple event loops by @fzyzcjy in #5162
- Change bf16 to fp8 for some gemms in attention for DeepSeek ckpt v2 by @fzyzcjy in #11805
- Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" by @hnyls2002 in #11827
- [overlap-spec] fix stop condition and trimming by @hnyls2002 in #11819
- [Spec Decoding] Support MTP for dsv3.2 by @Paiiiiiiiiiiiiii in #11652
- [CI] always print back trace in
retry()by @hnyls2002 in #11834 - [Test] Add basic matched stop for beta eagle by @hnyls2002 in #11833
- Deterministic Mode: Add 1-stage triton kernel for prefill by @hebiao064 in #11147
- [logprobs] Enable local deterministic logrprobs testing with strict threshold by @PrinsYin in #10994
- [CI] Add CI test for DeepSeek V3.2 MTP by @Fridge003 in #11835
- [NVIDIA] FA3/FA4 Fix by @johnnynunez in #11606
- [DeepseekV32] Add fast_topk_transform_ragged_fused kernel by @hlu1 in #11815
- Fix triton_kernels import error on some hardwares by @fzyzcjy in #11831
- Tiny bump DeepEP version in ARM blackwell by @fzyzcjy in #11810
- [BugFix] replace the input_to_float8 used in dsv2 by @Liu-congo in #11612
- [Doc] Update documents for FA4 by @Fridge003 in #11778
- fix(ci): Fix CI Monitor limit parameter and add CI Analysis to summary by @BBuf in #11832
- Fix version bump script to handle TOML files with outdated versions by @Kangyan-Zhou in #11787
- Improve Kernel Build Time by @Kangyan-Zhou in #11508
- check master server for mooncake store by @huangtingwei9988 in #10510
- chore: bump sgl-kernel version to 0.3.16.post3 by @sglang-bot in #11733
- Recapture cuda graph after model weight update to resolve IMA error by @harrisonlimh in #11780
- [Feature] Use current greenctx stream to communicate in PD-Multiplexing. by @ykcombat in #11594
- Support mrope triton kernel and add unit test by @yuan-luo in #11722
- [PD] Improve eagle acceptance rate by transferring draft model hidden states by @ZeldaHuang in #10801
- Tiny clean up for PD module and doc by @ShangmingCai in #11747
- Revert "[CI Monitor] Ci monitor only deal with main branch in default" by @BBuf in #11846
- [Model] Add Olmo 3 model support by @2015aroras in #11396
- Update amd gpu install docs. by @saienduri in #11849
- [AMD CI] Populate image cache in nightly docker release. by @saienduri in #11822
- fix(server_args): handle tokenizer init conflicts by @ishandhanani in #11776
- [Feature] New structural tag support by @DarkSharpness in #10691
- Tiny fix main lint by @hnyls2002 in #11862
- [9/N] MoE Refactor: cleanup dispatcher interfaces by @ch-wan in #11847
- Fix acc len and gen throughput metrics when enabling overlap-spec by @Qiaolin-Yu in #11823
- Replace function call with set literal by @penguin-wwy in #11867
- Support mixing cutedsl and deepgemm backend by @fzyzcjy in #11807
- [router] Worker Management Workflow Engine by @slin1237 in #11868
- [router] remove encoding header for oai router by @slin1237 in #11881
- [Auto Sync] Update scheduler.py, server_args.py (20251020) by @merrymercy in #11875
- [router][grpc] Remove
continue_final_messageinChatTemplateParamsand addminijinja-contribby @CatherineSue in #11882 - fix(sql-router): fix conflict port in test by @htiennv in #11826
- [router] clean up workflow logs to debug for implementation details logs by @slin1237 in #11886
- [code move] move pp into a separate mixin by @merrymercy in #11838
- [router][grpc] Fix wram-up random token ids for small models by @CatherineSue in #11887
- Revise MRotaryEmbedding's forward by @yuan-luo in #11859
- piecewise cuda graph support qwen3-moe by @BBuf in #11845
- Fix RotaryEmbedding for fp32 input by @zhangdonghao-zdh in #11843
- Init attention backend for Intel XPU by @airMeng in #10656
- Use trtllm_mla decode kernel for draft extend in speculative decoding by @Qiaolin-Yu in #11664
- [router] release router 0.2.1 by @slin1237 in #11885
- [AMD] Update wave-lang to 3.8.0 by @xintin in #11878
- init support for KTransformers Heterogeneous Computing by @Atream in #11487
- [FEATURE] Add OpenAI-Compatible LoRA Adapter Selection by @neelabhsinha in #11570
- [fix] fix ci uv install dependency by @HanHan009527 in #11895
- Support Thinking Budget (via custom_logit_processor for OpenAI API) [Fix #6572] by @whybeyoung in #11416
- Simplify multi-tokenizer by @zhengkezhou1 in #11295
- [CI] disable glm4.1v and fix the flashinfer installation by @ShangmingCai in #11902
- vlm: enforce pybase64 for image and str encode/decode by @b8zhong in #10700
- [smol] [perf] Inverse perm improvement by @vincentzed in #11482
- [quantization][MoE] fix the check for
tp_size/moe_ep_size/moe_intermediate_size/weight_block_size_nby @kevin85421 in #11702 - [CI] Fix b200 flashinfer installation by @ShangmingCai in #11915
- Fix flush cache API for spec v2 by @hnyls2002 in #11918
- [NVIDIA] Add new SMs support for Spark & Thor by @Kh4L in #11287
- Update sgl-kernel and remove fast hadamard depedency by @Fridge003 in #11844
- Rename flashmla kernel options of nsa backend for better readability by @Fridge003 in #11876
- chore: upgrade flashinfer 0.4.1 by @zhyncs in #11933
- [BugFix][Qwen3-VL]: add metadata for video in qwen3-vl by @ZhengWG in #11377
- [Auto Sync] Update forward_batch_info.py (20251021) by @zhyncs in #11934
- Fix openai input_text type compatibility by @key4ng in #11935
- fix: resolve flashinfer 0.4.1 import by @zhyncs in #11940
- [router][grpc] Support
v1/responsesAPI by @CatherineSue in #11926 - [router] Add gRPC E2E test suite by @key4ng in #11790
- [router][grpc] Fix background tasks stored with wrong id by @CatherineSue in #11945
- [lint] improve ruff check by @hnyls2002 in #11922
- [sgl-kernel] support flashmla libtorch by @FlamingoPg in #11717
- [NVIDIA] upstream FA4 and fix cccl path by @johnnynunez in #11929
- Enable native ModelOpt quantization support (3/3) by @Edwardf0t1 in #10154
- Fix mooncake dispatcher by @UNIDY2002 in #11908
- [2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank by @HanHan009527 in #10606
- [model] Support POINTSV15Chat model by @josephydu in #9651
- Fix flaky hicache test with mooncake backend by @ShangmingCai in #11953
- [Fix] Remove unused import from triton_kernels_moe.py by @FlamingoPg in #11967
- [router] Support multiple worker URLs for OpenAI router by @key4ng in #11723
- [Documentation] add doc for deterministic inference by @zminglei in #11956
- [6/n]decouple quantization implementation from vLLM dependency by @Hongbosherlock in #10750
- [BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… by @yuho8818 in #11977
- Revert "Recapture cuda graph after model weight update to resolve IMA error " by @merrymercy in #11980
- [NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel by @jiahanc in #11563
- [router] create worker removal step and clean up worker manager by @slin1237 in #11921
- Implement BGE-M3 Sparse Embeddings in SGLang by @approximated-intelligence in #10869
- [Doc] Update deterministic inference flag in server_arguments.md by @Fridge003 in #11978
- [grpc] Support gRPC standard health check by @CatherineSue in #11955
- [AMD] Support a new flag to disable quant on parallelLinear layer if required by @yichiche in #11811
- [ROCm] Remove vLLM rope dependency & use AITER impl by @b8zhong in #11322
- [NVIDIA] Build CUDA 13 by @johnnynunez in #11299
- Bump grace blackwell DeepEP version by @fzyzcjy in #11990
- [CPU] misc updates by @ZailiWang in #11906
- fix(deepep): resolve benchmark failure on 4×IB-card setup by aligning tuning config with DeepEP commit bdd119f8 by @zheng1 in #11965
- [CPU] Optimize FP16 decode_attention_cpu by @blzheng in #10652
- Allow to disable batch decoding. by @LorrinWWW in #11944
- Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend by @cicirori in #11985
- aiter update to v0.1.6.post1 by @HaiShaw in #12004
- Support overlap-spec-v2 with trtllm_mla attention backend by @Qiaolin-Yu in #11821
- Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 by @netanel-haber in #11866
- [router] Add comprehensive E2E tests for Response API by @key4ng in #11988
- [Router] Consolidate ConnectionMode enum to core module by @YouNeedCryDear in #11937
- Move memory runtime checker to mixin class by @hnyls2002 in #12014
- Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" by @hnyls2002 in #12015
- [Fix] memory leak by overlap + retract by @cctry in #11981
- [Feature] Support loading weights from ckpt engine worker by @stmatengss in #11755
- [router] change ci names and update log level in ci by @slin1237 in #12021
- Feature/nano v2 offline modelopt fp8 and nvfp4 by @netanel-haber in #12018
- [Auto Sync] Update test_deterministic_utils.py (20251023) by @merrymercy in #12022
- ci: fix night-ci with push retry mechanism by @mickqian in #11765
- [router][CI] Clean up imports and prints statements in sgl-router/py_test by @CatherineSue in #12024
- Add AWQ quantization support for NPU. by @ErvinXie in #10158
- model: support deepseek-ocr by @mickqian in #11891
- Log iteration # for prefill and decode by @nvcastet in #9366
- Revert "[ROCm] Remove vLLM rope dependency & use AITER impl" by @b8zhong in #12028
- Fix mamba radix cache eviction logic in
alloc_req_slotsby @rogeryoungh in #11616 - Update Github action title for kernel build by @Kangyan-Zhou in #12029
- [router] Add builder pattern for RouterConfig with zero duplication by @slin1237 in #12030
- Fixed aarch64 flash-mla by @nvjullin in #12009
- chore: bump SGLang version to 0.5.4 by @sglang-bot in #12027
New Contributors
- @xuwenyihust made their first contribution in #11302
- @ziruiliu made their first contribution in #10969
- @scottjlee made their first contribution in #11144
- @Liu-congo made their first contribution in #11360
- @lulor made their first contribution in #8919
- @antoine-roux made their first contribution in #10577
- @QiuMike made their first contribution in #11465
- @ai-jz made their first contribution in #11419
- @neelabhsinha made their first contribution in #11413
- @UNIDY2002 made their first contribution in #10423
- @zhooooong made their first contribution in #11541
- @pdasgup made their first contribution in #11730
- @anvdn made their first contribution in #11519
- @Kindyaa made their first contribution in #11710
- @petricevich made their first contribution in #11555
- @tazjin made their first contribution in #11328
- @Paiiiiiiiiiiiiii made their first contribution in #11652
- @2015aroras made their first contribution in #11396
- @zhangdonghao-zdh made their first contribution in #11843
- @xintin made their first contribution in #11878
- @zhengkezhou1 made their first contribution in #11295
- @Kh4L made their first contribution in #11287
- @yuho8818 made their first contribution in #11977
- @jiahanc made their first contribution in #11563
- @approximated-intelligence made their first contribution in #10869
- @zheng1 made their first contribution in #11965
- @ErvinXie made their first contribution in #10158
- @rogeryoungh made their first contribution in #11616
- @nvjullin made their first contribution in #12009
Full Changelog: v0.5.3...v0.5.4