github sgl-project/sglang v0.5.4
Release v0.5.4

16 hours ago

Highlights

What's Changed

  • [router] add ipv6 support across all components by @slin1237 in #11219
  • Remove env var warnings for release by @merrymercy in #11262
  • Enable native ModelOpt quantization support (1/3) by @Edwardf0t1 in #7149
  • [router][tool call] Clean up redundant detect_format and has_tool_markers by @CatherineSue in #11270
  • disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in #11274
  • docker: add manifest to versioned docker releases by @ishandhanani in #11268
  • [Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in #11182
  • [router][grpc] Refine streaming processes by @CatherineSue in #11277
  • Fix code sync scripts by @merrymercy in #11276
  • [Auto Sync] Update test_utils.py (20251006) by @merrymercy in #11280
  • Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in #11279
  • Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in #11238
  • Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in #11261
  • fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in #11282
  • docs: update sgl-kernel README by @zhyncs in #11286
  • chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in #11281
  • [router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in #11283
  • convert test_deterministic into unit tests by @skyzh in #11095
  • Feature/longbench v2 evaluation utils by @alhridoy in #10949
  • [ci] fix pp test by @hnyls2002 in #11294
  • EAGLE cache fix for SWARadixCache by @ispobock in #11231
  • Remove overlap thread by @hnyls2002 in #11210
  • [router] add reasoning and tool parser argument in router by @slin1237 in #11290
  • Remove sampling info events and overlap thread file by @hnyls2002 in #11300
  • Introduce future indices by @hnyls2002 in #11301
  • [sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in #11068
  • [Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in #11302
  • [router] add get server info and get model info in grpc server by @slin1237 in #11303
  • [router][grpc] Refactor chat template content format detection by @CatherineSue in #11288
  • [Doc] HiCache Design Documents by @ykwd in #11027
  • [Doc]: Best Practice for HICache by @hzh0425 in #11001
  • [router] fix grpc connection conversion and add optimization by @slin1237 in #11305
  • [router][grpc] Fix sampling_params.stop_strs is None by @CatherineSue in #11306
  • Update tool parser and related documentation by @JustinTong0323 in #11223
  • [router][grpc] Fix error message format in grpc chat handler by @CatherineSue in #11307
  • [quantization] Properly ignore quantization for layers excluded in quant_config by @BowenBao in #11205
  • [router] support Openai router conversation API CRUD by @key4ng in #11297
  • [router][grpc] Fix request_id extraction when n > 1 by @CatherineSue in #11311
  • [router] cleanup worker health check to return early by @slin1237 in #11310
  • [oai serving chat] Add argument --sampling-defaults and fix ChatCompletionRequest defaults by @CatherineSue in #11304
  • Clean match_prefix and prepare_for_extend for mem cache V2 by @cctry in #11200
  • ci: unify the model launch method of nightly ci by @mickqian in #11230
  • [Chore] Update xgrammar 0.1.24 -> 0.1.25 by @DarkSharpness in #10710
  • update sampling_params documentation with defaults by @JustinTong0323 in #11315
  • Optimize copy_kv_cache for spec decoding by @YAMY1234 in #11126
  • Rename ngram_utils -> ngram_info by @hnyls2002 in #11316
  • [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator by @CatherineSue in #11314
  • [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints by @adarshxs in #9545
  • [8/N] MoE Refactor: deprecate EPMoE by @ch-wan in #11211
  • Skip weight loading in deepgemm compilation by @ch-wan in #11312
  • [2/2] Support MHA prefill with FlashAttention 4. by @lifuhuang in #10937
  • [Doc] Update mooncake nvlink transport doc for PD disaggregation by @ShangmingCai in #11321
  • fix(decode): adjust ServerArgs import to explicit module path by @xiaguan in #11007
  • Support LoRA in bench_serving oai interface by @lifuhuang in #11318
  • benchmark: enhance configurable multimodal benchmarking in bench_serving by @AlienKevin in #9812
  • [CI] improve disaggregation CI. by @hnyls2002 in #11264
  • model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) by @netanel-haber in #10909
  • [router] refactor generate to use new pipeline arch by @slin1237 in #11323
  • [router] improve reasoning parser lock and reduce req cloning by @slin1237 in #11336
  • [router][grpc] Cleanup debug logs in grpc_server and grpc_router by @CatherineSue in #11340
  • [router] Fix all unused_qualifications by @CatherineSue in #11341
  • [router] Support history management using conversation by @key4ng in #11339
  • [router][grpc] Add dependencies in Cargo.toml to support chat template rendering by @CatherineSue in #11342
  • fix: fix revision for sgl-flash-attn in sgl-kernel by @mickqian in #11327
  • [Auto Sync] Update scheduler.py (20251009) by @zhyncs in #11350
  • [Generative Score API] Multi-Item scoring with custom attention mask. by @sundar24295s in #10979
  • [router][grpc] disable health check generation and increase timeout by @slin1237 in #11353
  • [router] Refactor OpenAI router: split monolithic file and move location by @key4ng in #11359
  • [router][lint] Add unused_qualifications to cargo lint warnings by @CatherineSue in #11366
  • [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size by @trevor-m in #11309
  • [router][grpc] Fix tool call streaming bugs: empty tool names, state pollution, and panics by @CatherineSue in #11373
  • add code pp support for nixl by @shaharmor98 in #11375
  • fix bench_serving mishandling of internal states by @shaharmor98 in #11376
  • [router][grpc] Replace fake health check with correct ones by @CatherineSue in #11387
  • [router] change grpc client from mutable to clone by @slin1237 in #11394
  • chore: upgrade flashinfer 0.4.0 by @zhyncs in #11364
  • [router] conversation item API: create, retrieve and delete by @key4ng in #11369
  • chore: bump SGLang version to 0.5.3.post1 by @sglang-bot in #11324
  • move more files under srt/utils by @merrymercy in #11285
  • [grammar] Avoid server crash when grammar backend is None by @JustinTong0323 in #11401
  • fix: fix gpu-proc affinity set incorrectly when pp_size > 1 by @acelyc111 in #11389
  • [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded by @glenliu21 in #11365
  • [CI] Refactor PD disaggregation test suite by @ShangmingCai in #11363
  • Replace pad with cat for better performance by @yuan-luo in #11388
  • fix: reinstall torch in deps install by @zhyncs in #11414
  • feat(hicache): Support passing prefix keys for l3 store. by @hzh0425 in #9045
  • fix file and object naming scheme in HiCacheNixl to avoid data corruption by @ziruiliu in #10969
  • Dedicated toml files for CPU/XPU by @ZailiWang in #10734
  • Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in #11144
  • chore: update pyproject by @zhyncs in #11420
  • fix: fix video input for qwen3-vl by @mickqian in #11361
  • perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in #11381
  • [HiCache] feat: add multi tenant with prefix tag by @stmatengss in #9256
  • [CI] Merge build-dev into workflow matrix by @csahithi in #11345
  • Revert "perf: optimize qwen-vl with symm mem allreduce" by @ch-wan in #11436
  • Revert "fix: fix video input for qwen3-vl" by @merrymercy in #11437
  • Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" by @scottjlee in #11433
  • [router] Fix ci nvcc not found error by @key4ng in #11411
  • feat(mooncake): support GB suffix for global_segment_size by @xiaguan in #10745
  • Separate allocation logic from scheduler by @cctry in #11313
  • [router] disable rate limiter by default by @slin1237 in #11435
  • [router] leverage RAII to actively cancel request during client disconnect by @slin1237 in #11399
  • [router][grpc] Consolidate parser checks for chat completions by @CatherineSue in #11439
  • Reorder PD disagg CI tests by @merrymercy in #11438
  • fix: Change dsv32 hack temporary path to use system temp directory by @wxsms in #11445
  • Fix batch invariant ops by @hebiao064 in #11368
  • [BugFix] test_mla_fp8.py fails on Cublas 12.9 by @Liu-congo in #11360
  • [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton by @byjiang1996 in #11450
  • Remove tilelang dependency in Dockerfile by @Fridge003 in #11455
  • Enable native ModelOpt quantization support (2/3) by @Edwardf0t1 in #9991
  • Reland [1/2] Optimizations and refactors about quant kernel by @fzyzcjy in #10312
  • Super tiny delete unused openai router in sgl-router by @fzyzcjy in #11448
  • Adjust logits metada init for target verify by @hnyls2002 in #11467
  • [Documentation][Configuration] Server args and documentation of PD-Multiplexing. by @ykcombat in #11427
  • Fix enable_v2 in int8 quant by @fzyzcjy in #11470
  • [Fix] Fix split prefill with fa3. by @ykcombat in #11428
  • fix stop when stream by @whybeyoung in #11462
  • Add option to disable any_whitespace for xgrammar and llguidance backends. by @lulor in #8919
  • [7/n] decouple quantization impl from vllm dependency - gguf kernel by @FlamingoPg in #11019
  • fix Xeon CI by @ZailiWang in #11454
  • [CI] Add nightly builds to dockerhub by @csahithi in #9804
  • [Feature] support regex strings as a stopping condition by @glenliu21 in #10635
  • Beta spec-overlap for EAGLE by @hnyls2002 in #11398
  • Piecewise CUDA Graph Support & Torch Compile Backend by @Oasis-Git in #10062
  • [Router]: Small Typo in a comment within tree.rs by @xuwenyihust in #11489
  • chore: bump sgl-kernel version to 0.3.16 by @sglang-bot in #11476
  • [smol] [perf] Qwen3-VL in place op. by @vincentzed in #11481
  • [chore][1/N] Avoid using default mutable parameters by @kevin85421 in #11478
  • [bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends by @MahmoudAshraf97 in #10172
  • [ perf ] Replace json-> orjson in hot path by @vincentzed in #11221
  • [chore][2/N] Avoid using default mutable parameters by @kevin85421 in #11479
  • Fix the GPT function calling regex to allow dash in the name by @antoine-roux in #10577
  • bailingMoE: Fix Key error of deepep_mode by @QiuMike in #11465
  • Fix CI break by express-laned PRs. by @hnyls2002 in #11499
  • Move args from global_config to environ by @hnyls2002 in #11332
  • move fla env check position by @yizhang2077 in #11500
  • Temporarily remove b200 tests by @merrymercy in #11501
  • Fix port conflicts in CI by @merrymercy in #11497
  • temporarily remove b200 tests by @merrymercy in #11502
  • Fix unit tests by @merrymercy in #11503
  • Bugfix: Fix Type consistency for KV indices in SWARadixCache by @hzh0425 in #11452
  • doc: add doc for adding new models into nightly-ci by @mickqian in #11443
  • [CI] fix lint by @hnyls2002 in #11509
  • Deprecate global_server_args_dict by @hnyls2002 in #11331
  • chore: remove flashinfer cleanup cache by @zhyncs in #11514
  • fix: revert temporarily remove b200 tests by @zhyncs in #11515
  • [Fix] Improve longbench prompt and other logics by @byjiang1996 in #11474
  • Sync changes on io_struct.py and deterministic ops by @merrymercy in #11498
  • [lint] Fix the lint issue by @ch-wan in #11516
  • Revert "Deprecate global_server_args_dict" by @ch-wan in #11520
  • Improve dp attention port assignment scheme by @jokerwyt in #5889
  • [router] openai router: support grok model by @key4ng in #11511
  • docs(router): add token-bucket rate limiting to the docs by @Jonahcb in #11485
  • [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11432
  • Update DeepSeek-R1-FP4 default config on blackwell by @Qiaolin-Yu in #11512
  • [Fix]: add missing device attribute to ChunkCache by @leavelet in #11493
  • [Feature] Support mamba radix cache v0 by @yizhang2077 in #11214
  • ci: improve nightly-ci by @mickqian in #11385
  • [CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering by @BBuf in #11505
  • [HICache]: Support 3FS-Store with page_first_direct layout by @hzh0425 in #11460
  • Tiny fix test run estimated time by @ShangmingCai in #11544
  • [Reland] perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in #11457
  • Depreate global_server_args_dict by @hnyls2002 in #11528
  • [Fix] Add per_channel_quant parameter to MoE config functions by @mmangkad in #11201
  • [router][ci] Add Nightly Release Workflow for SGLang Router by @slin1237 in #11527
  • [router] add tokenizer path to be dir by @slin1237 in #11530
  • Remove tp_worker.worker by @hnyls2002 in #11548
  • fix: fix video input for qwen3-vl by @mickqian in #11442
  • [NVIDIA] BUMP FA3 by @johnnynunez in #11444
  • [Fix] Include grpc reflection runtime dependency by @ai-jz in #11419
  • Adjust overlap event loop by @hnyls2002 in #11507
  • Move deep gemm related arguments to sglang.srt.environ by @hnyls2002 in #11547
  • [router][grpc] Further delegate non-stream processing to processing.rs by @CatherineSue in #11553
  • [router] allow user to specify chat template path by @slin1237 in #11549
  • Minor: improve sampler & remove unused fields from model_config.py by @merrymercy in #11531
  • [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter by @Jonahcb in #11483
  • Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in #11441
  • Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) by @trevor-m in #11557
  • [CI] Add Basic Test for DeepSeek V3.2 by @Fridge003 in #11308
  • [router][grpc] Add error handling to generate_tool_constraints by @CatherineSue in #11562
  • [NVIDIA] update pyproject.toml to support cu130 option by @johnnynunez in #11521
  • [CI Monitor] Ci monitor only deal with main branch in default by @BBuf in #11538
  • Tiny cleanup fp4 gemm calls by @fzyzcjy in #11537
  • [router][grpc] Add serve_grpc to launch_server and log id for HealthCheck by @CatherineSue in #11564
  • [router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds by @YouNeedCryDear in #11571
  • [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11534
  • chore: bump sgl-kernel version to 0.3.16.post1 by @sglang-bot in #11573
  • Fix accept rate in speculative decoding metrics by @Qiaolin-Yu in #11572
  • Compilation Folder Reset by @Oasis-Git in #11539
  • [FEATURE] Add Profile Trace Merger for Distributed Traces by @neelabhsinha in #11413
  • [DSv32] Use torch.compile for _get_logits_head_gate by @trevor-m in #11565
  • Make DeepEP combine recv do not overlap by @fzyzcjy in #11535
  • bench_serving support PD Disaggregation by @BBuf in #11542
  • Implement LRU eviction policy for LoRA adapters by @ConnorLi96 in #11041
  • Revert "[NVIDIA] BUMP FA3 (#11444)" by @zhyncs in #11582
  • chore: bump sgl-kernel version to 0.3.16.post2 by @sglang-bot in #11583
  • [Auto Sync] Update model_config.py (20251014) by @merrymercy in #11580
  • Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json by @Qiaolin-Yu in #11587
  • [router][protocols] Add Axum validate extractor and use it for /v1/chat/completions endpoint by @CatherineSue in #11588
  • [router] update generate spec to align with sgl io struct by @slin1237 in #11591
  • [router] change worker api to async instead of sync by @slin1237 in #11566
  • Update news section in README.md by @merrymercy in #11598
  • [router] delete useless table content comment in spec by @slin1237 in #11597
  • [router] allow router launch server to use grpc mode by @slin1237 in #11600
  • [Docs] [Router]: Update sg-router doc on circuit breaker by @xuwenyihust in #11449
  • [router] when given both local tokenizer and chat template, log all by @slin1237 in #11601
  • [AMD CI] Add image and weights caching. by @saienduri in #11593
  • Update release-docker-dev.yml by @sglang-bot in #11603
  • Optimize Triton Draft Backend by @hnyls2002 in #11556
  • Refactor spec decoding metrics calculation into separate TokenizerManager utility function by @scottjlee in #11586
  • make radix cache deterministic by @skyzh in #10721
  • move eagle draft post process to cuda graph by @cicirori in #11434
  • Reduce one step decode for draft model. by @hnyls2002 in #11561
  • [router] add py binding and readme for openai router and history backend by @key4ng in #11453
  • [router] cleanup app context and move to startup by @slin1237 in #11617
  • [router] add chang and keyang to sgl router author by @slin1237 in #11620
  • use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. by @strgrb in #11605
  • [router] update router readme to latest features by @slin1237 in #11619
  • Fix log for chunked prefix cache by @Fridge003 in #11624
  • [Auto Sync] Update scheduler.py, server_args.py (20251014) by @merrymercy in #11623
  • [Auto Sync] Update collector.py (20251014) by @merrymercy in #11625
  • [Minor] Update xgrammar dependency by @DarkSharpness in #11622
  • Update install.md by @merrymercy in #11631
  • fix: Update SGL_KERNEL_VERSION to 0.3.15 by @zhyncs in #11633
  • [router][grpc] add warm up to grpc server by @slin1237 in #11627
  • Refactor kv cache free by @cctry in #11351
  • [router] update router doc to latest features by @slin1237 in #11639
  • fix: upgrade transformers to 4.57.1 by @csahithi in #11628
  • [router] add worker self discovery for metadata by @slin1237 in #11638
  • [router] upgrade to 0.2.0 by @slin1237 in #11642
  • [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP by @UNIDY2002 in #10423
  • [1/N]Support DeepSeek-R1 w4a8 normal deepep by @ayrnb in #8247
  • [Fix] Fix accuracy bug in CSGMV kernel caching key. by @lifuhuang in #11579
  • feat: add add_chunked_prefix_cache_attention_backend by @zhyncs in #11636
  • Super tiny improve FA3 import error message by @fzyzcjy in #11590
  • [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl by @ZhengWG in #11458
  • [Doc] Update support matrix for attn and hybrid attn by @b8zhong in #11293
  • Clean up some Qwen3-Next and deterministic code by @hebiao064 in #11585
  • docs: update sglang installation guide by @zhyncs in #11659
  • Tiny cleanup some eagle unused codes by @hnyls2002 in #11660
  • Fix 1-step draft model forward by @ShangmingCai in #11653
  • [tool call] Fix prev_tool_call_arr management in base_format_detector.py by @CatherineSue in #11367
  • [router] Fix response api related spec by @key4ng in #11621
  • Fix missing json imports in serving_responses.py by @CatherineSue in #11681
  • [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11674
  • [sgl-kernel] Optimize gguf test by @FlamingoPg in #11667
  • [router][grpc] Simplify model_id determination by @CatherineSue in #11684
  • [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding by @slin1237 in #11676
  • chore: bump SGLang version to 0.5.3.post2 by @sglang-bot in #11680
  • [CI][XPU]enable sglang CI on Intel XPU by @DiweiSun in #9493
  • enable rmsnorm on XPU by @huaiyuzh in #10248
  • Sync code and test CI; rename some env vars by @merrymercy in #11686
  • docs: Add Contributor Covenant Code of Conduct by @zhyncs in #11689
  • [Mamba] Increase default mamba_full_memory_ratio to 0.9 by @hanming-lu in #11679
  • [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) by @ShangmingCai in #10912
  • [sgl-kernel] support hadamard by @FlamingoPg in #11663
  • Fix missing a2a backend init of GLM4.5 MoE Block by @ShangmingCai in #11692
  • Split test_intel_amx_attention_backend.py to pass CI of timeout by @yanbing-j in #11370
  • Set csgmv as default lora backend. by @lifuhuang in #11488
  • [Bugfix] Fix Qwen3/DSV3/DSV3.2 model support by @iforgetmyname in #11510
  • [CI] Add GLM4MoE model test by @ShangmingCai in #11706
  • [router] fix get_models endpoint for openai router by @key4ng in #11687
  • [ci]use H20 to run disaggregation test by @HanHan009527 in #11543
  • chore: bump SGLang version to 0.5.3.post3 by @sglang-bot in #11693
  • model: qwen3-omni (thinker-only) by @mickqian in #10911
  • [Router] Refactor protocol definitions: split spec.rs into modular files by @key4ng in #11677
  • [router] fix p and d worker filtering and bootstrap port handling by @slin1237 in #11729
  • [router][grpc] add dissag info to warm up in grpc server by @slin1237 in #11727
  • [router] Fix tool_choice normalization in ChatCompletionRequest and fix ut by @CatherineSue in #11731
  • Revert "make radix cache deterministic" by @Fridge003 in #11728
  • Reduce the image processing latency in VLM by @zhooooong in #11541
  • [router] add spec.rs to enables tests under spec folder by @key4ng in #11734
  • [router] Add rustfmt and set group imports by default by @CatherineSue in #11732
  • Revert "[router] fix get_models endpoint for openai router (#11687)" by @key4ng in #11740
  • [router][CI] Clean up deprecated fields in pr-test-pd-router.yml by @CatherineSue in #11739
  • [CI] Fix broken event loop creation by @hnyls2002 in #11746
  • [overlap-spec] Make plan stream an option by @hnyls2002 in #11724
  • ci: reduce and refactor vlm ut and combine test files by @mickqian in #11062
  • Abstraction for spec worker and code cleanup by @hnyls2002 in #11643
  • add tuned fuse moe kernel for qwen3 235b fp8 on h200 by @pdasgup in #11730
  • Revert "Set csgmv as default lora backend. (#11488)" by @zhyncs in #11735
  • [router] Fix UTF-8 Boundary Panic in Stop Sequence Decoder by @slin1237 in #11766
  • [router] fix grpc client time out to 1h by @slin1237 in #11768
  • [doc] update router document by @key4ng in #11767
  • [Feature] Reuse flashinfer workspace for PD-Multiplexing. by @ykcombat in #11540
  • Turn on shm_allreduce and shm_allgather for fp16 by @chunyuan-w in #10725
  • [Auto Sync] Update scheduler.py (20251017) by @zhyncs in #11738
  • [router][grpc] Remove timeout for connections and remove max_tokens deprecation warning log by @CatherineSue in #11775
  • Cleaning indexer for DeepSeek V3.2 by @Fridge003 in #11682
  • [minor] sync code on python/sglang/test/test_deterministic.py and improve ci tests by @merrymercy in #11777
  • [Auto Sync] Update common.py (20251017) by @merrymercy in #11782
  • [Fix] Skip visual layers when applying LoRA to Qwen2VL modules by @anvdn in #11519
  • [Lint] Add python/sglang to ruff F401 checks and remove unused imports in files by @CatherineSue in #11685
  • Super tiny fix missing input throughput by @fzyzcjy in #11607
  • Support shared experts overlap in cutlass moe by @fzyzcjy in #11611
  • Support casting bf16 NextN moe to fp8 by @fzyzcjy in #11613
  • Manually flip deepep_mode for cuda_graph by @zhuzilin in #11666
  • Set CUDA_VISIBLE_DEVICES to achieve one GPU per process by @merrymercy in #9170
  • Super tiny fix CI by @fzyzcjy in #11788
  • Make single-batch overlap compatible with offloading by @fzyzcjy in #11614
  • completely remove mixed mode deterministic test as prefix mode could cover it by @zminglei in #11783
  • [Refactor] move deep_gemm_wrapper out of quantization by @ch-wan in #11784
  • Enable lint on main by @fzyzcjy in #11794
  • [router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client by @CatherineSue in #11798
  • Try add back no-commit-to-branch by @fzyzcjy in #11799
  • fix(glm45): disable reduce scatter by @jinmingyi1998 in #11665
  • fix command line usage of profiling by @Qiaolin-Yu in #11793
  • [RL] support weight update with DP attention by @zhuzilin in #11669
  • [RL] use cpu group to prepare_mlp_sync_batch_raw when the server is offloaded by @zhuzilin in #10152
  • set default attention backend for deterministic inference by @zminglei in #11801
  • Eager Compiler for Torch Compile by @Oasis-Git in #11803
  • Fix install instructions and pyproject.tomls by @merrymercy in #11781
  • Bump torch_memory_saver to avoid installing pre-release versions by @fzyzcjy in #11797
  • [HiCache] feat: add more eviction policy by @stmatengss in #11506
  • [overlap-spec] support page size > 1 by @hnyls2002 in #11772
  • support server arg override KV cache to bf16 to avoid slow cases by @b8zhong in #11749
  • feat(example/fastapi): support --startup-timeout using Qwen3-Next-80B-A3B-Instruct as example by @Kindyaa in #11710
  • ci: update lmms-eval to speed up multimodal CI by @b8zhong in #11000
  • Use cutlass fp4 gemm by default by @Qiaolin-Yu in #11813
  • Fix Dockerfile not installing correct version of DeepEP for arm build by @kyleliang-nv in #11773
  • [router] Add Configurable L0 and L1 Tokenizer Caching by @slin1237 in #11688
  • [2/2] [feature] support openai like classification api in router by @whybeyoung in #11670
  • [1/2][feature] support openai like classification api by @whybeyoung in #11618
  • make sure logit bias is applied during eagle spec decoding verification by @petricevich in #11555
  • fix: do not wrap invalid grammar objects during constrained generation by @tazjin in #11328
  • Improve send_sone script by @hnyls2002 in #11817
  • Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads by @YAMY1234 in #10788
  • Update CODEOWNERS for layer quantization path by @merrymercy in #11818
  • support tokenized batch request by @narutolhy in #11091
  • Tiny add hints when users send requests to wrong place by @fzyzcjy in #11808
  • Make single-batch overlap compatible with NextN by @fzyzcjy in #11804
  • Support not officially supported high sgl-kernel version with low srt version by @fzyzcjy in #11786
  • Avoid generation gets hanging when user specifies multiple event loops by @fzyzcjy in #5162
  • Change bf16 to fp8 for some gemms in attention for DeepSeek ckpt v2 by @fzyzcjy in #11805
  • Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" by @hnyls2002 in #11827
  • [overlap-spec] fix stop condition and trimming by @hnyls2002 in #11819
  • [Spec Decoding] Support MTP for dsv3.2 by @Paiiiiiiiiiiiiii in #11652
  • [CI] always print back trace in retry() by @hnyls2002 in #11834
  • [Test] Add basic matched stop for beta eagle by @hnyls2002 in #11833
  • Deterministic Mode: Add 1-stage triton kernel for prefill by @hebiao064 in #11147
  • [logprobs] Enable local deterministic logrprobs testing with strict threshold by @PrinsYin in #10994
  • [CI] Add CI test for DeepSeek V3.2 MTP by @Fridge003 in #11835
  • [NVIDIA] FA3/FA4 Fix by @johnnynunez in #11606
  • [DeepseekV32] Add fast_topk_transform_ragged_fused kernel by @hlu1 in #11815
  • Fix triton_kernels import error on some hardwares by @fzyzcjy in #11831
  • Tiny bump DeepEP version in ARM blackwell by @fzyzcjy in #11810
  • [BugFix] replace the input_to_float8 used in dsv2 by @Liu-congo in #11612
  • [Doc] Update documents for FA4 by @Fridge003 in #11778
  • fix(ci): Fix CI Monitor limit parameter and add CI Analysis to summary by @BBuf in #11832
  • Fix version bump script to handle TOML files with outdated versions by @Kangyan-Zhou in #11787
  • Improve Kernel Build Time by @Kangyan-Zhou in #11508
  • check master server for mooncake store by @huangtingwei9988 in #10510
  • chore: bump sgl-kernel version to 0.3.16.post3 by @sglang-bot in #11733
  • Recapture cuda graph after model weight update to resolve IMA error by @harrisonlimh in #11780
  • [Feature] Use current greenctx stream to communicate in PD-Multiplexing. by @ykcombat in #11594
  • Support mrope triton kernel and add unit test by @yuan-luo in #11722
  • [PD] Improve eagle acceptance rate by transferring draft model hidden states by @ZeldaHuang in #10801
  • Tiny clean up for PD module and doc by @ShangmingCai in #11747
  • Revert "[CI Monitor] Ci monitor only deal with main branch in default" by @BBuf in #11846
  • [Model] Add Olmo 3 model support by @2015aroras in #11396
  • Update amd gpu install docs. by @saienduri in #11849
  • [AMD CI] Populate image cache in nightly docker release. by @saienduri in #11822
  • fix(server_args): handle tokenizer init conflicts by @ishandhanani in #11776
  • [Feature] New structural tag support by @DarkSharpness in #10691
  • Tiny fix main lint by @hnyls2002 in #11862
  • [9/N] MoE Refactor: cleanup dispatcher interfaces by @ch-wan in #11847
  • Fix acc len and gen throughput metrics when enabling overlap-spec by @Qiaolin-Yu in #11823
  • Replace function call with set literal by @penguin-wwy in #11867
  • Support mixing cutedsl and deepgemm backend by @fzyzcjy in #11807
  • [router] Worker Management Workflow Engine by @slin1237 in #11868
  • [router] remove encoding header for oai router by @slin1237 in #11881
  • [Auto Sync] Update scheduler.py, server_args.py (20251020) by @merrymercy in #11875
  • [router][grpc] Remove continue_final_message in ChatTemplateParams and add minijinja-contrib by @CatherineSue in #11882
  • fix(sql-router): fix conflict port in test by @htiennv in #11826
  • [router] clean up workflow logs to debug for implementation details logs by @slin1237 in #11886
  • [code move] move pp into a separate mixin by @merrymercy in #11838
  • [router][grpc] Fix wram-up random token ids for small models by @CatherineSue in #11887
  • Revise MRotaryEmbedding's forward by @yuan-luo in #11859
  • piecewise cuda graph support qwen3-moe by @BBuf in #11845
  • Fix RotaryEmbedding for fp32 input by @zhangdonghao-zdh in #11843
  • Init attention backend for Intel XPU by @airMeng in #10656
  • Use trtllm_mla decode kernel for draft extend in speculative decoding by @Qiaolin-Yu in #11664
  • [router] release router 0.2.1 by @slin1237 in #11885
  • [AMD] Update wave-lang to 3.8.0 by @xintin in #11878
  • init support for KTransformers Heterogeneous Computing by @Atream in #11487
  • [FEATURE] Add OpenAI-Compatible LoRA Adapter Selection by @neelabhsinha in #11570
  • [fix] fix ci uv install dependency by @HanHan009527 in #11895
  • Support Thinking Budget (via custom_logit_processor for OpenAI API) [Fix #6572] by @whybeyoung in #11416
  • Simplify multi-tokenizer by @zhengkezhou1 in #11295
  • [CI] disable glm4.1v and fix the flashinfer installation by @ShangmingCai in #11902
  • vlm: enforce pybase64 for image and str encode/decode by @b8zhong in #10700
  • [smol] [perf] Inverse perm improvement by @vincentzed in #11482
  • [quantization][MoE] fix the check for tp_size / moe_ep_size / moe_intermediate_size / weight_block_size_n by @kevin85421 in #11702
  • [CI] Fix b200 flashinfer installation by @ShangmingCai in #11915
  • Fix flush cache API for spec v2 by @hnyls2002 in #11918
  • [NVIDIA] Add new SMs support for Spark & Thor by @Kh4L in #11287
  • Update sgl-kernel and remove fast hadamard depedency by @Fridge003 in #11844
  • Rename flashmla kernel options of nsa backend for better readability by @Fridge003 in #11876
  • chore: upgrade flashinfer 0.4.1 by @zhyncs in #11933
  • [BugFix][Qwen3-VL]: add metadata for video in qwen3-vl by @ZhengWG in #11377
  • [Auto Sync] Update forward_batch_info.py (20251021) by @zhyncs in #11934
  • Fix openai input_text type compatibility by @key4ng in #11935
  • fix: resolve flashinfer 0.4.1 import by @zhyncs in #11940
  • [router][grpc] Support v1/responses API by @CatherineSue in #11926
  • [router] Add gRPC E2E test suite by @key4ng in #11790
  • [router][grpc] Fix background tasks stored with wrong id by @CatherineSue in #11945
  • [lint] improve ruff check by @hnyls2002 in #11922
  • [sgl-kernel] support flashmla libtorch by @FlamingoPg in #11717
  • [NVIDIA] upstream FA4 and fix cccl path by @johnnynunez in #11929
  • Enable native ModelOpt quantization support (3/3) by @Edwardf0t1 in #10154
  • Fix mooncake dispatcher by @UNIDY2002 in #11908
  • [2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank by @HanHan009527 in #10606
  • [model] Support POINTSV15Chat model by @josephydu in #9651
  • Fix flaky hicache test with mooncake backend by @ShangmingCai in #11953
  • [Fix] Remove unused import from triton_kernels_moe.py by @FlamingoPg in #11967
  • [router] Support multiple worker URLs for OpenAI router by @key4ng in #11723
  • [Documentation] add doc for deterministic inference by @zminglei in #11956
  • [6/n]decouple quantization implementation from vLLM dependency by @Hongbosherlock in #10750
  • [BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… by @yuho8818 in #11977
  • Revert "Recapture cuda graph after model weight update to resolve IMA error " by @merrymercy in #11980
  • [NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel by @jiahanc in #11563
  • [router] create worker removal step and clean up worker manager by @slin1237 in #11921
  • Implement BGE-M3 Sparse Embeddings in SGLang by @approximated-intelligence in #10869
  • [Doc] Update deterministic inference flag in server_arguments.md by @Fridge003 in #11978
  • [grpc] Support gRPC standard health check by @CatherineSue in #11955
  • [AMD] Support a new flag to disable quant on parallelLinear layer if required by @yichiche in #11811
  • [ROCm] Remove vLLM rope dependency & use AITER impl by @b8zhong in #11322
  • [NVIDIA] Build CUDA 13 by @johnnynunez in #11299
  • Bump grace blackwell DeepEP version by @fzyzcjy in #11990
  • [CPU] misc updates by @ZailiWang in #11906
  • fix(deepep): resolve benchmark failure on 4×IB-card setup by aligning tuning config with DeepEP commit bdd119f8 by @zheng1 in #11965
  • [CPU] Optimize FP16 decode_attention_cpu by @blzheng in #10652
  • Allow to disable batch decoding. by @LorrinWWW in #11944
  • Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend by @cicirori in #11985
  • aiter update to v0.1.6.post1 by @HaiShaw in #12004
  • Support overlap-spec-v2 with trtllm_mla attention backend by @Qiaolin-Yu in #11821
  • Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 by @netanel-haber in #11866
  • [router] Add comprehensive E2E tests for Response API by @key4ng in #11988
  • [Router] Consolidate ConnectionMode enum to core module by @YouNeedCryDear in #11937
  • Move memory runtime checker to mixin class by @hnyls2002 in #12014
  • Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" by @hnyls2002 in #12015
  • [Fix] memory leak by overlap + retract by @cctry in #11981
  • [Feature] Support loading weights from ckpt engine worker by @stmatengss in #11755
  • [router] change ci names and update log level in ci by @slin1237 in #12021
  • Feature/nano v2 offline modelopt fp8 and nvfp4 by @netanel-haber in #12018
  • [Auto Sync] Update test_deterministic_utils.py (20251023) by @merrymercy in #12022
  • ci: fix night-ci with push retry mechanism by @mickqian in #11765
  • [router][CI] Clean up imports and prints statements in sgl-router/py_test by @CatherineSue in #12024
  • Add AWQ quantization support for NPU. by @ErvinXie in #10158
  • model: support deepseek-ocr by @mickqian in #11891
  • Log iteration # for prefill and decode by @nvcastet in #9366
  • Revert "[ROCm] Remove vLLM rope dependency & use AITER impl" by @b8zhong in #12028
  • Fix mamba radix cache eviction logic in alloc_req_slots by @rogeryoungh in #11616
  • Update Github action title for kernel build by @Kangyan-Zhou in #12029
  • [router] Add builder pattern for RouterConfig with zero duplication by @slin1237 in #12030
  • Fixed aarch64 flash-mla by @nvjullin in #12009
  • chore: bump SGLang version to 0.5.4 by @sglang-bot in #12027

New Contributors

  • @xuwenyihust made their first contribution in #11302
  • @ziruiliu made their first contribution in #10969
  • @scottjlee made their first contribution in #11144
  • @Liu-congo made their first contribution in #11360
  • @lulor made their first contribution in #8919
  • @antoine-roux made their first contribution in #10577
  • @QiuMike made their first contribution in #11465
  • @ai-jz made their first contribution in #11419
  • @neelabhsinha made their first contribution in #11413
  • @UNIDY2002 made their first contribution in #10423
  • @zhooooong made their first contribution in #11541
  • @pdasgup made their first contribution in #11730
  • @anvdn made their first contribution in #11519
  • @Kindyaa made their first contribution in #11710
  • @petricevich made their first contribution in #11555
  • @tazjin made their first contribution in #11328
  • @Paiiiiiiiiiiiiii made their first contribution in #11652
  • @2015aroras made their first contribution in #11396
  • @zhangdonghao-zdh made their first contribution in #11843
  • @xintin made their first contribution in #11878
  • @zhengkezhou1 made their first contribution in #11295
  • @Kh4L made their first contribution in #11287
  • @yuho8818 made their first contribution in #11977
  • @jiahanc made their first contribution in #11563
  • @approximated-intelligence made their first contribution in #10869
  • @zheng1 made their first contribution in #11965
  • @ErvinXie made their first contribution in #10158
  • @rogeryoungh made their first contribution in #11616
  • @nvjullin made their first contribution in #12009

Full Changelog: v0.5.3...v0.5.4

Don't miss a new sglang release

NewReleases is sending notifications on new releases.