sgl-project/sglang v0.4.8 on GitHub

Highlights

OpenAI-Compatible Server Refactor

Re-structured the OpenAI-compatible server to support production and enterprise environments. Key improvements include:

Consistent metrics and logging for better observability and debugging.
Unified error handling, request validation, and processing logic for improved reliability and maintainability.
Improved request tracking across sessions and components.
Fixed bugs in embedding requests and reasoning parsers.

This work was a collaborative effort involving engineers from academic and industry institutions. Special thanks to the Oracle Cloud team and the SGLang team and community — including @slin1237, @CatherineSue, @key4ng, @JustinTong0323, @jhinpan, @yhyang201, @woodx9 and @whybeyoung — for their invaluable contributions.

DeepSeek R1 FP4 on Blackwell GPU

Added support for DeepSeek R1 with FP4 and MTP on NVIDIA Blackwell GPU.

Integrated FlashInfer NVFP4 MoE, supporting TP, EP, and DP.
Supported 2-stream shared expert execution.
Achieved up to 90 TPS per user at isl/osl/bs = 1k/1k/16 on B200.

Further optimization in progress. Special thanks to the FlashInfer, NVIDIA Enterprise Products, Novita AI, DataCrunch, Google Cloud, and SGLang teams — especially @Alcanderian and @pyc96 — for their critical contributions.

Breaking Change: OpenAI-Compatible API Module Moved

The sglang/srt/openai_api directory has been removed and replaced with sglang/srt/entrypoints/openai.

Update your imports to the new module path. For example:

- from sglang.srt.openai_api.protocol import Tool
+ from sglang.srt.entrypoints.openai.protocol import Tool

What's Changed

Update README.md by @merrymercy in #7040
[Docker] Upgrading base image from 24.04 to 24.12 by @Swipe4057 in #7043
fix 24.12 docker by @zhyncs in #7045
Minor cleanup of fa3 backend by @merrymercy in #6999
Fix eagle on AMD by @merrymercy in #7051
Clean up server_args.py by @merrymercy in #7037
Minor style fix in cuda_graph_runner.py by @merrymercy in #7053
[WA] fix output data is nan in CI test "test_moe_eval_accuracy_large.py" by @kkHuang-amd in #7021
[fix] libmlx5.so already in base image by @HanHan009527 in #7060
Fix test_lora.py CI by @Fridge003 in #7061
Tiny fix cutlass_mla_get_workspace_size stub incorrect signature by @fzyzcjy in #7057
Add sanity checks when a test file is not added to CI by @fzyzcjy in #6947
Revert "Add sanity checks when a test file is not added to CI (#6947)" by @zhyncs in #7063
Fix missing tool call id if tool call index >0 in streaming tool call output. by @Xu-Wenqing in #7049
chore: update dev docker by @zhyncs in #7064
Open AI API hidden states by @kyle-pena-kuzco in #6716
fix arm sgl-kernel link issue by @zhyncs in #7066
[Feature] Add Logit Bias by @b8zhong in #6579
Improve perf tuning docs by @merrymercy in #7071
Frontend language separate reasoning support by @binarycrayon in #6031
Do not run frontend_reasoning.ipynb to reduce the CI load by @merrymercy in #7073
Simplify the heuristics for setting --mem-fraction-static by @merrymercy in #7054
update doc by @Ximingwang-09 in #7046
Clean up docs for server args and sampling parameters (generated by grok) by @merrymercy in #7076
Fix GGuf and add back test_gguf.py by @Fridge003 in #7067
vlm: adapt internvl to VisionAttention by @mickqian in #6870
Fix circular import in test_prefix_chunk_info.py by @Fridge003 in #7097
Fix misusing the "_is_cuda". by @sogalin in #7091
Support VILA models by @futrime in #6106
[FIX]remove redundant code in logits_processor.py by @pc-neo in #7079
[feat]: Emit fixed-size KV blocks events by @faradawn in #6824
[Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations by @lifuhuang in #6994
Fix positional argument by @liquanfeng in #7093
[sgl-kernel] Add cuda kernel for moe_ep_silu_and_mul by @yuan-luo in #6919
Improve log status by @hnyls2002 in #7115
feat: update blackwell setup by @zhyncs in #7119
Update CODEOWNERS by @merrymercy in #7126
Add gfx950 support for sgl-kernel. by @sogalin in #7092
[Fix] Reduce busy polling when scheduler is idle by @p12tic in #6026
Minor add utility to read expert distribution recorder output by @fzyzcjy in #7134
Remove unnecessary metadata_expand.max_seq_len_k operations in fa3 to… by @byjiang1996 in #7140
Minor speedup topk postprocessing by @fzyzcjy in #7058
filter by num_hidden_layers by @pansicheng in #7056
Remove 200us slow concat kernel (part 1: kernel) by @fzyzcjy in #7145
Support new DeepGEMM format in per token group quant by @fzyzcjy in #7146
chore: bump v0.1.8.post1 by @zhyncs in #7152
Support new DeepGEMM format in per token group quant (part 2: srt) by @fzyzcjy in #7155
Fix DeepEP error in some environments by @fzyzcjy in #7154
Minor speed up block_quant_dequant by @fzyzcjy in #6814
Tiny add sanity checks for DeepGEMM inputs by @fzyzcjy in #7157
Remove 200us slow concat kernel (part 2: srt) by @fzyzcjy in #7020
Re-quantize DeepSeek model weights to support DeepGEMM new input format by @fzyzcjy in #7156
Minor style change of triton backend by @merrymercy in #7165
Split the eagle test into two files by @merrymercy in #7170
Support new DeepGEMM input format in silu_and_mul_masked_post_quant_fwd by @fzyzcjy in #7153
Refactor DeepGEMM integration by @fzyzcjy in #7150
Add test for refactored openai server by @jhinpan in #7161
Improve test cases for eagle infer by @merrymercy in #7173
Support new DeepGEMM by @fzyzcjy in #7172
Increase timeout in test/srt/test_disaggregation.py by @merrymercy in #7175
Add Phi-4-mm to supported VLM supported model list. by @lifuhuang in #7178
Fix shared experts fusion + weight requant by @fzyzcjy in #7177
[fix] fix dsv3 weight loader tqdm and simplify shared experts fusion by @Alcanderian in #7181
[fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla by @Alcanderian in #7184
[PD] Update prefill.py by @ByronHsu in #7190
Fix a minor bug related to DeepGEMM upgrade by @zhijian-liu in #7191
chore: bump v0.1.8.post2 by @zhyncs in #7189
[fix] fix determine_num_fused_shared_experts by @Alcanderian in #7180
chore: upgrade sgl-kernel v0.1.8.post2 by @Alcanderian in #7186
Fix NCCL 2.27.3 not in docker image by @fzyzcjy in #7195
Fix error when disabling new DeepGEMM by @fzyzcjy in #7198
[PD] Support decode retract and update decode.py by @ByronHsu in #7196
Move host memory pools into a separate file by @merrymercy in #7200
Lianmin/simplify memory pool by @merrymercy in #7202
Fix grammar abort & Minor style fixes by @merrymercy in #7204
feat: use zstd for docker by @zhyncs in #7205
[EAGLE] Refactor code for page size > 1 & more simplifications by @merrymercy in #7163
Revert "[EAGLE] Refactor code for page size > 1 & more simplifications" by @merrymercy in #7210
[PD] use int32 for kv indices & get num_reserved_decode_tokens from server_args by @ByronHsu in #7214
Minor PD style fix by @ByronHsu in #7215
Fix ChunkCache object has no attribute 'disable' by @Fridge003 in #7217
Implement gather before attn by @ch-wan in #6378
Support LoRA in MMMU benchmark script. by @lifuhuang in #7218
refine fused_moe benchmark by @BBuf in #7221
Minor style and doc fix by @merrymercy in #7228
[EAGLE] Refactor code for page size > 1 & more simplifications by @merrymercy in #7213
Fix sampling for speculative decoding & simplify kernels by @merrymercy in #7207
Release sgl-kernel 0.1.9 by @merrymercy in #7232
[EAGLE] Fix draft kv cache layout for fa3 and topk > 1 by @merrymercy in #7239
[Eagle] Fix kernel call after updating speculative sampling kernels by @merrymercy in #7231
minor fix by @hnyls2002 in #7245
Tiny remove comments about DeepEP on H20 by @fzyzcjy in #7234
Feat/support rerank by @woodx9 in #6058
[fix] fix DeepGEMM blackwell input quant & ut & fix style and log by @Alcanderian in #7247
Update CI flakes. by @saienduri in #7244
chore: bump v0.4.7.post1 by @zhyncs in #7248
fix amd EP MoE FP8 issue by @alexsun07 in #7125
Use seq_len_fill_value in the cuda graph runners by @merrymercy in #7233
support custom weight loader for model runner by @yukavio in #7122
Fix AMD speculative decoding by @merrymercy in #7252
[Refactor] OAI Server components by @JustinTong0323 in #7167
OAI Server Skeleton & Core Utility Endpoints by @yhyang201 in #7179
[amd] Opt dsv3 moe by @kkHuang-amd in #7160
update ci node for xeon by @DiweiSun in #7265
feat: mtp support dp-attention by @u4lr451 in #6081
support qwen2 running on ascend npu device by @zhuyijie88 in #7022
Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. by @pyc96 in #7164
bugfix(tool call ebnf): Fix EBNF generation for optional function parameters by @CatherineSue in #7283
Fix AWQ Dequant and Weight Loading of deepseek v2 by @AniZpZ in #6842
fix: resolve b200 dsv3 mtp issue by @zhyncs in #7286
ci: Fix test_ebnf_generate_all_optional_function_params by @CatherineSue in #7288
fix: only enable flash_attn test on sm80 sm90 by @zhyncs in #7289
[PD] Support get local ip from NIC for PD disaggregation by @ShangmingCai in #7237
[PD] Add custom memory pool option to support Mooncake PD with NVLink by @ShangmingCai in #7264
Upstreaming hicache bug fixes by @xiezhq-hermann in #7267
Update python API of activation, topk, norm and rope and remove vllm dependency by @yanbing-j in #6614
Fix hicache benchmark script bug - some sampled input_request is [] by @byjiang1996 in #7300
chore: change logs fromINFO to DEBUG for dp and add force quit for tokenizer manager by @ishandhanani in #7251
update invalid link in doc by @habaohaba in #7297
Fix mini_lb for PD with long output: limit chunk size of decode response by @ch-tiger1 in #7301
Fix profiler error when there are idle passes by @fzyzcjy in #7003
[pd] optimize dockerfile for pd disaggregation by @whybeyoung in #7319
Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router by @slin1237 in #7096
Add more refactored openai test & in CI by @jhinpan in #7284
fix: resolve blackwell deepep image issue by @zhyncs in #7331
add seed in CPU UTs to avoid flaky failure by @chunyuan-w in #7333
Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately by @hebiao064 in #7099
Reintroduce tiny fix sampler error when prob is not contiguous by @fzyzcjy in #7354
[Refactor] Clean up radix cache related API by @DarkSharpness in #7303
Put _normalize_rid before other normalization in io_struct by @CatherineSue in #7363
[PD] Transfer hidden states for mtp when disaggregation by @Atream in #7242
[Bugfix][PD] Set conclude state before clear when failure happens by @ShangmingCai in #7362
docs: update installation by @zhyncs in #7366
[Docker] optimize dockerfile remove deepep and blackwell merge it to… by @whybeyoung in #7343
Clean unused import for mimo mtp model by @lambert0312 in #7370
[Bugfix]Fix hang bug using dp attention with HiRadixCache by @LLLL114 in #7159
[Doc] add embedding rerank doc by @woodx9 in #7364
Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization by @lambert0312 in #7371
Feat/refactor embedding server by @woodx9 in #7322
Purge VerlEngine by @MrAta in #7326
support return logprobs for pipeline by @strgrb in #7356
[PD] Optimize custom mem pool usage and bump mooncake version by @ShangmingCai in #7393
Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. by @solrex in #5485
Refine OpenAI serving entrypoint to remove batch requests by @JustinTong0323 in #7372
[Feature] Comprehensive Hybrid Parallelism Support by @ch-wan in #6389
[DeepSeekNextN] fix: residual of head norm can be None by @ch-wan in #7398
[OAI refactor] Add rerank and score serving by @woodx9 in #7399
[OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor by @yhyang201 in #7360
Fix All-Gather under world size one by @ch-wan in #7219
Optimize DP attn scheduling for speculative decoding by @ch-wan in #7285
Update usage_processor.py by @ch-wan in #7402
Fix 7285 Merge Conflicts by @ch-wan in #7403
chore: upgrade mooncake-transfer-engine 0.3.4 by @zhyncs in #7401
[OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State by @key4ng in #7329
Remove batches api in docs & example by @jhinpan in #7400
[BugFix]: fix EmbeddingReqInput single input error by @woodx9 in #7396
[BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator by @ehuaa in #7394
fix overlap pagecount by @pansicheng in #6984
fix: Fix CI test_function_call_parser.py by @CatherineSue in #7425
Fix CPU offloading for MLA memory pool by @hnyls2002 in #7409
[fix] PD disaggregation when enable mtp and tp!=dp by @Atream in #7420
feat(oai refactor): Replace openai_api with entrypoints/openai by @CatherineSue in #7351
Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support by @lifuhuang in #7412
refactor(test): reorganize OpenAI test file structure by @CatherineSue in #7408
[minor] simplify the TokenToKVPoolAllocator by @hnyls2002 in #7414
Tiny add logging for GC by @fzyzcjy in #7406
FlashInfer NVFP4 MoE with EP & 2-stream shared expert by @trevor-m in #7327
Remove copy after bmm by @ispobock in #7441
Fix torch compile run by @kkHuang-amd in #7391
[misc] Add PD service discovery support in router by @slin1237 in #7361
add fused moe config for qwen3 in triton3.3.1 by @yizhang2077 in #7445
Fix CUDA Graph Check under Deepep with DP FFN by @ch-wan in #7451
Update hyperparameter_tuning.md by @merrymercy in #7454
feat: integrate deepgemm into EPMoE by @xutizhou in #6821
Solve docker build failed in the virtual machine by @kkHuang-amd in #7290
Fix a bug in BatchTokenIDOut & Misc style and dependency updates by @merrymercy in #7457
[CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests by @ShangmingCai in #7472
Fix prefill OOM due to wrong token calculation when page > 1 by @hnyls2002 in #7397
feat(func_call): Add more check in BaseFormatDetector.parse_streaming_increment by @CatherineSue in #7479
Fix dtype for idle input in spec decoding by @ch-wan in #7456
update mooncake in dockerfile by @hnyls2002 in #7480
kvcache io kernels and test case by @xiezhq-hermann in #7382
[perf] slightly imporve DeepSeek-R1-FP4 TP8 by @Alcanderian in #7481
Quick fix for DeepGemm requant to also cover MTP. by @pyc96 in #7378
Support weight loading without mmap by @guoyuhong in #7469
ci: Revert openai_server related tests in AMD suites by @CatherineSue in #7449
Perormance: Enable cuda graph for dp idle batch by @u4lr451 in #7269
bugfix: Prevent global mutation of conv.stop_str across requests by @huangtingwei9988 in #7347
Fix RequestValidationError response format by @CatherineSue in #7487
Fix MTP with Deepseek R1 Fp4 by @pyc96 in #7376
chore: bump sgl-kernel v0.2.0 by @zhyncs in #7490
chore: bump v0.4.8 by @zhyncs in #7493

New Contributors

@futrime made their first contribution in #6106
@faradawn made their first contribution in #6824
@liquanfeng made their first contribution in #7093
@p12tic made their first contribution in #6026
@byjiang1996 made their first contribution in #7140
@zhijian-liu made their first contribution in #7191
@DiweiSun made their first contribution in #7265
@zhuyijie88 made their first contribution in #7022
@pyc96 made their first contribution in #7164
@ch-tiger1 made their first contribution in #7301
@Atream made their first contribution in #7242
@LLLL114 made their first contribution in #7159
@key4ng made their first contribution in #7329
@ehuaa made their first contribution in #7394

Full Changelog: v0.4.7...v0.4.8

sgl-project/sglang v0.4.8 Release v0.4.8 on GitHub