Highlights
OpenAI-Compatible Server Refactor
Re-structured the OpenAI-compatible server to support production and enterprise environments. Key improvements include:
-
Consistent metrics and logging for better observability and debugging.
-
Unified error handling, request validation, and processing logic for improved reliability and maintainability.
-
Improved request tracking across sessions and components.
-
Fixed bugs in embedding requests and reasoning parsers.
This work was a collaborative effort involving engineers from academic and industry institutions. Special thanks to the Oracle Cloud team and the SGLang team and community — including @slin1237, @CatherineSue, @key4ng, @JustinTong0323, @jhinpan, @yhyang201, @woodx9 and @whybeyoung — for their invaluable contributions.
DeepSeek R1 FP4 on Blackwell GPU
Added support for DeepSeek R1 with FP4 and MTP on NVIDIA Blackwell GPU.
-
Integrated FlashInfer NVFP4 MoE, supporting TP, EP, and DP.
-
Supported 2-stream shared expert execution.
-
Achieved up to 90 TPS per user at isl/osl/bs = 1k/1k/16 on B200.
Further optimization in progress. Special thanks to the FlashInfer, NVIDIA Enterprise Products, Novita AI, DataCrunch, Google Cloud, and SGLang teams — especially @Alcanderian and @pyc96 — for their critical contributions.
Breaking Change: OpenAI-Compatible API Module Moved
The sglang/srt/openai_api
directory has been removed and replaced with sglang/srt/entrypoints/openai
.
Update your imports to the new module path. For example:
- from sglang.srt.openai_api.protocol import Tool
+ from sglang.srt.entrypoints.openai.protocol import Tool
What's Changed
- Update README.md by @merrymercy in #7040
- [Docker] Upgrading base image from 24.04 to 24.12 by @Swipe4057 in #7043
- fix 24.12 docker by @zhyncs in #7045
- Minor cleanup of fa3 backend by @merrymercy in #6999
- Fix eagle on AMD by @merrymercy in #7051
- Clean up server_args.py by @merrymercy in #7037
- Minor style fix in cuda_graph_runner.py by @merrymercy in #7053
- [WA] fix output data is nan in CI test "test_moe_eval_accuracy_large.py" by @kkHuang-amd in #7021
- [fix] libmlx5.so already in base image by @HanHan009527 in #7060
- Fix test_lora.py CI by @Fridge003 in #7061
- Tiny fix cutlass_mla_get_workspace_size stub incorrect signature by @fzyzcjy in #7057
- Add sanity checks when a test file is not added to CI by @fzyzcjy in #6947
- Revert "Add sanity checks when a test file is not added to CI (#6947)" by @zhyncs in #7063
- Fix missing tool call id if tool call index >0 in streaming tool call output. by @Xu-Wenqing in #7049
- chore: update dev docker by @zhyncs in #7064
- Open AI API hidden states by @kyle-pena-kuzco in #6716
- fix arm sgl-kernel link issue by @zhyncs in #7066
- [Feature] Add Logit Bias by @b8zhong in #6579
- Improve perf tuning docs by @merrymercy in #7071
- Frontend language separate reasoning support by @binarycrayon in #6031
- Do not run frontend_reasoning.ipynb to reduce the CI load by @merrymercy in #7073
- Simplify the heuristics for setting --mem-fraction-static by @merrymercy in #7054
- update doc by @Ximingwang-09 in #7046
- Clean up docs for server args and sampling parameters (generated by grok) by @merrymercy in #7076
- Fix GGuf and add back test_gguf.py by @Fridge003 in #7067
- vlm: adapt internvl to VisionAttention by @mickqian in #6870
- Fix circular import in test_prefix_chunk_info.py by @Fridge003 in #7097
- Fix misusing the "_is_cuda". by @sogalin in #7091
- Support VILA models by @futrime in #6106
- [FIX]remove redundant code in logits_processor.py by @pc-neo in #7079
- [feat]: Emit fixed-size KV blocks events by @faradawn in #6824
- [Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations by @lifuhuang in #6994
- Fix positional argument by @liquanfeng in #7093
- [sgl-kernel] Add cuda kernel for moe_ep_silu_and_mul by @yuan-luo in #6919
- Improve log status by @hnyls2002 in #7115
- feat: update blackwell setup by @zhyncs in #7119
- Update CODEOWNERS by @merrymercy in #7126
- Add gfx950 support for sgl-kernel. by @sogalin in #7092
- [Fix] Reduce busy polling when scheduler is idle by @p12tic in #6026
- Minor add utility to read expert distribution recorder output by @fzyzcjy in #7134
- Remove unnecessary metadata_expand.max_seq_len_k operations in fa3 to… by @byjiang1996 in #7140
- Minor speedup topk postprocessing by @fzyzcjy in #7058
- filter by num_hidden_layers by @pansicheng in #7056
- Remove 200us slow concat kernel (part 1: kernel) by @fzyzcjy in #7145
- Support new DeepGEMM format in per token group quant by @fzyzcjy in #7146
- chore: bump v0.1.8.post1 by @zhyncs in #7152
- Support new DeepGEMM format in per token group quant (part 2: srt) by @fzyzcjy in #7155
- Fix DeepEP error in some environments by @fzyzcjy in #7154
- Minor speed up block_quant_dequant by @fzyzcjy in #6814
- Tiny add sanity checks for DeepGEMM inputs by @fzyzcjy in #7157
- Remove 200us slow concat kernel (part 2: srt) by @fzyzcjy in #7020
- Re-quantize DeepSeek model weights to support DeepGEMM new input format by @fzyzcjy in #7156
- Minor style change of triton backend by @merrymercy in #7165
- Split the eagle test into two files by @merrymercy in #7170
- Support new DeepGEMM input format in silu_and_mul_masked_post_quant_fwd by @fzyzcjy in #7153
- Refactor DeepGEMM integration by @fzyzcjy in #7150
- Add test for refactored openai server by @jhinpan in #7161
- Improve test cases for eagle infer by @merrymercy in #7173
- Support new DeepGEMM by @fzyzcjy in #7172
- Increase timeout in test/srt/test_disaggregation.py by @merrymercy in #7175
- Add Phi-4-mm to supported VLM supported model list. by @lifuhuang in #7178
- Fix shared experts fusion + weight requant by @fzyzcjy in #7177
- [fix] fix dsv3 weight loader tqdm and simplify shared experts fusion by @Alcanderian in #7181
- [fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla by @Alcanderian in #7184
- [PD] Update prefill.py by @ByronHsu in #7190
- Fix a minor bug related to DeepGEMM upgrade by @zhijian-liu in #7191
- chore: bump v0.1.8.post2 by @zhyncs in #7189
- [fix] fix determine_num_fused_shared_experts by @Alcanderian in #7180
- chore: upgrade sgl-kernel v0.1.8.post2 by @Alcanderian in #7186
- Fix NCCL 2.27.3 not in docker image by @fzyzcjy in #7195
- Fix error when disabling new DeepGEMM by @fzyzcjy in #7198
- [PD] Support decode retract and update decode.py by @ByronHsu in #7196
- Move host memory pools into a separate file by @merrymercy in #7200
- Lianmin/simplify memory pool by @merrymercy in #7202
- Fix grammar abort & Minor style fixes by @merrymercy in #7204
- feat: use zstd for docker by @zhyncs in #7205
- [EAGLE] Refactor code for page size > 1 & more simplifications by @merrymercy in #7163
- Revert "[EAGLE] Refactor code for page size > 1 & more simplifications" by @merrymercy in #7210
- [PD] use int32 for kv indices & get num_reserved_decode_tokens from server_args by @ByronHsu in #7214
- Minor PD style fix by @ByronHsu in #7215
- Fix ChunkCache object has no attribute 'disable' by @Fridge003 in #7217
- Implement gather before attn by @ch-wan in #6378
- Support LoRA in MMMU benchmark script. by @lifuhuang in #7218
- refine fused_moe benchmark by @BBuf in #7221
- Minor style and doc fix by @merrymercy in #7228
- [EAGLE] Refactor code for page size > 1 & more simplifications by @merrymercy in #7213
- Fix sampling for speculative decoding & simplify kernels by @merrymercy in #7207
- Release sgl-kernel 0.1.9 by @merrymercy in #7232
- [EAGLE] Fix draft kv cache layout for fa3 and topk > 1 by @merrymercy in #7239
- [Eagle] Fix kernel call after updating speculative sampling kernels by @merrymercy in #7231
- minor fix by @hnyls2002 in #7245
- Tiny remove comments about DeepEP on H20 by @fzyzcjy in #7234
- Feat/support rerank by @woodx9 in #6058
- [fix] fix DeepGEMM blackwell input quant & ut & fix style and log by @Alcanderian in #7247
- Update CI flakes. by @saienduri in #7244
- chore: bump v0.4.7.post1 by @zhyncs in #7248
- fix amd EP MoE FP8 issue by @alexsun07 in #7125
- Use seq_len_fill_value in the cuda graph runners by @merrymercy in #7233
- support custom weight loader for model runner by @yukavio in #7122
- Fix AMD speculative decoding by @merrymercy in #7252
- [Refactor] OAI Server components by @JustinTong0323 in #7167
- OAI Server Skeleton & Core Utility Endpoints by @yhyang201 in #7179
- [amd] Opt dsv3 moe by @kkHuang-amd in #7160
- update ci node for xeon by @DiweiSun in #7265
- feat: mtp support dp-attention by @u4lr451 in #6081
- support qwen2 running on ascend npu device by @zhuyijie88 in #7022
- Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. by @pyc96 in #7164
- bugfix(tool call ebnf): Fix EBNF generation for optional function parameters by @CatherineSue in #7283
- Fix AWQ Dequant and Weight Loading of deepseek v2 by @AniZpZ in #6842
- fix: resolve b200 dsv3 mtp issue by @zhyncs in #7286
- ci: Fix test_ebnf_generate_all_optional_function_params by @CatherineSue in #7288
- fix: only enable flash_attn test on sm80 sm90 by @zhyncs in #7289
- [PD] Support get local ip from NIC for PD disaggregation by @ShangmingCai in #7237
- [PD] Add custom memory pool option to support Mooncake PD with NVLink by @ShangmingCai in #7264
- Upstreaming hicache bug fixes by @xiezhq-hermann in #7267
- Update python API of activation, topk, norm and rope and remove vllm dependency by @yanbing-j in #6614
- Fix hicache benchmark script bug - some sampled input_request is [] by @byjiang1996 in #7300
- chore: change logs from
INFO
toDEBUG
for dp and add force quit for tokenizer manager by @ishandhanani in #7251 - update invalid link in doc by @habaohaba in #7297
- Fix mini_lb for PD with long output: limit chunk size of decode response by @ch-tiger1 in #7301
- Fix profiler error when there are idle passes by @fzyzcjy in #7003
- [pd] optimize dockerfile for pd disaggregation by @whybeyoung in #7319
- Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router by @slin1237 in #7096
- Add more refactored openai test & in CI by @jhinpan in #7284
- fix: resolve blackwell deepep image issue by @zhyncs in #7331
- add seed in CPU UTs to avoid flaky failure by @chunyuan-w in #7333
- Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately by @hebiao064 in #7099
- Reintroduce tiny fix sampler error when prob is not contiguous by @fzyzcjy in #7354
- [Refactor] Clean up radix cache related API by @DarkSharpness in #7303
- Put
_normalize_rid
before other normalization inio_struct
by @CatherineSue in #7363 - [PD] Transfer hidden states for mtp when disaggregation by @Atream in #7242
- [Bugfix][PD] Set conclude state before clear when failure happens by @ShangmingCai in #7362
- docs: update installation by @zhyncs in #7366
- [Docker] optimize dockerfile remove deepep and blackwell merge it to… by @whybeyoung in #7343
- Clean unused import for mimo mtp model by @lambert0312 in #7370
- [Bugfix]Fix hang bug using dp attention with HiRadixCache by @LLLL114 in #7159
- [Doc] add embedding rerank doc by @woodx9 in #7364
- Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization by @lambert0312 in #7371
- Feat/refactor embedding server by @woodx9 in #7322
- Purge VerlEngine by @MrAta in #7326
- support return logprobs for pipeline by @strgrb in #7356
- [PD] Optimize custom mem pool usage and bump mooncake version by @ShangmingCai in #7393
- Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. by @solrex in #5485
- Refine OpenAI serving entrypoint to remove batch requests by @JustinTong0323 in #7372
- [Feature] Comprehensive Hybrid Parallelism Support by @ch-wan in #6389
- [DeepSeekNextN] fix: residual of head norm can be None by @ch-wan in #7398
- [OAI refactor] Add rerank and score serving by @woodx9 in #7399
- [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor by @yhyang201 in #7360
- Fix All-Gather under world size one by @ch-wan in #7219
- Optimize DP attn scheduling for speculative decoding by @ch-wan in #7285
- Update usage_processor.py by @ch-wan in #7402
- Fix 7285 Merge Conflicts by @ch-wan in #7403
- chore: upgrade mooncake-transfer-engine 0.3.4 by @zhyncs in #7401
- [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State by @key4ng in #7329
- Remove batches api in docs & example by @jhinpan in #7400
- [BugFix]: fix EmbeddingReqInput single input error by @woodx9 in #7396
- [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator by @ehuaa in #7394
- fix overlap pagecount by @pansicheng in #6984
- fix: Fix CI test_function_call_parser.py by @CatherineSue in #7425
- Fix CPU offloading for MLA memory pool by @hnyls2002 in #7409
- [fix] PD disaggregation when enable mtp and tp!=dp by @Atream in #7420
- feat(oai refactor): Replace
openai_api
withentrypoints/openai
by @CatherineSue in #7351 - Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support by @lifuhuang in #7412
- refactor(test): reorganize OpenAI test file structure by @CatherineSue in #7408
- [minor] simplify the
TokenToKVPoolAllocator
by @hnyls2002 in #7414 - Tiny add logging for GC by @fzyzcjy in #7406
- FlashInfer NVFP4 MoE with EP & 2-stream shared expert by @trevor-m in #7327
- Remove copy after bmm by @ispobock in #7441
- Fix torch compile run by @kkHuang-amd in #7391
- [misc] Add PD service discovery support in router by @slin1237 in #7361
- add fused moe config for qwen3 in triton3.3.1 by @yizhang2077 in #7445
- Fix CUDA Graph Check under Deepep with DP FFN by @ch-wan in #7451
- Update hyperparameter_tuning.md by @merrymercy in #7454
- feat: integrate deepgemm into EPMoE by @xutizhou in #6821
- Solve docker build failed in the virtual machine by @kkHuang-amd in #7290
- Fix a bug in BatchTokenIDOut & Misc style and dependency updates by @merrymercy in #7457
- [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests by @ShangmingCai in #7472
- Fix prefill OOM due to wrong token calculation when page > 1 by @hnyls2002 in #7397
- feat(func_call): Add more check in
BaseFormatDetector.parse_streaming_increment
by @CatherineSue in #7479 - Fix dtype for idle input in spec decoding by @ch-wan in #7456
- update mooncake in dockerfile by @hnyls2002 in #7480
- kvcache io kernels and test case by @xiezhq-hermann in #7382
- [perf] slightly imporve DeepSeek-R1-FP4 TP8 by @Alcanderian in #7481
- Quick fix for DeepGemm requant to also cover MTP. by @pyc96 in #7378
- Support weight loading without mmap by @guoyuhong in #7469
- ci: Revert openai_server related tests in AMD suites by @CatherineSue in #7449
- Perormance: Enable cuda graph for dp idle batch by @u4lr451 in #7269
- bugfix: Prevent global mutation of conv.stop_str across requests by @huangtingwei9988 in #7347
- Fix RequestValidationError response format by @CatherineSue in #7487
- Fix MTP with Deepseek R1 Fp4 by @pyc96 in #7376
- chore: bump sgl-kernel v0.2.0 by @zhyncs in #7490
- chore: bump v0.4.8 by @zhyncs in #7493
New Contributors
- @futrime made their first contribution in #6106
- @faradawn made their first contribution in #6824
- @liquanfeng made their first contribution in #7093
- @p12tic made their first contribution in #6026
- @byjiang1996 made their first contribution in #7140
- @zhijian-liu made their first contribution in #7191
- @DiweiSun made their first contribution in #7265
- @zhuyijie88 made their first contribution in #7022
- @pyc96 made their first contribution in #7164
- @ch-tiger1 made their first contribution in #7301
- @Atream made their first contribution in #7242
- @LLLL114 made their first contribution in #7159
- @key4ng made their first contribution in #7329
- @ehuaa made their first contribution in #7394
Full Changelog: v0.4.7...v0.4.8