sgl-project/sglang v0.4.4 on GitHub

Highlights

The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!

Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source community users for their contributions!

Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement , there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!

Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!

Optimizations

AMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog
Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with
--enable-flashinfer-mla
Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script
DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable with
export SGL_ENABLE_JIT_DEEPGEMM=1
Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:
- meituan/DeepSeek-R1-Channel-INT8
- meituan/DeepSeek-R1-Block-INT8
Other Optimizations:
- Blackwell architecture Block Scale FP8 GEMM support
- Support page size greater than 1 #4356
- Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89
- Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 8) #4390

Coming soon

Integrate Flash Attention #4385
Integrate FlashMLA #4384
EAGLE 2 optimization #4383
EAGLE 3 day one support #4247
Integrate DeepEP #4232
Prefill and Decoding Disaggregation

What's Changed

update flashinfer-python by @zhyncs in #3557
fix doc by @zhyncs in #3558
Add support for OpenAI API o1 model by @ChuyueSun in #3363
fix sgl-kernel codestyle by @BBuf in #3563
docs: update install by @zhyncs in #3581
Copy config files for MI300X to support in virtualized environments by @yosoyjay in #3505
ROCm docker: triton update by @HaiShaw in #3584
[fix] added support for vlm in offline inference by @FrankLeeeee in #3548
Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 by @ispobock in #3582
[CI] Improve Docs CI Efficiency by @shuaills in #3587
doc: emphasize and notify the usage of chat_template by @mickqian in #3589
fix eagle unit test by @zhyncs in #3591
fix high qps crash when enable mtp by @zhyncs in #3592
fix apply_token_bitmask_inplace_cuda by @zhyncs in #3594
[docs] added favicon to sphinx html by @FrankLeeeee in #3564
fix lockfile and port_registry file permission error by @Jiadalee in #3598
feat: Support Qwen 2.5 vl by @mickqian in #3258
[ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. by @whchung in #3535
Update to latest amd image. by @saienduri in #3597
Benchmark for reasoning models by @simveit in #3532
Draft of updated doc for sampling params. by @simveit in #3260
[docs] Update sampling_params.md by @shuaills in #3617
[docker] added rdma support by @FrankLeeeee in #3619
Revert "[ROCm] Use tl.range() in block GEMM kernels with `num_stage… by @zhyncs in #3632
add mtp unit test by @zhyncs in #3634
update unit test by @zhyncs in #3636
chore: bump v0.4.3.post1 by @zhyncs in #3638
h800 deepseek r1 config and support multi-gpu block-gemm tuning by @BBuf in #3639
feat: support flashinfer mla with prefix cache by @zhyncs in #3643
chore: update flashinfer v0.2.1.post2 by @zhyncs in #3644
chore: bump v0.4.3.post2 by @zhyncs in #3645
use transformers 4.48.3 by @zhyncs in #3650
[ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. by @whchung in #3616
[ROCm] Optimal MOE Tuning for AMD Radeon Graphics by @BruceXcluding in #3567
Deploy multi-node inference (LWS method) using sglang in a K8s cluster by @whybeyoung in #3624
Update amd docker image. by @saienduri in #3654
[Feature] Apply Cublas Grouped Gemm kernel by @Fridge003 in #3629
update pr-test by @zhyncs in #3663
Fix draft decode max batch size by @ispobock in #3676
fix: remove dependency on latest transformers impl by @mickqian in #3635
AMD Prefill optimize by @fsx950223 in #3665
fix: apply cache size limit of attention mask for VisionAttention by @mickqian in #3657
set NCCL_IB_GID_INDEX=3 for multi node NVIDIA InfiniBand if needed by @zhyncs in #3698
use warp shuffle style reduce and flashinfer vectorize by @BBuf in #3628
[Docs] Add SkyPilot DeepSeek example by @Michaelvll in #3706
[k8s] remove unnecessary hostIPC for security concern by @panpan0000 in #3700
[moe] optim: reduce memory consumption in fused_moe by @ch-wan in #3692
[Improve] Fix Multi-User Port Allocation Conflicts by @shuaills in #3601
Variance measure for reasoning benchmark by @simveit in #3677
Docs: Fix layout with sub-section by @zhaochenyang20 in #3710
add control for cutlass fp8 blockwise gemm by @yizhang2077 in #3727
revert BLOCK and num_warps on HIP by @HaiShaw in #3722
Optimize triton attention custom mask by @ispobock in #3731
[Bugfix] Fix scores mask for moe topk by @Chen-XiaoBing in #3705
[Docs] Modify ep related server args and remove cublas part of deepseek by @Fridge003 in #3732
[Fix] Fix bugs and refactor codes in lora for better scalability. by @aoshen524 in #3652
docs: fix 404 link by @trayvonpan in #3588
[docs] added torch.compile cache to dpsk manual by @FrankLeeeee in #3737
AMD/ROCm: update AITER repo to ROCm/aiter by @HaiShaw in #3747
feat: update grouped_topk to support softmax and sigmoid by @zixuanzhang226 in #3680
feat: Add SageMaker support by @andjsmi in #3740
Change description of nvidia jetson docs by @shahizat in #3761
[Fix] fix OpenAI API adapter tokenizer encoding by @shuaills in #3432
[bug] fixed batch api by @FrankLeeeee in #3754
Adjustments to docs by @simveit in #3733
docs: Add offline engine launch example and documentation by @shuaills in #3771
Update offline_engine_api.ipynb by @zhaochenyang20 in #3773
Support Qwen RM model. by @simveit in #3772
Add support for nvidia modelopt fp8 kv cache by @Edwardf0t1 in #3223
Tiny fix Olmo2 by @fzyzcjy in #3348
fix lm head weights in Qwen models by @zhaochenyang20 in #3777
Fix weight loader error when LM head weights are tied by @fzyzcjy in #3766
Let DetokenizerManager use TypeBasedDispatcher by @fzyzcjy in #3117
bench: Add a benchmark for vLM: MMMU by @mickqian in #3562
Extract generation_manager from tokenizer_manager by @fzyzcjy in #3115
Rename TokenizerManager to StdOrchestrator by @fzyzcjy in #3116
[Docs]Add instruction for manually stopping nsys profiler by @Fridge003 in #3795
Hierarchical Caching for SGLang by @xiezhq-hermann in #2693
Update readme by @merrymercy in #3809
Fix dependency by @merrymercy in #3813
Refactor flashinfer logic for deepseek v3 and fix accuracy bug by @Fridge003 in #3785
Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by @laixinn in #3730
Fix pandas dependency in CI by @merrymercy in #3818
Revert "Rename TokenizerManager to StdOrchestrator" by @merrymercy in #3828
Revert "Extract generation_manager from tokenizer_manager" by @merrymercy in #3829
Fix CI and install docs by @merrymercy in #3821
typos by @WrRan in #3801
doc: fix dead link in router.md by @He1pa in #3799
Fix doc site copyright to current year by @wilsonwu in #3741
[Doc] Fix typo in server-argument description by @yuanheng-zhao in #3641
[ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 by @lcskrishna in #3237
[BugFix]: Add missing clamp to llavavid by @PanJason in #3787
[dist] made timeout configurable by @FrankLeeeee in #3803
Fix allgather ops inside cuda graphs by @nvcastet in #3709
fix capture_bs by @fsx950223 in #3857
[BugFix] Fix crash when receive a req with structed output in DP attention mode. by @hcyz33 in #3841
Fix maximum recursion depth triggered on exception exit by @kebe7jun in #3519
[doc] added quantization doc for dpsk by @FrankLeeeee in #3843
[doc] fixed dpsk quant faq by @FrankLeeeee in #3865
Expert Parallelism (EP) Support for DeepSeek V3/R1 by @sleepcoo in #3602
Revert recent changes by @simveit in #3845
Feature/improve docs by @simveit in #3860
[Feature] Support llguidance for constrained decoding by @JC1DA in #3298
Move dpsk docs forward a step by @zhaochenyang20 in #3894
Docs: Reorngaize dpsk links by @zhaochenyang20 in #3900
Implemented frontend docs by @simveit in #3791
[doc] update sponsorship by @whybeyoung in #3903
[Rocm] Fix to the rocm_mla_decode_rope.py returning random result by @Chi-Chu319 in #3898
[doc] Update document for flashinfer mla by @Fridge003 in #3907
Add return hidden state in the native API by @Qiaolin-Yu in #3897
[Docs] Disable notebook CI when merge to main by @xqoasis in #3905
[Docs] Improve DPSK docs in dark mode by @hebiao064 in #3914
[Doc] Add experimental tag for flashinfer mla by @Fridge003 in #3925
Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by @laixinn in #3922
xgrammar 0.1.14 by @qeternity in #3593
revert "Docs: Reorngaize dpsk links #3900" by @zhyncs in #3933
upgrade flashinfer v0.2.2.post1 by @zhyncs in #3934
Fix the doc link for sampling params by @Qiaolin-Yu in #3861
[feat] Add Vertex AI compatible prediction route for /generate by @KCFindstr in #3866
[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x) by @yiakwy-xpu-ml-framework-team in #3613
Fix bench_serving not recognizing OPENAI_API_KEY by @kebe7jun in #3870
set a strict sgl-kernel version by @zhaochenyang20 in #3950
[Bugfix] Fix tokenizer_manager not getting 400 when req is too long by @CatherineSue in #3678
[Feature] integrate Structural Tag in xgrammar backend for function calling by @minleminzui in #3566
SGLang + Verl by @fzyzcjy in #3852
Remove unused imports from rocm mla kernel. by @lcskrishna in #3963
Update cutlass dependency by @elfiegg in #3966
[Feature]Support ragged prefill in flashinfer mla backend by @Fridge003 in #3967
Docs: add type hint to smapling parameters by @zhaochenyang20 in #3975
Add redline to highlight main process by @zhaochenyang20 in #3977
rename FunctionCallReqInput to ParseFunctionCallReq by @zhaochenyang20 in #3976
Docs: add special warning to engine docs by @zhaochenyang20 in #3979
Revert "[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)" by @zhaochenyang20 in #3982
Move return_hidden_states to the generate input by @Qiaolin-Yu in #3985
Update CODEOWNERS by @merrymercy in #3989
add deepgemm and sglang fp8 block-wise gemm benchmark by @BBuf in #3893
fix typo by @BBuf in #3991
Fix all gather torch compile by @ispobock in #3992
Add accuracy test for TP torch compile by @ispobock in #3994
Enable custom AR for AMD GPUs and maintain it in sgl-kernel by @hubertlu-tw in #3406
Add Benchmark for DeepGEMM Group GEMM by @hebiao064 in #3993
[feat] add small vocab table for eagle's draft model[1]. by @Zhou-sx in #3822
Add fast decode plan for flashinfer mla by @Fridge003 in #3987
Revert "Add fast decode plan for flashinfer mla" by @merrymercy in #4008
Add examples to token-in-token-out for LLM by @zhaochenyang20 in #4010
Fix nightly-test CI by @yinfan98 in #3826
Optimize Triton Kernel of Group GEMM in DeepGEMM Benchmark by @hebiao064 in #4014
Improve code styles by @merrymercy in #4021
Clean up custom allreduce by @merrymercy in #4029
remove cache configs in model definitions by @merrymercy in #4031
Update metrics documentation by @binarycrayon in #3264
Reorganize c++ source files in sgl-kernel with multiple folders by @merrymercy in #4025
Reorganize python source files in sgl-kernel with multiple files by @merrymercy in #4027
Misc clean up; Remove the support of jump forward by @merrymercy in #4032
Docs: Fix sampling parameter by @zhaochenyang20 in #4034
Remove outdated test utils and fix links for the doc of sampling params by @Qiaolin-Yu in #3999
Add examples in sampling parameters by @zhaochenyang20 in #4039
Share target model embed and head weights for nextn by @ispobock in #4033
Add a link to the roadmap in README.md by @merrymercy in #4043
docs: update README by @zhyncs in #4044
Fix assert options.num_stages != 0 error in the latest ROCm build image by @kkHuang-amd in #4049
Reasoning parser by @xihuai18 in #4000
HotFix for #3988 using blockwise_int8 by @xihuai18 in #4023
Fix breakage problem when using custom_ar by @kkHuang-amd in #4052
ROCm: update aiter and its usage to fused moe (bloat16, fp8, fp8 block-quant) by @HaiShaw in #4053
Fix debug_tensor_dump_output_folder optional key missing by @Qubitium in #4046
Remove grafana dashboard's datasource uid by @kebe7jun in #4051
[Fix & Style] Refactor the grammar backend to reduce human errors and improve readability by @DarkSharpness in #4030
[XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch. by @cboss6 in #3954
sgl-router - issues on routing and project build. (#3870) by @michaelfeil in #3948
fix: support gelu_new activation function in gpt2 by @Xiuyu-Li in #3712
remove unused max_jobs by @sgjzfzzf in #3607
[Feature] Add test for speculative_token_map by @Achazwl in #4016
Revert "Fix nightly-test CI" by @merrymercy in #4065
Update nextn ci test by @ispobock in #4071
Simplify eagle tests and TP sync in grammar backend by @merrymercy in #4066
Add examples for returning hidden states when using the server by @Qiaolin-Yu in #4074
[Minor] more code cleanup by @merrymercy in #4077
test: add vlm to token in & out example by @mickqian in #3941
[QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization by @Qubitium in #3790
bench: add dataset param for bench_multiturn by @zeroorhero in #3990
ROCM: AITER BLOCK GEMM by @BruceXcluding in #4075
[Eagle] Refactor eagle speculative decoding by @Ying1123 in #3986
Fix the moe padding conditional logic by @HaiShaw in #4081
[Revision] Add fast decode plan for flashinfer mla by @Fridge003 in #4012
Fix triton kernel illegal memory issue for eagle by @ispobock in #4100
Add update_weights_from_disk endpoint to Engine by @jhinpan in #4102
Add DeepSeek optimization ablations documentation by @M0gician in #4107
reorganize dpsk docs by @zhaochenyang20 in #4108
Add examples for server token-in-token-out by @Qiaolin-Yu in #4103
revert deepseek docs by @zhyncs in #4109
Create release-docker-amd-nightly.yml by @saienduri in #4105
remove testing on PR workflow change by @saienduri in #4110
Debug radixcache: refactor recursive helper methods by @luzengxiangcn in #3029
Online serving benchmarks of real datasets for hierarchical KV caching by @PanJason in #3211
fix cross-reference error and spelling mistakes by @samzong in #4101
fix Non-consecutive header level increase in docs/router/router.md by @samzong in #4099
chore: bump v0.4.3.post3 by @zhyncs in #4114
[Hoxfix] Fix incomplete token_to_kv_pool refactor by @Edenzzzz in #4121
Remove prefill-only-one-req by @merrymercy in #4117
Add a pointer to the real KV cache pool by @xiezhq-hermann in #4113
feat: support docs auto live-reload with sphinx-autobuild by @samzong in #4111
EAGLE docs by @simveit in #4038
Add codeowners for eagle implementations by @Ying1123 in #4131
Add tag suffix to nightly docker builds. by @saienduri in #4129
remove unused max_jobs in setup_rocm.py by @sgjzfzzf in #4126
Split the init of scheduler as smaller functions. Improve the eagle tests by @merrymercy in #4128
[Minor] make the __init__ function of model_runner.py shorter by @merrymercy in #4132
AMD/ROCm: update base image string by @kkHuang-amd in #4137
Update CODEOWNER by @merrymercy in #4138
fix bench serving bug by @Lzhang-hub in #4135
Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle by @merrymercy in #4134
Fix nightly ci Gsm8k & Fix flashinfer backend kvcache quant by @yinfan98 in #4147
Fix constrained generation errors by adding datasets dependency by @olliestanley in #4142
Release v0.4.3.post4 by @merrymercy in #4140
[docs] fix HF reference script command by @adarshxs in #4148
Docs: add torch compile cache by @zhaochenyang20 in #4151
Hot fix small vocal eagle in docs by @zhaochenyang20 in #4154
ROCm: enable trillion-parameter MoE models with INT4-FP8 single node by @HaiShaw in #4152
Add Support for Qwen2-VL Multi-modal Embedding Models by @Titan-p in #3694
[quant kernel] sgl-kernel support per_tensor_quant fp8 by @BBuf in #3786
Add sgl_per_token_quant_fp8 by @hebiao064 in #4089
[Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) by @HandH1998 in #3888
[Refactor] Reducing code duplication across FP8 CUDA quantization kernels by @hebiao064 in #4163
[Docs] Fix links and grammar issues by @windsonsea in #4162
Remove non-existent AMD header include by @hebiao064 in #4166
Put utils in ifndef USE_ROCM to fix CI (#4167) by @zhyncs in #4168
Memory pool fix for upstream change about eagle by @xiezhq-hermann in #4170
chore: bump v0.0.3.post7 for sgl-kernel by @zhyncs in #4176
Add an example of using deepseekv3 int8 sglang. by @sleepcoo in #4177
fix int8 doc link by @zhyncs in #4179
[Docs] Improve bullets appearance and grammar by @windsonsea in #4174
ROCm: Flex Attention Enablement with custom backends by @HaiShaw in #4178
Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" by @zhyncs in #4186
use same version for ci and pyproject by @zhyncs in #4187
Fix eagle hang issue for max_new_tokens=1 by @ispobock in #4185
Update amd ci docker image to v0.4.3.post4-rocm630. by @saienduri in #4189
New clang format for sgl kernel by @merrymercy in #4194
Remove the vllm dependency from the moe_align function by @sleepcoo in #4164
Minor improvement to per_tensor_quant_fp8 by @zcnrex in #4197
Revert "Minor improvement to per_tensor_quant_fp8 (#4197)" by @zhyncs in #4198
lazy import attn backends by @merrymercy in #4200
Fix bench_serving flush cache not recognizing OPENAI_API_KEY by @brighill in #4181
Use clang format 18 in pr-test-sgl-kernel.yml by @merrymercy in #4203
Refactor Dockerfile: unify CUDA logic and reduce image size by ~2.6 GB by @kebe7jun in #3749
Test no vllm custom allreduce by @merrymercy in #4210
refine quant kernel code style by @BBuf in #4211
Split test_mla.py into two files (deepseek v2 and deepseek v3) by @merrymercy in #4216
docs(reasoning content): 📝 deepseek-r1 parser support qwq by @xihuai18 in #4124
revert pr 3628 to pass test_mla ci by @BBuf in #4219
use latest sgl-kernel for mla test by @zhyncs in #4222
Rename files in sgl kernel to avoid nested folder structure by @merrymercy in #4213
chore: bump v0.0.4 for sgl-kernel by @zhyncs in #4223
Lazily import lora backends by @merrymercy in #4225
[docker] Distributed Serving with k8s Statefulset ( good example for DeepSeek-R1) by @panpan0000 in #3631
[docs] Unhide production metrics page by @hebiao064 in #4193
use sgl-kernel 0.0.4 by @zhyncs in #4224
Support nextn for flashinfer mla attention backend by @Fridge003 in #4218
Apply sgl w8a8 fp8 kernel by @HandH1998 in #3148
Check eagle server args by @Ying1123 in #4217
update sgl-kernel 3rdparty by @zhyncs in #4228
Update bench speculative script by @ispobock in #4235
Fix test of flashinfer mla with nextn by @Fridge003 in #4237
Move rope and bmm into sgl-kernel by @merrymercy in #4241
Revert "Check eagle server args" by @merrymercy in #4242
Minor style fix for sgl-kernel by @merrymercy in #4243
Auto balance CI tests by @merrymercy in #4238
Clean up fp8 support by @merrymercy in #4230
Move activation.cu to sgl-kernel/elementwise by @merrymercy in #4250
DeepGemm integrate to sgl-kernel by @laixinn in #4165
[Bug fixed] fixed the crash when enable the dp-attention on the single card by @DavidChan0519 in #3958
Added example for multimodal embedding by @simveit in #4206
Simplify tests & Fix trtllm custom allreduce registration by @merrymercy in #4252
fix the input_ids is None error by @Young1993 in #4144
fix per_token_group_quant_fp8 illegal memory when num_groups % 16 != 0 by @BBuf in #4231
Release sgl-kernel v0.0.4.post1 by @merrymercy in #4255
Fix quantization and nightly tests by @merrymercy in #4258
increase the timeout of nightly-test.yml by @merrymercy in #4262
Optimize rope in sgl kernel by @merrymercy in #4267
Test no vllm custom allreduce by @merrymercy in #4256
Amd test fp8 by @HandH1998 in #4261
add THIRDPARTYNOTICES for DeepGEMM by @zhyncs in #4272
upgrade xgrammar 0.1.15 by @zhyncs in #4275
Fix nightly eval for neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 by @merrymercy in #4279
Uupdate cutalss dependency for its bug fix. by @elfiegg in #4277
update deepgemm by @zhyncs in #4284
bump sgl-kernel 0.0.4.post2 by @zhyncs in #4288
Add A800 tuning configs support DeepSeek V3/R1 BF16 and INT8(block-wise) by @lambert0312 in #4136
update sgl-kernel 0.0.4.post2 by @zhyncs in #4291
linear support deepgemm by @sleepcoo in #4199
Update MTP doc by @ispobock in #4290
Add A100 tuning configs for DeepSeek R1/V3 channel-wise INT8 by @yych0745 in #4287
update doc by @zhyncs in #4299
[AMD] Fix rocm sgl-kernel missing modules error by @BruceXcluding in #4311
Add H20 tuning configs support DeepSeek V3/R1 INT8(block-wise) by @Ximingwang-09 in #4220
refactor: move image processors to separate files by @mickqian in #4229
upgrade flashinfer 0.2.3 by @zhyncs in #4317
unify is_cuda and is_hip by @zhyncs in #4321
Add A800 tuning configs for DeepSeek R1/V3 channel-wise INT8 by @lambert0312 in #4323
[Docs] Clean up benchmark_and_profiling.md by @windsonsea in #4297
refine sgl_moe_align_block_size_benchmark by @BBuf in #4327
Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% by @hebiao064 in #4215
Add awq dequantize kernel to sgl with 1x to 3x speedup by @zcnrex in #4104
fix awq_dequantize by @zhyncs in #4333
release 0.0.4.post3 sgl-kernel by @zhyncs in #4331
upgrade sgl-kernel 0.0.4.post3 by @zhyncs in #4334
Add INT8 support MTP NextN function by @lambert0312 in #3911
[Fix] fix _yarn_linear_ramp_mask with device parameter by @Alcanderian in #4337
remove the unused readline dependency from the Qwen2 model implementa… by @yych0745 in #4340
model: Support Janus-pro by @mickqian in #3203
Hierarchical Caching Refactoring and Fixing TP issue by @xiezhq-hermann in #4082
Support Blackwell Block Scale FP8 Gemm by @elfiegg in #4278
typo: Update http_server.py by @WrRan in #4350
Update nightly tests by @merrymercy in #4352
[Fix Doc.] Enable internal forwarding when starting the router by @shizhediao in #4355
Move output processing logic from scheduler.py into a separate file by @merrymercy in #4354
Fix scheduler proctitle suffix is None by @cnwenf in #4326
feat: support ep size < 32 for sgl kernel by @shuaills in #4348
Fix per token fp8 quant precision by @qingquansong in #4362
Remove the choices in --speculative-eagle-topk argument by @Achazwl in #4329
docs: add parameter --log-requests-level by @panpan0000 in #4335
simple bugfix by @WrRan in #4342
Fix the doc of FR-Spec by @Achazwl in #4295
[Fix] Check the device backend before calling empty_cache function by @cboss6 in #4212
[FIX] fix incorrect output when enable both deepgemm and torch compile by @AniZpZ in #4359
add INT8 example into dsv3 README by @laixinn in #4079
Avoid duplicated request ids in batch APIs by @tanconghui in #4026
example: add async offline inference demo by @kuizhiqing in #3961
Add device detection and count functions to utils. by @vshekhawat-hlab in #3962
Move aiohttp into public dependencies by @stevapple in #3980
[tools] add fp8 max/min constant in utils by @yiakwy-xpu-ml-framework-team in #3959
HotFix: json serialization error when using OAI v1/batches endpoint with logprobs by @dcfidalgo in #3896
[docs] Update outdated description about torch.compile by @junliu-mde in #3844
[Doc] Fix typo in backend/sampling_params by @yang-ybb in #3835
Ensure Usage Data in Streaming Responses Aligns with vLLM’s Implementation by @HermitSun in #3814
[moe] fix: correct the cache size in the last chunk by @ch-wan in #3679
Support page size > 1 by @merrymercy in #4356
[XPU][CPU] Enable the native path of DeepSeek by @airMeng in #4086
Revert "[XPU][CPU] Enable the native path of DeepSeek" by @merrymercy in #4367
Update grafana.json by @dblate in #4374
fix accuracy issue by @zhyncs in #4376
bump 0.0.5 sgl-kernel by @zhyncs in #4377
upgrade sgl-kernel 0.0.5 by @zhyncs in #4381
chore: bump v0.4.4 by @zhyncs in #4041

New Contributors

@yosoyjay made their first contribution in #3505
@FrankLeeeee made their first contribution in #3548
@Jiadalee made their first contribution in #3598
@whybeyoung made their first contribution in #3624
@fsx950223 made their first contribution in #3665
@panpan0000 made their first contribution in #3700
@ch-wan made their first contribution in #3692
@Chen-XiaoBing made their first contribution in #3705
@aoshen524 made their first contribution in #3652
@trayvonpan made their first contribution in #3588
@zixuanzhang226 made their first contribution in #3680
@andjsmi made their first contribution in #3740
@shahizat made their first contribution in #3761
@laixinn made their first contribution in #3730
@He1pa made their first contribution in #3799
@wilsonwu made their first contribution in #3741
@yuanheng-zhao made their first contribution in #3641
@nvcastet made their first contribution in #3709
@hcyz33 made their first contribution in #3841
@kebe7jun made their first contribution in #3519
@JC1DA made their first contribution in #3298
@Chi-Chu319 made their first contribution in #3898
@Qiaolin-Yu made their first contribution in #3897
@xqoasis made their first contribution in #3905
@KCFindstr made their first contribution in #3866
@elfiegg made their first contribution in #3966
@Zhou-sx made their first contribution in #3822
@xihuai18 made their first contribution in #4000
@cboss6 made their first contribution in #3954
@Xiuyu-Li made their first contribution in #3712
@sgjzfzzf made their first contribution in #3607
@zeroorhero made their first contribution in #3990
@samzong made their first contribution in #4101
@olliestanley made their first contribution in #4142
@windsonsea made their first contribution in #4162
@zcnrex made their first contribution in #4197
@brighill made their first contribution in #4181
@DavidChan0519 made their first contribution in #3958
@Young1993 made their first contribution in #4144
@lambert0312 made their first contribution in #4136
@yych0745 made their first contribution in #4287
@Ximingwang-09 made their first contribution in #4220
@Alcanderian made their first contribution in #4337
@shizhediao made their first contribution in #4355
@cnwenf made their first contribution in #4326
@qingquansong made their first contribution in #4362
@AniZpZ made their first contribution in #4359
@tanconghui made their first contribution in #4026
@kuizhiqing made their first contribution in #3961
@vshekhawat-hlab made their first contribution in #3962
@stevapple made their first contribution in #3980
@dcfidalgo made their first contribution in #3896
@junliu-mde made their first contribution in #3844
@yang-ybb made their first contribution in #3835
@airMeng made their first contribution in #4086
@dblate made their first contribution in #4374

Full Changelog: v0.4.3...v0.4.4

sgl-project/sglang v0.4.4 Release v0.4.4 on GitHub

Highlights

Optimizations

Coming soon

What's Changed

New Contributors

sgl-project/sglang v0.4.4
Release v0.4.4

on GitHub