github sgl-project/sglang v0.4.4
Release v0.4.4

latest release: v0.4.4.post1
one day ago

Highlights

The SGLang team is excited to announce the release of v0.4.4. We will keep improving DeepSeek V3/R1 performance. With the combination of FlashInfer, MTP, DeepGEMM, and Torch Compile optimizations on H200, it can achieve nearly 100 tokens/s, which is currently the fastest open-source implementation. Look out for new optimizations coming soon!

Thanks very much to xAI Team, NVIDIA Team, AMD Team, LinkedIn team, Baseten Team, Oracle Team, Meituan Team and the open source community users for their contributions!

Regarding the use of SGLang for DeepSeek R1 inference acceleration, in addition to the users mentioned in the announcement , there are also teams such as Tencent and Ant Group. We are very happy to have received recognition and usage from these teams!

Though surely there will be bugs and fixes that we'll be discovering and quickly patching in the coming days, including today :) Let's build and ship. Please feel free to join our Slack channel https://slack.sglang.ai/ Cheers!

Optimizations

  • AMD Performance Leadership: SGLang is now the fastest LLM engine for DeepSeek V3/R1 inference on AMD hardware, as confirmed by AMD's technical blog

  • Enhanced FlashInfer MLA Support: Now fully compatible with radix cache, chunked prefill, and MTP optimizations - enable with
    --enable-flashinfer-mla

  • Advanced MTP Capabilities: Both Triton and FlashInfer backends now offer comprehensive Multi-Token Prediction support, easily tunable via the bench_speculative script

  • DeepGEMM Integration: Full integration of DeepGEMM for NVIDIA Hopper architectures - enable with
    export SGL_ENABLE_JIT_DEEPGEMM=1

  • Pioneering INT8 Quantization: First industry implementation of INT8 support for DeepSeek R1 models:

  • Other Optimizations:

    • Blackwell architecture Block Scale FP8 GEMM support

    • Support page size greater than 1 #4356

    • Optimized W8A8 FP8 implementation with performance gains across all architectures (sm80, sm89, sm90), featuring 15%+ improvement specifically on sm89

    • Enhanced distributed parallelism capabilities (e.g., two-node configurations with DP 2, TP 8) #4390

Coming soon

  • Integrate Flash Attention #4385

  • Integrate FlashMLA #4384

  • EAGLE 2 optimization #4383

  • EAGLE 3 day one support #4247

  • Integrate DeepEP #4232

  • Prefill and Decoding Disaggregation

What's Changed

  • update flashinfer-python by @zhyncs in #3557
  • fix doc by @zhyncs in #3558
  • Add support for OpenAI API o1 model by @ChuyueSun in #3363
  • fix sgl-kernel codestyle by @BBuf in #3563
  • docs: update install by @zhyncs in #3581
  • Copy config files for MI300X to support in virtualized environments by @yosoyjay in #3505
  • ROCm docker: triton update by @HaiShaw in #3584
  • [fix] added support for vlm in offline inference by @FrankLeeeee in #3548
  • Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 by @ispobock in #3582
  • [CI] Improve Docs CI Efficiency by @shuaills in #3587
  • doc: emphasize and notify the usage of chat_template by @mickqian in #3589
  • fix eagle unit test by @zhyncs in #3591
  • fix high qps crash when enable mtp by @zhyncs in #3592
  • fix apply_token_bitmask_inplace_cuda by @zhyncs in #3594
  • [docs] added favicon to sphinx html by @FrankLeeeee in #3564
  • fix lockfile and port_registry file permission error by @Jiadalee in #3598
  • feat: Support Qwen 2.5 vl by @mickqian in #3258
  • [ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. by @whchung in #3535
  • Update to latest amd image. by @saienduri in #3597
  • Benchmark for reasoning models by @simveit in #3532
  • Draft of updated doc for sampling params. by @simveit in #3260
  • [docs] Update sampling_params.md by @shuaills in #3617
  • [docker] added rdma support by @FrankLeeeee in #3619
  • Revert "[ROCm] Use tl.range() in block GEMM kernels with `num_stage… by @zhyncs in #3632
  • add mtp unit test by @zhyncs in #3634
  • update unit test by @zhyncs in #3636
  • chore: bump v0.4.3.post1 by @zhyncs in #3638
  • h800 deepseek r1 config and support multi-gpu block-gemm tuning by @BBuf in #3639
  • feat: support flashinfer mla with prefix cache by @zhyncs in #3643
  • chore: update flashinfer v0.2.1.post2 by @zhyncs in #3644
  • chore: bump v0.4.3.post2 by @zhyncs in #3645
  • use transformers 4.48.3 by @zhyncs in #3650
  • [ROCm] Add additional block quant GEMM tuning configs for AMD GPUs. by @whchung in #3616
  • [ROCm] Optimal MOE Tuning for AMD Radeon Graphics by @BruceXcluding in #3567
  • Deploy multi-node inference (LWS method) using sglang in a K8s cluster by @whybeyoung in #3624
  • Update amd docker image. by @saienduri in #3654
  • [Feature] Apply Cublas Grouped Gemm kernel by @Fridge003 in #3629
  • update pr-test by @zhyncs in #3663
  • Fix draft decode max batch size by @ispobock in #3676
  • fix: remove dependency on latest transformers impl by @mickqian in #3635
  • AMD Prefill optimize by @fsx950223 in #3665
  • fix: apply cache size limit of attention mask for VisionAttention by @mickqian in #3657
  • set NCCL_IB_GID_INDEX=3 for multi node NVIDIA InfiniBand if needed by @zhyncs in #3698
  • use warp shuffle style reduce and flashinfer vectorize by @BBuf in #3628
  • [Docs] Add SkyPilot DeepSeek example by @Michaelvll in #3706
  • [k8s] remove unnecessary hostIPC for security concern by @panpan0000 in #3700
  • [moe] optim: reduce memory consumption in fused_moe by @ch-wan in #3692
  • [Improve] Fix Multi-User Port Allocation Conflicts by @shuaills in #3601
  • Variance measure for reasoning benchmark by @simveit in #3677
  • Docs: Fix layout with sub-section by @zhaochenyang20 in #3710
  • add control for cutlass fp8 blockwise gemm by @yizhang2077 in #3727
  • revert BLOCK and num_warps on HIP by @HaiShaw in #3722
  • Optimize triton attention custom mask by @ispobock in #3731
  • [Bugfix] Fix scores mask for moe topk by @Chen-XiaoBing in #3705
  • [Docs] Modify ep related server args and remove cublas part of deepseek by @Fridge003 in #3732
  • [Fix] Fix bugs and refactor codes in lora for better scalability. by @aoshen524 in #3652
  • docs: fix 404 link by @trayvonpan in #3588
  • [docs] added torch.compile cache to dpsk manual by @FrankLeeeee in #3737
  • AMD/ROCm: update AITER repo to ROCm/aiter by @HaiShaw in #3747
  • feat: update grouped_topk to support softmax and sigmoid by @zixuanzhang226 in #3680
  • feat: Add SageMaker support by @andjsmi in #3740
  • Change description of nvidia jetson docs by @shahizat in #3761
  • [Fix] fix OpenAI API adapter tokenizer encoding by @shuaills in #3432
  • [bug] fixed batch api by @FrankLeeeee in #3754
  • Adjustments to docs by @simveit in #3733
  • docs: Add offline engine launch example and documentation by @shuaills in #3771
  • Update offline_engine_api.ipynb by @zhaochenyang20 in #3773
  • Support Qwen RM model. by @simveit in #3772
  • Add support for nvidia modelopt fp8 kv cache by @Edwardf0t1 in #3223
  • Tiny fix Olmo2 by @fzyzcjy in #3348
  • fix lm head weights in Qwen models by @zhaochenyang20 in #3777
  • Fix weight loader error when LM head weights are tied by @fzyzcjy in #3766
  • Let DetokenizerManager use TypeBasedDispatcher by @fzyzcjy in #3117
  • bench: Add a benchmark for vLM: MMMU by @mickqian in #3562
  • Extract generation_manager from tokenizer_manager by @fzyzcjy in #3115
  • Rename TokenizerManager to StdOrchestrator by @fzyzcjy in #3116
  • [Docs]Add instruction for manually stopping nsys profiler by @Fridge003 in #3795
  • Hierarchical Caching for SGLang by @xiezhq-hermann in #2693
  • Update readme by @merrymercy in #3809
  • Fix dependency by @merrymercy in #3813
  • Refactor flashinfer logic for deepseek v3 and fix accuracy bug by @Fridge003 in #3785
  • Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by @laixinn in #3730
  • Fix pandas dependency in CI by @merrymercy in #3818
  • Revert "Rename TokenizerManager to StdOrchestrator" by @merrymercy in #3828
  • Revert "Extract generation_manager from tokenizer_manager" by @merrymercy in #3829
  • Fix CI and install docs by @merrymercy in #3821
  • typos by @WrRan in #3801
  • doc: fix dead link in router.md by @He1pa in #3799
  • Fix doc site copyright to current year by @wilsonwu in #3741
  • [Doc] Fix typo in server-argument description by @yuanheng-zhao in #3641
  • [ROCm] Enable Fused MLA Triton kernel for DeepSeekV3 by @lcskrishna in #3237
  • [BugFix]: Add missing clamp to llavavid by @PanJason in #3787
  • [dist] made timeout configurable by @FrankLeeeee in #3803
  • Fix allgather ops inside cuda graphs by @nvcastet in #3709
  • fix capture_bs by @fsx950223 in #3857
  • [BugFix] Fix crash when receive a req with structed output in DP attention mode. by @hcyz33 in #3841
  • Fix maximum recursion depth triggered on exception exit by @kebe7jun in #3519
  • [doc] added quantization doc for dpsk by @FrankLeeeee in #3843
  • [doc] fixed dpsk quant faq by @FrankLeeeee in #3865
  • Expert Parallelism (EP) Support for DeepSeek V3/R1 by @sleepcoo in #3602
  • Revert recent changes by @simveit in #3845
  • Feature/improve docs by @simveit in #3860
  • [Feature] Support llguidance for constrained decoding by @JC1DA in #3298
  • Move dpsk docs forward a step by @zhaochenyang20 in #3894
  • Docs: Reorngaize dpsk links by @zhaochenyang20 in #3900
  • Implemented frontend docs by @simveit in #3791
  • [doc] update sponsorship by @whybeyoung in #3903
  • [Rocm] Fix to the rocm_mla_decode_rope.py returning random result by @Chi-Chu319 in #3898
  • [doc] Update document for flashinfer mla by @Fridge003 in #3907
  • Add return hidden state in the native API by @Qiaolin-Yu in #3897
  • [Docs] Disable notebook CI when merge to main by @xqoasis in #3905
  • [Docs] Improve DPSK docs in dark mode by @hebiao064 in #3914
  • [Doc] Add experimental tag for flashinfer mla by @Fridge003 in #3925
  • Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) by @laixinn in #3922
  • xgrammar 0.1.14 by @qeternity in #3593
  • revert "Docs: Reorngaize dpsk links #3900" by @zhyncs in #3933
  • upgrade flashinfer v0.2.2.post1 by @zhyncs in #3934
  • Fix the doc link for sampling params by @Qiaolin-Yu in #3861
  • [feat] Add Vertex AI compatible prediction route for /generate by @KCFindstr in #3866
  • [MOE] enable efficient moe_alignment multi-blocks execution (3x~6x) by @yiakwy-xpu-ml-framework-team in #3613
  • Fix bench_serving not recognizing OPENAI_API_KEY by @kebe7jun in #3870
  • set a strict sgl-kernel version by @zhaochenyang20 in #3950
  • [Bugfix] Fix tokenizer_manager not getting 400 when req is too long by @CatherineSue in #3678
  • [Feature] integrate Structural Tag in xgrammar backend for function calling by @minleminzui in #3566
  • SGLang + Verl by @fzyzcjy in #3852
  • Remove unused imports from rocm mla kernel. by @lcskrishna in #3963
  • Update cutlass dependency by @elfiegg in #3966
  • [Feature]Support ragged prefill in flashinfer mla backend by @Fridge003 in #3967
  • Docs: add type hint to smapling parameters by @zhaochenyang20 in #3975
  • Add redline to highlight main process by @zhaochenyang20 in #3977
  • rename FunctionCallReqInput to ParseFunctionCallReq by @zhaochenyang20 in #3976
  • Docs: add special warning to engine docs by @zhaochenyang20 in #3979
  • Revert "[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)" by @zhaochenyang20 in #3982
  • Move return_hidden_states to the generate input by @Qiaolin-Yu in #3985
  • Update CODEOWNERS by @merrymercy in #3989
  • add deepgemm and sglang fp8 block-wise gemm benchmark by @BBuf in #3893
  • fix typo by @BBuf in #3991
  • Fix all gather torch compile by @ispobock in #3992
  • Add accuracy test for TP torch compile by @ispobock in #3994
  • Enable custom AR for AMD GPUs and maintain it in sgl-kernel by @hubertlu-tw in #3406
  • Add Benchmark for DeepGEMM Group GEMM by @hebiao064 in #3993
  • [feat] add small vocab table for eagle's draft model[1]. by @Zhou-sx in #3822
  • Add fast decode plan for flashinfer mla by @Fridge003 in #3987
  • Revert "Add fast decode plan for flashinfer mla" by @merrymercy in #4008
  • Add examples to token-in-token-out for LLM by @zhaochenyang20 in #4010
  • Fix nightly-test CI by @yinfan98 in #3826
  • Optimize Triton Kernel of Group GEMM in DeepGEMM Benchmark by @hebiao064 in #4014
  • Improve code styles by @merrymercy in #4021
  • Clean up custom allreduce by @merrymercy in #4029
  • remove cache configs in model definitions by @merrymercy in #4031
  • Update metrics documentation by @binarycrayon in #3264
  • Reorganize c++ source files in sgl-kernel with multiple folders by @merrymercy in #4025
  • Reorganize python source files in sgl-kernel with multiple files by @merrymercy in #4027
  • Misc clean up; Remove the support of jump forward by @merrymercy in #4032
  • Docs: Fix sampling parameter by @zhaochenyang20 in #4034
  • Remove outdated test utils and fix links for the doc of sampling params by @Qiaolin-Yu in #3999
  • Add examples in sampling parameters by @zhaochenyang20 in #4039
  • Share target model embed and head weights for nextn by @ispobock in #4033
  • Add a link to the roadmap in README.md by @merrymercy in #4043
  • docs: update README by @zhyncs in #4044
  • Fix assert options.num_stages != 0 error in the latest ROCm build image by @kkHuang-amd in #4049
  • Reasoning parser by @xihuai18 in #4000
  • HotFix for #3988 using blockwise_int8 by @xihuai18 in #4023
  • Fix breakage problem when using custom_ar by @kkHuang-amd in #4052
  • ROCm: update aiter and its usage to fused moe (bloat16, fp8, fp8 block-quant) by @HaiShaw in #4053
  • Fix debug_tensor_dump_output_folder optional key missing by @Qubitium in #4046
  • Remove grafana dashboard's datasource uid by @kebe7jun in #4051
  • [Fix & Style] Refactor the grammar backend to reduce human errors and improve readability by @DarkSharpness in #4030
  • [XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch. by @cboss6 in #3954
  • sgl-router - issues on routing and project build. (#3870) by @michaelfeil in #3948
  • fix: support gelu_new activation function in gpt2 by @Xiuyu-Li in #3712
  • remove unused max_jobs by @sgjzfzzf in #3607
  • [Feature] Add test for speculative_token_map by @Achazwl in #4016
  • Revert "Fix nightly-test CI" by @merrymercy in #4065
  • Update nextn ci test by @ispobock in #4071
  • Simplify eagle tests and TP sync in grammar backend by @merrymercy in #4066
  • Add examples for returning hidden states when using the server by @Qiaolin-Yu in #4074
  • [Minor] more code cleanup by @merrymercy in #4077
  • test: add vlm to token in & out example by @mickqian in #3941
  • [QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization by @Qubitium in #3790
  • bench: add dataset param for bench_multiturn by @zeroorhero in #3990
  • ROCM: AITER BLOCK GEMM by @BruceXcluding in #4075
  • [Eagle] Refactor eagle speculative decoding by @Ying1123 in #3986
  • Fix the moe padding conditional logic by @HaiShaw in #4081
  • [Revision] Add fast decode plan for flashinfer mla by @Fridge003 in #4012
  • Fix triton kernel illegal memory issue for eagle by @ispobock in #4100
  • Add update_weights_from_disk endpoint to Engine by @jhinpan in #4102
  • Add DeepSeek optimization ablations documentation by @M0gician in #4107
  • reorganize dpsk docs by @zhaochenyang20 in #4108
  • Add examples for server token-in-token-out by @Qiaolin-Yu in #4103
  • revert deepseek docs by @zhyncs in #4109
  • Create release-docker-amd-nightly.yml by @saienduri in #4105
  • remove testing on PR workflow change by @saienduri in #4110
  • Debug radixcache: refactor recursive helper methods by @luzengxiangcn in #3029
  • Online serving benchmarks of real datasets for hierarchical KV caching by @PanJason in #3211
  • fix cross-reference error and spelling mistakes by @samzong in #4101
  • fix Non-consecutive header level increase in docs/router/router.md by @samzong in #4099
  • chore: bump v0.4.3.post3 by @zhyncs in #4114
  • [Hoxfix] Fix incomplete token_to_kv_pool refactor by @Edenzzzz in #4121
  • Remove prefill-only-one-req by @merrymercy in #4117
  • Add a pointer to the real KV cache pool by @xiezhq-hermann in #4113
  • feat: support docs auto live-reload with sphinx-autobuild by @samzong in #4111
  • EAGLE docs by @simveit in #4038
  • Add codeowners for eagle implementations by @Ying1123 in #4131
  • Add tag suffix to nightly docker builds. by @saienduri in #4129
  • remove unused max_jobs in setup_rocm.py by @sgjzfzzf in #4126
  • Split the init of scheduler as smaller functions. Improve the eagle tests by @merrymercy in #4128
  • [Minor] make the __init__ function of model_runner.py shorter by @merrymercy in #4132
  • AMD/ROCm: update base image string by @kkHuang-amd in #4137
  • Update CODEOWNER by @merrymercy in #4138
  • fix bench serving bug by @Lzhang-hub in #4135
  • Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle by @merrymercy in #4134
  • Fix nightly ci Gsm8k & Fix flashinfer backend kvcache quant by @yinfan98 in #4147
  • Fix constrained generation errors by adding datasets dependency by @olliestanley in #4142
  • Release v0.4.3.post4 by @merrymercy in #4140
  • [docs] fix HF reference script command by @adarshxs in #4148
  • Docs: add torch compile cache by @zhaochenyang20 in #4151
  • Hot fix small vocal eagle in docs by @zhaochenyang20 in #4154
  • ROCm: enable trillion-parameter MoE models with INT4-FP8 single node by @HaiShaw in #4152
  • Add Support for Qwen2-VL Multi-modal Embedding Models by @Titan-p in #3694
  • [quant kernel] sgl-kernel support per_tensor_quant fp8 by @BBuf in #3786
  • Add sgl_per_token_quant_fp8 by @hebiao064 in #4089
  • [Feature] DeepSeek V3/R1 INT8 Quantization (channel-wise) by @HandH1998 in #3888
  • [Refactor] Reducing code duplication across FP8 CUDA quantization kernels by @hebiao064 in #4163
  • [Docs] Fix links and grammar issues by @windsonsea in #4162
  • Remove non-existent AMD header include by @hebiao064 in #4166
  • Put utils in ifndef USE_ROCM to fix CI (#4167) by @zhyncs in #4168
  • Memory pool fix for upstream change about eagle by @xiezhq-hermann in #4170
  • chore: bump v0.0.3.post7 for sgl-kernel by @zhyncs in #4176
  • Add an example of using deepseekv3 int8 sglang. by @sleepcoo in #4177
  • fix int8 doc link by @zhyncs in #4179
  • [Docs] Improve bullets appearance and grammar by @windsonsea in #4174
  • ROCm: Flex Attention Enablement with custom backends by @HaiShaw in #4178
  • Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" by @zhyncs in #4186
  • use same version for ci and pyproject by @zhyncs in #4187
  • Fix eagle hang issue for max_new_tokens=1 by @ispobock in #4185
  • Update amd ci docker image to v0.4.3.post4-rocm630. by @saienduri in #4189
  • New clang format for sgl kernel by @merrymercy in #4194
  • Remove the vllm dependency from the moe_align function by @sleepcoo in #4164
  • Minor improvement to per_tensor_quant_fp8 by @zcnrex in #4197
  • Revert "Minor improvement to per_tensor_quant_fp8 (#4197)" by @zhyncs in #4198
  • lazy import attn backends by @merrymercy in #4200
  • Fix bench_serving flush cache not recognizing OPENAI_API_KEY by @brighill in #4181
  • Use clang format 18 in pr-test-sgl-kernel.yml by @merrymercy in #4203
  • Refactor Dockerfile: unify CUDA logic and reduce image size by ~2.6 GB by @kebe7jun in #3749
  • Test no vllm custom allreduce by @merrymercy in #4210
  • refine quant kernel code style by @BBuf in #4211
  • Split test_mla.py into two files (deepseek v2 and deepseek v3) by @merrymercy in #4216
  • docs(reasoning content): 📝 deepseek-r1 parser support qwq by @xihuai18 in #4124
  • revert pr 3628 to pass test_mla ci by @BBuf in #4219
  • use latest sgl-kernel for mla test by @zhyncs in #4222
  • Rename files in sgl kernel to avoid nested folder structure by @merrymercy in #4213
  • chore: bump v0.0.4 for sgl-kernel by @zhyncs in #4223
  • Lazily import lora backends by @merrymercy in #4225
  • [docker] Distributed Serving with k8s Statefulset ( good example for DeepSeek-R1) by @panpan0000 in #3631
  • [docs] Unhide production metrics page by @hebiao064 in #4193
  • use sgl-kernel 0.0.4 by @zhyncs in #4224
  • Support nextn for flashinfer mla attention backend by @Fridge003 in #4218
  • Apply sgl w8a8 fp8 kernel by @HandH1998 in #3148
  • Check eagle server args by @Ying1123 in #4217
  • update sgl-kernel 3rdparty by @zhyncs in #4228
  • Update bench speculative script by @ispobock in #4235
  • Fix test of flashinfer mla with nextn by @Fridge003 in #4237
  • Move rope and bmm into sgl-kernel by @merrymercy in #4241
  • Revert "Check eagle server args" by @merrymercy in #4242
  • Minor style fix for sgl-kernel by @merrymercy in #4243
  • Auto balance CI tests by @merrymercy in #4238
  • Clean up fp8 support by @merrymercy in #4230
  • Move activation.cu to sgl-kernel/elementwise by @merrymercy in #4250
  • DeepGemm integrate to sgl-kernel by @laixinn in #4165
  • [Bug fixed] fixed the crash when enable the dp-attention on the single card by @DavidChan0519 in #3958
  • Added example for multimodal embedding by @simveit in #4206
  • Simplify tests & Fix trtllm custom allreduce registration by @merrymercy in #4252
  • fix the input_ids is None error by @Young1993 in #4144
  • fix per_token_group_quant_fp8 illegal memory when num_groups % 16 != 0 by @BBuf in #4231
  • Release sgl-kernel v0.0.4.post1 by @merrymercy in #4255
  • Fix quantization and nightly tests by @merrymercy in #4258
  • increase the timeout of nightly-test.yml by @merrymercy in #4262
  • Optimize rope in sgl kernel by @merrymercy in #4267
  • Test no vllm custom allreduce by @merrymercy in #4256
  • Amd test fp8 by @HandH1998 in #4261
  • add THIRDPARTYNOTICES for DeepGEMM by @zhyncs in #4272
  • upgrade xgrammar 0.1.15 by @zhyncs in #4275
  • Fix nightly eval for neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 by @merrymercy in #4279
  • Uupdate cutalss dependency for its bug fix. by @elfiegg in #4277
  • update deepgemm by @zhyncs in #4284
  • bump sgl-kernel 0.0.4.post2 by @zhyncs in #4288
  • Add A800 tuning configs support DeepSeek V3/R1 BF16 and INT8(block-wise) by @lambert0312 in #4136
  • update sgl-kernel 0.0.4.post2 by @zhyncs in #4291
  • linear support deepgemm by @sleepcoo in #4199
  • Update MTP doc by @ispobock in #4290
  • Add A100 tuning configs for DeepSeek R1/V3 channel-wise INT8 by @yych0745 in #4287
  • update doc by @zhyncs in #4299
  • [AMD] Fix rocm sgl-kernel missing modules error by @BruceXcluding in #4311
  • Add H20 tuning configs support DeepSeek V3/R1 INT8(block-wise) by @Ximingwang-09 in #4220
  • refactor: move image processors to separate files by @mickqian in #4229
  • upgrade flashinfer 0.2.3 by @zhyncs in #4317
  • unify is_cuda and is_hip by @zhyncs in #4321
  • Add A800 tuning configs for DeepSeek R1/V3 channel-wise INT8 by @lambert0312 in #4323
  • [Docs] Clean up benchmark_and_profiling.md by @windsonsea in #4297
  • refine sgl_moe_align_block_size_benchmark by @BBuf in #4327
  • Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% by @hebiao064 in #4215
  • Add awq dequantize kernel to sgl with 1x to 3x speedup by @zcnrex in #4104
  • fix awq_dequantize by @zhyncs in #4333
  • release 0.0.4.post3 sgl-kernel by @zhyncs in #4331
  • upgrade sgl-kernel 0.0.4.post3 by @zhyncs in #4334
  • Add INT8 support MTP NextN function by @lambert0312 in #3911
  • [Fix] fix _yarn_linear_ramp_mask with device parameter by @Alcanderian in #4337
  • remove the unused readline dependency from the Qwen2 model implementa… by @yych0745 in #4340
  • model: Support Janus-pro by @mickqian in #3203
  • Hierarchical Caching Refactoring and Fixing TP issue by @xiezhq-hermann in #4082
  • Support Blackwell Block Scale FP8 Gemm by @elfiegg in #4278
  • typo: Update http_server.py by @WrRan in #4350
  • Update nightly tests by @merrymercy in #4352
  • [Fix Doc.] Enable internal forwarding when starting the router by @shizhediao in #4355
  • Move output processing logic from scheduler.py into a separate file by @merrymercy in #4354
  • Fix scheduler proctitle suffix is ​​None by @cnwenf in #4326
  • feat: support ep size < 32 for sgl kernel by @shuaills in #4348
  • Fix per token fp8 quant precision by @qingquansong in #4362
  • Remove the choices in --speculative-eagle-topk argument by @Achazwl in #4329
  • docs: add parameter --log-requests-level by @panpan0000 in #4335
  • simple bugfix by @WrRan in #4342
  • Fix the doc of FR-Spec by @Achazwl in #4295
  • [Fix] Check the device backend before calling empty_cache function by @cboss6 in #4212
  • [FIX] fix incorrect output when enable both deepgemm and torch compile by @AniZpZ in #4359
  • add INT8 example into dsv3 README by @laixinn in #4079
  • Avoid duplicated request ids in batch APIs by @tanconghui in #4026
  • example: add async offline inference demo by @kuizhiqing in #3961
  • Add device detection and count functions to utils. by @vshekhawat-hlab in #3962
  • Move aiohttp into public dependencies by @stevapple in #3980
  • [tools] add fp8 max/min constant in utils by @yiakwy-xpu-ml-framework-team in #3959
  • HotFix: json serialization error when using OAI v1/batches endpoint with logprobs by @dcfidalgo in #3896
  • [docs] Update outdated description about torch.compile by @junliu-mde in #3844
  • [Doc] Fix typo in backend/sampling_params by @yang-ybb in #3835
  • Ensure Usage Data in Streaming Responses Aligns with vLLM’s Implementation by @HermitSun in #3814
  • [moe] fix: correct the cache size in the last chunk by @ch-wan in #3679
  • Support page size > 1 by @merrymercy in #4356
  • [XPU][CPU] Enable the native path of DeepSeek by @airMeng in #4086
  • Revert "[XPU][CPU] Enable the native path of DeepSeek" by @merrymercy in #4367
  • Update grafana.json by @dblate in #4374
  • fix accuracy issue by @zhyncs in #4376
  • bump 0.0.5 sgl-kernel by @zhyncs in #4377
  • upgrade sgl-kernel 0.0.5 by @zhyncs in #4381
  • chore: bump v0.4.4 by @zhyncs in #4041

New Contributors

  • @yosoyjay made their first contribution in #3505
  • @FrankLeeeee made their first contribution in #3548
  • @Jiadalee made their first contribution in #3598
  • @whybeyoung made their first contribution in #3624
  • @fsx950223 made their first contribution in #3665
  • @panpan0000 made their first contribution in #3700
  • @ch-wan made their first contribution in #3692
  • @Chen-XiaoBing made their first contribution in #3705
  • @aoshen524 made their first contribution in #3652
  • @trayvonpan made their first contribution in #3588
  • @zixuanzhang226 made their first contribution in #3680
  • @andjsmi made their first contribution in #3740
  • @shahizat made their first contribution in #3761
  • @laixinn made their first contribution in #3730
  • @He1pa made their first contribution in #3799
  • @wilsonwu made their first contribution in #3741
  • @yuanheng-zhao made their first contribution in #3641
  • @nvcastet made their first contribution in #3709
  • @hcyz33 made their first contribution in #3841
  • @kebe7jun made their first contribution in #3519
  • @JC1DA made their first contribution in #3298
  • @Chi-Chu319 made their first contribution in #3898
  • @Qiaolin-Yu made their first contribution in #3897
  • @xqoasis made their first contribution in #3905
  • @KCFindstr made their first contribution in #3866
  • @elfiegg made their first contribution in #3966
  • @Zhou-sx made their first contribution in #3822
  • @xihuai18 made their first contribution in #4000
  • @cboss6 made their first contribution in #3954
  • @Xiuyu-Li made their first contribution in #3712
  • @sgjzfzzf made their first contribution in #3607
  • @zeroorhero made their first contribution in #3990
  • @samzong made their first contribution in #4101
  • @olliestanley made their first contribution in #4142
  • @windsonsea made their first contribution in #4162
  • @zcnrex made their first contribution in #4197
  • @brighill made their first contribution in #4181
  • @DavidChan0519 made their first contribution in #3958
  • @Young1993 made their first contribution in #4144
  • @lambert0312 made their first contribution in #4136
  • @yych0745 made their first contribution in #4287
  • @Ximingwang-09 made their first contribution in #4220
  • @Alcanderian made their first contribution in #4337
  • @shizhediao made their first contribution in #4355
  • @cnwenf made their first contribution in #4326
  • @qingquansong made their first contribution in #4362
  • @AniZpZ made their first contribution in #4359
  • @tanconghui made their first contribution in #4026
  • @kuizhiqing made their first contribution in #3961
  • @vshekhawat-hlab made their first contribution in #3962
  • @stevapple made their first contribution in #3980
  • @dcfidalgo made their first contribution in #3896
  • @junliu-mde made their first contribution in #3844
  • @yang-ybb made their first contribution in #3835
  • @airMeng made their first contribution in #4086
  • @dblate made their first contribution in #4374

Full Changelog: v0.4.3...v0.4.4

Don't miss a new sglang release

NewReleases is sending notifications on new releases.