github sgl-project/sglang v0.4.5
Release v0.4.5

latest releases: v0.5.6.post2, gateway-v0.2.4, v0.5.6.post1...
8 months ago

Highlights

The SGLang team is excited to the release of v0.4.5! This version introduces several significant features, including Llama 4 support, FlashAttention 3 backend, EAGLE3 speculative decoding, DeepEP integration, and disaggregated prefill and decoding.

New Features

  • Llama 4 Support: We supported Llama 4 model with accuracy matching official benchmark numbers, achieving a zero-shot score of 75.2 on the MMLU Pro dataset for Llama-4-Scout-17B-16E-Instruct model and 80.7 for Llama-4-Maverick-17B-128E-Instruct model. #5092

  • FlashAttention 3 Backend: Our implementation of the FlashAttention 3 backend delivers significant acceleration for long-context tasks. #4709

  • EAGLE3 Speculative Decoding: We’re proud to be the first to support EAGLE3 speculative decoding, offering substantial gains in decoding throughput. Learn more in our documentation and the EAGLE3 paper. #4247

  • DeepEP Integration: By incorporating DeepEP, we enhanced performance for MoE inference.

  • Disaggregated Prefill and Decoding: We introduced a prototype for disaggregated prefill and decoding, with plans for further optimizations.

Thanks very much to the NVIDIA team, LinkedIn team, EAGLE team, Oracle team, Meituan team, and our incredible open-source community for their invaluable contributions!

Coming Soon

  • Disaggregated Prefill and Decoding: #4655

  • Llama 4 Optimization: #5118

  • EP Enhancement: #4734

  • FA3 Enhancement: #4709

We’re thrilled about these advancements and eager to hear your feedback! Join us on our Slack channel at slack.sglang.ai to connect and share your thoughts. Cheers!

What's Changed

  • Fix a regression introduced by overlapping KV cache writing by @merrymercy in #4375
  • Update ci_install_dependency.sh to use accelerate 1.4.0 by @merrymercy in #4392
  • Improve DP attention by @merrymercy in #4390
  • Fix auto merge & add back get_flat_data_by_layer by @merrymercy in #4393
  • Add some fused elementwise kernels for grok-1 by @merrymercy in #4398
  • Fix Llama3.3 tool call support by @CatherineSue in #4320
  • Fix the output of hidden states after HTTP requests by @Qiaolin-Yu in #4269
  • Add a dummy grok test case by @merrymercy in #4399
  • Hot fix for hicache with new page aligned radixtree by @xiezhq-hermann in #4397
  • bump v0.4.4.post1 by @zhyncs in #4402
  • Update CODEOWNERS by @merrymercy in #4403
  • Hierarchical Caching supports MLA by @zeroorhero in #4009
  • cleanup deps 1/n by @zhyncs in #4400
  • feat(remote_model): support variable remote backend for model loader by @DellCurry in #3964
  • [bug] fix duplicate variable MAX_PIXELS in qwen_vl.py by @qibaoyuan in #4419
  • [Doc] fix wrong flag in deepseek documentation by @lausannel in #4427
  • Add moe topk softmax templated from vllm by @qingquansong in #4302
  • bump v0.0.5.post1 by @zhyncs in #4437
  • Fix maximum recursion depth triggered on exception exit by @merrymercy in #4438
  • use topk_softmax with sgl-kernel by @zhyncs in #4439
  • docs: hot fix torch compile cache by @zhaochenyang20 in #4442
  • ci: update transformers==4.48.3 by @mickqian in #4451
  • Fix test_create_kvindices unit test by @sleepcoo in #4452
  • [Fix] Fix errors when using the device except cuda. by @cboss6 in #4455
  • docs: Add Llama 3.3 to supported models by @JiangJiaWei1103 in #4453
  • Update bench_serving.py by @xu-song in #4454
  • bugfix: Update sampling_params.py by @WrRan in #4413
  • typos: Update sampling_params.md by @WrRan in #4391
  • Auto-detect device if not specified in server arguments. by @vshekhawat-hlab in #4423
  • Add support for upcoming QwenMoe by @michaelfeil in #4447
  • perf: update fused moe config by @mickqian in #4459
  • typos by @WrRan in #4368
  • Fix minor style by @merrymercy in #4460
  • cleanup deps 2/n by @zhyncs in #4464
  • feat: Add FlashMLA submodule by @shuaills in #4449
  • [Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. by @Alcanderian in #4466
  • Fix finish step for pr tests and notebook tests by @merrymercy in #4467
  • Remove filter for pr-tests by @merrymercy in #4468
  • Add greedy verification kernel by @Ying1123 in #4383
  • Release sgl-kernel v0.0.5.post2 by @merrymercy in #4469
  • Revert "feat: Add FlashMLA submodule (#4449)" by @zhyncs in #4470
  • [Eagle] Remove the greedy branch and some redundant code by @Ying1123 in #4363
  • Support FlashMLA backend by @sleepcoo in #4472
  • fix custom allreduce performance/accuracy problem by @yizhang2077 in #4477
  • 400 on empty input_ids by @yinghai in #4481
  • Update CODEOWNERS by @merrymercy in #4484
  • Statistical Analysis of the Output Stability of the Deepseek Model by @tanzelin430 in #4202
  • model: support gemma-3-it by @mickqian in #4424
  • Initialize image processor for skip-tokenizer-init codepath by @yinghai in #4479
  • Fix: modelscope env comment by @huiwq1990 in #4474
  • Fix: Complete int32 to int64 conversion by @xiezhq-hermann in #4465
  • [ROCm] enable moe topk softmax in amd by @yiakwy-xpu-ml-framework-team in #4448
  • Feat/support code completion by @woodx9 in #3612
  • Add endpoint for file support, purely to speed up processing of input_embeds. by @RinRin-32 in #2797
  • Set xgrammar as the default grammar backend by @minleminzui in #4386
  • Fix router test by @ByronHsu in #4483
  • [Fix] use torch.inference_mode() instead of torch.no_grad() by @Alcanderian in #4372
  • [Feature] Support Deepseek-VL2 by @ccw1996 in #2798
  • config: Update fused moe config by @mickqian in #4493
  • Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. by @solrex in #4418
  • Support Online Quantization for W8A8 by @hebiao064 in #4485
  • Tool call with text by @xihuai18 in #4067
  • Nicer standalone engine inferface by @yinghai in #4480
  • [Fix] Resolve GPU Memory Leak in update_weights_from_tensor by @U-rara in #4446
  • [Doc] add doc for quantization w8a8_fp8 or w8a8_int8 by @HandH1998 in #4495
  • Fix data parallel + tensor parallel by @merrymercy in #4499
  • [ROCm] fix dtype by @yiakwy-xpu-ml-framework-team in #4510
  • Remove redundant type conversion by @merrymercy in #4513
  • Update readme by @merrymercy in #4517
  • [sgl-router] improvement to avoid hang by @yinghai in #4482
  • Revert "feat: update grouped_topk to support softmax and sigmoid" by @ispobock in #4505
  • bump v0.0.5.post3 by @zhyncs in #4520
  • upgrade sgl-kernel 0.0.5.post3 by @zhyncs in #4522
  • sglang quant module remove vllm dependency by @BBuf in #4507
  • Unit test for Hierarchical Caching by @xiezhq-hermann in #4486
  • refactor: rewrite bench-mmmu-sglang by @mickqian in #4458
  • fix: second_per_grid_ts should be used to get mrope position by @mickqian in #3682
  • [Hotfix] solve fp8 w8a8 ci test fail by @BBuf in #4531
  • remove useless backend forward in rotary_embedding by @BBuf in #4500
  • Fix the incorrect args in benchmark_and_profiling.md by @tianyuzhou95 in #4542
  • cleanup deps 3/n by @zhyncs in #4541
  • Add deepseek v2 torch compile pr test by @ispobock in #4538
  • use sgl custom all reduce by @zhyncs in #4441
  • [Fix] Type annotation correction for UpdateWeightsFromTensorReqInput by @U-rara in #4532
  • [Feature] Support EAGLE 3 by @chromecast56 in #4247
  • Reduce computation and communication in DP attention by @ch-wan in #4521
  • [Feature] Support Tensor Parallelism and Weight Slicing for Lora by @aoshen524 in #4274
  • Optimize Triton decoding kernel for dynamic workload by @Alcanderian in #4553
  • [Fix] Fix raw_bs bug when using flashinfer mla and eagle by @Fridge003 in #4557
  • Create col-major and tma-aligned x_scale for deep_gemm.gemm_fp8_fp8_bf16_nt by @strgrb in #4515
  • [Feature] Integrate DeepEP into SGLang by @liz-badada in #4232
  • Support FlashMLA backend cuda graph by @sleepcoo in #4514
  • Add clang-format to pre-commit config by @Hongbosherlock in #4583
  • [fix] fix initialization of _ENABLE_TORCH_INFERENCE_MODE by @Alcanderian in #4549
  • avoid cudaStreamSynchronize in DeepSeekV2AttentionMLA by @strgrb in #4577
  • Support n in OpenAI API completions by @ChuyueSun in #3446
  • [fix] fix illegal mem access and clean up triton attention backend by @Alcanderian in #4571
  • Enable setting sglang logger from Env Variable SGLANG_LOGGING_CONFIG_PATH by @guoyuhong in #4592
  • Update doc for MTP and DP attention by @ispobock in #4622
  • Support fp8 gemm for blackwell by @wenscarl in #4558
  • fix SUPPORT_CUTLASS_BLOCK_FP8 flag by @ch-wan in #4640
  • Set deepgemm to the default value in the hopper architecture. by @sleepcoo in #4613
  • [docs] Add links and fix grammars in deploy_on_k8s.md by @windsonsea in #4641
  • Align completion and chat_completion response to OpenAI API by @guoyuhong in #4637
  • [PD] Release initial code by @ByronHsu in #4654
  • fix: fix ipython running error for Engine due to outlines nest_asyncio by @minleminzui in #4582
  • update news for README by @zhyncs in #4664
  • Speed up per token and per tensor quant by 15% by @zcnrex in #4639
  • [quantization] fix channelwise conversion with scalar weight scale by @yundai424 in #4596
  • Correcting default configuration when benchmarking fused_moe by @penguin-wwy in #4665
  • [1/3] fix dsv3 awq issue by @AniZpZ in #4556
  • [Docs] Update docs for gemma3 and VLM chat templates by @adarshxs in #4674
  • [CI fix] test skipping modelopt on AMD by @adarshxs in #4677
  • fix flaky ut by @zhyncs in #4670
  • Add EAGLE mtbench benchmark script by @ispobock in #4676
  • Bug fix for metrics counter by @xiezhq-hermann in #4660
  • [Bug Fix] Add partial rotary factor support for Phi-4 and upgrade to transformers v4.50.0 by @adarshxs in #3984
  • Optimize Permute Kernel in DeepEP by @xutizhou in #4643
  • fix typo SGLang supports three grammar backends by @BroadbentJim in #4679
  • close gemma2 in test_verl_engine.py temporarily by @yizhang2077 in #4685
  • Multiple tiny code cleanups by @fzyzcjy in #4608
  • Support async in DeepEP by @fzyzcjy in #4610
  • refactor: bug fixes and refactor for vlm by @mickqian in #4661
  • Move mem_state update into debug mode by @xiezhq-hermann in #4525
  • Fix RotaryEmbedding when using Triton backend for EXAONE-3.5-2.4B by @lkm2835 in #4064
  • Unify variable naming: replace is_in_free_group with is_not_in_free_group by @c1lovez1 in #4698
  • [ROCm] Enable MTP (NextN) on AMD GPU by @alexsun07 in #4631
  • Support FA3 as Attention backend by using --attention-backend fa3 by @hebiao064 in #4680
  • rename benchmark_deepgemm_fp8_group_gemm.py by @tbzhang in #4605
  • [Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster by @zcnrex in #4396
  • Support dynamic version name in sglang's pyproject.toml by @guoyuhong in #4720
  • update pyproject by @zhyncs in #4731
  • [PD] Remove invalid parameter by @XucSh in #4721
  • Fix EAGLE3 for llama3.3 70b by @ispobock in #4716
  • Fix circular imports in gptq.py and unblock test explorer by @hebiao064 in #4736
  • [Model] Support Qwen2ForSequenceClassification by @Ximingwang-09 in #4609
  • Support FP4 gemm (1/2) by @trevor-m in #3899
  • Add DeepEP tests into CI by @fzyzcjy in #4737
  • model: Minicpmo by @mickqian in #3023
  • support cu128 sgl-kernel by @zhyncs in #4744
  • [Benchmark] tilelang vs deepgemm vs w8a8_block_fp8_matmul by @zcnrex in #4735
  • Super tiny fix typo by @fzyzcjy in #4738
  • fix FlashMLA cudagraph config by @sleepcoo in #4691
  • Speedup warmup when DP > 1 by @fzyzcjy in #4695
  • Add endpoints to dump selected expert ids by @yuhsuan-t in #4435
  • add dsv3 int8 test by @HandH1998 in #4705
  • [Feature] Support "strict" in function calling by @DarkSharpness in #4310
  • Revert "Add DeepEP tests into CI (#4737)" by @fzyzcjy in #4751
  • Fix test_expert_distribution failure by @fzyzcjy in #4752
  • Fix warmup error when dp=1 by @fzyzcjy in #4753
  • Add retry for flaky tests in CI by @fzyzcjy in #4755
  • [Fix] Fix unexpected idx bug of Phi-3-small by @Fridge003 in #4728
  • Warn users when release_memory_occupation is called without memory saver enabled by @fzyzcjy in #4566
  • fix(typo): fix reply to replay in base_attn_backend.py by @Thysrael in #4784
  • Support recording experts workload in QWen2-MoE by @ch-wan in #4775
  • Fix popen_launch_server wait for 20 minutes when child process exits by @fzyzcjy in #4777
  • Use metadata to detect version of package by @kebe7jun in #4782
  • Fix shared memory OOM on sm86 GPUs. by @Conless in #4797
  • Support compressed tensors fp8w8a8 by @BBuf in #4743
  • bump v0.4.4.post2 by @zhyncs in #4669
  • [3/3] fix dsv3 awq issue by @laixinn in #4719
  • Update supported_models.md: adding open-r1 Olympic Code 32B by HuggingFace by @didier-durand in #4628
  • Align finish reason and stream mode in openai api by @xihuai18 in #4388
  • support clip embedding model by @Titan-p in #4506
  • update xgrammar 0.1.17 by @zhyncs in #4804
  • Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device by @fzyzcjy in #4565
  • [FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 by @hebiao064 in #4745
  • support cmake for sgl-kernel by @zhyncs in #4706
  • Use apply_rope_with_cos_sin_cache_inplace for DeepSeek by @strgrb in #4764
  • Fix ut mla-test-1-gpu-amd by @strgrb in #4813
  • Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner by @gmlwns2000 in #4638
  • [k8s] Clarified the usage of shared memory. by @jsuchome in #4341
  • gemma3: impl get_attention_sliding_window_size for attn init by @vhain in #4823
  • add partial_json_parser and einops by @zhyncs in #4827
  • fix the release doc dependency issue by @zhyncs in #4828
  • Update doc for DeepSeek-V3-0324 by @ispobock in #4825
  • deps: lazy import optional dependencies gguf and torchvision by @vhain in #4826
  • Update MMMU Benchmark instructions by @ravi03071991 in #4694
  • Fix the nightly eval by lowering the threshold of neuralmagic/gemma-2-2b-it-FP8 by @merrymercy in #4830
  • Basic Cleanup by @danielholanda in #4833
  • Support (1 <= dp < tp) in the dp attention in DeepEP by @tarinkk in #4770
  • [Fix] Add compressed_tensors as deps by @ocss884 in #4819
  • Fix error due to CustomAllreduce setup failure by @kebe7jun in #4815
  • use default for torch.ops by @zhyncs in #4835
  • [CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder by @b8zhong in #3969
  • [Misc] Fix issues reported by torchfix by @b8zhong in #4837
  • Include context length in /v1/models response. by @jondurbin in #4809
  • [Fix] self.worker assignment in TpModelWorker and refactor references by @JustinTong0323 in #4788
  • Fix the lora adapter when lora path is none by @Qiaolin-Yu in #4799
  • fix: fix typo of comments in w8a8_fp8.py by @ZhuJiaqi9905 in #4843
  • Remove retry in nightly tests by @fzyzcjy in #4846
  • Fix CI of test_patch_torch by @fzyzcjy in #4844
  • IPv6 support by @vincent-4 in #3949
  • ci: add condition for daily docker build by @warjiang in #4487
  • [Fix] fix output_top_logprobs is not exist by @lambert0312 in #4597
  • fix: when use SGLANG_PORT this env,port is str by @lengrongfu in #4528
  • Support Page Size > 1 for FA3 by @hebiao064 in #4832
  • Fix Engine error when enabling DP attention by @fzyzcjy in #4648
  • fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest by @BroadbentJim in #4681
  • Support controlling nsys start and end range programmatically by @fzyzcjy in #4688
  • Remove empty tool function name by @kebe7jun in #4704
  • Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. by @vshekhawat-hlab in #4712
  • get the python version from env by @DavidChan0519 in #4729
  • Fix torch.cuda.MemPool() internal assertion failure by @fzyzcjy in #4687
  • Super tiny remove unused code by @fzyzcjy in #4750
  • Support with_stack and record_shapes in profiler by @fzyzcjy in #4740
  • test: reduce mem_fraction_static for gemma3 vision test by @vhain in #4840
  • Fix CI tests by @merrymercy in #4853
  • Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed by @qingquansong in #4855
  • Revert "get the python version from env (#4729)" by @zhyncs in #4863
  • [Feature] add multi-rank support for Lora by @jcbjcbjc in #4492
  • Clean up import vllm in quantization/init.py by @merrymercy in #4834
  • Fix wrong variable name when stopping memory profile by @Fr4nk1inCs in #4772
  • [Feat] support deepgemm for cmake by @yinfan98 in #4864
  • Make torch compile configurable for biased_grouped_topk by @qingquansong in #4749
  • update sgl-kernel test ci by @zhyncs in #4866
  • fix sampling issue by @zhyncs in #4871
  • bump sgl-kernel 0.0.5.post4 by @zhyncs in #4768
  • fix sgl-kernel cu118 build by @zhyncs in #4872
  • [Feature] Support FA3 backend for MLA by @Fridge003 in #4831
  • upgrade sgl-kernel 0.0.5.post4 by @zhyncs in #4873
  • update torch compile doc by @ispobock in #4874
  • bump v0.4.4.post3 by @zhyncs in #4878
  • Fix BadRequestError wrong arguments and remove openai dependency by @fzyzcjy in #4882
  • Improve stack trace of retry errors by @fzyzcjy in #4845
  • Tiny fix doc error by @fzyzcjy in #4795
  • [Docs] Update DeepGemm at README.md by @yinfan98 in #4886
  • Update CODEOWNERS by @zhyncs in #4889
  • Delete test_deep_gemm.py by @yinfan98 in #4891
  • Add deepseek style fused moe group gate selection kernel by @qingquansong in #4530
  • quick fix: add default for new kernel by @yinfan98 in #4898
  • remove setup for sgl-kernel by @zhyncs in #4899
  • [Misc] Clean m.def and add Development Tips by @yinfan98 in #4890
  • fix allreduce test by @yizhang2077 in #4909
  • Support page size > 1 + eagle by @merrymercy in #4908
  • Fix retract for page size > 1 by @merrymercy in #4914
  • [Feature] use pytest for sgl-kernel by @adarshxs in #4896
  • fix bmm fp8 by @zhyncs in #4926
  • Fix the timeout for unit-test-2-gpu in pr-test.yml by @merrymercy in #4927
  • Fix 2-gpu CI test and suppress some warnings by @merrymercy in #4930
  • [feat] add fa3 in sgl-kernel by @yinfan98 in #4902
  • Fix sglang frontend's incorrect dependency on torch by @seplos in #4931
  • [Fix] avoid stream sync and torch compile in prefill for fa3 backend by @Fridge003 in #4932
  • cleanup sgl-kernel by @zhyncs in #4933
  • [Fix] Improve Lora tests and reduce CI runtime by @Fridge003 in #4925
  • Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP by @fzyzcjy in #4883
  • [Fix] Add torch compile for torch.clamp back by @Fridge003 in #4936
  • Fix oom error for large page size by @xiezhq-hermann in #4913
  • [feat] interface for platforms abstraction by @Alcanderian in #4928
  • [Fix] revert clean m.def for cudagraph by @yinfan98 in #4944
  • refactor: multimodal data by @mickqian in #4754
  • bump sgl-kernel v0.0.6 by @zhyncs in #4950
  • [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu by @guoyuhong in #4953
  • use fa3 in sgl-kernel by @zhyncs in #4954
  • Revert PR 4764 & 4813 related to R1 RoPE by @guoyuhong in #4959
  • [Feature] Support DeepEP Low Latency by @liz-badada in #4767
  • update bench_serving by @zhyncs in #4958
  • Prevent memory leak of retract_decode when page_size > 1 by @xiezhq-hermann in #4977
  • [VLM RLHF] Take Image input for verl vlm rollout by @JustinTong0323 in #4915
  • Large page size aligned hierarchical caching by @xiezhq-hermann in #4581
  • bug fix for hicache host eviction by @xiezhq-hermann in #4989
  • sgl scaled_fp8_quant support output padding by @BBuf in #4861
  • Add Eagle Speculative Decoding to FA3 Backend by @qingquansong in #4951
  • Update tokenizer_manager.py by @yangky11 in #5008
  • [sgl-kernel] per token group quant support COLUMN MAJOR by @BBuf in #4817
  • update cutlass tag by @xiezhq-hermann in #5011
  • Feature/revise docs ci by @renxinx in #5009
  • fix: fix illegal cuda memory access at fused_moe_kernel by @saltyfish66 in #4727
  • [Build] Support build sgl-kernel with ccache by @guoyuhong in #5020
  • fix deepgemm as well by @xiezhq-hermann in #5030
  • try to fix ci oserror by @BBuf in #5024
  • Replace enable_flashinfer_mla argument with attention_backend by @Fridge003 in #5005
  • Small refactor DeepEPMode to clean up code a bit by @fzyzcjy in #4992
  • [Fix] fix fa3 build at cu118 by @yinfan98 in #5036
  • Revert "Replace enable_flashinfer_mla argument with attention_backend" by @merrymercy in #5048
  • bump sgl-kernel v0.0.7 by @zhyncs in #5046
  • update eagle-3 docs by @simveit in #4796
  • Add LlavaLlamaForCausaLM in MultiModal Processors by @ravi03071991 in #5039
  • Update the retry count by @zhyncs in #5051
  • upgrade sgl-kernel v0.0.7 by @zhyncs in #5049
  • [2/3] fix dsv3 awq issue by @AniZpZ in #4625
  • Feature/revise docs ci by @renxinx in #5056
  • Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 by @M0gician in #5057
  • [fix] remove cuda_device_count_stateless by @Alcanderian in #5060
  • Small refactor DeepEPDispatcher into subclasses by @fzyzcjy in #4994
  • Support async DeepEP by splitting into two stages by @fzyzcjy in #4995
  • Cleanup unused resources after DeepEP operation by @fzyzcjy in #4996
  • Add DeepSeek V3/R1 shared experts fusion by @BBuf in #4918
  • [deepep] fix: shared experts are not initialized when shared experts fusion is disabled by @ch-wan in #5072
  • fix dummy-load deepseekv2 by @inkcherry in #4535
  • support sgl-kernel on blackwell by @zhyncs in #5074
  • FA3 Spec Decoding to support top k = 1 and add cuda graph support by @hebiao064 in #5050
  • [Revision] Replace enable_flashinfer_mla argument with attention_backend by @Fridge003 in #5052
  • upgrade transformers 4.51.0 by @zhyncs in #5088
  • sgl-kernel transfer custom allreduce from trt kernel to vllm kernel by @yizhang2077 in #5079
  • bump sgl-kernel 0.0.8 by @zhyncs in #5089
  • python transfer custom allreduce from trt kernel to vllm kernel by @yizhang2077 in #5080
  • bump v0.4.4.post4 by @zhyncs in #5091
  • Fix: Reduce the number of document ci attempts to avoid long ci running by @minleminzui in #5097
  • Add Llama4 support by @CatherineSue in #5092
  • Fix refactor error - fp8.py by @HaiShaw in #5106
  • bump v0.4.5 by @zhyncs in #5117

New Contributors

  • @DellCurry made their first contribution in #3964
  • @lausannel made their first contribution in #4427
  • @JiangJiaWei1103 made their first contribution in #4453
  • @xu-song made their first contribution in #4454
  • @yinghai made their first contribution in #4481
  • @tanzelin430 made their first contribution in #4202
  • @huiwq1990 made their first contribution in #4474
  • @woodx9 made their first contribution in #3612
  • @ccw1996 made their first contribution in #2798
  • @solrex made their first contribution in #4418
  • @U-rara made their first contribution in #4446
  • @tianyuzhou95 made their first contribution in #4542
  • @chromecast56 made their first contribution in #4247
  • @strgrb made their first contribution in #4515
  • @liz-badada made their first contribution in #4232
  • @Hongbosherlock made their first contribution in #4583
  • @guoyuhong made their first contribution in #4592
  • @wenscarl made their first contribution in #4558
  • @penguin-wwy made their first contribution in #4665
  • @xutizhou made their first contribution in #4643
  • @BroadbentJim made their first contribution in #4679
  • @lkm2835 made their first contribution in #4064
  • @c1lovez1 made their first contribution in #4698
  • @alexsun07 made their first contribution in #4631
  • @tbzhang made their first contribution in #4605
  • @XucSh made their first contribution in #4721
  • @yuhsuan-t made their first contribution in #4435
  • @Thysrael made their first contribution in #4784
  • @Conless made their first contribution in #4797
  • @gmlwns2000 made their first contribution in #4638
  • @jsuchome made their first contribution in #4341
  • @danielholanda made their first contribution in #4833
  • @tarinkk made their first contribution in #4770
  • @ocss884 made their first contribution in #4819
  • @b8zhong made their first contribution in #3969
  • @jondurbin made their first contribution in #4809
  • @JustinTong0323 made their first contribution in #4788
  • @ZhuJiaqi9905 made their first contribution in #4843
  • @vincent-4 made their first contribution in #3949
  • @warjiang made their first contribution in #4487
  • @lengrongfu made their first contribution in #4528
  • @jcbjcbjc made their first contribution in #4492
  • @Fr4nk1inCs made their first contribution in #4772
  • @seplos made their first contribution in #4931
  • @yangky11 made their first contribution in #5008
  • @renxinx made their first contribution in #5009
  • @saltyfish66 made their first contribution in #4727
  • @inkcherry made their first contribution in #4535

Full Changelog: v0.4.4...v0.4.5

Don't miss a new sglang release

NewReleases is sending notifications on new releases.