sgl-project/sglang v0.3.4.post1 on GitHub

Highlights

Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
- Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
Added an overlap scheduler for reducing CPU overhead #1738
New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
Added support for reward models #1525.
Added support for Intel XPU #1480.
Improved stability for greedy decoding #1589.
Accelerated multi-LoRA serving #1587.

What's Changed

[Fix] Ignore model import error by @merrymercy in #1513
minor: fix config by @hnyls2002 in #1524
[Event] Update meeting link by @Ying1123 in #1529
[Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B by @Ying1123 in #1525
Add float8 dynamic quant to torchao_utils by @jerryzh168 in #1528
[FIX] Catch syntax error of Regex Guide to avoid crash by @du00cs in #1521
[bugfix]Add modelscope package to avoid docker image without modelscope by @KylinMountain in #1520
Fix RuntimeEndpoint.select method by @jeffrey-fong in #1495
Multiple minor fixes by @merrymercy in #1530
Make detokenizer_manager.py not asyncio by @merrymercy in #1532
Organize image inputs by @hnyls2002 in #1531
Improve process creation by @merrymercy in #1534
fix ipv6 url when warm up model by @cauyxy in #1537
Move scheduler code from tp_worker.py to scheduler.py by @merrymercy in #1538
Process image in parallel by @hnyls2002 in #1539
Let ModelRunner take InputMetadata as input, instead of ScheduleBatch by @merrymercy in #1541
Rename InputMetadata -> ForwardBatch by @merrymercy in #1543
Clean up batch data structures: Introducing ModelWorkerBatch by @merrymercy in #1544
[Fix, LoRA] fix LoRA with updates in main by @Ying1123 in #1545
Organize Attention Backends by @hnyls2002 in #1547
Fix bugs of logprobs_nums by @hnyls2002 in #1548
Dispatch flashinfer wrappers by @hnyls2002 in #1550
Simplify flashinfer dispatch by @hnyls2002 in #1552
[Refactor] Simplify io_struct and tokenizer_manager by @Ying1123 in #1549
[Performance, Hardware] MoE tuning on AMD MI300x GPUs by @kkHuang-amd in #1554
[Fix] Fix all the Huggingface paths by @tbarton16 in #1553
[Fix] do not maintain regex_fsm in SamplingBatchInfo by @merrymercy in #1555
[Fix] Move ScheduleBatch out of SamplingInfo by @merrymercy in #1556
Move status check in the memory pool to CPU by @merrymercy in #1557
[Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' by @mssongit in #1536
[FP8 KV Cache] Avoid KeyError at loading pre-quantized FP8 model with kv_scale by @HaiShaw in #1559
Organize sampling batch info better by @merrymercy in #1562
Use ipc instead of tcp in zmq by @merrymercy in #1566
Make input_ids a torch.Tensor by @merrymercy in #1568
[Minifix] Remove extra space in cot example by @FredericOdermatt in #1569
[Fix] Fix major performance bug in certain cases by @Ying1123 in #1563
Refine the add request reasons to avoid corner cases. by @hnyls2002 in #1574
chore: update README.md by @eltociear in #1580
[Easy] use .text() instead of .text by @ByronHsu in #1577
[Event] Update README.md by @Ying1123 in #1572
Add llama implementation with no tensor parallel linears by @jerryzh168 in #1561
Backend method not found when SRT Runtime is used by @ByronHsu in #1576
default sampling param should be deepcopied by @ByronHsu in #1581
Fix styling by @ByronHsu in #1583
Fix runtime.generate when sampling param is not passed by @ByronHsu in #1582
Support min_tokens in sgl.gen by @ByronHsu in #1573
[Minor] Improve the style and fix flaky tests by @merrymercy in #1584
[Bug] Fix decode stats error on output_len 1 by @HaiShaw in #1585
Clean up event loop by @merrymercy in #1586
[LoRA, Performance] Speedup multi-LoRA serving - Step 1 by @Ying1123 in #1587
[Minor, Performance] Use torch.argmax for greedy sampling by @Ying1123 in #1589
Test consistency for single and batch seperately by @ByronHsu in #1590
Update README.md by @merrymercy in #1591
Fix modality for image inputs by @merrymercy in #1592
Provide an offline engine API by @ByronHsu in #1567
[Fix] Fix the case where prompt_len = 0 by @merrymercy in #1593
Use atexit hook to implicitly shutdown Runtime by @ByronHsu in #1595
Use is_flashinfer_available to replace is_hip for flashinfer check by @merrymercy in #1596
Fix chunked prefill condition by @ispobock in #1594
Fix the port_args in bench_latency by @merrymercy in #1597
Remove references to squeezellm by @janimo in #1603
[Profile] Add pytorch profiler by @Ying1123 in #1604
[Engine] Fix generate hanging issue after the first call by @ByronHsu in #1606
Release v0.3.3 by @merrymercy in #1605
[Minor] Fix logging typo by @amosyou in #1615
Fix test_vision_openai_server on CI by @ByronHsu in #1620
[Performance, hardware] MoE tuning update to AMD MI300x GPUs by @HaiShaw in #1619
Update README.md by @kushal34712 in #1625
Update README.md by @merrymercy in #1629
Add device support by @liangan1 in #1607
Nit about the decorator of PortArgs.init_new by @glen-amd in #1611
[Bug] Fix the Image Input of Batch Generation by @OBJECT907 in #1579
Add the ability to enable and disable the Profiler via HTTP API. by @Abatom in #1626
Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py by @merrymercy in #1631
Add image_token in conversation.py by @merrymercy in #1632
Added a "Back To Top" Button by @JanumalaAkhilendra in #1633
Fix constrained decoding by @merrymercy in #1634
Add back data parallelism by @merrymercy in #1635
Release v0.3.3.post1 by @merrymercy in #1636
[engine] support async and streaming by @ByronHsu in #1614
[Fix] Fix the style of test_large_max_new_tokens.py by @merrymercy in #1638
fix missing ignore_eos in v1/chat/completions by @learninmou in #1642
Fix ignore_eos in the OpenAI ChatCompletions API by @merrymercy in #1645
[Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch by @liangan1 in #1480
Fix unit tests and type annotations by @merrymercy in #1648
Add an option to disable penalizer by @merrymercy in #1651
Add get_tokenizer function for Engine class by @pjyi2147 in #1653
Fix the batch_is_full check for jump-forward decoding by @merrymercy in #1654
Simplify the event loop and expose --num-continuous-decode-steps as an argument by @merrymercy in #1652
[doc] Add engine section in backend.md by @ByronHsu in #1656
[Fix] fix eos trim inconsistency by @Ying1123 in #1650
Add output_ids into ScheduleBatch by @merrymercy in #1659
[Minor] Rename no_eos_trim to no_stop_trim by @Ying1123 in #1661
Add a test case to test retract by @merrymercy in #1662
Move filter_batch out of stream_output by @merrymercy in #1663
Support double sparsity by @andy-yang-1 in #1459
Fix unit test order to balance the tasks in CI by @merrymercy in #1665
[Minor] Improve style by @merrymercy in #1666
Simplify chunked prefill by @merrymercy in #1667
[1/N] Remove CacheConfig import in all model files by @ByronHsu in #1658
[doc] improve engine doc and add to readme by @ByronHsu in #1670
[Minor] Add some utility functions by @merrymercy in #1671
Improve benchmark scripts by @merrymercy in #1672
Fix memory leak during abort by @merrymercy in #1674
Fix filter_batch function call by @hnyls2002 in #1681
Add OLMo model by @janimo in #1676
Add a new event loop by @merrymercy in #1677
Fix srt dependency by @ispobock in #1685
[Event] Add online meetup meeting link by @Ying1123 in #1686
Launch a thread to overlap CPU and GPU by @merrymercy in #1687
Returning a per request metric for number of cached_tokens read by @havetc in #1599
add orjson for jsonresponse by @michaelfeil in #1688
Update README.md by @merrymercy in #1689
Add date to logging messages (#1623) by @zeng-zc in #1679
Update the transformers version in CI by @merrymercy in #1690
Use SGLang imports for linear layer by @janimo in #1696
feat: radix tree code optimize by @wxsms in #1697
ORJson. Faster Json serialization by @michaelfeil in #1694
Fix the failed unit tests by @merrymercy in #1699
Fix failed ci tests on long prompts; Better error messages for embedding models by @merrymercy in #1700
Fix engine unit test by @merrymercy in #1701
Fix mixed batch for multi modal models by @merrymercy in #1702
Add matched_stop token or str to distinguish between eos or stop str finish_reason generation by @g-drozdov in #1684
Fix regex and logprob conflicts when chunked prefilling by @hnyls2002 in #1703
Simplify flashinfer utilities by @merrymercy in #1704
Add dtype for more operations by @merrymercy in #1705
Add grouped free operations by @merrymercy in #1706
Skip unnecessary penalizer by @merrymercy in #1707
Simplify the nan detection and greedy check in sampler by @merrymercy in #1709
Fix is_all_ready for overlap copy by @merrymercy in #1710
Fix the race condition in overlap mode by @merrymercy in #1712
Update README.md by @merrymercy in #1713
Release v0.3.4 by @merrymercy in #1714
Simplify the interface of tp_worker by @merrymercy in #1718
Update vllm to 0.6.3 (#1711) by @zhyncs in #1720
Support qwen2 vl model by @zhyncs in #1721
Update README.md by @Ying1123 in #1722
Unify the memory pool api and tp worker API by @merrymercy in #1724
Temporarily skip the test_mixed_batch for QWen2VL by @merrymercy in #1725
Split the overlapped version of TpModelWorkerClient into a separate file by @merrymercy in #1726
[Bugfix] qwen2vl forward_extend by @yizhang2077 in #1727
Simplify the usage of device by @merrymercy in #1734
Simplify batch result resolution by @merrymercy in #1735
Add GLM-4 TextGeneration Model support for SGLang by @sixsixcoder in #1736
Make token mapping non-blocking in the overlapped mode by @merrymercy in #1740
Maintain seq_lens_sum to make more FlashInfer operations non-blocking by @merrymercy in #1741
Fix prefill oom by @hnyls2002 in #1743
Faster overlap mode scheduler by @merrymercy in #1738
misc: add CODEOWNERS by @zhyncs in #1737
Fix sliding window attention and gemma-2 unit tests in CI by @merrymercy in #1746
Llama3.2 vision model support by @hnyls2002 in #1551
Update max_req_len and max_req_input_len by @hnyls2002 in #1748
Release v0.3.4.post1 by @merrymercy in #1749

New Contributors

@du00cs made their first contribution in #1521
@KylinMountain made their first contribution in #1520
@jeffrey-fong made their first contribution in #1495
@cauyxy made their first contribution in #1537
@kkHuang-amd made their first contribution in #1554
@tbarton16 made their first contribution in #1553
@mssongit made their first contribution in #1536
@FredericOdermatt made their first contribution in #1569
@kushal34712 made their first contribution in #1625
@liangan1 made their first contribution in #1607
@glen-amd made their first contribution in #1611
@OBJECT907 made their first contribution in #1579
@Abatom made their first contribution in #1626
@JanumalaAkhilendra made their first contribution in #1633
@learninmou made their first contribution in #1642
@pjyi2147 made their first contribution in #1653
@andy-yang-1 made their first contribution in #1459
@michaelfeil made their first contribution in #1688
@zeng-zc made their first contribution in #1679
@wxsms made their first contribution in #1697
@g-drozdov made their first contribution in #1684
@sixsixcoder made their first contribution in #1736

Full Changelog: v0.3.2...v0.3.4.post1

sgl-project/sglang v0.3.4.post1 Release v0.3.4.post1 on GitHub

Highlights

What's Changed

New Contributors

sgl-project/sglang v0.3.4.post1
Release v0.3.4.post1

on GitHub