vllm-project/vllm v0.3.0 on GitHub

Major Changes

Experimental multi-lora support
Experimental prefix caching support
FP8 KV Cache support
Optimized MoE performance and Deepseek MoE support
CI tested PRs
Support batch completion in server

What's Changed

Miner fix of type hint by @beginlner in #2340
Build docker image with shared objects from "build" step by @payoto in #2237
Ensure metrics are logged regardless of requests by @ichernev in #2347
Changed scheduler to use deques instead of lists by @NadavShmayo in #2290
Fix eager mode performance by @WoosukKwon in #2377
[Minor] Remove unused code in attention by @WoosukKwon in #2384
Add baichuan chat template jinjia file by @EvilPsyCHo in #2390
[Speculative decoding 1/9] Optimized rejection sampler by @cadedaniel in #2336
Fix ipv4 ipv6 dualstack by @yunfeng-scale in #2408
[Minor] Rename phi_1_5 to phi by @WoosukKwon in #2385
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine by @litone01 in #1011
[Minor] Fix the format in quick start guide related to Model Scope by @zhuohan123 in #2425
Add gradio chatbot for openai webserver by @arkohut in #2307
[BUG] RuntimeError: deque mutated during iteration in abort_seq_group by @chenxu2048 in #2371
Allow setting fastapi root_path argument by @chiragjn in #2341
Address Phi modeling update 2 by @huiwy in #2428
Update a more user-friendly error message, offering more considerate advice for beginners, when using V100 GPU #1901 by @chuanzhubin in #2374
Update quickstart.rst with small clarifying change (fix typo) by @nautsimon in #2369
Aligning top_p and top_k Sampling by @chenxu2048 in #1885
[Minor] Fix err msg by @WoosukKwon in #2431
[Minor] Optimize cuda graph memory usage by @esmeetu in #2437
[CI] Add Buildkite by @simon-mo in #2355
Announce the second vLLM meetup by @WoosukKwon in #2444
Allow buildkite to retry build on agent lost by @simon-mo in #2446
Fix weigit loading for GQA with TP by @zhangch9 in #2379
CI: make sure benchmark script exit on error by @simon-mo in #2449
ci: retry on build failure as well by @simon-mo in #2457
Add StableLM3B model by @ita9naiwa in #2372
OpenAI refactoring by @FlorianJoncour in #2360
[Experimental] Prefix Caching Support by @caoshiyi in #1669
fix stablelm.py tensor-parallel-size bug by @YingchaoX in #2482
Minor fix in prefill cache example by @JasonZhu1313 in #2494
fix: fix some args desc by @zspo in #2487
[Neuron] Add an option to build with neuron by @liangfu in #2065
Don't download both safetensor and bin files. by @NikolaBorisov in #2480
[BugFix] Fix abort_seq_group by @beginlner in #2463
refactor completion api for readability by @simon-mo in #2499
Support OpenAI API server in benchmark_serving.py by @hmellor in #2172
Simplify broadcast logic for control messages by @zhuohan123 in #2501
[Bugfix] fix load local safetensors model by @esmeetu in #2512
Add benchmark serving to CI by @simon-mo in #2505
Add group as an argument in broadcast ops by @GindaChen in #2522
[Fix] Keep scheduler.running as deque by @njhill in #2523
migrate pydantic from v1 to v2 by @joennlae in #2531
[Speculative decoding 2/9] Multi-step worker for draft model by @cadedaniel in #2424
Fix "Port could not be cast to integer value as " by @pcmoritz in #2545
Add qwen2 by @JustinLin610 in #2495
Fix progress bar and allow HTTPS in benchmark_serving.py by @hmellor in #2552
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py by @JasonZhu1313 in #2553
[Feature] Simple API token authentication by @taisazero in #1106
Add multi-LoRA support by @Yard1 in #1804
lint: format all python file instead of just source code by @simon-mo in #2567
[Bugfix] fix crash if max_tokens=None by @NikolaBorisov in #2570
Added include_stop_str_in_output and length_penalty parameters to OpenAI API by @galatolofederico in #2562
[Doc] Fix the syntax error in the doc of supported_models. by @keli-wen in #2584
Support Batch Completion in Server by @simon-mo in #2529
fix names and license by @JustinLin610 in #2589
[Fix] Use a correct device when creating OptionalCUDAGuard by @sh1ng in #2583
[ROCm] add support to ROCm 6.0 and MI300 by @hongxiayang in #2274
Support for Stable LM 2 by @dakotamahan-stability in #2598
Don't build punica kernels by default by @pcmoritz in #2605
AWQ: Up to 2.66x higher throughput by @casper-hansen in #2566
Use head_dim in config if exists by @xiangxu-google in #2622
Custom all reduce kernels by @hanzhi713 in #2192
[Minor] Fix warning on Ray dependencies by @WoosukKwon in #2630
Speed up Punica compilation by @WoosukKwon in #2632
Small async_llm_engine refactor by @andoorve in #2618
Update Ray version requirements by @simon-mo in #2636
Support FP8-E5M2 KV Cache by @zhaoyang-star in #2279
Fix error when tp > 1 by @zhaoyang-star in #2644
No repeated IPC open by @hanzhi713 in #2642
ROCm: Allow setting compilation target by @rlrs in #2581
DeepseekMoE support with Fused MoE kernel by @zwd003 in #2453
Fused MOE for Mixtral by @pcmoritz in #2542
Fix 'Actor methods cannot be called directly' when using --engine-use-ray by @HermitSun in #2664
Add swap_blocks unit tests by @sh1ng in #2616
Fix a small typo (tenosr -> tensor) by @pcmoritz in #2672
[Minor] Fix false warning when TP=1 by @WoosukKwon in #2674
Add quantized mixtral support by @WoosukKwon in #2673
Bump up version to v0.3.0 by @zhuohan123 in #2656

New Contributors

@payoto made their first contribution in #2237
@NadavShmayo made their first contribution in #2290
@EvilPsyCHo made their first contribution in #2390
@litone01 made their first contribution in #1011
@arkohut made their first contribution in #2307
@chiragjn made their first contribution in #2341
@huiwy made their first contribution in #2428
@chuanzhubin made their first contribution in #2374
@nautsimon made their first contribution in #2369
@zhangch9 made their first contribution in #2379
@ita9naiwa made their first contribution in #2372
@caoshiyi made their first contribution in #1669
@YingchaoX made their first contribution in #2482
@JasonZhu1313 made their first contribution in #2494
@zspo made their first contribution in #2487
@liangfu made their first contribution in #2065
@NikolaBorisov made their first contribution in #2480
@GindaChen made their first contribution in #2522
@njhill made their first contribution in #2523
@joennlae made their first contribution in #2531
@pcmoritz made their first contribution in #2545
@JustinLin610 made their first contribution in #2495
@taisazero made their first contribution in #1106
@galatolofederico made their first contribution in #2562
@keli-wen made their first contribution in #2584
@sh1ng made their first contribution in #2583
@hongxiayang made their first contribution in #2274
@dakotamahan-stability made their first contribution in #2598
@xiangxu-google made their first contribution in #2622
@andoorve made their first contribution in #2618
@rlrs made their first contribution in #2581
@zwd003 made their first contribution in #2453

Full Changelog: v0.2.7...v0.3.0