Major Changes
- Experimental multi-lora support
- Experimental prefix caching support
- FP8 KV Cache support
- Optimized MoE performance and Deepseek MoE support
- CI tested PRs
- Support batch completion in server
What's Changed
- Miner fix of type hint by @beginlner in #2340
- Build docker image with shared objects from "build" step by @payoto in #2237
- Ensure metrics are logged regardless of requests by @ichernev in #2347
- Changed scheduler to use deques instead of lists by @NadavShmayo in #2290
- Fix eager mode performance by @WoosukKwon in #2377
- [Minor] Remove unused code in attention by @WoosukKwon in #2384
- Add baichuan chat template jinjia file by @EvilPsyCHo in #2390
- [Speculative decoding 1/9] Optimized rejection sampler by @cadedaniel in #2336
- Fix ipv4 ipv6 dualstack by @yunfeng-scale in #2408
- [Minor] Rename phi_1_5 to phi by @WoosukKwon in #2385
- [DOC] Add additional comments for LLMEngine and AsyncLLMEngine by @litone01 in #1011
- [Minor] Fix the format in quick start guide related to Model Scope by @zhuohan123 in #2425
- Add gradio chatbot for openai webserver by @arkohut in #2307
- [BUG] RuntimeError: deque mutated during iteration in abort_seq_group by @chenxu2048 in #2371
- Allow setting fastapi root_path argument by @chiragjn in #2341
- Address Phi modeling update 2 by @huiwy in #2428
- Update a more user-friendly error message, offering more considerate advice for beginners, when using V100 GPU #1901 by @chuanzhubin in #2374
- Update quickstart.rst with small clarifying change (fix typo) by @nautsimon in #2369
- Aligning
top_p
andtop_k
Sampling by @chenxu2048 in #1885 - [Minor] Fix err msg by @WoosukKwon in #2431
- [Minor] Optimize cuda graph memory usage by @esmeetu in #2437
- [CI] Add Buildkite by @simon-mo in #2355
- Announce the second vLLM meetup by @WoosukKwon in #2444
- Allow buildkite to retry build on agent lost by @simon-mo in #2446
- Fix weigit loading for GQA with TP by @zhangch9 in #2379
- CI: make sure benchmark script exit on error by @simon-mo in #2449
- ci: retry on build failure as well by @simon-mo in #2457
- Add StableLM3B model by @ita9naiwa in #2372
- OpenAI refactoring by @FlorianJoncour in #2360
- [Experimental] Prefix Caching Support by @caoshiyi in #1669
- fix stablelm.py tensor-parallel-size bug by @YingchaoX in #2482
- Minor fix in prefill cache example by @JasonZhu1313 in #2494
- fix: fix some args desc by @zspo in #2487
- [Neuron] Add an option to build with neuron by @liangfu in #2065
- Don't download both safetensor and bin files. by @NikolaBorisov in #2480
- [BugFix] Fix abort_seq_group by @beginlner in #2463
- refactor completion api for readability by @simon-mo in #2499
- Support OpenAI API server in
benchmark_serving.py
by @hmellor in #2172 - Simplify broadcast logic for control messages by @zhuohan123 in #2501
- [Bugfix] fix load local safetensors model by @esmeetu in #2512
- Add benchmark serving to CI by @simon-mo in #2505
- Add
group
as an argument in broadcast ops by @GindaChen in #2522 - [Fix] Keep
scheduler.running
as deque by @njhill in #2523 - migrate pydantic from v1 to v2 by @joennlae in #2531
- [Speculative decoding 2/9] Multi-step worker for draft model by @cadedaniel in #2424
- Fix "Port could not be cast to integer value as " by @pcmoritz in #2545
- Add qwen2 by @JustinLin610 in #2495
- Fix progress bar and allow HTTPS in
benchmark_serving.py
by @hmellor in #2552 - Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py by @JasonZhu1313 in #2553
- [Feature] Simple API token authentication by @taisazero in #1106
- Add multi-LoRA support by @Yard1 in #1804
- lint: format all python file instead of just source code by @simon-mo in #2567
- [Bugfix] fix crash if max_tokens=None by @NikolaBorisov in #2570
- Added
include_stop_str_in_output
andlength_penalty
parameters to OpenAI API by @galatolofederico in #2562 - [Doc] Fix the syntax error in the doc of supported_models. by @keli-wen in #2584
- Support Batch Completion in Server by @simon-mo in #2529
- fix names and license by @JustinLin610 in #2589
- [Fix] Use a correct device when creating OptionalCUDAGuard by @sh1ng in #2583
- [ROCm] add support to ROCm 6.0 and MI300 by @hongxiayang in #2274
- Support for Stable LM 2 by @dakotamahan-stability in #2598
- Don't build punica kernels by default by @pcmoritz in #2605
- AWQ: Up to 2.66x higher throughput by @casper-hansen in #2566
- Use head_dim in config if exists by @xiangxu-google in #2622
- Custom all reduce kernels by @hanzhi713 in #2192
- [Minor] Fix warning on Ray dependencies by @WoosukKwon in #2630
- Speed up Punica compilation by @WoosukKwon in #2632
- Small async_llm_engine refactor by @andoorve in #2618
- Update Ray version requirements by @simon-mo in #2636
- Support FP8-E5M2 KV Cache by @zhaoyang-star in #2279
- Fix error when tp > 1 by @zhaoyang-star in #2644
- No repeated IPC open by @hanzhi713 in #2642
- ROCm: Allow setting compilation target by @rlrs in #2581
- DeepseekMoE support with Fused MoE kernel by @zwd003 in #2453
- Fused MOE for Mixtral by @pcmoritz in #2542
- Fix 'Actor methods cannot be called directly' when using
--engine-use-ray
by @HermitSun in #2664 - Add swap_blocks unit tests by @sh1ng in #2616
- Fix a small typo (tenosr -> tensor) by @pcmoritz in #2672
- [Minor] Fix false warning when TP=1 by @WoosukKwon in #2674
- Add quantized mixtral support by @WoosukKwon in #2673
- Bump up version to v0.3.0 by @zhuohan123 in #2656
New Contributors
- @payoto made their first contribution in #2237
- @NadavShmayo made their first contribution in #2290
- @EvilPsyCHo made their first contribution in #2390
- @litone01 made their first contribution in #1011
- @arkohut made their first contribution in #2307
- @chiragjn made their first contribution in #2341
- @huiwy made their first contribution in #2428
- @chuanzhubin made their first contribution in #2374
- @nautsimon made their first contribution in #2369
- @zhangch9 made their first contribution in #2379
- @ita9naiwa made their first contribution in #2372
- @caoshiyi made their first contribution in #1669
- @YingchaoX made their first contribution in #2482
- @JasonZhu1313 made their first contribution in #2494
- @zspo made their first contribution in #2487
- @liangfu made their first contribution in #2065
- @NikolaBorisov made their first contribution in #2480
- @GindaChen made their first contribution in #2522
- @njhill made their first contribution in #2523
- @joennlae made their first contribution in #2531
- @pcmoritz made their first contribution in #2545
- @JustinLin610 made their first contribution in #2495
- @taisazero made their first contribution in #1106
- @galatolofederico made their first contribution in #2562
- @keli-wen made their first contribution in #2584
- @sh1ng made their first contribution in #2583
- @hongxiayang made their first contribution in #2274
- @dakotamahan-stability made their first contribution in #2598
- @xiangxu-google made their first contribution in #2622
- @andoorve made their first contribution in #2618
- @rlrs made their first contribution in #2581
- @zwd003 made their first contribution in #2453
Full Changelog: v0.2.7...v0.3.0