vllm-project/vllm v0.2.7 on GitHub

Major Changes

Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
Fix tensor parallelism support for Mixtral + GPTQ/AWQ

What's Changed

Minor fix for gpu-memory-utilization description by @SuhongMoon in #2162
[BugFix] Raise error when max_model_len is larger than KV cache size by @WoosukKwon in #2163
[BugFix] Fix RoPE kernel on long sequences by @WoosukKwon in #2164
Add SSL arguments to API servers by @hmellor in #2109
typo fix by @oushu1zhangxiangxuan1 in #2166
[ROCm] Fixes for GPTQ on ROCm by @kliuae in #2180
Update Help Text for --gpu-memory-utilization Argument by @SuhongMoon in #2183
[Minor] Add warning on CUDA graph memory usage by @WoosukKwon in #2182
Added DeciLM-7b and DeciLM-7b-instruct by @avideci in #2062
[BugFix] Fix weight loading for Mixtral with TP by @WoosukKwon in #2208
Make _prepare_sample non blocking and pin memory of CPU input buffers by @hanzhi713 in #2207
Remove Sampler copy stream by @Yard1 in #2209
Fix a broken link by @ronensc in #2222
Disable Ray usage stats collection by @WoosukKwon in #2206
[BugFix] Fix recovery logic for sequence group by @WoosukKwon in #2186
Update installation instructions to include CUDA 11.8 xFormers by @skt7 in #2246
Add "About" Heading to README.md by @blueceiling in #2260
[BUGFIX] Do not return ignored sentences twice in async llm engine by @zhuohan123 in #2258
[BUGFIX] Fix API server test by @zhuohan123 in #2270
[BUGFIX] Fix the path of test prompts by @zhuohan123 in #2273
[BUGFIX] Fix communication test by @zhuohan123 in #2285
Add support GPT-NeoX Models without attention biases by @dalgarak in #2301
[FIX] Fix kernel bug by @jeejeelee in #1959
fix typo and remove unused code by @esmeetu in #2305
Enable CUDA graph for GPTQ & SqueezeLLM by @WoosukKwon in #2318
Fix Gradio example: remove deprecated parameter concurrency_count by @ronensc in #2315
Use NCCL instead of ray for control-plane communication to remove serialization overhead by @zhuohan123 in #2221
Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK by @ronensc in #2321
[Minor] Revert the changes in test_cache by @WoosukKwon in #2335
Bump up to v0.2.7 by @WoosukKwon in #2337

New Contributors

@SuhongMoon made their first contribution in #2162
@hmellor made their first contribution in #2109
@oushu1zhangxiangxuan1 made their first contribution in #2166
@kliuae made their first contribution in #2180
@avideci made their first contribution in #2062
@hanzhi713 made their first contribution in #2207
@ronensc made their first contribution in #2222
@skt7 made their first contribution in #2246
@blueceiling made their first contribution in #2260
@dalgarak made their first contribution in #2301

Full Changelog: v0.2.6...v0.2.7