Major Changes
- Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
- Fix tensor parallelism support for Mixtral + GPTQ/AWQ
What's Changed
- Minor fix for gpu-memory-utilization description by @SuhongMoon in #2162
- [BugFix] Raise error when max_model_len is larger than KV cache size by @WoosukKwon in #2163
- [BugFix] Fix RoPE kernel on long sequences by @WoosukKwon in #2164
- Add SSL arguments to API servers by @hmellor in #2109
- typo fix by @oushu1zhangxiangxuan1 in #2166
- [ROCm] Fixes for GPTQ on ROCm by @kliuae in #2180
- Update Help Text for --gpu-memory-utilization Argument by @SuhongMoon in #2183
- [Minor] Add warning on CUDA graph memory usage by @WoosukKwon in #2182
- Added DeciLM-7b and DeciLM-7b-instruct by @avideci in #2062
- [BugFix] Fix weight loading for Mixtral with TP by @WoosukKwon in #2208
- Make _prepare_sample non blocking and pin memory of CPU input buffers by @hanzhi713 in #2207
- Remove Sampler copy stream by @Yard1 in #2209
- Fix a broken link by @ronensc in #2222
- Disable Ray usage stats collection by @WoosukKwon in #2206
- [BugFix] Fix recovery logic for sequence group by @WoosukKwon in #2186
- Update installation instructions to include CUDA 11.8 xFormers by @skt7 in #2246
- Add "About" Heading to README.md by @blueceiling in #2260
- [BUGFIX] Do not return ignored sentences twice in async llm engine by @zhuohan123 in #2258
- [BUGFIX] Fix API server test by @zhuohan123 in #2270
- [BUGFIX] Fix the path of test prompts by @zhuohan123 in #2273
- [BUGFIX] Fix communication test by @zhuohan123 in #2285
- Add support GPT-NeoX Models without attention biases by @dalgarak in #2301
- [FIX] Fix kernel bug by @jeejeelee in #1959
- fix typo and remove unused code by @esmeetu in #2305
- Enable CUDA graph for GPTQ & SqueezeLLM by @WoosukKwon in #2318
- Fix Gradio example: remove deprecated parameter
concurrency_count
by @ronensc in #2315 - Use NCCL instead of ray for control-plane communication to remove serialization overhead by @zhuohan123 in #2221
- Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK by @ronensc in #2321
- [Minor] Revert the changes in test_cache by @WoosukKwon in #2335
- Bump up to v0.2.7 by @WoosukKwon in #2337
New Contributors
- @SuhongMoon made their first contribution in #2162
- @hmellor made their first contribution in #2109
- @oushu1zhangxiangxuan1 made their first contribution in #2166
- @kliuae made their first contribution in #2180
- @avideci made their first contribution in #2062
- @hanzhi713 made their first contribution in #2207
- @ronensc made their first contribution in #2222
- @skt7 made their first contribution in #2246
- @blueceiling made their first contribution in #2260
- @dalgarak made their first contribution in #2301
Full Changelog: v0.2.6...v0.2.7