vllm-project/vllm v0.2.6 on GitHub

Major changes

Fix typing in generate function for AsyncLLMEngine & add toml to requirements-dev by @mezuzza in #2100
Fix Dockerfile.rocm by @tjtanaa in #2101
avoid multiple redefinition by @MitchellX in #1817
Add a flag to include stop string in output text by @yunfeng-scale in #1976
Add GPTQ support by @chu-tianxiang in #916
[Docs] Add quantization support to docs by @WoosukKwon in #2135
[ROCm] Temporarily remove GPTQ ROCm support by @WoosukKwon in #2138
simplify loading weights logic by @esmeetu in #2133
Optimize model execution with CUDA graph by @WoosukKwon in #1926
[Minor] Delete Llama tokenizer warnings by @WoosukKwon in #2146
Fix all-reduce memory usage by @WoosukKwon in #2151
Pin PyTorch & xformers versions by @WoosukKwon in #2155
Remove dependency on CuPy by @WoosukKwon in #2152
[Docs] Add CUDA graph support to docs by @WoosukKwon in #2148
Temporarily enforce eager mode for GPTQ models by @WoosukKwon in #2154
[Minor] Add more detailed explanation on quantization argument by @WoosukKwon in #2145
[Minor] Fix xformers version by @WoosukKwon in #2158
[Minor] Add Phi 2 to supported models by @WoosukKwon in #2159
Make sampler less blocking by @Yard1 in #1889
[Minor] Fix a typo in .pt weight support by @WoosukKwon in #2160
Disable CUDA graph for SqueezeLLM by @WoosukKwon in #2161
Bump up to v0.2.6 by @WoosukKwon in #2157

Full Changelog: v0.2.5...v0.2.6