Major changes
- Fast model execution with CUDA/HIP graph
- W4A16 GPTQ support (thanks to @chu-tianxiang)
- Fix memory profiling with tensor parallelism
- Fix *.bin weight loading for Mixtral models
What's Changed
- Fix typing in generate function for AsyncLLMEngine & add toml to requirements-dev by @mezuzza in #2100
- Fix Dockerfile.rocm by @tjtanaa in #2101
- avoid multiple redefinition by @MitchellX in #1817
- Add a flag to include stop string in output text by @yunfeng-scale in #1976
- Add GPTQ support by @chu-tianxiang in #916
- [Docs] Add quantization support to docs by @WoosukKwon in #2135
- [ROCm] Temporarily remove GPTQ ROCm support by @WoosukKwon in #2138
- simplify loading weights logic by @esmeetu in #2133
- Optimize model execution with CUDA graph by @WoosukKwon in #1926
- [Minor] Delete Llama tokenizer warnings by @WoosukKwon in #2146
- Fix all-reduce memory usage by @WoosukKwon in #2151
- Pin PyTorch & xformers versions by @WoosukKwon in #2155
- Remove dependency on CuPy by @WoosukKwon in #2152
- [Docs] Add CUDA graph support to docs by @WoosukKwon in #2148
- Temporarily enforce eager mode for GPTQ models by @WoosukKwon in #2154
- [Minor] Add more detailed explanation on
quantization
argument by @WoosukKwon in #2145 - [Minor] Fix xformers version by @WoosukKwon in #2158
- [Minor] Add Phi 2 to supported models by @WoosukKwon in #2159
- Make sampler less blocking by @Yard1 in #1889
- [Minor] Fix a typo in .pt weight support by @WoosukKwon in #2160
- Disable CUDA graph for SqueezeLLM by @WoosukKwon in #2161
- Bump up to v0.2.6 by @WoosukKwon in #2157
New Contributors
- @mezuzza made their first contribution in #2100
- @MitchellX made their first contribution in #1817
Full Changelog: v0.2.5...v0.2.6