Major changes
- Optimize Mixtral performance with expert parallelism (thanks to @Yard1)
- [BugFix] Fix input positions for long context with sliding window
What's Changed
- Update Dockerfile to support Mixtral by @simon-mo in #2027
- Remove python 3.10 requirement by @WoosukKwon in #2040
- [CI/CD] Upgrade PyTorch version to v2.1.1 by @WoosukKwon in #2045
- Upgrade transformers version to 4.36.0 by @WoosukKwon in #2046
- Remove einops from dependencies by @WoosukKwon in #2049
- gqa added to mpt attn by @megha95 in #1938
- Update Dockerfile to build Megablocks by @simon-mo in #2042
- Fix peak memory profiling by @WoosukKwon in #2031
- Implement lazy model loader by @WoosukKwon in #2044
- [ROCm] Upgrade xformers version dependency for ROCm; update documentations by @tjtanaa in #2079
- Update installation instruction for CUDA 11.8 by @WoosukKwon in #2086
- [Docs] Add notes on ROCm-supported models by @WoosukKwon in #2087
- [BugFix] Fix input positions for long context with sliding window by @WoosukKwon in #2088
- Mixtral expert parallelism by @Yard1 in #2090
- Bump up to v0.2.5 by @WoosukKwon in #2095
Full Changelog: v0.2.4...v0.2.5