What's Changed
- Update version.txt after 0.16.3 release by @loadams in #6965
- Precisely track nvme optimizer offload by @tjruwase in #6963
- Update build_win.bat script to exclue GDS op as it lacks Windows support. by @loadams in #6971
- Add CUDA 12.8 support and comment on CUDA 12.7 by @loadams in #6975
- Update cpu torch latest to use torch 2.6 by @loadams in #6977
- generalize deepspeed linear and implement it for non cuda systems by @oelayan7 in #6932
- Update recommended Windows whl building versions by @loadams in #6983
- Title: Fix setup_env_ranks to Properly Set Environment Variables Instead of Raising Error by @fabiosanger in #6979
- Specify torchvision in nv-ds-chat workflow (prevents errors with torch 2.6) by @loadams in #6982
- Remove assumption that padding only occurs on last rank by @xylian86 in #6974
- Use ds-specific module id to avoid conflicts by @tjruwase in #6847
- Update A6000 workflows to use newer docker container - 24.09 vs 24.03 by @loadams in #6967
- Allow NVIDIA Blackwell by @fabiendupont in #6991
- Update GH org references by @tjruwase in #6998
- [XPU] max1100 workflow update for docker and softwares by @Liangliang-Ma in #7003
- autotp training(fix dco) by @inkcherry in #7004
- import triton files when triton is supported and installed by @oelayan7 in #6989
- Update A6000 tests transformers version by @loadams in #7016
- Fix ds-chat CI regression by @tjruwase in #7015
- [Ulysses tutorial] typos by @stas00 in #7024
- fix hostname -I for macOS #6497 by @fitzjalen in #6990
- Update workflows to cuda 12.4 by @loadams in #7000
- [ROCm] Enable fp_quantizer on ROCm by @rraminen in #7027
- add gds chinese blog by @GuanhuaWang in #7034
- Add chinese blog for deepspeed windows, and fix format by @hwchen2017 in #7035
- AIO on ROCM by @jomayeri in #7023
- Control trace cache warnings by @tjruwase in #7039
- Update CUDA compute capability to support Blackwell by @hwchen2017 in #7047
- Update setup.py handling of ROCm cupy by @loadams in #7051
- nv-ds-chat breaks with latest transformers by @loadams in #7052
- Rename aio_thread_count to intra_op_parallelism by @tjruwase in #7056
- add autoTP training zero2 tests by @inkcherry in #7049
- Fix, bf16 optimizer remove dup loop by @wukong1992 in #7054
New Contributors
- @fabiosanger made their first contribution in #6979
- @fabiendupont made their first contribution in #6991
- @fitzjalen made their first contribution in #6990
- @wukong1992 made their first contribution in #7054
Full Changelog: v0.16.3...v0.16.4