What's Changed
- add auto-generated PR workflow by @mrwyattii in #2822
- Fix typo in auto-sync workflow by @mrwyattii in #2850
- Fix example command for building wheel with dev version specified. by @loadams in #2815
- Create tensor parallelism blog/tutorial by @molly-smith in #2766
- Data efficiency library update by @conglongli in #2866
- Make z3 respect comm dtype by @tjruwase in #2807
- Automatic Tensor Parallelism Blog Links by @molly-smith in #2877
- Check device count before running dist tests by @HeyangQin in #2799
- AutoTP tutorial web formatting and news by @molly-smith in #2883
- Remove deprecated
torch._six
imports by @yasyf in #2863 - Reduce I/O size by @tjruwase in #2814
- add missing license info to top of all source code by @jeffra in #2889
- Enable tensor fragments for zero 2 & 3 by @tjruwase in #2727
- better eval sampler for val or test dataset by @mayank31398 in #2907
- using container when loading inference checkpoints by @HeyangQin in #2875
- Fix CPUAdam for when
vendor_id_raw
is not provided by @FarzanT in #2836 - Fix Bloom logits mismatch by @molly-smith in #2851
- Fixes
AttributeError
in #2853 by @saforem2 in #2854 - Add MPICH Multinode Runner by @inkcherry in #2839
- TP unsupported models and assertions by @molly-smith in #2810
- AutoTP Assert Kernel Injection Support by @molly-smith in #2939
- Check for local CUDA graphs when enable_cuda_graph=True by @lekurile in #2941
- Improve overflow handling by @tjruwase in #2944
- [RFC] add device abstraction to allow other device than CUDA be used by @delock in #2221
- deepspeed.init_distributed() support for TCP protocols by @noabauma in #2905
New Contributors
- @HeyangQin made their first contribution in #2799
- @yasyf made their first contribution in #2863
- @mayank31398 made their first contribution in #2907
- @FarzanT made their first contribution in #2836
- @saforem2 made their first contribution in #2854
- @noabauma made their first contribution in #2905
Full Changelog: v0.8.1...v0.8.2