What's Changed
- [docs] add zero++ paper link by @jeffra in #3974
- Avoid race condition with port selection in unit tests by @mrwyattii in #3975
- Remove duplicated inference unit tests by @mrwyattii in #3951
- Switch to torch.linalg.norm by @loadams in #3984
- Simplify chain comparisons, remove redundant parentheses by @digger-yu in #3912
- [CPU] Support HBM flatmode and fakenuma mode by @delock in #3918
- Fix checkpoint conversion when model layers share weights by @awaelchli in #3825
- fixing flops profiler formatting, units and precision by @clumsy in #3927
- Specify language=python in pre-commit hook by @wangruohui in #3994
- [CPU] Skip CPU support unimplemented error by @Yejing-Lai in #3633
- ZeRO Gradient Accumulation Dtype. by @jomayeri in #2847
- [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) by @delock in #3919
- Re-enable skipped unit tests by @mrwyattii in #3939
- Make AMD/ROCm apex install to /blob to save test/compile time. by @loadams in #3997
- Option to exclude frozen weights for checkpoint save by @tjruwase in #3953
- Allow user to select name of .deepspeed_env by @loadams in #4006
- Silence backend warning by @mrwyattii in #4009
- Fix user arg parsing in single node deployment by @mrwyattii in #4007
- Specify triton 2.0.0 requirement by @mrwyattii in #4008
- Re-enable elastic training for torch 2+ by @loadams in #4010
- add /dev/shm size to ds_report by @jeffra in #4015
- Make Ascend NPU available by @hipudding in #3831
- RNNprofiler: fix gates size retrieval logic in _rnn_flops by @pinstripe-potoroo in #3921
- fix typo in SECURITY.md by @jstan327 in #4019
- add llama2 autoTP support in replace_module by @dc3671 in #4022
- [zero_to_fp32] 3x less cpu memory requirements by @stas00 in #4025
- [CPU] FusedAdam and CPU training support by @delock in #3991
- remove duplicate check for pp and zero stage by @inkcherry in #4033
- Pass missing positional arguments in
DeepSpeedHybridEngine.generate()
by @XuehaiPan in #4026 - Remove print of weight parameter in RMS norm by @puneeshkhanna in #4031
- Monitored Loss Calculations by @jomayeri in #4030
- fix(pipe): make pipe module
load_state_dir
non-strict-mode work by @hughpu in #4020 - polishing timers and log_dist by @clumsy in #3996
- Engine side fix for loading llama checkpoint fine-tuned with zero3 by @minjiaz in #3981
- fix: Remove duplicate word the by @digger-yu in #4051
- [Bug Fix] Fix comm logging for inference by @delock in #4043
- fix opt-350m shard loading issue in AutoTP by @sywangyi in #3600
- enable autoTP for MPT by @sywangyi in #3861
- autoTP for fused qkv weight by @inkcherry in #3844
- [CPU] Faster reduce kernel for SHM allreduce by @delock in #4049
- Multiple zero stage 3 related fixes by @tjruwase in #3886
- Fix deadlock when SHM based allreduce spin too fast by @delock in #4048
- [MiCS] [Bugfix] set self.save_non_zero_checkpoint=True only for first partition group by @zarzen in #3787
- add reproducible compilation environment by @fecet in #3943
- fix: remove unnessary
#
punct in the secondsed
command by @hughpu in #4061 - Refactor autoTP inference for HE by @molly-smith in #4040
- Fix transformers unit tests by @mrwyattii in #4079
- Fix Stable Diffusion Injection by @lekurile in #4078
- Spread layers more uniformly when using partition_uniform by @marcobellagente93 in #4053
- fix typo: change polciies to policies by @digger-yu in #4090
- update ut/doc for glm/codegen by @inkcherry in #4057
- zero_to_fp32 script adds support for tag argument by @EeyoreLee in #4089
- add type checker ignore by @EeyoreLee in #4102
- Fix generate config validation error on inference unit tests by @mrwyattii in #4107
- use correct ckpt path when base_dir not available by @polisettyvarma in #4101
- Disable z3 tracing profiler by @tjruwase in #4106
- Pass correct node size for ZeRO++ by @cmikeh2 in #4085
- add deepspeed chat arxiv report by @conglongli in #4110
- enable pipeline checkpoint loading mode by @leiwen83 in #3629
- Fix Issue 4083 by @jomayeri in #4084
- Add full list of DS_BUILD_* by @loadams in #4119
- Update nightly workflows to open an issue if CI fails by @loadams in #3952
- Update torch1.9 tests to 1.10 to match latest accelerate. by @loadams in #4126
- Handle PermissionError in os.chmod Call - Update engine.py by @M-Chris in #4139
- Generalize frozen weights unit test by @tjruwase in #4140
- Respect memory pinning config by @tjruwase in #4131
- Remove incorrect async-io library checking code. by @loadams in #4150
- Return nn.parameter type for weights and biases by @molly-smith in #4146
- Fixes #4151 by @saforem2 in #4152
- Handling for SIGTERM as well by @loadams in #4160
- Fix CI Badges by @mrwyattii in #4162
- Add DS-Chat CI workflow by @lekurile in #4127
- [CPU][Bugfix] Make uid and addr_port part of SHM name in CCL backend by @delock in #4115
- Add DSE branch input to nv-ds-chat by @lekurile in #4173
- Pin transformers by @mrwyattii in #4174
New Contributors
- @awaelchli made their first contribution in #3825
- @wangruohui made their first contribution in #3994
- @jstan327 made their first contribution in #4019
- @XuehaiPan made their first contribution in #4026
- @puneeshkhanna made their first contribution in #4031
- @hughpu made their first contribution in #4020
- @fecet made their first contribution in #3943
- @marcobellagente93 made their first contribution in #4053
- @polisettyvarma made their first contribution in #4101
- @leiwen83 made their first contribution in #3629
- @M-Chris made their first contribution in #4139
Full Changelog: v0.10.0...v0.10.1