deepspeed 0.10.1 on Python PyPI

What's Changed

[docs] add zero++ paper link by @jeffra in #3974
Avoid race condition with port selection in unit tests by @mrwyattii in #3975
Remove duplicated inference unit tests by @mrwyattii in #3951
Switch to torch.linalg.norm by @loadams in #3984
Simplify chain comparisons, remove redundant parentheses by @digger-yu in #3912
[CPU] Support HBM flatmode and fakenuma mode by @delock in #3918
Fix checkpoint conversion when model layers share weights by @awaelchli in #3825
fixing flops profiler formatting, units and precision by @clumsy in #3927
Specify language=python in pre-commit hook by @wangruohui in #3994
[CPU] Skip CPU support unimplemented error by @Yejing-Lai in #3633
ZeRO Gradient Accumulation Dtype. by @jomayeri in #2847
[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) by @delock in #3919
Re-enable skipped unit tests by @mrwyattii in #3939
Make AMD/ROCm apex install to /blob to save test/compile time. by @loadams in #3997
Option to exclude frozen weights for checkpoint save by @tjruwase in #3953
Allow user to select name of .deepspeed_env by @loadams in #4006
Silence backend warning by @mrwyattii in #4009
Fix user arg parsing in single node deployment by @mrwyattii in #4007
Specify triton 2.0.0 requirement by @mrwyattii in #4008
Re-enable elastic training for torch 2+ by @loadams in #4010
add /dev/shm size to ds_report by @jeffra in #4015
Make Ascend NPU available by @hipudding in #3831
RNNprofiler: fix gates size retrieval logic in _rnn_flops by @pinstripe-potoroo in #3921
fix typo in SECURITY.md by @jstan327 in #4019
add llama2 autoTP support in replace_module by @dc3671 in #4022
[zero_to_fp32] 3x less cpu memory requirements by @stas00 in #4025
[CPU] FusedAdam and CPU training support by @delock in #3991
remove duplicate check for pp and zero stage by @inkcherry in #4033
Pass missing positional arguments in DeepSpeedHybridEngine.generate() by @XuehaiPan in #4026
Remove print of weight parameter in RMS norm by @puneeshkhanna in #4031
Monitored Loss Calculations by @jomayeri in #4030
fix(pipe): make pipe module load_state_dir non-strict-mode work by @hughpu in #4020
polishing timers and log_dist by @clumsy in #3996
Engine side fix for loading llama checkpoint fine-tuned with zero3 by @minjiaz in #3981
fix: Remove duplicate word the by @digger-yu in #4051
[Bug Fix] Fix comm logging for inference by @delock in #4043
fix opt-350m shard loading issue in AutoTP by @sywangyi in #3600
enable autoTP for MPT by @sywangyi in #3861
autoTP for fused qkv weight by @inkcherry in #3844
[CPU] Faster reduce kernel for SHM allreduce by @delock in #4049
Multiple zero stage 3 related fixes by @tjruwase in #3886
Fix deadlock when SHM based allreduce spin too fast by @delock in #4048
[MiCS] [Bugfix] set self.save_non_zero_checkpoint=True only for first partition group by @zarzen in #3787
add reproducible compilation environment by @fecet in #3943
fix: remove unnessary # punct in the second sed command by @hughpu in #4061
Refactor autoTP inference for HE by @molly-smith in #4040
Fix transformers unit tests by @mrwyattii in #4079
Fix Stable Diffusion Injection by @lekurile in #4078
Spread layers more uniformly when using partition_uniform by @marcobellagente93 in #4053
fix typo: change polciies to policies by @digger-yu in #4090
update ut/doc for glm/codegen by @inkcherry in #4057
zero_to_fp32 script adds support for tag argument by @EeyoreLee in #4089
add type checker ignore by @EeyoreLee in #4102
Fix generate config validation error on inference unit tests by @mrwyattii in #4107
use correct ckpt path when base_dir not available by @polisettyvarma in #4101
Disable z3 tracing profiler by @tjruwase in #4106
Pass correct node size for ZeRO++ by @cmikeh2 in #4085
add deepspeed chat arxiv report by @conglongli in #4110
enable pipeline checkpoint loading mode by @leiwen83 in #3629
Fix Issue 4083 by @jomayeri in #4084
Add full list of DS_BUILD_* by @loadams in #4119
Update nightly workflows to open an issue if CI fails by @loadams in #3952
Update torch1.9 tests to 1.10 to match latest accelerate. by @loadams in #4126
Handle PermissionError in os.chmod Call - Update engine.py by @M-Chris in #4139
Generalize frozen weights unit test by @tjruwase in #4140
Respect memory pinning config by @tjruwase in #4131
Remove incorrect async-io library checking code. by @loadams in #4150
Return nn.parameter type for weights and biases by @molly-smith in #4146
Fixes #4151 by @saforem2 in #4152
Handling for SIGTERM as well by @loadams in #4160
Fix CI Badges by @mrwyattii in #4162
Add DS-Chat CI workflow by @lekurile in #4127
[CPU][Bugfix] Make uid and addr_port part of SHM name in CCL backend by @delock in #4115
Add DSE branch input to nv-ds-chat by @lekurile in #4173
Pin transformers by @mrwyattii in #4174

New Contributors

@awaelchli made their first contribution in #3825
@wangruohui made their first contribution in #3994
@jstan327 made their first contribution in #4019
@XuehaiPan made their first contribution in #4026
@puneeshkhanna made their first contribution in #4031
@hughpu made their first contribution in #4020
@fecet made their first contribution in #3943
@marcobellagente93 made their first contribution in #4053
@polisettyvarma made their first contribution in #4101
@leiwen83 made their first contribution in #3629
@M-Chris made their first contribution in #4139

Full Changelog: v0.10.0...v0.10.1

deepspeed 0.10.1 v0.10.1: Patch release on Python PyPI

What's Changed

New Contributors

deepspeed 0.10.1
v0.10.1: Patch release

on Python PyPI