What's Changed
- Update version.txt after 0.13.2 release by @mrwyattii in #5119
- Stop tracking backward chain of broadcast (ZeRO3) by @tohtana in #5113
- [NPU]ZeRO-Infinity feature compatibility by @misstek in #5077
- BF16 optimizer: Improve device utilization by immediate grad update by @deepcharm in #4975
- removed if condition in
if collate_fn is None
by @bm-synth in #5107 - disable compile tests for torch<2.1 by @mrwyattii in #5121
- Update inference test model names by @mrwyattii in #5127
- Fix issue with zero-sized file after merging file on curriculum
map_reduce
by @bm-synth in #5106 - Update return codes in PyTest to properly error out if tests fail by @loadams in #5122
- add missing methods to MPS_Accelerator by @mrwyattii in #5134
- Solve tensor vs numpy dtype conflicts in data efficiency map-reduce. by @bm-synth in #5108
- Fix broadcast deadlock for incomplete batches in data sample for data analysis by @bm-synth in #5117
- Avoid zero-sized microbatches for incomplete minibatches when doing curriculum learning by @bm-synth in #5118
- remove mandatory
index
key from output ofmetric_function
inDataAnalysis
map operation by @bm-synth in #5112 - tensorboard logging: avoid item() outside gas to improve performance by @nelyahu in #5135
- Check overflow on device without host synchronization for each tensor by @BacharL in #5115
- Update nv-inference torch version by @loadams in #5128
- Method
run_map_reduce
to fix errors when runningrun_map
followed byrun_reduce
by @bm-synth in #5131 - Added missing
isinstance
check in PR 5112 by @bm-synth in #5142 - Fix UserWarning: The torch.cuda.*DtypeTensor constructors are no long… by @ShukantPal in #5018
- TestEmptyParameterGroup: replace fusedAdam with torch.optim.AdamW by @nelyahu in #5139
- Update deprecated HuggingFace function by @mrwyattii in #5144
- Pin to PyTest 8.0.0 by @loadams in #5163
- get_grad_norm_direct: fix a case of empty norm group by @nelyahu in #5148
- Distributed in-memory map-reduce for data analyzer by @bm-synth in #5129
- DeepSpeedZeroOptimizer_Stage3: remove cuda specific optimizer by @nelyahu in #5138
- MOE: Fix save checkpoint when TP > 1 by @mosheisland in #5157
- Fix gradient clipping by @tohtana in #5150
- Use ninja to speed up build by @jinzhen-lin in #5088
- Update flops profiler to handle attn and matmul by @KimmiShi in #4724
- Fix allreduce for BF16 and ZeRO0 by @tohtana in #5170
- Write multiple items to output file at once, in distributed data analyzer. by @bm-synth in #5169
- Fix typos in blogs/ by @jinyouzhi in #5172
- Inference V2 Human Eval by @lekurile in #4804
- Reduce ds_id name length by @jomayeri in #5176
- Switch cpu-inference workflow from --extra-index-url to --index-url by @loadams in #5182
New Contributors
Full Changelog: v0.13.2...v0.13.3