Added
- TensorFlow: Added new
get_local_and_global_gradients
to PartialDistributedGradientTape to retrieve local and non-local gradients separately. (#3859)
Changed
- Improved reducescatter performance by allocating output tensors before enqueuing the operation. (#3824)
- TensorFlow: Ensured that
tf.logical_and
within allreducetf.cond
runs on CPU. (#3885) - TensorFlow: Added support for Keras 2.11+ optimizers. (#3860)
CUDA_VISIBLE_DEVICES
environment variable is no longer passed to remote nodes. (#3865)
Fixed
- Fixed build with ROCm. (#3839, #3848)
- Fixed build of Docker image horovod-nvtabular. (#3851)
- Fixed linking recent NCCL by defaulting CUDA runtime library linkage to static and ensuring that weak symbols are overridden. (#3867, #3846)
- Fixed compatibility with TensorFlow 2.12 and recent nightly versions. (#3864, #3894, #3906, #3907)
- Fixed missing arguments of Keras allreduce function. (#3905)
- Updated with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls. (#3912)