Added
- Keras: Added
PartialDistributedOptimizer
API. (#3738) - Added
HOROVOD_SPARK_USE_LOCAL_RANK_GPU_INDEX
environment variable to ignore GPU device indices assigned by Spark and always use local rank GPU device in Spark estimators. (#3737) - Added support for reducescatter arguments
prescale_factor
andpostscale_factor
and moved averaging into Horovod backend. (#3815) - Spark Estimator: Added support for custom data loaders in TorchEstimator. (#3787)
- Spark Estimator: Added NVTabular data loader for TorchEstimator. (#3787)
Changed
- Improved NCCL performance for fused allgather operations through padding for better memory alignment. (#3727)
- Improved look-ahead tensor fusion buffer size estimates when allgather and other operations are mixed. (#3727)
Fixed
- ROCm: Fixed GPU MPI operations support in build. (#3746)
- PyTorch: Fixed linking order to avoid using Gloo from PyTorch dynamic libraries. (#3750)
- Fixed memory leak in
MPI_GPUAllgather
. (#3727) - TensorFlow: Fixed deprecation warnings when building with TensorFlow 2.11. (#3767)
- Keras: Added support for additional arguments to
SyncBatchNormalization._moments()
. (#3775) - Fixed version number parsing with pypa/packaging 22.0. (#3794)
- TensorFlow: Fixed linking with nightly versions leading up to TensorFlow 2.12. (#3755)
- TensorFlow: Fixed handling of
tf.IndexedSlices
types when scaling local gradients. (#3786) - Added missing
MEMCPY_IN_FUSION_BUFFER
timeline event for reducescatter. (#3808) - Fixed build of Docker image horovod-nvtabular. (#3817)
- TensorFlow: Several fixes for allreduce and grouped allreduce handling of
tf.IndexedSlices
. (#3813) - Spark: Restricted PyArrow to versions < 11.0. (#3830)
- TensorFlow: Resolved conflicts between multiple optimizer wrappers reusing the same gradient accumulation counter. (#3783)
- TensorFlow/Keras: Fixed
DistributedOptimizer
with Keras 2.11+. (#3822) - PyTorch, ROCm: Fixed allreduce average on process sets. (#3815)