1.21.0-rc1 (May 24, 2026)
Features:
UCP
- Added UCX_PROTO_EMULATION_ENABLE option to force zero-copy RMA protocol selection
- Added UCX_MAX_HCA_PER_GPU policy to limit GPU memory registrations to nearest HCAs
- Added device lanes that can access host memory for GPU transfer fallback
- Enabled gdr_copy for memtype endpoint transport
UCT
- Added device channel pool support
- Added CPU memory usage as AMO local buffer for device operations
RDMA CORE (IB, ROCE, etc.)
- Added UCX_IB_GDA_RETAIN_INACTIVE_CTX option to control inactive CUDA context retention in GDAKI
Build
- Added --without-gda configure option
- Made cuRAND an optional dependency for perftest CUDA kernels
CI/Testing
- Added dry-run package installation checks to the release package build
Bugfixes:
Build
- Fixed support for -Og by disabling always-inline attributes
UCP
- Fixed progress counter to return the actual operation status
- Fixed multi-protocol minimum size handling for 1-byte operations
- Fixed endpoint finalization when no P2P or connection-manager lane is available
UCT
- Fixed notify callback handling by adding a NULL check
CUDA
- Fixed CUDA IPC accessibility cache separation for local and remote rkeys
- Fixed CUDA IPC cache/LRU invariant for referenced regions
- Fixed DMA-BUF offsets for interior CUDA addresses
ROCM
- Fixed hangs in HIP MPI and OMB tests
RDMA CORE (IB, ROCE, etc.)
- Fixed GDA DMA-BUF offset handling
- Fixed GDA WQE ordering by using CAS-based readiness marking
UCS
- Reverted dynamically loaded external module/plugin support
Packaging
- Fixed Debian maintainer field
- Fixed GDA RPM build
- Fixed GDA RPM/devel package layout for CUDA/GDA subpackages
- Fixed RPM/DEB handling when GDA is disabled