1.15.0 (September 28, 2023)
Features:
UCP
- Added 2-stage pipeline protocol in the new protocol infrastructure
- Added reset and abort functionality of rendezvous protocols in the new infrastructure
- Added zero-copy rendezvous data send protocol in the new infrastructure
- Added support for user memory handle in the new protocol infrastructure
- Added option to force ODP registration for certain memory types
- Enabled lock free memory region deregistration
- Updated allow/deny transport list feature to control auxiliary transport selection
- Multiple performance improvements of the new protocol infrastructure
- Multiple improvements in error and debug messages
UCT
- Split UCT_MD_MKEY_PACK_FLAG_INVALIDATE into two flags for RMA and AMO
- Added put_zcopy and get_zcopy scheme support for self transport
- Added base implementation of is_reachable_v2 API using intra/inter flag
- Introduced MD capability for non-blocking registration memory types
RDMA CORE (IB, ROCE, etc.)
- Added implementation of is_reachable_v2 routine to IB interface
- Added option to control CQE zipping per CQ RX/TX direction
- Added option to specify how DCI selects port under RoCE LAG
- Added hw_dcs to the list of policies to select DCI by an endpoint
- Removed implicit on-demand paging
- Added option to set RoCE lag dct port for response under queue affinity mode
- Improved IB memlock limit logging
UCS
- Added ucs_string_buffer_rbrk() to split token
GPU (CUDA, ROCM)
- Added support for atomic reply_buffer on GPU memory
- Added system device information for AMD GPUs
- Improved performance estimation of gdr_copy transport
- Added a simplistic implementation of performance estimation of cuda_ipc transport
- Improved performance estimation of cuda_ipc on Hopper architecture
- Added rcache parameters for rocm transports
- Introduced dmabuf support for rocm transports
- Implemented asynchronous progress for the zcopy operations in the rocm_copy transport
- Added option to enable using cross-device dmabuf file descriptor for rocm
Java
- Added Java bindings for exported memh feature
Tests
- Added a rocm docker container for testing
- Added option to send client_id in iodemo test
- Added support for multiple connections to the same server in iodemo test
- Added synchronization before exit to hello world examples
Tools
- Added user-side memcpy option for AM benchmarks in ucx_perftest
- Added wireshark LUA dissectors for some UCX protocols
Build
- Added support for binutils 2.40
- Added versioned dependency to switch between packages with the same names
- Added a separate xpmem deb subpackage
- Added aarch64 support to the binary distribution pipeline
- Removed dependency on libnuma
Bugfixes:
UCP
- Fixed assertion when sending from non-contiguous GPU buffer to managed buffer
- Fixed the race condition on endpoint configurations
- Fixed endpoint reconfiguration issues due to asymmetrical selection
- Fixed endpoint reconfiguration error due to wrong locality detection
- Fixed crash during connection manager cleanup
- Fixed rkey index calculation for rendezvous protocol
- Fixed rcache dump function
- Removed logging from rkey unpack in release mode
- Fixed dobule free of rkey in rendezvous protocol
- Fixed rendezvous pipeline protocol error flow
- Fixed error handling in rendezvous get zcopy protocol
- Replay pending requests of wireup EP CM during connection establishment to prevent potential ordering issues and wrong configuration
- Pass user-provided memory type to the function that checks whether the buffer can be sent inline or not
- Avoid memory registration during UCP context initialization
- Fixed CPU/device atomics selection in the new protocol infrastructure
- Multiple fixes in the new protocol infrastructure information output
UCT
- Added check for dmabuf kernel support in ROCm memory domain
- Fixed exported memh packing
- Fixed an error in checking return status of multi-threaded memory registration function
RDMA CORE (IB, ROCE, etc.)
- Fixed dma-buf based memory region registration
- Fixed memory handle data corruption when PCIe relaxed ordering is enabled
- Fixed performance degradation when indirect atomic key is not supported by the hardware
- Fixed remote access error to strict-order keys because of wrong offset
- Added check for UAR support to memory domain opening
- Fixed updating port counters for devx qp
- Fixed ibv_create_cq error message on node without Infiniband
- Fixed performance degradation due to using 2 paths on NDR400 by default
- Removed unnecessary async lock which otherwise would block UD progress
GPU (CUDA, ROCM)
- Fixed CUDA IPC performance degradation due to libnuma removal
UCS
- Fixed lane selection and added bandwidth estimation for Sapphire Rapids family
- Fixed displaying wrong environment variable suggestions
- Fixed VFS warning output
- Fixed SEGV in ucs_debug_backtrace_next(), upon previous SEGV handling, due to ENOMEM situation
- Fixed memory corruption when using UCX_MPOOL_FIFO=y
UCM
- Fixed conditional jump patching
- Fixed mremap() override
GPU (CUDA, ROCM)
- Fixed usage of dmabuf when the buffer is not page-aligned
- Removed async_cb from cuda_copy to avoid the issue with UCP worker async lock
Java
- Fixed leakage of jucx_request global references
Documentation
- Updated ucp_worker_release_address description
Tests
- Fixed wrong usage of ep_close in examples
Tools
- Fixed memory access flags in perftest
- Removed support for librte from perf
- Fixed worker flush deadlock when using multiple workers in ucx_perftest
Build
- Changed 'unsupported option' ICC command line warning to error
- Removed never used fault-injection configuration option
- Fixed obsolete macro warnings in new autoconf/libtool
- Fixed building UCX with GCC 13
- Fixed UCX RPM build on machines that have libxpmem-devel rpm from MLNX_OFED installation
- Fixed ucx-rdmacm package requirements
- Fixed compilation errors with armcc-22.1
- Fixed passing port number to goperftest