What’s New
C++
Thrust / CUB
- Inclusive scan now supports initial value #1940
- Inclusive and exclusive scan now support problem sizes exceeding 2^31 elements #2171
- New
cub::DeviceMerge::MergeKeys
andcub::DeviceMerge::MergePairs
algorithms #1817 - New
thrust::tabulate_output_iterator
fancy iterator #2282
Libcudacxx
- Enable Assertions on host and device depending on users choice
- C++26 inplace_vector has been implemented and backported to C++14
- Improved support for extended floating point types
__half
and__nv_bfloat16
both for cmath functions and complex cuda::std::tuple
is now trivially copyable if the stored types are trivially copyable- Reworked our atomics implementation
- Improved
<cuda/std/bit>
conformance - Implemented
<cuda/std/bitset>
and backported to C++14 - Implemented and backported C++20
bit_cast
. It is available in all standard modes and constexpr with compiler support - Various backports and constexpr improvements (
bool_constant
,cuda::std::max
) - Moved the experimental memory resources from
<cuda/memory_resource>
into<cuda/experimental/memory_resource.cuh>
Python
cuda.cooperative
Best practice of using CCCL to make your CUDA kernels easier to write and faster to execute is now available in Python through the cuda.cooperative module. This module currently supports block- and warp-level algorithms within numba.cuda kernels, offering speed-of-light reductions, prefix sums, radix, and merge sort. You can customize cuda.cooperative
algorithms with user-defined data types and operators, implemented directly in Python.
Block and warp-level cooperative algorithms are now available in Python #1973.
Experimental versions of reduce, scan, merge and radix sort are available in numba.cuda kernels.
cuda.parallel
Apart from device-side cooperative algorithms, CCCL 2.7 provides an experimental version of host-side parallel algorithms as part of the cuda.parallel module. This release includes parallel reduction.
What's Changed
- Fix documentation generation for thrust::pair by @bernhardmgruber in #1976
- Correct typo in a launch configuration header name by @pciolkosz in #1972
- Fix thrust::sort for large problem sizes by @gevtushenko in #1952
- Avoid SIGPIPE when truncating verbose output in CI scripts. by @alliepiper in #1971
- Clarify compiler support by @bernhardmgruber in #1970
- Experimental Python cooperative algorithms by @gevtushenko in #1973
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #1928
- Guard against an overflow in sort tests by @bernhardmgruber in #1980
- Remove obsolete Thrust function traits by @bernhardmgruber in #1962
- Python: Add version string & wheel build command by @leofang in #1985
- Add device inclusive scan with init_value by @gonidelis in #1845
- Fix BWUtil report on early exit by @gonidelis in #1994
- Use libcu++ void_t everywhere by @bernhardmgruber in #1977
- Drop zipped_binary_op by @bernhardmgruber in #1988
- Clarify PtxVersion and SmVersion by @bernhardmgruber in #2004
- More simplifications for CUB util_device by @bernhardmgruber in #1948
- fix some typos in
<cuda/stream_ref>
by @ericniebler in #2003 - Add CI slack notifications. by @alliepiper in #1961
- Allow nightly workflow to be manually invoked. by @alliepiper in #2007
- Need to use a different approach to reuse secrets in reusable workflows vs. actions. by @alliepiper in #2008
- Enable RAPIDS builds for manually dispatched workflows. by @alliepiper in #2009
- clean up complex.inl by @ZelboK in #1655
- Add github token to nightly workflow-results action. by @alliepiper in #2012
- Remove obsolete build system glue from the Thrust/CUB submodule structure. by @alliepiper in #2016
- Benchmark thrust::copy with non-trivially relocatable type by @bernhardmgruber in #1989
- Make bool_constant available in C++11 by @bernhardmgruber in #1997
- Spell value initialization where used in thrust vectors by @bernhardmgruber in #1990
- Do no redefine
__ELF__
macro by @miscco in #2018 - Port
thrust::merge[_by_key]
to CUB by @bernhardmgruber in #1817 - Simplify some pointer traits by @bernhardmgruber in #2020
- Simplify test data setup by @bernhardmgruber in #2023
- Add tests to ensure that we properly propagate common_type for complex types by @miscco in #2025
- Update Thrust CMake README to use CCCL repo. by @alliepiper in #2026
- Include container toolkit in manual prereqs by @bryevdv in #2064
- Avoid ADL issues with
thrust::distance
by @miscco in #2053 - Simplify thrust::detail::wrapped_function by @bernhardmgruber in #2019
- Add a test for Thrust scan with non-commutative op by @bernhardmgruber in #2024
- Update memory_resource docs by @miscco in #1883
- Temporarily switch nightly H100 CI to build-only. by @alliepiper in #2060
- Do not rely on conversions between float and extended floating point types by @miscco in #2046
- experimental wrapper types for
cudaEvent_t
that provide a modern C++ interface. by @ericniebler in #2017 - [CUDAX] Add a dummy device struct for now by @pciolkosz in #2066
- Allow (somewhat) different input value types for merge by @bernhardmgruber in #2075
- Avoid
::result_type
for partial sums in TBB reduce_by_key by @bernhardmgruber in #1998 - Fix formatting by @bernhardmgruber in #2090
- Rename and refactor transform_iterator_base by @bernhardmgruber in #1987
- Benchmark analysis: Print all top rows when asked for by @bernhardmgruber in #2089
- Makes user-provided functors in our examples use
__device__
instead ofCUB_RUNTIME_FUNCTION
by @elstehle in #2088 - Separate
cuda/experimental
when sorting includes by @bernhardmgruber in #2094 - add support to
cudax::device
for querying a device's attributes by @ericniebler in #2084 - [CUDAX] Add experimental owning abstraction for cudaStream_t by @pciolkosz in #2093
- Do not query NVRTC for cuda runtime header by @miscco in #2102
- Cleanup CUB block/thread load and exchange by @bernhardmgruber in #1946
- Improve binary function objects and replace thrust implementation by @srinivasyadav18 in #1872
- Replace
_LIBCUDACXX_CPO_ACCESSIBILITY
with_CCCL_GLOBAL_VARIABLE
by @miscco in #1881 - Add script to update RAPIDS version. by @bdice in #2082
- Update bad links by @bryevdv in #2080
- Fix line break issues that break doxygen code examples by @miscco in #2103
- Add internal wrapper for cuda driver APIs by @pciolkosz in #2070
- Use
common_type
for complexpow
by @miscco in #1800 - [CUDAX] rename
device
todevice_ref
, add immovabledevice
as a place to cache properties by @ericniebler in #2110 - Use the float flavors of the cmath functions in the extended floating point fallbacks by @miscco in #2106
- [PoC]: Implement
cuda::experimental::uninitialized_buffer
by @miscco in #1831 - Ensure that we avoid ABI Version conflics by @miscco in #2137
- Ensure that
cuda_memory_resource
allocates memory on the proper device by @miscco in #2073 - Clarify compatibility wrt. template specializations by @bernhardmgruber in #2138
- Implement a
cudax::get_stream
CPO by @miscco in #2135 - Make
cuda::std::tuple
trivially copyable by @miscco in #2127 - Fix missing copy of docs artifacts by @miscco in #2162
- Fix g++-14 warning on uninitialized copying by @bernhardmgruber in #2157
- Fix flakey heterogeneous tests by @wmaxey in #2085
- Fix multiple definition of InclusiveScanKernel by @bernhardmgruber in #2169
- [CUDAX] Add a global constexpr
cudax::devices
range for all devices in the system by @ericniebler in #2100 - fix use of
cudaStream_t
as if it were a stream wrapper by @ericniebler in #2190 - Fix uninitialized_buffer self assignment by @miscco in #2170
- Fix trivial_copy_device_to_device execution space by @gevtushenko in #2164
- Clarify libcu++ use by non-CUDA compilers by @bernhardmgruber in #1969
- Warn when using C++14 in CUB and Thrust by @bernhardmgruber in #2166
- Fix the
clang-format
path in the devcontainers by @miscco in #2194 - Mount a temporary build volume for CCCL projects if WSL is detected by @wmaxey in #2035
- 2118 [CUDAX] Change the RAII device swapper to use driver API and add it in places where it was missing by @pciolkosz in #2192
- Fix singular vs plural typo in thread scope documentation. by @brycelelbach in #2198
- [CUDAX] fixing some minor issues with device attribute queries by @ericniebler in #2183
- Integrate Python docs by @bryevdv in #2196
- [FEA] Atomics codegen refactor by @wmaxey in #1993
- [CUDAX] add
__launch_transform
to transform arguments tocudax::launch
prior to launching the kernel by @ericniebler in #2202 - Cleanup common testing headers and correct asserts in launch testing by @pciolkosz in #2204
- [CUDAX] Add an API to get device_ref from stream and add comparison operator to device_ref by @pciolkosz in #2203
- Update devcontainer docs for WSL by @jrhemstad in #2200
- add
cudax::distribute<threadsPrBlock>(numElements)
by @ericniebler in #2210 - Rework mdspan concept emulation by @miscco in #2213
- Un-doc functions taking debug_synchronous by @bryevdv in #2209
- CUDA
vector_add
sample project by @ericniebler in #2160 - avoid constraint recursion in the
resource
concept by @ericniebler in #2215 - fix
cuda_memory_resource
test for properly aligned memory by @ericniebler in #2227 - Fix including
<complex>
when bad CUDA bfloat/half macros are used. by @wmaxey in #2226 - Add license & fix
long_description
insetup.py
by @leofang in #2211 - Extract reduction kernels into NVRTC-compilable header by @gevtushenko in #2231
- Implement
<cuda/std/bitset>
by @griwes in #1496 - Refactor Thrust placeholder operators by @bernhardmgruber in #2233
- Add missing annotations for deprecated debug_sync APIs by @bernhardmgruber in #2212
- Test thrust headers for disabled half/bf16 support by @bernhardmgruber in #2219
- Make cuda::std::max constexpr in C++11 by @bernhardmgruber in #2107
- Fix ForEachCopyN for non-contiguous iterators by @bernhardmgruber in #2220
- Configure CUB/Thrust for C++17 by default by @bernhardmgruber in #2217
- Allow installing components when downstream by @stephenswat in #2096
- Rename the memory resources to drop the superfluous prefix
cuda_
by @miscco in #2243 - Fix and simplify by @wmaxey in #2197
- Proclaim pair and tuple trivially relocatable by @bernhardmgruber in #2010
- Make
cuda::std::min
constexpr in C++11 by @miscco in #2249 - Add
CCCL_DISABLE_NVTX
macro by @bernhardmgruber in #2173 - Workaround GCC 13 issue with empty histogram decoder op by @bernhardmgruber in #2252
- Refactor Thrust's logical meta functions by @bernhardmgruber in #2260
- Fix use of doxygen \file command by @bernhardmgruber in #2259
- Add tests for transform_iterator's reference type by @bernhardmgruber in #2221
- Small tuning script output improvements by @bernhardmgruber in #2262
- Fix Thrust::vector ctor selection for int,int by @bernhardmgruber in #2261
- Adds support for large number of items to
DeviceScan
by @elstehle in #2171 - Use and test radix sort for int128, half and bfloat16 in Thrust by @bernhardmgruber in #2168
- Implement C API for device reduction by @gevtushenko in #2256
- Move cooperative module by @gevtushenko in #2269
- Move compiler version macros into libcu++ by @bernhardmgruber in #2250
- Introduce cuda.parallel module by @gevtushenko in #2276
- Adds
thrust::tabulate_output_iterator
by @elstehle in #2282 - Drop macos string that lit cannot parse properly by @miscco in #2283
- Flatten forwarding headers by @miscco in #2284
- 2270 static compute capabilities queries by @pciolkosz in #2271
- Fix read of dangling reference in thrust placeholders by @bernhardmgruber in #2290
- Implement
any_resource
, an owning wrapper around a memory resource by @ericniebler in #2266 - Fixes formatting of
tabulate_output_iterator.inl
by @elstehle in #2298 - use
NV_IF_TARGET
to conditionally compile CUDAX tests by @ericniebler in #2297 - Make for_each compatible with NVRTC by @wmaxey in #2288
- refactor cmake so more cudax samples can be easily added by @ericniebler in #2296
- Use the
in
,out
, andinout
parameter decorators fromcudax::launch
by @ericniebler in #2294 - Implement
std::bit_cast
by @miscco in #2258 - Cleanup the
<cuda/std/bit>
header by @miscco in #2299 - change
cudax::uninitialized_buffer
to own its memory resource withcudax::any_resource
by @ericniebler in #2293 - Documentation typos by @fbusato in #2302
- Add thrust::inclusive_scan with init_value support by @gonidelis in #1940
- Assure placeholder expressions are semi-regular by @bernhardmgruber in #2305
- Add documentation for
any_resource
by @miscco in #2309 - Implement P0843
inplace_vector
by @miscco in #1936 - Cleanup
__config
and unify most visibility macros by @miscco in #2285 - Add a fast, low memory "limited" mode to CUB testing. by @alliepiper in #2317
- [CUDAX] Add event_ref::is_done() and update event tests by @pciolkosz in #2304
- Minor cleanup to memory resources by @miscco in #2308
- Drop ICC from the cudax support matrix by @miscco in #2330
- Do not hardcode Thrust's host system to cpp. by @alliepiper in #2332
- [CUDAX] Add compute_capability device attribute and handle arch_traits for future architectures by @pciolkosz in #2328
- Disable exec checks on ranges CPOs by @miscco in #2331
- Enable exceptions by default by @miscco in #2329
- Make the thrust dispatch mechanisms configurable by @miscco in #2310
- [CUDAX] give all the cudax headers the
.cuh
extension by @ericniebler in #2340 - Compiler version improvements by @fbusato in #2316
- Fix hardcoding __THRUST_HOST_SYSTEM_NAMESPACE to cpp by @bernhardmgruber in #2341
- Improvements to the Cuda Core C library infrastructure by @miscco in #2336
- Fix bug remaining on thrust::inclusive_scan with init value with CDP by @gonidelis in #2346
- [CUDAX] make
uninitialized_buffer
usable withlaunch
by @ericniebler in #2342 - Test and fix failing nightly libcudacxx + CUB jobs by @miscco in #1847
- Update Memory Model docs for HMM by @gonzalobg in #2272
- Harden thrust algorithms against evil iterators that overload
operator,
by @miscco in #2349 - Avoid circular concept definition with memory resources by @miscco in #2351
- add IWYU
export
pragma on config headers by @ericniebler in #2352 - Add cuda_parallel to CI. by @alliepiper in #2338
- [CUDAX] Branch out an experimental version of stream_ref by @pciolkosz in #2343
- Improve visibility macros for libcu++ by @miscco in #2337
- Add missing cuKernelGetFunction call to reduce by @pciolkosz in #2355
- Move
invalid_stream
to the proper file by @miscco in #2360 - fix the cudax
vector_add
sample by @ericniebler in #2372 - Add -Wmissing-field-initializers to cudax by @pciolkosz in #2373
- Update CCCL version to 2.7.0 by @wmaxey in #2364
- Backport several fixes into 2.7.x. by @wmaxey in #2579
- [BACKPORT]: Rework
head_flags
so that we do not rely on the tuple being unevaluated (#2619) by @miscco in #2620 - [Backport] Fix cluster launch error in branch/2.7.x by @wmaxey in #2866
- Disable execution checks for tuple (#2780) by @wmaxey in #2867
- [BACKPORT: Fix Thrust/CUB tests by adding empty base opt-ins to iterator classes (#3066) by @miscco in https://github.com//pull/3068
- [Backport] Fix EBO in zip_iterator on MSVC. by @wmaxey in #3107
New Contributors
- @bryevdv made their first contribution in #2064
- @stephenswat made their first contribution in #2096
Full Changelog: v2.6.1...v2.7.0