What's Changed
- Adds benchmarks for
DeviceSelect::Unique
by @elstehle in #2359 - CUB - Enable DPX Reduction by @fbusato in #2286
- [CUDAX] add a small c++17 implementation of
std::execution
(aka P2300) by @ericniebler in #2301 - Add thurst::transform_inclusive_scan with init value by @gonidelis in #2326
- Widen histogram agent constructor to more types by @bernhardmgruber in #2380
- Use a constant for the amount of static SMEM by @bernhardmgruber in #2374
- Add
cub::DeviceTransform
by @bernhardmgruber in #2086 - Update toolkit to CTK 12.6 by @miscco in #2348
- implement
make_integer_sequence
in terms of intrinsics whenever possible by @ericniebler in #2384 - Implement
cuda::mr::cuda_async_memory_resource
by @miscco in #1637 - Drop implementation of
thrust::pair
andthrust::tuple
by @miscco in #2395 - Pull out
_LIBCUDACXX_UNREACHABLE
into its own file by @miscco in #2399 - Share common compiler flags in new CCCL-level targets. by @alliepiper in #2386
- conditionally include
<crt/host_defines.h>
from__cccl/execution_space.h
header by @ericniebler in #2406 - add some simple utilities for manipulating lists of types by @ericniebler in #2370
- Drop thrusts diagnostic suppression warnings by @miscco in #2392
- [PoC]: Implement
cuda::experimental::uninitialized_async_buffer
by @miscco in #1854 - Fix thrust package to work with newer FindOpenMP.cmake. by @alliepiper in #2421
- Introduce
cccl_configure_target
cmake function. by @alliepiper in #2388 - Fix sccache errors in RAPIDS builds by @trxcllnt in #2417
- Replace
CUDA C++ Core Libraries
withCUDA Core Compute Libraries
(only in README.md). by @rwgk in #2424 - Minor cleanup with
cuda/atomic
by @miscco in #2418 uninitialized_buffer::get_resource
returns a ref to anany_resource
that can be copied by @ericniebler in #2431- Refactor
cuda::ceil_div
to take two different types by @miscco in #2376 - Reduce PR testing matrix. by @alliepiper in #2436
- Implement
cudax::shared_resource
by @miscco in #2398 - Increase the libcu++ timeout by @miscco in #2435
- Move c/include/cccl/.h files to c/include/cccl/c/.h by @rwgk in #2428
- Make
any_resource
emplacable by @miscco in #2425 - Fix issues with
__host__
and__device__
definitions by @miscco in #2413 - Make
bit_cast
play nice with extended floating point types by @miscco in #2434 - Do not include our own string.h file by @miscco in #2444
- Move nightly time by @bdice in #2437
- Remove a ton of lines in thrust tests by @gonidelis in #2356
- [CUDAX] Add placeholder green context type and logical device that can hold both a green ctx and a device by @pciolkosz in #2446
- Fix typo in CCCLBuildCompilerTargets.cmake by @alliepiper in #2453
- Drop superflous compile definition from thrust tests by @miscco in #2450
- Consolidate packages and install rules by @alliepiper in #2456
- Prune CUB's ChainedPolicy by CUDA_ARCH_LIST by @bernhardmgruber in #2154
- fixes merge conflict for policy pruning by @elstehle in #2466
- Add CCCL_ENABLE_WERROR flag. by @alliepiper in #2463
- Add CUB tests for segmented sort/radix sort with 64-bit num. items and segments by @fbusato in #2254
- Propagate compiler flags down to libcu++ LIT tests by @Artem-B in #2420
- Drop remaining uses of
_LIBCUDACXX_COMPILER_*
by @miscco in #2467 - Avoid C++17 extension in c++11 tests by @miscco in #2469
- Add span to example and templated block size by @Kh4ster in #2470
- Drop Objective C++ support by @miscco in #2468
- removes superfluous template keyword in call to Dereference by @andrewcorrigan in #2482
- Improve build times in several heavyweight libcudacxx tests. by @wmaxey in #2478
- Drop
__availability
header by @miscco in #2484 - Replace a few more instances of
CUDA C++ Core Libraries
with CUDA Core Compute Libraries`. by @rwgk in #2447 - Fix
common_type
specialization for extended floating point types by @miscco in #2483 - Implement some CUDA API calls for
async_memory_pool
by @miscco in #2455 - Move cudax example project to CCCL project examples. by @alliepiper in #2462
- Disable system header for narrowing conversion check by @miscco in #2465
- Require resources to always provide at least one execution space property by @miscco in #2489
- Rework builtin handling by @miscco in #2461
- Disable execution checks for
std::equal
by @miscco in #2491 - replace
_CCCL_ALWAYS_INLINE
with_CCCL_FORCEINLINE
by @ericniebler in #2439 - Drop 2 relative includes that snuck in by @miscco in #2492
- re-express the
cudax::__tupl::__apply
member to make nvc++ happy by @ericniebler in #2493 - Drop badly named
_One_of
concept by @miscco in #2490 - Unify assert handling in cccl by @miscco in #2382
- Reduce scope of Thrust linkage in cudax. by @alliepiper in #2496
- Centralize CPM logic. by @alliepiper in #2495
- Fix typo in presets. by @alliepiper in #2497
- Refactor away per-project TOPLEVEL flags. by @alliepiper in #2498
- [FEA]: Validate cuda.parallel type matching in build and execution by @rwgk in #2429
- avoid gcc optimizer bug by not force inlining part of
thrust::transform
by @ericniebler in #2509 - Cleanup and modularize
<cuda/std/barrier>
by @miscco in #2443 - Consolidate header testing infra. by @alliepiper in #2460
- Add ForEachN from CUB to cccl/c. by @wmaxey in #2378
- Adds support for large number of items in
DeviceSelect
andDevicePartition
by @elstehle in #2400 - Adds support for large number of items to
DeviceScan::*ByKey
family of algorithms by @elstehle in #2477 - Integrate c/parallel with CCCL build system and CI. by @alliepiper in #2514
- Create a command list utility for nvrtc/jitlink steps. by @wmaxey in #2511
- Fix the example project which the documentation refers too by @caugonnet in #2531
- Enable tests/headertests for c/parallel in all-dev presets. by @alliepiper in #2566
- Rename cudax test targets to match CCCL conventions. by @alliepiper in #2568
- Update project list in issue template by @alliepiper in #2532
- Disable compiler extensions on CCCL targets. by @alliepiper in #2559
- Fixes
cub::DeviceMemcpy::Batched
to be able to copy fromconst
source pointers by @elstehle in #2573 - Fix documentation error in ci/build_common.sh for -arch by @caugonnet in #2574
- gcc-14 gained the ability to mangle
noexcept
expressions by @ericniebler in #2565 - Miscellaneous simple fixes by @rwgk in #2575
- Avoid including
yvals.h
when the compiler is not MSVC. by @wmaxey in #2545 - Fix popc.h when architecture is not x86 on MSVC. by @wmaxey in #2524
- test for exceptions support on msvc with the
_CPPUNWIND
macro by @ericniebler in #2576 - fix the forwarding of the receiver in the
just_from
algorithm by @ericniebler in #2569 - Block type pack indexing on NVCC by @wmaxey in #2563
- Cleanup the semaphore headers by @miscco in #2441
- Add
_CCCL_GRID_CONSTANT
macro by @fbusato in #2530 - Add
_CCCL_RESTRICT
macro by @fbusato in #2529 - Try to use the same redefinition of
__assert_fail
as pytorch has by @miscco in #2577 - Fix miscellaneous bugs in cub/iterator documentation. by @rwgk in #2580
- Expose parts of
<cuda/std/memory>
by @fbusato in #2502 - add a config macro for testing support for inline variables by @ericniebler in #2581
- add dialect macros
_CCCL_NO_RTTI
and_CCCL_NO_TYPEID
by @ericniebler in #2578 - fix misspelling in the
_CCCL_NO_VARIABLE_TEMPLATES
macro by @ericniebler in #2584 - Add
atomic_ref
support for 8 and 16b types. by @wmaxey in #2255 - add
_LIBCUDACXX_REQUIRES_EXPR
to the concepts emulation macros by @ericniebler in #2564 - Ensure CuPy arrays can be used with
cuda.parallel
too by @leofang in #2335 - assert that
cuda::std::declval
isnoexcept
by @ericniebler in #2588 - Revert accidental force push to main. by @wmaxey in #2596
- add
__is_callable_v
variable template when possible by @ericniebler in #2598 - Cleanup threading support by @miscco in #2507
- CCCL_TOPLEVEL_PROJECT always needs to be defined by @robertmaynard in #2597
- Strip prefix paths from cudax documentation by @caugonnet in #2603
- examples/cudax/CMakeLists.txt should not be executable by @caugonnet in #2594
- [CUDAX] Peer access control on async_memory_pool and async_memory_resource by @pciolkosz in #2587
- Introduce
_CCCL_PRAGMA
to CCCL by @davebayer in #2610 - Only enable CUDA language when needed. by @alliepiper in #2612
- Modularize latch by @miscco in #2508
- Unify kernel dispatch paths for device reduce between CUB and c.parallel. by @griwes in #2591
- Integrate CUDASTF -> CudaX by @caugonnet in #2572
- [STF] The cmake example for stf was not updated when moving to main branch by @caugonnet in #2618
- Rework
head_flags
so that we do not rely on the tuple being unevaluated by @miscco in #2619 - [CUDAX] size_bytes in buffer types by @pciolkosz in #2621
- fix portability bug in libcu++'s implementation of
char_traits
by @ericniebler in #2623 - [cccl/c] Unify some build boilerplate by @wmaxey in #2625
- devcontainer: replace
VAULT_HOST
withAWS_ROLE_ARN
by @jjacobelli in #2604 - Add checks to unique_id by @andralex in #2622
- Add
cuda::get_device_address
by @miscco in #2611 - Do not pass integral constants to ptx by @miscco in #2229
- Add nvhpc devcontainer to CI by @miscco in #1488
- Use a default initialization for CUDA graph mem alloc nodes by @caugonnet in #2632
- [CUDAX] Add get_name to device_ref by @pciolkosz in #2631
- Add 12.5 devcontainer needed for nvhpc by @miscco in #2634
- a substitute for
std::type_info
when the compiler doesn't support RTTI by @ericniebler in #2582 - Check for missing
inline
on functions in public headers. by @alliepiper in #2570 - fix linker errors about multiply defined symbols in STF by @ericniebler in #2641
- Add installation presets and update README with install steps by @alliepiper in #2643
- Fix annotated_ptr test failures. by @wmaxey in #2607
- Issue a deprecation warning when compiling with ICC by @bernhardmgruber in #2076
- Include all python libs in inspect_changes. by @alliepiper in #2648
- Add reusable workflow for updating version in branch with a PR by @wmaxey in #2589
- define
_CCCL_NO_RTTI
in device code; RTTI isn't available there by @ericniebler in #2639 - Migrate C2H library to top-level library by @alliepiper in #2629
- [CUDAX] Add can_peer_access_to API to device_ref and check both ways access in get_peers by @pciolkosz in #2642
- Use
_CCCL_ASSERT
for stf by @miscco in #2645 - un-templatize CUDASTF's
callback_completion_kernel
per @robertmaynard by @ericniebler in #2656 - Implement C++20
<source_location>
by @miscco in #2628 - Disable
[[no_unique_address]]
for clang and mdspan by @miscco in #2646 - [STF] Adapt timing_with_fences test to be more reliable by @caugonnet in #2658
- Add prefetching kernel as new fallback for
cub::DeviceTransform
by @bernhardmgruber in #2396 - Drop
cub::DeviceTransform
fallback tocub::DeviceFor
by @bernhardmgruber in #2660 - Ignore more files when detecting CI changes. by @alliepiper in #2654
- Add
thrust::universal_host_pinned_vector
by @bernhardmgruber in #2653 - add new type-list algorithms
copy_if
,remove_if
,find_if
, andunique
by @ericniebler in #2644 - abide by CCCL config macro naming conventions for
_CCCL_PRETTY_FUNCTION
and_CCCL_NO_BUILTIN_STRLEN
by @ericniebler in #2640 - [STF] Fix how we define multi-dimensional shapes in the documentation by @caugonnet in #2662
- Automate creating a CCCL release from RC tags. by @wmaxey in #2657
- Enable span to work with contiguous std containers in C++17 by @miscco in #2613
- [Version] Update main to v2.8.0 by @github-actions in #2670
- promote the cudax
__async/config.cuh
to be the config for all of cudax by @ericniebler in #2638 - avoid using nvcc's
__type_pack_element
before 12.2 by @ericniebler in #2673 - Update ninja_summary.py to support ninja log v6. by @alliepiper in #2663
- Rename new CUB headers to follow conventions. by @alliepiper in #2675
- consistent use of
_CUDAX
function attributes in the cudax__async/
directory by @ericniebler in #2676 - [CUDAX] Add forwarding reference to functor accepting launch by @pciolkosz in #2677
- [CUDAX] Add initial bits of copy_bytes and fill_bytes by @pciolkosz in #2608
- suppress msvc warning "qualifier applied to function type" in
is_function
by @ericniebler in #2683 - Disable ublkcp CUB transform kernel for NVHPC by @bernhardmgruber in #2664
- Deprecate
thrust::cuda_cub::identity
by @bernhardmgruber in #2688 - Remove an unused variable by @bernhardmgruber in #2690
- Setup cudax examples. by @alliepiper in #2697
- portability fixes for
_CCCL_BUILTIN_PRETTY_FUNCTION
and_CCCL_TYPEID
by @ericniebler in #2695 - address portability issues found while using the typelist/typeset utities by @ericniebler in #2694
- Make tests technically correct by initializing the barrier by @miscco in #2701
- Fix invalid memory reads in test_device_batch_copy. by @alliepiper in #2698
- revert config macros
_CCCL_CUDACC_BELOW_XX_X
to their original semantics by @ericniebler in #2700 - This cleanes up our function objects a bit by @miscco in #2702
- Drop handling of 32bit Windows by @bernhardmgruber in #2689
- Guard inclusion of
cuda_runtime_api
by using a cuda compiler by @miscco in #2704 - Fix race condition in block_reduce_raking. by @alliepiper in #2699
- Honor CCCL_ENABLE_WERROR for CUDA targets. by @alliepiper in #2705
- Fix nvbench helper compilation for clang-18 by @bernhardmgruber in #2707
- Default ctor of device_ptr and normal_iterator by @bernhardmgruber in #2708
- Add
cuda::minimum
andcuda::maximum
by @Jacobfaib in #2681 - Various fixes to
cub::DeviceTransform
by @bernhardmgruber in #2709 - Make
thrust::transform
usecub::DeviceTransform
by @bernhardmgruber in #2389 - Ensure that we only use the inline variable trait when it is actually available by @miscco in #2712
- [CUDAX] Rename memory resource and memory pool from async to device by @pciolkosz in #2710
- triple_chevron fix by @fbusato in #2720
- Improve
uninitialized_{async_}buffer
API by @miscco in #2713 - Fix merge conflict from renaming of async_memory_resource by @miscco in #2728
- [STF] Improve DOT graph outputs by @caugonnet in #2703
- Implement
_CCCL_SUPPRESS_DEPRECATED_[PUSH|POP]
for ICC and NVHPC by @bernhardmgruber in #2730 - Clean up CUB thread operators by @bernhardmgruber in #2716
- Deprecate/replace more of Thrust functional by @bernhardmgruber in #2105
- Alias
cuda::std::identity
to__identity
by @bernhardmgruber in #2733 - Do not read uninitialized memory for OOB elements. by @alliepiper in #2739
- Add option to conditionally build CUDASTF by @miscco in #2731
- fix
cuda::std::bit_width()
return type by @fbusato in #2745 - [STF] Option to disable kernel generation in CUDASTF by @caugonnet in #2723
- fix
static_extent()
return type by @fbusato in #2751 - make the empty parens after level constructors optional by @ericniebler in #2750
- cudax: rename ustdex's
__query
member function toquery
by @ericniebler in #2757 - Implement execution policies by @miscco in #2715
- Document some transform iterator corner cases by @bernhardmgruber in #2740
- Shorten the git commit message in the ci scripts by @miscco in #2760
- Separate CUDA and C++ code in C2H by @bernhardmgruber in #2734
- Make
get_stream
work with queries by @miscco in #2761 - Allow
thrust::identity
to forward value category by @bernhardmgruber in #2732 - Proclaim Thrust/CUB/libcu++ functor address stability by @bernhardmgruber in #2719
- give
declval
an implementation that compiles 2x faster by @ericniebler in #2758 - [CUDAX] Add modernized simpleP2P sample by @pciolkosz in #2696
s/get_delegatee_scheduler/get_delegation_scheduler/
by @ericniebler in #2766- remove duplicated
__apply_cv
type trait by @ericniebler in #2754 - merge metaprogramming libs from libcudac++ and µstdex by @ericniebler in #2767
- Doc fix scan by @karthikeyann in #2769
- Remove obsolete ways to set iterator category in CUB by @bernhardmgruber in #2759
- Run
thrust::transform
benchmarks with more elements by @bernhardmgruber in #2764 - Increase libcu++ timeout by @miscco in #2774
- [STF] Rename the redux access mode into relaxed by @caugonnet in #2776
- Enable type trait aliases in all standard modes by @miscco in #2763
- Optimize, Cleanup, and Expose CUB Thread-Level Reduction by @fbusato in #2756
- Disable execution checks for tuple by @miscco in #2780
- Avoid benchmarking first-time setup in Thrust algorithms by @bernhardmgruber in #2782
- Improve listing benchmarks and text by @bernhardmgruber in #2778
- Fix thrust partition docs typo by @gonidelis in #2791
- Drop unused sanitizer hook by @miscco in #2793
- use
_CCCL_HAS_FEATURE
instead of plain__has_feature
everywhere by @davebayer in #2794 - Avoid
make_zip_iterator(make_tuple(...))
by @bernhardmgruber in #2796 - implement
_CCCL_HAS_INCLUDE
by @davebayer in #2786 - add
__cpp_lib_mdspan
feature-test macro by @fbusato in #2787 - Remove redundant cmake from example. by @alliepiper in #2804
- change
__as_type_list
so it doesn't cause the instantiation of its argument by @ericniebler in #2803 - [CUDAX] Enable passing hierarchy levels directly into make_config by @pciolkosz in #2755
- Fix cudacc/cluster detection macro in launch path of libcudacxx tests by @wmaxey in #2811
- [STF] Replace CUDASTF_CODE_GENERATION by !CUDASTF_DISABLE_CODE_GENERATION by @caugonnet in #2797
- Reduce P0 benchmark variations for merge_sort_pairs by @bernhardmgruber in #2798
- Replace macros by lambdas in cub::DeviceTransform by @bernhardmgruber in #2817
- Add
nvrtc_sm_top_level::add_link_list()
and use in c/parallel/src/reduce.cu by @rwgk in #2781 - give
completion_signatures
a fast lookup cache by @ericniebler in #2812 - implement new compiler checks for NVHPC by @davebayer in #2816
- Unify [CCCL|CUB|THRUST]_ENABLE_BENCHMARKS by @bernhardmgruber in #2827
- Remove traces of metal from CCCL by @bernhardmgruber in #2828
- Move our CUDACC version checks towards the new version check design by @miscco in #2826
- Extend CUB benchmarking documentation by @bernhardmgruber in #2831
- Remove all warm-up runs from Thrust benchmarks by @bernhardmgruber in #2838
- Utility scripts for benchmark database by @gevtushenko in #2847
- [CUDAX] Add missing sm_61 traits by @pciolkosz in #2848
- Move
_CCCL_COMPILER_ICC
to the new macro by @miscco in #2849 - Fix wrong include in Thrust benchmark by @bernhardmgruber in #2854
- Add missing include by @bernhardmgruber in #2855
- Move
_CCCL_COMPILER_GCC
to the new macro by @davebayer in #2850 - Add benchmarking and tuning presets by @bernhardmgruber in #2856
- Fix race condition in block-RLD test harness. by @alliepiper in #2706
- Add MatX build to CCCL CI by @alliepiper in #2682
- Fix DeviceSegmentedSort NVTX range name by @davidwendt in #2857
- Make discovery mechanism for
cuda/_include
directory compatible withpip install --editable
by @rwgk in #2846 - add missing
DOXYGEN_*
predefined macros when building the cudax docs by @ericniebler in #2858 - correct the names of
shared_resource
's async allocate/deallocate members by @ericniebler in #2880 - [Docs/PTX] Add device tensor map init example by @ahendriksen in #1983
- Fix rst typos in benchmarking.html by @gonidelis in #2868
- Include use of NVHPC in CUB/Thrust magic namespace by @bernhardmgruber in #2771
- backport
to_underlying
by @davebayer in #2853 - move
_CCCL_COMPILER_CLANG
to the new macro by @davebayer in #2859 - Automate release branch creation by @wmaxey in #2685
- Add
thrust_create_target
DISPATCH
option. by @alliepiper in #2844 for_each_in_extent
by @fbusato in #2518- Fix old gcc version check by @davebayer in #2904
- Move implementation of
_LIBCUDACXX_TEMPLATE
to CCCL by @miscco in #2832 - Try to work around issue with NVHPC in conjunction with older CTK versions by @miscco in #2889
- Refactor nvbench helper less_t by @bernhardmgruber in #2905
- add "
interface
" to_CCCL_PUSH_MACROS
by @ericniebler in #2919 - Replace inconsistent Doxygen macros with
_CCCL_DOXYGEN_INVOKED
by @ericniebler in #2921 - implement C++26
std::span::at
by @davebayer in #2924 - move msvc compiler macros to new version by @davebayer in #2885
- Reorganize PTX tests to match generator by @bernhardmgruber in #2930
- Reorganize PTX docs to match generator by @bernhardmgruber in #2929
- Improve build instructions for libcu++ by @miscco in #2881
- Reorganize PTX headers to match generator by @bernhardmgruber in #2925
- implement C++26
std::span
's constructor fromstd::initializer_list
by @davebayer in #2923 - Add tuple protocol to
cuda::std::complex
from C++26 by @davebayer in #2882 - Add missing qualifier for cuda namespace by @bernhardmgruber in #2940
- Try to fix a clang warning: by @bernhardmgruber in #2941
- minor consistency improvements in concepts macros by @ericniebler in #2928
- Drop some of the mdspan fold implementation by @miscco in #2949
- [STF] Implement CUDASTF_DOT_TIMING for the ctx.cuda_kernel construct by @caugonnet in #2950
- Avoid potential null dereference in
annotated_ptr
by @miscco in #2951 - make compiler version comparison utility generic by @davebayer in #2952
- Add SM100 descriptor to target by @miscco in #2954
- Regenerate
cuda::ptx
headers/docs and run format by @bernhardmgruber in #2937 - Regenerate
cuda::ptx
test by @bernhardmgruber in #2953 - Do not include extended floating point headers if they are not needed by @miscco in #2956
- [CUDAX] Add copy_bytes and fill_bytes overloads for mdspan by @pciolkosz in #2932
- add a
_CCCL_NO_CONCEPTS
config macro by @ericniebler in #2945 - remove definition of macro (
_LIBCUDACXX_NO_RTTI
) that is no longer used by @ericniebler in #2957 - Avoid symbol clashes with libc++ by @miscco in #2955
- Add more CUB transform benchmarks by @bernhardmgruber in #2906
- Start reworking our math functions by @miscco in #2749
- Drop memory resources in libcu++ by @miscco in #2860
std::dims
by @fbusato in #2961- Fix merge conflict from moving resources up a namespace by @miscco in #2965
- [CUDAX] Add a way to combine thread hierarchies by @pciolkosz in #2746
- Require approval to run CI on draft PRs by @bdice in #2969
- fix thread-reduce performance regression by @fbusato in #2944
- add a
__type_switch
utility and use it the ptx generator by @ericniebler in #2946 - replace use of old
_CONCEPT_FRAGMENT
macro in cudax by @ericniebler in #2973 - remove vestigal uses of the old
DOXYGEN_SHOULD_SKIP_THIS
macro by @ericniebler in #2978 - Fix proclaim_copyable_arguments for lambdas by @bernhardmgruber in #2833
- Forward declare half types in
cuda::ptx
by @ahendriksen in #2981 - Fix tuning benchmark for
cub::DeviceTransform
by @bernhardmgruber in #2970 - fix old gcc version check by @davebayer in #2989
- Fix a typo in thrust/binary_search.h (#2980) by @hzhangxyz in #2992
- Enable assertions for CCCL users in CMake Debug builds by @bernhardmgruber in #2986
- Fix CMake warning for FindPythonInterp by @bernhardmgruber in #2982
- Further clarify host compiler support by @bernhardmgruber in #2991
- Drop _CCCL_ELSE_IF_CONSTEXPR by @bernhardmgruber in #2966
- implement C++26
std::ignore
by @davebayer in #2922 - make the upper limit on TMP loop unrolling configurable by @ericniebler in #2971
- Update docs with recent features by @davebayer in #2994
- Restore thrust single config options. by @alliepiper in #2977
- Document tuning DB comparison scripts by @bernhardmgruber in #2968
- Build CUB and Thrust tests with assertions by @bernhardmgruber in #2987
- Issue a deprecation warning when compiling with Visual Studio 2017 by @bernhardmgruber in #2990
- Guard forward declarations of extended FP types by @bernhardmgruber in #2998
- [STF] Create dot sections to possibly collapse nodes when displaying large DOT graphs by @caugonnet in #2988
- Remove redundant pre c++11 checks by @davebayer in #2999
- Avoid checking unsigned values for negativity by @bernhardmgruber in #2997
- Rename thrust example
version.cu
toprint_version.cu
by @j3soon in #3002 - don't bother sync-ing a stream with itself by @ericniebler in #3007
- Backport
is_scoped_enum
by @davebayer in #3003 - Put
monostate
in<utility>
by @davebayer in #3000 - backport std integer comparison functions to C++11 by @davebayer in #2805
- backport
forward_like
by @davebayer in #2995 - Document how to profile benchmarks by @bernhardmgruber in #3015
- Update Thrust examples ReadMe by @bernhardmgruber in #3004
- Deprecate public CUB/Thrust deprecation macros by @bernhardmgruber in #3010
- Fix libcudacxx example by @j3soon in #3013
- Refactor BlockLoad test by @bernhardmgruber in #3005
- Fix NVBench profile flags in docs by @bernhardmgruber in #3016
- Update RAPIDS to 25.02. by @bdice in #2967
- Tweak tuning database plot and comparison scripts by @bernhardmgruber in #2883
- Allow passing debug flags to NVRTC in libcudacxx tests by @wmaxey in #3020
- Add missing template parameter to BlockRadixRank example. by @esoha-nvidia in #2736
- Fix value range overflows in tests by @Artem-B in #3022
- Avoid relative includesthat have slipped in by @miscco in #3042
- Fix word count example in Thrust by @caugonnet in #3014
- revise
<cuda/std/version>
by @davebayer in #3043 - Replace thrust::swap by cuda::std::swap by @bernhardmgruber in #2985
- add a converting constructor to
cudax::stream_ref
fromcuda::stream_ref
by @ericniebler in #3052 - [CUDAX] Remove launch overloads taking dimensions and make everything except make_hierarchy return kernel_config by @pciolkosz in #2979
- move sender support library to
__async/sender/
by @ericniebler in #3056 - [cuda.cooperative] Add block.load and block.store. by @brycelelbach in #2693
- Backport
unreachable
by @davebayer in #3018 - Define the destructor of
kernel_arg
by @miscco in #3060 - Add missing
__syncthreads()
to test by @miscco in #3061 - Add assertions in the mdspan accessors that we are not out of bounds by @miscco in #3055
- Do not use cudaGetErrorString on GPU. by @Artem-B in #3059
- Reduce number of per-PR CI jobs. by @alliepiper in #2931
- Rework CUDA compiler checks by @davebayer in #3057
- implement C++23
invoke_r
by @davebayer in #3041 - Consider NV_TARGET_SM_INTEGER_LIST for ChainedPolicy pruning by @bernhardmgruber in #2772
- Add environment to encapsulate information needed for
cudax::vector
by @miscco in #2775 - We should not call
cudaGetErrorString
on device by @miscco in #3062 - Introduce cuda.cooperative overloads not requiring temporary storage by @gevtushenko in #2528
basic_any
: a utility for defining type-erasing wrappers in terms of an interface description by @ericniebler in #2633- Fix Thrust/CUB tests by adding empty base opt-ins to iterator classes by @wmaxey in #3066
- Don't use exact comparison for FP values. by @Artem-B in #2742
- Use consistent spelling for aliasing select benchmarks by @bernhardmgruber in #3073
- Improve handling of language level features by @miscco in #3069
- Only tune streaming DeviceSelect versions for 64-bit offsets by @bernhardmgruber in #3072
- Disable nvrtc workaround by @miscco in #1116
- fix assorted problems in cudax memory resource equality fns by @ericniebler in #3079
- Support fancy iterators in cuda.parallel by @rwgk in #2788
- fix feature test for operator<=> by @ericniebler in #3075
- Mark test as potentially passing by @miscco in #3078
- Avoid padding warning with MSVC by @miscco in #3077
- Improve CUB tuning documentation by @bernhardmgruber in #3058
- Optimise tuning compile-time by @bernhardmgruber in #3074
- Use consistent spelling for
CounterT
in histogram benchmarks by @bernhardmgruber in #3089 - [Improvement] Don't require specifying output type when constructing TransformIterator (cuda.parallel) by @shwina in #3083
- simplify the definition of the
basic_any
class template by @ericniebler in #3085 - Use only signed offset types in CUB benchmarks by @bernhardmgruber in #3087
- Improve readability of DispatchSelectIf parameterization by @bernhardmgruber in #3092
- [cudax] Simplify implementation of device attributes by @davebayer in #3084
- suppress
-Werror=empty-body
inchar_traits
implementation by @ericniebler in #3098 - help older clang and gcc to disambiguate
basic_any<__ireference<I>>
andbasic_any<I&>
bases by @ericniebler in #3102 - [PERF] cuda.parallel: Cache intermediate results to improve performance of
cudax.reduce_into
by @shwina in #3001 - [Improvement] cuda.parallel: Don't require value_type when constructing iterators by @shwina in #3105
- Fix zip and permutation iterator EBO on MSVC by @wmaxey in #3106
- Avoid signed unsigned warnings in
annotated_ptr
test by @miscco in #3076 - Changes
DispatchScan[ByKey]
documentation to advise using unsigned offset types by @elstehle in #3111 - [STF] reduce access mode by @caugonnet in #2830
- add support for comparing type-erased wrappers to non-type-erased objects by @ericniebler in #3100
- backport
byte
by @davebayer in #3091 - Add bound checks for each dimension of
mdspan
by @fbusato in #3065 - Move some CUB tunings to dedicated headers by @bernhardmgruber in #3096
- [CUDAX] Add combine API to kernel_config and allow adding default configuration to kernel functors by @pciolkosz in #3082
- Extend tuning guide by @bernhardmgruber in #3117
- Densen sm90 policy by @gonidelis in #3121
- Fix a typo in the documentation of cub::DeviceReduce::Reduce by @caugonnet in #3123
- Cleanup select if tuning by @bernhardmgruber in #3120
- Modularize
<cuda/std/cstddef>
by @davebayer in #3119 - Use programmatic dependent launch in CUB merge sort by @bernhardmgruber in #3114
- Refactor selecting default tuning for select_if by @bernhardmgruber in #3124
- Refactor SM90 radix_sort tuning by @bernhardmgruber in #3125
- [STF] Improved sparse CG example and rename scalar to scalar_view by @caugonnet in #3112
- [CUDAX] Fix the other copy of vector_add after migration to use configs in launch by @pciolkosz in #3129
- Refactor cub histogram tuning by @bernhardmgruber in #3128
- Refactor RLE tuning by @bernhardmgruber in #3127
- Make PDL available with CTK 12.0 by @bernhardmgruber in #3136
- Refactor reduce_by_key tuning by @bernhardmgruber in #3137
- Refactor scan tunings by @bernhardmgruber in #3138
- Fix analyze.py bug by @gonidelis in #3067
- Refactor scan_by_key tuning by @bernhardmgruber in #3139
- Refactor three_way_parition tuning by @bernhardmgruber in #3140
- Clarify passing ValueT to scan_by_key tuning by @bernhardmgruber in #3143
- Move remaining CUB policy hubs to tuning headers by @bernhardmgruber in #3141
- [Internal Cleanup] pre-commit ruff (excluding docs/tools, libcudacxx/test) by @rwgk in #3110
- Add Python codeowners by @jrhemstad in #3150
- make
basic_any
compile for device by stubbing out the virtual tables by @ericniebler in #3109 - Refactoring unique by key by @gonidelis in #3145
- Add missing header in bench scan exclusive base header by @gonidelis in #3157
- Use synchronize_optional for device-to-device copy in thrust::copy() by @davidwendt in #3149
- [Internal Cleanup] pre-commit ruff libcudacxx/tests by @rwgk in #3152
- Clarify unknown tuning axis are ignored by @bernhardmgruber in #3156
- address portability issue in
basic_any
with older nvcc versions by @ericniebler in #3160 - Add limited H100 testing for CUB by @jrhemstad in #3151
- Unify policy hub handling and update documentation by @bernhardmgruber in #3142
- make the
_CCCL_REQUIRES_EXPR
macro more robust by @ericniebler in #3164 - [Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base by @shwina in #3118
- the streams created by
cudax::stream
should not synchronize with the null stream by @ericniebler in #3167 - [STF] Implement CUDASTF_DOT_TIMING for the host_launch construct by @caugonnet in #3170
- Add support for sm_101 and sm_101a to NV_TARGET by @bernhardmgruber in #3166
- implement C++23
byteswap
by @davebayer in #3093 - Unifies large problem test helper infrastructure by @elstehle in #3171
- Deprectate C++11 and C++14 for libcu++ by @miscco in #3173
- Implement
abs
anddiv
fromcstdlib
by @davebayer in #3153 - Fix missing radix sort policies by @bernhardmgruber in #3174
- Introduces new
DeviceReduce::Arg{Min,Max}
interface with two output iterators by @elstehle in #3148 - Extend tuning documentation by @bernhardmgruber in #3179
- Add codespell pre-commit hook, fix typos in CCCL by @bdice in #3168
- Fix parameter space for TUNE_LOAD in scan benchmark by @bernhardmgruber in #3176
- Fix various old compiler version checks by @davebayer in #3178
- Implement ADL-proof
std::projected
from C++26 by @davebayer in #3175 - Fix pre-commit config for codespell and remaining typos by @shwina in #3182
- Massive cleanup of our config by @miscco in #3155
- Fix UB in atomics with automatic storage by @wmaxey in #2586
- Refactor the source code layout for
cuda.parallel
by @shwina in #3177 - new type-erased memory resources by @ericniebler in #2824
- rename
_LIBCUDACXX_DECLSPEC_EMPTY_BASES
to_CCCL_DECLSPEC_EMPTY_BASES
by @ericniebler in #3186 - Document address stability of
thrust::transform
by @bernhardmgruber in #3181 - turn off cuda version check for clangd by @ericniebler in #3194
- [STF] jacobi example based on parallel_for by @caugonnet in #3187
- Fixes pre-CTK 11.5 diag suppression issues by @elstehle in #3189
- Prefer c2h::type_name over c2h::demangle by @bernhardmgruber in #3195
- Fix memcpy_async* tests by @ahendriksen in #3197
- Add type annotations and mypy checks for
cuda.parallel
by @shwina in #3180 - Fix rendering of cuda.parallel docs by @shwina in #3192
- Enable PDL for DeviceMergeSortBlockSortKernel by @bernhardmgruber in #3199
- Adds support for large
num_items
toDeviceReduce::{ArgMin,ArgMax}
by @elstehle in #2647 - Fixes for Python 3.7 docs environment by @shwina in #3206
- Adds support for large number of items to
DeviceTransform
by @elstehle in #3172 - cp_async_bulk: Fix test by @ahendriksen in #3198
- cudax fixes for msvc 14.41 by @ericniebler in #3200
- avoid instantiating class templates in
is_same
implementation when possible by @ericniebler in #3203 - Fix: make launchers a CUB detail; make kernel source functions hidden. by @griwes in #3209
- help the ranges concepts recognize standard contiguous iterators in c++14/17 by @ericniebler in #3202
- unify macros and cmake options that control the suppression of deprecation warnings by @ericniebler in #3220
- Fx thread-reduce performance regression by @fbusato in #3225
- cuda.parallel: In-memory caching of
cuda.parallel
build objects by @shwina in #3216 - clean up the
cuda::std::span
implementation with minimal c++14 range support by @ericniebler in #3211 - use generalized concepts portability macros to simplify the
range
concept by @ericniebler in #3217 - Use Ruff to sort imports by @shwina in #3230
- Fix scan / sm90 perf regression by @gevtushenko in #3236
- [STF] Logical token by @caugonnet in #3196
- Fix ReduceByKey tuning by @gevtushenko in #3240
- Fix RLE tuning by @gevtushenko in #3239
- cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) by @shwina in #3233
- Backport to 2.8: Make CUB NVRTC commandline arguments come from a cmake template (#3292) by @bernhardmgruber in #3322
- Backport to 2.8: Deprecate GridBarrier and GridBarrierLifetime (#3258) by @bernhardmgruber in #3288
- Backport to 2.8: Deprecate cub::Swap (#3333) by @bernhardmgruber in #3350
- Backport to 2.8: Deprecate Thrust's cpp_compatibility.h macros (#3299) by @bernhardmgruber in #3321
- Backport to 2.8: Deprecate cub::IterateThreadStore (#3337) by @bernhardmgruber in #3351
- Backport to 2.8: Deprecate thrust::null_type (#3367) by @bernhardmgruber in #3373
- Backport to 2.8: Review/Deprecate CUB
util.ptx
for CCCL 2.x (#3342) by @bernhardmgruber in #3389 - Backport to 2.8: Deprecate thrust::optional (#3307) by @bernhardmgruber in #3393
- Backport to 2.8: Deprecate thrust::numeric_limits (#3366) by @bernhardmgruber in #3392
- Backport to 2.8: Redefine and deprecate thrust::remove_cvref (#3394) by @bernhardmgruber in #3420
- Backport to 2.8: Replace and deprecate thrust::cuda_cub::terminate (#3421) by @bernhardmgruber in #3425
- [BACKPORT]: Deprecate
cub::{min, max}
and replace internal uses with those from libcu++ (#3419) by @miscco in #3447 - Backport to 2.8: Deprecate thrust::async (#3324) by @bernhardmgruber in #3388
- [BACKPORT]: Moves agents to detail::<algorithm_name> namespace by @elstehle in #3454
- Backport to 2.8: Deprecate a few CUB macros (#3456) by @bernhardmgruber in #3463
- [BACKPORT]: Fix assert definition for NVHPC due to constexpr issues (#3418) by @miscco in #3448
- Backport to 2.8: Deprecate cub::DeviceSpmv (#3320) by @bernhardmgruber in #3374
- Backport to 2.8: some FP8 support by @bernhardmgruber in #3479
- Backport to 2.8: Deprecate block/warp algo specializations (#3455) by @bernhardmgruber in #3481
- Backport to 2.8: Refactor
limits
andclimits
(#3221) by @bernhardmgruber in #3488 - Backport to 2.8: Fix typo in limits (#3491) by @bernhardmgruber in #3498
- Backport to 2.8: Update upload-pages-artifact to v3 (#3423) by @bernhardmgruber in #3513
- Backport to 2.8: Implement
cuda::std::numeric_limits
for__half
and__nv_bfloat16
(#3361) by @bernhardmgruber in #3490 - Backport PRs #3201, #3523, #3547, #3580 to the 2.8.x branch. by @rwgk in #3536
- [Backport 2.8] work around msvc bug exposed by
__type_index
intype_list.h
(#3487) by @wmaxey in #3537 - [Backport] #3572 to the 2.8.x branch. by @miscco in #3605
- Backport to 2.8: Specialize
cuda::std::numeric_limits
for FP8 types (#3478) by @bernhardmgruber in #3492 - Backport to 2.8: Deprecate thrust universal iterator categories (#3461) by @bernhardmgruber in #3471
- Backport to 2.8: Deprecate and replace thrust::cuda_cub iterators (#3422) by @bernhardmgruber in #3510
- Backport to 2.8: Deprecate thrust macros from type_deduction.h (#3501) by @bernhardmgruber in #3511
- Backport to 2.8: Deprecate macros from cuda/detail/core/util.h (#3504) by @bernhardmgruber in #3520
- [BACKPORT]:: Try to always include the definition of barrier_native_handle when needed (#3556) by @miscco in #3569
- Backport to 2.8: Deprecates tuning policy hubs by @elstehle in #3531
- [Backport 2.8] Add extended data type macro identification by @fbusato in #3586
- Backport to 2.8: Deprecate thrust logical meta functions (#3538) by @bernhardmgruber in #3567
- Backport to 2.8: Refactor (#3561) by @bernhardmgruber in #3566
- Backport to 2.8: Tune cub::DeviceTransform for Blackwell (#3543) by @bernhardmgruber in #3565
- Backport to 2.8: Deprecate and replace
CUB_IS_INT128_ENABLED
(#3427) by @bernhardmgruber in #3629 - Backport to 2.8: Deprecate CUB iterators existing in Thrust (#3304) by @bernhardmgruber in #3534
- Backport to 2.8: Deprecate thrust event, future and more (#3457) by @bernhardmgruber in #3512
- Backport to 2.8: PTX support for Blackwell by @bernhardmgruber in #3624
- Backport to 2.8: Support FP16 traits on CTK 12.0 (#3535) by @bernhardmgruber in #3625
- [Backport 2.8] Deprecate
AgentSegmentFixupPolicy
by @fbusato in #3638 - Backport to 2.8: PTX: fix cp.async.bulk.tensor and mbarrier.arrive (#3628) by @bernhardmgruber in #3630
- Backport to 2.8: Suppress execution checks for vocabulary types (#3578) by @miscco in #3599
- [BACKPORT]: Try and get rapids green (#3503) by @miscco in #3598
- Backport to 2.8: Internalize triple_chevron (#3648) by @bernhardmgruber in #3650
- [BACKPORT]: Ensure that headers in
<cuda/*>
can be build with a C++ only compiler (#3472) by @miscco in #3651 - Backport to 2.8:
__builtin_isfinite
is only available above nvrtc 12.2 by @leofang in #3653 - [Backport 2.8.x] Backport #3575 deprecating old ABIs in libcudacxx by @wmaxey in #3660
- [Backport 2.8.x] Backport [nv/target] Add sm_120 macros. (#3550) by @wmaxey in #3661
- Backport to 2.8: Add b200 policies for device.select.if,flagged,unique (#3545) by @bernhardmgruber in #3667
- Backport to 2.8: Add b200 tunings for radix_sort.pairs (#3626) by @bernhardmgruber in #3668
- [Backport branch/2.8.x] PTX: mbarrier.{test,try}_wait: Fix return value by @github-actions in #3672
- Backport to 2.8: Add b200 tunings for radix_sort.keys (#3611) by @bernhardmgruber in #3655
- [Backport branch/2.8.x] Fix issues with nvrtc compilation by @github-actions in #3674
- [Backport branch/2.8.x] Add b200 policies for cub.select.unique_by_key by @github-actions in #3673
- [Backport branch/2.8.x] Deprecate cub::FpLimits in favor of cuda::std::numeric_limits by @github-actions in #3658
- Backport to 2.8: Deprecate
cub::AliasTemporaries
(#3679) andcub::PolicyWrapper
(#3681) by @bernhardmgruber in #3690 - [Backport branch/2.8.x] Internalize cub::KernelConfig by @github-actions in #3688
- Backport to 2.8: Fix transform_iterator (#3652) and Deprecate thrust::identity (#3649) by @bernhardmgruber in #3693
- Backport to 2.8: Add b200 policies for cub.device.run_length_encode.encode,non_trivialruns (#3546) by @bernhardmgruber in #3704
- [BACKPORT] Remove cugraph-ops from RAPIDS 25.04 builds. (#3675) by @miscco in #3696
- Backport to 2.8: Make thrust iterators work with NVRTC (#3676) and replace CUB iterators by Thrust ones (#3480) by @bernhardmgruber in #3697
- [Backport branch/2.8.x] Deprecate
cub::RegBoundScaling
andcub::MemBoundScaling
by @github-actions in #3706 - [backport 2.8] Deprecate and replace
Int2Type
by @fbusato in #3705 - [Backport branch/2.8.x] Add b200 policies for partition.three_way by @github-actions in #3710
- Backport to 2.8: Deprecate cub::Trait::CATEGORY|PRIMITIVE|NULL_TYPE (#3689) by @bernhardmgruber in #3703
- [Backport branch/2.8.x] Add b200 tunings for scan.exclusive.by_key by @github-actions in #3719
- Backport to 2.8: B200 reduce.by_key tunings by @bernhardmgruber in #3726
- Backport to 2.8: B200 tunings for histogram by @bernhardmgruber in #3728
- Backport to 2.8: B200 reduce tunings by @bernhardmgruber in #3735
- Backport to 2.8: Add b200 policies for cub.device.partition.flagged,if (#3617) by @bernhardmgruber in #3736
- Backport to 2.8: Add b200 tunings for scan.exclusive.sum (#3559) by @bernhardmgruber in #3738
- [Backport branch/2.8.x] fix ::cuda::discard_memory by @github-actions in #3737
- [Backport branch/2.8.x] Fix cub trait deprecations by @github-actions in #3744
- [Backport branch/2.8.x] [Automation] Add release workflow for tagging and testing new RCs by @github-actions in #3754
- Suppress deprecatings on logical meta functions by @bernhardmgruber in #3795
- Revert back to cub::Traits::CATEGORY|PRIMITIVE by @bernhardmgruber in #3866
- [2.8.x] Disable
[[no_unique_address]]
for MSVC (#3757) by @miscco in #3869 - [Backport branch/2.8.x] do not try to use clang-19's support for c++26 pack indexing by @github-actions in #3903
New Contributors
- @Artem-B made their first contribution in #2420
- @Kh4ster made their first contribution in #2470
- @andrewcorrigan made their first contribution in #2482
- @jjacobelli made their first contribution in #2604
- @andralex made their first contribution in #2622
- @Jacobfaib made their first contribution in #2681
- @karthikeyann made their first contribution in #2769
- @davidwendt made their first contribution in #2857
- @hzhangxyz made their first contribution in #2992
- @j3soon made their first contribution in #3002
- @esoha-nvidia made their first contribution in #2736
Full Changelog: v2.7.0...v2.8.0