NVIDIA/cccl v2.8.0 on GitHub

What's Changed

Adds benchmarks for DeviceSelect::Unique by @elstehle in #2359
CUB - Enable DPX Reduction by @fbusato in #2286
[CUDAX] add a small c++17 implementation of std::execution (aka P2300) by @ericniebler in #2301
Add thurst::transform_inclusive_scan with init value by @gonidelis in #2326
Widen histogram agent constructor to more types by @bernhardmgruber in #2380
Use a constant for the amount of static SMEM by @bernhardmgruber in #2374
Add cub::DeviceTransform by @bernhardmgruber in #2086
Update toolkit to CTK 12.6 by @miscco in #2348
implement make_integer_sequence in terms of intrinsics whenever possible by @ericniebler in #2384
Implement cuda::mr::cuda_async_memory_resource by @miscco in #1637
Drop implementation of thrust::pair and thrust::tuple by @miscco in #2395
Pull out _LIBCUDACXX_UNREACHABLE into its own file by @miscco in #2399
Share common compiler flags in new CCCL-level targets. by @alliepiper in #2386
conditionally include <crt/host_defines.h> from __cccl/execution_space.h header by @ericniebler in #2406
add some simple utilities for manipulating lists of types by @ericniebler in #2370
Drop thrusts diagnostic suppression warnings by @miscco in #2392
[PoC]: Implement cuda::experimental::uninitialized_async_buffer by @miscco in #1854
Fix thrust package to work with newer FindOpenMP.cmake. by @alliepiper in #2421
Introduce cccl_configure_target cmake function. by @alliepiper in #2388
Fix sccache errors in RAPIDS builds by @trxcllnt in #2417
Replace CUDA C++ Core Libraries with CUDA Core Compute Libraries (only in README.md). by @rwgk in #2424
Minor cleanup with cuda/atomic by @miscco in #2418
uninitialized_buffer::get_resource returns a ref to an any_resource that can be copied by @ericniebler in #2431
Refactor cuda::ceil_div to take two different types by @miscco in #2376
Reduce PR testing matrix. by @alliepiper in #2436
Implement cudax::shared_resource by @miscco in #2398
Increase the libcu++ timeout by @miscco in #2435
Move c/include/cccl/.h files to c/include/cccl/c/.h by @rwgk in #2428
Make any_resource emplacable by @miscco in #2425
Fix issues with __host__ and __device__ definitions by @miscco in #2413
Make bit_cast play nice with extended floating point types by @miscco in #2434
Do not include our own string.h file by @miscco in #2444
Move nightly time by @bdice in #2437
Remove a ton of lines in thrust tests by @gonidelis in #2356
[CUDAX] Add placeholder green context type and logical device that can hold both a green ctx and a device by @pciolkosz in #2446
Fix typo in CCCLBuildCompilerTargets.cmake by @alliepiper in #2453
Drop superflous compile definition from thrust tests by @miscco in #2450
Consolidate packages and install rules by @alliepiper in #2456
Prune CUB's ChainedPolicy by CUDA_ARCH_LIST by @bernhardmgruber in #2154
fixes merge conflict for policy pruning by @elstehle in #2466
Add CCCL_ENABLE_WERROR flag. by @alliepiper in #2463
Add CUB tests for segmented sort/radix sort with 64-bit num. items and segments by @fbusato in #2254
Propagate compiler flags down to libcu++ LIT tests by @Artem-B in #2420
Drop remaining uses of _LIBCUDACXX_COMPILER_* by @miscco in #2467
Avoid C++17 extension in c++11 tests by @miscco in #2469
Add span to example and templated block size by @Kh4ster in #2470
Drop Objective C++ support by @miscco in #2468
removes superfluous template keyword in call to Dereference by @andrewcorrigan in #2482
Improve build times in several heavyweight libcudacxx tests. by @wmaxey in #2478
Drop __availability header by @miscco in #2484
Replace a few more instances of CUDA C++ Core Libraries with CUDA Core Compute Libraries`. by @rwgk in #2447
Fix common_type specialization for extended floating point types by @miscco in #2483
Implement some CUDA API calls for async_memory_pool by @miscco in #2455
Move cudax example project to CCCL project examples. by @alliepiper in #2462
Disable system header for narrowing conversion check by @miscco in #2465
Require resources to always provide at least one execution space property by @miscco in #2489
Rework builtin handling by @miscco in #2461
Disable execution checks for std::equal by @miscco in #2491
replace _CCCL_ALWAYS_INLINE with _CCCL_FORCEINLINE by @ericniebler in #2439
Drop 2 relative includes that snuck in by @miscco in #2492
re-express the cudax::__tupl::__apply member to make nvc++ happy by @ericniebler in #2493
Drop badly named _One_of concept by @miscco in #2490
Unify assert handling in cccl by @miscco in #2382
Reduce scope of Thrust linkage in cudax. by @alliepiper in #2496
Centralize CPM logic. by @alliepiper in #2495
Fix typo in presets. by @alliepiper in #2497
Refactor away per-project TOPLEVEL flags. by @alliepiper in #2498
[FEA]: Validate cuda.parallel type matching in build and execution by @rwgk in #2429
avoid gcc optimizer bug by not force inlining part of thrust::transform by @ericniebler in #2509
Cleanup and modularize <cuda/std/barrier> by @miscco in #2443
Consolidate header testing infra. by @alliepiper in #2460
Add ForEachN from CUB to cccl/c. by @wmaxey in #2378
Adds support for large number of items in DeviceSelect and DevicePartition by @elstehle in #2400
Adds support for large number of items to DeviceScan::*ByKey family of algorithms by @elstehle in #2477
Integrate c/parallel with CCCL build system and CI. by @alliepiper in #2514
Create a command list utility for nvrtc/jitlink steps. by @wmaxey in #2511
Fix the example project which the documentation refers too by @caugonnet in #2531
Enable tests/headertests for c/parallel in all-dev presets. by @alliepiper in #2566
Rename cudax test targets to match CCCL conventions. by @alliepiper in #2568
Update project list in issue template by @alliepiper in #2532
Disable compiler extensions on CCCL targets. by @alliepiper in #2559
Fixes cub::DeviceMemcpy::Batched to be able to copy from const source pointers by @elstehle in #2573
Fix documentation error in ci/build_common.sh for -arch by @caugonnet in #2574
gcc-14 gained the ability to mangle noexcept expressions by @ericniebler in #2565
Miscellaneous simple fixes by @rwgk in #2575
Avoid including yvals.h when the compiler is not MSVC. by @wmaxey in #2545
Fix popc.h when architecture is not x86 on MSVC. by @wmaxey in #2524
test for exceptions support on msvc with the _CPPUNWIND macro by @ericniebler in #2576
fix the forwarding of the receiver in the just_from algorithm by @ericniebler in #2569
Block type pack indexing on NVCC by @wmaxey in #2563
Cleanup the semaphore headers by @miscco in #2441
Add _CCCL_GRID_CONSTANT macro by @fbusato in #2530
Add _CCCL_RESTRICT macro by @fbusato in #2529
Try to use the same redefinition of __assert_fail as pytorch has by @miscco in #2577
Fix miscellaneous bugs in cub/iterator documentation. by @rwgk in #2580
Expose parts of <cuda/std/memory> by @fbusato in #2502
add a config macro for testing support for inline variables by @ericniebler in #2581
add dialect macros _CCCL_NO_RTTI and _CCCL_NO_TYPEID by @ericniebler in #2578
fix misspelling in the _CCCL_NO_VARIABLE_TEMPLATES macro by @ericniebler in #2584
Add atomic_ref support for 8 and 16b types. by @wmaxey in #2255
add _LIBCUDACXX_REQUIRES_EXPR to the concepts emulation macros by @ericniebler in #2564
Ensure CuPy arrays can be used with cuda.parallel too by @leofang in #2335
assert that cuda::std::declval is noexcept by @ericniebler in #2588
Revert accidental force push to main. by @wmaxey in #2596
add __is_callable_v variable template when possible by @ericniebler in #2598
Cleanup threading support by @miscco in #2507
CCCL_TOPLEVEL_PROJECT always needs to be defined by @robertmaynard in #2597
Strip prefix paths from cudax documentation by @caugonnet in #2603
examples/cudax/CMakeLists.txt should not be executable by @caugonnet in #2594
[CUDAX] Peer access control on async_memory_pool and async_memory_resource by @pciolkosz in #2587
Introduce _CCCL_PRAGMA to CCCL by @davebayer in #2610
Only enable CUDA language when needed. by @alliepiper in #2612
Modularize latch by @miscco in #2508
Unify kernel dispatch paths for device reduce between CUB and c.parallel. by @griwes in #2591
Integrate CUDASTF -> CudaX by @caugonnet in #2572
[STF] The cmake example for stf was not updated when moving to main branch by @caugonnet in #2618
Rework head_flags so that we do not rely on the tuple being unevaluated by @miscco in #2619
[CUDAX] size_bytes in buffer types by @pciolkosz in #2621
fix portability bug in libcu++'s implementation of char_traits by @ericniebler in #2623
[cccl/c] Unify some build boilerplate by @wmaxey in #2625
devcontainer: replace VAULT_HOST with AWS_ROLE_ARN by @jjacobelli in #2604
Add checks to unique_id by @andralex in #2622
Add cuda::get_device_address by @miscco in #2611
Do not pass integral constants to ptx by @miscco in #2229
Add nvhpc devcontainer to CI by @miscco in #1488
Use a default initialization for CUDA graph mem alloc nodes by @caugonnet in #2632
[CUDAX] Add get_name to device_ref by @pciolkosz in #2631
Add 12.5 devcontainer needed for nvhpc by @miscco in #2634
a substitute for std::type_info when the compiler doesn't support RTTI by @ericniebler in #2582
Check for missing inline on functions in public headers. by @alliepiper in #2570
fix linker errors about multiply defined symbols in STF by @ericniebler in #2641
Add installation presets and update README with install steps by @alliepiper in #2643
Fix annotated_ptr test failures. by @wmaxey in #2607
Issue a deprecation warning when compiling with ICC by @bernhardmgruber in #2076
Include all python libs in inspect_changes. by @alliepiper in #2648
Add reusable workflow for updating version in branch with a PR by @wmaxey in #2589
define _CCCL_NO_RTTI in device code; RTTI isn't available there by @ericniebler in #2639
Migrate C2H library to top-level library by @alliepiper in #2629
[CUDAX] Add can_peer_access_to API to device_ref and check both ways access in get_peers by @pciolkosz in #2642
Use _CCCL_ASSERT for stf by @miscco in #2645
un-templatize CUDASTF's callback_completion_kernel per @robertmaynard by @ericniebler in #2656
Implement C++20 <source_location> by @miscco in #2628
Disable [[no_unique_address]] for clang and mdspan by @miscco in #2646
[STF] Adapt timing_with_fences test to be more reliable by @caugonnet in #2658
Add prefetching kernel as new fallback for cub::DeviceTransform by @bernhardmgruber in #2396
Drop cub::DeviceTransform fallback to cub::DeviceFor by @bernhardmgruber in #2660
Ignore more files when detecting CI changes. by @alliepiper in #2654
Add thrust::universal_host_pinned_vector by @bernhardmgruber in #2653
add new type-list algorithms copy_if, remove_if, find_if, and unique by @ericniebler in #2644
abide by CCCL config macro naming conventions for _CCCL_PRETTY_FUNCTION and _CCCL_NO_BUILTIN_STRLEN by @ericniebler in #2640
[STF] Fix how we define multi-dimensional shapes in the documentation by @caugonnet in #2662
Automate creating a CCCL release from RC tags. by @wmaxey in #2657
Enable span to work with contiguous std containers in C++17 by @miscco in #2613
[Version] Update main to v2.8.0 by @github-actions in #2670
promote the cudax __async/config.cuh to be the config for all of cudax by @ericniebler in #2638
avoid using nvcc's __type_pack_element before 12.2 by @ericniebler in #2673
Update ninja_summary.py to support ninja log v6. by @alliepiper in #2663
Rename new CUB headers to follow conventions. by @alliepiper in #2675
consistent use of _CUDAX function attributes in the cudax __async/ directory by @ericniebler in #2676
[CUDAX] Add forwarding reference to functor accepting launch by @pciolkosz in #2677
[CUDAX] Add initial bits of copy_bytes and fill_bytes by @pciolkosz in #2608
suppress msvc warning "qualifier applied to function type" in is_function by @ericniebler in #2683
Disable ublkcp CUB transform kernel for NVHPC by @bernhardmgruber in #2664
Deprecate thrust::cuda_cub::identity by @bernhardmgruber in #2688
Remove an unused variable by @bernhardmgruber in #2690
Setup cudax examples. by @alliepiper in #2697
portability fixes for _CCCL_BUILTIN_PRETTY_FUNCTION and _CCCL_TYPEID by @ericniebler in #2695
address portability issues found while using the typelist/typeset utities by @ericniebler in #2694
Make tests technically correct by initializing the barrier by @miscco in #2701
Fix invalid memory reads in test_device_batch_copy. by @alliepiper in #2698
revert config macros _CCCL_CUDACC_BELOW_XX_X to their original semantics by @ericniebler in #2700
This cleanes up our function objects a bit by @miscco in #2702
Drop handling of 32bit Windows by @bernhardmgruber in #2689
Guard inclusion of cuda_runtime_api by using a cuda compiler by @miscco in #2704
Fix race condition in block_reduce_raking. by @alliepiper in #2699
Honor CCCL_ENABLE_WERROR for CUDA targets. by @alliepiper in #2705
Fix nvbench helper compilation for clang-18 by @bernhardmgruber in #2707
Default ctor of device_ptr and normal_iterator by @bernhardmgruber in #2708
Add cuda::minimum and cuda::maximum by @Jacobfaib in #2681
Various fixes to cub::DeviceTransform by @bernhardmgruber in #2709
Make thrust::transform use cub::DeviceTransform by @bernhardmgruber in #2389
Ensure that we only use the inline variable trait when it is actually available by @miscco in #2712
[CUDAX] Rename memory resource and memory pool from async to device by @pciolkosz in #2710
triple_chevron fix by @fbusato in #2720
Improve uninitialized_{async_}buffer API by @miscco in #2713
Fix merge conflict from renaming of async_memory_resource by @miscco in #2728
[STF] Improve DOT graph outputs by @caugonnet in #2703
Implement _CCCL_SUPPRESS_DEPRECATED_[PUSH|POP] for ICC and NVHPC by @bernhardmgruber in #2730
Clean up CUB thread operators by @bernhardmgruber in #2716
Deprecate/replace more of Thrust functional by @bernhardmgruber in #2105
Alias cuda::std::identity to __identity by @bernhardmgruber in #2733
Do not read uninitialized memory for OOB elements. by @alliepiper in #2739
Add option to conditionally build CUDASTF by @miscco in #2731
fix cuda::std::bit_width() return type by @fbusato in #2745
[STF] Option to disable kernel generation in CUDASTF by @caugonnet in #2723
fix static_extent() return type by @fbusato in #2751
make the empty parens after level constructors optional by @ericniebler in #2750
cudax: rename ustdex's __query member function to query by @ericniebler in #2757
Implement execution policies by @miscco in #2715
Document some transform iterator corner cases by @bernhardmgruber in #2740
Shorten the git commit message in the ci scripts by @miscco in #2760
Separate CUDA and C++ code in C2H by @bernhardmgruber in #2734
Make get_stream work with queries by @miscco in #2761
Allow thrust::identity to forward value category by @bernhardmgruber in #2732
Proclaim Thrust/CUB/libcu++ functor address stability by @bernhardmgruber in #2719
give declval an implementation that compiles 2x faster by @ericniebler in #2758
[CUDAX] Add modernized simpleP2P sample by @pciolkosz in #2696
s/get_delegatee_scheduler/get_delegation_scheduler/ by @ericniebler in #2766
remove duplicated __apply_cv type trait by @ericniebler in #2754
merge metaprogramming libs from libcudac++ and µstdex by @ericniebler in #2767
Doc fix scan by @karthikeyann in #2769
Remove obsolete ways to set iterator category in CUB by @bernhardmgruber in #2759
Run thrust::transform benchmarks with more elements by @bernhardmgruber in #2764
Increase libcu++ timeout by @miscco in #2774
[STF] Rename the redux access mode into relaxed by @caugonnet in #2776
Enable type trait aliases in all standard modes by @miscco in #2763
Optimize, Cleanup, and Expose CUB Thread-Level Reduction by @fbusato in #2756
Disable execution checks for tuple by @miscco in #2780
Avoid benchmarking first-time setup in Thrust algorithms by @bernhardmgruber in #2782
Improve listing benchmarks and text by @bernhardmgruber in #2778
Fix thrust partition docs typo by @gonidelis in #2791
Drop unused sanitizer hook by @miscco in #2793
use _CCCL_HAS_FEATURE instead of plain __has_feature everywhere by @davebayer in #2794
Avoid make_zip_iterator(make_tuple(...)) by @bernhardmgruber in #2796
implement _CCCL_HAS_INCLUDE by @davebayer in #2786
add __cpp_lib_mdspan feature-test macro by @fbusato in #2787
Remove redundant cmake from example. by @alliepiper in #2804
change __as_type_list so it doesn't cause the instantiation of its argument by @ericniebler in #2803
[CUDAX] Enable passing hierarchy levels directly into make_config by @pciolkosz in #2755
Fix cudacc/cluster detection macro in launch path of libcudacxx tests by @wmaxey in #2811
[STF] Replace CUDASTF_CODE_GENERATION by !CUDASTF_DISABLE_CODE_GENERATION by @caugonnet in #2797
Reduce P0 benchmark variations for merge_sort_pairs by @bernhardmgruber in #2798
Replace macros by lambdas in cub::DeviceTransform by @bernhardmgruber in #2817
Add nvrtc_sm_top_level::add_link_list() and use in c/parallel/src/reduce.cu by @rwgk in #2781
give completion_signatures a fast lookup cache by @ericniebler in #2812
implement new compiler checks for NVHPC by @davebayer in #2816
Unify [CCCL|CUB|THRUST]_ENABLE_BENCHMARKS by @bernhardmgruber in #2827
Remove traces of metal from CCCL by @bernhardmgruber in #2828
Move our CUDACC version checks towards the new version check design by @miscco in #2826
Extend CUB benchmarking documentation by @bernhardmgruber in #2831
Remove all warm-up runs from Thrust benchmarks by @bernhardmgruber in #2838
Utility scripts for benchmark database by @gevtushenko in #2847
[CUDAX] Add missing sm_61 traits by @pciolkosz in #2848
Move _CCCL_COMPILER_ICC to the new macro by @miscco in #2849
Fix wrong include in Thrust benchmark by @bernhardmgruber in #2854
Add missing include by @bernhardmgruber in #2855
Move _CCCL_COMPILER_GCC to the new macro by @davebayer in #2850
Add benchmarking and tuning presets by @bernhardmgruber in #2856
Fix race condition in block-RLD test harness. by @alliepiper in #2706
Add MatX build to CCCL CI by @alliepiper in #2682
Fix DeviceSegmentedSort NVTX range name by @davidwendt in #2857
Make discovery mechanism for cuda/_include directory compatible with pip install --editable by @rwgk in #2846
add missing DOXYGEN_* predefined macros when building the cudax docs by @ericniebler in #2858
correct the names of shared_resource's async allocate/deallocate members by @ericniebler in #2880
[Docs/PTX] Add device tensor map init example by @ahendriksen in #1983
Fix rst typos in benchmarking.html by @gonidelis in #2868
Include use of NVHPC in CUB/Thrust magic namespace by @bernhardmgruber in #2771
backport to_underlying by @davebayer in #2853
move _CCCL_COMPILER_CLANG to the new macro by @davebayer in #2859
Automate release branch creation by @wmaxey in #2685
Add thrust_create_target DISPATCH option. by @alliepiper in #2844
for_each_in_extent by @fbusato in #2518
Fix old gcc version check by @davebayer in #2904
Move implementation of _LIBCUDACXX_TEMPLATE to CCCL by @miscco in #2832
Try to work around issue with NVHPC in conjunction with older CTK versions by @miscco in #2889
Refactor nvbench helper less_t by @bernhardmgruber in #2905
add "interface" to _CCCL_PUSH_MACROS by @ericniebler in #2919
Replace inconsistent Doxygen macros with _CCCL_DOXYGEN_INVOKED by @ericniebler in #2921
implement C++26 std::span::at by @davebayer in #2924
move msvc compiler macros to new version by @davebayer in #2885
Reorganize PTX tests to match generator by @bernhardmgruber in #2930
Reorganize PTX docs to match generator by @bernhardmgruber in #2929
Improve build instructions for libcu++ by @miscco in #2881
Reorganize PTX headers to match generator by @bernhardmgruber in #2925
implement C++26 std::span's constructor from std::initializer_list by @davebayer in #2923
Add tuple protocol to cuda::std::complex from C++26 by @davebayer in #2882
Add missing qualifier for cuda namespace by @bernhardmgruber in #2940
Try to fix a clang warning: by @bernhardmgruber in #2941
minor consistency improvements in concepts macros by @ericniebler in #2928
Drop some of the mdspan fold implementation by @miscco in #2949
[STF] Implement CUDASTF_DOT_TIMING for the ctx.cuda_kernel construct by @caugonnet in #2950
Avoid potential null dereference in annotated_ptr by @miscco in #2951
make compiler version comparison utility generic by @davebayer in #2952
Add SM100 descriptor to target by @miscco in #2954
Regenerate cuda::ptx headers/docs and run format by @bernhardmgruber in #2937
Regenerate cuda::ptx test by @bernhardmgruber in #2953
Do not include extended floating point headers if they are not needed by @miscco in #2956
[CUDAX] Add copy_bytes and fill_bytes overloads for mdspan by @pciolkosz in #2932
add a _CCCL_NO_CONCEPTS config macro by @ericniebler in #2945
remove definition of macro (_LIBCUDACXX_NO_RTTI) that is no longer used by @ericniebler in #2957
Avoid symbol clashes with libc++ by @miscco in #2955
Add more CUB transform benchmarks by @bernhardmgruber in #2906
Start reworking our math functions by @miscco in #2749
Drop memory resources in libcu++ by @miscco in #2860
std::dims by @fbusato in #2961
Fix merge conflict from moving resources up a namespace by @miscco in #2965
[CUDAX] Add a way to combine thread hierarchies by @pciolkosz in #2746
Require approval to run CI on draft PRs by @bdice in #2969
fix thread-reduce performance regression by @fbusato in #2944
add a __type_switch utility and use it the ptx generator by @ericniebler in #2946
replace use of old _CONCEPT_FRAGMENT macro in cudax by @ericniebler in #2973
remove vestigal uses of the old DOXYGEN_SHOULD_SKIP_THIS macro by @ericniebler in #2978
Fix proclaim_copyable_arguments for lambdas by @bernhardmgruber in #2833
Forward declare half types in cuda::ptx by @ahendriksen in #2981
Fix tuning benchmark for cub::DeviceTransform by @bernhardmgruber in #2970
fix old gcc version check by @davebayer in #2989
Fix a typo in thrust/binary_search.h (#2980) by @hzhangxyz in #2992
Enable assertions for CCCL users in CMake Debug builds by @bernhardmgruber in #2986
Fix CMake warning for FindPythonInterp by @bernhardmgruber in #2982
Further clarify host compiler support by @bernhardmgruber in #2991
Drop _CCCL_ELSE_IF_CONSTEXPR by @bernhardmgruber in #2966
implement C++26 std::ignore by @davebayer in #2922
make the upper limit on TMP loop unrolling configurable by @ericniebler in #2971
Update docs with recent features by @davebayer in #2994
Restore thrust single config options. by @alliepiper in #2977
Document tuning DB comparison scripts by @bernhardmgruber in #2968
Build CUB and Thrust tests with assertions by @bernhardmgruber in #2987
Issue a deprecation warning when compiling with Visual Studio 2017 by @bernhardmgruber in #2990
Guard forward declarations of extended FP types by @bernhardmgruber in #2998
[STF] Create dot sections to possibly collapse nodes when displaying large DOT graphs by @caugonnet in #2988
Remove redundant pre c++11 checks by @davebayer in #2999
Avoid checking unsigned values for negativity by @bernhardmgruber in #2997
Rename thrust example version.cu to print_version.cu by @j3soon in #3002
don't bother sync-ing a stream with itself by @ericniebler in #3007
Backport is_scoped_enum by @davebayer in #3003
Put monostate in <utility> by @davebayer in #3000
backport std integer comparison functions to C++11 by @davebayer in #2805
backport forward_like by @davebayer in #2995
Document how to profile benchmarks by @bernhardmgruber in #3015
Update Thrust examples ReadMe by @bernhardmgruber in #3004
Deprecate public CUB/Thrust deprecation macros by @bernhardmgruber in #3010
Fix libcudacxx example by @j3soon in #3013
Refactor BlockLoad test by @bernhardmgruber in #3005
Fix NVBench profile flags in docs by @bernhardmgruber in #3016
Update RAPIDS to 25.02. by @bdice in #2967
Tweak tuning database plot and comparison scripts by @bernhardmgruber in #2883
Allow passing debug flags to NVRTC in libcudacxx tests by @wmaxey in #3020
Add missing template parameter to BlockRadixRank example. by @esoha-nvidia in #2736
Fix value range overflows in tests by @Artem-B in #3022
Avoid relative includesthat have slipped in by @miscco in #3042
Fix word count example in Thrust by @caugonnet in #3014
revise <cuda/std/version> by @davebayer in #3043
Replace thrust::swap by cuda::std::swap by @bernhardmgruber in #2985
add a converting constructor to cudax::stream_ref from cuda::stream_ref by @ericniebler in #3052
[CUDAX] Remove launch overloads taking dimensions and make everything except make_hierarchy return kernel_config by @pciolkosz in #2979
move sender support library to __async/sender/ by @ericniebler in #3056
[cuda.cooperative] Add block.load and block.store. by @brycelelbach in #2693
Backport unreachable by @davebayer in #3018
Define the destructor of kernel_arg by @miscco in #3060
Add missing __syncthreads() to test by @miscco in #3061
Add assertions in the mdspan accessors that we are not out of bounds by @miscco in #3055
Do not use cudaGetErrorString on GPU. by @Artem-B in #3059
Reduce number of per-PR CI jobs. by @alliepiper in #2931
Rework CUDA compiler checks by @davebayer in #3057
implement C++23 invoke_r by @davebayer in #3041
Consider NV_TARGET_SM_INTEGER_LIST for ChainedPolicy pruning by @bernhardmgruber in #2772
Add environment to encapsulate information needed for cudax::vector by @miscco in #2775
We should not call cudaGetErrorString on device by @miscco in #3062
Introduce cuda.cooperative overloads not requiring temporary storage by @gevtushenko in #2528
basic_any: a utility for defining type-erasing wrappers in terms of an interface description by @ericniebler in #2633
Fix Thrust/CUB tests by adding empty base opt-ins to iterator classes by @wmaxey in #3066
Don't use exact comparison for FP values. by @Artem-B in #2742
Use consistent spelling for aliasing select benchmarks by @bernhardmgruber in #3073
Improve handling of language level features by @miscco in #3069
Only tune streaming DeviceSelect versions for 64-bit offsets by @bernhardmgruber in #3072
Disable nvrtc workaround by @miscco in #1116
fix assorted problems in cudax memory resource equality fns by @ericniebler in #3079
Support fancy iterators in cuda.parallel by @rwgk in #2788
fix feature test for operator<=> by @ericniebler in #3075
Mark test as potentially passing by @miscco in #3078
Avoid padding warning with MSVC by @miscco in #3077
Improve CUB tuning documentation by @bernhardmgruber in #3058
Optimise tuning compile-time by @bernhardmgruber in #3074
Use consistent spelling for CounterT in histogram benchmarks by @bernhardmgruber in #3089
[Improvement] Don't require specifying output type when constructing TransformIterator (cuda.parallel) by @shwina in #3083
simplify the definition of the basic_any class template by @ericniebler in #3085
Use only signed offset types in CUB benchmarks by @bernhardmgruber in #3087
Improve readability of DispatchSelectIf parameterization by @bernhardmgruber in #3092
[cudax] Simplify implementation of device attributes by @davebayer in #3084
suppress -Werror=empty-body in char_traits implementation by @ericniebler in #3098
help older clang and gcc to disambiguate basic_any<__ireference<I>> and basic_any<I&> bases by @ericniebler in #3102
[PERF] cuda.parallel: Cache intermediate results to improve performance of cudax.reduce_into by @shwina in #3001
[Improvement] cuda.parallel: Don't require value_type when constructing iterators by @shwina in #3105
Fix zip and permutation iterator EBO on MSVC by @wmaxey in #3106
Avoid signed unsigned warnings in annotated_ptr test by @miscco in #3076
Changes DispatchScan[ByKey] documentation to advise using unsigned offset types by @elstehle in #3111
[STF] reduce access mode by @caugonnet in #2830
add support for comparing type-erased wrappers to non-type-erased objects by @ericniebler in #3100
backport byte by @davebayer in #3091
Add bound checks for each dimension of mdspan by @fbusato in #3065
Move some CUB tunings to dedicated headers by @bernhardmgruber in #3096
[CUDAX] Add combine API to kernel_config and allow adding default configuration to kernel functors by @pciolkosz in #3082
Extend tuning guide by @bernhardmgruber in #3117
Densen sm90 policy by @gonidelis in #3121
Fix a typo in the documentation of cub::DeviceReduce::Reduce by @caugonnet in #3123
Cleanup select if tuning by @bernhardmgruber in #3120
Modularize <cuda/std/cstddef> by @davebayer in #3119
Use programmatic dependent launch in CUB merge sort by @bernhardmgruber in #3114
Refactor selecting default tuning for select_if by @bernhardmgruber in #3124
Refactor SM90 radix_sort tuning by @bernhardmgruber in #3125
[STF] Improved sparse CG example and rename scalar to scalar_view by @caugonnet in #3112
[CUDAX] Fix the other copy of vector_add after migration to use configs in launch by @pciolkosz in #3129
Refactor cub histogram tuning by @bernhardmgruber in #3128
Refactor RLE tuning by @bernhardmgruber in #3127
Make PDL available with CTK 12.0 by @bernhardmgruber in #3136
Refactor reduce_by_key tuning by @bernhardmgruber in #3137
Refactor scan tunings by @bernhardmgruber in #3138
Fix analyze.py bug by @gonidelis in #3067
Refactor scan_by_key tuning by @bernhardmgruber in #3139
Refactor three_way_parition tuning by @bernhardmgruber in #3140
Clarify passing ValueT to scan_by_key tuning by @bernhardmgruber in #3143
Move remaining CUB policy hubs to tuning headers by @bernhardmgruber in #3141
[Internal Cleanup] pre-commit ruff (excluding docs/tools, libcudacxx/test) by @rwgk in #3110
Add Python codeowners by @jrhemstad in #3150
make basic_any compile for device by stubbing out the virtual tables by @ericniebler in #3109
Refactoring unique by key by @gonidelis in #3145
Add missing header in bench scan exclusive base header by @gonidelis in #3157
Use synchronize_optional for device-to-device copy in thrust::copy() by @davidwendt in #3149
[Internal Cleanup] pre-commit ruff libcudacxx/tests by @rwgk in #3152
Clarify unknown tuning axis are ignored by @bernhardmgruber in #3156
address portability issue in basic_any with older nvcc versions by @ericniebler in #3160
Add limited H100 testing for CUB by @jrhemstad in #3151
Unify policy hub handling and update documentation by @bernhardmgruber in #3142
make the _CCCL_REQUIRES_EXPR macro more robust by @ericniebler in #3164
[Refactor] cuda.parallel: Simplify TransformIterator implementation and refactor iterators to derive from a common base by @shwina in #3118
the streams created by cudax::stream should not synchronize with the null stream by @ericniebler in #3167
[STF] Implement CUDASTF_DOT_TIMING for the host_launch construct by @caugonnet in #3170
Add support for sm_101 and sm_101a to NV_TARGET by @bernhardmgruber in #3166
implement C++23 byteswap by @davebayer in #3093
Unifies large problem test helper infrastructure by @elstehle in #3171
Deprectate C++11 and C++14 for libcu++ by @miscco in #3173
Implement abs and div from cstdlib by @davebayer in #3153
Fix missing radix sort policies by @bernhardmgruber in #3174
Introduces new DeviceReduce::Arg{Min,Max} interface with two output iterators by @elstehle in #3148
Extend tuning documentation by @bernhardmgruber in #3179
Add codespell pre-commit hook, fix typos in CCCL by @bdice in #3168
Fix parameter space for TUNE_LOAD in scan benchmark by @bernhardmgruber in #3176
Fix various old compiler version checks by @davebayer in #3178
Implement ADL-proof std::projected from C++26 by @davebayer in #3175
Fix pre-commit config for codespell and remaining typos by @shwina in #3182
Massive cleanup of our config by @miscco in #3155
Fix UB in atomics with automatic storage by @wmaxey in #2586
Refactor the source code layout for cuda.parallel by @shwina in #3177
new type-erased memory resources by @ericniebler in #2824
rename _LIBCUDACXX_DECLSPEC_EMPTY_BASES to _CCCL_DECLSPEC_EMPTY_BASES by @ericniebler in #3186
Document address stability of thrust::transform by @bernhardmgruber in #3181
turn off cuda version check for clangd by @ericniebler in #3194
[STF] jacobi example based on parallel_for by @caugonnet in #3187
Fixes pre-CTK 11.5 diag suppression issues by @elstehle in #3189
Prefer c2h::type_name over c2h::demangle by @bernhardmgruber in #3195
Fix memcpy_async* tests by @ahendriksen in #3197
Add type annotations and mypy checks for cuda.parallel by @shwina in #3180
Fix rendering of cuda.parallel docs by @shwina in #3192
Enable PDL for DeviceMergeSortBlockSortKernel by @bernhardmgruber in #3199
Adds support for large num_items to DeviceReduce::{ArgMin,ArgMax} by @elstehle in #2647
Fixes for Python 3.7 docs environment by @shwina in #3206
Adds support for large number of items to DeviceTransform by @elstehle in #3172
cp_async_bulk: Fix test by @ahendriksen in #3198
cudax fixes for msvc 14.41 by @ericniebler in #3200
avoid instantiating class templates in is_same implementation when possible by @ericniebler in #3203
Fix: make launchers a CUB detail; make kernel source functions hidden. by @griwes in #3209
help the ranges concepts recognize standard contiguous iterators in c++14/17 by @ericniebler in #3202
unify macros and cmake options that control the suppression of deprecation warnings by @ericniebler in #3220
Fx thread-reduce performance regression by @fbusato in #3225
cuda.parallel: In-memory caching of cuda.parallel build objects by @shwina in #3216
clean up the cuda::std::span implementation with minimal c++14 range support by @ericniebler in #3211
use generalized concepts portability macros to simplify the range concept by @ericniebler in #3217
Use Ruff to sort imports by @shwina in #3230
Fix scan / sm90 perf regression by @gevtushenko in #3236
[STF] Logical token by @caugonnet in #3196
Fix ReduceByKey tuning by @gevtushenko in #3240
Fix RLE tuning by @gevtushenko in #3239
cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) by @shwina in #3233
Backport to 2.8: Make CUB NVRTC commandline arguments come from a cmake template (#3292) by @bernhardmgruber in #3322
Backport to 2.8: Deprecate GridBarrier and GridBarrierLifetime (#3258) by @bernhardmgruber in #3288
Backport to 2.8: Deprecate cub::Swap (#3333) by @bernhardmgruber in #3350
Backport to 2.8: Deprecate Thrust's cpp_compatibility.h macros (#3299) by @bernhardmgruber in #3321
Backport to 2.8: Deprecate cub::IterateThreadStore (#3337) by @bernhardmgruber in #3351
Backport to 2.8: Deprecate thrust::null_type (#3367) by @bernhardmgruber in #3373
Backport to 2.8: Review/Deprecate CUB util.ptx for CCCL 2.x (#3342) by @bernhardmgruber in #3389
Backport to 2.8: Deprecate thrust::optional (#3307) by @bernhardmgruber in #3393
Backport to 2.8: Deprecate thrust::numeric_limits (#3366) by @bernhardmgruber in #3392
Backport to 2.8: Redefine and deprecate thrust::remove_cvref (#3394) by @bernhardmgruber in #3420
Backport to 2.8: Replace and deprecate thrust::cuda_cub::terminate (#3421) by @bernhardmgruber in #3425
[BACKPORT]: Deprecate cub::{min, max} and replace internal uses with those from libcu++ (#3419) by @miscco in #3447
Backport to 2.8: Deprecate thrust::async (#3324) by @bernhardmgruber in #3388
[BACKPORT]: Moves agents to detail::<algorithm_name> namespace by @elstehle in #3454
Backport to 2.8: Deprecate a few CUB macros (#3456) by @bernhardmgruber in #3463
[BACKPORT]: Fix assert definition for NVHPC due to constexpr issues (#3418) by @miscco in #3448
Backport to 2.8: Deprecate cub::DeviceSpmv (#3320) by @bernhardmgruber in #3374
Backport to 2.8: some FP8 support by @bernhardmgruber in #3479
Backport to 2.8: Deprecate block/warp algo specializations (#3455) by @bernhardmgruber in #3481
Backport to 2.8: Refactor limits and climits (#3221) by @bernhardmgruber in #3488
Backport to 2.8: Fix typo in limits (#3491) by @bernhardmgruber in #3498
Backport to 2.8: Update upload-pages-artifact to v3 (#3423) by @bernhardmgruber in #3513
Backport to 2.8: Implement cuda::std::numeric_limits for __half and __nv_bfloat16 (#3361) by @bernhardmgruber in #3490
Backport PRs #3201, #3523, #3547, #3580 to the 2.8.x branch. by @rwgk in #3536
[Backport 2.8] work around msvc bug exposed by __type_index in type_list.h (#3487) by @wmaxey in #3537
[Backport] #3572 to the 2.8.x branch. by @miscco in #3605
Backport to 2.8: Specialize cuda::std::numeric_limits for FP8 types (#3478) by @bernhardmgruber in #3492
Backport to 2.8: Deprecate thrust universal iterator categories (#3461) by @bernhardmgruber in #3471
Backport to 2.8: Deprecate and replace thrust::cuda_cub iterators (#3422) by @bernhardmgruber in #3510
Backport to 2.8: Deprecate thrust macros from type_deduction.h (#3501) by @bernhardmgruber in #3511
Backport to 2.8: Deprecate macros from cuda/detail/core/util.h (#3504) by @bernhardmgruber in #3520
[BACKPORT]:: Try to always include the definition of barrier_native_handle when needed (#3556) by @miscco in #3569
Backport to 2.8: Deprecates tuning policy hubs by @elstehle in #3531
[Backport 2.8] Add extended data type macro identification by @fbusato in #3586
Backport to 2.8: Deprecate thrust logical meta functions (#3538) by @bernhardmgruber in #3567
Backport to 2.8: Refactor (#3561) by @bernhardmgruber in #3566
Backport to 2.8: Tune cub::DeviceTransform for Blackwell (#3543) by @bernhardmgruber in #3565
Backport to 2.8: Deprecate and replace CUB_IS_INT128_ENABLED (#3427) by @bernhardmgruber in #3629
Backport to 2.8: Deprecate CUB iterators existing in Thrust (#3304) by @bernhardmgruber in #3534
Backport to 2.8: Deprecate thrust event, future and more (#3457) by @bernhardmgruber in #3512
Backport to 2.8: PTX support for Blackwell by @bernhardmgruber in #3624
Backport to 2.8: Support FP16 traits on CTK 12.0 (#3535) by @bernhardmgruber in #3625
[Backport 2.8] Deprecate AgentSegmentFixupPolicy by @fbusato in #3638
Backport to 2.8: PTX: fix cp.async.bulk.tensor and mbarrier.arrive (#3628) by @bernhardmgruber in #3630
Backport to 2.8: Suppress execution checks for vocabulary types (#3578) by @miscco in #3599
[BACKPORT]: Try and get rapids green (#3503) by @miscco in #3598
Backport to 2.8: Internalize triple_chevron (#3648) by @bernhardmgruber in #3650
[BACKPORT]: Ensure that headers in <cuda/*> can be build with a C++ only compiler (#3472) by @miscco in #3651
Backport to 2.8: __builtin_isfinite is only available above nvrtc 12.2 by @leofang in #3653
[Backport 2.8.x] Backport #3575 deprecating old ABIs in libcudacxx by @wmaxey in #3660
[Backport 2.8.x] Backport [nv/target] Add sm_120 macros. (#3550) by @wmaxey in #3661
Backport to 2.8: Add b200 policies for device.select.if,flagged,unique (#3545) by @bernhardmgruber in #3667
Backport to 2.8: Add b200 tunings for radix_sort.pairs (#3626) by @bernhardmgruber in #3668
[Backport branch/2.8.x] PTX: mbarrier.{test,try}_wait: Fix return value by @github-actions in #3672
Backport to 2.8: Add b200 tunings for radix_sort.keys (#3611) by @bernhardmgruber in #3655
[Backport branch/2.8.x] Fix issues with nvrtc compilation by @github-actions in #3674
[Backport branch/2.8.x] Add b200 policies for cub.select.unique_by_key by @github-actions in #3673
[Backport branch/2.8.x] Deprecate cub::FpLimits in favor of cuda::std::numeric_limits by @github-actions in #3658
Backport to 2.8: Deprecate cub::AliasTemporaries (#3679) and cub::PolicyWrapper (#3681) by @bernhardmgruber in #3690
[Backport branch/2.8.x] Internalize cub::KernelConfig by @github-actions in #3688
Backport to 2.8: Fix transform_iterator (#3652) and Deprecate thrust::identity (#3649) by @bernhardmgruber in #3693
Backport to 2.8: Add b200 policies for cub.device.run_length_encode.encode,non_trivialruns (#3546) by @bernhardmgruber in #3704
[BACKPORT] Remove cugraph-ops from RAPIDS 25.04 builds. (#3675) by @miscco in #3696
Backport to 2.8: Make thrust iterators work with NVRTC (#3676) and replace CUB iterators by Thrust ones (#3480) by @bernhardmgruber in #3697
[Backport branch/2.8.x] Deprecate cub::RegBoundScaling and cub::MemBoundScaling by @github-actions in #3706
[backport 2.8] Deprecate and replace Int2Type by @fbusato in #3705
[Backport branch/2.8.x] Add b200 policies for partition.three_way by @github-actions in #3710
Backport to 2.8: Deprecate cub::Trait::CATEGORY|PRIMITIVE|NULL_TYPE (#3689) by @bernhardmgruber in #3703
[Backport branch/2.8.x] Add b200 tunings for scan.exclusive.by_key by @github-actions in #3719
Backport to 2.8: B200 reduce.by_key tunings by @bernhardmgruber in #3726
Backport to 2.8: B200 tunings for histogram by @bernhardmgruber in #3728
Backport to 2.8: B200 reduce tunings by @bernhardmgruber in #3735
Backport to 2.8: Add b200 policies for cub.device.partition.flagged,if (#3617) by @bernhardmgruber in #3736
Backport to 2.8: Add b200 tunings for scan.exclusive.sum (#3559) by @bernhardmgruber in #3738
[Backport branch/2.8.x] fix ::cuda::discard_memory by @github-actions in #3737
[Backport branch/2.8.x] Fix cub trait deprecations by @github-actions in #3744
[Backport branch/2.8.x] [Automation] Add release workflow for tagging and testing new RCs by @github-actions in #3754
Suppress deprecatings on logical meta functions by @bernhardmgruber in #3795
Revert back to cub::Traits::CATEGORY|PRIMITIVE by @bernhardmgruber in #3866
[2.8.x] Disable [[no_unique_address]] for MSVC (#3757) by @miscco in #3869
[Backport branch/2.8.x] do not try to use clang-19's support for c++26 pack indexing by @github-actions in #3903

New Contributors

@Artem-B made their first contribution in #2420
@Kh4ster made their first contribution in #2470
@andrewcorrigan made their first contribution in #2482
@jjacobelli made their first contribution in #2604
@andralex made their first contribution in #2622
@Jacobfaib made their first contribution in #2681
@karthikeyann made their first contribution in #2769
@davidwendt made their first contribution in #2857
@hzhangxyz made their first contribution in #2992
@j3soon made their first contribution in #3002
@esoha-nvidia made their first contribution in #2736

Full Changelog: v2.7.0...v2.8.0

NVIDIA/cccl v2.8.0 CCCL 2.8.0 on GitHub

What's Changed

New Contributors

NVIDIA/cccl v2.8.0
CCCL 2.8.0

on GitHub