CCCL 3.0 Release
The 3.0 release of the CUDA Core Compute Libraries (CCCL) marks our first major version since unifying the Thrust, CUB, and libcudacxx libraries under a single repository. This release reflects over a year of work focused on cleanup, consolidation, and modernizing the codebase to support future growth.
While this release includes a number of breaking changes, many involve the consolidation of APIs—particularly in the thrust::
and cub::
namespaces—as well as cleanup of internal details that were never intended for public use. In many cases, redundant functionality from thrust::
or cub::
has been replaced with equivalent or improved abstractions from the cuda::
or cuda::std::
namespaces. Impact should be minimal for most users. For full details and recommended migration steps, please consult the CCCL 2.x to 3.0 Migration Guide.
Key Changes in CCCL 3.0
Requirements
- C++17 or newer is now required (support for C++11 and C++14 has been dropped #3255)
- CUDA Toolkit 12.0+ is now required (support for CTK 11.0+ has been dropped). For details on version compatibility, see the README.
- Compilers:
- Dropped support for
Header Directory Changes in CUDA Toolkit 13.0
CCCL 3.0 will be included with an upcoming CUDA Toolkit 13.0 release. In this release, the bundled CCCL headers have moved to new top-level directories under ${CTK_ROOT}/include/cccl/.
Before CUDA 13.0 | After CUDA 13.0 |
---|---|
${CTK_ROOT}/include/cuda/
| ${CTK_ROOT}/include/cccl/cuda/
|
${CTK_ROOT}/include/cub/
| ${CTK_ROOT}/include/cccl/cub/
|
${CTK_ROOT}/include/thrust/
| ${CTK_ROOT}/include/cccl/thrust/
|
These changes only affect the on-disk location of CCCL headers within the CUDA Toolkit installation.
What you need to know
- ❌ Do NOT write
#include <cccl/...>
— this will break. - If using CCCL headers only in files compiled with nvcc
- ✅ No action needed. This is the default for most users.
- If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
- Using CMake and linking
CCCL::CCCL
- ✅ No action needed. (This is the recommended path. See example)
- Other build systems
- ⚠️ Add
${CTK_ROOT}/include/cccl
to your compiler’s include search path (e.g., with-I
)
- ⚠️ Add
- Using CMake and linking
These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.
Major API Changes
Hundreds of macros, internal types, and implementation details were removed or relocated to internal namespaces. This significantly reduces surface area and eliminates long-standing technical debt, improving both compile times and maintainability.
Removed Macros
Over 50 legacy macros have been removed in favor of modern C++ alternatives:
CUB_{MIN,MAX}
: usecuda::std::{min,max}
instead #3821THRUST_NODISCARD
: use[[nodiscard]]
instead #3746THRUST_INLINE_CONSTANT
: use `inline constexpr` instead #3746- See CCCL 2.x to 3.0 Migration Guide for complete list
Removed Functions and Classes
thrust::optional
: usecuda::std::optional
instead #4172thrust::tuple
: usecuda::std::tuple
instead #2395thrust::pair
: usecuda::std::pair
instead #2395thrust::numeric_limits
: usecuda::std::numeric_limits
instead #3366cub::BFE
: use `cuda::bitfield_inser`t andcuda::bitfield_extract
instead #4031cub::ConstantInputIterator
: usethrust::constant_iterator
instead #3831cub::CountingInputIterator
: usethrust::counting_iterator
instead #3831cub::GridBarrier
: use cooperative groups instead #3745cub::DeviceSpmv
: use cuSPARSE instead #3320cub::Mutex
: usecuda::std::mutex
instead #3251- See CCCL 2.x to 3.0 Migration Guide for complete list
New Features
C++
cuda::
cuda::std::numeric_limits
now supports__float128
#4059cuda::std::optional<T&>
implementation (P2988) #3631cuda::std::numbers
header for mathematical constants #3355NVFP8/6/4
extended floating-point types support in<cuda/std/cmath>
#3843cuda::overflow_cast
for safe numeric conversions #4151cuda::ilog2
andcuda::ilog10
integer logarithms #4100cuda::round_up
andcuda::round_down
utilities #3234
cub::
- `cub::DeviceSegmentedReduce` now supports large number of segments #3746
- `cub::DeviceCopy::Batched` now supports large number of buffers #4129
- `cub::DeviceMemcpy::Batched` now supports large number of buffers #4065
thrust::
- New `thrust::offset_iterator` iterator #4073
- Temporary storage allocations in parallel algorithms now respect `par_nosync` #4204
Python
CUDA Python Core Libraries are now available on PyPI through the cuda-cccl
package.
pip install cuda-cccl
cuda.cccl.cooperative
- Block-level sorting now supports multi-dimensional thread blocks #4035, #4028
- Block-level data movement now supports multi-dimensional thread blocks #3161
- New block-level inclusive sum algorithm #3921
cuda.cccl.parallel
- New device-level segmented-reduce algorithm #3906
- New device-level unique-by-key algorithm #3947
- New device-level merge-sort algorithm #3763
What's Changed
🚀 Thrust / CUB
- Drop cub::Mutex by @bernhardmgruber in #3251
- Remove legacy macros from CUB util_arch.cuh by @bernhardmgruber in #3257
- Remove thrust::[unary|binary]_traits by @bernhardmgruber in #3260
- Drop thrust not1 and not2 by @bernhardmgruber in #3264
- Deprecate GridBarrier and GridBarrierLifetime by @bernhardmgruber in #3258
- Drop thrust::[unary|binary]_function by @bernhardmgruber in #3274
- Enable thrust::identity test for non-MSVC by @bernhardmgruber in #3281
- Enable PDL in triple chevron launch by @bernhardmgruber in #3282
- Drop Thrust legacy arch macros by @bernhardmgruber in #3298
- Drop Thrust's compiler_fence.h by @bernhardmgruber in #3300
- Drop CUB's util_compiler.cuh by @bernhardmgruber in #3302
- Drop Thrust's deprecated compiler macros by @bernhardmgruber in #3301
- Drop CUB_RUNTIME_ENABLED and THRUST_HAS_CUDART by @bernhardmgruber in #3305
- Require C++17 for compiling Thrust and CUB by @bernhardmgruber in #3255
- Deprecate Thrust's cpp_compatibility.h macros by @bernhardmgruber in #3299
- Deprecate cub::IterateThreadStore by @bernhardmgruber in #3337
- Drop CUB's BinaryFlip operator by @bernhardmgruber in #3332
- Deprecate cub::Swap by @bernhardmgruber in #3333
- Drop CUB APIs with a debug_synchronous parameter by @bernhardmgruber in #3330
- Drop CUB's util_compiler.cuh for real by @bernhardmgruber in #3340
- Drop cub::ValueCache by @bernhardmgruber in #3346
- Drop CDPv1 by @bernhardmgruber in #3344
- Use cuda::std::addressof in Thrust by @bernhardmgruber in #3363
- Drop deprecated aliases in Thrust functional by @bernhardmgruber in #3272
- Drop cub::DivideAndRoundUp by @bernhardmgruber in #3347
- Use cuda::std::min/max in Thrust by @bernhardmgruber in #3364
- Cleanup CUB util_arch by @bernhardmgruber in #2773
- Deprecate thrust::null_type by @bernhardmgruber in #3367
- Deprecate thrust::async by @bernhardmgruber in #3324
- Review CUB
util.ptx
for CCCL 2.x by @fbusato in #3342 - Deprecate thrust::numeric_limits by @bernhardmgruber in #3366
- Deprecate thrust::optional by @bernhardmgruber in #3307
- Redefine and deprecate thrust::remove_cvref by @bernhardmgruber in #3394
- Replace and deprecate thrust::cuda_cub::terminate by @bernhardmgruber in #3421
- Deprecate
cub::{min, max}
and replace internal uses with those from libcu++ by @miscco in #3419 - Moves agents to
detail::<algorithm_name>
namespace by @elstehle in #3435 - Default transform_iterator's copy ctor by @bernhardmgruber in #3395
- Refactor allocator handling of contiguous_storage by @bernhardmgruber in #3050
- Drop thrust::detail::integer_traits by @bernhardmgruber in #3391
- Deprecate a few CUB macros by @bernhardmgruber in #3456
- Deprecate thrust universal iterator categories by @bernhardmgruber in #3461
- Drop thrust universal iterator categories by @bernhardmgruber in #3474
- Moves CUB kernel entry points to a detail namespace by @elstehle in #3468
- Deprecate block/warp algo specializations by @bernhardmgruber in #3455
- Drop thrust numeric_traits by @bernhardmgruber in #3476
- Deprecate and replace thrust::cuda_cub iterators by @bernhardmgruber in #3422
- Deprecate thrust macros from type_deduction.h by @bernhardmgruber in #3501
- Deprecate thrust event, future and more by @bernhardmgruber in #3457
- Drop thrust::null_type by @bernhardmgruber in #3508
- Deprecates tuning policy hubs by @elstehle in #3514
- Deprecate macros from cuda/detail/core/util.h by @bernhardmgruber in #3504
- Deprecate CUB iterators existing in Thrust by @bernhardmgruber in #3304
- Deprecate thrust logical meta functions by @bernhardmgruber in #3538
- Fixes value type of
thrust::tabulate_output_iterator
by @elstehle in #3573 - Internalize cuda/detail/core/* by @bernhardmgruber in #3505
- Remove CUB
DeviceSpMV
by @fbusato in #3549 - Remove
LEGACY_PTX_ARCH
by @fbusato in #3551 - Removes deprecated
Agent*
alias templates in the public namespace by @elstehle in #3717 - Move
ForceInclusive
parameter ofDispatchScan
before policy by @bernhardmgruber in #3739 - Drop Thrust's cpp_compatibility.h by @bernhardmgruber in #3746
- Drop thrust::identity by @bernhardmgruber in #3747
- Drop deprecated entities from CUB util_type by @bernhardmgruber in #3743
- Drop cub::GridBarrier by @bernhardmgruber in #3745
- Move Dispatcher policy hub parameters to the back by @bernhardmgruber in #3740
- Drop small deprecated entites by @bernhardmgruber in #3748
- Error when users specialize BaseTraits but not numeric_limits by @bernhardmgruber in #3836
- Drop deprecated iterators from Thrust cuda utils by @bernhardmgruber in #3905
- Drop CUB thread operators by @bernhardmgruber in #3918
- Minimize usage of cub::Traits by @bernhardmgruber in #3863
- Drop/internalize some macros by @bernhardmgruber in #3936
- Drop public access to RegBoundScaling/MemBoundScaling by @bernhardmgruber in #3934
- Drop deprecated features from CUB util_ptx.cuh by @bernhardmgruber in #3935
- Fix definition of universal_host_pinned_memory_resource by @bernhardmgruber in #3988
- Assert offset type in
DispatchScan[ByKey]
to be unsigned and at least 4 bytes by @bernhardmgruber in #3992 - Drop deprecated CUB macros by @bernhardmgruber in #3821
- Drop deprecated warp/block algo specializations by @bernhardmgruber in #4007
- Drop remaining 2.8-deprecated entities by @bernhardmgruber in #4009
- Use cuda::std::array in histogram APIs by @bernhardmgruber in #3973
- Test tuple of iterator reference assignment by @bernhardmgruber in #1964
- Rework counting_iterator difference by @bernhardmgruber in #3861
- [thrust, docs] Use the variadic overload of
make_zip_iterator
in thezip_iterator
docs by @brycelelbach in #4111
📚 Libcudacxx
- ptx: Add add_ptx_instruction.py by @bernhardmgruber in #3190
- Fix assert definition for NVHPC due to constexpr issues by @miscco in #3418
ceil_div
return common type and optmize by @fbusato in #3229- attempt to work around msvc bug exposed by type_list.h by @ericniebler in #3487
- Ensure that pointer_traits work nicely with proxy iterators by @miscco in #3519
- Define is_floating_point_v in terms of is_floating_point by @bernhardmgruber in #3923
- Rework our
mdspan
implementation by @miscco in #3343 - Implement more of cmath by @miscco in #3963
📝 Documentation
🔄 Other Changes
- Expands support for more offset types in segmented benchmark by @elstehle in #3231
- Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects by @miscco in #3253
- [Version] Update main to v2.9.0 by @github-actions in #3247
- Architecture and OS identification macros by @fbusato in #3237
- [Version] Update main to v3.0.0 by @github-actions in #3265
- CCCL Internal macro documentation by @fbusato in #3238
- Require at least gcc7 by @bernhardmgruber in #3268
- Drop ICC from CI by @bernhardmgruber in #3277
- [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place by @caugonnet in #3270
- Disambiguate line continuations and macro continuations in <nv/target> by @wmaxey in #3244
- Drop VS 2017 from CI by @bernhardmgruber in #3287
- Drop ICC support in code by @bernhardmgruber in #3279
- Make CUB NVRTC commandline arguments come from a cmake template by @wmaxey in #3292
- Add components to the bug report template by @caugonnet in #3295
- Use process isolation instead of default hyper-v for Windows. by @wmaxey in #3294
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #3248
- Drop CTK 11.x from CI by @bernhardmgruber in #3275
- Update repo_man and packman versions by @shwina in #3293
- Adds support for large number of items to
DevicePartition::If
with theThreeWayPartition
overload by @elstehle in #2506 - Refactor scan tunings by @bernhardmgruber in #3262
- Implement
views::empty
by @miscco in #3254 - Refactor
limits
andclimits
by @davebayer in #3221 - cuda.parallel: Add documentation for the current iterators along with examples and tests by @NaderAlAwar in #3311
- Drop clang<14 from CI, update devcontainers. by @alliepiper in #3309
- [STF] Cleanup task dependencies object constructors by @caugonnet in #3291
- Disable test with a gcc-14 regression by @miscco in #3297
- Remove dropped function objects from docs by @bernhardmgruber in #3319
- Document
NV_TARGET
macros by @fbusato in #3313 - [STF] Define ctx.pick_stream() which was missing for the unified context by @caugonnet in #3326
- Clarify CUB transform output can overlap input by @bernhardmgruber in #3323
- Limits the number of different offset types for
DeviceMergeSort
by @elstehle in #3328 - Drop thrust::void_t by @bernhardmgruber in #3362
- Fix all_of documentation for empty ranges by @upsj in #3358
- [STF] Do not keep track of dangling events in a CUDA graph backend by @caugonnet in #3327
- Extract scan kernels into NVRTC-compilable header by @shwina in #3334
- Implement
cuda::std::numeric_limits
for__half
and__nv_bfloat16
by @davebayer in #3361 - Deprecate cub::DeviceSpmv by @bernhardmgruber in #3320
- Improves
DeviceSegmentedSort
test run time for large number of items and segments by @elstehle in #3246 - Compile basic infra test with C++17 by @bernhardmgruber in #3377
- Adds support for large number of items and large number of segments to
DeviceSegmentedSort
by @elstehle in #3308 - Exit with error when RAPIDS CI fails. by @alliepiper in #3385
- cuda.parallel: Support structured types as algorithm inputs by @shwina in #3218
- Fix broken
_CCCL_BUILTIN_ASSUME
macro by @fbusato in #3314 - Replace
typedef
withusing
in libcu++ by @davebayer in #3368 - Upgrade to Catch2 3.8 by @bernhardmgruber in #3310
- refactor
<cuda/std/cstdint>
by @davebayer in #3325 - Update CODEOWNERS by @jrhemstad in #3331
- Fix sign-compare warning by @bernhardmgruber in #3408
- Implement more cmath functions to be usable on host and device by @miscco in #3382
- Extend CUB reduce benchmarks by @bernhardmgruber in #3401
- Update upload-pages-artifact to v3 by @shwina in #3423
std::linalg
accessors andtransposed_layout
by @fbusato in #2962- Add round up/down to multiple by @fbusato in #3234
- [FEA]: Introduce Python module with CCCL headers by @rwgk in #3201
- cuda.parallel: Add optional stream argument to reduce_into() by @NaderAlAwar in #3348
- Fix Deploy CCCL pages workflow by @rwgk in #3434
- [CUDAX] Fix CI issues in the nightly testing by @pciolkosz in #3443
- Remove deprecated
cub::min
andthrust::remove_cvref
by @miscco in #3450 - Fix typo in builtin by @miscco in #3451
- Uses unsigned offset types in thrust's scan algorithms by @elstehle in #3436
- Turn C++ dialect warning into error by @bernhardmgruber in #3453
- Uses unsigned offset types in thrust's sort algorithm calling into
DispatchMergeSort
by @elstehle in #3437 - Add
cuda::is_floating_point
supporting half and bfloat by @bernhardmgruber in #3379 - Drop C++11 and C++14 support for all of cccl by @miscco in #3417
- [CUDAX] Fix block and grid dimension order in <<<>>> in one of the hierarchy tests by @pciolkosz in #3465
- Add
--extended-lambda
to the list of removed clangd flags by @fbusato in #3432 - add
_CCCL_HAS_NVFP8
macro by @fbusato in #3429 - Add
_CCCL_BUILTIN_PREFETCH
by @fbusato in #3433 - Ensure that headers in
<cuda/*>
can be build with a C++ only compiler by @miscco in #3472 - Specialize __is_extended_floating_point for FP8 types by @bernhardmgruber in #3470
- Refactor CUB's util_debug by @bernhardmgruber in #3345
- Specialize
cuda::std::numeric_limits
for FP8 types by @davebayer in #3478 - Fix typo in limits by @miscco in #3491
- Add dynamic CUB dispatch for scan to support c.parallel by @shwina in #3398
- Use a raw string literal for nvrtc source by @bernhardmgruber in #3486
- Add
popcount
,clz
,ctz
builtin intrinsics by @fbusato in #3489 - [STF] Fix paths in the STF unittest infrastructure by @caugonnet in #3396
- Increase test coverage now that we dropped half of our configs by @miscco in #3500
- Fix issue with conversion between
mdspan<T>
andmdspan<const T>
by @miscco in #3469 - Extract merge sort kernels to NVRTC compilable header by @NaderAlAwar in #3438
- [STF] Generate statistics about the DOT output by @caugonnet in #3509
- [CUDAX] Align some naming and add missing docs by @pciolkosz in #3497
- [CUDAX] Rename
hierarchy_dimensions_fragment
tohierarchy_dimensions
and remove the old alias by @pciolkosz in #3496 - cuda.parallel: invoke pytest directly rather than via
python -m pytest
by @shwina in #3523 - add a
__call_result_t
alias template, implement__is_callable_v
with it by @ericniebler in #3527 - cudastf (examples): Fix compiler errors when enabling examples for CUDA STF by @janciesko in #3516
- A few improvements for internal macro documentation by @fbusato in #3554
- Replace pipes.quote with shlex.quote in lit config by @wmaxey in #3547
- Tune cub::DeviceTransform for Blackwell by @bernhardmgruber in #3543
- Refactor injecting benchmark policy_hub by @bernhardmgruber in #3561
- Try to always include the definition of barrier_native_handle when needed by @miscco in #3556
- Fix transform iterator for non-copy-constructible types by @bernhardmgruber in #3542
- Sync ptx helpers with libcudaptx by @bernhardmgruber in #3564
- Update ptx_isa.h to include 8.7 by @bernhardmgruber in #3563
- add missing visibility annotations to ustdex types that have data members by @ericniebler in #3571
- [STF] Document dot sections by @caugonnet in #3506
- Remove nvks runners from testing pool. by @alliepiper in #3580
- Try and get rapids green by @miscco in #3503
- Add
__int128
and__float128
detection macros by @fbusato in #3413 - Remove all code paths and policies for SM37 and below by @fbusato in #3466
- PTX: Update generated files with Blackwell instructions by @bernhardmgruber in #3568
- Update CI matrix to use NVKS nodes. by @alliepiper in #3572
- Deprecate and replace
CUB_IS_INT128_ENABLED
by @fbusato in #3427 - Adds support for large num items to
DeviceMerge
by @elstehle in #3530 - Support FP16 traits on CTK 12.0 by @bernhardmgruber in #3535
- Suppress execution checks for vocabulary types by @miscco in #3578
- [nv/target] Add sm_120 macros. by @wmaxey in #3550
- PTX: Remove internal instructions by @bernhardmgruber in #3583
- Add dynamic CUB dispatch for merge_sort by @NaderAlAwar in #3525
- PTX: Update existing instructions by @bernhardmgruber in #3584
- PTX: Add clusterlaunchcontrol by @bernhardmgruber in #3589
- PTX: Add st.bulk by @bernhardmgruber in #3604
- PTX: Add multimem instructions by @bernhardmgruber in #3603
- PTX: Add cp.async.mbarrier.arrive{.noinc} by @bernhardmgruber in #3602
- PTX: Add tcgen05 instructions by @bernhardmgruber in #3607
- Use a differrent implementation for
tuple_of_iterator_references
to tuple conversion by @miscco in #3609 - work around erroneous "undefined in device code" error in
basic_any
by @ericniebler in #3614 - Deprecate
AgentSegmentFixupPolicy
by @fbusato in #3593 - Fix deadlocks by enabling eager module loading in libcudacxx tests. by @wmaxey in #3585
- Add b200 tunings for histogram by @bernhardmgruber in #3616
- make
uninitialized[_async]_buffer
's range accessors const-correct by @ericniebler in #3615 - Fix typo in index.rst by @cliffburdick in #3620
- Disable X86-64 detection macro for Arm64 emulation on MSVC by @fbusato in #3540
- Deprecate ABI v2 and v3 in libcudacxx by @wmaxey in #3575
- Add b200 policies for reduce by @bernhardmgruber in #3612
- Add b200 tunings for reduce.by_key by @bernhardmgruber in #3610
- Remove CUDA 11.x support by @fbusato in #3596
- PTX: fix cp.async.bulk.tensor and mbarrier.arrive by @bernhardmgruber in #3628
- Add b200 tunings for radix_sort.keys by @bernhardmgruber in #3611
- Try and make nvrtc on windows pass by @miscco in #3623
- Sync PTX refactorings from libcudaptx by @bernhardmgruber in #3632
- Bump CI to use CTK 12.8, add sm100 build. by @alliepiper in #3544
- PTX: add bfind, exit and trap by @bernhardmgruber in #3627
- Adds benchmarks for
cub::DeviceMerge
by @elstehle in #3529 - remove AgentSegmentFixupPolicy by @fbusato in #3639
__builtin_isfinite
is only available above nvrtc 12.2 by @miscco in #3644- Turn
TEST_[HALF|BF]_T
into function-style macros and fix some tests by @bernhardmgruber in #3608 - [STF] frozen_logical_data::get_access_mode() by @caugonnet in #3646
- Internalize
triple_chevron
by @bernhardmgruber in #3648 - This improves the detection logic for
__cccl_ptx_isa
for clang-cuda by @miscco in #3647 - Try to fix backport workflow by @leofang in #3634
- Revert #3623 by @leofang in #3654
- Deprecate cub::FpLimits in favor of cuda::std::numeric_limits by @bernhardmgruber in #3635
- Fix transform_iterator and drop result_of_adaptable_function by @bernhardmgruber in #3652
- Transition build system of cuda_cccl and cuda_parallel to scikit-build-core by @oleksandr-pavlyk in #3597
- Replaces bool template parameters on
Dispatch*
class templates to useenum class
by @elstehle in #3643 - Add b200 policies for device.select.if,flagged,unique by @bernhardmgruber in #3545
- Add b200 tunings for radix_sort.pairs by @bernhardmgruber in #3626
- Fix the vectorized loading of BlockLoad by @ChristinaZ in #3517
- PTX: mbarrier.{test,try}_wait: Fix return value by @ahendriksen in #3670
- Add b200 policies for cub.select.unique_by_key by @bernhardmgruber in #3557
- Update RAPIDS CI build to 25.04. by @alliepiper in #3539
- Fix issues with nvrtc compilation by @miscco in #3666
- Function-like macros for FP6/BF16 macros by @fbusato in #3588
- Remove
cub::ArrayWrapper
by @fbusato in #3677 - Internalize cub::PolicyWrapper by @fbusato in #3681
- Modernize MSVC 2005/nvcc workaround by @bernhardmgruber in #3606
- Deprecate
cub::AliasTemporaries
by @fbusato in #3679 - [CUB] Remove pre-c++17 conditions and code by @fbusato in #3684
- Internalize cub::KernelConfig by @fbusato in #3683
- remove MSVC 2017 paths by @fbusato in #3553
- [Thrust] Remove pre-c++17 conditions and code by @fbusato in #3687
- Remove cugraph-ops from RAPIDS 25.04 builds. by @bdice in #3675
- Refactor radix_sort tuning by @bernhardmgruber in #3657
- Make thrust iterators work with NVRTC by @bernhardmgruber in #3676
- Deprecate and replace thrust::identity by @bernhardmgruber in #3649
- Replace CUB iterators by Thrust ones by @bernhardmgruber in #3480
- Drop Thrust's global workaround by @bernhardmgruber in #3692
- replace Int2Type in CUB library by @fbusato in #3641
- Add b200 policies for cub.device.run_length_encode.encode,non_trivialruns by @bernhardmgruber in #3546
- Deprecate cub::Trait::CATEGORY|PRIMITIVE|NULL_TYPE by @bernhardmgruber in #3689
- Fix sccache reporting in CI summaries. by @alliepiper in #3621
- Make THRUST_DEVICE_SYSTEM and THRUST_CPP_DIALECT independent of THRUST_HOST_SYSTEM by @adams381 in #3659
- Deprecate
cub::RegBoundScaling
andcub::MemBoundScaling
by @fbusato in #3685 - Fix devcontainers'
initializeCommand
by @trxcllnt in #3533 - [cuda.cooperative] Add missing overloads to block.reduce and block.sum by @brycelelbach in #2691
- clean up the cudax
__launch_transform
code and document its purpose and design by @ericniebler in #3526 - Add b200 policies for partition.three_way by @bernhardmgruber in #3708
- Fix multiple CI arches in matrix by @alliepiper in #3702
- Minor cleanups following bool-to-enum template parameter PR by @elstehle in #3716
- Remove V2 and V3 ABI support from libcudacxx. by @wmaxey in #3662
- Add b200 tunings for scan.exclusive.by_key by @bernhardmgruber in #3560
- assorted bug fixes for the std::execution implementation in cudax by @ericniebler in #3721
- Minor fix for a regressing tuning in reduce.by_key by @gonidelis in #3723
- Fix SM100 histogram tunings by @bernhardmgruber in #3691
- Move
zip_iterator
to internally usecuda::std::tuple
by @miscco in #3725 - Remove reduce tunings with no benefit by @bernhardmgruber in #3724
- fix ::cuda::discard_memory by @fbusato in #3733
- Add b200 policies for cub.device.partition.flagged,if by @bernhardmgruber in #3617
- Add b200 tunings for scan.exclusive.sum by @bernhardmgruber in #3559
- Fix cub trait deprecations by @bernhardmgruber in #3742
- Nightly fixes by @alliepiper in #3720
- Clarify scan benchmarks by @bernhardmgruber in #3709
- Drop thrust::future|event|async::* by @bernhardmgruber in #3730
- Replace raw arm64/x86_64 macros by @fbusato in #3732
- Add Merge Sort implementation for c.parallel by @NaderAlAwar in #3636
- Extracted Segmented Reduce kernels into NVRTC compilable header by @oleksandr-pavlyk in #3727
- Remove unsupported CPU architecture paths (32-bit) by @fbusato in #3752
- [Automation] Add release workflow for tagging and testing new RCs by @wmaxey in #3009
- fix cuda std namespace by @fbusato in #3751
- Remove cuda/init.py in
cuda-parallel
package by @shwina in #3750 - Simplify
cuda::std::{min,max}
by @miscco in #3758 - Add dynamic CUB dispatch for SegmentedReduce by @oleksandr-pavlyk in #3753
- [STF] Implement kernel chains in the graph backend without child graphs by @caugonnet in #3707
- Add Scan implementation for c.parallel by @shwina in #3462
- cuda.parallel: Minor perf improvements by @shwina in #3718
- refactor
<cuda/std/cstdlib>
by @davebayer in #3339 - Fix python editable builds by @oleksandr-pavlyk in #3762
- Reinstate
thrust::optional
by @miscco in #3759 - Drop unsupported dialects for libcu++ by @miscco in #3695
- Disable
[[no_unique_address]]
for MSVC by @miscco in #3757 - cuda.coop: Generalize war_introspection utility for any # of arguments by @shwina in #3769
- Avoid issues with nvcc compilation in c++ mode by @miscco in #3770
- Refactor
cuda/cmath
functions documentation by @fbusato in #3773 - [STF] Factorize large event lists in CUDA graphs by @caugonnet in #3756
- Replace pre-c++17 traits with modern ones in CUB by @fbusato in #3774
- Drop cugraph-gnn from rapids CI by @miscco in #3771
- [STF] Ensure dot_section::guard is actually movable by @caugonnet in #3778
- Guard PDL by availability by @miscco in #3779
- [STF] virtual to_string() method for STF contexts by @caugonnet in #3781
- [STF] Enable freeze on logical tokens by @caugonnet in #3782
- Refactors
DeviceMemcpy
'svectorized_copy
tests by @elstehle in #3777 - More h100 usage. by @alliepiper in #3776
- Add Python wrappers for c.parallel scan API by @shwina in #3592
- Replace
_CCCL_IF_CONSTEXPR
by @fbusato in #3775 - Remove
_CCCL_CONSTEXPR_CXX14/17
by @fbusato in #3793 - Bump -std from 14 to 17 in `./ci/(build|test)_cub.sh examples. by @tpn in #3792
- [CUDAX] Add host launch API allowing stream ordered host execution by @pciolkosz in #3555
- Moves
DeviceMemcpy
'sBitPackedCounter
tests to Catch2 by @elstehle in #3794 - Refactor
<cuda/std/cstring>
by @davebayer in #3484 - fix NoopExecutor by @fbusato in #3811
- Unifies workload generation for
DeviceMerge
benchmarks by @elstehle in #3645 - Optimize and clean
countl
,countr
,popcount
,has_single_bit
by @fbusato in #3414 - fix
-Werror=unused-result
by @fbusato in #3810 - Enable
cuda::std::ssize
for C++17 by @miscco in #3813 - fix
_LIBCUDACXX_HAS_NO_INT128
with NVRTC by @fbusato in #3802 - Move radix sort kernels to separate NVRTC compilable header by @NaderAlAwar in #3803
- Fix
popc
parentheses warning by @fbusato in #3820 - Add arch_traits for sm100 to cudax. by @alliepiper in #3818
- Remove unused function parameter by @ericniebler in #3828
- CI summary fix by @alliepiper in #3826
- Refactor Thrust allocator example by @bernhardmgruber in #3830
- [STF] Improved cache mechanism for executable CUDA graphs by @caugonnet in #3768
- Drop deprecated CUB iterators by @bernhardmgruber in #3831
- Use libcu++ limits/trait in tests/benchmarks by @bernhardmgruber in #3822
- Move unique_by_key kernels to NVRTC compilable header by @NaderAlAwar in #3815
- Specialize
numeric_limits
for CUDA 12.8 FP types by @davebayer in #3832 - Refactor thrust::zip_iterator by @bernhardmgruber in #3834
- Refactor Thrust iterators 2/4 by @bernhardmgruber in #3840
- Refactor Thrust iterators 3/4 by @bernhardmgruber in #3842
- Refactor Thrust iterators 4/4 by @bernhardmgruber in #3833
- Increase libcudacxx test timeout by @alliepiper in #3850
- Use lower case variable name to avoid macro collosions by @miscco in #3856
- Fix incorrect availability of
variant
in docs by @miscco in #3859 - Add cuda_cccl to the list of Python packages for which test suite is run by @oleksandr-pavlyk in #3846
- Refactor Thrust iterators 1/4 by @bernhardmgruber in #3839
- Rewrites
DeviceMemcpy::Batched
tests to use device-side data generation and Catch2 by @elstehle in #3849 - Refactor CUB transfrom by @bernhardmgruber in #3825
- Add Python wrappers for c.parallel merge_sort API by @NaderAlAwar in #3763
- Add c parallel segmented reduce api by @oleksandr-pavlyk in #3838
- [libcudacxx] Stable abstraction for Blackwell work-stealing (PTX try_cancel) by @gonzalobg in #3671
- Consider specializations of
std::iterator_traits
by @miscco in #3837 - Update supported C++ dialects in README by @davebayer in #3879
- Refactor
assume_aligned
implementation by @fbusato in #3765 - Refactor and make NVRTC compile
<cub/util_device>
by @bernhardmgruber in #3880 - Cache the result of
merge_sort()
by @shwina in #3881 - do not try to use clang-19's support for c++26 pack indexing by @ericniebler in #3888
- Add support for single item per thread calls to block_scan.exclusive_scan by @tpn in #3829
- Document
cuda::maximum
,cuda::minimum
by @fbusato in #3883 - Refactor Thrust iterator_traits by @bernhardmgruber in #3892
- Update Blackwell PTX instruction availability tables by @bernhardmgruber in #3894
- Fix CCCL C headers to be compileable by C compiler by @oleksandr-pavlyk in #3885
- Move transform kernels to NVRTC compilable header by @shwina in #3875
- PTX
shfl_sync
by @fbusato in #3241 - Add a warning that we cannot tune transform by @bernhardmgruber in #3896
- Extend tuning guide by @bernhardmgruber in #3904
- Drop join_iterator by @bernhardmgruber in #3891
- Revert Thrust find_if_not implementation to please nvc++ by @bernhardmgruber in #3901
- [CUB/docs] Add missing closing braces to
BlockReduce
kernel examples in CUB docs. by @brycelelbach in #3916 - [STF] Executable CUDA graphs caching policies by @caugonnet in #3868
- Refactor Thrust iterator internals by @bernhardmgruber in #3893
- Revert Thrust mismatch implementation by @bernhardmgruber in #3899
- Replace usage of CUB_MIN|MAX in reduce by @bernhardmgruber in #3927
- Move to cuda::std::iterator_traits in CUB by @bernhardmgruber in #3924
- Add C++ test for single-item-per-thread BlockScan Sum routines. by @tpn in #3889
- Rename threads_in_block -> threads_per_block to be consistent with CUB. by @tpn in #3919
- Implement cuda.coopertive.block_scan.inclusive_sum(). by @tpn in #3921
- Replace CUB macros in more places by @bernhardmgruber in #3930
- [PTX] Add shl, shr, bmsk, prmt by @bernhardmgruber in #3939
- Add test_reduce_api.py::test_reduce_struct_type_minmax by @oleksandr-pavlyk in #3938
- Add
cuda::std::aligned_accessor
by @fbusato in #3731 - [STF] Thread safe graph_ctx by @caugonnet in #3925
- Replace CUB macros in tunings and benchmarks by @bernhardmgruber in #3931
- Deprecate and replace some Thrust iterator traits by @bernhardmgruber in #3928
- Optimize
bit_floor
,bit_ceil
,bit_width
by @fbusato in #3296 - Allow RAPIDS workflow to run on an arbitrary branch. by @alliepiper in #3945
- Initial CUDA C++ Execution Model documentation by @gonzalobg in #3873
- [STF] Remove unmaintained CUDASTF_DEBUG option by @caugonnet in #3944
- Revert "Initial CUDA C++ Execution Model documentation (#3873)" by @alliepiper in #3950
- Implement
ranges::ref_view
by @miscco in #3316 - Expose CCCL branch controls on Actions UI for RAPIDS workflow. by @alliepiper in #3948
- Drop unused
TEST_COMPILER_CUDACC_BELOW_11_3
macro by @miscco in #3946 - Allow NVRTC to compile more of CUB by @bernhardmgruber in #3951
- Use
_CCCL_REQUIRES_EXPR
in test code by @miscco in #3954 - Improve
<cuda/std/bit>
documentation by @fbusato in #3959 - [STF] Support generation of multiple CUDA graphs from separate threads by @caugonnet in #3943
- Add segmented_reduce python api by @oleksandr-pavlyk in #3906
- Implement
__cccl_is_integer
trait by @davebayer in #3962 - Implement
cudax::async_buffer
by @miscco in #3460 - Add dynamic CUB dispatch for unique_by_key by @NaderAlAwar in #3816
- Fix typo in
_LIBCUDACXX_HAS_NVFP16
macro by @davebayer in #3965 - Drop obsolete thrust tuple algorithms by @bernhardmgruber in #3966
- Extend CUB policy and tuning documentation by @bernhardmgruber in #3933
- Fix thrust::raw_reference_cast for tuple_of_iterator_references and simplify thrust::generate by @bernhardmgruber in #3970
- [PTX] Add
st
,ld
instructions by @fbusato in #3974 - [cuda.cooperative] Support multidimensional thread blocks in block load/store and improve load/store docs by @brycelelbach in #3161
- Disable automatic header inclusion for clangd by @miscco in #3365
- Deprecate and replace
THRUST_STATIC_ASSERT
by @bernhardmgruber in #3971 - Avoid int overflow during multipl