CCCL 3.0 Release

The 3.0 release of the CUDA Core Compute Libraries (CCCL) marks our first major version since unifying the Thrust, CUB, and libcudacxx libraries under a single repository. This release reflects over a year of work focused on cleanup, consolidation, and modernizing the codebase to support future growth.

While this release includes a number of breaking changes, many involve the consolidation of APIs—particularly in the thrust:: and cub:: namespaces—as well as cleanup of internal details that were never intended for public use. In many cases, redundant functionality from thrust:: or cub:: has been replaced with equivalent or improved abstractions from the cuda:: or cuda::std:: namespaces. Impact should be minimal for most users. For full details and recommended migration steps, please consult the CCCL 2.x to 3.0 Migration Guide.

Key Changes in CCCL 3.0

Requirements

C++17 or newer is now required (support for C++11 and C++14 has been dropped #3255)
CUDA Toolkit 12.0+ is now required (support for CTK 11.0+ has been dropped). For details on version compatibility, see the README.
Compilers:
- GCC 7+ (support for GCC < 7 has been dropped #3268)
- Clang 14+ (support for Clang < 14 has been dropped #3309)
- MSVC 2019+ (support for MSVC 2017 has been dropped #3287, #3553)
Dropped support for
- ICC #3277, #3279
- CUDA Dynamic Parallelism v1 (CDPv1) #3344

Header Directory Changes in CUDA Toolkit 13.0

CCCL 3.0 will be included with an upcoming CUDA Toolkit 13.0 release. In this release, the bundled CCCL headers have moved to new top-level directories under ${CTK_ROOT}/include/cccl/.

Before CUDA 13.0	After CUDA 13.0
`${CTK_ROOT}/include/cuda/`	`${CTK_ROOT}/include/cccl/cuda/`
`${CTK_ROOT}/include/cub/`	`${CTK_ROOT}/include/cccl/cub/`
`${CTK_ROOT}/include/thrust/`	`${CTK_ROOT}/include/cccl/thrust/`

These changes only affect the on-disk location of CCCL headers within the CUDA Toolkit installation.

What you need to know

❌ Do NOT write #include <cccl/...> — this will break.
If using CCCL headers only in files compiled with nvcc
- ✅ No action needed. This is the default for most users.
If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
- Using CMake and linking CCCL::CCCL
  - ✅ No action needed. (This is the recommended path. See example)
- Other build systems
  - ⚠️ Add ${CTK_ROOT}/include/cccl to your compiler’s include search path (e.g., with -I)

These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.

Major API Changes

Hundreds of macros, internal types, and implementation details were removed or relocated to internal namespaces. This significantly reduces surface area and eliminates long-standing technical debt, improving both compile times and maintainability.

Removed Macros

Over 50 legacy macros have been removed in favor of modern C++ alternatives:

CUB_{MIN,MAX}: use cuda::std::{min,max} instead #3821
THRUST_NODISCARD: use [[nodiscard]] instead #3746
THRUST_INLINE_CONSTANT: use `inline constexpr` instead #3746
See CCCL 2.x to 3.0 Migration Guide for complete list

Removed Functions and Classes

thrust::optional: use cuda::std::optional instead #4172
thrust::tuple: use cuda::std::tuple instead #2395
thrust::pair: use cuda::std::pair instead #2395
thrust::numeric_limits: use cuda::std::numeric_limits instead #3366
cub::BFE: use `cuda::bitfield_inser`t and cuda::bitfield_extract instead #4031
cub::ConstantInputIterator: use thrust::constant_iterator instead #3831
cub::CountingInputIterator: use thrust::counting_iterator instead #3831
cub::GridBarrier: use cooperative groups instead #3745
cub::DeviceSpmv: use cuSPARSE instead #3320
cub::Mutex: use cuda::std::mutex instead #3251
See CCCL 2.x to 3.0 Migration Guide for complete list

New Features

C++

`cuda::`

cuda::std::numeric_limits now supports __float128 #4059
cuda::std::optional<T&> implementation (P2988) #3631
cuda::std::numbers header for mathematical constants #3355
NVFP8/6/4 extended floating-point types support in <cuda/std/cmath> #3843
cuda::overflow_cast for safe numeric conversions #4151
cuda::ilog2 and cuda::ilog10 integer logarithms #4100
cuda::round_up and cuda::round_down utilities #3234

`cub::`

`cub::DeviceSegmentedReduce` now supports large number of segments #3746
`cub::DeviceCopy::Batched` now supports large number of buffers #4129
`cub::DeviceMemcpy::Batched` now supports large number of buffers #4065

`thrust::`

New `thrust::offset_iterator` iterator #4073
Temporary storage allocations in parallel algorithms now respect `par_nosync` #4204

Python

CUDA Python Core Libraries are now available on PyPI through the cuda-cccl package.

pip install cuda-cccl

cuda.cccl.cooperative

Block-level sorting now supports multi-dimensional thread blocks #4035, #4028
Block-level data movement now supports multi-dimensional thread blocks #3161
New block-level inclusive sum algorithm #3921

cuda.cccl.parallel

New device-level segmented-reduce algorithm #3906
New device-level unique-by-key algorithm #3947
New device-level merge-sort algorithm #3763

What's Changed

🚀 Thrust / CUB

Drop cub::Mutex by @bernhardmgruber in #3251
Remove legacy macros from CUB util_arch.cuh by @bernhardmgruber in #3257
Remove thrust::[unary|binary]_traits by @bernhardmgruber in #3260
Drop thrust not1 and not2 by @bernhardmgruber in #3264
Deprecate GridBarrier and GridBarrierLifetime by @bernhardmgruber in #3258
Drop thrust::[unary|binary]_function by @bernhardmgruber in #3274
Enable thrust::identity test for non-MSVC by @bernhardmgruber in #3281
Enable PDL in triple chevron launch by @bernhardmgruber in #3282
Drop Thrust legacy arch macros by @bernhardmgruber in #3298
Drop Thrust's compiler_fence.h by @bernhardmgruber in #3300
Drop CUB's util_compiler.cuh by @bernhardmgruber in #3302
Drop Thrust's deprecated compiler macros by @bernhardmgruber in #3301
Drop CUB_RUNTIME_ENABLED and THRUST_HAS_CUDART by @bernhardmgruber in #3305
Require C++17 for compiling Thrust and CUB by @bernhardmgruber in #3255
Deprecate Thrust's cpp_compatibility.h macros by @bernhardmgruber in #3299
Deprecate cub::IterateThreadStore by @bernhardmgruber in #3337
Drop CUB's BinaryFlip operator by @bernhardmgruber in #3332
Deprecate cub::Swap by @bernhardmgruber in #3333
Drop CUB APIs with a debug_synchronous parameter by @bernhardmgruber in #3330
Drop CUB's util_compiler.cuh for real by @bernhardmgruber in #3340
Drop cub::ValueCache by @bernhardmgruber in #3346
Drop CDPv1 by @bernhardmgruber in #3344
Use cuda::std::addressof in Thrust by @bernhardmgruber in #3363
Drop deprecated aliases in Thrust functional by @bernhardmgruber in #3272
Drop cub::DivideAndRoundUp by @bernhardmgruber in #3347
Use cuda::std::min/max in Thrust by @bernhardmgruber in #3364
Cleanup CUB util_arch by @bernhardmgruber in #2773
Deprecate thrust::null_type by @bernhardmgruber in #3367
Deprecate thrust::async by @bernhardmgruber in #3324
Review CUB util.ptx for CCCL 2.x by @fbusato in #3342
Deprecate thrust::numeric_limits by @bernhardmgruber in #3366
Deprecate thrust::optional by @bernhardmgruber in #3307
Redefine and deprecate thrust::remove_cvref by @bernhardmgruber in #3394
Replace and deprecate thrust::cuda_cub::terminate by @bernhardmgruber in #3421
Deprecate cub::{min, max} and replace internal uses with those from libcu++ by @miscco in #3419
Moves agents to detail::<algorithm_name> namespace by @elstehle in #3435
Default transform_iterator's copy ctor by @bernhardmgruber in #3395
Refactor allocator handling of contiguous_storage by @bernhardmgruber in #3050
Drop thrust::detail::integer_traits by @bernhardmgruber in #3391
Deprecate a few CUB macros by @bernhardmgruber in #3456
Deprecate thrust universal iterator categories by @bernhardmgruber in #3461
Drop thrust universal iterator categories by @bernhardmgruber in #3474
Moves CUB kernel entry points to a detail namespace by @elstehle in #3468
Deprecate block/warp algo specializations by @bernhardmgruber in #3455
Drop thrust numeric_traits by @bernhardmgruber in #3476
Deprecate and replace thrust::cuda_cub iterators by @bernhardmgruber in #3422
Deprecate thrust macros from type_deduction.h by @bernhardmgruber in #3501
Deprecate thrust event, future and more by @bernhardmgruber in #3457
Drop thrust::null_type by @bernhardmgruber in #3508
Deprecates tuning policy hubs by @elstehle in #3514
Deprecate macros from cuda/detail/core/util.h by @bernhardmgruber in #3504
Deprecate CUB iterators existing in Thrust by @bernhardmgruber in #3304
Deprecate thrust logical meta functions by @bernhardmgruber in #3538
Fixes value type of thrust::tabulate_output_iterator by @elstehle in #3573
Internalize cuda/detail/core/* by @bernhardmgruber in #3505
Remove CUB DeviceSpMV by @fbusato in #3549
Remove LEGACY_PTX_ARCH by @fbusato in #3551
Removes deprecated Agent* alias templates in the public namespace by @elstehle in #3717
Move ForceInclusive parameter of DispatchScan before policy by @bernhardmgruber in #3739
Drop Thrust's cpp_compatibility.h by @bernhardmgruber in #3746
Drop thrust::identity by @bernhardmgruber in #3747
Drop deprecated entities from CUB util_type by @bernhardmgruber in #3743
Drop cub::GridBarrier by @bernhardmgruber in #3745
Move Dispatcher policy hub parameters to the back by @bernhardmgruber in #3740
Drop small deprecated entites by @bernhardmgruber in #3748
Error when users specialize BaseTraits but not numeric_limits by @bernhardmgruber in #3836
Drop deprecated iterators from Thrust cuda utils by @bernhardmgruber in #3905
Drop CUB thread operators by @bernhardmgruber in #3918
Minimize usage of cub::Traits by @bernhardmgruber in #3863
Drop/internalize some macros by @bernhardmgruber in #3936
Drop public access to RegBoundScaling/MemBoundScaling by @bernhardmgruber in #3934
Drop deprecated features from CUB util_ptx.cuh by @bernhardmgruber in #3935
Fix definition of universal_host_pinned_memory_resource by @bernhardmgruber in #3988
Assert offset type in DispatchScan[ByKey] to be unsigned and at least 4 bytes by @bernhardmgruber in #3992
Drop deprecated CUB macros by @bernhardmgruber in #3821
Drop deprecated warp/block algo specializations by @bernhardmgruber in #4007
Drop remaining 2.8-deprecated entities by @bernhardmgruber in #4009
Use cuda::std::array in histogram APIs by @bernhardmgruber in #3973
Test tuple of iterator reference assignment by @bernhardmgruber in #1964
Rework counting_iterator difference by @bernhardmgruber in #3861
[thrust, docs] Use the variadic overload of make_zip_iterator in the zip_iterator docs by @brycelelbach in #4111

📚 Libcudacxx

ptx: Add add_ptx_instruction.py by @bernhardmgruber in #3190
Fix assert definition for NVHPC due to constexpr issues by @miscco in #3418
ceil_div return common type and optmize by @fbusato in #3229
attempt to work around msvc bug exposed by type_list.h by @ericniebler in #3487
Ensure that pointer_traits work nicely with proxy iterators by @miscco in #3519
Define is_floating_point_v in terms of is_floating_point by @bernhardmgruber in #3923
Rework our mdspan implementation by @miscco in #3343
Implement more of cmath by @miscco in #3963

📝 Documentation

Improve docs of std headers by @miscco in #3416

🔄 Other Changes

Expands support for more offset types in segmented benchmark by @elstehle in #3231
Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects by @miscco in #3253
[Version] Update main to v2.9.0 by @github-actions in #3247
Architecture and OS identification macros by @fbusato in #3237
[Version] Update main to v3.0.0 by @github-actions in #3265
CCCL Internal macro documentation by @fbusato in #3238
Require at least gcc7 by @bernhardmgruber in #3268
Drop ICC from CI by @bernhardmgruber in #3277
[STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place by @caugonnet in #3270
Disambiguate line continuations and macro continuations in <nv/target> by @wmaxey in #3244
Drop VS 2017 from CI by @bernhardmgruber in #3287
Drop ICC support in code by @bernhardmgruber in #3279
Make CUB NVRTC commandline arguments come from a cmake template by @wmaxey in #3292
Add components to the bug report template by @caugonnet in #3295
Use process isolation instead of default hyper-v for Windows. by @wmaxey in #3294
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #3248
Drop CTK 11.x from CI by @bernhardmgruber in #3275
Update repo_man and packman versions by @shwina in #3293
Adds support for large number of items to DevicePartition::If with the ThreeWayPartition overload by @elstehle in #2506
Refactor scan tunings by @bernhardmgruber in #3262
Implement views::empty by @miscco in #3254
Refactor limits and climits by @davebayer in #3221
cuda.parallel: Add documentation for the current iterators along with examples and tests by @NaderAlAwar in #3311
Drop clang<14 from CI, update devcontainers. by @alliepiper in #3309
[STF] Cleanup task dependencies object constructors by @caugonnet in #3291
Disable test with a gcc-14 regression by @miscco in #3297
Remove dropped function objects from docs by @bernhardmgruber in #3319
Document NV_TARGET macros by @fbusato in #3313
[STF] Define ctx.pick_stream() which was missing for the unified context by @caugonnet in #3326
Clarify CUB transform output can overlap input by @bernhardmgruber in #3323
Limits the number of different offset types for DeviceMergeSort by @elstehle in #3328
Drop thrust::void_t by @bernhardmgruber in #3362
Fix all_of documentation for empty ranges by @upsj in #3358
[STF] Do not keep track of dangling events in a CUDA graph backend by @caugonnet in #3327
Extract scan kernels into NVRTC-compilable header by @shwina in #3334
Implement cuda::std::numeric_limits for __half and __nv_bfloat16 by @davebayer in #3361
Deprecate cub::DeviceSpmv by @bernhardmgruber in #3320
Improves DeviceSegmentedSort test run time for large number of items and segments by @elstehle in #3246
Compile basic infra test with C++17 by @bernhardmgruber in #3377
Adds support for large number of items and large number of segments to DeviceSegmentedSort by @elstehle in #3308
Exit with error when RAPIDS CI fails. by @alliepiper in #3385
cuda.parallel: Support structured types as algorithm inputs by @shwina in #3218
Fix broken _CCCL_BUILTIN_ASSUME macro by @fbusato in #3314
Replace typedef with using in libcu++ by @davebayer in #3368
Upgrade to Catch2 3.8 by @bernhardmgruber in #3310
refactor <cuda/std/cstdint> by @davebayer in #3325
Update CODEOWNERS by @jrhemstad in #3331
Fix sign-compare warning by @bernhardmgruber in #3408
Implement more cmath functions to be usable on host and device by @miscco in #3382
Extend CUB reduce benchmarks by @bernhardmgruber in #3401
Update upload-pages-artifact to v3 by @shwina in #3423
std::linalg accessors and transposed_layout by @fbusato in #2962
Add round up/down to multiple by @fbusato in #3234
[FEA]: Introduce Python module with CCCL headers by @rwgk in #3201
cuda.parallel: Add optional stream argument to reduce_into() by @NaderAlAwar in #3348
Fix Deploy CCCL pages workflow by @rwgk in #3434
[CUDAX] Fix CI issues in the nightly testing by @pciolkosz in #3443
Remove deprecated cub::min and thrust::remove_cvref by @miscco in #3450
Fix typo in builtin by @miscco in #3451
Uses unsigned offset types in thrust's scan algorithms by @elstehle in #3436
Turn C++ dialect warning into error by @bernhardmgruber in #3453
Uses unsigned offset types in thrust's sort algorithm calling into DispatchMergeSort by @elstehle in #3437
Add cuda::is_floating_point supporting half and bfloat by @bernhardmgruber in #3379
Drop C++11 and C++14 support for all of cccl by @miscco in #3417
[CUDAX] Fix block and grid dimension order in <<<>>> in one of the hierarchy tests by @pciolkosz in #3465
Add --extended-lambda to the list of removed clangd flags by @fbusato in #3432
add _CCCL_HAS_NVFP8 macro by @fbusato in #3429
Add _CCCL_BUILTIN_PREFETCH by @fbusato in #3433
Ensure that headers in <cuda/*> can be build with a C++ only compiler by @miscco in #3472
Specialize __is_extended_floating_point for FP8 types by @bernhardmgruber in #3470
Refactor CUB's util_debug by @bernhardmgruber in #3345
Specialize cuda::std::numeric_limits for FP8 types by @davebayer in #3478
Fix typo in limits by @miscco in #3491
Add dynamic CUB dispatch for scan to support c.parallel by @shwina in #3398
Use a raw string literal for nvrtc source by @bernhardmgruber in #3486
Add popcount, clz, ctz builtin intrinsics by @fbusato in #3489
[STF] Fix paths in the STF unittest infrastructure by @caugonnet in #3396
Increase test coverage now that we dropped half of our configs by @miscco in #3500
Fix issue with conversion between mdspan<T> and mdspan<const T> by @miscco in #3469
Extract merge sort kernels to NVRTC compilable header by @NaderAlAwar in #3438
[STF] Generate statistics about the DOT output by @caugonnet in #3509
[CUDAX] Align some naming and add missing docs by @pciolkosz in #3497
[CUDAX] Rename hierarchy_dimensions_fragment to hierarchy_dimensions and remove the old alias by @pciolkosz in #3496
cuda.parallel: invoke pytest directly rather than via python -m pytest by @shwina in #3523
add a __call_result_t alias template, implement __is_callable_v with it by @ericniebler in #3527
cudastf (examples): Fix compiler errors when enabling examples for CUDA STF by @janciesko in #3516
A few improvements for internal macro documentation by @fbusato in #3554
Replace pipes.quote with shlex.quote in lit config by @wmaxey in #3547
Tune cub::DeviceTransform for Blackwell by @bernhardmgruber in #3543
Refactor injecting benchmark policy_hub by @bernhardmgruber in #3561
Try to always include the definition of barrier_native_handle when needed by @miscco in #3556
Fix transform iterator for non-copy-constructible types by @bernhardmgruber in #3542
Sync ptx helpers with libcudaptx by @bernhardmgruber in #3564
Update ptx_isa.h to include 8.7 by @bernhardmgruber in #3563
add missing visibility annotations to ustdex types that have data members by @ericniebler in #3571
[STF] Document dot sections by @caugonnet in #3506
Remove nvks runners from testing pool. by @alliepiper in #3580
Try and get rapids green by @miscco in #3503
Add __int128 and __float128 detection macros by @fbusato in #3413
Remove all code paths and policies for SM37 and below by @fbusato in #3466
PTX: Update generated files with Blackwell instructions by @bernhardmgruber in #3568
Update CI matrix to use NVKS nodes. by @alliepiper in #3572
Deprecate and replace CUB_IS_INT128_ENABLED by @fbusato in #3427
Adds support for large num items to DeviceMerge by @elstehle in #3530
Support FP16 traits on CTK 12.0 by @bernhardmgruber in #3535
Suppress execution checks for vocabulary types by @miscco in #3578
[nv/target] Add sm_120 macros. by @wmaxey in #3550
PTX: Remove internal instructions by @bernhardmgruber in #3583
Add dynamic CUB dispatch for merge_sort by @NaderAlAwar in #3525
PTX: Update existing instructions by @bernhardmgruber in #3584
PTX: Add clusterlaunchcontrol by @bernhardmgruber in #3589
PTX: Add st.bulk by @bernhardmgruber in #3604
PTX: Add multimem instructions by @bernhardmgruber in #3603
PTX: Add cp.async.mbarrier.arrive{.noinc} by @bernhardmgruber in #3602
PTX: Add tcgen05 instructions by @bernhardmgruber in #3607
Use a differrent implementation for tuple_of_iterator_references to tuple conversion by @miscco in #3609
work around erroneous "undefined in device code" error in basic_any by @ericniebler in #3614
Deprecate AgentSegmentFixupPolicy by @fbusato in #3593
Fix deadlocks by enabling eager module loading in libcudacxx tests. by @wmaxey in #3585
Add b200 tunings for histogram by @bernhardmgruber in #3616
make uninitialized[_async]_buffer's range accessors const-correct by @ericniebler in #3615
Fix typo in index.rst by @cliffburdick in #3620
Disable X86-64 detection macro for Arm64 emulation on MSVC by @fbusato in #3540
Deprecate ABI v2 and v3 in libcudacxx by @wmaxey in #3575
Add b200 policies for reduce by @bernhardmgruber in #3612
Add b200 tunings for reduce.by_key by @bernhardmgruber in #3610
Remove CUDA 11.x support by @fbusato in #3596
PTX: fix cp.async.bulk.tensor and mbarrier.arrive by @bernhardmgruber in #3628
Add b200 tunings for radix_sort.keys by @bernhardmgruber in #3611
Try and make nvrtc on windows pass by @miscco in #3623
Sync PTX refactorings from libcudaptx by @bernhardmgruber in #3632
Bump CI to use CTK 12.8, add sm100 build. by @alliepiper in #3544
PTX: add bfind, exit and trap by @bernhardmgruber in #3627
Adds benchmarks for cub::DeviceMerge by @elstehle in #3529
remove AgentSegmentFixupPolicy by @fbusato in #3639
__builtin_isfinite is only available above nvrtc 12.2 by @miscco in #3644
Turn TEST_[HALF|BF]_T into function-style macros and fix some tests by @bernhardmgruber in #3608
[STF] frozen_logical_data::get_access_mode() by @caugonnet in #3646
Internalize triple_chevron by @bernhardmgruber in #3648
This improves the detection logic for __cccl_ptx_isa for clang-cuda by @miscco in #3647
Try to fix backport workflow by @leofang in #3634
Revert #3623 by @leofang in #3654
Deprecate cub::FpLimits in favor of cuda::std::numeric_limits by @bernhardmgruber in #3635
Fix transform_iterator and drop result_of_adaptable_function by @bernhardmgruber in #3652
Transition build system of cuda_cccl and cuda_parallel to scikit-build-core by @oleksandr-pavlyk in #3597
Replaces bool template parameters on Dispatch* class templates to use enum class by @elstehle in #3643
Add b200 policies for device.select.if,flagged,unique by @bernhardmgruber in #3545
Add b200 tunings for radix_sort.pairs by @bernhardmgruber in #3626
Fix the vectorized loading of BlockLoad by @ChristinaZ in #3517
PTX: mbarrier.{test,try}_wait: Fix return value by @ahendriksen in #3670
Add b200 policies for cub.select.unique_by_key by @bernhardmgruber in #3557
Update RAPIDS CI build to 25.04. by @alliepiper in #3539
Fix issues with nvrtc compilation by @miscco in #3666
Function-like macros for FP6/BF16 macros by @fbusato in #3588
Remove cub::ArrayWrapper by @fbusato in #3677
Internalize cub::PolicyWrapper by @fbusato in #3681
Modernize MSVC 2005/nvcc workaround by @bernhardmgruber in #3606
Deprecate cub::AliasTemporaries by @fbusato in #3679
[CUB] Remove pre-c++17 conditions and code by @fbusato in #3684
Internalize cub::KernelConfig by @fbusato in #3683
remove MSVC 2017 paths by @fbusato in #3553
[Thrust] Remove pre-c++17 conditions and code by @fbusato in #3687
Remove cugraph-ops from RAPIDS 25.04 builds. by @bdice in #3675
Refactor radix_sort tuning by @bernhardmgruber in #3657
Make thrust iterators work with NVRTC by @bernhardmgruber in #3676
Deprecate and replace thrust::identity by @bernhardmgruber in #3649
Replace CUB iterators by Thrust ones by @bernhardmgruber in #3480
Drop Thrust's global workaround by @bernhardmgruber in #3692
replace Int2Type in CUB library by @fbusato in #3641
Add b200 policies for cub.device.run_length_encode.encode,non_trivialruns by @bernhardmgruber in #3546
Deprecate cub::Trait::CATEGORY|PRIMITIVE|NULL_TYPE by @bernhardmgruber in #3689
Fix sccache reporting in CI summaries. by @alliepiper in #3621
Make THRUST_DEVICE_SYSTEM and THRUST_CPP_DIALECT independent of THRUST_HOST_SYSTEM by @adams381 in #3659
Deprecate cub::RegBoundScaling and cub::MemBoundScaling by @fbusato in #3685
Fix devcontainers' initializeCommand by @trxcllnt in #3533
[cuda.cooperative] Add missing overloads to block.reduce and block.sum by @brycelelbach in #2691
clean up the cudax __launch_transform code and document its purpose and design by @ericniebler in #3526
Add b200 policies for partition.three_way by @bernhardmgruber in #3708
Fix multiple CI arches in matrix by @alliepiper in #3702
Minor cleanups following bool-to-enum template parameter PR by @elstehle in #3716
Remove V2 and V3 ABI support from libcudacxx. by @wmaxey in #3662
Add b200 tunings for scan.exclusive.by_key by @bernhardmgruber in #3560
assorted bug fixes for the std::execution implementation in cudax by @ericniebler in #3721
Minor fix for a regressing tuning in reduce.by_key by @gonidelis in #3723
Fix SM100 histogram tunings by @bernhardmgruber in #3691
Move zip_iterator to internally use cuda::std::tuple by @miscco in #3725
Remove reduce tunings with no benefit by @bernhardmgruber in #3724
fix ::cuda::discard_memory by @fbusato in #3733
Add b200 policies for cub.device.partition.flagged,if by @bernhardmgruber in #3617
Add b200 tunings for scan.exclusive.sum by @bernhardmgruber in #3559
Fix cub trait deprecations by @bernhardmgruber in #3742
Nightly fixes by @alliepiper in #3720
Clarify scan benchmarks by @bernhardmgruber in #3709
Drop thrust::future|event|async::* by @bernhardmgruber in #3730
Replace raw arm64/x86_64 macros by @fbusato in #3732
Add Merge Sort implementation for c.parallel by @NaderAlAwar in #3636
Extracted Segmented Reduce kernels into NVRTC compilable header by @oleksandr-pavlyk in #3727
Remove unsupported CPU architecture paths (32-bit) by @fbusato in #3752
[Automation] Add release workflow for tagging and testing new RCs by @wmaxey in #3009
fix cuda std namespace by @fbusato in #3751
Remove cuda/init.py in cuda-parallel package by @shwina in #3750
Simplify cuda::std::{min,max} by @miscco in #3758
Add dynamic CUB dispatch for SegmentedReduce by @oleksandr-pavlyk in #3753
[STF] Implement kernel chains in the graph backend without child graphs by @caugonnet in #3707
Add Scan implementation for c.parallel by @shwina in #3462
cuda.parallel: Minor perf improvements by @shwina in #3718
refactor <cuda/std/cstdlib> by @davebayer in #3339
Fix python editable builds by @oleksandr-pavlyk in #3762
Reinstate thrust::optional by @miscco in #3759
Drop unsupported dialects for libcu++ by @miscco in #3695
Disable [[no_unique_address]] for MSVC by @miscco in #3757
cuda.coop: Generalize war_introspection utility for any # of arguments by @shwina in #3769
Avoid issues with nvcc compilation in c++ mode by @miscco in #3770
Refactor cuda/cmath functions documentation by @fbusato in #3773
[STF] Factorize large event lists in CUDA graphs by @caugonnet in #3756
Replace pre-c++17 traits with modern ones in CUB by @fbusato in #3774
Drop cugraph-gnn from rapids CI by @miscco in #3771
[STF] Ensure dot_section::guard is actually movable by @caugonnet in #3778
Guard PDL by availability by @miscco in #3779
[STF] virtual to_string() method for STF contexts by @caugonnet in #3781
[STF] Enable freeze on logical tokens by @caugonnet in #3782
Refactors DeviceMemcpy's vectorized_copy tests by @elstehle in #3777
More h100 usage. by @alliepiper in #3776
Add Python wrappers for c.parallel scan API by @shwina in #3592
Replace _CCCL_IF_CONSTEXPR by @fbusato in #3775
Remove _CCCL_CONSTEXPR_CXX14/17 by @fbusato in #3793
Bump -std from 14 to 17 in `./ci/(build|test)_cub.sh examples. by @tpn in #3792
[CUDAX] Add host launch API allowing stream ordered host execution by @pciolkosz in #3555
Moves DeviceMemcpy's BitPackedCounter tests to Catch2 by @elstehle in #3794
Refactor <cuda/std/cstring> by @davebayer in #3484
fix NoopExecutor by @fbusato in #3811
Unifies workload generation forDeviceMerge benchmarks by @elstehle in #3645
Optimize and clean countl, countr, popcount, has_single_bit by @fbusato in #3414
fix -Werror=unused-result by @fbusato in #3810
Enable cuda::std::ssize for C++17 by @miscco in #3813
fix _LIBCUDACXX_HAS_NO_INT128 with NVRTC by @fbusato in #3802
Move radix sort kernels to separate NVRTC compilable header by @NaderAlAwar in #3803
Fix popc parentheses warning by @fbusato in #3820
Add arch_traits for sm100 to cudax. by @alliepiper in #3818
Remove unused function parameter by @ericniebler in #3828
CI summary fix by @alliepiper in #3826
Refactor Thrust allocator example by @bernhardmgruber in #3830
[STF] Improved cache mechanism for executable CUDA graphs by @caugonnet in #3768
Drop deprecated CUB iterators by @bernhardmgruber in #3831
Use libcu++ limits/trait in tests/benchmarks by @bernhardmgruber in #3822
Move unique_by_key kernels to NVRTC compilable header by @NaderAlAwar in #3815
Specialize numeric_limits for CUDA 12.8 FP types by @davebayer in #3832
Refactor thrust::zip_iterator by @bernhardmgruber in #3834
Refactor Thrust iterators 2/4 by @bernhardmgruber in #3840
Refactor Thrust iterators 3/4 by @bernhardmgruber in #3842
Refactor Thrust iterators 4/4 by @bernhardmgruber in #3833
Increase libcudacxx test timeout by @alliepiper in #3850
Use lower case variable name to avoid macro collosions by @miscco in #3856
Fix incorrect availability of variant in docs by @miscco in #3859
Add cuda_cccl to the list of Python packages for which test suite is run by @oleksandr-pavlyk in #3846
Refactor Thrust iterators 1/4 by @bernhardmgruber in #3839
Rewrites DeviceMemcpy::Batched tests to use device-side data generation and Catch2 by @elstehle in #3849
Refactor CUB transfrom by @bernhardmgruber in #3825
Add Python wrappers for c.parallel merge_sort API by @NaderAlAwar in #3763
Add c parallel segmented reduce api by @oleksandr-pavlyk in #3838
[libcudacxx] Stable abstraction for Blackwell work-stealing (PTX try_cancel) by @gonzalobg in #3671
Consider specializations of std::iterator_traits by @miscco in #3837
Update supported C++ dialects in README by @davebayer in #3879
Refactor assume_aligned implementation by @fbusato in #3765
Refactor and make NVRTC compile <cub/util_device> by @bernhardmgruber in #3880
Cache the result of merge_sort() by @shwina in #3881
do not try to use clang-19's support for c++26 pack indexing by @ericniebler in #3888
Add support for single item per thread calls to block_scan.exclusive_scan by @tpn in #3829
Document cuda::maximum, cuda::minimum by @fbusato in #3883
Refactor Thrust iterator_traits by @bernhardmgruber in #3892
Update Blackwell PTX instruction availability tables by @bernhardmgruber in #3894
Fix CCCL C headers to be compileable by C compiler by @oleksandr-pavlyk in #3885
Move transform kernels to NVRTC compilable header by @shwina in #3875
PTX shfl_sync by @fbusato in #3241
Add a warning that we cannot tune transform by @bernhardmgruber in #3896
Extend tuning guide by @bernhardmgruber in #3904
Drop join_iterator by @bernhardmgruber in #3891
Revert Thrust find_if_not implementation to please nvc++ by @bernhardmgruber in #3901
[CUB/docs] Add missing closing braces to BlockReduce kernel examples in CUB docs. by @brycelelbach in #3916
[STF] Executable CUDA graphs caching policies by @caugonnet in #3868
Refactor Thrust iterator internals by @bernhardmgruber in #3893
Revert Thrust mismatch implementation by @bernhardmgruber in #3899
Replace usage of CUB_MIN|MAX in reduce by @bernhardmgruber in #3927
Move to cuda::std::iterator_traits in CUB by @bernhardmgruber in #3924
Add C++ test for single-item-per-thread BlockScan Sum routines. by @tpn in #3889
Rename threads_in_block -> threads_per_block to be consistent with CUB. by @tpn in #3919
Implement cuda.coopertive.block_scan.inclusive_sum(). by @tpn in #3921
Replace CUB macros in more places by @bernhardmgruber in #3930
[PTX] Add shl, shr, bmsk, prmt by @bernhardmgruber in #3939
Add test_reduce_api.py::test_reduce_struct_type_minmax by @oleksandr-pavlyk in #3938
Add cuda::std::aligned_accessor by @fbusato in #3731
[STF] Thread safe graph_ctx by @caugonnet in #3925
Replace CUB macros in tunings and benchmarks by @bernhardmgruber in #3931
Deprecate and replace some Thrust iterator traits by @bernhardmgruber in #3928
Optimize bit_floor, bit_ceil, bit_width by @fbusato in #3296
Allow RAPIDS workflow to run on an arbitrary branch. by @alliepiper in #3945
Initial CUDA C++ Execution Model documentation by @gonzalobg in #3873
[STF] Remove unmaintained CUDASTF_DEBUG option by @caugonnet in #3944
Revert "Initial CUDA C++ Execution Model documentation (#3873)" by @alliepiper in #3950
Implement ranges::ref_view by @miscco in #3316
Expose CCCL branch controls on Actions UI for RAPIDS workflow. by @alliepiper in #3948
Drop unused TEST_COMPILER_CUDACC_BELOW_11_3 macro by @miscco in #3946
Allow NVRTC to compile more of CUB by @bernhardmgruber in #3951
Use _CCCL_REQUIRES_EXPR in test code by @miscco in #3954
Improve <cuda/std/bit> documentation by @fbusato in #3959
[STF] Support generation of multiple CUDA graphs from separate threads by @caugonnet in #3943
Add segmented_reduce python api by @oleksandr-pavlyk in #3906
Implement __cccl_is_integer trait by @davebayer in #3962
Implement cudax::async_buffer by @miscco in #3460
Add dynamic CUB dispatch for unique_by_key by @NaderAlAwar in #3816
Fix typo in _LIBCUDACXX_HAS_NVFP16 macro by @davebayer in #3965
Drop obsolete thrust tuple algorithms by @bernhardmgruber in #3966
Extend CUB policy and tuning documentation by @bernhardmgruber in #3933
Fix thrust::raw_reference_cast for tuple_of_iterator_references and simplify thrust::generate by @bernhardmgruber in #3970
[PTX] Add st, ld instructions by @fbusato in #3974
[cuda.cooperative] Support multidimensional thread blocks in block load/store and improve load/store docs by @brycelelbach in #3161
Disable automatic header inclusion for clangd by @miscco in #3365
Deprecate and replace THRUST_STATIC_ASSERT by @bernhardmgruber in #3971
Avoid int overflow during multipl

NVIDIA/cccl v3.0.0 on GitHub