github NVIDIA/cccl v3.0.0

latest releases: v3.0.3-rc0, v3.1.0-rc4, v3.1.0-rc3...
one month ago

CCCL 3.0 Release

The 3.0 release of the CUDA Core Compute Libraries (CCCL) marks our first major version since unifying the Thrust, CUB, and libcudacxx libraries under a single repository. This release reflects over a year of work focused on cleanup, consolidation, and modernizing the codebase to support future growth.

While this release includes a number of breaking changes, many involve the consolidation of APIs—particularly in the thrust:: and cub:: namespaces—as well as cleanup of internal details that were never intended for public use. In many cases, redundant functionality from thrust:: or cub:: has been replaced with equivalent or improved abstractions from the cuda:: or cuda::std:: namespaces. Impact should be minimal for most users. For full details and recommended migration steps, please consult the CCCL 2.x to 3.0 Migration Guide.

Key Changes in CCCL 3.0

Requirements

  • C++17 or newer is now required (support for C++11 and C++14 has been dropped #3255)
  • CUDA Toolkit 12.0+ is now required (support for CTK 11.0+ has been dropped). For details on version compatibility, see the README.
  • Compilers:
    • GCC 7+ (support for GCC < 7 has been dropped #3268)
    • Clang 14+ (support for Clang < 14 has been dropped #3309)
    • MSVC 2019+ (support for MSVC 2017 has been dropped #3287, #3553)
  • Dropped support for

Header Directory Changes in CUDA Toolkit 13.0

CCCL 3.0 will be included with an upcoming CUDA Toolkit 13.0 release. In this release, the bundled CCCL headers have moved to new top-level directories under ${CTK_ROOT}/include/cccl/.

Before CUDA 13.0 After CUDA 13.0
${CTK_ROOT}/include/cuda/ ${CTK_ROOT}/include/cccl/cuda/
${CTK_ROOT}/include/cub/ ${CTK_ROOT}/include/cccl/cub/
${CTK_ROOT}/include/thrust/ ${CTK_ROOT}/include/cccl/thrust/

These changes only affect the on-disk location of CCCL headers within the CUDA Toolkit installation.

What you need to know

  • ❌ Do NOT write #include <cccl/...> — this will break.
  • If using CCCL headers only in files compiled with nvcc
    • ✅ No action needed. This is the default for most users.
  • If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
    • Using CMake and linking CCCL::CCCL
      • ✅ No action needed. (This is the recommended path. See example)
    • Other build systems
      • ⚠️ Add ${CTK_ROOT}/include/cccl to your compiler’s include search path (e.g., with -I)

These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.

Major API Changes

Hundreds of macros, internal types, and implementation details were removed or relocated to internal namespaces. This significantly reduces surface area and eliminates long-standing technical debt, improving both compile times and maintainability.

Removed Macros

Over 50 legacy macros have been removed in favor of modern C++ alternatives:

Removed Functions and Classes

  • thrust::optional: use cuda::std::optional instead #4172
  • thrust::tuple: use cuda::std::tuple instead #2395
  • thrust::pair: use cuda::std::pair instead #2395
  • thrust::numeric_limits: use cuda::std::numeric_limits instead #3366
  • cub::BFE: use `cuda::bitfield_inser`t and cuda::bitfield_extract instead #4031
  • cub::ConstantInputIterator: use thrust::constant_iterator instead #3831
  • cub::CountingInputIterator: use thrust::counting_iterator instead #3831
  • cub::GridBarrier: use cooperative groups instead #3745
  • cub::DeviceSpmv: use cuSPARSE instead #3320
  • cub::Mutex: use cuda::std::mutex instead #3251
  • See CCCL 2.x to 3.0 Migration Guide for complete list

New Features

C++

cuda::

  • cuda::std::numeric_limits now supports __float128 #4059
  • cuda::std::optional<T&> implementation (P2988) #3631
  • cuda::std::numbers header for mathematical constants #3355
  • NVFP8/6/4 extended floating-point types support in <cuda/std/cmath> #3843
  • cuda::overflow_cast for safe numeric conversions #4151
  • cuda::ilog2 and cuda::ilog10 integer logarithms #4100
  • cuda::round_up and cuda::round_down utilities #3234

cub::

  • `cub::DeviceSegmentedReduce` now supports large number of segments #3746
  • `cub::DeviceCopy::Batched` now supports large number of buffers #4129
  • `cub::DeviceMemcpy::Batched` now supports large number of buffers #4065

thrust::

  • New `thrust::offset_iterator` iterator #4073
  • Temporary storage allocations in parallel algorithms now respect `par_nosync` #4204

Python

CUDA Python Core Libraries are now available on PyPI through the cuda-cccl package.

pip install cuda-cccl

cuda.cccl.cooperative

  • Block-level sorting now supports multi-dimensional thread blocks #4035, #4028
  • Block-level data movement now supports multi-dimensional thread blocks #3161
  • New block-level inclusive sum algorithm #3921

cuda.cccl.parallel

  • New device-level segmented-reduce algorithm #3906
  • New device-level unique-by-key algorithm #3947
  • New device-level merge-sort algorithm #3763

What's Changed

🚀 Thrust / CUB

📚 Libcudacxx

📝 Documentation

🔄 Other Changes

Don't miss a new cccl release

NewReleases is sending notifications on new releases.