github NVIDIA/cccl v3.2.0

8 hours ago

The CCCL team is excited to announce the 3.2 release of the CUDA Core Compute Library (CCCL) whose highlights include include new modern CUDA C++ runtime APIs and new speed-of-light algorithms including Top-K.

Modern CUDA C++ Runtime

CCCL 3.2 broadly introduces new, idiomatic C++ interfaces for core CUDA runtime and driver functionality.

If you’ve written CUDA C++ for a while, you’ve likely built (or adopted) some form of convenience wrappers around today’s C-like APIs like cudaMalloc or cudaStreamCreate.

The new APIs added in CCCL 3.2 are meant to provide the productivity and safety benefits of C++ for core CUDA constructs so you can spend less time reinventing wrappers and more time writing kernels and algorithms.

Highlights:

  • New convenient vocabulary types for core CUDA concepts (cuda::stream, cuda::event, cuda::arch_traits)
  • Easier memory management with Memory Resources and - - cuda::buffer
    More powerful and convenient kernel launch with cuda::launch

Example (vector add, revisited):

cuda::device_ref device = cuda::devices[0];
cuda::stream stream{device};
auto pool = cuda::device_default_memory_pool(device);

int num_elements = 1000;
auto A = cuda::make_buffer<float>(stream, pool, num_elements, 1.0);
auto B = cuda::make_buffer<float>(stream, pool, num_elements, 2.0);
auto C = cuda::make_buffer<float>(stream, pool, num_elements, cuda::no_init);

constexpr int threads_per_block = 256;
auto config = cuda::distribute<threads_per_block>(num_elements);
auto kernel = [] __device__ (auto config, cuda::std::span<const float> A, 
                                            cuda::std::span<const float> B, 
                                            cuda::std::span<float> C){
    auto tid = cuda::gpu_thread.rank(cuda::grid, config);
    if (tid < A.size())
        C[tid] = A[tid] + B[tid];
};
cuda::launch(stream, config, kernel, config, A, B, C);

(Try this example live on Compiler Explorer!)

A forthcoming blog post will go deeper into the details, the design goals, intended usage patterns, and how these new APIs fit alongside existing CUDA APIs.

New Algorithms

Top-K Selection

CCCL 3.2 introduces cub::DeviceTopK (for example, cub::DeviceTopK::MaxKeys) to select the K largest (or smallest) elements without sorting the entire input. For workloads where K is small, this can deliver up to 5X speedups over a full radix sort, and can reduce memory consumption when you don’t need sorted results.

Top‑K is an active area of ongoing work for CCCL: our roadmap includes planned segmented Top‑K as well as block‑scope and warp‑scope Top‑K variants. See what’s planned and tell us what Top‑K use cases matter most in CCCL GitHub issue #5673.

image

Fixed-size Segmented Reduction

CCCL 3.2 now provides a new cub::DeviceSegmentedReduce variant that accepts a uniform segment_size, eliminating offset iterator overhead in the common case when segments are fixed-size. This enables optimizations for both small segment sizes (up to 66x) and large segment sizes (up to 14x).

// New API accepts fixed segment_size instead of per-segment begin/end offsets
cub::DeviceSegmentedReduce::Sum(d_temp, temp_bytes, input, output,
num_segments, segment_size);

image

Additional New Algorithms in CCCL 3.2

Segmented Scan - cub::DeviceSegmentedScan provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments.

Binary Search - cub::DeviceFind::[Upper/LowerBound] performs a parallel search for multiple values in an ordered sequence.

Search - cub::DeviceFind::FindIf searches the unordered input for the first element that satisfies a given condition. Thanks to its early-exit logic, it can be up to 7x faster than searching the entire sequence.

Full Changelog: v3.2.0...v3.2.0

What's Changed

🚀 Thrust / CUB

libcu++

🤝 cuda.coop

  • Implement cuda.coop striped_to_blocked. by @tpn in #4662

🔄 Other Changes

New Contributors

Full Changelog: v3.1.4...v3.2.0

Don't miss a new cccl release

NewReleases is sending notifications on new releases.