NVIDIA/cccl v3.2.0 on GitHub

The CCCL team is excited to announce the 3.2 release of the CUDA Core Compute Library (CCCL) whose highlights include include new modern CUDA C++ runtime APIs and new speed-of-light algorithms including Top-K.

Modern CUDA C++ Runtime

CCCL 3.2 broadly introduces new, idiomatic C++ interfaces for core CUDA runtime and driver functionality.

If you’ve written CUDA C++ for a while, you’ve likely built (or adopted) some form of convenience wrappers around today’s C-like APIs like cudaMalloc or cudaStreamCreate.

The new APIs added in CCCL 3.2 are meant to provide the productivity and safety benefits of C++ for core CUDA constructs so you can spend less time reinventing wrappers and more time writing kernels and algorithms.

Highlights:

New convenient vocabulary types for core CUDA concepts (cuda::stream, cuda::event, cuda::arch_traits)
Easier memory management with Memory Resources and - - cuda::buffer
More powerful and convenient kernel launch with cuda::launch

Example (vector add, revisited):

cuda::device_ref device = cuda::devices[0];
cuda::stream stream{device};
auto pool = cuda::device_default_memory_pool(device);

int num_elements = 1000;
auto A = cuda::make_buffer<float>(stream, pool, num_elements, 1.0);
auto B = cuda::make_buffer<float>(stream, pool, num_elements, 2.0);
auto C = cuda::make_buffer<float>(stream, pool, num_elements, cuda::no_init);

constexpr int threads_per_block = 256;
auto config = cuda::distribute<threads_per_block>(num_elements);
auto kernel = [] __device__ (auto config, cuda::std::span<const float> A, 
                                            cuda::std::span<const float> B, 
                                            cuda::std::span<float> C){
    auto tid = cuda::gpu_thread.rank(cuda::grid, config);
    if (tid < A.size())
        C[tid] = A[tid] + B[tid];
};
cuda::launch(stream, config, kernel, config, A, B, C);

(Try this example live on Compiler Explorer!)

A forthcoming blog post will go deeper into the details, the design goals, intended usage patterns, and how these new APIs fit alongside existing CUDA APIs.

New Algorithms

Top-K Selection

CCCL 3.2 introduces cub::DeviceTopK (for example, cub::DeviceTopK::MaxKeys) to select the K largest (or smallest) elements without sorting the entire input. For workloads where K is small, this can deliver up to 5X speedups over a full radix sort, and can reduce memory consumption when you don’t need sorted results.

Top‑K is an active area of ongoing work for CCCL: our roadmap includes planned segmented Top‑K as well as block‑scope and warp‑scope Top‑K variants. See what’s planned and tell us what Top‑K use cases matter most in CCCL GitHub issue #5673.

Fixed-size Segmented Reduction

CCCL 3.2 now provides a new cub::DeviceSegmentedReduce variant that accepts a uniform segment_size, eliminating offset iterator overhead in the common case when segments are fixed-size. This enables optimizations for both small segment sizes (up to 66x) and large segment sizes (up to 14x).

// New API accepts fixed segment_size instead of per-segment begin/end offsets
cub::DeviceSegmentedReduce::Sum(d_temp, temp_bytes, input, output,
num_segments, segment_size);

Additional New Algorithms in CCCL 3.2

Segmented Scan - cub::DeviceSegmentedScan provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments.

Binary Search - cub::DeviceFind::[Upper/LowerBound] performs a parallel search for multiple values in an ordered sequence.

Search - cub::DeviceFind::FindIf searches the unordered input for the first element that satisfies a given condition. Thanks to its early-exit logic, it can be up to 7x faster than searching the entire sequence.

Full Changelog: v3.2.0...v3.2.0

What's Changed

🚀 Thrust / CUB

Modified test [reduce][nondeterministic] per gh-5443 by @oleksandr-pavlyk in #5451
Remove unused include of grid/grid_queue from CUB agent/dispatch headers by @oleksandr-pavlyk in #5887
[CUB] Implement BlockLoadToShared by @pauleonix in #5780
Fix debug section around line 390 of dispatch_topk by @oleksandr-pavlyk in #6152
Fix typos in segmented reduce by @oleksandr-pavlyk in #6153
Device scan doc fixes by @oleksandr-pavlyk in #6294
Scan tests and benchmarks by @oleksandr-pavlyk in #6355
[Thrust]: New "sum rows" and "sum columns" examples by @brycelelbach in #4462
Added new CUB APIs: DeviceTransform::Fill #5526, DeviceTransform::Generate #5890, DeviceTransform::TransformIf #5198, which are used by thrust::fill[_n] #5805, thrust::uninitialized_fill #5813, thrust::generate[_n] #5807, and thrust::transform_if, thrust::scatter_if #5952, and non-trival thrust::copy #5954. By @bernhardmgruber.
Made thrust::tabulate #6012 use cub::DeviceTransform as well by @bernhardmgruber in #5198

libcu++

Added cuda::barrier and cuda::memcpy_async_tx examples using TMA @bernhardmgruber in #6231
Waiting on a cuda::barrier on SM90+ is now faster and produces less code @bernhardmgruber in #6007
Improve cuda::memcpy_async codegen @bernhardmgruber in #5996
Improve TMA codegen on sm120 in cuda::memcpy_async, cuda::device::memcpy_async_tx, cub::DeviceTransform @bernhardmgruber in #6362

🤝 cuda.coop

Implement cuda.coop striped_to_blocked. by @tpn in #4662

🔄 Other Changes

Rework our signbit implementation to be potentially constexpr by @miscco in #5259
[CUDAX->libcu++] Move ensure_current_device to libcu++ and change the name to ensure_current_context by @pciolkosz in #5285
[Version] Update main to v3.2.0 by @github-actions[bot] in #5286
Rework our copysign implementation to be potentially constexpr by @miscco in #5287
Update NVBench by @bernhardmgruber in #5288
[CUDAX] Rename async_buffer::change_stream to set_stream and add a test by @pciolkosz in #5273
Extend and refactor transform overloads in CUDA system by @bernhardmgruber in #5238
Refactor c2h by @bernhardmgruber in #5205
Fix inplace_vector out of bounds access for at() by @Jacobfaib in #5295
Fix cudax test breaking main by @davebayer in #5301
[STF] Move occupancy calculation utility and support CUfunction by @caugonnet in #5236
[CUDAX->libcu++] Move stream and event from cudax to libcu++ by @pciolkosz in #5293
Port thrust::transform_input_output_iterator to cuda by @miscco in #5113
Implement format.arguments and format.context from standard formatting library by @davebayer in #5217
Initial migration of cuco hasher to cudax by @srinivasyadav18 in #4898
CUB - Add internal integer utils and tests (Split WarpReduce PR) by @fbusato in #5314
Skip zero values in fast_mod_div unit test by @fbusato in #5307
Fix cuda::static_for noexcept definition by @davebayer in #5303
Add sm90 tunings for RFA F32 by @srinivasyadav18 in #5269
Add and use new artifact/workflow functionality for CI scripts. by @alliepiper in #4861
Add gitlab devcontainers by @wmaxey in #5325
Remove mentions of CUDA experimental that sneaked into libcu++ by @pciolkosz in #5306
Add a macro to disable PDL by @bernhardmgruber in #5316
Move aligned_size_t, get_device_address and discard_memory to cuda/__memory/ by @davebayer in #5239
Adds tests for large number of items to DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #5251
[libcu++] Deprecate default stream_ref constructor and fix some few last usages by @pciolkosz in #5310
Extends benchmarks for DeviceRunLengthEncode::NonTrivialRuns to differentiate between offset and run-length type by @elstehle in #5248
Complex log accuracy refinement by @s-oboyle in #5185
Replace use of cupy with cuda-core in cuda.cccl.parallel by @shwina in #5323
Better motivates cuda::device::is_address_from by @fbusato in #5341
Fix CUB 'limited' job in nightly CI by @alliepiper in #5347
Fix nvrtc when there are more than one CTK include directories available by @wmaxey in #5318
c.parallel: enable UBLKCP in transform by @griwes in #4847
Merge sort benchmark requires no sync by @bernhardmgruber in #5350
Forgot to add inline in is_address_from by @fbusato in #5349
Add sm86 tunings for deterministic DeviceReduce (RFA) by @srinivasyadav18 in #5354
Adds support for large number of items to DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #5252
Document that scan_op must be associative by @bernhardmgruber in #5358
Fix cuco hasher test by @srinivasyadav18 in #5353
Super tiny tweak for analysis script to work after introducing postgreSQL by @gonidelis in #5331
c.parallel: support providing well-known operations by @griwes in #4562
Add simpler, single-phase APIs for all parallel algorithms by @shwina in #5207
[STF] [EASY] Fix exception guard usage in traits.cuh by @GPMueller in #5369
[CUB] Add cub::detail::ThreadScan*Partial by @pauleonix in #5300
Diagnose missing numeric_limits specialization in DeviceReduce Min/Max by @bernhardmgruber in #5359
Suppress clang warnings on vector types in upcoming CTK by @bernhardmgruber in #5362
Add is_object_from by @fbusato in #5364
[CUB] Add cub::detail::ThreadReducePartial by @pauleonix in #5324
fix noexcept clause on ctor of let_value's opstate by @ericniebler in #5387
Add some notes about performance of 1 and 2 byte atomic_ref. by @wmaxey in #5390
Add a section covering include changes in the migration docs by @wmaxey in #5391
Add missing NV_TARGET macro by @fbusato in #5388
[libcu++] Add missing pop of deprecation warning suppression by @pciolkosz in #5395
Makes thrust::unique use cub::DeviceSelect::Unique by @elstehle in #5396
fix race condition in starts_on execution test by @ericniebler in #5393
Split c2h sources into more files by @bernhardmgruber in #5384
Remove CTK <12 version check for PDL by @bernhardmgruber in #5343
Add nondeterministic reduce that uses atomics by @NaderAlAwar in #4961
Add scan tunings from leaderboard by @gonidelis in #5283
[CUDAX->libcu++] Expose fill_bytes and copy_bytes in libcudacxx by @pciolkosz in #5304
Move ownership of cudax test cmake to cudax owners by @pciolkosz in #5406
move basic_any from cudax to libcudacxx by @ericniebler in #5298
fix a data race and use-after-free in execution::run_loop by @ericniebler in #5402
fix the _CCCL_PP_COMMA_IFF macro by @ericniebler in #5407
Replaces internal macros with __host__ and __device__ attributes by @elstehle in #5412
[STF] Allow CUfunction/CUkernel (driver API) in the cuda_kernel(_chain) API by @caugonnet in #5215
Improve forward declarations. We often need only a forward declaration of vocabulary types and also want to know whether something is an instance of said type. by @miscco in #5305
Add NVTX ranges to C2H tests by @bernhardmgruber in #5332
[STF] Low level interface for the cuda_kernel(_chain) construct by @caugonnet in #5319
Drops global namespace qualification from cuda namespace usage in our tests by @elstehle in #5415
Add Histogram implementation for c.parallel by @NaderAlAwar in #4689
Disable NVHPC optimization that leads to error by @gonidelis in #5416
Combine block_reduce_warp_reduction_nondeterministic.cuh specialization with original deterministic one by @NaderAlAwar in #5408
Use C2H in radix_sort c.parallel tests by @NaderAlAwar in #5426
Add common constants for floating point types by @miscco in #5413
[libcu++] Rename memory resource concepts to indicate asynchronous allocations are the default ones by @pciolkosz in #5313
Fix gpu_to_gpu determinism fallback conditions to run_to_run determinism by @srinivasyadav18 in #5382
remove the dependence from sync_wait's receiver on the sender's type by @ericniebler in #5446
Print character vectors as numbers in tests by @bernhardmgruber in #5154
Generate negative numbers in Thrust unit tests by @bernhardmgruber in #4923
cuda.cccl: Update dependencies to enable running on CUDA 13 driver by @shwina in #5442
Move TMA barrier in DeviceTransform into dynamic SMEM by @bernhardmgruber in #5414
Fix grid dependency sync in cub::DeviceMergeSort by @bernhardmgruber in #5456
Add python wrappers for c.parallel histogram API by @NaderAlAwar in #4709
Integer Add with overflow checking by @fbusato in #5267
fix NV_TARGET typos by @fbusato in #5418
CUB - Add internal thread and warp utils (Split WarpReduce PR) by @fbusato in #5317
Introduce i128 and u128 literals to libcu++ testing by @davebayer in #5372
Replace address space intrinsics with cuda::device::is_address_from by @davebayer in #4866
Update cuda::ptx to CTK 13 by @fbusato in #5447
Implement cuda::std::from_chars for integers by @davebayer in #4938
Port thrust::zip_iterator to namespace cuda by @miscco in #5429
Drop all usages of _CCCL_TRAIT by @miscco in #5466
[STF] Misc. STF doxygen documentation by @caugonnet in #5470
[STF] Cleanup for_each_batched.cuh by @caugonnet in #5473
[STF] Move only_convertible_or to reserved namespace by @caugonnet in #5472
extend execution environments to support queries that take extra arguments by @ericniebler in #5464
Fix atomic reduce for arches < 600 with dtype double by @NaderAlAwar in #5428
Rework our fabs implementation to be potentially constexpr by @miscco in #5302
Fix handling of invalid inputs (<= 0) to GridEvenShare and adjust handling of num_items == 0 on the caller side by @NaderAlAwar in #5452
Simplify thrust::device_malloc by @bernhardmgruber in #5477
[STF] Accept shapes which are just integral values in parallel_for by @caugonnet in #5485
[libcu++] Remove experimental memory resource define check from around the concept, properties and the query. by @pciolkosz in #5437
Drop unused iterator bases and update standard iterators by @miscco in #5454
Fix mismatched internal dispatch in cub::ScatterToStripedFlagged by @MengAiDev in #5483
[STF] Improve how we ignore void interface (tokens) arguments in prototypes by @caugonnet in #5475
Refactor thrust::pointer by @bernhardmgruber in #5478
Add test to ensure we can use cuda::std::reverse_iterator with thrust APIs by @miscco in #5486
Drop thrust::LoadIterator/make_load_iterator by @bernhardmgruber in #5480
Fix __float128 detection and require compiler support for literals by @davebayer in #4591
[CUB] Implement *Partial member functions for WarpScan by @pauleonix in #5379
Add SM_110a for non-supporting compilers by @fbusato in #5489
[Thrust] Make documentation behind #if 0 visible by @pauleonix in #5455
Add support for virtual shared memory to DispatchReduceByKey by @elstehle in #5440
Use thrust::copy in thrust::uninitialized_copy[_n] in CUDA system when possible by @bernhardmgruber in #5181
Move segmented sort kernels to separate header by @NaderAlAwar in #5499
Refactor agent_reduce by @bernhardmgruber in #5507
Enable mdspan public headers test for msvc in C++17 by @davebayer in #5510
Make NVTX headers declare themselves as system headers by @bernhardmgruber in #5508
add _CCCL_TYPE_VISIBILITY_HIDDEN config macro by @ericniebler in #5514
[STF] Avoid warning about unsed variable by @miscco in #5518
Handle NVTX3 being disabled in C2H by @bernhardmgruber in #5511
Also test DeviceTransform with unaligned destination by @bernhardmgruber in #5509
Add nondeterministic reduce sum benchmark by @NaderAlAwar in #5520
Add grayscale transform benchmark by @NaderAlAwar in #5522
Document why workstealing is not implemented in DeviceTransform by @bernhardmgruber in #5525
fix device definition of cudax execution's __nothrow_fooable traits by @ericniebler in #5533
use auto(expr) for _LIBCUDACXX_AUTO_CAST when it is available by @ericniebler in #5537
implement a variant of P3206 for getting a sender's completion behavior by @ericniebler in #5517
Fix naming of our namespace macros and friends by @miscco in #5538
Fix regression introduced with agent_reduce refactoring by @bernhardmgruber in #5542
[libcu++] Rename resource_ref to match the new async by default naming by @pciolkosz in #5534
Add missing full qualification for ::cuda::std in libcu++ by @bernhardmgruber in #5544
Implement ranges::for_each{_n} by @miscco in #5540
[CUDAX] Rename type-erased memory resource wrappers by @pciolkosz in #5536
Fix merge conflict 🙈 by @miscco in #5546
make clangd use libc++ instead of libstdc++ by @ericniebler in #5548
permit __query_result_or_t to take extra arguments by @ericniebler in #5551
Only download wheels artifacts for release by @cryos in #5543
give pod_tuple.h the _CCCL_EXEC_CHECK_DISABLE treatment by @ericniebler in #5553
Fix fp constants by @davebayer in #5467
Replace _CCCL_ASSUME with _CCCL_BUILTIN_ASSUME by @fbusato in #5554
suppress bogus msvc warning about unreachable code in cuda::std::optional by @ericniebler in #5563
port then() tests from stdexec and fix bugs in schedule_from and sync_wait by @ericniebler in #5561
extend the get_completion_scheduler to accept the receiver's env by @ericniebler in #5565
Add a benchmark for transform_if with stencil by @bernhardmgruber in #5571
Complex sqrt accuracy/speed improvements by @s-oboyle in #5371
Remove repo-docs dependency by @gevtushenko in #5568
Regenerate PTX docs by @bernhardmgruber in #5574
Replace our qualification macros with plain cuda::std:: by @miscco in #5573
Reorganize docs pages a bit by @bernhardmgruber in #5584
Documentation fixes by @bernhardmgruber in #5468
Use a custom git describe command for setuptools-scm by @shwina in #5586
Update CCCL to CTK mapping table by @bernhardmgruber in #5587
Port thrust::shuffle_iterator to cuda by @miscco in #5530
Add docstrings for all single-phase APIs in CUDA CCCL parallel algorithms by @Copilot in #5582
Remove remainig namespace macros by @miscco in #5608
Avoids invoking custom equality operator for out-of-bounds items by @elstehle in #5566
Rework our fmax and fmin implementation to be potentially constexpr by @miscco in #5539
Update cuda/ptx instructions to support all new SM architectures in CTK 13 by @fbusato in #5600
[libcu++] Disable arch traits testing kernel for old arches for which we don't provide traits by @pciolkosz in #5602
Enable PDL in DeviceTransform by @bernhardmgruber in #5249
re-express execution::starts_on in terms of execution::continues_on by @ericniebler in #5576
Refactor cuda.cccl.parallel benchmarks to reduce repetition using pytest parametrization by @Copilot in #5589
Add ZipIterator to cuda.cccl.parallel by @shwina in #5389
[skip-ci] Clarify GPU architecture support in README. by @jrhemstad in #5618
rename _CCCL_TRIVIAL_API to _CCCL_NODEBUG_API by @ericniebler in #5617
Fix includes table in migration guide by @wmaxey in #5624
Implement format.formatter.spec by @davebayer in #5368
Implement execution policies by @miscco in #5577
Move partition kernels to separate header by @NaderAlAwar in #5630
Drop internal uses of thrust::reverse_iterator by @miscco in #5616
[libcu++] Add SM_110 arch traits by @pciolkosz in #5631
Add device fp128 funcitons include by @davebayer in #5585
Allow C++ code for operators in c.parallel by @gevtushenko in #5633
[STF] Fix CUDA graph API calls for CUDA 13 by @caugonnet in #5636
[STF] Implement token elision in cuda_kernel constructs by @caugonnet in #5640
[STF] make get_owning_container_of local to a class by @caugonnet in #5643
Avoid issue with MinimalElementType and MSVC by @miscco in #5641
[STF] Replace task dep's as_read_mode by a more general as_mode by @caugonnet in #5645
Drop constraints from fp conversion rank order traits by @davebayer in #5644
Implement __fp_is_explicit_conversion_v by @davebayer in #5648
Rename header guards to drop the _LIBCUDACXX prefix by @miscco in #5632
Minor path_finder → pathfinder fixes by @rwgk in #5637
[CUDAX] Add legacy prefix to managed_memory_resource and remove async members by @pciolkosz in #4983
Update docs build to deploy from gh-pages branch to docs/ directory with preserved branch history by @Copilot in #5605
Fixes thrust::unique for non-const equality_op by @elstehle in #5652
Fix bug in reduce tuning by @gonidelis in #5654
Enable parallel Sphinx builds by @jrhemstad in #5655
[STF] Remove the hook mechanism by @caugonnet in #5660
cuda.cccl: Build combined CUDA 12+13 wheel by @shwina in #5613
Add tests/parallel/examples/scan/scan_applications.py by @oleksandr-pavlyk in #5634
cuda.cccl.parallel: Expose "well-known" operations to Python by @shwina in #5578
Fix issues with compiling on 12.0 for memcpy_async on Ampere+ by @wmaxey in #5665
[STF] Add examples which add tasks to user-provided CUDA graphs by @caugonnet in #5410
NVHPC 25.7 by @alliepiper in #5360
Add a missing variant header in c/parallel by @caugonnet in #5680
Simplify enum bindings by @shwina in #5666
Fix issue revealed by gcc14 stringent checking by @andralex in #5671
Fix cuda::shuffle_iterator not properly working with thrust algorithms by @miscco in #5686
Update cudaGraphAddDependencies for 13.0 by @pciolkosz in #5691
add a query to get a sender's completion domain for each completion disposition by @ericniebler in #5599
Update PTX ISA version for CUDA 13 by @davebayer in #5676
Move nvbench_helper out of CUB for easier reuse. by @alliepiper in #5692
[STF] Add missing low-level API in the unified context and introduce a method to enable graph capture in the low level API by @caugonnet in #5701
Add cuco hasher's benchmark in cudax by @srinivasyadav18 in #5558
Fix backslashes in blocked doxygen alias in CUB docs by @oleksandr-pavlyk in #5695
Fix thrust::malloc for void by @miscco in #5698
Ensure that we are building with the /Zc:preprocessor flag on windows by @miscco in #5687
Add support for float16 (__half) in cuda.cccl.parallel by @NaderAlAwar in #5696
Work around NVRTC bug with virtual default ctors/dtors by @wmaxey in #5704
Parse and merge devcontainer feature metadata in launch.sh by @trxcllnt in #5074
[CUDAX] Remove synchronization from set_stream and add a stream argument to destroy in async_buffer by @pciolkosz in #5697
fix completion signature computation of starts_on, work around gcc9 ICE by @ericniebler in #5724
[STF] Example to freeze logical data in a graph to use in a child graph by @caugonnet in #5731
[STF] Remove the for_each_batched experiment entirely by @caugonnet in #5726
Also use the new preprocessor in the libcu++ header tests by @miscco in #5732
Disable test for all MSVC and NVCC 12.0 by @miscco in #5734
Unify the libcudacxx header test infrastructure with the other projects by @miscco in #5735
Improvements to CI PR comments. by @alliepiper in #5705
Fix build scripts when sccache is not available. by @alliepiper in #5727
Make default CMake options configure a minimal installation. by @alliepiper in #5737
Add git-bisect script/workflow and generic single-target build/test script by @alliepiper in #5728
Deprecate <cuda/discard_memory> by @davebayer in #5672
Add CTK 13.0, gcc14 devcontainers and CI by @alliepiper in #5431
Add workflow to build and cleanup per PR docs previews by @jrhemstad in #5559
fix: Add missing pages:write permission to PR cleanup workflow by @jrhemstad in #5744
Document UB in warp_match_all by @gonzalobg in #5658
Guard against some optional files not being present. by @alliepiper in #5742
Migrate all cuco hashers by @srinivasyadav18 in #5400
Add comprehensive GitHub Copilot instructions for CCCL development workflow including Python components by @Copilot in #5620
Split cub developer guide into separate sections by @miscco in #5739
[cudax] Add green_context::id() method by @davebayer in #5471
Fix problematic clang attribute namespace by @davebayer in #5748
[CUDAX] Implement cudax::kernel_ref by @davebayer in #5041
Fix local builds by @miscco in #5746
Obtain temp storage size and alignment directly from LTO IR via PTX conversion. by @tpn in #5355
Properly guard ptx includes for when we are in cuda mode by @miscco in #5749
Remove thrust from async_buffer and use cub instead by @pciolkosz in #5659
Safe cuda::std::memset/memcpy API by @fbusato in #5500
Cleaned up the AGENTS instructions with GPT5. by @alliepiper in #5745
Address Sphinx warnings, populate Thrust's group pages by @oleksandr-pavlyk in #5759
Remove problematic new build symlink by @alliepiper in #5761
[STF] Remove broken data_from_device_async test by @caugonnet in #5765
[STF] Remove dead STF example 09-nbody-blocked-graph by @caugonnet in #5763
[STF] Remove the stopwatch utility header by @caugonnet in #5762
[STF] Example to import logical data in a sub context with a while condition by @caugonnet in #5738
[STF] Test write-back on frozen logical data by @caugonnet in #5733
Make sure that cuda:: iterators are random_access_iterator when possible by @miscco in #5678
[STF] Rework dot tool to have really nested sections by @caugonnet in #5723
Fix generate_version.sh script to only consider tags beginning with v by @shwina in #5771
Add TransformOutputIterator implementation and tests by @shwina in #5743
[STF] Improve how we retrieve streams from async_resources_handle objects by @caugonnet in #5769
[CUDAX] Implement cudax::library and cudax::library_ref by @davebayer in #5174
Improve cudax/cuco hashers by @srinivasyadav18 in #5768
cuda.cccl.parallel: Reference examples in docstrings and eliminate test_*_api.py files by @Copilot in #5614
Fix Thrust header tests and remove unused defines by @alliepiper in #5764
[STF] Add missing type definition in task_dep by @caugonnet in #5783
add a deleted query member function to std::execution::env<> by @ericniebler in #5778
Ensure that we do not rely on host library functions that might not be defined by @miscco in #5782
Fix cudax::launch for kernels with no parameters by @davebayer in #5785
[STF] Generic per-context resource sets by @caugonnet in #5777
[URGENT][TRIVIAL] Make sure cudaLibrary_t is used only in versions that define it by @andralex in #5790
Show missing executables while setting up build by @andralex in #5796
Sort workflow job times by duration by @alliepiper in #5795
Bump cuda99 containers to gcc14 by @wmaxey in #5760
Drop unused header by @bernhardmgruber in #5802
Fix libcu++ compilation with clang-20 by @davebayer in #5799
Use nested namespace specifier in Thrust cpp system by @bernhardmgruber in #5801
Improve documentation of cuda iterators by @miscco in #5662
Remove "Workflow Started" PR comment. by @alliepiper in #5810
Add support for large OffsetT types with deterministic DeviceReduce (RFA) by @srinivasyadav18 in #5434
Drop unused include of CG by @bernhardmgruber in #5814
Remove stray semicolon by @bernhardmgruber in #5815
Adds output_ordering requirement as env option by @elstehle in #5781
Fix uninitialized read in uninitialized_copy_n by @bernhardmgruber in #5811
Fix __fp_one by @davebayer in #5800
cccl.parallel: Unify input and output iterators by @shwina in #5770
Fix some issues that were found by QA by @miscco in #5820
Implement remaining cmath functions and drop indirection header by @miscco in #5786
[STF] C bindings library by @caugonnet in #5740
silence potential warning about ignored nodiscard value by @ericniebler in #5794
[STF] Misc documentation fixes/clarifications by @caugonnet in #5722
Modified CUB's device-wide developer guide by @oleksandr-pavlyk in #5829
Fix Thrust API docs appearing twice in toctree by @bernhardmgruber in #5828
[cub/grid] fix documentation typo in grid_even_share.cuh by @thewilsonator in #5835
Enable NVTX for NVHPC by @bernhardmgruber in #5836
Skip ptx-json tests for clang-cuda by @davebayer in #5841
Drop unnecessary includes from libcu++ in CUB by @miscco in #5830
make the concepts portability macros slightly more maintainable by @ericniebler in #5817
Slim down Thrust CUDA core utils by @bernhardmgruber in #5845
Improve Thrust iterator documentation by @bernhardmgruber in #5833
Add CI information to AGENTS.md. by @alliepiper in #5779
Use a custom iter_swap kernel in Thrust by @bernhardmgruber in #5843
Use std::atomic in host only code by @bernhardmgruber in #5838
Use forward declarations of extended floating point types instead of including the headers by @miscco in #5846
Clang20 CI + devcontainers by @alliepiper in #5797
Fix PTX ISA detection for clang-cuda by @davebayer in #5869
fix issue in concepts macros where noexcept(t) became {noexcept(t)} noexcept by @ericniebler in #5867
Fix grammar in doc comment for TilePrefixCallbackOp by @oleksandr-pavlyk in #5866
Introduce facilities to extract the exponent of a floating point value. by @miscco in #5136
[STF] test to get the stream associated to a task in the different backends by @caugonnet in #5865
Redacted some comments in util_type CUDA header file for clarity by @oleksandr-pavlyk in #5868
Fixes example of DeviceScan::InclusiveScanInit to use thrust vectors instead of c2h by @elstehle in #5871
[STF] Factorize add_vertex calls by @caugonnet in #5864
Avoid ADL issues with GCC-9 in iterator tests by @miscco in #5872
Document Thrust systems, execution policies and their dispatch by @bernhardmgruber in #5827
[cudax] Make cudax::host_launch work with move-only types by @davebayer in #5876
[cudax] Require cudax::kernel_ref argument types to be TriviallyCopyable by @davebayer in #5878
Avoid symbol clash with older clang by @miscco in #5874
Use __fp_get_exp to implement ilogb and logb by @miscco in #5873
Test Thrust iterator system propagation by @bernhardmgruber in #5875
Add missing template argument in transform_reduce benchmark by @bernhardmgruber in #5803
Refactor thrust::iterator_facade_category by @bernhardmgruber in #5877
Allow single-target build/test jobs in CI override for faster turn-around times, reduced runner usage. by @alliepiper in #5784
Add tests for host system propagation by @bernhardmgruber in #5881
Add more SMs to cuda-clang CI builds by @alliepiper in #5861
Drop obsolete is_discard_iterator by @bernhardmgruber in #5884
Work around submdspan compiler issue on MSVC by @miscco in #5885
Fix iterator_category_to_system for device iterator tags by @bernhardmgruber in #5880
Add missing stream synchronization in thrust::cuda_cub::generate by @bernhardmgruber in #5889
Clarify missing Reference and ValueType by @bernhardmgruber in #5888
Modernize Thrust examples by @charan-003 in #5670
Inherit thrust::transform_iterator traversal from base iterator traversal by @bernhardmgruber in #5883
[CUDAX] Change async_buffer constructor and make_async_buffer to only optionally take an environment by @pciolkosz in #5776
Allow CI to run on forks with sccache enabled. by @alliepiper in #5882
Ensure test kernels remain active during allocator testing. by @alliepiper in #5899
Implement cuda::complex by @davebayer in #5609
Update RAPIDS devcontainers by @bdice in #5898
Small improvements to DeviceMergeSort by @bernhardmgruber in #5900
Drop thrust::counting_iterator in favor of cuda::counting_iterator by @miscco in #5839
Ensure that logb is constexpr by @miscco in #5901
Simplify cuda::std::is_trivially_copyable implementation by @davebayer in #5906
[STF] Fix a typo in the documentation about logical_data::freeze by @caugonnet in #5922
Revert "Drop thrust::counting_iterator in favor of cuda::counting_iterator (#5839)" by @alliepiper in #5925
Revert "Simplify cuda::std::is_trivially_copyable implementation" by @davebayer in #5921
Fix branch protection checks by @alliepiper in #5915
Allow bisect jobs with custom args to run through matrix.yml. by @alliepiper in #5894
The test has been randomly segfaulting recently so lets disable until we know whats happening by @miscco in #5930
Ignore -Wmaybe-uninitialized in dispatch_reduce.cuh. by @bdice in #5933
Drop Thrust mpl math by @bernhardmgruber in #5897
remove early customization and redesign transform_sender by @ericniebler in #5793
Enable CUDA 12.0+ testing for cuda.cccl by @shwina in #5682
Require type annotations for TransformOutputIterator by @shwina in #5934
Modernize iterator machinery by @miscco in #5928
Allow 128-bit int/float in nvrtc tests by @davebayer in #5411
Fix iterator adaptor sample by @gevtushenko in #5957
correct the spelling of the _LIBCPP_VERSION macro by @ericniebler in #5958
Refactor cub::DeviceMerge by @bernhardmgruber in #5937
Drop unused LoadAlgorithm from merge policy by @bernhardmgruber in #5942
for better intellisense in cudax, define LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE by @ericniebler in #5959
[STF] Support larger pos4 and dim4 by @caugonnet in #5893
Detect QNX for atomics support by @miscco in #5961
Refactor and condense thrust::copy implementation by @bernhardmgruber in #5491
Improve thrust::cuda_cub::replace functor handling by @bernhardmgruber in #5949
Simplify and deprecate cuda::std::is_pod in C++20 by @davebayer in #5914
Refactor Thrust execution policies by @bernhardmgruber in #5821
Try to use _CCCL_API in Thrust and CUB by @bernhardmgruber in #5953
Simplify cuda::std::is_trivially_constructible implementation by @davebayer in #5907
Improve interoperability of cuda iterators with thrust and std by @miscco in #5929
Drop thrust detail seq policy global by @bernhardmgruber in #5964
Use CUDA 13 for RAPIDS CI builds by @vyasr in #5967
Simplify cuda::std::is_trivially_copy_constructible implementation by @davebayer in #5910
Deprecate and replace THUST_[HOST|DEVICE]_FUNCTION by @bernhardmgruber in #5972
Allow __builtin_addressof for nvrtc 12.3+ by @davebayer in #5980
Fix reference for cuda::transform_iterator by @miscco in #5983
Fix dereferencing nullptr in thrust::device_reference by @bernhardmgruber in #4226
[STF] frozen_logical_data now inherits from frozen_logical_data_untyped by @caugonnet in #5986
[CUDAX] Make kernel_config parameter a __grid_constant__ in kernel launcher by @davebayer in #5990
Add env-based overloads for DeviceReduce::(Arg)MinMax by @gonidelis in #5143
Split up cub.test.iterator to fix nightly NVHPC OOMs, add CI memory monitoring script. by @alliepiper in #5988
Improve shared memory address range check by @fbusato in #5834
Bump internal containers to LLVM20. by @wmaxey in #5997
Simplify cuda::std::is_trivially_move_constructible implementation by @davebayer in #5913
[libcu++] Switch to use cuGetProcAddress to get driver functions by @pciolkosz in #5976
[CUDAX] Lower copy_bytes to the batched memcpy starting with CUDA 13 by @pciolkosz in #5818
Adds device-level Top-K Parallel Algorithm to CUB by @ChristinaZ in #5677
Fix merge agent construction from non-ptr contiguous iterator by @bernhardmgruber in #5993
Add env based api for DeviceScan::ExclusiveSum/Scan by @srinivasyadav18 in #5767
[CUDAX] Implement hierarchy_dimensions::static_extents() by @davebayer in #6010
Enable __grid_constant__ with clang-cuda-20 and nvrtc by @davebayer in #5991
Rename the trait checks to __has_meow_traversal by @miscco in #5968
remove pynvjitlink references from examples by @jayavenkatesh19 in #5826
Simplify selected type traits implementation by @davebayer in #5979
Fix libcu++ lit config arch list by @bernhardmgruber in #6014
Avoid bad_alloc inside Catch2 CHECK() by @bernhardmgruber in #6025
Try to clean up align utilities by @fbusato in #5950
Allow small abs error < 1e-10 in Deterministic Device Reduce large num_items test by @srinivasyadav18 in #6027
Make assertions work on macOS by @miscco in #6028
Move for_each_canceled_block to cuda::device:: by @davebayer in #6037
Remove fork-ci feature. by @alliepiper in #6004
Add windows versions of the CI target/bisect scripts. by @alliepiper in #5931
Get Windows c.parallel build working. by @tpn in #5924
Fix cuda13.0-rapids-conda devcontainer symlink by @bdice in #6042
Temporarily pin CCCL version used to test RAPIDS by @vyasr in #5973
Split up high-mem compilations in CUB to help out CI runners by @alliepiper in #6044
c.parallel: enable dynamic policies in scan. by @griwes in #5960
Change PARALLEL_LEVEL default from nproc to nproc-1 in build_common.sh by @Copilot in #6046
basic_any gets better support for storing immovable types by @ericniebler in #5935
Error when including cub umbrella header under NVRTC by @cnaples79 in #6035
add missing InitT tparam to specialization of DispatchSegmentedReduce by @ericniebler in #6048
Improve zip_iterator by @miscco in #6036
Provide escape hatch for CTK compatability check by @miscco in #6029
Replace _LIBCUDACXX_DEPRECATED with CCCL_DEPRECATED by @davebayer in #6024
Fix throwing functions marked as noexcept by @davebayer in #6021
Simplify cuda::std::is_trivially_default_constructible implementation by @davebayer in #5911
Simplify cuda::std::is_trivially_destructible implementation by @davebayer in #5905
Fix addressof shadowing issue with libc++ by @wmaxey in #6032
Drop unused OutputIterator template parameter in reduce by @bernhardmgruber in #6051
Fix licenses. by @alliepiper in #6047
Use __is_same_as builtin for cuda::std::is_same by @davebayer in #5994
Move CDP API macros to libcu++ by @bernhardmgruber in #6017
Modularize <cuda/std/chrono> a bit by @miscco in #5945
Unwrap cuda::zip_iterator/zip_function in thrust::transform by @bernhardmgruber in #6039
Simplify cuda::std::is_trivially_move_assignable implementation by @davebayer in #5912
Do not require int128 in for_each_canceled by @davebayer in #5822
Simplify cuda::std::is_trivially_copy_assignable implementation by @davebayer in #5909
Fix memcpy ADL/ambiguity by @fbusato in #5969
Simplify cuda::std::is_trivially_copyable by @davebayer in #5938
Fix nvc++ 25.9 with format_parse_context tests by @davebayer in #6056
Use more inline variables when possible by @miscco in #6038
[DOC]: Add OpKind to parallel API docs by @shwina in #6058
define cuda::std::declval in terms of new __declfn_t alias by @ericniebler in #6045
Add dynamic CUB dispatch for three_way_partition by @NaderAlAwar in #5965
Simplify cuda::std::is_trivially_assignable implementation by @davebayer in #5908
Build and test Python wheels with arm64 in addition to x86_64 by @shwina in #6060
Assert we have enough SMEM for DeviceReduce by @bernhardmgruber in #6062
Improve __float128 support for isnan, fmin, fmax by @fbusato in #5923
Add more tests for thrust::reduce_by_key by @bernhardmgruber in #6063
[STF] Fix bug 5891 with large index spaces and overflows in partitionners by @caugonnet in #6015
add missing visibility attributes and workaround nvcc bug by @ericniebler in #6070
Add three way partition implementation for c.parallel by @NaderAlAwar in #6068
Provide cub::DeviceCopy(mdspan) by @fbusato in #5939
Drop CUB_STATIC_ASSERT from Doxyfiles by @bernhardmgruber in #6072
Refactor thrust::[try_]unwrap_contiguous_iterator[_t] by @bernhardmgruber in #6065
Use __has_meow_traversal for cuda iterators by @miscco in #6088
Refactor ChainedPolicy by @bernhardmgruber in #6075
Fix MSVC error with OffsetT in c/parallel/src/three_way_partition.cu. by @tpn in #6081
Update to RAPIDS 25.12 by @bdice in #6082
Fix clang-cuda 21 warning of unitialized local passed as const void* by @davebayer in #6091
Add python wrappers for c.parallel three_way_partition API by @NaderAlAwar in #6080
Add dynamic CUB dispatch for segmented_sort by @NaderAlAwar in #6069
Replace CUDA Runtime calls with Driver calls in libcu++ by @davebayer in #6073
Use constexpr for some chrono traits by @miscco in #6103
Bump catch2 to 3.8.1. by @alliepiper in #6101
Fix imports from cudax to libcu++ by @davebayer in #6105
add a specialization of __make_tuple_types for complex<T> by @ericniebler in #6102
Remove iterator workarounds for lack of operator+= by @bernhardmgruber in #6094
[CUB] Replace several direct uses of __clz by @wmaxey in #6099
Implement cuda::zip_transform_iterator by @miscco in #5982
Refactor DeviceSegmentedReduce by @bernhardmgruber in #6061
Skip nightlies on weekends, cleanup old CTK, bump devcontainers. by @alliepiper in #6111
Cache device name and peers by @davebayer in #6110
Remove fully qualified ::cuda::std:: from examples by @charan-003 in #6130
[STF] Fix incorrect level index in 3-depth execution policy by @19970126ljl in #6089
Use cuda::narrow instead of custom version by @bernhardmgruber in #6133
[CUDAX] Implement managed_memory_resource and refactor the memory pool implementation by @pciolkosz in #5998
Rename cuda.cccl.{parallel,cooperative} -> cuda.{compute,coop} by @shwina in #6125
Concatenate nested namespaces in CUB by @bernhardmgruber in #6139
Cache modify iterators locally in agent_merge.cuh by @bernhardmgruber in #6142
Initial batch of changes to setup GPU windows runners. by @alliepiper in #6131
Remove cuda::physical_device from public API by @davebayer in #6135
Implement operator<< for cuda::std::string_view by @davebayer in #4736
c.parallel: enable dynamic policies in unique_by_key. by @griwes in #6087
c.parallel: enable dynamic policies in merge_sort. by @griwes in #6147
Use size_t for byte count in device attributes by @davebayer in #6151
Fix link to examples in cuda.cccl Python documentation by @shwina in #6157
Provide cuda::ptr_in_range by @fbusato in #6086
Add a 'pull_request_lite' workflow for unmodified dependees. by @alliepiper in #6164
Fix title in docs/python/index.rst by @shwina in #6166
Add version_compare script, minor build_common updates. by @alliepiper in #6168
Move nvhpc header wrappers to libcudacxx. by @alliepiper in #6167
Remove NVIDIA Software License from top-level license by @jrhemstad in #6176
Fix up issues with jobs in the pull_request_lite matrix. by @alliepiper in #6169
Add noexcept to deallocate in type erased wrappers by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6179
cuda.compute: Add PermutationIterator by @shwina in https://github.com/NVIDIA/cccl/pull/6182
[libcu++] Fix blocks per SM in arch traits traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6185
cuda.compute: Use annotations when available to determine signature of user-defined transform operation by @shwina in https://github.com/NVIDIA/cccl/pull/6183
Refactor agent_histogram.cuh by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6141
Use vector width over load size in vectorized transform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6066
We do not need to use force includes for nvrtc by @miscco in https://github.com/NVIDIA/cccl/pull/6194
Fix clang-cuda stf build by @davebayer in https://github.com/NVIDIA/cccl/pull/6199
We should guard the host library include wrappers so that we can unconditionally include the headers with NVRTC by @miscco in https://github.com/NVIDIA/cccl/pull/6195
Refactor Thrust destroy_range and device_[new|delete|free] by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6134
Improve fully cached build times. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6127
Refactor Thrust allocator internals by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6136
Do not use pair for two element zip iterators by @miscco in https://github.com/NVIDIA/cccl/pull/6209
Add missing sm121 to nv/target and CUB tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6205
Bypass allocator in thrust::device_delete by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6198
Refactor cuda::arch_traits by @davebayer in https://github.com/NVIDIA/cccl/pull/6150
Ensure that cuda iterators support for difference by @miscco in https://github.com/NVIDIA/cccl/pull/6201
Fix arch_traits warnings without -fpermissive for older gcc by @davebayer in https://github.com/NVIDIA/cccl/pull/6217
Improve handling of empty members in cuda iterators by @miscco in https://github.com/NVIDIA/cccl/pull/6006
Improve string_view interoperability std:: counterpart and string by @davebayer in https://github.com/NVIDIA/cccl/pull/6184
Refactor / fixup libcudacxx CMake targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6223
Fix missing attributes in cccl-rt and rename event::flags to event_flags by @davebayer in https://github.com/NVIDIA/cccl/pull/6224
merge the schedule_from and continues_on algorithms by @ericniebler in https://github.com/NVIDIA/cccl/pull/6162
Fix {host, device, managed}_mdspan by @miscco in https://github.com/NVIDIA/cccl/pull/6093
Expose cuda::mul_hi by @fbusato in https://github.com/NVIDIA/cccl/pull/6146
Assert deallocation is noexcept by @bdice in https://github.com/NVIDIA/cccl/pull/6186
Provide cub::DeviceFor::ForEachInLayout by @fbusato in https://github.com/NVIDIA/cccl/pull/5956
Replace internal multiple_higher_bits with cuda::mul_hi by @fbusato in https://github.com/NVIDIA/cccl/pull/6239
Update GPU architecture support details in README by @jrhemstad in https://github.com/NVIDIA/cccl/pull/6229
Test polymorphic types in thrust::device_delete by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6140
Modernizes top-k examples by @elstehle in https://github.com/NVIDIA/cccl/pull/6241
use cuda::mul_hi in cuda::std::calloc by @davebayer in https://github.com/NVIDIA/cccl/pull/6242
Refactor iterator usage in thrust/cuda find() by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6019
Try different formulation for thrust::ccosh by @miscco in https://github.com/NVIDIA/cccl/pull/6200
Improve invoke machinery by @miscco in https://github.com/NVIDIA/cccl/pull/6227
Drop all uses of thrust::tabulate_output_iterator in favor of cuda::tabulate_output_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6001
Fix __compressed_movable_box by @miscco in https://github.com/NVIDIA/cccl/pull/6247
Fix __is_primary_std_template for libc++ by @miscco in https://github.com/NVIDIA/cccl/pull/6243
Add environment overloads for DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6204
Fix invalid refactoring of #4377 by @miscco in https://github.com/NVIDIA/cccl/pull/6246
[libcu++] Enable complex literals by @davebayer in https://github.com/NVIDIA/cccl/pull/6252
Implement cudax::cufile_driver by @davebayer in https://github.com/NVIDIA/cccl/pull/5941
Fixing cudax::execution CUDA stream scheduler by @ericniebler in https://github.com/NVIDIA/cccl/pull/6175
[libcu++/cudax] Move all experimental additions to memory resource properties to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6233
Fix invalid device_accessible namespace by @davebayer in https://github.com/NVIDIA/cccl/pull/6269
Do no use bit_cast to work around initialization issues with barrier by @miscco in https://github.com/NVIDIA/cccl/pull/6263
Fix missing qualifications for __construct_at by @miscco in https://github.com/NVIDIA/cccl/pull/6270
Fix missed constructor with compressed box by @miscco in https://github.com/NVIDIA/cccl/pull/6268
Fix using char as the index type of tabulate_output_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6271
Add host standard library detection by @davebayer in https://github.com/NVIDIA/cccl/pull/6244
Adds a section on perf checks to contributing.md by @elstehle in https://github.com/NVIDIA/cccl/pull/6267
Deprecate <cuda/stream_ref> header by @davebayer in https://github.com/NVIDIA/cccl/pull/6266
Provide cuda::in_range by @fbusato in https://github.com/NVIDIA/cccl/pull/6034
[CUDAX] Add assignment operator that rebinds resource_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6240
[CUDAX] Change memory pool type to also be a resource by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6180
[CUB]: Add missing closing braces to examples in Doxygen. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/6278
Pass a device array or None as the initial value to cuda.compute scan by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6262
Limit deprecation exclusions to targeted headers. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6275
Disable SASS check in cuda.compute for scan no init value for sm_90 and later by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6287
Implement initial Windows CI support for the Python cuda-cccl library. by @tpn in https://github.com/NVIDIA/cccl/pull/6160
Fix exception handling macros in exceptions.h by @ericniebler in https://github.com/NVIDIA/cccl/pull/6286
Cleanup and simplify structured bindings support by @miscco in https://github.com/NVIDIA/cccl/pull/6281
Provide cuda::ptx::enable_smem_spilling() by @davebayer in https://github.com/NVIDIA/cccl/pull/6289
Add PyTorch build to CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6276
Dropped duplicated math function from Thrust by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6188
Refines the section on perf checks to contributing.md by @elstehle in https://github.com/NVIDIA/cccl/pull/6280
Drop typedef in cuda::atomic test by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6297
fix kernel launch failure when sender expressions can throw by @ericniebler in https://github.com/NVIDIA/cccl/pull/6277
Extract BlockScan code-block examples to literalinclude 1/3 by @gonidelis in https://github.com/NVIDIA/cccl/pull/6288
[DOC] Fix BlockRadixRank documentation by @Aminsed in https://github.com/NVIDIA/cccl/pull/6207
Fix string_view construction from std::string_view by @davebayer in https://github.com/NVIDIA/cccl/pull/6291
Clean up cuda::/std::/cuda::std:: __is_meow_v traits by @davebayer in https://github.com/NVIDIA/cccl/pull/6300
add parallel scan support for TBB and OMP by @charan-003 in https://github.com/NVIDIA/cccl/pull/6178
Use 'python3 -m pip args' instead of 'pip args' in docs/gen_docs.bash script by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/6293
Ignore OOM failures for large size unique thrust test. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6304
Add support for 128b atomics to atomic_ref by @wmaxey in https://github.com/NVIDIA/cccl/pull/3440
Fix is_sufficiently_aligned with const void* by @fbusato in https://github.com/NVIDIA/cccl/pull/6307
GCC only recognizes unused-local-typedefs by @alliepiper in https://github.com/NVIDIA/cccl/pull/6303
Replace __popc with cude::std::popcounter by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6213
Deprecate experimental TMA exposure in cuda::barrier by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6305
Provide utility to check pointer ranges overlapping by @fbusato in https://github.com/NVIDIA/cccl/pull/6100
Always include <new> when we need operator new for clang-cuda by @miscco in https://github.com/NVIDIA/cccl/pull/6310
Fix thrust system dependend includes by @miscco in https://github.com/NVIDIA/cccl/pull/6311
Optimize cuda::minimum/maximum for float, double, __half, __nv_bfloat16, __float128 by @fbusato in https://github.com/NVIDIA/cccl/pull/5034
Disable test for compressed_movable_box by @miscco in https://github.com/NVIDIA/cccl/pull/6320
c.parallel: enable dynamic policies in radix_sort. by @griwes in https://github.com/NVIDIA/cccl/pull/6264
Simplify thrust::zip_function by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6321
Simplify numeric_limits::[min|max]() implementation for integrals by @davebayer in https://github.com/NVIDIA/cccl/pull/6324
[cudax -> libcudacxx] Move type-erased resource wrappers to libcudacxx by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6299
Include <math.h> in <cuda/std/cmath> headers unconditionally by @davebayer in https://github.com/NVIDIA/cccl/pull/6333
Ensure that we can instantiate zip_function with a type that is not non-const invocable by @miscco in https://github.com/NVIDIA/cccl/pull/6323
Use RAPIDS main branch by @bdice in https://github.com/NVIDIA/cccl/pull/6318
[CUDAX] Rename memory_resource types to memory_pool_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6334
Fix transform.cu size_t/int issue. by @tpn in https://github.com/NVIDIA/cccl/pull/6332
Remove unused _LIBCUDACXX_HAS_MEOW macros by @davebayer in https://github.com/NVIDIA/cccl/pull/6338
Refactor thrust::mismatch to use CUDA iterators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6018
Use _CCCL_CTK_MEOW instead of _CCCL_CUDACC_MEOW by @davebayer in https://github.com/NVIDIA/cccl/pull/6343
Improve cuda::barrier TMA examples and elect_one in DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6329
Move the throw_meow_error functions into their own header and drop the stdexcept include by @miscco in https://github.com/NVIDIA/cccl/pull/6335
Refactor histogram kernel entrypoint by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6342
Implements a more memory-efficient way to test for large k in DeviceTopK tests by @elstehle in https://github.com/NVIDIA/cccl/pull/6322
Replace inline PTX by cuda::ptx in cuda::barrier<thread_scope_block> by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6250
Add a philox PRNG engine by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6109
Do not mark deduction guides as hidden by @miscco in https://github.com/NVIDIA/cccl/pull/6350
Move the implementation of tuple into its own file by @miscco in https://github.com/NVIDIA/cccl/pull/6336
[cudax] Fix managed resource test by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6354
Retry sccache startup on windows to WAR random auth issues. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6347
Unpin CCCL version used for RAPIDS testing by @bdice in https://github.com/NVIDIA/cccl/pull/6349
Refactor generic thrust scan dispatch by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6360
Add missing header to random engine tests by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6364
[cudax->libcu++] Move any_resource tests and remove experimental aliases by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6351
Move CPP, OMP and TBB exec policies to detail by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6361
Refactor Thrust OMP system headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6372
Fix reference to cuda::std::bit_floor/bit_ceil in docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6373
Fix tuple constraint by @miscco in https://github.com/NVIDIA/cccl/pull/6363
Improve exception macros by @davebayer in https://github.com/NVIDIA/cccl/pull/6337
Windows CI: CCCL C Parallel by @alliepiper in https://github.com/NVIDIA/cccl/pull/6254
Use non deprecated methods for stream_ref in docs by @davebayer in https://github.com/NVIDIA/cccl/pull/6376
Inline the Thrust ADL layer by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6377
Move iosfwd to its own internal file by @miscco in https://github.com/NVIDIA/cccl/pull/6390
Move is_reference_wrapper trait to __fwd/reference_wrapper.h by @davebayer in https://github.com/NVIDIA/cccl/pull/6392
Fix iter_move constraints for MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/6357
c.parallel: fixes for well-known operations. by @griwes in https://github.com/NVIDIA/cccl/pull/6386
Modularize variant by @miscco in https://github.com/NVIDIA/cccl/pull/6393
Fix label in memcpy_async_tx docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6398
Drop unused file to detect CUDA archs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6374
Make some if constexpr by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6382
Refactor Thrust TBB system headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6394
Move cuda/std/__cuda/api_wrapper.h to cuda/__runtime/api_wrapper.h by @davebayer in https://github.com/NVIDIA/cccl/pull/6379
Rewrite agent template parameters to PascalCase by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6380
Implement std::seed_seq by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6358
Fix missing monostate_include by @miscco in https://github.com/NVIDIA/cccl/pull/6403
c.parallel: single-stage runtime compilation. by @griwes in https://github.com/NVIDIA/cccl/pull/6341
Apply cuda::barrier and elect_one feedback by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6344
Fix clang 21 issues by @davebayer in https://github.com/NVIDIA/cccl/pull/6404
Add a benchmark for DeviceSegmentedReduce::ArgMin by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6401
Rewrite block algorithm template parameters to PascalCase by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6381
[cudax -> libcu++] Move memory resources to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6384
c.parallel: cache runtime transform configs. by @griwes in https://github.com/NVIDIA/cccl/pull/6385
Fix wrong namespace in TBB Backend by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6395
Refactor agent_histogram.cuh Part 2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6196
Fix wrongly rewritten license headers in Thrust OMP/TBB backend by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6406
bring cudax::execution closer in line with the evolving P3826 by @ericniebler in https://github.com/NVIDIA/cccl/pull/6417
Prepare cudax::host_lauch migration to libcu++ by @davebayer in https://github.com/NVIDIA/cccl/pull/6420
Drops default constructor of BlockLoadToShared by @elstehle in https://github.com/NVIDIA/cccl/pull/6427
Inline remaining *.inl files in tbb and seq backends by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6437
Nested namespace fixes in General modules & cub by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6425
Fix offset_iterator tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6436
Enhance lane mask validation in __shfl_sync by @fbusato in https://github.com/NVIDIA/cccl/pull/6429
Use SPDX license identifiers in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6441
Add _CCCL_DECLSPEC_EMPTY_BASES to mdspan features by @miscco in https://github.com/NVIDIA/cccl/pull/6444
[clang-format] WrapNamespaceBodyWithEmptyLines: Never by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6439
Ensure that detect_wrong_difference is a valid output iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6450
Fix cub.bench.radix_sort.keys.base regression on H200 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6452
Fixes non-default-constructible iterators for large number of items types in DeviceRunLengthEncode::Encode by @elstehle in https://github.com/NVIDIA/cccl/pull/6451
Test mixing iterators in DeviceMerge by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6455
Use PDL in DeviceHistogram by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6367
Enabling max pool size for memory pools by @nirandaperera in https://github.com/NVIDIA/cccl/pull/6370
Add segmented sort implementation for c.parallel by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6095
Fix Random CI failures for Deterministic Device Reduce (RFA) with different policies by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/6464
Implement __is_fully_bounded_array trait by @davebayer in https://github.com/NVIDIA/cccl/pull/6461
Drop build.log by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6454
Prefix CUB kernel headers with kernel_ by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6383
Nested namespace resolve for thrust & libcudacxx by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6465
Various CMake cleanups. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6346
Ignore python/cuda_cccl/build.log by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6473
Replace enum by static constexpr in some agent tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6472
fall back gpu_to_gpu floating-point min/max reductions to run_to_run by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/6462
Fix incorrect file name by @miscco in https://github.com/NVIDIA/cccl/pull/6481
Add python wrappers for c.parallel segmented_sort by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6471
Provide cuda::sub_overflow by @fbusato in https://github.com/NVIDIA/cccl/pull/6084
Cleanup libcu++ CMake by @miscco in https://github.com/NVIDIA/cccl/pull/6478
Avoid single letter typenames by @miscco in https://github.com/NVIDIA/cccl/pull/6474
Add WarpReduce Device-Side Benchmarks by @fbusato in https://github.com/NVIDIA/cccl/pull/6431
Avoid potentially ambiguous overload in warp_excahnge_shfl by @miscco in https://github.com/NVIDIA/cccl/pull/6484
Replace uses of cub::PowerOfTwo and deprecated it by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6490
Drop shadowing redeclaration of constants by @miscco in https://github.com/NVIDIA/cccl/pull/6479
Provide cuda::div_overflow by @fbusato in https://github.com/NVIDIA/cccl/pull/6128
Enable __int128_t as difference type in counting_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6487
Make nvrtc concept macros a bit more reliable by @miscco in https://github.com/NVIDIA/cccl/pull/6397
Use __byte_perm intrinsic rather then inline asm in cuda::std::byteswap by @davebayer in https://github.com/NVIDIA/cccl/pull/6493
Updates to populate the PyPI landing page. by @shwina in https://github.com/NVIDIA/cccl/pull/6483
Remove old C++ version checks by @davebayer in https://github.com/NVIDIA/cccl/pull/6494
[clang-format] KeepEmptyLines only at EOF by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6440
Expose ptx::mbarrier_inval and use it by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6496
Move the libcu++ specific config by @miscco in https://github.com/NVIDIA/cccl/pull/6396
Implement cuda::invalid_stream by @davebayer in https://github.com/NVIDIA/cccl/pull/6488
Fix invalid reference type of cuda::strided_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6501
Allow passing in None as init value for scan when using an iterator as input by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6499
Catchs NaN's before they make it to static_cast and creating UB by @s-oboyle in https://github.com/NVIDIA/cccl/pull/6502
Extract BlockScan code-block examples to literalinclude 2/3 by @gonidelis in https://github.com/NVIDIA/cccl/pull/6418
Fix NVTX disabling test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6516
Disable CI workflows on forks. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6514
Adds token to enforce correct call sequence in BlockLoadToShared: Commit()->Wait() by @elstehle in https://github.com/NVIDIA/cccl/pull/6510
Expose ptx::setmaxnreg by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6504
[CUDAX] Uglify the hierarchy files by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6491
[CUB] Use BlockLoadToShared in DeviceMerge by @pauleonix in https://github.com/NVIDIA/cccl/pull/6077
Replace custom equal_to functors by _1 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6515
Add cccl_add_xfail_compile_target_test CMake function by @alliepiper in https://github.com/NVIDIA/cccl/pull/6434
Add conda installation instructions for cuda.cccl Python package by @Copilot in https://github.com/NVIDIA/cccl/pull/6513
Fix missing token passing in AgentMerge by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6525
[cudax->libcu++] Move uninitialized_async_buffer and heterogeneous_iterator to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6489
[CUDAX] Rename async_buffer to buffer by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6520
Split DeviceSegmentedReduce in its own file by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6524
Add launch bounds to block_reduce_kernel by @Artem-B in https://github.com/NVIDIA/cccl/pull/6533
Fixes braces around scalar initializer warning in BlockLoadToShared by @elstehle in https://github.com/NVIDIA/cccl/pull/6534
[cudax->libcu++] Move host_launch to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6536
Improves DeviceTopK docs by @elstehle in https://github.com/NVIDIA/cccl/pull/6531
Allow __builtin_bitreverse with clang-cuda by @davebayer in https://github.com/NVIDIA/cccl/pull/6545
[cudax->libcu++] Move shared_resource to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6539
Fix/Improve <cuda/bit> documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/6543
cuda::align_up/down workaround for memory space by @fbusato in https://github.com/NVIDIA/cccl/pull/6541
Cleanup thrust::complex math includes and functions by @miscco in https://github.com/NVIDIA/cccl/pull/6546
Fix bit_reverse documentation example by @fbusato in https://github.com/NVIDIA/cccl/pull/6551
Drop old namespace macros by @miscco in https://github.com/NVIDIA/cccl/pull/6548
Update NVBench type string declarations for FP16 and BF16 by @fbusato in https://github.com/NVIDIA/cccl/pull/6555
Make uniform_int_distribution constexpr by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6523
Cleanup includes in thrust by @miscco in https://github.com/NVIDIA/cccl/pull/6547
Rename some of the namespace macros by @miscco in https://github.com/NVIDIA/cccl/pull/6549
Fix merge conflicts from dropping headers by @miscco in https://github.com/NVIDIA/cccl/pull/6563
Use cuda/iterator in cub/test by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/6405
Fix compute capability -> PTX version conversion by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6567
Add gersemi CMake formatter by @alliepiper in https://github.com/NVIDIA/cccl/pull/6557
[cudax->libcudacxx] Move device_transform to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6469
Fix some warnings in cub headers that are picked up by the libcu++ tests by @miscco in https://github.com/NVIDIA/cccl/pull/6522
Fix cuda/cmath and cuda/memory documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/6569
Implement bernoulli_distribution by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6375
Fix missing include by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6578
Test passing a custom policy to DispatchReduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6577
Refactor DispatchMergeSort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6580
Refactor cub::detail::for_each::dispatch_t by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6579
Replace enum by static constexpr in CUB/Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6480
[cudax] Add pointer attributes fallback to async buffer initialization by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6352
[libcu++] Add initial cccl-runtime docs for 3.1 by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6562
Split segmented radix sort into separate files by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6581
Add BlockLoadToShared improvements by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6526
Disable clang-cuda with libc++ tests for now by @miscco in https://github.com/NVIDIA/cccl/pull/6586
Try and improve our is_nothrow_constructible fallback by @miscco in https://github.com/NVIDIA/cccl/pull/6583
Refactor DispatchSegmentedSort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6599
Refactor DispatchReduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6590
Fix misspelling of contiguous range in documentation by @brycelelbach in https://github.com/NVIDIA/cccl/pull/6603
Refactor DispatchScan by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6594
Refactor DispatchScanByKey by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6596
Refactor rfa dispatcher by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6591
Fix typo in mbarrier.inval by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6615
Test and refactor [Mem|Reg]BoundScaling by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6575
Try to fix windos python runner by @miscco in https://github.com/NVIDIA/cccl/pull/6602
Allow using ZipIterator as an output in cuda.compute by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6518
Fix issue with old GCC by @miscco in https://github.com/NVIDIA/cccl/pull/6614
Fix some minor issues in the extents implementation by @miscco in https://github.com/NVIDIA/cccl/pull/6604
Make Thrust/CUB ABI namespace resilient against user-defined macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6564
Improve cudax::dynamic_shared_memory implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/6495
Refactor DispatchUniqueByKey by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6600
Replace uses of thrust::pair with cuda::std::pair by @miscco in https://github.com/NVIDIA/cccl/pull/6616
Refactor deterministic reduce dispatcher by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6593
[cudax] Add synchronous_resource_adapter and use it in async_buffer by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6432
Split fixed-size segmented reduce dispatch header by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6597
Publicly expose <cuda/std/algorithm> by @miscco in https://github.com/NVIDIA/cccl/pull/3741
Refactor cub::detail::AliasTemporaries by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6617
Add __version_(at_least|below) utilities to CUDA Driver wrappers by @davebayer in https://github.com/NVIDIA/cccl/pull/6626
Fix includes in work stealing example by @miscco in https://github.com/NVIDIA/cccl/pull/6631
streamline the implementation of cuda::std::__tuple by @ericniebler in https://github.com/NVIDIA/cccl/pull/6623
Optimize cuda::is_address_space by forcing the memory space by @fbusato in https://github.com/NVIDIA/cccl/pull/6553
Add DiscardIterator to cuda.compute to enable unique_by_key keys only by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6618
Provide utilities to check pointer memory space (host/device/managed) by @fbusato in https://github.com/NVIDIA/cccl/pull/6325
MVP for disabling nvtx ranges for thrust::seq by @gonidelis in https://github.com/NVIDIA/cccl/pull/6415
Provide make_tma_descriptor, DLPack -> CUtensorMap by @fbusato in https://github.com/NVIDIA/cccl/pull/6237
Cleanup dependencies between internal targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6571
Port thrust complex.cu tests to catch2_test_complex.cu by @dunga1k58bh in https://github.com/NVIDIA/cccl/pull/6625
Support nested structs in cuda.compute by @shwina in https://github.com/NVIDIA/cccl/pull/6353
[cudax->libcu++] Move the hierarchy type from cudax to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6611
[Backport branch/3.2.x] Address pending comments for make_tma_descriptor by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6683
[Backport branch/3.2.x] Fixes issue with select close to int_max by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6701
[Backport branch/3.2.x] fix omp scan bug by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6704
[Backport branch/3.2.x] Fix electing leader from any group in cuda::memcpy_async by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6716
[Backport branch/3.2.x] Avoid scaling twice in ReduceNondeterministicPolicy by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6719
[Backport branch/3.2.x] [libcu++] Automatically bump up the release threshold of default mempools by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6735
[Backport branch/3.2.x] Fix __throw_cuda_error availability with nvrtc by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6769
[Backport 3.2] Add sm_62 arch traits (#6772) by @davebayer in https://github.com/NVIDIA/cccl/pull/6778
[Backport branch/3.2.x] Ensure that we properly warn about device lambdas that need to query the return type by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6782
[Backport branch/3.2.x] Use conventional order of _CCCL_API friend consistently by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6794
[Backport branch/3.2.x] Temporarily add upper bound to numba-cuda dependency by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6830
Msvc-error-backport by @alliepiper in https://github.com/NVIDIA/cccl/pull/6827
[Backport branch/3.2.x] Fix arch related cuda::device:: APIs for nvhpc in CUDA mode by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6832
[Backport 3.2.x] Test building for all arches. (#6113) by @davebayer in https://github.com/NVIDIA/cccl/pull/6842
[Backport branch/3.2.x] Remove upper bound on numba-cuda by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6853
[Backport branch/3.2.x] Use lit for cuda::arch_id and cuda::compute_capability tests by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6840
[Backport branch/3.2.x] [PTX] Add cp.async.bulk.dst.src.mbarrier::complete_tx::bytes.ignore_oob by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6860
CMake backports for 3.2 by @alliepiper in https://github.com/NVIDIA/cccl/pull/6850
[Backport branch/3.2.x] Add missing doc strings to support old CMake. by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6877
[backport 3.2.x][cudax->libcu++] Move buffer type from cudax to libcu++ (#6627) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6833
[Backport branch/3.2.x] Move launch API from cudax to libcu++ by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6891
[backport 3.2.x][libcu++] Add memory_pool header and correct legacy resources namespace by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6893
[Backport 3.2.x] [cuda.compute] Add dependency on nvidia-nvvm #6909 by @shwina in https://github.com/NVIDIA/cccl/pull/6949
[Backport branch/3.2.x] Remove all usage of old experimental MR macro by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6965
[Backport branch/3.2.x] [libcu++] Leak static CUDA resources and add missing release on memory pool by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6960
[Backport branch/3.2.x] [libcu++] Add as_ref() to memory pool types by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6959
[Backport branch/3.2.x] Remove [[nodiscard]] from barrier's .arrive(...) method by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6950
[Backport branch/3.2.x] Properly specialize cub functions for __nv_bfloat16 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6940
[Backport 3.2] Avoid waring about missing braces for subobject (#6929) by @miscco in https://github.com/NVIDIA/cccl/pull/6973
[Backport branch/3.2.x] Add missing nvrtc nv target archs by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6933
[Backport branch/3.2.x] Make sure we actually use overflow builtins by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6934
[Backport branch/3.2.x] [libcu++] Static assert that resource is copyable in buffer constructors by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6936
[Backport branch/3.2.x] [libcu++] Rename device_transform back to launch_transform by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6937
[Backport branch/3.2.x] [libcu++] Fix minor version compatibility in 13.X by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6896
[Backport branch/3.2.x] Don't use __builtin_bswap128 during constant evaluation by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6969
[Backport branch/3.2.x] [libcu++] Uncomment some tests and fix launch include after launch was moved to libcu++ by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6989
[Backport branch/3.2.x] [libcu++] Dynamically load CUDA library instead of using the runtime by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6982
[backport 3.2.x] Use cuda.core Linker instead of numba-cuda and fix import issues with experimental namespace by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/7025
[Backport branch/3.2.x] avoid error adding pointer to reference in any_resource by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7018
[Backport branch/3.2.x] [libcu++] Don't require accessibility property on type erased wrappers by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7030
[backport 3.2.x][libcu++] Fix test issues on Windows (#6993) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/7017
[Backport branch/3.2.x] Fix cuda::memcpy async edge cases and add more tests by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7036
[Backport branch/3.2.x] [libcu++] Fix synchronous resource adapter property passing by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7031
[backport 3.2] Backport #6844, #6958, #6619 and #6957 by @davebayer in https://github.com/NVIDIA/cccl/pull/7038
[Backport branch/3.2.x] Disable LDL/STL checks, for failures seen with NVRTC 13.1 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7065
[Backport branch/3.2.x] [libcu++] Add explicit alignment specification in buffer (#7005) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/7041
[Backport branch/3.2.x] [libcu++] Correctly handle extended lambda in cuda::launch by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7069
[Backport branch/3.2.x] use references for mdspan internal methods #7059 by @fbusato in https://github.com/NVIDIA/cccl/pull/7068
[Backport 3.2] Disable test that is failing in multiple configurations (#6745) by @miscco in https://github.com/NVIDIA/cccl/pull/7076
[Backport 3.2] Use resource test fixure members through this (#6717) by @miscco in https://github.com/NVIDIA/cccl/pull/7075
[Backport 3.2] Avoid invalid compiler warning with VS2026 (#7077) by @miscco in https://github.com/NVIDIA/cccl/pull/7081
[Backport 3.2] Avoid compiler issue with MSVC _CCCL_UNREACHABLE (#7080) by @miscco in https://github.com/NVIDIA/cccl/pull/7083
[Backport branch/3.2.x] [libcu++] Make kernel_config member private and allow it in hierarchy queries by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7070
[Backport branch/3.2.x] [thrust] Ignore CUDA free errors in thrust memory resource by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7033
[Backport branch/3.2.x] [libcu++] Remove _view from the shared memory getter name by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7032
[backport 3.2] Backport #6985 by @fbusato in https://github.com/NVIDIA/cccl/pull/7039
[Backport branch/3.2.x] Explicitly set CCCL_TOPLEVEL_PROJECT to OFF when needed by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7040
[Backport branch/3.2.x] Simplify cuda::host_launch API by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7089
[Backport 3.2] Expand warning suppression for braces around subobject (#7087) by @miscco in https://github.com/NVIDIA/cccl/pull/7091
[Backport to 3.2] Refactor c2h gen to ensure teardown before main (#7067) and Add an option to use CCCL from CTK for C2H (#6848) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/7085
[Backport branch/3.2.x] [libcu++] Fix driver api test after curand changes by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7105
[Backport branch/3.2.x] Enhance DLPack compatibility by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7060
[Backport branch/3.2.x] [libcu++] Check if managed pools are accessible in is_pointer_accessible test by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7116
[Backport branch/3.2.x] Fix calculation of necessary bits in feistel projection by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7117
[Backport branch/3.2.x] Fix incorrect if else logic in fmax by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7112
[Backport branch/3.2.x] Use cudaMemcpyDefault for trivial copies by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7134
[Backport branch/3.2.x] Fix nvrtcc minimum arch for __float128 support by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7141
[BACKPORT 3.2] Generator for prologue/epilogue (#7099) by @miscco in https://github.com/NVIDIA/cccl/pull/7136
[Backport branch/3.2.x] Replace and deprecate compute_capability::major() and compute_capability::minor() by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7143
[Backport branch/3.2.x] Move DLPack include to separate file by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7142
[backport 3.2] [libcu++] Allow all public headers to be included with host compilers only (#7012) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/7146
[Backport branch/3.2.x] Disable reference_wrapper test for VS2026 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7145
[Backport branch/3.2.x] Revert nested namespace change to <nv/target> by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7153
[backport branch/3.2.x] Backport #7023, #7009, #7139, #7144 and #7130 by @davebayer in https://github.com/NVIDIA/cccl/pull/7147
[Backport branch/3.2.x] Add Android-specific assert handling in __cccl/assert.h by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7158
[Backport branch/3.2.x] Align local vector storage arrays in vec transform by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7163
[Backport 3.2] Fix make_tma_descriptor() unit test (#7152) by @miscco in https://github.com/NVIDIA/cccl/pull/7164
[Backport branch/3.2.x] Fixes for thrust::shuffle by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7189
[Backport branch/3.2.x] Do not try to run catch2 tests with nvrtc by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7244
[Backport to 3.2] Fix extracting CUDA stream in cub::DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/7263
[Backport branch/3.2.x] Change the order of conditions in cuda::barrier by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7273
[Backport 3.2] Fix __query_or CPO by @miscco in https://github.com/NVIDIA/cccl/pull/7267
[Backport branch/3.2.x] Fix is_address_from for cluster_shared for pre-sm_90 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7301
[Backport branch/3.2.x] Skip checking build prereqs if installing by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7326

New Contributors

@GPMueller made their first contribution in #5369
@MengAiDev made their first contribution in #5483
@Copilot made their first contribution in #5582
@thewilsonator made their first contribution in #5835
@vyasr made their first contribution in #5967
@jayavenkatesh19 made their first contribution in #5826
@cnaples79 made their first contribution in #6035
@19970126ljl made their first contribution in #6089
@nirandaperera made their first contribution in https://github.com/NVIDIA/cccl/pull/6370
@dunga1k58bh made their first contribution in https://github.com/NVIDIA/cccl/pull/6625

Full Changelog: v3.1.4...v3.2.0