The CCCL team is excited to announce the 3.2 release of the CUDA Core Compute Library (CCCL) whose highlights include include new modern CUDA C++ runtime APIs and new speed-of-light algorithms including Top-K.
Modern CUDA C++ Runtime
CCCL 3.2 broadly introduces new, idiomatic C++ interfaces for core CUDA runtime and driver functionality.
If you’ve written CUDA C++ for a while, you’ve likely built (or adopted) some form of convenience wrappers around today’s C-like APIs like cudaMalloc or cudaStreamCreate.
The new APIs added in CCCL 3.2 are meant to provide the productivity and safety benefits of C++ for core CUDA constructs so you can spend less time reinventing wrappers and more time writing kernels and algorithms.
Highlights:
- New convenient vocabulary types for core CUDA concepts (cuda::stream, cuda::event, cuda::arch_traits)
- Easier memory management with Memory Resources and - - cuda::buffer
More powerful and convenient kernel launch with cuda::launch
Example (vector add, revisited):
cuda::device_ref device = cuda::devices[0];
cuda::stream stream{device};
auto pool = cuda::device_default_memory_pool(device);
int num_elements = 1000;
auto A = cuda::make_buffer<float>(stream, pool, num_elements, 1.0);
auto B = cuda::make_buffer<float>(stream, pool, num_elements, 2.0);
auto C = cuda::make_buffer<float>(stream, pool, num_elements, cuda::no_init);
constexpr int threads_per_block = 256;
auto config = cuda::distribute<threads_per_block>(num_elements);
auto kernel = [] __device__ (auto config, cuda::std::span<const float> A,
cuda::std::span<const float> B,
cuda::std::span<float> C){
auto tid = cuda::gpu_thread.rank(cuda::grid, config);
if (tid < A.size())
C[tid] = A[tid] + B[tid];
};
cuda::launch(stream, config, kernel, config, A, B, C);
(Try this example live on Compiler Explorer!)
A forthcoming blog post will go deeper into the details, the design goals, intended usage patterns, and how these new APIs fit alongside existing CUDA APIs.
New Algorithms
Top-K Selection
CCCL 3.2 introduces cub::DeviceTopK (for example, cub::DeviceTopK::MaxKeys) to select the K largest (or smallest) elements without sorting the entire input. For workloads where K is small, this can deliver up to 5X speedups over a full radix sort, and can reduce memory consumption when you don’t need sorted results.
Top‑K is an active area of ongoing work for CCCL: our roadmap includes planned segmented Top‑K as well as block‑scope and warp‑scope Top‑K variants. See what’s planned and tell us what Top‑K use cases matter most in CCCL GitHub issue #5673.
Fixed-size Segmented Reduction
CCCL 3.2 now provides a new cub::DeviceSegmentedReduce variant that accepts a uniform segment_size, eliminating offset iterator overhead in the common case when segments are fixed-size. This enables optimizations for both small segment sizes (up to 66x) and large segment sizes (up to 14x).
// New API accepts fixed segment_size instead of per-segment begin/end offsets
cub::DeviceSegmentedReduce::Sum(d_temp, temp_bytes, input, output,
num_segments, segment_size);
Additional New Algorithms in CCCL 3.2
Segmented Scan - cub::DeviceSegmentedScan provides a segmented version of a parallel scan that efficiently computes a scan operation over multiple independent segments.
Binary Search - cub::DeviceFind::[Upper/LowerBound] performs a parallel search for multiple values in an ordered sequence.
Search - cub::DeviceFind::FindIf searches the unordered input for the first element that satisfies a given condition. Thanks to its early-exit logic, it can be up to 7x faster than searching the entire sequence.
Full Changelog: v3.2.0...v3.2.0
What's Changed
🚀 Thrust / CUB
- Modified test [reduce][nondeterministic] per gh-5443 by @oleksandr-pavlyk in #5451
- Remove unused include of grid/grid_queue from CUB agent/dispatch headers by @oleksandr-pavlyk in #5887
- [CUB] Implement
BlockLoadToSharedby @pauleonix in #5780 - Fix debug section around line 390 of dispatch_topk by @oleksandr-pavlyk in #6152
- Fix typos in segmented reduce by @oleksandr-pavlyk in #6153
- Device scan doc fixes by @oleksandr-pavlyk in #6294
- Scan tests and benchmarks by @oleksandr-pavlyk in #6355
- [Thrust]: New "sum rows" and "sum columns" examples by @brycelelbach in #4462
- Added new CUB APIs:
DeviceTransform::Fill#5526,DeviceTransform::Generate#5890,DeviceTransform::TransformIf#5198, which are used bythrust::fill[_n]#5805,thrust::uninitialized_fill#5813,thrust::generate[_n]#5807, andthrust::transform_if,thrust::scatter_if#5952, and non-trivalthrust::copy#5954. By @bernhardmgruber. - Made
thrust::tabulate#6012 usecub::DeviceTransformas well by @bernhardmgruber in #5198
libcu++
- Added
cuda::barrierandcuda::memcpy_async_txexamples using TMA @bernhardmgruber in #6231 - Waiting on a
cuda::barrieron SM90+ is now faster and produces less code @bernhardmgruber in #6007 - Improve
cuda::memcpy_asynccodegen @bernhardmgruber in #5996 - Improve TMA codegen on sm120 in
cuda::memcpy_async,cuda::device::memcpy_async_tx,cub::DeviceTransform@bernhardmgruber in #6362
🤝 cuda.coop
🔄 Other Changes
- Rework our
signbitimplementation to be potentially constexpr by @miscco in #5259 - [CUDAX->libcu++] Move ensure_current_device to libcu++ and change the name to ensure_current_context by @pciolkosz in #5285
- [Version] Update main to v3.2.0 by @github-actions[bot] in #5286
- Rework our
copysignimplementation to be potentially constexpr by @miscco in #5287 - Update NVBench by @bernhardmgruber in #5288
- [CUDAX] Rename async_buffer::change_stream to set_stream and add a test by @pciolkosz in #5273
- Extend and refactor transform overloads in CUDA system by @bernhardmgruber in #5238
- Refactor c2h by @bernhardmgruber in #5205
- Fix inplace_vector out of bounds access for at() by @Jacobfaib in #5295
- Fix cudax test breaking main by @davebayer in #5301
- [STF] Move occupancy calculation utility and support CUfunction by @caugonnet in #5236
- [CUDAX->libcu++] Move stream and event from cudax to libcu++ by @pciolkosz in #5293
- Port
thrust::transform_input_output_iteratortocudaby @miscco in #5113 - Implement
format.argumentsandformat.contextfrom standard formatting library by @davebayer in #5217 - Initial migration of cuco hasher to cudax by @srinivasyadav18 in #4898
- CUB - Add internal integer utils and tests (Split
WarpReducePR) by @fbusato in #5314 - Skip zero values in
fast_mod_divunit test by @fbusato in #5307 - Fix
cuda::static_fornoexcept definition by @davebayer in #5303 - Add sm90 tunings for RFA F32 by @srinivasyadav18 in #5269
- Add and use new artifact/workflow functionality for CI scripts. by @alliepiper in #4861
- Add gitlab devcontainers by @wmaxey in #5325
- Remove mentions of CUDA experimental that sneaked into libcu++ by @pciolkosz in #5306
- Add a macro to disable PDL by @bernhardmgruber in #5316
- Move
aligned_size_t,get_device_addressanddiscard_memorytocuda/__memory/by @davebayer in #5239 - Adds tests for large number of items to
DeviceRunLengthEncode::NonTrivialRunsby @elstehle in #5251 - [libcu++] Deprecate default stream_ref constructor and fix some few last usages by @pciolkosz in #5310
- Extends benchmarks for
DeviceRunLengthEncode::NonTrivialRunsto differentiate between offset and run-length type by @elstehle in #5248 - Complex log accuracy refinement by @s-oboyle in #5185
- Replace use of cupy with cuda-core in cuda.cccl.parallel by @shwina in #5323
- Better motivates
cuda::device::is_address_fromby @fbusato in #5341 - Fix CUB 'limited' job in nightly CI by @alliepiper in #5347
- Fix nvrtc when there are more than one CTK include directories available by @wmaxey in #5318
- c.parallel: enable UBLKCP in transform by @griwes in #4847
- Merge sort benchmark requires no sync by @bernhardmgruber in #5350
- Forgot to add
inlineinis_address_fromby @fbusato in #5349 - Add
sm86tunings for deterministic DeviceReduce (RFA) by @srinivasyadav18 in #5354 - Adds support for large number of items to
DeviceRunLengthEncode::NonTrivialRunsby @elstehle in #5252 - Document that scan_op must be associative by @bernhardmgruber in #5358
- Fix cuco hasher test by @srinivasyadav18 in #5353
- Super tiny tweak for analysis script to work after introducing postgreSQL by @gonidelis in #5331
- c.parallel: support providing well-known operations by @griwes in #4562
- Add simpler, single-phase APIs for all
parallelalgorithms by @shwina in #5207 - [STF] [EASY] Fix exception guard usage in traits.cuh by @GPMueller in #5369
- [CUB] Add
cub::detail::ThreadScan*Partialby @pauleonix in #5300 - Diagnose missing
numeric_limitsspecialization inDeviceReduce Min/Maxby @bernhardmgruber in #5359 - Suppress clang warnings on vector types in upcoming CTK by @bernhardmgruber in #5362
- Add
is_object_fromby @fbusato in #5364 - [CUB] Add
cub::detail::ThreadReducePartialby @pauleonix in #5324 - fix
noexceptclause on ctor oflet_value's opstate by @ericniebler in #5387 - Add some notes about performance of 1 and 2 byte atomic_ref. by @wmaxey in #5390
- Add a section covering include changes in the migration docs by @wmaxey in #5391
- Add missing
NV_TARGETmacro by @fbusato in #5388 - [libcu++] Add missing pop of deprecation warning suppression by @pciolkosz in #5395
- Makes
thrust::uniqueusecub::DeviceSelect::Uniqueby @elstehle in #5396 - fix race condition in
starts_onexecution test by @ericniebler in #5393 - Split c2h sources into more files by @bernhardmgruber in #5384
- Remove CTK <12 version check for PDL by @bernhardmgruber in #5343
- Add nondeterministic reduce that uses atomics by @NaderAlAwar in #4961
- Add scan tunings from leaderboard by @gonidelis in #5283
- [CUDAX->libcu++] Expose fill_bytes and copy_bytes in libcudacxx by @pciolkosz in #5304
- Move ownership of cudax test cmake to cudax owners by @pciolkosz in #5406
- move basic_any from cudax to libcudacxx by @ericniebler in #5298
- fix a data race and use-after-free in
execution::run_loopby @ericniebler in #5402 - fix the
_CCCL_PP_COMMA_IFFmacro by @ericniebler in #5407 - Replaces internal macros with
__host__and__device__attributes by @elstehle in #5412 - [STF] Allow CUfunction/CUkernel (driver API) in the cuda_kernel(_chain) API by @caugonnet in #5215
- Improve forward declarations. We often need only a forward declaration of vocabulary types and also want to know whether something is an instance of said type. by @miscco in #5305
- Add NVTX ranges to C2H tests by @bernhardmgruber in #5332
- [STF] Low level interface for the cuda_kernel(_chain) construct by @caugonnet in #5319
- Drops global namespace qualification from
cudanamespace usage in our tests by @elstehle in #5415 - Add Histogram implementation for c.parallel by @NaderAlAwar in #4689
- Disable NVHPC optimization that leads to error by @gonidelis in #5416
- Combine
block_reduce_warp_reduction_nondeterministic.cuhspecialization with original deterministic one by @NaderAlAwar in #5408 - Use C2H in radix_sort c.parallel tests by @NaderAlAwar in #5426
- Add common constants for floating point types by @miscco in #5413
- [libcu++] Rename memory resource concepts to indicate asynchronous allocations are the default ones by @pciolkosz in #5313
- Fix gpu_to_gpu determinism fallback conditions to run_to_run determinism by @srinivasyadav18 in #5382
- remove the dependence from
sync_wait's receiver on the sender's type by @ericniebler in #5446 - Print character vectors as numbers in tests by @bernhardmgruber in #5154
- Generate negative numbers in Thrust unit tests by @bernhardmgruber in #4923
- cuda.cccl: Update dependencies to enable running on CUDA 13 driver by @shwina in #5442
- Move TMA barrier in DeviceTransform into dynamic SMEM by @bernhardmgruber in #5414
- Fix grid dependency sync in cub::DeviceMergeSort by @bernhardmgruber in #5456
- Add python wrappers for c.parallel histogram API by @NaderAlAwar in #4709
- Integer Add with overflow checking by @fbusato in #5267
- fix
NV_TARGETtypos by @fbusato in #5418 - CUB - Add internal thread and warp utils (Split
WarpReducePR) by @fbusato in #5317 - Introduce
i128andu128literals to libcu++ testing by @davebayer in #5372 - Replace address space intrinsics with
cuda::device::is_address_fromby @davebayer in #4866 - Update
cuda::ptxto CTK 13 by @fbusato in #5447 - Implement
cuda::std::from_charsfor integers by @davebayer in #4938 - Port
thrust::zip_iteratorto namespacecudaby @miscco in #5429 - Drop all usages of
_CCCL_TRAITby @miscco in #5466 - [STF] Misc. STF doxygen documentation by @caugonnet in #5470
- [STF] Cleanup for_each_batched.cuh by @caugonnet in #5473
- [STF] Move only_convertible_or to reserved namespace by @caugonnet in #5472
- extend execution environments to support queries that take extra arguments by @ericniebler in #5464
- Fix atomic reduce for arches < 600 with dtype double by @NaderAlAwar in #5428
- Rework our
fabsimplementation to be potentially constexpr by @miscco in #5302 - Fix handling of invalid inputs (<= 0) to
GridEvenShareand adjust handling ofnum_items == 0on the caller side by @NaderAlAwar in #5452 - Simplify thrust::device_malloc by @bernhardmgruber in #5477
- [STF] Accept shapes which are just integral values in parallel_for by @caugonnet in #5485
- [libcu++] Remove experimental memory resource define check from around the concept, properties and the query. by @pciolkosz in #5437
- Drop unused iterator bases and update standard iterators by @miscco in #5454
- Fix mismatched internal dispatch in cub::ScatterToStripedFlagged by @MengAiDev in #5483
- [STF] Improve how we ignore void interface (tokens) arguments in prototypes by @caugonnet in #5475
- Refactor thrust::pointer by @bernhardmgruber in #5478
- Add test to ensure we can use
cuda::std::reverse_iteratorwith thrust APIs by @miscco in #5486 - Drop
thrust::LoadIterator/make_load_iteratorby @bernhardmgruber in #5480 - Fix __float128 detection and require compiler support for literals by @davebayer in #4591
- [CUB] Implement
*Partialmember functions forWarpScanby @pauleonix in #5379 - Add
SM_110afor non-supporting compilers by @fbusato in #5489 - [Thrust] Make documentation behind
#if 0visible by @pauleonix in #5455 - Add support for virtual shared memory to
DispatchReduceByKeyby @elstehle in #5440 - Use
thrust::copyinthrust::uninitialized_copy[_n]in CUDA system when possible by @bernhardmgruber in #5181 - Move segmented sort kernels to separate header by @NaderAlAwar in #5499
- Refactor agent_reduce by @bernhardmgruber in #5507
- Enable mdspan public headers test for msvc in C++17 by @davebayer in #5510
- Make NVTX headers declare themselves as system headers by @bernhardmgruber in #5508
- add
_CCCL_TYPE_VISIBILITY_HIDDENconfig macro by @ericniebler in #5514 - [STF] Avoid warning about unsed variable by @miscco in #5518
- Handle NVTX3 being disabled in C2H by @bernhardmgruber in #5511
- Also test DeviceTransform with unaligned destination by @bernhardmgruber in #5509
- Add nondeterministic reduce sum benchmark by @NaderAlAwar in #5520
- Add
grayscaletransform benchmark by @NaderAlAwar in #5522 - Document why workstealing is not implemented in DeviceTransform by @bernhardmgruber in #5525
- fix device definition of cudax execution's
__nothrow_fooabletraits by @ericniebler in #5533 - use
auto(expr)for_LIBCUDACXX_AUTO_CASTwhen it is available by @ericniebler in #5537 - implement a variant of P3206 for getting a sender's completion behavior by @ericniebler in #5517
- Fix naming of our namespace macros and friends by @miscco in #5538
- Fix regression introduced with agent_reduce refactoring by @bernhardmgruber in #5542
- [libcu++] Rename resource_ref to match the new async by default naming by @pciolkosz in #5534
- Add missing full qualification for ::cuda::std in libcu++ by @bernhardmgruber in #5544
- Implement
ranges::for_each{_n}by @miscco in #5540 - [CUDAX] Rename type-erased memory resource wrappers by @pciolkosz in #5536
- Fix merge conflict 🙈 by @miscco in #5546
- make clangd use libc++ instead of libstdc++ by @ericniebler in #5548
- permit
__query_result_or_tto take extra arguments by @ericniebler in #5551 - Only download wheels artifacts for release by @cryos in #5543
- give pod_tuple.h the
_CCCL_EXEC_CHECK_DISABLEtreatment by @ericniebler in #5553 - Fix fp constants by @davebayer in #5467
- Replace
_CCCL_ASSUMEwith_CCCL_BUILTIN_ASSUMEby @fbusato in #5554 - suppress bogus msvc warning about unreachable code in
cuda::std::optionalby @ericniebler in #5563 - port
then()tests from stdexec and fix bugs inschedule_fromandsync_waitby @ericniebler in #5561 - extend the
get_completion_schedulerto accept the receiver's env by @ericniebler in #5565 - Add a benchmark for transform_if with stencil by @bernhardmgruber in #5571
- Complex sqrt accuracy/speed improvements by @s-oboyle in #5371
- Remove repo-docs dependency by @gevtushenko in #5568
- Regenerate PTX docs by @bernhardmgruber in #5574
- Replace our qualification macros with plain
cuda::std::by @miscco in #5573 - Reorganize docs pages a bit by @bernhardmgruber in #5584
- Documentation fixes by @bernhardmgruber in #5468
- Use a custom git describe command for setuptools-scm by @shwina in #5586
- Update CCCL to CTK mapping table by @bernhardmgruber in #5587
- Port
thrust::shuffle_iteratorto cuda by @miscco in #5530 - Add docstrings for all single-phase APIs in CUDA CCCL parallel algorithms by @Copilot in #5582
- Remove remainig namespace macros by @miscco in #5608
- Avoids invoking custom equality operator for out-of-bounds items by @elstehle in #5566
- Rework our
fmaxandfminimplementation to be potentially constexpr by @miscco in #5539 - Update
cuda/ptxinstructions to support all new SM architectures in CTK 13 by @fbusato in #5600 - [libcu++] Disable arch traits testing kernel for old arches for which we don't provide traits by @pciolkosz in #5602
- Enable PDL in DeviceTransform by @bernhardmgruber in #5249
- re-express
execution::starts_onin terms ofexecution::continues_onby @ericniebler in #5576 - Refactor cuda.cccl.parallel benchmarks to reduce repetition using pytest parametrization by @Copilot in #5589
- Add ZipIterator to
cuda.cccl.parallelby @shwina in #5389 - [skip-ci] Clarify GPU architecture support in README. by @jrhemstad in #5618
- rename
_CCCL_TRIVIAL_APIto_CCCL_NODEBUG_APIby @ericniebler in #5617 - Fix includes table in migration guide by @wmaxey in #5624
- Implement
format.formatter.specby @davebayer in #5368 - Implement execution policies by @miscco in #5577
- Move partition kernels to separate header by @NaderAlAwar in #5630
- Drop internal uses of
thrust::reverse_iteratorby @miscco in #5616 - [libcu++] Add SM_110 arch traits by @pciolkosz in #5631
- Add device fp128 funcitons include by @davebayer in #5585
- Allow C++ code for operators in c.parallel by @gevtushenko in #5633
- [STF] Fix CUDA graph API calls for CUDA 13 by @caugonnet in #5636
- [STF] Implement token elision in cuda_kernel constructs by @caugonnet in #5640
- [STF] make get_owning_container_of local to a class by @caugonnet in #5643
- Avoid issue with
MinimalElementTypeand MSVC by @miscco in #5641 - [STF] Replace task dep's as_read_mode by a more general as_mode by @caugonnet in #5645
- Drop constraints from fp conversion rank order traits by @davebayer in #5644
- Implement
__fp_is_explicit_conversion_vby @davebayer in #5648 - Rename header guards to drop the
_LIBCUDACXXprefix by @miscco in #5632 - Minor
path_finder→pathfinder fixesby @rwgk in #5637 - [CUDAX] Add legacy prefix to managed_memory_resource and remove async members by @pciolkosz in #4983
- Update docs build to deploy from gh-pages branch to docs/ directory with preserved branch history by @Copilot in #5605
- Fixes
thrust::uniquefor non-constequality_opby @elstehle in #5652 - Fix bug in reduce tuning by @gonidelis in #5654
- Enable parallel Sphinx builds by @jrhemstad in #5655
- [STF] Remove the hook mechanism by @caugonnet in #5660
- cuda.cccl: Build combined CUDA 12+13 wheel by @shwina in #5613
- Add tests/parallel/examples/scan/scan_applications.py by @oleksandr-pavlyk in #5634
cuda.cccl.parallel: Expose "well-known" operations to Python by @shwina in #5578- Fix issues with compiling on 12.0 for memcpy_async on Ampere+ by @wmaxey in #5665
- [STF] Add examples which add tasks to user-provided CUDA graphs by @caugonnet in #5410
- NVHPC 25.7 by @alliepiper in #5360
- Add a missing variant header in c/parallel by @caugonnet in #5680
- Simplify enum bindings by @shwina in #5666
- Fix issue revealed by gcc14 stringent checking by @andralex in #5671
- Fix
cuda::shuffle_iteratornot properly working with thrust algorithms by @miscco in #5686 - Update cudaGraphAddDependencies for 13.0 by @pciolkosz in #5691
- add a query to get a sender's completion domain for each completion disposition by @ericniebler in #5599
- Update PTX ISA version for CUDA 13 by @davebayer in #5676
- Move nvbench_helper out of CUB for easier reuse. by @alliepiper in #5692
- [STF] Add missing low-level API in the unified context and introduce a method to enable graph capture in the low level API by @caugonnet in #5701
- Add cuco hasher's benchmark in cudax by @srinivasyadav18 in #5558
- Fix backslashes in blocked doxygen alias in CUB docs by @oleksandr-pavlyk in #5695
- Fix
thrust::mallocforvoidby @miscco in #5698 - Ensure that we are building with the
/Zc:preprocessorflag on windows by @miscco in #5687 - Add support for float16 (__half) in cuda.cccl.parallel by @NaderAlAwar in #5696
- Work around NVRTC bug with virtual default ctors/dtors by @wmaxey in #5704
- Parse and merge devcontainer feature metadata in
launch.shby @trxcllnt in #5074 - [CUDAX] Remove synchronization from set_stream and add a stream argument to destroy in async_buffer by @pciolkosz in #5697
- fix completion signature computation of
starts_on, work around gcc9 ICE by @ericniebler in #5724 - [STF] Example to freeze logical data in a graph to use in a child graph by @caugonnet in #5731
- [STF] Remove the for_each_batched experiment entirely by @caugonnet in #5726
- Also use the new preprocessor in the libcu++ header tests by @miscco in #5732
- Disable test for all MSVC and NVCC 12.0 by @miscco in #5734
- Unify the libcudacxx header test infrastructure with the other projects by @miscco in #5735
- Improvements to CI PR comments. by @alliepiper in #5705
- Fix build scripts when sccache is not available. by @alliepiper in #5727
- Make default CMake options configure a minimal installation. by @alliepiper in #5737
- Add git-bisect script/workflow and generic single-target build/test script by @alliepiper in #5728
- Deprecate
<cuda/discard_memory>by @davebayer in #5672 - Add CTK 13.0, gcc14 devcontainers and CI by @alliepiper in #5431
- Add workflow to build and cleanup per PR docs previews by @jrhemstad in #5559
- fix: Add missing pages:write permission to PR cleanup workflow by @jrhemstad in #5744
- Document UB in warp_match_all by @gonzalobg in #5658
- Guard against some optional files not being present. by @alliepiper in #5742
- Migrate all cuco hashers by @srinivasyadav18 in #5400
- Add comprehensive GitHub Copilot instructions for CCCL development workflow including Python components by @Copilot in #5620
- Split cub developer guide into separate sections by @miscco in #5739
- [cudax] Add
green_context::id()method by @davebayer in #5471 - Fix problematic clang attribute namespace by @davebayer in #5748
- [CUDAX] Implement
cudax::kernel_refby @davebayer in #5041 - Fix local builds by @miscco in #5746
- Obtain temp storage size and alignment directly from LTO IR via PTX conversion. by @tpn in #5355
- Properly guard ptx includes for when we are in cuda mode by @miscco in #5749
- Remove thrust from async_buffer and use cub instead by @pciolkosz in #5659
- Safe
cuda::std::memset/memcpyAPI by @fbusato in #5500 - Cleaned up the AGENTS instructions with GPT5. by @alliepiper in #5745
- Address Sphinx warnings, populate Thrust's group pages by @oleksandr-pavlyk in #5759
- Remove problematic new build symlink by @alliepiper in #5761
- [STF] Remove broken data_from_device_async test by @caugonnet in #5765
- [STF] Remove dead STF example 09-nbody-blocked-graph by @caugonnet in #5763
- [STF] Remove the stopwatch utility header by @caugonnet in #5762
- [STF] Example to import logical data in a sub context with a while condition by @caugonnet in #5738
- [STF] Test write-back on frozen logical data by @caugonnet in #5733
- Make sure that
cuda::iterators arerandom_access_iteratorwhen possible by @miscco in #5678 - [STF] Rework dot tool to have really nested sections by @caugonnet in #5723
- Fix
generate_version.shscript to only consider tags beginning withvby @shwina in #5771 - Add TransformOutputIterator implementation and tests by @shwina in #5743
- [STF] Improve how we retrieve streams from async_resources_handle objects by @caugonnet in #5769
- [CUDAX] Implement
cudax::libraryandcudax::library_refby @davebayer in #5174 - Improve cudax/cuco hashers by @srinivasyadav18 in #5768
- cuda.cccl.parallel: Reference examples in docstrings and eliminate test_*_api.py files by @Copilot in #5614
- Fix Thrust header tests and remove unused defines by @alliepiper in #5764
- [STF] Add missing type definition in task_dep by @caugonnet in #5783
- add a deleted
querymember function tostd::execution::env<>by @ericniebler in #5778 - Ensure that we do not rely on host library functions that might not be defined by @miscco in #5782
- Fix
cudax::launchfor kernels with no parameters by @davebayer in #5785 - [STF] Generic per-context resource sets by @caugonnet in #5777
- [URGENT][TRIVIAL] Make sure cudaLibrary_t is used only in versions that define it by @andralex in #5790
- Show missing executables while setting up build by @andralex in #5796
- Sort workflow job times by duration by @alliepiper in #5795
- Bump cuda99 containers to gcc14 by @wmaxey in #5760
- Drop unused header by @bernhardmgruber in #5802
- Fix libcu++ compilation with clang-20 by @davebayer in #5799
- Use nested namespace specifier in Thrust cpp system by @bernhardmgruber in #5801
- Improve documentation of cuda iterators by @miscco in #5662
- Remove "Workflow Started" PR comment. by @alliepiper in #5810
- Add support for large
OffsetTtypes withdeterministic DeviceReduce(RFA) by @srinivasyadav18 in #5434 - Drop unused include of CG by @bernhardmgruber in #5814
- Remove stray semicolon by @bernhardmgruber in #5815
- Adds
output_orderingrequirement as env option by @elstehle in #5781 - Fix uninitialized read in uninitialized_copy_n by @bernhardmgruber in #5811
- Fix
__fp_oneby @davebayer in #5800 - cccl.parallel: Unify input and output iterators by @shwina in #5770
- Fix some issues that were found by QA by @miscco in #5820
- Implement remaining
cmathfunctions and drop indirection header by @miscco in #5786 - [STF] C bindings library by @caugonnet in #5740
- silence potential warning about ignored nodiscard value by @ericniebler in #5794
- [STF] Misc documentation fixes/clarifications by @caugonnet in #5722
- Modified CUB's device-wide developer guide by @oleksandr-pavlyk in #5829
- Fix Thrust API docs appearing twice in toctree by @bernhardmgruber in #5828
- [cub/grid] fix documentation typo in
grid_even_share.cuhby @thewilsonator in #5835 - Enable NVTX for NVHPC by @bernhardmgruber in #5836
- Skip ptx-json tests for clang-cuda by @davebayer in #5841
- Drop unnecessary includes from libcu++ in CUB by @miscco in #5830
- make the concepts portability macros slightly more maintainable by @ericniebler in #5817
- Slim down Thrust CUDA core utils by @bernhardmgruber in #5845
- Improve Thrust iterator documentation by @bernhardmgruber in #5833
- Add CI information to AGENTS.md. by @alliepiper in #5779
- Use a custom
iter_swapkernel in Thrust by @bernhardmgruber in #5843 - Use std::atomic in host only code by @bernhardmgruber in #5838
- Use forward declarations of extended floating point types instead of including the headers by @miscco in #5846
- Clang20 CI + devcontainers by @alliepiper in #5797
- Fix PTX ISA detection for
clang-cudaby @davebayer in #5869 - fix issue in concepts macros where
noexcept(t)became{noexcept(t)} noexceptby @ericniebler in #5867 - Fix grammar in doc comment for TilePrefixCallbackOp by @oleksandr-pavlyk in #5866
- Introduce facilities to extract the exponent of a floating point value. by @miscco in #5136
- [STF] test to get the stream associated to a task in the different backends by @caugonnet in #5865
- Redacted some comments in util_type CUDA header file for clarity by @oleksandr-pavlyk in #5868
- Fixes example of
DeviceScan::InclusiveScanInitto usethrustvectors instead ofc2hby @elstehle in #5871 - [STF] Factorize add_vertex calls by @caugonnet in #5864
- Avoid ADL issues with GCC-9 in iterator tests by @miscco in #5872
- Document Thrust systems, execution policies and their dispatch by @bernhardmgruber in #5827
- [cudax] Make
cudax::host_launchwork with move-only types by @davebayer in #5876 - [cudax] Require
cudax::kernel_refargument types to be TriviallyCopyable by @davebayer in #5878 - Avoid symbol clash with older clang by @miscco in #5874
- Use
__fp_get_expto implementilogbandlogbby @miscco in #5873 - Test Thrust iterator system propagation by @bernhardmgruber in #5875
- Add missing template argument in transform_reduce benchmark by @bernhardmgruber in #5803
- Refactor thrust::iterator_facade_category by @bernhardmgruber in #5877
- Allow single-target build/test jobs in CI override for faster turn-around times, reduced runner usage. by @alliepiper in #5784
- Add tests for host system propagation by @bernhardmgruber in #5881
- Add more SMs to cuda-clang CI builds by @alliepiper in #5861
- Drop obsolete is_discard_iterator by @bernhardmgruber in #5884
- Work around
submdspancompiler issue on MSVC by @miscco in #5885 - Fix
iterator_category_to_systemfor device iterator tags by @bernhardmgruber in #5880 - Add missing stream synchronization in
thrust::cuda_cub::generateby @bernhardmgruber in #5889 - Clarify missing Reference and ValueType by @bernhardmgruber in #5888
- Modernize Thrust examples by @charan-003 in #5670
- Inherit
thrust::transform_iteratortraversal from base iterator traversal by @bernhardmgruber in #5883 - [CUDAX] Change async_buffer constructor and make_async_buffer to only optionally take an environment by @pciolkosz in #5776
- Allow CI to run on forks with sccache enabled. by @alliepiper in #5882
- Ensure test kernels remain active during allocator testing. by @alliepiper in #5899
- Implement
cuda::complexby @davebayer in #5609 - Update RAPIDS devcontainers by @bdice in #5898
- Small improvements to DeviceMergeSort by @bernhardmgruber in #5900
- Drop
thrust::counting_iteratorin favor ofcuda::counting_iteratorby @miscco in #5839 - Ensure that
logbis constexpr by @miscco in #5901 - Simplify
cuda::std::is_trivially_copyableimplementation by @davebayer in #5906 - [STF] Fix a typo in the documentation about logical_data::freeze by @caugonnet in #5922
- Revert "Drop
thrust::counting_iteratorin favor ofcuda::counting_iterator(#5839)" by @alliepiper in #5925 - Revert "Simplify
cuda::std::is_trivially_copyableimplementation" by @davebayer in #5921 - Fix branch protection checks by @alliepiper in #5915
- Allow bisect jobs with custom args to run through matrix.yml. by @alliepiper in #5894
- The test has been randomly segfaulting recently so lets disable until we know whats happening by @miscco in #5930
- Ignore
-Wmaybe-uninitializedin dispatch_reduce.cuh. by @bdice in #5933 - Drop Thrust mpl math by @bernhardmgruber in #5897
- remove early customization and redesign
transform_senderby @ericniebler in #5793 - Enable CUDA 12.0+ testing for cuda.cccl by @shwina in #5682
- Require type annotations for TransformOutputIterator by @shwina in #5934
- Modernize iterator machinery by @miscco in #5928
- Allow 128-bit int/float in nvrtc tests by @davebayer in #5411
- Fix iterator adaptor sample by @gevtushenko in #5957
- correct the spelling of the
_LIBCPP_VERSIONmacro by @ericniebler in #5958 - Refactor cub::DeviceMerge by @bernhardmgruber in #5937
- Drop unused LoadAlgorithm from merge policy by @bernhardmgruber in #5942
- for better intellisense in cudax, define
LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCEby @ericniebler in #5959 - [STF] Support larger pos4 and dim4 by @caugonnet in #5893
- Detect
QNXfor atomics support by @miscco in #5961 - Refactor and condense thrust::copy implementation by @bernhardmgruber in #5491
- Improve
thrust::cuda_cub::replacefunctor handling by @bernhardmgruber in #5949 - Simplify and deprecate
cuda::std::is_podin C++20 by @davebayer in #5914 - Refactor Thrust execution policies by @bernhardmgruber in #5821
- Try to use _CCCL_API in Thrust and CUB by @bernhardmgruber in #5953
- Simplify
cuda::std::is_trivially_constructibleimplementation by @davebayer in #5907 - Improve interoperability of
cudaiterators with thrust and std by @miscco in #5929 - Drop thrust detail seq policy global by @bernhardmgruber in #5964
- Use CUDA 13 for RAPIDS CI builds by @vyasr in #5967
- Simplify
cuda::std::is_trivially_copy_constructibleimplementation by @davebayer in #5910 - Deprecate and replace
THUST_[HOST|DEVICE]_FUNCTIONby @bernhardmgruber in #5972 - Allow
__builtin_addressoffor nvrtc 12.3+ by @davebayer in #5980 - Fix
referenceforcuda::transform_iteratorby @miscco in #5983 - Fix dereferencing nullptr in
thrust::device_referenceby @bernhardmgruber in #4226 - [STF] frozen_logical_data now inherits from frozen_logical_data_untyped by @caugonnet in #5986
- [CUDAX] Make
kernel_configparameter a__grid_constant__in kernel launcher by @davebayer in #5990 - Add env-based overloads for DeviceReduce::(Arg)MinMax by @gonidelis in #5143
- Split up cub.test.iterator to fix nightly NVHPC OOMs, add CI memory monitoring script. by @alliepiper in #5988
- Improve shared memory address range check by @fbusato in #5834
- Bump internal containers to LLVM20. by @wmaxey in #5997
- Simplify
cuda::std::is_trivially_move_constructibleimplementation by @davebayer in #5913 - [libcu++] Switch to use cuGetProcAddress to get driver functions by @pciolkosz in #5976
- [CUDAX] Lower copy_bytes to the batched memcpy starting with CUDA 13 by @pciolkosz in #5818
- Adds device-level Top-K Parallel Algorithm to CUB by @ChristinaZ in #5677
- Fix merge agent construction from non-ptr contiguous iterator by @bernhardmgruber in #5993
- Add env based api for DeviceScan::ExclusiveSum/Scan by @srinivasyadav18 in #5767
- [CUDAX] Implement
hierarchy_dimensions::static_extents()by @davebayer in #6010 - Enable
__grid_constant__with clang-cuda-20 and nvrtc by @davebayer in #5991 - Rename the trait checks to
__has_meow_traversalby @miscco in #5968 - remove pynvjitlink references from examples by @jayavenkatesh19 in #5826
- Simplify selected type traits implementation by @davebayer in #5979
- Fix libcu++ lit config arch list by @bernhardmgruber in #6014
- Avoid bad_alloc inside Catch2 CHECK() by @bernhardmgruber in #6025
- Try to clean up align utilities by @fbusato in #5950
- Allow small abs error < 1e-10 in Deterministic Device Reduce large num_items test by @srinivasyadav18 in #6027
- Make assertions work on macOS by @miscco in #6028
- Move
for_each_canceled_blocktocuda::device::by @davebayer in #6037 - Remove fork-ci feature. by @alliepiper in #6004
- Add windows versions of the CI target/bisect scripts. by @alliepiper in #5931
- Get Windows c.parallel build working. by @tpn in #5924
- Fix cuda13.0-rapids-conda devcontainer symlink by @bdice in #6042
- Temporarily pin CCCL version used to test RAPIDS by @vyasr in #5973
- Split up high-mem compilations in CUB to help out CI runners by @alliepiper in #6044
- c.parallel: enable dynamic policies in scan. by @griwes in #5960
- Change PARALLEL_LEVEL default from nproc to nproc-1 in build_common.sh by @Copilot in #6046
basic_anygets better support for storing immovable types by @ericniebler in #5935- Error when including cub umbrella header under NVRTC by @cnaples79 in #6035
- add missing
InitTtparam to specialization ofDispatchSegmentedReduceby @ericniebler in #6048 - Improve
zip_iteratorby @miscco in #6036 - Provide escape hatch for CTK compatability check by @miscco in #6029
- Replace
_LIBCUDACXX_DEPRECATEDwithCCCL_DEPRECATEDby @davebayer in #6024 - Fix throwing functions marked as
noexceptby @davebayer in #6021 - Simplify
cuda::std::is_trivially_default_constructibleimplementation by @davebayer in #5911 - Simplify
cuda::std::is_trivially_destructibleimplementation by @davebayer in #5905 - Fix addressof shadowing issue with libc++ by @wmaxey in #6032
- Drop unused OutputIterator template parameter in reduce by @bernhardmgruber in #6051
- Fix licenses. by @alliepiper in #6047
- Use
__is_same_asbuiltin forcuda::std::is_sameby @davebayer in #5994 - Move CDP API macros to libcu++ by @bernhardmgruber in #6017
- Modularize
<cuda/std/chrono>a bit by @miscco in #5945 - Unwrap cuda::zip_iterator/zip_function in thrust::transform by @bernhardmgruber in #6039
- Simplify
cuda::std::is_trivially_move_assignableimplementation by @davebayer in #5912 - Do not require
int128infor_each_canceledby @davebayer in #5822 - Simplify
cuda::std::is_trivially_copy_assignableimplementation by @davebayer in #5909 - Fix
memcpyADL/ambiguity by @fbusato in #5969 - Simplify
cuda::std::is_trivially_copyableby @davebayer in #5938 - Fix nvc++ 25.9 with
format_parse_contexttests by @davebayer in #6056 - Use more inline variables when possible by @miscco in #6038
- [DOC]: Add OpKind to parallel API docs by @shwina in #6058
- define
cuda::std::declvalin terms of new__declfn_talias by @ericniebler in #6045 - Add dynamic CUB dispatch for three_way_partition by @NaderAlAwar in #5965
- Simplify
cuda::std::is_trivially_assignableimplementation by @davebayer in #5908 - Build and test Python wheels with arm64 in addition to x86_64 by @shwina in #6060
- Assert we have enough SMEM for DeviceReduce by @bernhardmgruber in #6062
- Improve
__float128support forisnan,fmin,fmaxby @fbusato in #5923 - Add more tests for
thrust::reduce_by_keyby @bernhardmgruber in #6063 - [STF] Fix bug 5891 with large index spaces and overflows in partitionners by @caugonnet in #6015
- add missing visibility attributes and workaround nvcc bug by @ericniebler in #6070
- Add three way partition implementation for c.parallel by @NaderAlAwar in #6068
- Provide
cub::DeviceCopy(mdspan)by @fbusato in #5939 - Drop CUB_STATIC_ASSERT from Doxyfiles by @bernhardmgruber in #6072
- Refactor
thrust::[try_]unwrap_contiguous_iterator[_t]by @bernhardmgruber in #6065 - Use
__has_meow_traversalforcudaiterators by @miscco in #6088 - Refactor ChainedPolicy by @bernhardmgruber in #6075
- Fix MSVC error with OffsetT in c/parallel/src/three_way_partition.cu. by @tpn in #6081
- Update to RAPIDS 25.12 by @bdice in #6082
- Fix clang-cuda 21 warning of unitialized local passed as
const void*by @davebayer in #6091 - Add python wrappers for c.parallel three_way_partition API by @NaderAlAwar in #6080
- Add dynamic CUB dispatch for segmented_sort by @NaderAlAwar in #6069
- Replace CUDA Runtime calls with Driver calls in libcu++ by @davebayer in #6073
- Use
constexprfor some chrono traits by @miscco in #6103 - Bump catch2 to 3.8.1. by @alliepiper in #6101
- Fix imports from cudax to libcu++ by @davebayer in #6105
- add a specialization of
__make_tuple_typesforcomplex<T>by @ericniebler in #6102 - Remove iterator workarounds for lack of operator+= by @bernhardmgruber in #6094
- [CUB] Replace several direct uses of
__clzby @wmaxey in #6099 - Implement
cuda::zip_transform_iteratorby @miscco in #5982 - Refactor DeviceSegmentedReduce by @bernhardmgruber in #6061
- Skip nightlies on weekends, cleanup old CTK, bump devcontainers. by @alliepiper in #6111
- Cache device name and peers by @davebayer in #6110
- Remove fully qualified ::cuda::std:: from examples by @charan-003 in #6130
- [STF] Fix incorrect level index in 3-depth execution policy by @19970126ljl in #6089
- Use cuda::narrow instead of custom version by @bernhardmgruber in #6133
- [CUDAX] Implement managed_memory_resource and refactor the memory pool implementation by @pciolkosz in #5998
- Rename
cuda.cccl.{parallel,cooperative}->cuda.{compute,coop}by @shwina in #6125 - Concatenate nested namespaces in CUB by @bernhardmgruber in #6139
- Cache modify iterators locally in agent_merge.cuh by @bernhardmgruber in #6142
- Initial batch of changes to setup GPU windows runners. by @alliepiper in #6131
- Remove
cuda::physical_devicefrom public API by @davebayer in #6135 - Implement
operator<<forcuda::std::string_viewby @davebayer in #4736 - c.parallel: enable dynamic policies in unique_by_key. by @griwes in #6087
- c.parallel: enable dynamic policies in merge_sort. by @griwes in #6147
- Use
size_tfor byte count in device attributes by @davebayer in #6151 - Fix link to examples in cuda.cccl Python documentation by @shwina in #6157
- Provide
cuda::ptr_in_rangeby @fbusato in #6086 - Add a 'pull_request_lite' workflow for unmodified dependees. by @alliepiper in #6164
- Fix title in docs/python/index.rst by @shwina in #6166
- Add version_compare script, minor build_common updates. by @alliepiper in #6168
- Move nvhpc header wrappers to libcudacxx. by @alliepiper in #6167
- Remove NVIDIA Software License from top-level license by @jrhemstad in #6176
- Fix up issues with jobs in the pull_request_lite matrix. by @alliepiper in #6169
- Add noexcept to deallocate in type erased wrappers by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6179
cuda.compute: Add PermutationIterator by @shwina in https://github.com/NVIDIA/cccl/pull/6182- [libcu++] Fix blocks per SM in arch traits traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6185
cuda.compute: Use annotations when available to determine signature of user-defined transform operation by @shwina in https://github.com/NVIDIA/cccl/pull/6183- Refactor
agent_histogram.cuhby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6141 - Use vector width over load size in vectorized transform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6066
- We do not need to use force includes for nvrtc by @miscco in https://github.com/NVIDIA/cccl/pull/6194
- Fix clang-cuda stf build by @davebayer in https://github.com/NVIDIA/cccl/pull/6199
- We should guard the host library include wrappers so that we can unconditionally include the headers with NVRTC by @miscco in https://github.com/NVIDIA/cccl/pull/6195
- Refactor Thrust
destroy_rangeanddevice_[new|delete|free]by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6134 - Improve fully cached build times. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6127
- Refactor Thrust allocator internals by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6136
- Do not use
pairfor two element zip iterators by @miscco in https://github.com/NVIDIA/cccl/pull/6209 - Add missing sm121 to nv/target and CUB tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6205
- Bypass allocator in
thrust::device_deleteby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6198 - Refactor
cuda::arch_traitsby @davebayer in https://github.com/NVIDIA/cccl/pull/6150 - Ensure that cuda iterators support for difference by @miscco in https://github.com/NVIDIA/cccl/pull/6201
- Fix
arch_traitswarnings without-fpermissivefor older gcc by @davebayer in https://github.com/NVIDIA/cccl/pull/6217 - Improve handling of empty members in cuda iterators by @miscco in https://github.com/NVIDIA/cccl/pull/6006
- Improve
string_viewinteroperabilitystd::counterpart andstringby @davebayer in https://github.com/NVIDIA/cccl/pull/6184 - Refactor / fixup libcudacxx CMake targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6223
- Fix missing attributes in cccl-rt and rename
event::flagstoevent_flagsby @davebayer in https://github.com/NVIDIA/cccl/pull/6224 - merge the
schedule_fromandcontinues_onalgorithms by @ericniebler in https://github.com/NVIDIA/cccl/pull/6162 - Fix
{host, device, managed}_mdspanby @miscco in https://github.com/NVIDIA/cccl/pull/6093 - Expose
cuda::mul_hiby @fbusato in https://github.com/NVIDIA/cccl/pull/6146 - Assert deallocation is noexcept by @bdice in https://github.com/NVIDIA/cccl/pull/6186
- Provide
cub::DeviceFor::ForEachInLayoutby @fbusato in https://github.com/NVIDIA/cccl/pull/5956 - Replace internal
multiple_higher_bitswithcuda::mul_hiby @fbusato in https://github.com/NVIDIA/cccl/pull/6239 - Update GPU architecture support details in README by @jrhemstad in https://github.com/NVIDIA/cccl/pull/6229
- Test polymorphic types in
thrust::device_deleteby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6140 - Modernizes top-k examples by @elstehle in https://github.com/NVIDIA/cccl/pull/6241
- use
cuda::mul_hiincuda::std::callocby @davebayer in https://github.com/NVIDIA/cccl/pull/6242 - Refactor iterator usage in thrust/cuda find() by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6019
- Try different formulation for thrust::ccosh by @miscco in https://github.com/NVIDIA/cccl/pull/6200
- Improve
invokemachinery by @miscco in https://github.com/NVIDIA/cccl/pull/6227 - Drop all uses of
thrust::tabulate_output_iteratorin favor ofcuda::tabulate_output_iteratorby @miscco in https://github.com/NVIDIA/cccl/pull/6001 - Fix
__compressed_movable_boxby @miscco in https://github.com/NVIDIA/cccl/pull/6247 - Fix
__is_primary_std_templatefor libc++ by @miscco in https://github.com/NVIDIA/cccl/pull/6243 - Add environment overloads for DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6204
- Fix invalid refactoring of #4377 by @miscco in https://github.com/NVIDIA/cccl/pull/6246
- [libcu++] Enable complex literals by @davebayer in https://github.com/NVIDIA/cccl/pull/6252
- Implement
cudax::cufile_driverby @davebayer in https://github.com/NVIDIA/cccl/pull/5941 - Fixing cudax::execution CUDA stream scheduler by @ericniebler in https://github.com/NVIDIA/cccl/pull/6175
- [libcu++/cudax] Move all experimental additions to memory resource properties to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6233
- Fix invalid
device_accessiblenamespace by @davebayer in https://github.com/NVIDIA/cccl/pull/6269 - Do no use
bit_castto work around initialization issues with barrier by @miscco in https://github.com/NVIDIA/cccl/pull/6263 - Fix missing qualifications for
__construct_atby @miscco in https://github.com/NVIDIA/cccl/pull/6270 - Fix missed constructor with compressed box by @miscco in https://github.com/NVIDIA/cccl/pull/6268
- Fix using
charas the index type oftabulate_output_iteratorby @miscco in https://github.com/NVIDIA/cccl/pull/6271 - Add host standard library detection by @davebayer in https://github.com/NVIDIA/cccl/pull/6244
- Adds a section on perf checks to
contributing.mdby @elstehle in https://github.com/NVIDIA/cccl/pull/6267 - Deprecate
<cuda/stream_ref>header by @davebayer in https://github.com/NVIDIA/cccl/pull/6266 - Provide
cuda::in_rangeby @fbusato in https://github.com/NVIDIA/cccl/pull/6034 - [CUDAX] Add assignment operator that rebinds resource_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6240
- [CUDAX] Change memory pool type to also be a resource by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6180
- [CUB]: Add missing closing braces to examples in Doxygen. by @brycelelbach in https://github.com/NVIDIA/cccl/pull/6278
- Pass a device array or
Noneas the initial value to cuda.compute scan by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6262 - Limit deprecation exclusions to targeted headers. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6275
- Disable SASS check in cuda.compute for scan no init value for sm_90 and later by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6287
- Implement initial Windows CI support for the Python cuda-cccl library. by @tpn in https://github.com/NVIDIA/cccl/pull/6160
- Fix exception handling macros in exceptions.h by @ericniebler in https://github.com/NVIDIA/cccl/pull/6286
- Cleanup and simplify structured bindings support by @miscco in https://github.com/NVIDIA/cccl/pull/6281
- Provide
cuda::ptx::enable_smem_spilling()by @davebayer in https://github.com/NVIDIA/cccl/pull/6289 - Add PyTorch build to CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6276
- Dropped duplicated math function from Thrust by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6188
- Refines the section on perf checks to
contributing.mdby @elstehle in https://github.com/NVIDIA/cccl/pull/6280 - Drop typedef in cuda::atomic test by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6297
- fix kernel launch failure when sender expressions can throw by @ericniebler in https://github.com/NVIDIA/cccl/pull/6277
- Extract BlockScan code-block examples to literalinclude 1/3 by @gonidelis in https://github.com/NVIDIA/cccl/pull/6288
- [DOC] Fix BlockRadixRank documentation by @Aminsed in https://github.com/NVIDIA/cccl/pull/6207
- Fix
string_viewconstruction fromstd::string_viewby @davebayer in https://github.com/NVIDIA/cccl/pull/6291 - Clean up
cuda::/std::/cuda::std::__is_meow_vtraits by @davebayer in https://github.com/NVIDIA/cccl/pull/6300 - add parallel scan support for TBB and OMP by @charan-003 in https://github.com/NVIDIA/cccl/pull/6178
- Use 'python3 -m pip args' instead of 'pip args' in docs/gen_docs.bash script by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/6293
- Ignore OOM failures for large size unique thrust test. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6304
- Add support for 128b atomics to
atomic_refby @wmaxey in https://github.com/NVIDIA/cccl/pull/3440 - Fix
is_sufficiently_alignedwithconst void*by @fbusato in https://github.com/NVIDIA/cccl/pull/6307 - GCC only recognizes
unused-local-typedefsby @alliepiper in https://github.com/NVIDIA/cccl/pull/6303 - Replace __popc with cude::std::popcounter by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6213
- Deprecate experimental TMA exposure in cuda::barrier by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6305
- Provide utility to check pointer ranges overlapping by @fbusato in https://github.com/NVIDIA/cccl/pull/6100
- Always include
<new>when we needoperator newfor clang-cuda by @miscco in https://github.com/NVIDIA/cccl/pull/6310 - Fix thrust system dependend includes by @miscco in https://github.com/NVIDIA/cccl/pull/6311
- Optimize
cuda::minimum/maximumforfloat,double,__half,__nv_bfloat16,__float128by @fbusato in https://github.com/NVIDIA/cccl/pull/5034 - Disable test for compressed_movable_box by @miscco in https://github.com/NVIDIA/cccl/pull/6320
- c.parallel: enable dynamic policies in radix_sort. by @griwes in https://github.com/NVIDIA/cccl/pull/6264
- Simplify thrust::zip_function by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6321
- Simplify
numeric_limits::[min|max]()implementation for integrals by @davebayer in https://github.com/NVIDIA/cccl/pull/6324 - [cudax -> libcudacxx] Move type-erased resource wrappers to libcudacxx by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6299
- Include
<math.h>in<cuda/std/cmath>headers unconditionally by @davebayer in https://github.com/NVIDIA/cccl/pull/6333 - Ensure that we can instantiate
zip_functionwith a type that is not non-const invocable by @miscco in https://github.com/NVIDIA/cccl/pull/6323 - Use RAPIDS main branch by @bdice in https://github.com/NVIDIA/cccl/pull/6318
- [CUDAX] Rename memory_resource types to memory_pool_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6334
- Fix transform.cu size_t/int issue. by @tpn in https://github.com/NVIDIA/cccl/pull/6332
- Remove unused
_LIBCUDACXX_HAS_MEOWmacros by @davebayer in https://github.com/NVIDIA/cccl/pull/6338 - Refactor thrust::mismatch to use CUDA iterators by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6018
- Use
_CCCL_CTK_MEOWinstead of_CCCL_CUDACC_MEOWby @davebayer in https://github.com/NVIDIA/cccl/pull/6343 - Improve
cuda::barrierTMA examples andelect_oneinDeviceTransformby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6329 - Move the
throw_meow_errorfunctions into their own header and drop thestdexceptinclude by @miscco in https://github.com/NVIDIA/cccl/pull/6335 - Refactor histogram kernel entrypoint by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6342
- Implements a more memory-efficient way to test for large
kinDeviceTopKtests by @elstehle in https://github.com/NVIDIA/cccl/pull/6322 - Replace inline PTX by cuda::ptx in cuda::barrier<thread_scope_block> by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6250
- Add a philox PRNG engine by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6109
- Do not mark deduction guides as hidden by @miscco in https://github.com/NVIDIA/cccl/pull/6350
- Move the implementation of
tupleinto its own file by @miscco in https://github.com/NVIDIA/cccl/pull/6336 - [cudax] Fix managed resource test by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6354
- Retry sccache startup on windows to WAR random auth issues. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6347
- Unpin CCCL version used for RAPIDS testing by @bdice in https://github.com/NVIDIA/cccl/pull/6349
- Refactor generic thrust scan dispatch by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6360
- Add missing header to random engine tests by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6364
- [cudax->libcu++] Move any_resource tests and remove experimental aliases by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6351
- Move CPP, OMP and TBB exec policies to detail by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6361
- Refactor Thrust OMP system headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6372
- Fix reference to cuda::std::bit_floor/bit_ceil in docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6373
- Fix tuple constraint by @miscco in https://github.com/NVIDIA/cccl/pull/6363
- Improve exception macros by @davebayer in https://github.com/NVIDIA/cccl/pull/6337
- Windows CI: CCCL C Parallel by @alliepiper in https://github.com/NVIDIA/cccl/pull/6254
- Use non deprecated methods for
stream_refin docs by @davebayer in https://github.com/NVIDIA/cccl/pull/6376 - Inline the Thrust ADL layer by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6377
- Move
iosfwdto its own internal file by @miscco in https://github.com/NVIDIA/cccl/pull/6390 - Move
is_reference_wrappertrait to__fwd/reference_wrapper.hby @davebayer in https://github.com/NVIDIA/cccl/pull/6392 - Fix
iter_moveconstraints for MSVC by @miscco in https://github.com/NVIDIA/cccl/pull/6357 - c.parallel: fixes for well-known operations. by @griwes in https://github.com/NVIDIA/cccl/pull/6386
- Modularize
variantby @miscco in https://github.com/NVIDIA/cccl/pull/6393 - Fix label in memcpy_async_tx docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6398
- Drop unused file to detect CUDA archs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6374
- Make some if constexpr by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6382
- Refactor Thrust TBB system headers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6394
- Move
cuda/std/__cuda/api_wrapper.htocuda/__runtime/api_wrapper.hby @davebayer in https://github.com/NVIDIA/cccl/pull/6379 - Rewrite agent template parameters to PascalCase by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6380
- Implement std::seed_seq by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6358
- Fix missing monostate_include by @miscco in https://github.com/NVIDIA/cccl/pull/6403
- c.parallel: single-stage runtime compilation. by @griwes in https://github.com/NVIDIA/cccl/pull/6341
- Apply cuda::barrier and elect_one feedback by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6344
- Fix clang 21 issues by @davebayer in https://github.com/NVIDIA/cccl/pull/6404
- Add a benchmark for DeviceSegmentedReduce::ArgMin by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6401
- Rewrite block algorithm template parameters to PascalCase by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6381
- [cudax -> libcu++] Move memory resources to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6384
- c.parallel: cache runtime transform configs. by @griwes in https://github.com/NVIDIA/cccl/pull/6385
- Fix wrong namespace in TBB Backend by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6395
- Refactor
agent_histogram.cuhPart 2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6196 - Fix wrongly rewritten license headers in Thrust OMP/TBB backend by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6406
- bring
cudax::executioncloser in line with the evolving P3826 by @ericniebler in https://github.com/NVIDIA/cccl/pull/6417 - Prepare
cudax::host_lauchmigration to libcu++ by @davebayer in https://github.com/NVIDIA/cccl/pull/6420 - Drops default constructor of
BlockLoadToSharedby @elstehle in https://github.com/NVIDIA/cccl/pull/6427 - Inline remaining
*.inlfiles in tbb and seq backends by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6437 - Nested namespace fixes in General modules & cub by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6425
- Fix offset_iterator tests by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6436
- Enhance lane mask validation in
__shfl_syncby @fbusato in https://github.com/NVIDIA/cccl/pull/6429 - Use SPDX license identifiers in CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6441
- Add
_CCCL_DECLSPEC_EMPTY_BASESto mdspan features by @miscco in https://github.com/NVIDIA/cccl/pull/6444 - [clang-format] WrapNamespaceBodyWithEmptyLines: Never by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6439
- Ensure that
detect_wrong_differenceis a valid output iterator by @miscco in https://github.com/NVIDIA/cccl/pull/6450 - Fix
cub.bench.radix_sort.keys.baseregression on H200 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6452 - Fixes non-default-constructible iterators for large number of items types in
DeviceRunLengthEncode::Encodeby @elstehle in https://github.com/NVIDIA/cccl/pull/6451 - Test mixing iterators in DeviceMerge by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6455
- Use PDL in DeviceHistogram by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6367
- Enabling max pool size for memory pools by @nirandaperera in https://github.com/NVIDIA/cccl/pull/6370
- Add segmented sort implementation for c.parallel by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6095
- Fix Random CI failures for Deterministic Device Reduce (RFA) with different policies by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/6464
- Implement
__is_fully_bounded_arraytrait by @davebayer in https://github.com/NVIDIA/cccl/pull/6461 - Drop build.log by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6454
- Prefix CUB kernel headers with
kernel_by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6383 - Nested namespace resolve for thrust & libcudacxx by @viralbhadeshiya in https://github.com/NVIDIA/cccl/pull/6465
- Various CMake cleanups. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6346
- Ignore python/cuda_cccl/build.log by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6473
- Replace enum by static constexpr in some agent tunings by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6472
- fall back gpu_to_gpu floating-point min/max reductions to run_to_run by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/6462
- Fix incorrect file name by @miscco in https://github.com/NVIDIA/cccl/pull/6481
- Add python wrappers for c.parallel segmented_sort by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6471
- Provide
cuda::sub_overflowby @fbusato in https://github.com/NVIDIA/cccl/pull/6084 - Cleanup libcu++ CMake by @miscco in https://github.com/NVIDIA/cccl/pull/6478
- Avoid single letter typenames by @miscco in https://github.com/NVIDIA/cccl/pull/6474
- Add
WarpReduceDevice-Side Benchmarks by @fbusato in https://github.com/NVIDIA/cccl/pull/6431 - Avoid potentially ambiguous overload in
warp_excahnge_shflby @miscco in https://github.com/NVIDIA/cccl/pull/6484 - Replace uses of
cub::PowerOfTwoand deprecated it by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6490 - Drop shadowing redeclaration of constants by @miscco in https://github.com/NVIDIA/cccl/pull/6479
- Provide
cuda::div_overflowby @fbusato in https://github.com/NVIDIA/cccl/pull/6128 - Enable
__int128_tas difference type incounting_iteratorby @miscco in https://github.com/NVIDIA/cccl/pull/6487 - Make nvrtc concept macros a bit more reliable by @miscco in https://github.com/NVIDIA/cccl/pull/6397
- Use
__byte_permintrinsic rather then inline asm incuda::std::byteswapby @davebayer in https://github.com/NVIDIA/cccl/pull/6493 - Updates to populate the PyPI landing page. by @shwina in https://github.com/NVIDIA/cccl/pull/6483
- Remove old C++ version checks by @davebayer in https://github.com/NVIDIA/cccl/pull/6494
- [clang-format] KeepEmptyLines only at EOF by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6440
- Expose
ptx::mbarrier_invaland use it by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6496 - Move the libcu++ specific config by @miscco in https://github.com/NVIDIA/cccl/pull/6396
- Implement
cuda::invalid_streamby @davebayer in https://github.com/NVIDIA/cccl/pull/6488 - Fix invalid reference type of
cuda::strided_iteratorby @miscco in https://github.com/NVIDIA/cccl/pull/6501 - Allow passing in
Noneas init value for scan when using an iterator as input by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6499 - Catchs NaN's before they make it to static_cast and creating UB by @s-oboyle in https://github.com/NVIDIA/cccl/pull/6502
- Extract BlockScan code-block examples to literalinclude 2/3 by @gonidelis in https://github.com/NVIDIA/cccl/pull/6418
- Fix NVTX disabling test by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6516
- Disable CI workflows on forks. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6514
- Adds token to enforce correct call sequence in
BlockLoadToShared:Commit()->Wait()by @elstehle in https://github.com/NVIDIA/cccl/pull/6510 - Expose
ptx::setmaxnregby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6504 - [CUDAX] Uglify the hierarchy files by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6491
- [CUB] Use
BlockLoadToSharedinDeviceMergeby @pauleonix in https://github.com/NVIDIA/cccl/pull/6077 - Replace custom equal_to functors by _1 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6515
- Add
cccl_add_xfail_compile_target_testCMake function by @alliepiper in https://github.com/NVIDIA/cccl/pull/6434 - Add conda installation instructions for cuda.cccl Python package by @Copilot in https://github.com/NVIDIA/cccl/pull/6513
- Fix missing token passing in AgentMerge by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6525
- [cudax->libcu++] Move uninitialized_async_buffer and heterogeneous_iterator to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6489
- [CUDAX] Rename async_buffer to buffer by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6520
- Split
DeviceSegmentedReducein its own file by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6524 - Add launch bounds to block_reduce_kernel by @Artem-B in https://github.com/NVIDIA/cccl/pull/6533
- Fixes
braces around scalar initializerwarning in BlockLoadToShared by @elstehle in https://github.com/NVIDIA/cccl/pull/6534 - [cudax->libcu++] Move host_launch to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6536
- Improves
DeviceTopKdocs by @elstehle in https://github.com/NVIDIA/cccl/pull/6531 - Allow
__builtin_bitreversewith clang-cuda by @davebayer in https://github.com/NVIDIA/cccl/pull/6545 - [cudax->libcu++] Move shared_resource to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6539
- Fix/Improve
<cuda/bit>documentation by @fbusato in https://github.com/NVIDIA/cccl/pull/6543 cuda::align_up/downworkaround for memory space by @fbusato in https://github.com/NVIDIA/cccl/pull/6541- Cleanup
thrust::complexmath includes and functions by @miscco in https://github.com/NVIDIA/cccl/pull/6546 - Fix
bit_reversedocumentation example by @fbusato in https://github.com/NVIDIA/cccl/pull/6551 - Drop old namespace macros by @miscco in https://github.com/NVIDIA/cccl/pull/6548
- Update NVBench type string declarations for FP16 and BF16 by @fbusato in https://github.com/NVIDIA/cccl/pull/6555
- Make uniform_int_distribution constexpr by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6523
- Cleanup includes in thrust by @miscco in https://github.com/NVIDIA/cccl/pull/6547
- Rename some of the namespace macros by @miscco in https://github.com/NVIDIA/cccl/pull/6549
- Fix merge conflicts from dropping headers by @miscco in https://github.com/NVIDIA/cccl/pull/6563
- Use cuda/iterator in cub/test by @oleksandr-pavlyk in https://github.com/NVIDIA/cccl/pull/6405
- Fix compute capability -> PTX version conversion by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6567
- Add gersemi CMake formatter by @alliepiper in https://github.com/NVIDIA/cccl/pull/6557
- [cudax->libcudacxx] Move device_transform to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6469
- Fix some warnings in cub headers that are picked up by the libcu++ tests by @miscco in https://github.com/NVIDIA/cccl/pull/6522
- Fix
cuda/cmathandcuda/memorydocumentation by @fbusato in https://github.com/NVIDIA/cccl/pull/6569 - Implement bernoulli_distribution by @RAMitchell in https://github.com/NVIDIA/cccl/pull/6375
- Fix missing include by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6578
- Test passing a custom policy to DispatchReduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6577
- Refactor DispatchMergeSort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6580
- Refactor cub::detail::for_each::dispatch_t by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6579
- Replace enum by static constexpr in CUB/Thrust by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6480
- [cudax] Add pointer attributes fallback to async buffer initialization by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6352
- [libcu++] Add initial cccl-runtime docs for 3.1 by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6562
- Split segmented radix sort into separate files by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6581
- Add BlockLoadToShared improvements by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6526
- Disable clang-cuda with libc++ tests for now by @miscco in https://github.com/NVIDIA/cccl/pull/6586
- Try and improve our
is_nothrow_constructiblefallback by @miscco in https://github.com/NVIDIA/cccl/pull/6583 - Refactor DispatchSegmentedSort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6599
- Refactor DispatchReduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6590
- Fix misspelling of contiguous range in documentation by @brycelelbach in https://github.com/NVIDIA/cccl/pull/6603
- Refactor DispatchScan by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6594
- Refactor DispatchScanByKey by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6596
- Refactor rfa dispatcher by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6591
- Fix typo in mbarrier.inval by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6615
- Test and refactor
[Mem|Reg]BoundScalingby @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6575 - Try to fix windos python runner by @miscco in https://github.com/NVIDIA/cccl/pull/6602
- Allow using ZipIterator as an output in cuda.compute by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6518
- Fix issue with old GCC by @miscco in https://github.com/NVIDIA/cccl/pull/6614
- Fix some minor issues in the extents implementation by @miscco in https://github.com/NVIDIA/cccl/pull/6604
- Make Thrust/CUB ABI namespace resilient against user-defined macros by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6564
- Improve
cudax::dynamic_shared_memoryimplementation by @davebayer in https://github.com/NVIDIA/cccl/pull/6495 - Refactor DispatchUniqueByKey by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6600
- Replace uses of
thrust::pairwithcuda::std::pairby @miscco in https://github.com/NVIDIA/cccl/pull/6616 - Refactor deterministic reduce dispatcher by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6593
- [cudax] Add synchronous_resource_adapter and use it in async_buffer by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6432
- Split fixed-size segmented reduce dispatch header by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6597
- Publicly expose
<cuda/std/algorithm>by @miscco in https://github.com/NVIDIA/cccl/pull/3741 - Refactor cub::detail::AliasTemporaries by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6617
- Add
__version_(at_least|below)utilities to CUDA Driver wrappers by @davebayer in https://github.com/NVIDIA/cccl/pull/6626 - Fix includes in work stealing example by @miscco in https://github.com/NVIDIA/cccl/pull/6631
- streamline the implementation of
cuda::std::__tupleby @ericniebler in https://github.com/NVIDIA/cccl/pull/6623 - Optimize
cuda::is_address_spaceby forcing the memory space by @fbusato in https://github.com/NVIDIA/cccl/pull/6553 - Add
DiscardIteratortocuda.computeto enableunique_by_keykeys only by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/6618 - Provide utilities to check pointer memory space (host/device/managed) by @fbusato in https://github.com/NVIDIA/cccl/pull/6325
- MVP for disabling nvtx ranges for
thrust::seqby @gonidelis in https://github.com/NVIDIA/cccl/pull/6415 - Provide
make_tma_descriptor,DLPack->CUtensorMapby @fbusato in https://github.com/NVIDIA/cccl/pull/6237 - Cleanup dependencies between internal targets. by @alliepiper in https://github.com/NVIDIA/cccl/pull/6571
- Port thrust complex.cu tests to catch2_test_complex.cu by @dunga1k58bh in https://github.com/NVIDIA/cccl/pull/6625
- Support nested structs in
cuda.computeby @shwina in https://github.com/NVIDIA/cccl/pull/6353 - [cudax->libcu++] Move the hierarchy type from cudax to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6611
- [Backport branch/3.2.x] Address pending comments for
make_tma_descriptorby @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6683 - [Backport branch/3.2.x] Fixes issue with select close to int_max by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6701
- [Backport branch/3.2.x] fix omp scan bug by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6704
- [Backport branch/3.2.x] Fix electing leader from any group in
cuda::memcpy_asyncby @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6716 - [Backport branch/3.2.x] Avoid scaling twice in
ReduceNondeterministicPolicyby @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6719 - [Backport branch/3.2.x] [libcu++] Automatically bump up the release threshold of default mempools by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6735
- [Backport branch/3.2.x] Fix
__throw_cuda_erroravailability with nvrtc by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6769 - [Backport 3.2] Add sm_62 arch traits (#6772) by @davebayer in https://github.com/NVIDIA/cccl/pull/6778
- [Backport branch/3.2.x] Ensure that we properly warn about device lambdas that need to query the return type by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6782
- [Backport branch/3.2.x] Use conventional order of
_CCCL_API friendconsistently by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6794 - [Backport branch/3.2.x] Temporarily add upper bound to numba-cuda dependency by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6830
- Msvc-error-backport by @alliepiper in https://github.com/NVIDIA/cccl/pull/6827
- [Backport branch/3.2.x] Fix arch related
cuda::device::APIs for nvhpc in CUDA mode by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6832 - [Backport 3.2.x] Test building for all arches. (#6113) by @davebayer in https://github.com/NVIDIA/cccl/pull/6842
- [Backport branch/3.2.x] Remove upper bound on numba-cuda by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6853
- [Backport branch/3.2.x] Use lit for
cuda::arch_idandcuda::compute_capabilitytests by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6840 - [Backport branch/3.2.x] [PTX] Add
cp.async.bulk.dst.src.mbarrier::complete_tx::bytes.ignore_oobby @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6860 - CMake backports for 3.2 by @alliepiper in https://github.com/NVIDIA/cccl/pull/6850
- [Backport branch/3.2.x] Add missing doc strings to support old CMake. by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6877
- [backport 3.2.x][cudax->libcu++] Move buffer type from cudax to libcu++ (#6627) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6833
- [Backport branch/3.2.x] Move launch API from cudax to libcu++ by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6891
- [backport 3.2.x][libcu++] Add memory_pool header and correct legacy resources namespace by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6893
- [Backport 3.2.x] [cuda.compute] Add dependency on nvidia-nvvm #6909 by @shwina in https://github.com/NVIDIA/cccl/pull/6949
- [Backport branch/3.2.x] Remove all usage of old experimental MR macro by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6965
- [Backport branch/3.2.x] [libcu++] Leak static CUDA resources and add missing release on memory pool by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6960
- [Backport branch/3.2.x] [libcu++] Add as_ref() to memory pool types by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6959
- [Backport branch/3.2.x] Remove
[[nodiscard]]from barrier's.arrive(...)method by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6950 - [Backport branch/3.2.x] Properly specialize cub functions for
__nv_bfloat16by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6940 - [Backport 3.2] Avoid waring about missing braces for subobject (#6929) by @miscco in https://github.com/NVIDIA/cccl/pull/6973
- [Backport branch/3.2.x] Add missing nvrtc nv target archs by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6933
- [Backport branch/3.2.x] Make sure we actually use overflow builtins by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6934
- [Backport branch/3.2.x] [libcu++] Static assert that resource is copyable in buffer constructors by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6936
- [Backport branch/3.2.x] [libcu++] Rename device_transform back to launch_transform by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6937
- [Backport branch/3.2.x] [libcu++] Fix minor version compatibility in 13.X by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6896
- [Backport branch/3.2.x] Don't use
__builtin_bswap128during constant evaluation by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6969 - [Backport branch/3.2.x] [libcu++] Uncomment some tests and fix launch include after launch was moved to libcu++ by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6989
- [Backport branch/3.2.x] [libcu++] Dynamically load CUDA library instead of using the runtime by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6982
- [backport 3.2.x] Use cuda.core Linker instead of numba-cuda and fix import issues with experimental namespace by @NaderAlAwar in https://github.com/NVIDIA/cccl/pull/7025
- [Backport branch/3.2.x] avoid error adding pointer to reference in
any_resourceby @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7018 - [Backport branch/3.2.x] [libcu++] Don't require accessibility property on type erased wrappers by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7030
- [backport 3.2.x][libcu++] Fix test issues on Windows (#6993) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/7017
- [Backport branch/3.2.x] Fix
cuda::memcpy asyncedge cases and add more tests by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7036 - [Backport branch/3.2.x] [libcu++] Fix synchronous resource adapter property passing by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7031
- [backport 3.2] Backport #6844, #6958, #6619 and #6957 by @davebayer in https://github.com/NVIDIA/cccl/pull/7038
- [Backport branch/3.2.x] Disable LDL/STL checks, for failures seen with NVRTC 13.1 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7065
- [Backport branch/3.2.x] [libcu++] Add explicit alignment specification in buffer (#7005) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/7041
- [Backport branch/3.2.x] [libcu++] Correctly handle extended lambda in cuda::launch by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7069
- [Backport branch/3.2.x] use references for mdspan internal methods #7059 by @fbusato in https://github.com/NVIDIA/cccl/pull/7068
- [Backport 3.2] Disable test that is failing in multiple configurations (#6745) by @miscco in https://github.com/NVIDIA/cccl/pull/7076
- [Backport 3.2] Use resource test fixure members through this (#6717) by @miscco in https://github.com/NVIDIA/cccl/pull/7075
- [Backport 3.2] Avoid invalid compiler warning with VS2026 (#7077) by @miscco in https://github.com/NVIDIA/cccl/pull/7081
- [Backport 3.2] Avoid compiler issue with MSVC
_CCCL_UNREACHABLE(#7080) by @miscco in https://github.com/NVIDIA/cccl/pull/7083 - [Backport branch/3.2.x] [libcu++] Make kernel_config member private and allow it in hierarchy queries by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7070
- [Backport branch/3.2.x] [thrust] Ignore CUDA free errors in thrust memory resource by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7033
- [Backport branch/3.2.x] [libcu++] Remove _view from the shared memory getter name by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7032
- [backport 3.2] Backport #6985 by @fbusato in https://github.com/NVIDIA/cccl/pull/7039
- [Backport branch/3.2.x] Explicitly set
CCCL_TOPLEVEL_PROJECTtoOFFwhen needed by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7040 - [Backport branch/3.2.x] Simplify
cuda::host_launchAPI by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7089 - [Backport 3.2] Expand warning suppression for braces around subobject (#7087) by @miscco in https://github.com/NVIDIA/cccl/pull/7091
- [Backport to 3.2] Refactor c2h gen to ensure teardown before main (#7067) and Add an option to use CCCL from CTK for C2H (#6848) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/7085
- [Backport branch/3.2.x] [libcu++] Fix driver api test after curand changes by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7105
- [Backport branch/3.2.x] Enhance DLPack compatibility by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7060
- [Backport branch/3.2.x] [libcu++] Check if managed pools are accessible in is_pointer_accessible test by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7116
- [Backport branch/3.2.x] Fix calculation of necessary bits in feistel projection by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7117
- [Backport branch/3.2.x] Fix incorrect if else logic in fmax by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7112
- [Backport branch/3.2.x] Use cudaMemcpyDefault for trivial copies by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7134
- [Backport branch/3.2.x] Fix
nvrtccminimum arch for__float128support by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7141 - [BACKPORT 3.2] Generator for prologue/epilogue (#7099) by @miscco in https://github.com/NVIDIA/cccl/pull/7136
- [Backport branch/3.2.x] Replace and deprecate
compute_capability::major()andcompute_capability::minor()by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7143 - [Backport branch/3.2.x] Move DLPack include to separate file by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7142
- [backport 3.2] [libcu++] Allow all public headers to be included with host compilers only (#7012) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/7146
- [Backport branch/3.2.x] Disable reference_wrapper test for VS2026 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7145
- [Backport branch/3.2.x] Revert nested namespace change to <nv/target> by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7153
- [backport branch/3.2.x] Backport #7023, #7009, #7139, #7144 and #7130 by @davebayer in https://github.com/NVIDIA/cccl/pull/7147
- [Backport branch/3.2.x] Add Android-specific assert handling in
__cccl/assert.hby @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7158 - [Backport branch/3.2.x] Align local vector storage arrays in vec transform by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7163
- [Backport 3.2] Fix
make_tma_descriptor()unit test (#7152) by @miscco in https://github.com/NVIDIA/cccl/pull/7164 - [Backport branch/3.2.x] Fixes for thrust::shuffle by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7189
- [Backport branch/3.2.x] Do not try to run catch2 tests with nvrtc by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7244
- [Backport to 3.2] Fix extracting CUDA stream in cub::DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/7263
- [Backport branch/3.2.x] Change the order of conditions in
cuda::barrierby @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7273 - [Backport 3.2] Fix
__query_orCPO by @miscco in https://github.com/NVIDIA/cccl/pull/7267 - [Backport branch/3.2.x] Fix
is_address_fromforcluster_sharedfor pre-sm_90 by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7301 - [Backport branch/3.2.x] Skip checking build prereqs if installing by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/7326
New Contributors
- @GPMueller made their first contribution in #5369
- @MengAiDev made their first contribution in #5483
- @Copilot made their first contribution in #5582
- @thewilsonator made their first contribution in #5835
- @vyasr made their first contribution in #5967
- @jayavenkatesh19 made their first contribution in #5826
- @cnaples79 made their first contribution in #6035
- @19970126ljl made their first contribution in #6089
- @nirandaperera made their first contribution in https://github.com/NVIDIA/cccl/pull/6370
- @dunga1k58bh made their first contribution in https://github.com/NVIDIA/cccl/pull/6625
Full Changelog: v3.1.4...v3.2.0