NVIDIA/cccl v3.1.0 on GitHub

What's Changed

🚀 Thrust / CUB

[Thrust] Perform asynchronous allocations by default for the par_nosync policy by @brycelelbach in #4204
[Thrust] reduce_into by @brycelelbach in #4355
Enable Catch2 tests in Thrust by @bernhardmgruber in #2669
Add memcpy_async transform kernel for Ampere by @bernhardmgruber in #2394
Allow default-initializing and skipping initialization of Thrust vectors by @bernhardmgruber in #4183
Add thrust::strided_iterator and a step for thrust::counting_iterator by @bernhardmgruber in #4014
Add new WarpReduce overloadings by @fbusato in #3884
Optimize ThreadReduce by @fbusato in #3441

📚 Libcudacxx

Enable device assertions in CUDA debug mode nvcc -G by @fbusato in #4444
avoid EDG bug by moving diagnostic push & pop out of templates by @ericniebler in #4416
Add host/device/managed mdspan and accessors by @fbusato in #3686
Add cuda::ptx::elect.sync by @fbusato in #4445
Add pointer utilities cuda::is_aligned, cuda::align_up, cuda::align_down, cuda::ptr_rebind by @fbusato in #5037
Add cuda::ceil_ilog2 by @fbusato in #4485
Add cuda::is_power_of_two, cuda::next_power_of_two, cuda::prev_power_of_two by @fbusato in #4627
Add cuda::device::warp_match_all by @fbusato in #4746
Add cuda::static_for by @fbusato in #4855
Improve/cleanup cuda::annotated_ptr implementation by @fbusato in #4503
Add cuda::fast_mod_div Fast Modulo Division by @fbusato in #5210

📝 Documentation

Making extended API documentation slightly more uniform by @fbusato in #4965
Add memory space note to cuda::memory documentation by @fbusato in #5151
Better specify lane_mask::all_active() behavior by @fbusato in #5183

🔄 Other Changes

[CUDAX] Add universal comparison across memory resources by @pciolkosz in #4168
Implement ranges::range_adaptor by @miscco in #4066
Avoiding looping over problem size in individual tests by @oleksandr-pavlyk in #4140
Replace CUB util_arch.cuh macros with inline constexpr variables by @fbusato in #4165
Improves test times for DeviceSegmentedRadixSort by @elstehle in #4156
Simplify Thrust iterator functions by @bernhardmgruber in #4178
Remove _LIBCUDACXX_UNUSED_VAR by @davebayer in #4174
Remove _CCCL_NO_IF_CONSTEXPR by @davebayer in #4187
Implement __fp_native_type_t by @davebayer in #4173
Adds support for large number of segments and large number of items to DeviceSegmentedRadixSort by @elstehle in #3402
Implement inclusive scan in cuda.parallel by @NaderAlAwar in #4147
Remove _CCCL_NO_NOEXCEPT_FUNCTION_TYPE by @davebayer in #4190
Fix not_fn by @miscco in #4186
Remove _CCCL_NTTP_AUTO by @davebayer in #4191
Avoid instantiating discard_iterator while parsing by @bernhardmgruber in #4180
Host/Device accessors for mdspan by @fbusato in #3686
Remove _CCCL_NO_DEDUCTION_GUIDES by @davebayer in #4188
Set NO_CMAKE_FIND_ROOT_PATH for cudax. by @bdice in #4162
Fix build breaking with setuptools by @miscco in #4212
Replaces remaining uses of thrust::{host,device}_vector in our Catch2 tests by @elstehle in #4205
Add check that CXX + CUDA_HOST compilers match when necessary. by @alliepiper in #4201
Disable test on 12.0 CTK by @miscco in #4214
Implement fp properties by @davebayer in #4213
[CUDAX] Separate non-async pinned memory resource into legacy_pinned_memory_resource by @pciolkosz in #4179
Avoid errors in get_device_address tests by @miscco in #4209
Implement extended fp traits by @davebayer in #4211
Remove _CCCL_INLINE_VAR by @davebayer in #4192
Improve host/device mdspan documentation by @fbusato in #4220
Drop _LIBCUDACXX_BEGIN_NAMESPACE_RANGES_ABI by @miscco in #4210
Fix C++ version used in CONTRIBUTING.md by @bernhardmgruber in #4224
Extend tuning documentation by @bernhardmgruber in #4184
Drop tuning params for benchmarks with custom ops by @bernhardmgruber in #4176
Make compiler version comparisons safer by @davebayer in #4185
Document python packages for sol plot script by @bernhardmgruber in #4228
Remove _CCCL_NO_FOLD_EXPRESSIONS by @davebayer in #4189
Remove python/cuda_cooperative/setup.py by @rwgk in #4221
Allow cuda::par*.on() to take cuda::stream_ref by @bernhardmgruber in #4225
Drop _CCCL_NO_VARIABLE_TEMPLATES by @miscco in #4229
Fix typos in cuda mdspan documentation by @fbusato in #4231
Simplify Thrust assign_value by @bernhardmgruber in #4227
Remove double underscore limit macros by @davebayer in #4194
Document deprecations from #4165 by @bernhardmgruber in #4237
Implement __fp_is_subset trait by @davebayer in #4230
Extend tuning verification docs by @bernhardmgruber in #4236
Use [[maybe_unused]] in whole cccl by @davebayer in #4207
Move implementation of cuda::std::array to libcu++ by @davebayer in #4239
Implement __cccl_fp class by @davebayer in #4238
Add transform c parallel implementation by @shwina in #4048
Drop duplicated system header blocks by @miscco in #4245
Exclude sm101 from RDC testing. by @alliepiper in #4247
Make cuda::stream_ref constructible on device by @miscco in #4243
Fix logic in test_segmented_reduce by @oleksandr-pavlyk in #4198
Add new WarpReduce overloadings by @fbusato in #3884
Fix construction of host init value in test_reduce made incorrect after refactoring by @oleksandr-pavlyk in #4251
Refactor fp masks by @davebayer in #4246
Implement views::all by @miscco in #4244
[cudax] incorporate P3557 (constexpr completion signatures) into µstdex by @ericniebler in #3841
Add fixed size segmented reduce by @srinivasyadav18 in #3969
Drop old Readmes and other unused files by @miscco in #4199
Implement fp constants by @davebayer in #4256
[STF] Enable NVHPC in CUDASTF CI by @caugonnet in #3857
[STF] fix type issues in the multi-GPU CG test by @caugonnet in #4260
Allow rapids to avoid unrolling some loops in sort by @miscco in #4253
Implement __fp_neg by @davebayer in #4257
Restore CUB changelog by @miscco in #4263
Drop _CCCL_NODISCARD by @miscco in #4265
Drop unused _CCCL_ALIAS_ATTRIBUTE macro by @miscco in #4266
Drop _CCCL_NO_INLINE_VARIABLES by @miscco in #4267
Change to allow cccl/c/parallel/unique_by_key.h to compile by C compiler by @oleksandr-pavlyk in #4259
Drop _CCCL_FALLTHROUGH by @davebayer in #4269
Cleanup libcu++ force_include.h test file by @davebayer in #4262
Remove few remaining qualifiers _CCCL_NODISCARD by @oleksandr-pavlyk in #4274
Fix ratio plot by @gevtushenko in #4099
Drop _CCCL_NORETURN by @davebayer in #4268
fix clang portability issue in __rcvr_with_env_t and remove dead code by @ericniebler in #4277
change version check in type_list.h so that NO clang-19.X compilers try to use pack indexing by @ericniebler in #4278
Fix internal shfl check by @fbusato in #4282
tweak the cccl compiler version check macros to better agree with intuition by @ericniebler in #4279
Implement ranges::single_view by @miscco in #4255
Implement fp overflow handlers by @davebayer in #4261
Drop _LIBCUDACXX_HAS_NO_UNICODE_CHARS by @davebayer in #4295
[Version] Update main to v3.1.0 by @github-actions[bot] in #4175
Fix _LIBCUDACXX_PREFERRED_ALIGNOF definition by @davebayer in #4297
Drop _LIBCUDACXX_HAS_NO_WIDE_CHARACTERS by @davebayer in #4298
[STF] dispatch content of stf.cuh into internal headers by @caugonnet in #4275
Implement <cuda/std/charconv> classes by @davebayer in #4301
Drop _LIBCUDACXX_DEPRECATED_IN_[11|14|17] by @davebayer in #4271
[CUDAX] Remove all_devices.at() and add bounds checks to operator[] by @pciolkosz in #4311
Drop pre C++11 support in <nv/target>> by @miscco in #4299
[STF] Add task_count and stream_to_event_list to the generic context API by @caugonnet in #4313
Implement cuda::uabs by @davebayer in #4292
Fix vectorized loading and storing for warpLoad, warpStore and blockS… by @ChristinaZ in #4283
Add SM120a to <nv/target> by @miscco in #4289
Remove invalid single # in builtin.h by @miscco in #4319
Add multi-dimensional support to block_reduce routines. by @tpn in #4064
Add multi-dimensional support to block_scan routines. by @tpn in #4309
Use more libcu++ includes in thrust by @miscco in #4316
[STF] green_context affinity test by @caugonnet in #4315
[STF] Print a summary of the logical data that were used in a context by @caugonnet in #4314
Update nvhpc to 25.3 and devcontainers to 25.06. by @alliepiper in #4302
Implement equality operators for charconv result types by @davebayer in #4331
Update PTX ld/st by @fbusato in #4324
Remove __void_t by @davebayer in #4333
Rename WarpShuffleResult to warp_shuffle_result by @davebayer in #4332
Disable extended floating-point types for nvc++ by @fbusato in #4340
Deprecate numeric_limits::has_denorm in C++23 by @davebayer in #4344
WAR unused variable warning on gcc9. by @alliepiper in #4348
Remove undefined variable from cmake. by @alliepiper in #4349
Readability, grammar and explanation improvements on CUB Public Tunin… by @gonidelis in #4343
Add ReverseIterator to cuda.parallel by @NaderAlAwar in #4291
Add clang19 to matrix, use latest gcc for cudax. by @alliepiper in #4351
Refactoring ThreadReduce by @fbusato in #3441
Fix inconsistent usage of vsmem helper in c.parallel merge_sort and unique_by_key algorithms by @NaderAlAwar in #4090
fix host/device annotations of the fallback _CCCL_TYPEID implementation for clang cuda by @ericniebler in #4354
Add explanatory image in results analysis part for tuning by @gonidelis in #4369
Fix issue where calling merge_sort on custom types was failing by @NaderAlAwar in #4367
[STF] Fix an iterator type error with an unordered_multimap by @caugonnet in #4374
Fix cccl integer traits by @davebayer in #4329
Move TEST_HAS_NO_EXCEPTIONS to function like macro by @miscco in #4112
Improve cuda::std::distance and friends by @miscco in #4335
Add pytest-benchmarks for cuda_parallel by @shwina in #4357
Native extension to bind to cccl c parallel library by @oleksandr-pavlyk in #4325
[STF] Add support for codes which do not allow exceptions by @caugonnet in #4373
Make MatX CI runnable from Actions tab on arbitrary CCCL tags. by @alliepiper in #4378
Implement selected string manipulation and examination C functions by @davebayer in #4346
Make thread_reduce work with NVHPC by @miscco in #4377
[STF] Fix task dependencies when using tokens by @caugonnet in #4380
Check for windows platform rather than MSVC for aligned_alloc by @miscco in #4371
Add an option to immediately create a point release PR after finalizing PR. by @wmaxey in #4051
Fix autogenerating release notes. by @wmaxey in #4052
Fix version detection in new MatX build functionality. by @alliepiper in #4385
Streamline algorithm class in Cython by @oleksandr-pavlyk in #4384
Migrate cudax tests to c2h. by @alliepiper in #4390
Improve libcu++ tests customization by @miscco in #4193
Improve cuda::std::gcd and cuda::std::lcm implementations by @davebayer in #4399
Move <ratio> implementation to libcu++ by @davebayer in #4398
Add transform python wrappers by @shwina in #4320
Move <span> implementation to libcudacxx by @davebayer in #4400
cuda.parallel: cmake build script to avoid using find_program by @oleksandr-pavlyk in #4382
Implement P0466R5 from C++20 by @davebayer in #4383
Avoid compiling CPU-only code in benchmarks. by @alliepiper in #4375
[CUDAX] rename wait() to sync() in various types. by @pciolkosz in #4379
Implement std::counted_iterator by @miscco in #4288
[STF] Temporarility disable the algorithm construct by @caugonnet in #4403
fix basic_any on nvhpc when not compiling as CUDA by @ericniebler in #4405
[STF] Honor CCCL_DISABLE_NVTX and NVTX_DISABLE in STF by @caugonnet in #4413
‼️ Fix failing CCCL Infra jobs on main , fix failing nightlies, plug PR coverage gap. by @alliepiper in #4402
Split Python test jobs by @shwina in #4391
add _CCCL_UNREACHABLE after returning uses of NV_DISPATCH_TARGET by @ericniebler in #4417
Fix uninitialized read in local atomic code path. by @wmaxey in #4352
Add some missing jobs to the nightly CI matrix. by @alliepiper in #4414
[STF] Simpler token API by @caugonnet in #4430
Remove extra semicolons in Thrust by @hwabis in #4426
Implement a reverse output iterator in cuda.parallel by @NaderAlAwar in #4342
Add _CCCL_NO_SPECIALIZATIONS attribute by @davebayer in #4432
Update CI overview documentation. by @alliepiper in #4437
Cleanup the definition of max_align_t by @miscco in #4436
restrict use of NV_IF_TARGET in char_traits<char>::length to nvcc by @ericniebler in #4406
Attempt to recover from upstream OOM in disjoint_pool. by @alliepiper in #4420
Missing header in <cuda/bit> by @fbusato in #4439
Update c.parallel testing (C2H, header tests) by @alliepiper in #4404
Enable device assertions in CUDA debug mode by @fbusato in #4444
[STF] Refactor CUDASTF allocators by @andralex in #4306
Move libcudacxx endian macros to cccl by @davebayer in #4429
c/parallel should be built with CUB_DISABLE_CDP by @oleksandr-pavlyk in #4422
Drop cuSpatial from RAPIDS builds. by @bdice in #4453
Add PTX elect.sync by @fbusato in #4445
cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks by @shwina in #4418
Add dynamic CUB dispatch for radix_sort by @NaderAlAwar in #4135
[CUDAX] Fix uninitialized context pointers in streamGetCtx_v2 by @pciolkosz in #4454
Improve _CCCL_ASSUME by @fbusato in #4456
Improve some thrust iterators by @miscco in #4461
Refactor cuda::std::popcount by @davebayer in #4434
Modernize cuda::std::complex by @davebayer in #4448
Drop invalid relative includes. by @miscco in #4468
Deprecate more Thrust facilities in favor of libcu++ ones by @miscco in #4334
Fix the local atomic uninitialized read test when built against small archs by @wmaxey in #4440
Use new channel for RAPIDS notifs by @alliepiper in #4476
Actually use new channel for RAPIDS failures. by @alliepiper in #4478
add a visitation interface to the senders of ustdex by @ericniebler in #4466
Make per-PR RAPIDS builds opt-in by @alliepiper in #4477
Fix ceil_div behavior with nvc++ and constexpr in device code by @fbusato in #4467
Add _CCCL_PURE attribute by @fbusato in #4446
cuda::bitmask should have a default type by @fbusato in #4484
Refactor cuda::std::rot* by @davebayer in #4488
Use cudaStream_t for thrust::device.on(...). by @alliepiper in #4451
cuda.parallel: Check compiled code for LDL/STL instructions in tests by @shwina in #4472
Add py.typed marker for cuda.cccl per PEP-0561 by @oleksandr-pavlyk in #4482
Add Radix Sort Implementation for c.parallel by @NaderAlAwar in #4350
Update CUB dispatch layer documentation with new example by @NaderAlAwar in #4281
Fix cudax regression on main by @alliepiper in #4498
Bump nvbench SHA to bring in some fixes on newer libraries. by @alliepiper in #4497
Fix __nv_pure__ compatibility by @fbusato in #4499
__has_unique_object_representations is supported by MSVC by @fbusato in #4494
Add Thrust CMake example with flexible device system, update docs by @alliepiper in #4500
Improve [[gnu::*]] attribute detection by @davebayer in #4502
Remove __tuple_element_t by @davebayer in #4501
Change streaming algorithms to use operator+= from using operator+ by @oleksandr-pavlyk in #4428
fix spelling of clang's -Wno-unknown-cuda-version switch by @ericniebler in #4504
Add Python wrappers for c.parallel radix_sort API by @NaderAlAwar in #4353
Add missing SASS testing changes to radix_sort by @NaderAlAwar in #4508
c.parallel: device wrappers as code, not format strings by @griwes in #3439
Fixes empty and single-item inputs for DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #4459
Missing include in iterator_facade_category header by @gonidelis in #4512
Clarify p2p native atomic support docs by @jrhemstad in #4510
Add ceil_ilog2 by @fbusato in #4485
[Docs] Clarifies that init_val is not applied to block_aggregate in BlockScan by @elstehle in #4515
Refactor cuda::std::countl_* by @davebayer in #4469
[STF] Support dynamic dependencies in the cuda_kernel construct and document cuda_kernel by @caugonnet in #4490
Set execution status of CUB device functions to error code by @oleksandr-pavlyk in #4511
Disable constexpr test for gcc14 by @miscco in #4517
Replace cub::detail with cub::internal by @fbusato in #4441
move [[nodiscard]] before __device__ to make clang happy by @ericniebler in #4522
Replace cub::internal with cub::detail by @fbusato in #4521
Refactor cuda::std::countr_* by @davebayer in #4487
Implement cuda::isqrt by @davebayer in #4427
Improve _CCCL_UNREACHABLE by @fbusato in #4443
Implement internal constexpr cstring functions by @davebayer in #4450
Implement cuda::std::to_chars for integers by @davebayer in #4330
Add -Wextra-semi to warnings we are building with by @miscco in #4435
Cleanup more macro definitions by @miscco in #4411
Mark libcu++ algorithms with _CCCL_EXEC_CHECK_DISABLE by @miscco in #4471
Fixes rst-style comments in BlockScan by @elstehle in #4520
Fix main by @davebayer in #4528
Add support for large num_segments to DeviceSegmentedReduce with fixed segment size by @srinivasyadav18 in #4366
Add internal __num_bits_v trait by @fbusato in #4293
Avoid deprecated CUDART usage. by @alliepiper in #4505
Adds support for large number of items to DeviceRunLengthEncode::Encode by @elstehle in #4442
Fix invalid license by @davebayer in #4527
Fix stubs for DeviceMergeSortBuildResult, DeviceUniqueByKeyBuildResult by @oleksandr-pavlyk in #4480
Implement views::counted by @miscco in #4408
Update to RAPIDS 25.06 by @bdice in #4455
Fixup trailing whitespace in release-update-rc.yml by @wmaxey in #4550
Exclude cudastf stress tests from CI. by @alliepiper in #4547
Implement cuda::std::char_traits by @davebayer in #4525
Remove <cuda/std/__cuda/chrono.h> by @davebayer in #4557
Rework <cuda/std/ctime> by @davebayer in #4555
Fix race in decoupled lookback test harness. by @alliepiper in #4556
do not use detail _NV_EVAL macro from <nv/target> by @ericniebler in #4560
Cudax cleanups by @ericniebler in #4561
CUB merge algorithms: avoid OOB access and improve compile time. by @alliepiper in #4548
Improve/cleanup annotated_ptr implementation by @fbusato in #4503
make run_loop lock-free and usable from device code by @ericniebler in #4523
Implement ranges::iota_view by @miscco in #4559
[c.parallel]: clean-up in test_utils.h by @oleksandr-pavlyk in #4544
Ensure that we are actually calling the cuda APIs ... by @miscco in #4570
[CUDAX] Remove caching allocator and non-async memory resources from async_buffer tests by @pciolkosz in #4563
Use device functions that accept pointer arguments in ccc.cl and cuda.parallel by @shwina in #4249
fix some portability issues with the cudax async tests by @ericniebler in #4577
Implement cuda::ipow by @davebayer in #4558
[CUDAX] Adjust some of the async_buffer interfaces by @pciolkosz in #4585
Improve CUDA macros by @davebayer in #4553
Implement shuffle_iterator iterator type by @djns99 in #4564
Switch cuCtxCreate to cuDevicePrimaryCtxRetain in cub and libcu++ tests by @pciolkosz in #4594
Add domain support and make all algorithms customizable with domain-based dispatch by @ericniebler in #4578
Replace calls to CUDA runtime occupancy with launcher_factory.MaxSmOccupancy() by @NaderAlAwar in #4602
Exclude gcc-11, gcc-10 from annotated_ptr constexpr test by @fbusato in #4595
use forward and friends from std:: to leverage compiler optimizations by @ericniebler in #4431
Update license of CTK files in libcu++ by @fbusato in #4613
Improve access_property and annotated_ptr documentation by @fbusato in #4580
Move NVTX to libcu++ and add support for Thrust by @gonidelis in #4537
Always bypass automatic atomic storage checks to prevent potential compiler issues by @PointKernel in #4586
Drop host STL includes in CUB if there are libcu++ alternatives by @miscco in #4619
Pass cached_segment by span by @miscco in #4618
Add tests to ensure that we can pass vocabulary types that contain [[no_unqiue_address]] to a kernel by @miscco in #4620
[CUDAX] Remove assign and execution policy from async_buffer by @pciolkosz in #4604
Guard <nv/target> bits from C contexts by @wmaxey in #4625
Replace use of __CUDACC__ with _CCCL_CUDA_COMPILATION() by @davebayer in #4587
Implement ranges::transform_view by @miscco in #4568
Simplify _CubLog by @davebayer in #4632
Adopt test for NVRTC properly implementing the line builtin. by @miscco in #4634
C2H fixes by @alliepiper in #4536
Bump nvbench SHA. by @alliepiper in #4535
Add _CCCL_LOG_CUDA_API, improve cuda_error reporting by @alliepiper in #4588
Migrate CUB's %PARAM% parsing logic to CCCL to enable reuse by other projects. by @alliepiper in #4576
Make sccache error non-fatal in CI scripts. by @alliepiper in #4638
cuda::std::errc should be an alias to std::errc by @davebayer in #4639
Refactor attributes by @davebayer in #4633
Backport reference_wrapper traits by @davebayer in #4642
move support for environments from cuda::experimental to cuda::std::execution by @ericniebler in #4584
Migrate away from docker-out-of-docker CI pattern. by @alliepiper in #4637
Maintenance/c parallel tests build caching by @oleksandr-pavlyk in #4609
Implement cuda::std::string_view by @davebayer in #4541
enable [[no_unique_address]] for clang on c++20 by @ericniebler in #4646
Turn cuda::std::iter_swap into a CPO to avoid ADL fiasco by @miscco in #4641
Ensure that construct_at optimization uses our special narrowing handling by @miscco in #4534
Add weekly compute-sanitizer CI jobs for CUB by @alliepiper in #4571
Introduce scan_op support to cuda.coop block_scan module. by @tpn in #4628
relocate ustdex within cudax by @ericniebler in #4626
User-friendly pow2 functions derived from std/bit by @fbusato in #4627
rename start_on and continue_on to starts_on and continues_on per WG21 by @ericniebler in #4647
Add CUDA toolkit macros by @davebayer in #4630
Update PTX ISA Version for CUDA 12.9 by @fbusato in #4656
Bump NVBench to bring in entropy fixes. by @alliepiper in #4654
Refactor part of <cuda/std/type_traits> by @davebayer in #4648
Bring in more NVBench stopping criterion fixes. by @alliepiper in #4661
Fix wrong function argument in for_each_in_extents::dynamic_kernel by @miscco in #4653
Make transform iterator utility in c parallel test suite by @oleksandr-pavlyk in #4645
replace the simplistic eager customization mechanism with proper apply/transform_sender by @ericniebler in #4657
Implement ranges::take_while_view by @miscco in #4640
Reduce the use of __CUDA_ARCH__ by @davebayer in #4589
Allow mdspan header tests for msvc in C++17 by @davebayer in #4667
[CUDAX] Add launch transform to async_buffer by @pciolkosz in #4605
[CUDAX] Fix launch priority option type by @pciolkosz in #4669
add the schedule_from algorithm, make continues_on lower to it by @ericniebler in #4658
disable execution space checks for cuda::std::exchange by @ericniebler in #4670
fix the spelling of the _CCCL_PREFERED_NAME macro by @ericniebler in #4672
Implement cuda::neg by @davebayer in #4567
Make cuda::get_device_address work with C++ compilers by @davebayer in #4572
improved diagnostics for cuda::experimental::execution by @ericniebler in #4673
Change definition of _CCCL_NODISCARD_FRIEND by @miscco in #4668
Improve defence against the external macros by @davebayer in #4635
Use ugly attribute names in public headers by @davebayer in #4675
Disable NVTX tests for NVHPC in C++20 by @miscco in #4686
Extend CUB DeviceSegmentedReduce API with fixed segment size to support all operators by @srinivasyadav18 in #4549
Add missing prologue/epilogue includes by @davebayer in #4683
rename cudax::uninit to cudax::no_init for better readability by @ericniebler in #4690
Remove Apple paths from libcu++ by @davebayer in #4693
disable the execution-space checks for the generic environment utilities by @ericniebler in #4692
cuda.parallel: Fix handling of duplicate LTOIRs by @shwina in #4698
[STF] Improvements for the cached fifo allocator and misc improvements by @caugonnet in #4703
Build and test python wheels in CI by @shwina in #4679
Improve compiler checks on CMake 3.31+. by @alliepiper in #4710
Add missing include to move algorithms by @miscco in #4712
Enable chrono literals from C++20 by @davebayer in #4696
Remove __cccl_timespec_t by @davebayer in #4694
[STF] Ensure we generate CUDA graphs which always have the same topology by @caugonnet in #4705
Implement cuda::std::string_view constructors from ranges by @davebayer in #4677
simple wrapper types for cudaGraph_t, cudaGraphNode_t, and cudaGraphExec_t by @ericniebler in #4680
Move histogram kernels to nvrtc compilable header by @NaderAlAwar in #4614
disable execution space warnings for all of µstdex's generic facilities by @ericniebler in #4727
change the adaptors to only forward queries specified as "forwarding" by @ericniebler in #4725
c.parallel: reuse CUB agent policies for reduce by @griwes in #4286
Introduce temp storage alignment awareness to cuda.cooperative. by @tpn in #4729
Fix typo in agent_batch_memcpy.cuh comment. by @brycelelbach in #4730
Use list init for test data in iterator docs by @bernhardmgruber in #4738
Globalize the include of <cuda_runtime_api.h> by @davebayer in #4704
Ensure include order of insert_nested_NVTX_range_guard via clang-format by @bernhardmgruber in #4741
[CUDAX] Add in_place_type argument to pass-through constructor of shared resource by @pciolkosz in #4714
Bump CI to CTK 12.9, regen devcontainers. by @alliepiper in #4624
Cuda parallel test add mark large by @oleksandr-pavlyk in #4723
move forwarding_query to cuda/std/__execution/env.h by @ericniebler in #4743
turn off execution space checks for unique_ptr by @ericniebler in #4732
Make device_reference<T>::operator= const by @bernhardmgruber in #4740
Add variadic ctor and CTAD to zip_iterator by @bernhardmgruber in #4113
Add explicit documentation for cuda::is_floating_point by @bernhardmgruber in #4749
Simplify thrust::cuda_cub::swap_ranges by @bernhardmgruber in #4182
Move get_stream_t to libcu++ by @miscco in #4737
install ca-certificates into devcontainer by @shwina in #4753
Add python jobs to nightly workflow by @shwina in #4720
Host incrementable iterator approach 2 by @oleksandr-pavlyk in #4697
Split Optimize Warp Reduce PR - libcu++ part by @fbusato in #4715
Split Optimize Warp Reduce PR - CUB part by @fbusato in #4716
Fix cuda.coop limitation preventing user-defined types when items_per_thread > 1 in block scan module. by @tpn in #4756
make it possible to get the status code from a cuda_error exception object by @ericniebler in #4731
Do not use open-coded INFINITY for tests that also test extended floating points by @miscco in #4752
Port thrust::discard_iterator by @miscco in #4717
Drop cmake workarounds for nvcc < 12 by @bernhardmgruber in #4754
Add dynamic CUB dispatch for histogram by @NaderAlAwar in #4636
Move get_memory_resource into libcu++ by @miscco in #4742
Port thrust::transform_iterator to cuda by @miscco in #4718
Add thrust::transform_n by @bernhardmgruber in #4750
Add workflow to build and test all Python wheels by @shwina in #4721
Update CI to NVHPC 25.5 by @alliepiper in #4763
Use cuda.bindings.path_finder in cuda.parallel wheel by @rwgk in #4735
Clear CUDA error state after a failure by @davebayer in #4759
Small refactorings in Thrust CUDA by @bernhardmgruber in #4764
Implement ranges::repeat_view by @miscco in #4666
change sync_wait to never call make_exception_ptr from device code by @ericniebler in #4734
test the return value of forwarding_query(Tag{}) in the __forwarding_query concept by @ericniebler in #4766
fix two issues with transform_sender by @ericniebler in #4770
port the let_value tests over from stdexec by @ericniebler in #4771
Install suggested build environment for pyenv by @shwina in #4781
Remove thrust from python dependency list by @shwina in #4788
fix broken cudax build due to an invalid expression in sync_wait error path by @ericniebler in #4787
fix the _CCCL_API macro family for NVHPC by @ericniebler in #4777
factor common code out of schedule_from and continues_on by @ericniebler in #4774
Add missing ForceInclusive tag in exclusive.scan benchmark source by @gonidelis in #4792
Use proper qualification in allocate.h by @miscco in #4796
Add missing #pragma once to headers to prevent multiple inclusions by @PointKernel in #4789
Align bulk copies to 16 bytes on Blackwell by @bernhardmgruber in #4778
Fully qualify calls in cuda:: and cuda::device:: namespaces by @davebayer in #4798
avoid double-wrapping receivers in __rcvr_ref by @ericniebler in #4775
Retry calls to apt update/install to WAR network issues. by @alliepiper in #4800
Segmented reduce to reuse CUB's tuning policy by @oleksandr-pavlyk in #4745
Fix define headers on libcucxx according to new path names by @gonidelis in #4803
fix the late-bound customization of the continues_on algorithm by @ericniebler in #4779
de-duplication, reuse, naming conventions, and copyrights by @ericniebler in #4795
Improve checking for prologue/epilogue code wrapping by @davebayer in #4802
Reenable __APPLE__ for pthread detection. by @miscco in #4805
Port thrust::counting_iterator as to cuda by @miscco in #4780
Add NVTX nests guard back in CUB unit test conditionally based on Thrust entries by @gonidelis in #4583
Replace invalid use of _CCCL_HAS_CUDA_COMPILER() by @davebayer in #4684
Increase bytes in flight for B200 to 64KiB by @bernhardmgruber in #4790
Make sure that cuda iterators play nicely with the thrust system and traversal machinery by @miscco in #4806
Remove -G/-g/-lineinfo from ptx-json tests. by @alliepiper in #4813
fix cudax's vector_add example that was broken by #4795 by @ericniebler in #4814
Check cuda::memcpy_async preconditions by @davebayer in #4700
Replace _CCCL_NO_CONCEPTS with _CCCL_HAS_CONCEPTS() by @davebayer in #4809
Unify BabelStream benchmarks and make nstream consistent by @bernhardmgruber in #4782
Run CCCL infra tests when example projects may have changed. by @alliepiper in #4816
Add cuda::narrow(from) by @bernhardmgruber in #4784
Refactor Thrust select_system by @bernhardmgruber in #4762
Port thrust::strided_iterator to cuda by @miscco in #4808
Refactor Thrust internal_functional by @bernhardmgruber in #4810
cuda.parallel: Skip SASS verification for complex input in scan tests by @shwina in #4838
[CUDAX] Add default properties for resources and add properties deduction to make_async_buffer by @pciolkosz in #4617
Implement cuda::device::lane_mask by @davebayer in #4804
Add a workflow to upload wheels to PyPi by @cryos in #4839
add a __query_or_default function for querying an environment with a fallback value by @ericniebler in #4841
Fix lane_mask documentation by @fbusato in #4854
Create "packaging" CI jobs, distinct from CCCL core. by @alliepiper in #4843
Speedup runtime of c/parallel/test/test_radix_sort.cpp by @oleksandr-pavlyk in #4848
Ensuring CTK minor version compatibility for cccl.c.parallel by @oleksandr-pavlyk in #4851
Add address_space and is_address_from to cuda::device:: by @davebayer in #4797
Refactor thrust::minimum_type|minimum_system by @bernhardmgruber in #4042
Add cuda::device::warp_match_all by @fbusato in #4746
Add CUB_ENABLE_LAUNCH_VARIANTS to toggle lid_1/2 variants. by @alliepiper in #4860
Use cuda::ptx::get_sreg_laneid instead of plain asm by @davebayer in #4862
Add {std, ranges}::min and {std, ranges}::min_element to algorithm by @miscco in #4783
Add test to ensure that we are properly copying mdspan around by @miscco in #4760
implement the proposed resolution of P3718 by adding a get_domain_late query. by @ericniebler in #4864
[CUDAX] Add make_async_buffer overload for each constructor by @pciolkosz in #4856
upgrade the completion_signatures machinery and add tests by @ericniebler in #4863
Add simple kernel for deterministic reduction by @SAtacker in #2234
Update pip packages to include colorama by @gonidelis in #4872
avoid the use of [[no_unique_address]] in prop and env on nvcc by @ericniebler in #4871
incidental fixes for ustdex by @ericniebler in #4873
using the "magic_get" trick to infer a type's structured binding size by @ericniebler in #4875
Env-based API for CUB part 1/3 by @gevtushenko in #4874
add a minimally functional execution context for CUDA streams by @ericniebler in #4579
Make cuda::stream_ref an env for itself by @miscco in #4878
Port functional_placeholders_logical Thrust test to Catch2 by @bernhardmgruber in #4882
[CUDAX] Remove circular dependency from the resource concept by @pciolkosz in #4852
Env-based API for CUB part 2/3 by @gevtushenko in #4876
strip -G from clangd command line for all-dev debug build by @ericniebler in #4884
Disable bulk copy transform on sm120 by @bernhardmgruber in #4870
Default kernel launcher factory indirection by @gevtushenko in #4890
Use random data for heterogeneous cub::DeviceTransform test by @bernhardmgruber in #4883
Port thrust::constant_iterator to cuda by @miscco in #4812
Use cuda::std::type_identity instead of identity-like types by @davebayer in #4893
Fix inspect_changes exclusions. by @alliepiper in #4885
Drop cuda::std::__identity by @bernhardmgruber in #4887
Refactor subdir checks to fix CI issue. by @alliepiper in #4895
take stream scheduler tests out of matrix until i figure out what is going wrong by @ericniebler in #4902
work around for stream_context defaulted constructor bug in nvcc-12.0 by @ericniebler in #4903
Fix segfault when compiling env by @gevtushenko in #4891
Improve code and coverage of DeviceFor::ForEachInExtents by @fbusato in #4664
Replace bool_constant by if constexpr in agent_scan by @bernhardmgruber in #4880
Refactor radix sort onesweep dispatch by @bernhardmgruber in #4868
Fix _CountOneBits when building against MSVC older than 14.31. by @wmaxey in #4906
do not constexpr cast to enum a value that is outside the enum's range by @ericniebler in #4905
Avoid warning in cuda::ilog10 by @miscco in #4908
Fix NVTX related comments by @bernhardmgruber in #4909
Generate a version from git/JSON for packages by @cryos in #4889
Reorganize cub::DeviceTransform tests by @bernhardmgruber in #4899
Readd int64 offset tests for DeviceTransform by @bernhardmgruber in #4914
Fix async buffer example by @gevtushenko in #4916
Combine cuda_{parallel,coop,cccl} into a single package by @shwina in #4910
Update Fixed Size Segmented Reduce benchmark by @srinivasyadav18 in #4913
cccl/c: Refactor the NVRTC build list helper by @wmaxey in #4907
Fix RadixEncoder operator() signature for radix sort by @davidwendt in #4921
Fix build-and-test-python-wheels workflow by @shwina in #4926
We only have one wheel to release now by @cryos in #4924
[CUDAX] Remove default device argument from stream and device_memory_resource constructor by @pciolkosz in #4915
Simplify cudax transform test by @bernhardmgruber in #4927
[CUDAX] Add sm_120 arch traits by @pciolkosz in #4931
Improve RFA PR 2234 by @srinivasyadav18 in #4888
Fix unused parameter issue caught by nightlies. by @alliepiper in #4941
Add init value test for RFA by @srinivasyadav18 in #4942
Log MatX SHA in builds. by @alliepiper in #4940
Fix ValueError encountered when running test_device_reduce on machine without CTK installed by @oleksandr-pavlyk in #4932
bring the design of the cudax execution policies in line with C++17 by @ericniebler in #4937
Env-based API for CUB part 3/3 by @gevtushenko in #4877
Fix documentation typo: s/BlockRadixSort/BlockRunLengthDecode/. by @tpn in #4943
fix the CUDA stream scheduler by @ericniebler in #4933
[CUDAX] Introduce driver stack checking macro and apply in it to device, event and stream tests by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4934
Remove jinja2 dependency from cuda.cooperative. by @tpn in https://github.com/NVIDIA/cccl/pull/4946
[CUDAX] Fix cub cudax example after default device removal by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4950
Fix RFA dispatch template parameters by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/4951
Improve device_accessor memory space check by @fbusato in https://github.com/NVIDIA/cccl/pull/4840
Fix cuda::warp_match_all test case by @fbusato in https://github.com/NVIDIA/cccl/pull/4963
Infra cleanup, prep for artifacts by @alliepiper in https://github.com/NVIDIA/cccl/pull/4929
Drop dead code in Thrust reduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4969
Use env in RFA tests by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/4948
[CUDAX] Rename attr to attribute in device APIs by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4964
replace _LIBCUDACXX_HIDE_FROM_ABI with _CCCL_API inline by @ericniebler in https://github.com/NVIDIA/cccl/pull/4936
Retry configure step when CPM hits network issues in CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/4956
constexpr-ify cuda::experimental::execution by @ericniebler in https://github.com/NVIDIA/cccl/pull/4962
make cudax::stream_ref a scheduler by @ericniebler in https://github.com/NVIDIA/cccl/pull/4952
Enable custom msvc multiarch builds in CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/4978
Use atomicAdd_block in device histogram by @gonidelis in https://github.com/NVIDIA/cccl/pull/4973
is_nothrow_destructible_v should use the builtin when it is available by @ericniebler in https://github.com/NVIDIA/cccl/pull/4979
nvcc-12.0 seems happier with __host__ __device__ lambdas by @ericniebler in https://github.com/NVIDIA/cccl/pull/4980
fix the syntax for Catch2 test tags for cudax::execution by @ericniebler in https://github.com/NVIDIA/cccl/pull/4982
Support more arguments to CCCL_PP_SPLICE_WITH by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4972
Add support for sm110 to nv/target by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4987
Add potential search path for cccl headers in potential layout by @wmaxey in https://github.com/NVIDIA/cccl/pull/4990
Drop _LIBCUDACXX_CONSTRUCT_AT by @miscco in https://github.com/NVIDIA/cccl/pull/4998
Do not use an anonymous union with optional by @miscco in https://github.com/NVIDIA/cccl/pull/4997
Modularize to_chars tests by @davebayer in https://github.com/NVIDIA/cccl/pull/4904
Update to RAPIDS 25.08. by @bdice in https://github.com/NVIDIA/cccl/pull/5008
[CUDAX] Add sm_103 traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5006
Port thrust::tabulate_output_iterator to cuda by @miscco in https://github.com/NVIDIA/cccl/pull/4879
Avoid deprecated cudaGetDriverEntryPoint by @miscco in https://github.com/NVIDIA/cccl/pull/5010
Fix incorrect argument name in thrust openMP cmake file by @miscco in https://github.com/NVIDIA/cccl/pull/5004
Refactor thrust::sequential::sort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4925
Implement ranges::take_view by @miscco in https://github.com/NVIDIA/cccl/pull/4867
Split Optimize WarpReduce PR - Part3 c2h by @fbusato in https://github.com/NVIDIA/cccl/pull/4842
Use ptx::elect_sync in ublkcp transform kernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5014
improved try/catch portability macros by @ericniebler in https://github.com/NVIDIA/cccl/pull/4986
[FEA] expose std::uniform_int_distribution in libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/4410
Fix debug check in cuda::ptx::shfl_sync_* by @fbusato in https://github.com/NVIDIA/cccl/pull/5016
Add load-bearing semicolon for MSVC in openMP sort by @miscco in https://github.com/NVIDIA/cccl/pull/5024
Improve compile-time of c2h generators_vector by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5023
Update docs in device_radix_sort.cuh by @davidwendt in https://github.com/NVIDIA/cccl/pull/5021
Enable H100 for c.parallel and python tests. by @griwes in https://github.com/NVIDIA/cccl/pull/4999
[CUDAX] Remove "get_" prefix from member functions by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4984
Extend nightly SM build coverage by @alliepiper in https://github.com/NVIDIA/cccl/pull/4949
add the bulk, bulk_chunked, and bulk_unchunked sender adaptors by @ericniebler in https://github.com/NVIDIA/cccl/pull/4989
[CUDAX] Prototype implementation of path_builder that can build paths in a graph and implementation of launch accepting it by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4758
change relative include to system include in .../__execution/stream/continues_on.cuh by @ericniebler in https://github.com/NVIDIA/cccl/pull/5042
Replace vector by inplace_vector in tests by @davebayer in https://github.com/NVIDIA/cccl/pull/4944
Implement std::fma by @miscco in https://github.com/NVIDIA/cccl/pull/5029
rename the get_domain_late query to get_domain_override per WG21 by @ericniebler in https://github.com/NVIDIA/cccl/pull/5043
[CUDAX] Add an event constructor taking a device_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5035
Refactor around thrust::vector by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5044
[cudax] Fix cudax compilation with gcc 9 by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5048
Handle upcoming vector type change by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5036
Reduce memory usage of the random distribution tests by @miscco in https://github.com/NVIDIA/cccl/pull/5052
[CUDAX] Fix parentheses in one of the launch overloads by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5058
Minor updates to the cuda::iterators by @miscco in https://github.com/NVIDIA/cccl/pull/5054
Apply remove_cvref in thrust::is_contiguous_iterator and refactor all uses by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5050
Port std::modf and std::fmod by @miscco in https://github.com/NVIDIA/cccl/pull/5047
thrust::cuda::pinned_memory_resource should dispatch to the host system by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5045
try to make it so that we allocate on the heap instead of a stack array by @miscco in https://github.com/NVIDIA/cccl/pull/5060
Fix formatting in CONTRIBUTING.md by @pauleonix in https://github.com/NVIDIA/cccl/pull/5062
Replace cg::memcpy_async in memcpy_async transform kernel by custom implementation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4976
Provide cuda::static_for by @fbusato in #4855
Enable cuda::std::string_view tests in libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/4894
Refactor iterator concepts to use our new concept emulation by @miscco in https://github.com/NVIDIA/cccl/pull/5059
Add vectorized cub::DeviceTransform algorithm by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4815
Replace CG by TMA copy in bulk copy fallback path by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5061
Refactor DeviceTransform implementation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5071
Unconditionally enable async copy transform kernels by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5069
Port thrust::transform_output_iterator to cuda by @miscco in https://github.com/NVIDIA/cccl/pull/5051
Move math builtins into the respective header file by @miscco in https://github.com/NVIDIA/cccl/pull/5075
Implement std::remainder and std::remquo by @miscco in https://github.com/NVIDIA/cccl/pull/5070
Modularize our complex implementation by @miscco in https://github.com/NVIDIA/cccl/pull/5076
Docs nitpick by @gonidelis in https://github.com/NVIDIA/cccl/pull/5079
refactor opstates and receivers to shorten mangled names by @ericniebler in https://github.com/NVIDIA/cccl/pull/5065
Relax constraints for gpu_to_gpu determinism by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/4981
avoid a mysterious codegen issue with llvm18 by simplifying the transform for the stream bulk senders by @ericniebler in https://github.com/NVIDIA/cccl/pull/5087
refactor the starts_on algorithm for shorter symbol length by @ericniebler in https://github.com/NVIDIA/cccl/pull/5088
promote the write_attrs sender adaptor by @ericniebler in https://github.com/NVIDIA/cccl/pull/5089
give cudax::stream_ref the opt-in for satisfying the scheduler concept by @ericniebler in https://github.com/NVIDIA/cccl/pull/5090
[libcudacxx] Add EXEC_CHECK_DISABLE in to_address implementation by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5086
Implement format_parse_context and format_error by @davebayer in https://github.com/NVIDIA/cccl/pull/4939
Modularize optional by @miscco in https://github.com/NVIDIA/cccl/pull/5080
Fix thrust::make_discard_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/5093
make the stream sender adaptor work with non-visitable senders by @ericniebler in https://github.com/NVIDIA/cccl/pull/5091
[CUDAX] Switch access control API to use a span of device_refs by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5084
get the starts_on algorithm working with the stream scheduler by @ericniebler in https://github.com/NVIDIA/cccl/pull/5092
[CUDAX] Migrate copy and fill to use driver API and add driver stack checks in memory_resource and async_buffer tests by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5073
sync_wait should decay copy the value results by @ericniebler in https://github.com/NVIDIA/cccl/pull/5107
add the execution::on sender adapter by @ericniebler in https://github.com/NVIDIA/cccl/pull/5097
fix let_value and friends to work when the function returns a dependent sender by @ericniebler in https://github.com/NVIDIA/cccl/pull/5105
Skip unnecessary fence in DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5102
fix visibility problem in invoke, correct the spelling of "invocable" globally by @ericniebler in https://github.com/NVIDIA/cccl/pull/5106
change the defn of __query_result_or_t to not require _Query to b… by @ericniebler in https://github.com/NVIDIA/cccl/pull/5109
Small fixes and improvements to DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5078
Avoid usage of _CCCL_NO_UNIQUE_ADDRESS for cuda iterators by @miscco in https://github.com/NVIDIA/cccl/pull/5110
Add docs for cuda.cccl.parallel and cuda.cccl.cooperative by @shwina in https://github.com/NVIDIA/cccl/pull/5095
Added f32/fp64 specializations for complex exp function. by @s-oboyle in https://github.com/NVIDIA/cccl/pull/4928
Fix link to Python docs in cccl docs index page by @shwina in https://github.com/NVIDIA/cccl/pull/5115
add _CCCL_DECLSPEC_EMPTY_BASES as an AttributeMacro to .clang-format by @ericniebler in https://github.com/NVIDIA/cccl/pull/5123
[CUB] Tests DeviceScan for invalid values passed to the custom reduction operator by @pauleonix in https://github.com/NVIDIA/cccl/pull/5085
Avoid more upcoming deprecation warnings on vector types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5114
Add framework for C2H tests in libcudacxx. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5101
Simplify the type of write_env's receiver and make write_env pipeable by @ericniebler in https://github.com/NVIDIA/cccl/pull/5108
Fix unqualified call to __unwrap_iter by @miscco in https://github.com/NVIDIA/cccl/pull/5117
cuda:: pointer utilities by @fbusato in #5037
avoid return type deduction in the execution queries by @ericniebler in https://github.com/NVIDIA/cccl/pull/5096
[CUDAX] Add id() getter to stream_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5132
make value_types_of_t and error_types_of_t work with non-variadic templates by @ericniebler in https://github.com/NVIDIA/cccl/pull/5134
make cuda::std::__tuple work with members of reference type by @ericniebler in https://github.com/NVIDIA/cccl/pull/5129
Fix RAPIDS CI jobs by @trxcllnt in https://github.com/NVIDIA/cccl/pull/5072
Try avoid instantiating timespec_get as that might or might not be available on android CTKs by @miscco in https://github.com/NVIDIA/cccl/pull/5128
Port thrust::permutation_iterator by @miscco in https://github.com/NVIDIA/cccl/pull/4835
Update transform iterator example to use a not quadratic sequence by @shwina in https://github.com/NVIDIA/cccl/pull/5131
Add devcontainer postAttachCommand for GitHub Codespaces by @trxcllnt in https://github.com/NVIDIA/cccl/pull/5133
Add cub::DeviceReduce::Sum Env-based API by @gonidelis in https://github.com/NVIDIA/cccl/pull/4985
Avoid deprecation warning in the libcu++ extended vector types tests by @miscco in https://github.com/NVIDIA/cccl/pull/5135
Test DeviceTransform with more overaligned types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5139
the stream implementation of continues_on is using a moved-from receiver by @ericniebler in https://github.com/NVIDIA/cccl/pull/5150
RAPIDS CI update to CUDA 12.9 by @jakirkham in https://github.com/NVIDIA/cccl/pull/5104
Make some member functions of inplace_vector static by @miscco in https://github.com/NVIDIA/cccl/pull/5149
Refactor thrust generic sequence by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5156
Try, fail and ignore to guarantee dynamic SMEM alignment on Hopper by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5122
Test more unaligned inputs in DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5111
Enhance our deprecation machinery so that we also suppress the right nvcc warnings by @miscco in https://github.com/NVIDIA/cccl/pull/5138
[CUB] Tests DeviceScan with primitive type for invalid values being passed to the scan operator by @pauleonix in https://github.com/NVIDIA/cccl/pull/5141
fix ODR voilation making cudax launch tests flaky by @ericniebler in https://github.com/NVIDIA/cccl/pull/5161
Replace mdpan/extents.h - __count_dynamic with a template variable by @fbusato in https://github.com/NVIDIA/cccl/pull/5168
Update doc errors in set_operations.h by @akifcorduk in https://github.com/NVIDIA/cccl/pull/5177
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in https://github.com/NVIDIA/cccl/pull/4365
Refactor out large offset size calculation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5175
unify doxygen predefined macros in repo.toml by @ericniebler in https://github.com/NVIDIA/cccl/pull/5162
Optionally use PostgreSQL for benchmark data by @gevtushenko in https://github.com/NVIDIA/cccl/pull/5184
Refactor thrust cuda replace by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5180
Implement thrust::transform[_if]_n in the generic system by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5182
[CUDAX] Use stream_id instead of unsigned long long as stream_ref::id() return type by @davebayer in https://github.com/NVIDIA/cccl/pull/5146
Adds debug info to large problem test helper by @elstehle in https://github.com/NVIDIA/cccl/pull/5187
[CUB] Fix BlockScan documentation by @pauleonix in https://github.com/NVIDIA/cccl/pull/5189
[CUDAX] Make basic tests work on Windows by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5188
Fix overflow in offset calculation in transform kernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5176
[STF] Rename task_fence to fence, and graph epochs to graph stages by @caugonnet in https://github.com/NVIDIA/cccl/pull/5200
Use shared memory pointer instead of offset in UBLKCP transform kernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5173
finalize the design of the launch transform API by @ericniebler in https://github.com/NVIDIA/cccl/pull/5153
Fix cudax compilation with upcoming CTK by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5202
[STF] Use #ifndef _CCCL_DOXYGEN_INVOKED instead of @cond NEVER_DOCUMENT by @caugonnet in https://github.com/NVIDIA/cccl/pull/5211
Remove CDP (RDC) architecture filtering logic from Thrust/CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5206
Fix exit code capture under set -e by @alliepiper in https://github.com/NVIDIA/cccl/pull/5213
Add options for fixing failures in release generation by @wmaxey in https://github.com/NVIDIA/cccl/pull/5194
[STF] Simplify the visit pattern used in context.cuh using the ->* operator by @caugonnet in https://github.com/NVIDIA/cccl/pull/5212
Replace cuda version checks with _CCCL_CTK_XXX() macro by @davebayer in https://github.com/NVIDIA/cccl/pull/5204
Remove duplicated entry in ptx docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5222
Tuning rules by @gevtushenko in https://github.com/NVIDIA/cccl/pull/5195
Remove _LIBCUDACXX_EXTERN_TEMPLATE and _LIBCUDACXX_BUILDING_LIBRARY macros by @davebayer in https://github.com/NVIDIA/cccl/pull/5230
Don't mention C++ 11 and 14 in more places by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5201
Enable RDC tests on MSVC by default by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5214
Rebalance items per thread in LDGSTS/UBLKCP transform kernels by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5112
Migrate Thrust transform tests to Catch2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5197
[STF] Keep dependency event names in DOT by @caugonnet in https://github.com/NVIDIA/cccl/pull/5235
[STF] Cleanup how we setup allocators in algorithms by @caugonnet in https://github.com/NVIDIA/cccl/pull/5220
Use arch=native in benchmark/tuning presets/docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5216
Re-enable UBLKCP transform kernel on sm120 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5223
use constexpr std::exception if it is available by @ericniebler in https://github.com/NVIDIA/cccl/pull/5221
[CUDAX] Uglify driver API header and remove CUDAX prefix from the driver function getter by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5219
[CUDAX] Rename device type to physical_device by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5208
[STF] Cleanup how we specify edge type in DOT output by @caugonnet in https://github.com/NVIDIA/cccl/pull/5232
Warn when the traditional MSVC preprocessor is used by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5172
Make cuda::ptx available in cuda::device as an namespace alias by @davebayer in https://github.com/NVIDIA/cccl/pull/5241
Add missing _CCCL_HEADER_TEST definitions to public header tests by @davebayer in https://github.com/NVIDIA/cccl/pull/5242
Do not enable __float128 support on device for clang-cuda or NVHPC by @miscco in https://github.com/NVIDIA/cccl/pull/5254
Disable assertions for QNX, they do not provide the machinery with their libc by @miscco in https://github.com/NVIDIA/cccl/pull/5253
Support types with any alignment in UBLKCP transform kernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5178
Refactor benchmark of conditional algorithms by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5237
Proclaim copyable_args in nvbench_helper.cu by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5247
Make sure that nested tuple and pair have the expected size by @miscco in https://github.com/NVIDIA/cccl/pull/5246
Add CONSTEXPR_STEPS: option to lit config by @davebayer in https://github.com/NVIDIA/cccl/pull/5229
Implement thrust::swap_ranges via transform in CUDA system by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5243
Skip init of temp vectors in CUB test launch helpers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5260
Add missing prologue / epilogue includes to <cuda/ptx> by @miscco in https://github.com/NVIDIA/cccl/pull/5261
[CUDAX->libcu++] Move driver_api header and testing header to libcudacxx by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5255
[STF] Separately display freeze and unfreeze operations in DOT by @caugonnet in https://github.com/NVIDIA/cccl/pull/5234
Add missed specializations of the new aligned vector types to cub by @miscco in https://github.com/NVIDIA/cccl/pull/5264
[CUDAX] Refactor arch traits to be more structured and support arch-specific targets by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5064
Improve driver api implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5272
Implement tuple protocol for nvfp vector types by @davebayer in https://github.com/NVIDIA/cccl/pull/5218
Fix failing warning suppression for nvrtc by @miscco in https://github.com/NVIDIA/cccl/pull/5278
Expose Fast Modulo Division in libcu++ by @fbusato in #5210
[CUDAX->libcu++] Move device APIs to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5279
[Backport branch/3.1.x] [CUDAX->libcu++] Move ensure_current_device to libcu++ and change the name to ensure_current_context by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5309
[Backport branch/3.1.x] [CUDAX->libcu++] Move stream and event from cudax to libcu++ by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5322
[Backport branch/3.1.x] Remove mentions of CUDA experimental that sneaked into libcu++ by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5326
[Backport branch/3.1.x] Add a macro to disable PDL by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5329
[Backport branch/3.1.x] Skip zero values in fast_mod_div unit test by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5337
[Backport branch/3.1.x] [libcu++] Deprecate default stream_ref constructor and fix some few last usages by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5334
[Backport branch/3.1.x] Add gitlab devcontainers by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5342
[Backport branch/3.1.x] Fix nvrtc when there are more than one CTK include directories available by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5356
[Backport 3.1.x] is_address_from fixes (#5349) by @fbusato in https://github.com/NVIDIA/cccl/pull/5363
[Backport branch/3.1.x] Diagnose missing numeric_limits specialization in DeviceReduce Min/Max by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5376
[Backport branch/3.1.x] [CUDAX->libcu++] Expose fill_bytes and copy_bytes in libcudacxx by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5403
Backport #5442 to 3.1x by @shwina in https://github.com/NVIDIA/cccl/pull/5476
[Backport branch/3.1.x] NV_TARGET and cuda::ptx for CTK 13 by @fbusato in https://github.com/NVIDIA/cccl/pull/5474
Backport to 3.1: c.parallel: enable UBLKCP in transform (#4847) and Move TMA barrier in DeviceTransform into dynamic SMEM (#5414) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5457
[BACKPORT 3.1]: Replace address space intrinsics with cuda::device::is_address_from (#4866) by @miscco in https://github.com/NVIDIA/cccl/pull/5465
[Backport branch/3.1.x] Fix grid dependency sync in cub::DeviceMergeSort by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5460
[Backport branch/3.1.x] move basic_any from cudax to libcudacxx by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5459
[backport -> 3.1.x][libcu++] Rename memory resource concepts to indicate asynchronous allocations are the default ones (#5313) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5492
[Backport branch/3.1.x] [libcu++] Remove experimental memory resource define check from around the concept, properties and the query. by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5501
[Backport branch/3.1.x] Add SM_110a for non-supporting compilers by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5503
Backport to 3.1: NVTX ranges for C2H, NVTX as system headers, and handle NVTX being disabled in C2H by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5519
[Backport branch/3.1.x] [libcu++] Rename resource_ref to match the new async by default naming by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5545
[Backport branch/3.1.x] [CUDAX] Rename type-erased memory resource wrappers by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5547
[Backport branch/3.1.x] PR #5396 and #5566 by @elstehle in https://github.com/NVIDIA/cccl/pull/5611
Backport to 3.1: Update cuda/ptx instructions to support all new SM architectures in CTK 13 (#5600) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5612
[Backport branch/3.1.x] Fixes thrust::unique for non-const equality_op by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5656
[BACKPORT 3.1]: Update PTX ISA version for CUDA 13 (#5676) by @miscco in https://github.com/NVIDIA/cccl/pull/5699
[Backport branch/3.1.x] Fix thrust::malloc for void by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5718
[BACKPORT 3.1]: Fix problematic clang attribute namespace (#5748) by @miscco in https://github.com/NVIDIA/cccl/pull/5756
[Backport 3.1]: Work around submdspan compiler issue on MSVC (#5885) by @miscco in https://github.com/NVIDIA/cccl/pull/5902
[Backport branch/3.1.x] Ignore -Wmaybe-uninitialized in dispatch_reduce.cuh. by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5936
Detect QNX for atomics support by @miscco in https://github.com/NVIDIA/cccl/pull/5962
[BACKPORT 3.1] Use forward declarations of extended floating point types instead of including the headers (#5846) by @miscco in https://github.com/NVIDIA/cccl/pull/5978
[Backport 3.1] Backport iterator fixes by @miscco in https://github.com/NVIDIA/cccl/pull/5977
[Backport branch/3.1.x] [libcu++] Switch to use cuGetProcAddress to get driver functions by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6011
[Backport branch/3.1.x] Enable __grid_constant__ with clang-cuda-20 and nvrtc by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6013
[Backport branch/3.1.x] Fix libcu++ compilation with clang-20 by @davebayer in https://github.com/NVIDIA/cccl/pull/5985
[Backport branch/3.1.x] Fix throwing functions marked as noexcept by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6052
[Backport 3.1]: add missing InitT tparam to specialization of DispatchSegmentedReduce (#6048) by @miscco in https://github.com/NVIDIA/cccl/pull/6054
[Backport 3.1]: Fix addressof shadowing issue with libc++ (#6032) by @miscco in https://github.com/NVIDIA/cccl/pull/6053
Backport to 3.1: Do not require int128 in for_each_canceled (#5822) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6057
[Backport branch/3.1.x] Fix nvc++ 25.9 with format_parse_context tests by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6059
Backport to 3.1: Add SM_110 arch traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6071
[Backport 3.1]: Change PARALLEL_LEVEL default from nproc to nproc-1 in build_common.sh (#6046) by @miscco in https://github.com/NVIDIA/cccl/pull/6055
[Backport to 3.1] Fix dereferencing nullptr in thrust::device_reference by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6002
[Backport to 3.1] add a specialization of __make_tuple_types for complex<T> (#6102) by @davebayer in https://github.com/NVIDIA/cccl/pull/6116
[Backport to 3.1] Remove iterator workarounds for lack of operator+= (#6094) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6115
[Backport 3.1] Fix imports from cudax to libcu++ (#6105) by @davebayer in https://github.com/NVIDIA/cccl/pull/6144
[Backport branch/3.1.x] Implement operator<< for cuda::std::string_view by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6148
[Backport branch/3.1.x] [libcu++] Fix blocks per SM in arch traits traits by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6187
[Backport to 3.1]: Backport bad bad alloc by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6197
[Backport 3.1]: Work around NVRTC bug with virtual default ctors/dtors (#5704) by @miscco in https://github.com/NVIDIA/cccl/pull/6193
[Backport 3.1] Cache device name and peers (#6110) by @davebayer in https://github.com/NVIDIA/cccl/pull/6145
[Backport 3.1] Replace CUDA Runtime calls with Driver calls in libcu++ by @davebayer in https://github.com/NVIDIA/cccl/pull/6211
[Backport 3.1]: [CUB] Replace several direct uses of __clz (#6099) by @miscco in https://github.com/NVIDIA/cccl/pull/6202
[Backport branch/3.1.x] Add missing sm121 to nv/target and CUB tests by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6210
[Backport 3.1] Backport recent <cuda/device> changes by @davebayer in https://github.com/NVIDIA/cccl/pull/6215
[Backport 3.1] Backport PRs #4591 , #6176 , #6201 and #6006 by @miscco in https://github.com/NVIDIA/cccl/pull/6222
[Backport 3.1] Backport #6184 and #6224 by @davebayer in https://github.com/NVIDIA/cccl/pull/6228
[Backport 3.1] Backport #5305 and #6093 by @davebayer in https://github.com/NVIDIA/cccl/pull/6232

New Contributors

@hwabis made their first contribution in #4426
@SAtacker made their first contribution in #2234
@jakirkham made their first contribution in https://github.com/NVIDIA/cccl/pull/5104
@akifcorduk made their first contribution in https://github.com/NVIDIA/cccl/pull/5177

Full Changelog: v3.0.3...v3.1.0