What's Changed
🚀 Thrust / CUB
- [Thrust] Perform asynchronous allocations by default for the
par_nosync
policy by @brycelelbach in #4204 - [Thrust]
reduce_into
by @brycelelbach in #4355 - Enable Catch2 tests in Thrust by @bernhardmgruber in #2669
- Add memcpy_async transform kernel for Ampere by @bernhardmgruber in #2394
- Allow default-initializing and skipping initialization of Thrust vectors by @bernhardmgruber in #4183
- Add thrust::strided_iterator and a step for thrust::counting_iterator by @bernhardmgruber in #4014
- Add new WarpReduce overloadings by @fbusato in #3884
- Optimize ThreadReduce by @fbusato in #3441
📚 Libcudacxx
- Enable device assertions in CUDA debug mode
nvcc -G
by @fbusato in #4444 - avoid EDG bug by moving diagnostic push & pop out of templates by @ericniebler in #4416
- Add host/device/managed mdspan and accessors by @fbusato in #3686
- Add cuda::ptx::elect.sync by @fbusato in #4445
- Add pointer utilities cuda::is_aligned, cuda::align_up, cuda::align_down, cuda::ptr_rebind by @fbusato in #5037
- Add cuda::ceil_ilog2 by @fbusato in #4485
- Add cuda::is_power_of_two, cuda::next_power_of_two, cuda::prev_power_of_two by @fbusato in #4627
- Add cuda::device::warp_match_all by @fbusato in #4746
- Add cuda::static_for by @fbusato in #4855
- Improve/cleanup cuda::annotated_ptr implementation by @fbusato in #4503
- Add cuda::fast_mod_div Fast Modulo Division by @fbusato in #5210
📝 Documentation
- Making extended API documentation slightly more uniform by @fbusato in #4965
- Add memory space note to
cuda::memory
documentation by @fbusato in #5151 - Better specify
lane_mask::all_active()
behavior by @fbusato in #5183
🔄 Other Changes
- [CUDAX] Add universal comparison across memory resources by @pciolkosz in #4168
- Implement
ranges::range_adaptor
by @miscco in #4066 - Avoiding looping over problem size in individual tests by @oleksandr-pavlyk in #4140
- Replace CUB
util_arch.cuh
macros withinline constexpr
variables by @fbusato in #4165 - Improves test times for
DeviceSegmentedRadixSort
by @elstehle in #4156 - Simplify Thrust iterator functions by @bernhardmgruber in #4178
- Remove
_LIBCUDACXX_UNUSED_VAR
by @davebayer in #4174 - Remove
_CCCL_NO_IF_CONSTEXPR
by @davebayer in #4187 - Implement
__fp_native_type_t
by @davebayer in #4173 - Adds support for large number of segments and large number of items to
DeviceSegmentedRadixSort
by @elstehle in #3402 - Implement inclusive scan in cuda.parallel by @NaderAlAwar in #4147
- Remove
_CCCL_NO_NOEXCEPT_FUNCTION_TYPE
by @davebayer in #4190 - Fix
not_fn
by @miscco in #4186 - Remove
_CCCL_NTTP_AUTO
by @davebayer in #4191 - Avoid instantiating discard_iterator while parsing by @bernhardmgruber in #4180
- Host/Device accessors for
mdspan
by @fbusato in #3686 - Remove
_CCCL_NO_DEDUCTION_GUIDES
by @davebayer in #4188 - Set NO_CMAKE_FIND_ROOT_PATH for cudax. by @bdice in #4162
- Fix build breaking with setuptools by @miscco in #4212
- Replaces remaining uses of
thrust::{host,device}_vector
in our Catch2 tests by @elstehle in #4205 - Add check that CXX + CUDA_HOST compilers match when necessary. by @alliepiper in #4201
- Disable test on 12.0 CTK by @miscco in #4214
- Implement fp properties by @davebayer in #4213
- [CUDAX] Separate non-async pinned memory resource into legacy_pinned_memory_resource by @pciolkosz in #4179
- Avoid errors in
get_device_address
tests by @miscco in #4209 - Implement extended fp traits by @davebayer in #4211
- Remove
_CCCL_INLINE_VAR
by @davebayer in #4192 - Improve host/device mdspan documentation by @fbusato in #4220
- Drop
_LIBCUDACXX_BEGIN_NAMESPACE_RANGES_ABI
by @miscco in #4210 - Fix C++ version used in CONTRIBUTING.md by @bernhardmgruber in #4224
- Extend tuning documentation by @bernhardmgruber in #4184
- Drop tuning params for benchmarks with custom ops by @bernhardmgruber in #4176
- Make compiler version comparisons safer by @davebayer in #4185
- Document python packages for sol plot script by @bernhardmgruber in #4228
- Remove
_CCCL_NO_FOLD_EXPRESSIONS
by @davebayer in #4189 - Remove python/cuda_cooperative/setup.py by @rwgk in #4221
- Allow cuda::par*.on() to take cuda::stream_ref by @bernhardmgruber in #4225
- Drop
_CCCL_NO_VARIABLE_TEMPLATES
by @miscco in #4229 - Fix typos in cuda mdspan documentation by @fbusato in #4231
- Simplify Thrust assign_value by @bernhardmgruber in #4227
- Remove double underscore limit macros by @davebayer in #4194
- Document deprecations from #4165 by @bernhardmgruber in #4237
- Implement
__fp_is_subset
trait by @davebayer in #4230 - Extend tuning verification docs by @bernhardmgruber in #4236
- Use
[[maybe_unused]]
in whole cccl by @davebayer in #4207 - Move implementation of
cuda::std::array
to libcu++ by @davebayer in #4239 - Implement
__cccl_fp
class by @davebayer in #4238 - Add transform c parallel implementation by @shwina in #4048
- Drop duplicated system header blocks by @miscco in #4245
- Exclude sm101 from RDC testing. by @alliepiper in #4247
- Make
cuda::stream_ref
constructible on device by @miscco in #4243 - Fix logic in test_segmented_reduce by @oleksandr-pavlyk in #4198
- Add new
WarpReduce
overloadings by @fbusato in #3884 - Fix construction of host init value in test_reduce made incorrect after refactoring by @oleksandr-pavlyk in #4251
- Refactor fp masks by @davebayer in #4246
- Implement
views::all
by @miscco in #4244 - [cudax] incorporate P3557 (constexpr completion signatures) into µstdex by @ericniebler in #3841
- Add fixed size segmented reduce by @srinivasyadav18 in #3969
- Drop old Readmes and other unused files by @miscco in #4199
- Implement fp constants by @davebayer in #4256
- [STF] Enable NVHPC in CUDASTF CI by @caugonnet in #3857
- [STF] fix type issues in the multi-GPU CG test by @caugonnet in #4260
- Allow rapids to avoid unrolling some loops in sort by @miscco in #4253
- Implement
__fp_neg
by @davebayer in #4257 - Restore CUB changelog by @miscco in #4263
- Drop
_CCCL_NODISCARD
by @miscco in #4265 - Drop unused
_CCCL_ALIAS_ATTRIBUTE
macro by @miscco in #4266 - Drop
_CCCL_NO_INLINE_VARIABLES
by @miscco in #4267 - Change to allow cccl/c/parallel/unique_by_key.h to compile by C compiler by @oleksandr-pavlyk in #4259
- Drop
_CCCL_FALLTHROUGH
by @davebayer in #4269 - Cleanup libcu++
force_include.h
test file by @davebayer in #4262 - Remove few remaining qualifiers _CCCL_NODISCARD by @oleksandr-pavlyk in #4274
- Fix ratio plot by @gevtushenko in #4099
- Drop
_CCCL_NORETURN
by @davebayer in #4268 - fix clang portability issue in
__rcvr_with_env_t
and remove dead code by @ericniebler in #4277 - change version check in
type_list.h
so that NO clang-19.X compilers try to use pack indexing by @ericniebler in #4278 - Fix internal
shfl
check by @fbusato in #4282 - tweak the cccl compiler version check macros to better agree with intuition by @ericniebler in #4279
- Implement
ranges::single_view
by @miscco in #4255 - Implement fp overflow handlers by @davebayer in #4261
- Drop
_LIBCUDACXX_HAS_NO_UNICODE_CHARS
by @davebayer in #4295 - [Version] Update main to v3.1.0 by @github-actions[bot] in #4175
- Fix
_LIBCUDACXX_PREFERRED_ALIGNOF
definition by @davebayer in #4297 - Drop
_LIBCUDACXX_HAS_NO_WIDE_CHARACTERS
by @davebayer in #4298 - [STF] dispatch content of stf.cuh into internal headers by @caugonnet in #4275
- Implement
<cuda/std/charconv>
classes by @davebayer in #4301 - Drop
_LIBCUDACXX_DEPRECATED_IN_[11|14|17]
by @davebayer in #4271 - [CUDAX] Remove all_devices.at() and add bounds checks to operator[] by @pciolkosz in #4311
- Drop pre C++11 support in <nv/target>> by @miscco in #4299
- [STF] Add task_count and stream_to_event_list to the generic context API by @caugonnet in #4313
- Implement
cuda::uabs
by @davebayer in #4292 - Fix vectorized loading and storing for warpLoad, warpStore and blockS… by @ChristinaZ in #4283
- Add SM120a to
<nv/target>
by @miscco in #4289 - Remove invalid single
#
in builtin.h by @miscco in #4319 - Add multi-dimensional support to block_reduce routines. by @tpn in #4064
- Add multi-dimensional support to block_scan routines. by @tpn in #4309
- Use more libcu++ includes in thrust by @miscco in #4316
- [STF] green_context affinity test by @caugonnet in #4315
- [STF] Print a summary of the logical data that were used in a context by @caugonnet in #4314
- Update nvhpc to 25.3 and devcontainers to 25.06. by @alliepiper in #4302
- Implement equality operators for charconv result types by @davebayer in #4331
- Update PTX
ld/st
by @fbusato in #4324 - Remove
__void_t
by @davebayer in #4333 - Rename
WarpShuffleResult
towarp_shuffle_result
by @davebayer in #4332 - Disable extended floating-point types for nvc++ by @fbusato in #4340
- Deprecate
numeric_limits::has_denorm
in C++23 by @davebayer in #4344 - WAR unused variable warning on gcc9. by @alliepiper in #4348
- Remove undefined variable from cmake. by @alliepiper in #4349
- Readability, grammar and explanation improvements on CUB Public Tunin… by @gonidelis in #4343
- Add ReverseIterator to cuda.parallel by @NaderAlAwar in #4291
- Add clang19 to matrix, use latest gcc for cudax. by @alliepiper in #4351
- Refactoring
ThreadReduce
by @fbusato in #3441 - Fix inconsistent usage of vsmem helper in c.parallel merge_sort and unique_by_key algorithms by @NaderAlAwar in #4090
- fix host/device annotations of the fallback
_CCCL_TYPEID
implementation for clang cuda by @ericniebler in #4354 - Add explanatory image in results analysis part for tuning by @gonidelis in #4369
- Fix issue where calling merge_sort on custom types was failing by @NaderAlAwar in #4367
- [STF] Fix an iterator type error with an unordered_multimap by @caugonnet in #4374
- Fix cccl integer traits by @davebayer in #4329
- Move
TEST_HAS_NO_EXCEPTIONS
to function like macro by @miscco in #4112 - Improve
cuda::std::distance
and friends by @miscco in #4335 - Add pytest-benchmarks for cuda_parallel by @shwina in #4357
- Native extension to bind to cccl c parallel library by @oleksandr-pavlyk in #4325
- [STF] Add support for codes which do not allow exceptions by @caugonnet in #4373
- Make MatX CI runnable from Actions tab on arbitrary CCCL tags. by @alliepiper in #4378
- Implement selected string manipulation and examination C functions by @davebayer in #4346
- Make
thread_reduce
work with NVHPC by @miscco in #4377 - [STF] Fix task dependencies when using tokens by @caugonnet in #4380
- Check for windows platform rather than MSVC for
aligned_alloc
by @miscco in #4371 - Add an option to immediately create a point release PR after finalizing PR. by @wmaxey in #4051
- Fix autogenerating release notes. by @wmaxey in #4052
- Fix version detection in new MatX build functionality. by @alliepiper in #4385
- Streamline algorithm class in Cython by @oleksandr-pavlyk in #4384
- Migrate cudax tests to c2h. by @alliepiper in #4390
- Improve libcu++ tests customization by @miscco in #4193
- Improve
cuda::std::gcd
andcuda::std::lcm
implementations by @davebayer in #4399 - Move
<ratio>
implementation to libcu++ by @davebayer in #4398 - Add transform python wrappers by @shwina in #4320
- Move
<span>
implementation to libcudacxx by @davebayer in #4400 - cuda.parallel: cmake build script to avoid using find_program by @oleksandr-pavlyk in #4382
- Implement P0466R5 from C++20 by @davebayer in #4383
- Avoid compiling CPU-only code in benchmarks. by @alliepiper in #4375
- [CUDAX] rename wait() to sync() in various types. by @pciolkosz in #4379
- Implement
std::counted_iterator
by @miscco in #4288 - [STF] Temporarility disable the algorithm construct by @caugonnet in #4403
- fix
basic_any
on nvhpc when not compiling as CUDA by @ericniebler in #4405 - [STF] Honor CCCL_DISABLE_NVTX and NVTX_DISABLE in STF by @caugonnet in #4413
- ‼️ Fix failing CCCL Infra jobs on
main
, fix failing nightlies, plug PR coverage gap. by @alliepiper in #4402 - Split Python test jobs by @shwina in #4391
- add
_CCCL_UNREACHABLE
after returning uses ofNV_DISPATCH_TARGET
by @ericniebler in #4417 - Fix uninitialized read in local atomic code path. by @wmaxey in #4352
- Add some missing jobs to the nightly CI matrix. by @alliepiper in #4414
- [STF] Simpler token API by @caugonnet in #4430
- Remove extra semicolons in Thrust by @hwabis in #4426
- Implement a reverse output iterator in cuda.parallel by @NaderAlAwar in #4342
- Add
_CCCL_NO_SPECIALIZATIONS
attribute by @davebayer in #4432 - Update CI overview documentation. by @alliepiper in #4437
- Cleanup the definition of
max_align_t
by @miscco in #4436 - restrict use of
NV_IF_TARGET
inchar_traits<char>::length
to nvcc by @ericniebler in #4406 - Attempt to recover from upstream OOM in disjoint_pool. by @alliepiper in #4420
- Missing header in
<cuda/bit>
by @fbusato in #4439 - Update c.parallel testing (C2H, header tests) by @alliepiper in #4404
- Enable device assertions in CUDA debug mode by @fbusato in #4444
- [STF] Refactor CUDASTF allocators by @andralex in #4306
- Move libcudacxx endian macros to cccl by @davebayer in #4429
- c/parallel should be built with CUB_DISABLE_CDP by @oleksandr-pavlyk in #4422
- Drop cuSpatial from RAPIDS builds. by @bdice in #4453
- Add PTX
elect.sync
by @fbusato in #4445 - cuda.parallel: Exclude allocation times from pytest-benchmarks + add struct benchmarks by @shwina in #4418
- Add dynamic CUB dispatch for radix_sort by @NaderAlAwar in #4135
- [CUDAX] Fix uninitialized context pointers in streamGetCtx_v2 by @pciolkosz in #4454
- Improve
_CCCL_ASSUME
by @fbusato in #4456 - Improve some thrust iterators by @miscco in #4461
- Refactor
cuda::std::popcount
by @davebayer in #4434 - Modernize
cuda::std::complex
by @davebayer in #4448 - Drop invalid relative includes. by @miscco in #4468
- Deprecate more
Thrust
facilities in favor oflibcu++
ones by @miscco in #4334 - Fix the local atomic uninitialized read test when built against small archs by @wmaxey in #4440
- Use new channel for RAPIDS notifs by @alliepiper in #4476
- Actually use new channel for RAPIDS failures. by @alliepiper in #4478
- add a visitation interface to the senders of ustdex by @ericniebler in #4466
- Make per-PR RAPIDS builds opt-in by @alliepiper in #4477
- Fix
ceil_div
behavior with nvc++ andconstexpr
in device code by @fbusato in #4467 - Add
_CCCL_PURE
attribute by @fbusato in #4446 cuda::bitmask
should have a default type by @fbusato in #4484- Refactor
cuda::std::rot*
by @davebayer in #4488 - Use
cudaStream_t
forthrust::device.on(...)
. by @alliepiper in #4451 - cuda.parallel: Check compiled code for LDL/STL instructions in tests by @shwina in #4472
- Add py.typed marker for cuda.cccl per PEP-0561 by @oleksandr-pavlyk in #4482
- Add Radix Sort Implementation for c.parallel by @NaderAlAwar in #4350
- Update CUB dispatch layer documentation with new example by @NaderAlAwar in #4281
- Fix cudax regression on main by @alliepiper in #4498
- Bump nvbench SHA to bring in some fixes on newer libraries. by @alliepiper in #4497
- Fix
__nv_pure__
compatibility by @fbusato in #4499 __has_unique_object_representations
is supported by MSVC by @fbusato in #4494- Add Thrust CMake example with flexible device system, update docs by @alliepiper in #4500
- Improve
[[gnu::*]]
attribute detection by @davebayer in #4502 - Remove
__tuple_element_t
by @davebayer in #4501 - Change streaming algorithms to use operator+= from using operator+ by @oleksandr-pavlyk in #4428
- fix spelling of clang's
-Wno-unknown-cuda-version
switch by @ericniebler in #4504 - Add Python wrappers for c.parallel radix_sort API by @NaderAlAwar in #4353
- Add missing SASS testing changes to radix_sort by @NaderAlAwar in #4508
- c.parallel: device wrappers as code, not format strings by @griwes in #3439
- Fixes empty and single-item inputs for
DeviceRunLengthEncode::NonTrivialRuns
by @elstehle in #4459 - Missing include in iterator_facade_category header by @gonidelis in #4512
- Clarify p2p native atomic support docs by @jrhemstad in #4510
- Add
ceil_ilog2
by @fbusato in #4485 - [Docs] Clarifies that
init_val
is not applied toblock_aggregate
inBlockScan
by @elstehle in #4515 - Refactor
cuda::std::countl_*
by @davebayer in #4469 - [STF] Support dynamic dependencies in the cuda_kernel construct and document cuda_kernel by @caugonnet in #4490
- Set execution status of CUB device functions to error code by @oleksandr-pavlyk in #4511
- Disable constexpr test for gcc14 by @miscco in #4517
- Replace
cub::detail
withcub::internal
by @fbusato in #4441 - move
[[nodiscard]]
before__device__
to make clang happy by @ericniebler in #4522 - Replace
cub::internal
withcub::detail
by @fbusato in #4521 - Refactor
cuda::std::countr_*
by @davebayer in #4487 - Implement
cuda::isqrt
by @davebayer in #4427 - Improve
_CCCL_UNREACHABLE
by @fbusato in #4443 - Implement internal constexpr cstring functions by @davebayer in #4450
- Implement
cuda::std::to_chars
for integers by @davebayer in #4330 - Add
-Wextra-semi
to warnings we are building with by @miscco in #4435 - Cleanup more macro definitions by @miscco in #4411
- Mark libcu++ algorithms with
_CCCL_EXEC_CHECK_DISABLE
by @miscco in #4471 - Fixes rst-style comments in
BlockScan
by @elstehle in #4520 - Fix main by @davebayer in #4528
- Add support for large num_segments to
DeviceSegmentedReduce
with fixed segment size by @srinivasyadav18 in #4366 - Add internal
__num_bits_v
trait by @fbusato in #4293 - Avoid deprecated CUDART usage. by @alliepiper in #4505
- Adds support for large number of items to
DeviceRunLengthEncode::Encode
by @elstehle in #4442 - Fix invalid license by @davebayer in #4527
- Fix stubs for DeviceMergeSortBuildResult, DeviceUniqueByKeyBuildResult by @oleksandr-pavlyk in #4480
- Implement
views::counted
by @miscco in #4408 - Update to RAPIDS 25.06 by @bdice in #4455
- Fixup trailing whitespace in release-update-rc.yml by @wmaxey in #4550
- Exclude cudastf stress tests from CI. by @alliepiper in #4547
- Implement
cuda::std::char_traits
by @davebayer in #4525 - Remove
<cuda/std/__cuda/chrono.h>
by @davebayer in #4557 - Rework
<cuda/std/ctime>
by @davebayer in #4555 - Fix race in decoupled lookback test harness. by @alliepiper in #4556
- do not use detail
_NV_EVAL
macro from<nv/target>
by @ericniebler in #4560 - Cudax cleanups by @ericniebler in #4561
- CUB merge algorithms: avoid OOB access and improve compile time. by @alliepiper in #4548
- Improve/cleanup
annotated_ptr
implementation by @fbusato in #4503 - make
run_loop
lock-free and usable from device code by @ericniebler in #4523 - Implement
ranges::iota_view
by @miscco in #4559 - [c.parallel]: clean-up in test_utils.h by @oleksandr-pavlyk in #4544
- Ensure that we are actually calling the cuda APIs ... by @miscco in #4570
- [CUDAX] Remove caching allocator and non-async memory resources from async_buffer tests by @pciolkosz in #4563
- Use device functions that accept pointer arguments in ccc.cl and cuda.parallel by @shwina in #4249
- fix some portability issues with the cudax async tests by @ericniebler in #4577
- Implement
cuda::ipow
by @davebayer in #4558 - [CUDAX] Adjust some of the async_buffer interfaces by @pciolkosz in #4585
- Improve CUDA macros by @davebayer in #4553
- Implement
shuffle_iterator
iterator type by @djns99 in #4564 - Switch cuCtxCreate to cuDevicePrimaryCtxRetain in cub and libcu++ tests by @pciolkosz in #4594
- Add domain support and make all algorithms customizable with domain-based dispatch by @ericniebler in #4578
- Replace calls to CUDA runtime occupancy with launcher_factory.MaxSmOccupancy() by @NaderAlAwar in #4602
- Exclude gcc-11, gcc-10 from
annotated_ptr
constexpr test by @fbusato in #4595 - use
forward
and friends fromstd::
to leverage compiler optimizations by @ericniebler in #4431 - Update license of CTK files in libcu++ by @fbusato in #4613
- Improve
access_property
andannotated_ptr
documentation by @fbusato in #4580 - Move NVTX to libcu++ and add support for Thrust by @gonidelis in #4537
- Always bypass automatic atomic storage checks to prevent potential compiler issues by @PointKernel in #4586
- Drop host STL includes in CUB if there are libcu++ alternatives by @miscco in #4619
- Pass
cached_segment
byspan
by @miscco in #4618 - Add tests to ensure that we can pass vocabulary types that contain
[[no_unqiue_address]]
to a kernel by @miscco in #4620 - [CUDAX] Remove assign and execution policy from async_buffer by @pciolkosz in #4604
- Guard <nv/target> bits from C contexts by @wmaxey in #4625
- Replace use of
__CUDACC__
with_CCCL_CUDA_COMPILATION()
by @davebayer in #4587 - Implement
ranges::transform_view
by @miscco in #4568 - Simplify
_CubLog
by @davebayer in #4632 - Adopt test for NVRTC properly implementing the line builtin. by @miscco in #4634
- C2H fixes by @alliepiper in #4536
- Bump nvbench SHA. by @alliepiper in #4535
- Add _CCCL_LOG_CUDA_API, improve cuda_error reporting by @alliepiper in #4588
- Migrate CUB's %PARAM% parsing logic to CCCL to enable reuse by other projects. by @alliepiper in #4576
- Make sccache error non-fatal in CI scripts. by @alliepiper in #4638
cuda::std::errc
should be an alias tostd::errc
by @davebayer in #4639- Refactor attributes by @davebayer in #4633
- Backport
reference_wrapper
traits by @davebayer in #4642 - move support for environments from
cuda::experimental
tocuda::std::execution
by @ericniebler in #4584 - Migrate away from docker-out-of-docker CI pattern. by @alliepiper in #4637
- Maintenance/c parallel tests build caching by @oleksandr-pavlyk in #4609
- Implement
cuda::std::string_view
by @davebayer in #4541 - enable
[[no_unique_address]]
for clang on c++20 by @ericniebler in #4646 - Turn
cuda::std::iter_swap
into a CPO to avoid ADL fiasco by @miscco in #4641 - Ensure that
construct_at
optimization uses our special narrowing handling by @miscco in #4534 - Add weekly compute-sanitizer CI jobs for CUB by @alliepiper in #4571
- Introduce scan_op support to cuda.coop block_scan module. by @tpn in #4628
- relocate ustdex within cudax by @ericniebler in #4626
- User-friendly pow2 functions derived from
std/bit
by @fbusato in #4627 - rename
start_on
andcontinue_on
tostarts_on
andcontinues_on
per WG21 by @ericniebler in #4647 - Add CUDA toolkit macros by @davebayer in #4630
- Update PTX ISA Version for CUDA 12.9 by @fbusato in #4656
- Bump NVBench to bring in entropy fixes. by @alliepiper in #4654
- Refactor part of
<cuda/std/type_traits>
by @davebayer in #4648 - Bring in more NVBench stopping criterion fixes. by @alliepiper in #4661
- Fix wrong function argument in
for_each_in_extents::dynamic_kernel
by @miscco in #4653 - Make transform iterator utility in c parallel test suite by @oleksandr-pavlyk in #4645
- replace the simplistic eager customization mechanism with proper apply/transform_sender by @ericniebler in #4657
- Implement
ranges::take_while_view
by @miscco in #4640 - Reduce the use of
__CUDA_ARCH__
by @davebayer in #4589 - Allow mdspan header tests for msvc in C++17 by @davebayer in #4667
- [CUDAX] Add launch transform to async_buffer by @pciolkosz in #4605
- [CUDAX] Fix launch priority option type by @pciolkosz in #4669
- add the schedule_from algorithm, make continues_on lower to it by @ericniebler in #4658
- disable execution space checks for
cuda::std::exchange
by @ericniebler in #4670 - fix the spelling of the
_CCCL_PREFERED_NAME
macro by @ericniebler in #4672 - Implement
cuda::neg
by @davebayer in #4567 - Make
cuda::get_device_address
work with C++ compilers by @davebayer in #4572 - improved diagnostics for
cuda::experimental::execution
by @ericniebler in #4673 - Change definition of
_CCCL_NODISCARD_FRIEND
by @miscco in #4668 - Improve defence against the external macros by @davebayer in #4635
- Use ugly attribute names in public headers by @davebayer in #4675
- Disable NVTX tests for NVHPC in C++20 by @miscco in #4686
- Extend CUB DeviceSegmentedReduce API with fixed segment size to support all operators by @srinivasyadav18 in #4549
- Add missing prologue/epilogue includes by @davebayer in #4683
- rename
cudax::uninit
tocudax::no_init
for better readability by @ericniebler in #4690 - Remove Apple paths from libcu++ by @davebayer in #4693
- disable the execution-space checks for the generic environment utilities by @ericniebler in #4692
- cuda.parallel: Fix handling of duplicate LTOIRs by @shwina in #4698
- [STF] Improvements for the cached fifo allocator and misc improvements by @caugonnet in #4703
- Build and test python wheels in CI by @shwina in #4679
- Improve compiler checks on CMake 3.31+. by @alliepiper in #4710
- Add missing include to move algorithms by @miscco in #4712
- Enable chrono literals from C++20 by @davebayer in #4696
- Remove
__cccl_timespec_t
by @davebayer in #4694 - [STF] Ensure we generate CUDA graphs which always have the same topology by @caugonnet in #4705
- Implement
cuda::std::string_view
constructors from ranges by @davebayer in #4677 - simple wrapper types for
cudaGraph_t
,cudaGraphNode_t
, andcudaGraphExec_t
by @ericniebler in #4680 - Move histogram kernels to nvrtc compilable header by @NaderAlAwar in #4614
- disable execution space warnings for all of µstdex's generic facilities by @ericniebler in #4727
- change the adaptors to only forward queries specified as "forwarding" by @ericniebler in #4725
- c.parallel: reuse CUB agent policies for reduce by @griwes in #4286
- Introduce temp storage alignment awareness to cuda.cooperative. by @tpn in #4729
- Fix typo in agent_batch_memcpy.cuh comment. by @brycelelbach in #4730
- Use list init for test data in iterator docs by @bernhardmgruber in #4738
- Globalize the include of
<cuda_runtime_api.h>
by @davebayer in #4704 - Ensure include order of
insert_nested_NVTX_range_guard
via clang-format by @bernhardmgruber in #4741 - [CUDAX] Add in_place_type argument to pass-through constructor of shared resource by @pciolkosz in #4714
- Bump CI to CTK 12.9, regen devcontainers. by @alliepiper in #4624
- Cuda parallel test add mark large by @oleksandr-pavlyk in #4723
- move
forwarding_query
tocuda/std/__execution/env.h
by @ericniebler in #4743 - turn off execution space checks for
unique_ptr
by @ericniebler in #4732 - Make
device_reference<T>::operator=
const
by @bernhardmgruber in #4740 - Add variadic ctor and CTAD to zip_iterator by @bernhardmgruber in #4113
- Add explicit documentation for cuda::is_floating_point by @bernhardmgruber in #4749
- Simplify thrust::cuda_cub::swap_ranges by @bernhardmgruber in #4182
- Move
get_stream_t
to libcu++ by @miscco in #4737 - install ca-certificates into devcontainer by @shwina in #4753
- Add python jobs to nightly workflow by @shwina in #4720
- Host incrementable iterator approach 2 by @oleksandr-pavlyk in #4697
- Split Optimize Warp Reduce PR - libcu++ part by @fbusato in #4715
- Split Optimize Warp Reduce PR - CUB part by @fbusato in #4716
- Fix cuda.coop limitation preventing user-defined types when items_per_thread > 1 in block scan module. by @tpn in #4756
- make it possible to get the status code from a
cuda_error
exception object by @ericniebler in #4731 - Do not use open-coded
INFINITY
for tests that also test extended floating points by @miscco in #4752 - Port
thrust::discard_iterator
by @miscco in #4717 - Drop cmake workarounds for nvcc < 12 by @bernhardmgruber in #4754
- Add dynamic CUB dispatch for histogram by @NaderAlAwar in #4636
- Move
get_memory_resource
into libcu++ by @miscco in #4742 - Port
thrust::transform_iterator
to cuda by @miscco in #4718 - Add thrust::transform_n by @bernhardmgruber in #4750
- Add workflow to build and test all Python wheels by @shwina in #4721
- Update CI to NVHPC 25.5 by @alliepiper in #4763
- Use
cuda.bindings.path_finder
incuda.parallel
wheel by @rwgk in #4735 - Clear CUDA error state after a failure by @davebayer in #4759
- Small refactorings in Thrust CUDA by @bernhardmgruber in #4764
- Implement
ranges::repeat_view
by @miscco in #4666 - change sync_wait to never call make_exception_ptr from device code by @ericniebler in #4734
- test the return value of
forwarding_query(Tag{})
in the__forwarding_query
concept by @ericniebler in #4766 - fix two issues with
transform_sender
by @ericniebler in #4770 - port the
let_value
tests over from stdexec by @ericniebler in #4771 - Install suggested build environment for pyenv by @shwina in #4781
- Remove thrust from python dependency list by @shwina in #4788
- fix broken cudax build due to an invalid expression in
sync_wait
error path by @ericniebler in #4787 - fix the
_CCCL_API
macro family for NVHPC by @ericniebler in #4777 - factor common code out of
schedule_from
andcontinues_on
by @ericniebler in #4774 - Add missing ForceInclusive tag in exclusive.scan benchmark source by @gonidelis in #4792
- Use proper qualification in allocate.h by @miscco in #4796
- Add missing #pragma once to headers to prevent multiple inclusions by @PointKernel in #4789
- Align bulk copies to 16 bytes on Blackwell by @bernhardmgruber in #4778
- Fully qualify calls in
cuda::
andcuda::device::
namespaces by @davebayer in #4798 - avoid double-wrapping receivers in
__rcvr_ref
by @ericniebler in #4775 - Retry calls to apt update/install to WAR network issues. by @alliepiper in #4800
- Segmented reduce to reuse CUB's tuning policy by @oleksandr-pavlyk in #4745
- Fix define headers on libcucxx according to new path names by @gonidelis in #4803
- fix the late-bound customization of the
continues_on
algorithm by @ericniebler in #4779 - de-duplication, reuse, naming conventions, and copyrights by @ericniebler in #4795
- Improve checking for prologue/epilogue code wrapping by @davebayer in #4802
- Reenable
__APPLE__
for pthread detection. by @miscco in #4805 - Port
thrust::counting_iterator
as to cuda by @miscco in #4780 - Add NVTX nests guard back in CUB unit test conditionally based on Thrust entries by @gonidelis in #4583
- Replace invalid use of
_CCCL_HAS_CUDA_COMPILER()
by @davebayer in #4684 - Increase bytes in flight for B200 to 64KiB by @bernhardmgruber in #4790
- Make sure that
cuda
iterators play nicely with the thrust system and traversal machinery by @miscco in #4806 - Remove -G/-g/-lineinfo from ptx-json tests. by @alliepiper in #4813
- fix cudax's vector_add example that was broken by #4795 by @ericniebler in #4814
- Check
cuda::memcpy_async
preconditions by @davebayer in #4700 - Replace
_CCCL_NO_CONCEPTS
with_CCCL_HAS_CONCEPTS()
by @davebayer in #4809 - Unify BabelStream benchmarks and make nstream consistent by @bernhardmgruber in #4782
- Run CCCL infra tests when example projects may have changed. by @alliepiper in #4816
- Add cuda::narrow(from) by @bernhardmgruber in #4784
- Refactor Thrust select_system by @bernhardmgruber in #4762
- Port
thrust::strided_iterator
to cuda by @miscco in #4808 - Refactor Thrust internal_functional by @bernhardmgruber in #4810
- cuda.parallel: Skip SASS verification for complex input in scan tests by @shwina in #4838
- [CUDAX] Add default properties for resources and add properties deduction to make_async_buffer by @pciolkosz in #4617
- Implement
cuda::device::lane_mask
by @davebayer in #4804 - Add a workflow to upload wheels to PyPi by @cryos in #4839
- add a
__query_or_default
function for querying an environment with a fallback value by @ericniebler in #4841 - Fix
lane_mask
documentation by @fbusato in #4854 - Create "packaging" CI jobs, distinct from CCCL core. by @alliepiper in #4843
- Speedup runtime of c/parallel/test/test_radix_sort.cpp by @oleksandr-pavlyk in #4848
- Ensuring CTK minor version compatibility for cccl.c.parallel by @oleksandr-pavlyk in #4851
- Add
address_space
andis_address_from
tocuda::device::
by @davebayer in #4797 - Refactor
thrust::minimum_type|minimum_system
by @bernhardmgruber in #4042 - Add
cuda::device::warp_match_all
by @fbusato in #4746 - Add CUB_ENABLE_LAUNCH_VARIANTS to toggle lid_1/2 variants. by @alliepiper in #4860
- Use
cuda::ptx::get_sreg_laneid
instead of plain asm by @davebayer in #4862 - Add
{std, ranges}::min
and{std, ranges}::min_element
to algorithm by @miscco in #4783 - Add test to ensure that we are properly copying mdspan around by @miscco in #4760
- implement the proposed resolution of P3718 by adding a
get_domain_late
query. by @ericniebler in #4864 - [CUDAX] Add make_async_buffer overload for each constructor by @pciolkosz in #4856
- upgrade the
completion_signatures
machinery and add tests by @ericniebler in #4863 - Add simple kernel for deterministic reduction by @SAtacker in #2234
- Update pip packages to include colorama by @gonidelis in #4872
- avoid the use of
[[no_unique_address]]
inprop
andenv
on nvcc by @ericniebler in #4871 - incidental fixes for ustdex by @ericniebler in #4873
- using the "magic_get" trick to infer a type's structured binding size by @ericniebler in #4875
- Env-based API for CUB part 1/3 by @gevtushenko in #4874
- add a minimally functional execution context for CUDA streams by @ericniebler in #4579
- Make
cuda::stream_ref
an env for itself by @miscco in #4878 - Port functional_placeholders_logical Thrust test to Catch2 by @bernhardmgruber in #4882
- [CUDAX] Remove circular dependency from the resource concept by @pciolkosz in #4852
- Env-based API for CUB part 2/3 by @gevtushenko in #4876
- strip
-G
from clangd command line for all-dev debug build by @ericniebler in #4884 - Disable bulk copy transform on sm120 by @bernhardmgruber in #4870
- Default kernel launcher factory indirection by @gevtushenko in #4890
- Use random data for heterogeneous cub::DeviceTransform test by @bernhardmgruber in #4883
- Port
thrust::constant_iterator
to cuda by @miscco in #4812 - Use
cuda::std::type_identity
instead of identity-like types by @davebayer in #4893 - Fix inspect_changes exclusions. by @alliepiper in #4885
- Drop
cuda::std::__identity
by @bernhardmgruber in #4887 - Refactor subdir checks to fix CI issue. by @alliepiper in #4895
- take stream scheduler tests out of matrix until i figure out what is going wrong by @ericniebler in #4902
- work around for
stream_context
defaulted constructor bug in nvcc-12.0 by @ericniebler in #4903 - Fix segfault when compiling env by @gevtushenko in #4891
- Improve code and coverage of
DeviceFor::ForEachInExtents
by @fbusato in #4664 - Replace
bool_constant
byif constexpr
in agent_scan by @bernhardmgruber in #4880 - Refactor radix sort onesweep dispatch by @bernhardmgruber in #4868
- Fix _CountOneBits when building against MSVC older than 14.31. by @wmaxey in #4906
- do not constexpr cast to enum a value that is outside the enum's range by @ericniebler in #4905
- Avoid warning in
cuda::ilog10
by @miscco in #4908 - Fix NVTX related comments by @bernhardmgruber in #4909
- Generate a version from git/JSON for packages by @cryos in #4889
- Reorganize cub::DeviceTransform tests by @bernhardmgruber in #4899
- Readd int64 offset tests for DeviceTransform by @bernhardmgruber in #4914
- Fix async buffer example by @gevtushenko in #4916
- Combine cuda_{parallel,coop,cccl} into a single package by @shwina in #4910
- Update Fixed Size Segmented Reduce benchmark by @srinivasyadav18 in #4913
- cccl/c: Refactor the NVRTC build list helper by @wmaxey in #4907
- Fix RadixEncoder operator() signature for radix sort by @davidwendt in #4921
- Fix build-and-test-python-wheels workflow by @shwina in #4926
- We only have one wheel to release now by @cryos in #4924
- [CUDAX] Remove default device argument from stream and device_memory_resource constructor by @pciolkosz in #4915
- Simplify cudax transform test by @bernhardmgruber in #4927
- [CUDAX] Add sm_120 arch traits by @pciolkosz in #4931
- Improve RFA PR 2234 by @srinivasyadav18 in #4888
- Fix unused parameter issue caught by nightlies. by @alliepiper in #4941
- Add init value test for RFA by @srinivasyadav18 in #4942
- Log MatX SHA in builds. by @alliepiper in #4940
- Fix ValueError encountered when running test_device_reduce on machine without CTK installed by @oleksandr-pavlyk in #4932
- bring the design of the cudax execution policies in line with C++17 by @ericniebler in #4937
- Env-based API for CUB part 3/3 by @gevtushenko in #4877
- Fix documentation typo: s/BlockRadixSort/BlockRunLengthDecode/. by @tpn in #4943
- fix the CUDA stream scheduler by @ericniebler in #4933
- [CUDAX] Introduce driver stack checking macro and apply in it to device, event and stream tests by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4934
- Remove jinja2 dependency from cuda.cooperative. by @tpn in https://github.com/NVIDIA/cccl/pull/4946
- [CUDAX] Fix cub cudax example after default device removal by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4950
- Fix RFA dispatch template parameters by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/4951
- Improve
device_accessor
memory space check by @fbusato in https://github.com/NVIDIA/cccl/pull/4840 - Fix
cuda::warp_match_all
test case by @fbusato in https://github.com/NVIDIA/cccl/pull/4963 - Infra cleanup, prep for artifacts by @alliepiper in https://github.com/NVIDIA/cccl/pull/4929
- Drop dead code in Thrust reduce by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4969
- Use
env
in RFA tests by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/4948 - [CUDAX] Rename attr to attribute in device APIs by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4964
- replace
_LIBCUDACXX_HIDE_FROM_ABI
with_CCCL_API inline
by @ericniebler in https://github.com/NVIDIA/cccl/pull/4936 - Retry configure step when CPM hits network issues in CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/4956
constexpr
-ifycuda::experimental::execution
by @ericniebler in https://github.com/NVIDIA/cccl/pull/4962- make
cudax::stream_ref
a scheduler by @ericniebler in https://github.com/NVIDIA/cccl/pull/4952 - Enable custom msvc multiarch builds in CI. by @alliepiper in https://github.com/NVIDIA/cccl/pull/4978
- Use atomicAdd_block in device histogram by @gonidelis in https://github.com/NVIDIA/cccl/pull/4973
is_nothrow_destructible_v
should use the builtin when it is available by @ericniebler in https://github.com/NVIDIA/cccl/pull/4979- nvcc-12.0 seems happier with
__host__ __device__
lambdas by @ericniebler in https://github.com/NVIDIA/cccl/pull/4980 - fix the syntax for Catch2 test tags for
cudax::execution
by @ericniebler in https://github.com/NVIDIA/cccl/pull/4982 - Support more arguments to CCCL_PP_SPLICE_WITH by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4972
- Add support for sm110 to nv/target by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4987
- Add potential search path for cccl headers in potential layout by @wmaxey in https://github.com/NVIDIA/cccl/pull/4990
- Drop
_LIBCUDACXX_CONSTRUCT_AT
by @miscco in https://github.com/NVIDIA/cccl/pull/4998 - Do not use an anonymous union with
optional
by @miscco in https://github.com/NVIDIA/cccl/pull/4997 - Modularize
to_chars
tests by @davebayer in https://github.com/NVIDIA/cccl/pull/4904 - Update to RAPIDS 25.08. by @bdice in https://github.com/NVIDIA/cccl/pull/5008
- [CUDAX] Add sm_103 traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5006
- Port
thrust::tabulate_output_iterator
tocuda
by @miscco in https://github.com/NVIDIA/cccl/pull/4879 - Avoid deprecated
cudaGetDriverEntryPoint
by @miscco in https://github.com/NVIDIA/cccl/pull/5010 - Fix incorrect argument name in thrust openMP cmake file by @miscco in https://github.com/NVIDIA/cccl/pull/5004
- Refactor thrust::sequential::sort by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4925
- Implement
ranges::take_view
by @miscco in https://github.com/NVIDIA/cccl/pull/4867 - Split Optimize
WarpReduce
PR - Part3c2h
by @fbusato in https://github.com/NVIDIA/cccl/pull/4842 - Use ptx::elect_sync in ublkcp transform kernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5014
- improved try/catch portability macros by @ericniebler in https://github.com/NVIDIA/cccl/pull/4986
- [FEA] expose
std::uniform_int_distribution
in libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/4410 - Fix debug check in
cuda::ptx::shfl_sync_*
by @fbusato in https://github.com/NVIDIA/cccl/pull/5016 - Add load-bearing semicolon for MSVC in openMP sort by @miscco in https://github.com/NVIDIA/cccl/pull/5024
- Improve compile-time of c2h generators_vector by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5023
- Update docs in device_radix_sort.cuh by @davidwendt in https://github.com/NVIDIA/cccl/pull/5021
- Enable H100 for c.parallel and python tests. by @griwes in https://github.com/NVIDIA/cccl/pull/4999
- [CUDAX] Remove "get_" prefix from member functions by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4984
- Extend nightly SM build coverage by @alliepiper in https://github.com/NVIDIA/cccl/pull/4949
- add the
bulk
,bulk_chunked
, andbulk_unchunked
sender adaptors by @ericniebler in https://github.com/NVIDIA/cccl/pull/4989 - [CUDAX] Prototype implementation of path_builder that can build paths in a graph and implementation of
launch
accepting it by @pciolkosz in https://github.com/NVIDIA/cccl/pull/4758 - change relative include to system include in
.../__execution/stream/continues_on.cuh
by @ericniebler in https://github.com/NVIDIA/cccl/pull/5042 - Replace
vector
byinplace_vector
in tests by @davebayer in https://github.com/NVIDIA/cccl/pull/4944 - Implement
std::fma
by @miscco in https://github.com/NVIDIA/cccl/pull/5029 - rename the
get_domain_late
query toget_domain_override
per WG21 by @ericniebler in https://github.com/NVIDIA/cccl/pull/5043 - [CUDAX] Add an event constructor taking a device_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5035
- Refactor around
thrust::vector
by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5044 - [cudax] Fix cudax compilation with gcc 9 by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5048
- Handle upcoming vector type change by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5036
- Reduce memory usage of the random distribution tests by @miscco in https://github.com/NVIDIA/cccl/pull/5052
- [CUDAX] Fix parentheses in one of the launch overloads by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5058
- Minor updates to the cuda::iterators by @miscco in https://github.com/NVIDIA/cccl/pull/5054
- Apply
remove_cvref
inthrust::is_contiguous_iterator
and refactor all uses by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5050 - Port
std::modf
andstd::fmod
by @miscco in https://github.com/NVIDIA/cccl/pull/5047 thrust::cuda::pinned_memory_resource
should dispatch to the host system by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5045- try to make it so that we allocate on the heap instead of a stack array by @miscco in https://github.com/NVIDIA/cccl/pull/5060
- Fix formatting in CONTRIBUTING.md by @pauleonix in https://github.com/NVIDIA/cccl/pull/5062
- Replace
cg::memcpy_async
inmemcpy_async
transform kernel by custom implementation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4976 - Provide
cuda::static_for
by @fbusato in #4855 - Enable
cuda::std::string_view
tests in libcu++ by @miscco in https://github.com/NVIDIA/cccl/pull/4894 - Refactor iterator concepts to use our new concept emulation by @miscco in https://github.com/NVIDIA/cccl/pull/5059
- Add vectorized
cub::DeviceTransform
algorithm by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/4815 - Replace CG by TMA copy in bulk copy fallback path by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5061
- Refactor DeviceTransform implementation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5071
- Unconditionally enable async copy transform kernels by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5069
- Port
thrust::transform_output_iterator
tocuda
by @miscco in https://github.com/NVIDIA/cccl/pull/5051 - Move math builtins into the respective header file by @miscco in https://github.com/NVIDIA/cccl/pull/5075
- Implement
std::remainder
andstd::remquo
by @miscco in https://github.com/NVIDIA/cccl/pull/5070 - Modularize our complex implementation by @miscco in https://github.com/NVIDIA/cccl/pull/5076
- Docs nitpick by @gonidelis in https://github.com/NVIDIA/cccl/pull/5079
- refactor opstates and receivers to shorten mangled names by @ericniebler in https://github.com/NVIDIA/cccl/pull/5065
- Relax constraints for
gpu_to_gpu
determinism by @srinivasyadav18 in https://github.com/NVIDIA/cccl/pull/4981 - avoid a mysterious codegen issue with llvm18 by simplifying the transform for the stream bulk senders by @ericniebler in https://github.com/NVIDIA/cccl/pull/5087
- refactor the
starts_on
algorithm for shorter symbol length by @ericniebler in https://github.com/NVIDIA/cccl/pull/5088 - promote the
write_attrs
sender adaptor by @ericniebler in https://github.com/NVIDIA/cccl/pull/5089 - give
cudax::stream_ref
the opt-in for satisfying thescheduler
concept by @ericniebler in https://github.com/NVIDIA/cccl/pull/5090 - [libcudacxx] Add EXEC_CHECK_DISABLE in to_address implementation by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5086
- Implement
format_parse_context
andformat_error
by @davebayer in https://github.com/NVIDIA/cccl/pull/4939 - Modularize optional by @miscco in https://github.com/NVIDIA/cccl/pull/5080
- Fix
thrust::make_discard_iterator
by @miscco in https://github.com/NVIDIA/cccl/pull/5093 - make the stream sender adaptor work with non-visitable senders by @ericniebler in https://github.com/NVIDIA/cccl/pull/5091
- [CUDAX] Switch access control API to use a span of device_refs by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5084
- get the
starts_on
algorithm working with the stream scheduler by @ericniebler in https://github.com/NVIDIA/cccl/pull/5092 - [CUDAX] Migrate copy and fill to use driver API and add driver stack checks in memory_resource and async_buffer tests by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5073
sync_wait
should decay copy the value results by @ericniebler in https://github.com/NVIDIA/cccl/pull/5107- add the
execution::on
sender adapter by @ericniebler in https://github.com/NVIDIA/cccl/pull/5097 - fix
let_value
and friends to work when the function returns a dependent sender by @ericniebler in https://github.com/NVIDIA/cccl/pull/5105 - Skip unnecessary fence in DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5102
- fix visibility problem in
invoke
, correct the spelling of "invocable" globally by @ericniebler in https://github.com/NVIDIA/cccl/pull/5106 - change the defn of
__query_result_or_t
to not require_Query
to b… by @ericniebler in https://github.com/NVIDIA/cccl/pull/5109 - Small fixes and improvements to DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5078
- Avoid usage of
_CCCL_NO_UNIQUE_ADDRESS
for cuda iterators by @miscco in https://github.com/NVIDIA/cccl/pull/5110 - Add docs for
cuda.cccl.parallel
andcuda.cccl.cooperative
by @shwina in https://github.com/NVIDIA/cccl/pull/5095 - Added f32/fp64 specializations for complex exp function. by @s-oboyle in https://github.com/NVIDIA/cccl/pull/4928
- Fix link to Python docs in cccl docs index page by @shwina in https://github.com/NVIDIA/cccl/pull/5115
- add
_CCCL_DECLSPEC_EMPTY_BASES
as an AttributeMacro to .clang-format by @ericniebler in https://github.com/NVIDIA/cccl/pull/5123 - [CUB] Tests
DeviceScan
for invalid values passed to the custom reduction operator by @pauleonix in https://github.com/NVIDIA/cccl/pull/5085 - Avoid more upcoming deprecation warnings on vector types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5114
- Add framework for C2H tests in libcudacxx. by @alliepiper in https://github.com/NVIDIA/cccl/pull/5101
- Simplify the type of
write_env
's receiver and makewrite_env
pipeable by @ericniebler in https://github.com/NVIDIA/cccl/pull/5108 - Fix unqualified call to
__unwrap_iter
by @miscco in https://github.com/NVIDIA/cccl/pull/5117 cuda::
pointer utilities by @fbusato in #5037- avoid return type deduction in the execution queries by @ericniebler in https://github.com/NVIDIA/cccl/pull/5096
- [CUDAX] Add id() getter to stream_ref by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5132
- make value_types_of_t and error_types_of_t work with non-variadic templates by @ericniebler in https://github.com/NVIDIA/cccl/pull/5134
- make
cuda::std::__tuple
work with members of reference type by @ericniebler in https://github.com/NVIDIA/cccl/pull/5129 - Fix RAPIDS CI jobs by @trxcllnt in https://github.com/NVIDIA/cccl/pull/5072
- Try avoid instantiating
timespec_get
as that might or might not be available on android CTKs by @miscco in https://github.com/NVIDIA/cccl/pull/5128 - Port
thrust::permutation_iterator
by @miscco in https://github.com/NVIDIA/cccl/pull/4835 - Update transform iterator example to use a not quadratic sequence by @shwina in https://github.com/NVIDIA/cccl/pull/5131
- Add devcontainer
postAttachCommand
for GitHub Codespaces by @trxcllnt in https://github.com/NVIDIA/cccl/pull/5133 - Add cub::DeviceReduce::Sum Env-based API by @gonidelis in https://github.com/NVIDIA/cccl/pull/4985
- Avoid deprecation warning in the libcu++ extended vector types tests by @miscco in https://github.com/NVIDIA/cccl/pull/5135
- Test DeviceTransform with more overaligned types by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5139
- the stream implementation of
continues_on
is using a moved-from receiver by @ericniebler in https://github.com/NVIDIA/cccl/pull/5150 - RAPIDS CI update to CUDA 12.9 by @jakirkham in https://github.com/NVIDIA/cccl/pull/5104
- Make some member functions of
inplace_vector
static by @miscco in https://github.com/NVIDIA/cccl/pull/5149 - Refactor thrust generic sequence by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5156
- Try, fail and ignore to guarantee dynamic SMEM alignment on Hopper by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5122
- Test more unaligned inputs in DeviceTransform by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5111
- Enhance our deprecation machinery so that we also suppress the right nvcc warnings by @miscco in https://github.com/NVIDIA/cccl/pull/5138
- [CUB] Tests
DeviceScan
with primitive type for invalid values being passed to the scan operator by @pauleonix in https://github.com/NVIDIA/cccl/pull/5141 - fix ODR voilation making cudax
launch
tests flaky by @ericniebler in https://github.com/NVIDIA/cccl/pull/5161 - Replace
mdpan/extents.h - __count_dynamic
with a template variable by @fbusato in https://github.com/NVIDIA/cccl/pull/5168 - Update doc errors in set_operations.h by @akifcorduk in https://github.com/NVIDIA/cccl/pull/5177
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in https://github.com/NVIDIA/cccl/pull/4365
- Refactor out large offset size calculation by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5175
- unify doxygen predefined macros in repo.toml by @ericniebler in https://github.com/NVIDIA/cccl/pull/5162
- Optionally use PostgreSQL for benchmark data by @gevtushenko in https://github.com/NVIDIA/cccl/pull/5184
- Refactor thrust cuda replace by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5180
- Implement
thrust::transform[_if]_n
in the generic system by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5182 - [CUDAX] Use
stream_id
instead ofunsigned long long
asstream_ref::id()
return type by @davebayer in https://github.com/NVIDIA/cccl/pull/5146 - Adds debug info to large problem test helper by @elstehle in https://github.com/NVIDIA/cccl/pull/5187
- [CUB] Fix
BlockScan
documentation by @pauleonix in https://github.com/NVIDIA/cccl/pull/5189 - [CUDAX] Make basic tests work on Windows by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5188
- Fix overflow in offset calculation in transform kernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5176
- [STF] Rename task_fence to fence, and graph epochs to graph stages by @caugonnet in https://github.com/NVIDIA/cccl/pull/5200
- Use shared memory pointer instead of offset in UBLKCP transform kernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5173
- finalize the design of the launch transform API by @ericniebler in https://github.com/NVIDIA/cccl/pull/5153
- Fix cudax compilation with upcoming CTK by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5202
- [STF] Use #ifndef _CCCL_DOXYGEN_INVOKED instead of @cond NEVER_DOCUMENT by @caugonnet in https://github.com/NVIDIA/cccl/pull/5211
- Remove CDP (RDC) architecture filtering logic from Thrust/CUB by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5206
- Fix exit code capture under
set -e
by @alliepiper in https://github.com/NVIDIA/cccl/pull/5213 - Add options for fixing failures in release generation by @wmaxey in https://github.com/NVIDIA/cccl/pull/5194
- [STF] Simplify the visit pattern used in context.cuh using the ->* operator by @caugonnet in https://github.com/NVIDIA/cccl/pull/5212
- Replace cuda version checks with
_CCCL_CTK_XXX()
macro by @davebayer in https://github.com/NVIDIA/cccl/pull/5204 - Remove duplicated entry in ptx docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5222
- Tuning rules by @gevtushenko in https://github.com/NVIDIA/cccl/pull/5195
- Remove
_LIBCUDACXX_EXTERN_TEMPLATE
and_LIBCUDACXX_BUILDING_LIBRARY
macros by @davebayer in https://github.com/NVIDIA/cccl/pull/5230 - Don't mention C++ 11 and 14 in more places by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5201
- Enable RDC tests on MSVC by default by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5214
- Rebalance items per thread in LDGSTS/UBLKCP transform kernels by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5112
- Migrate Thrust transform tests to Catch2 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5197
- [STF] Keep dependency event names in DOT by @caugonnet in https://github.com/NVIDIA/cccl/pull/5235
- [STF] Cleanup how we setup allocators in algorithms by @caugonnet in https://github.com/NVIDIA/cccl/pull/5220
- Use arch=native in benchmark/tuning presets/docs by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5216
- Re-enable UBLKCP transform kernel on sm120 by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5223
- use constexpr
std::exception
if it is available by @ericniebler in https://github.com/NVIDIA/cccl/pull/5221 - [CUDAX] Uglify driver API header and remove CUDAX prefix from the driver function getter by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5219
- [CUDAX] Rename device type to physical_device by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5208
- [STF] Cleanup how we specify edge type in DOT output by @caugonnet in https://github.com/NVIDIA/cccl/pull/5232
- Warn when the traditional MSVC preprocessor is used by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5172
- Make
cuda::ptx
available incuda::device
as an namespace alias by @davebayer in https://github.com/NVIDIA/cccl/pull/5241 - Add missing
_CCCL_HEADER_TEST
definitions to public header tests by @davebayer in https://github.com/NVIDIA/cccl/pull/5242 - Do not enable
__float128
support on device for clang-cuda or NVHPC by @miscco in https://github.com/NVIDIA/cccl/pull/5254 - Disable assertions for QNX, they do not provide the machinery with their libc by @miscco in https://github.com/NVIDIA/cccl/pull/5253
- Support types with any alignment in UBLKCP transform kernel by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5178
- Refactor benchmark of conditional algorithms by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5237
- Proclaim copyable_args in nvbench_helper.cu by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5247
- Make sure that nested
tuple
andpair
have the expected size by @miscco in https://github.com/NVIDIA/cccl/pull/5246 - Add
CONSTEXPR_STEPS:
option to lit config by @davebayer in https://github.com/NVIDIA/cccl/pull/5229 - Implement
thrust::swap_ranges
viatransform
in CUDA system by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5243 - Skip init of temp vectors in CUB test launch helpers by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5260
- Add missing prologue / epilogue includes to <cuda/ptx> by @miscco in https://github.com/NVIDIA/cccl/pull/5261
- [CUDAX->libcu++] Move driver_api header and testing header to libcudacxx by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5255
- [STF] Separately display freeze and unfreeze operations in DOT by @caugonnet in https://github.com/NVIDIA/cccl/pull/5234
- Add missed specializations of the new aligned vector types to cub by @miscco in https://github.com/NVIDIA/cccl/pull/5264
- [CUDAX] Refactor arch traits to be more structured and support arch-specific targets by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5064
- Improve driver api implementation by @davebayer in https://github.com/NVIDIA/cccl/pull/5272
- Implement tuple protocol for nvfp vector types by @davebayer in https://github.com/NVIDIA/cccl/pull/5218
- Fix failing warning suppression for nvrtc by @miscco in https://github.com/NVIDIA/cccl/pull/5278
- Expose Fast Modulo Division in libcu++ by @fbusato in #5210
- [CUDAX->libcu++] Move device APIs to libcu++ by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5279
- [Backport branch/3.1.x] [CUDAX->libcu++] Move ensure_current_device to libcu++ and change the name to ensure_current_context by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5309
- [Backport branch/3.1.x] [CUDAX->libcu++] Move stream and event from cudax to libcu++ by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5322
- [Backport branch/3.1.x] Remove mentions of CUDA experimental that sneaked into libcu++ by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5326
- [Backport branch/3.1.x] Add a macro to disable PDL by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5329
- [Backport branch/3.1.x] Skip zero values in
fast_mod_div
unit test by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5337 - [Backport branch/3.1.x] [libcu++] Deprecate default stream_ref constructor and fix some few last usages by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5334
- [Backport branch/3.1.x] Add gitlab devcontainers by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5342
- [Backport branch/3.1.x] Fix nvrtc when there are more than one CTK include directories available by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5356
- [Backport 3.1.x]
is_address_from
fixes (#5349) by @fbusato in https://github.com/NVIDIA/cccl/pull/5363 - [Backport branch/3.1.x] Diagnose missing
numeric_limits
specialization inDeviceReduce Min/Max
by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5376 - [Backport branch/3.1.x] [CUDAX->libcu++] Expose fill_bytes and copy_bytes in libcudacxx by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5403
- Backport #5442 to 3.1x by @shwina in https://github.com/NVIDIA/cccl/pull/5476
- [Backport branch/3.1.x]
NV_TARGET
andcuda::ptx
for CTK 13 by @fbusato in https://github.com/NVIDIA/cccl/pull/5474 - Backport to 3.1: c.parallel: enable UBLKCP in transform (#4847) and Move TMA barrier in DeviceTransform into dynamic SMEM (#5414) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5457
- [BACKPORT 3.1]: Replace address space intrinsics with
cuda::device::is_address_from
(#4866) by @miscco in https://github.com/NVIDIA/cccl/pull/5465 - [Backport branch/3.1.x] Fix grid dependency sync in cub::DeviceMergeSort by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5460
- [Backport branch/3.1.x] move basic_any from cudax to libcudacxx by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5459
- [backport -> 3.1.x][libcu++] Rename memory resource concepts to indicate asynchronous allocations are the default ones (#5313) by @pciolkosz in https://github.com/NVIDIA/cccl/pull/5492
- [Backport branch/3.1.x] [libcu++] Remove experimental memory resource define check from around the concept, properties and the query. by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5501
- [Backport branch/3.1.x] Add
SM_110a
for non-supporting compilers by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5503 - Backport to 3.1: NVTX ranges for C2H, NVTX as system headers, and handle NVTX being disabled in C2H by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5519
- [Backport branch/3.1.x] [libcu++] Rename resource_ref to match the new async by default naming by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5545
- [Backport branch/3.1.x] [CUDAX] Rename type-erased memory resource wrappers by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5547
- [Backport branch/3.1.x] PR #5396 and #5566 by @elstehle in https://github.com/NVIDIA/cccl/pull/5611
- Backport to 3.1: Update
cuda/ptx
instructions to support all new SM architectures in CTK 13 (#5600) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/5612 - [Backport branch/3.1.x] Fixes
thrust::unique
for non-constequality_op
by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5656 - [BACKPORT 3.1]: Update PTX ISA version for CUDA 13 (#5676) by @miscco in https://github.com/NVIDIA/cccl/pull/5699
- [Backport branch/3.1.x] Fix
thrust::malloc
forvoid
by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5718 - [BACKPORT 3.1]: Fix problematic clang attribute namespace (#5748) by @miscco in https://github.com/NVIDIA/cccl/pull/5756
- [Backport 3.1]: Work around
submdspan
compiler issue on MSVC (#5885) by @miscco in https://github.com/NVIDIA/cccl/pull/5902 - [Backport branch/3.1.x] Ignore
-Wmaybe-uninitialized
in dispatch_reduce.cuh. by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/5936 - Detect
QNX
for atomics support by @miscco in https://github.com/NVIDIA/cccl/pull/5962 - [BACKPORT 3.1] Use forward declarations of extended floating point types instead of including the headers (#5846) by @miscco in https://github.com/NVIDIA/cccl/pull/5978
- [Backport 3.1] Backport iterator fixes by @miscco in https://github.com/NVIDIA/cccl/pull/5977
- [Backport branch/3.1.x] [libcu++] Switch to use cuGetProcAddress to get driver functions by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6011
- [Backport branch/3.1.x] Enable
__grid_constant__
with clang-cuda-20 and nvrtc by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6013 - [Backport branch/3.1.x] Fix libcu++ compilation with clang-20 by @davebayer in https://github.com/NVIDIA/cccl/pull/5985
- [Backport branch/3.1.x] Fix throwing functions marked as
noexcept
by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6052 - [Backport 3.1]: add missing
InitT
tparam to specialization ofDispatchSegmentedReduce
(#6048) by @miscco in https://github.com/NVIDIA/cccl/pull/6054 - [Backport 3.1]: Fix addressof shadowing issue with libc++ (#6032) by @miscco in https://github.com/NVIDIA/cccl/pull/6053
- Backport to 3.1: Do not require int128 in for_each_canceled (#5822) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6057
- [Backport branch/3.1.x] Fix nvc++ 25.9 with
format_parse_context
tests by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6059 - Backport to 3.1: Add SM_110 arch traits by @pciolkosz in https://github.com/NVIDIA/cccl/pull/6071
- [Backport 3.1]: Change PARALLEL_LEVEL default from nproc to nproc-1 in build_common.sh (#6046) by @miscco in https://github.com/NVIDIA/cccl/pull/6055
- [Backport to 3.1] Fix dereferencing nullptr in thrust::device_reference by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6002
- [Backport to 3.1] add a specialization of
__make_tuple_types
forcomplex<T>
(#6102) by @davebayer in https://github.com/NVIDIA/cccl/pull/6116 - [Backport to 3.1] Remove iterator workarounds for lack of operator+= (#6094) by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6115
- [Backport 3.1] Fix imports from cudax to libcu++ (#6105) by @davebayer in https://github.com/NVIDIA/cccl/pull/6144
- [Backport branch/3.1.x] Implement
operator<<
forcuda::std::string_view
by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6148 - [Backport branch/3.1.x] [libcu++] Fix blocks per SM in arch traits traits by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6187
- [Backport to 3.1]: Backport bad bad alloc by @bernhardmgruber in https://github.com/NVIDIA/cccl/pull/6197
- [Backport 3.1]: Work around NVRTC bug with virtual default ctors/dtors (#5704) by @miscco in https://github.com/NVIDIA/cccl/pull/6193
- [Backport 3.1] Cache device name and peers (#6110) by @davebayer in https://github.com/NVIDIA/cccl/pull/6145
- [Backport 3.1] Replace CUDA Runtime calls with Driver calls in libcu++ by @davebayer in https://github.com/NVIDIA/cccl/pull/6211
- [Backport 3.1]: [CUB] Replace several direct uses of
__clz
(#6099) by @miscco in https://github.com/NVIDIA/cccl/pull/6202 - [Backport branch/3.1.x] Add missing sm121 to nv/target and CUB tests by @github-actions[bot] in https://github.com/NVIDIA/cccl/pull/6210
- [Backport 3.1] Backport recent
<cuda/device>
changes by @davebayer in https://github.com/NVIDIA/cccl/pull/6215 - [Backport 3.1] Backport PRs #4591 , #6176 , #6201 and #6006 by @miscco in https://github.com/NVIDIA/cccl/pull/6222
- [Backport 3.1] Backport #6184 and #6224 by @davebayer in https://github.com/NVIDIA/cccl/pull/6228
- [Backport 3.1] Backport #5305 and #6093 by @davebayer in https://github.com/NVIDIA/cccl/pull/6232
New Contributors
- @hwabis made their first contribution in #4426
- @SAtacker made their first contribution in #2234
- @jakirkham made their first contribution in https://github.com/NVIDIA/cccl/pull/5104
- @akifcorduk made their first contribution in https://github.com/NVIDIA/cccl/pull/5177
Full Changelog: v3.0.3...v3.1.0