NVIDIA/cccl v3.3.0 on GitHub

Full Changelog: v3.3.0...v3.3.0

What's Changed

📚 Libcudacxx

[libcudacxx] Fix a typo in the documentation by @caugonnet in #7330
Add a test for <nv/target> to validate old dialect support. by @wmaxey in #7241

🔄 Other Changes

Implement cudax::cufile by @davebayer in #6122
Update linear_congruential_generator with constexpr, tests and a fast discard by @RAMitchell in #6402
Replace _CCCL_HAS_CUDA_COMPILER() with _CCCL_CUDA_COMPILATION() by @davebayer in #6399
Remove unnecessary casts in complex multiplication/division by @davebayer in #6670
Add benchmark batch script by @bernhardmgruber in #6661
Improvements and testing for inspect_changes CI functionality. by @alliepiper in #6535
Improve clarity of CCCL assert macro documentation by @jrhemstad in #6675
Fix oversubscription issue with lit precompile, label hack by @alliepiper in #6554
Make missing sccache nonfatal. by @alliepiper in #6582
Address pending comments for make_tma_descriptor by @fbusato in #6662
Add nvhpc 25.9. by @alliepiper in #6003
Test building for all arches. by @alliepiper in #6113
Add nvbench_helper tests to CI. by @alliepiper in #6679
Add more targets to pytorch build. by @alliepiper in #6685
Add host std lib version detection by @davebayer in #6678
Improve CUB benchmark docs by @bernhardmgruber in #6640
Use if consteval in libcu++ by @davebayer in #6424
Update docs for _CCCL_IF_CONSTEVAL by @davebayer in #6692
Fixes issue with select close to int_max by @elstehle in #6641
Update libcudacxx C++ dialect handling. by @alliepiper in #6693
Simplifies env usage in DeviceTopK tests by @elstehle in #6680
Switch to S3 preprocessor cache by @alliepiper in #6561
fix omp scan bug by @charan-003 in #6560
Refactor out variant from transform tunings by @bernhardmgruber in #6669
[libcu++] Waive hierarchy constexpr testing on GCC8 by @pciolkosz in #6707
Use wrapper with void* argument types for iterator advance/dereference signature by @shwina in #6634
Restore libcudacxx dialect presets. by @alliepiper in #6705
Refactor error handling in radix sort dispatch by @bernhardmgruber in #6681
Remove special dialect handling from cudax build system. by @alliepiper in #6702
Segmented scan followup by @oleksandr-pavlyk in #6706
Fix electing leader from any group in cuda::memcpy_async by @bernhardmgruber in #6710
Avoid scaling twice in ReduceNondeterministicPolicy by @bernhardmgruber in #6711
Remove special handling of C++ dialect in CUB's build system by @alliepiper in #6713
[libcu++] Use resource test fixture members through this by @pciolkosz in #6717
Improves top-k examples to illustrate stream usage by @elstehle in #6723
Tweak sol.py a bit by @bernhardmgruber in #6721
Implement PCG64 as extension by @RAMitchell in #6292
Use PDL in cub::DeviceScan by @bernhardmgruber in #6639
Fix header in libcudacxx test by @alliepiper in #6726
Remove dead code. by @alliepiper in #6725
Add deps on thrust/cub to libcudacxx. by @alliepiper in #6694
Remove special handling for dialect in Thrust's build system. by @alliepiper in #6722
[libcu++] Automatically bump up the release threshold of default mempools by @pciolkosz in #6718
Backport cuda::std::reference_wrapper C++20 features by @davebayer in #6709
Relax error tolerance for deterministic_device_reduce (RFA) test by @srinivasyadav18 in #6720
[DOC] Add temp_storage_bytes usage guide by @Aminsed in #6208
Improve charconv test compile times by @davebayer in #6687
Move source location builtins directly to <cuda/std/source_location> by @davebayer in #6738
Small improvements for cuda::ipow by @davebayer in #6736
Add support for clang's alignment builtins by @davebayer in #6741
Disable test that is failing in multiple configurations by @miscco in #6745
Implement std::normal_distribution by @RAMitchell in #6585
Update cuda::std::span concepts by @davebayer in #6744
Improve bit builtins support by @davebayer in #6737
Implement ranges::drop_view by @miscco in #5049
Improve fp decompose by @davebayer in #6749
Enable caching of advance/dereference methods for Zipiterator and PermutationIterator by @shwina in #6753
implement indeterminate_domain from P3826R2 by @ericniebler in #6628
Fix cuda::std::reference_wrapper noexcept test with gcc-8 by @davebayer in #6757
cuda.compute: In TransformIterator, use type annotations (if available) to determine the output type of user-provided op by @shwina in #6760
cuda.compute: Fixes and improvements to function caching by @shwina in #6758
Fix __throw_cuda_error availability with nvrtc by @davebayer in #6759
Implement ranges::find_if and ranges::find_if_not by @miscco in #6752
Fix radix_sort tuning namespace by @bernhardmgruber in #6755
[libcu++] Add sm_62 arch traits by @pciolkosz in #6772
fix(readme): Update broken Godbolt example link by @miyanyan in #6773
Implement CUDA backend for parallel cuda::std::for_each by @miscco in #5610
Ensure that we properly warn about device lambdas that need to query the return type by @miscco in #6765
Add missing test for thrust::reduce_into by @Pansysk75 in #6572
cuda.compute: Add select algorithm based on three_way_partition by @shwina in #6766
Add queries for CUB ptx version as arch_id by @bernhardmgruber in #6776
Add operator<< to some CUB enums by @bernhardmgruber in #6774
cuda.compute: Fix caching of functions that call other functions by @shwina in #6770
Implement std::exponential_distribution by @RAMitchell in #6750
Fix issue with libcudacxx header tests. by @alliepiper in #6785
Add a type and operation enum to CUB by @bernhardmgruber in #6780
Use conventional order of _CCCL_API friend consistently by @miscco in #6781
Implement std::binomial_distribution by @RAMitchell in #6747
Fixes i32 overflow for benchmark data generation of more than INT_MAX number of items by @elstehle in #6809
Temporarily add upper bound to numba-cuda dependency by @shwina in #6815
Make cuda capabilities part of cccl config by @davebayer in #6806
Update std::uniform_real_distribution by @RAMitchell in #6798
[cub] Implement cub::MaxPotentialDynamicSmemBytes by @davebayer in #6818
libcudacxx: streamline simple trait aliases by @Aminsed in #6740
Fix a typo in compute.rst by @shwina in #6826
Improve our WarpReduce implementation by @miscco in #6814
Implement cuda::sincos by @davebayer in #6742
Replace inline ptx with intrinsics by @davebayer in #6810
[cudax->libcu++] Move buffer type from cudax to libcu++ by @pciolkosz in #6627
Improve CMake package handling, add MSVC compat flags to libcudacxx's public interface. by @alliepiper in #6791
Fix arch related cuda::device:: APIs for nvhpc in CUDA mode by @davebayer in #6829
Implement the new tuning API for DeviceReduce by @bernhardmgruber in #6544
Replace internal assert with _CCCL_ASSERT in libcu++ by @davebayer in #6825
Implement std::gamma_distribution by @RAMitchell in #6786
Run scan benchmark for 2^32 elements by @bernhardmgruber in #6834
Remove upper bound on numba-cuda by @shwina in #6835
Use lit for cuda::arch_id and cuda::compute_capability tests by @davebayer in #6775
Extends DeviceScan tests in preparation for the warpspeed scan implementation by @elstehle in #6836
smoke test for all_of algorithm by @viralbhadeshiya in #6828
[CUB][device] Add a env-based overload of the device segmented reductions primitives by @rbourgeois33 in #6674
Avoid use of cccl namespace macros in cub by @davebayer in #6844
Beautify vector mismatch reporting by @bernhardmgruber in #6837
Implement std::lognormal_distribution by @RAMitchell in #6789
Implement std::weibull_distribution by @RAMitchell in #6797
[PTX] Add cp.async.bulk.dst.src.mbarrier::complete_tx::bytes.ignore_oob by @bernhardmgruber in #6854
Complex asinh accuracy refinement by @s-oboyle in #6428
Allow numpy struct types as initial value for Zipiterator inputs by @shwina in #6861
In test_device_segmented_scan_api change type from int to unsigned by @oleksandr-pavlyk in #6868
Add missing doc strings to support old CMake. by @alliepiper in #6869
avoid error adding pointer to reference in any_resource by @ericniebler in #6875
smoke test for adjacent_difference by @viralbhadeshiya in #6872
Bump minimum CMake to 3.18, add CI testing of public packages with it. by @alliepiper in #6871
[PTX] Regenerate by @bernhardmgruber in #6859
Implement std::poisson distribution by @RAMitchell in #6748
[libcu++] Add memory_pool header and correct legacy resources namespace by @pciolkosz in #6852
[cuda.compute]: Fix issue with get_dtype() not working anymore for pytorch arrays by @NaderAlAwar in #6882
[cuda.compute]: Add fast path to extract PyTorch array pointer by @NaderAlAwar in #6884
accommodate new behavior of clang's __builtin_structured_binding_size by @ericniebler in #6888
[libcu++] Don't require accessibility property on type erased wrappers by @pciolkosz in #6851
Move launch API from cudax to libcu++ by @pciolkosz in #6667
[libcu++] Fix minor version compatibility in 13.X by @pciolkosz in #6895
[libcu++] Leak static CUDA resources and add missing release on memory pool by @pciolkosz in #6892
Add limited RTX PRO 6000 coverage. by @alliepiper in #6841
Add quotes and error checking to devcontainer init. by @alliepiper in #6886
Update std::uniform_int_distribution by @RAMitchell in #6799
portability macro for checking whether an expression satisfies a concept in a _CCCL_REQUIRES_EXPR clause by @ericniebler in #6890
Add ptxas local memory usage warnings to cub builds by @davebayer in #6838
Pcg64 uint128 fallback implementation for MSVC by @RAMitchell in #6746
[cuda.compute] Add dependency on nvidia-nvvm by @shwina in #6909
Add missing OpKind docs entries by @ktaletsk in #6910
Fix overflow issue in histogram even benchmark when the number of bins exceeds what LevelT can represent by @NaderAlAwar in #6908
[libcu++] Add as_ref() to memory pool types by @pciolkosz in #6900
Fix exhaustive policy chain pruning test by @bernhardmgruber in #6903
Extend transform benchmarks to 2^32 elements by @bernhardmgruber in #6920
Implement std::cauchy_distribution by @RAMitchell in #6787
Implement std::extreme_value_distribution by @RAMitchell in #6788
Implement std::fisher_f_distribution by @RAMitchell in #6857
change concepts portability macros to avoid use of macro EXPAND by @ericniebler in #6366
Implement cuda::__all_arch_ids and cuda::__is_specific_arch by @davebayer in #6916
[libcu++] Rename device_transform back to launch_transform by @pciolkosz in #6927
Add unsupported compiler flag to .clangd by @ericniebler in #6911
Add an option to use CCCL from CTK for C2H by @bernhardmgruber in #6848
Avoid waring about missing braces for subobject by @miscco in #6929
Add missing nvrtc nv target archs by @davebayer in #6880
Make sure we actually use overflow builtins by @davebayer in #6904
Implement std::chi_squared_distribution by @RAMitchell in #6856
[libcu++] Static assert that resource is copyable in buffer constructors by @pciolkosz in #6928
Use vectorized transform kernel for sizeof(T) < 4 workloads of arity >1 on Hopper by @bernhardmgruber in #6921
port the trampoline_scheduler from stdexec to cudax::execution by @ericniebler in #6894
Properly specialize cub functions for __nv_bfloat16 by @miscco in #6931
[CUB] Fix mask types in block_radix_rank.cuh by @Aminsed in #6189
[CUB]: Use the new tuning API for nondeterministic reduce by @NaderAlAwar in #6932
clean up some allocator and memory utilities by @ericniebler in #6939
Unify operator handling in cuda.compute by @shwina in #6938
Use integer promotion for warp_reduce by @miscco in #6819
implement task_scheduler from C++26 ([exec.task.scheduler]) by @ericniebler in #5975
Implement std::negative_binomial_distribution by @RAMitchell in #6879
Implement std::student_t_distribution by @RAMitchell in #6858
Remove [[nodiscard]] from barrier's .arrive(...) method by @davebayer in #6947
Implement std::geometric_distribution by @RAMitchell in #6924
[cuda.compute] Refactor code for creating void* wrappers by @shwina in #6941
Expose not guaranteed determinism to reduce in cuda.compute by @NaderAlAwar in #6926
Make __cccl_is_floating_point_v consistent with __cccl_is_integer_v by @davebayer in #6952
Disable __builtin_structured_binding_size with nvcc by @davebayer in #6961
Provide thrust::find_if benchmark by @gonidelis in #6956
Remove all usage of old experimental MR macro by @pciolkosz in #6962
Fix cuda::std::abs for floating points by @davebayer in #6958
Expose <cuda/std/charconv> by @davebayer in #6672
Don't use __builtin_bswap128 during constant evaluation by @davebayer in #6967
Avoid using _CCCL_UNREACHABLE() unless it's necessary by @davebayer in #6948
Cleanups for random module by @RAMitchell in #6951
Remove need for hardcoded LevelT for histogram in c.parallel and cuda.compute by @NaderAlAwar in #6915
Use ublkcp/memcpy_async in transform when dtype size is not a power of two by @NaderAlAwar in #6972
Add internal cuda::__is_device_or_managed_memory by @fbusato in #6918
Improve and apply _CCCL_THROW by @fbusato in #6684
Define _CCCL_ASSERT_IMPL_HOST correctly for clang on Windows by @asmelko in #6971
Allow if consteval in device code with nvcc 13.1 by @davebayer in #6902
c.parallel: reuse CUB agent policies for histogram by @NaderAlAwar in #6974
[libcu++] Dynamically load CUDA library instead of using the runtime by @pciolkosz in #6899
[libcu++] Uncomment some tests and fix launch include after launch was moved to libcu++ by @pciolkosz in #6966
Re-enable using std::meow builtins by @davebayer in #6978
use cooperative_groups in execution::bulk to synchronize across thread blocks by @ericniebler in #6992
Provide Shared Memory mdspan and accessor by @fbusato in #6703
Add non-throwing overloads to is_pointer_accessible by @fbusato in #6988
[cuda.compute]: fix alignment not being set properly for gpu_struct types by @NaderAlAwar in #6995
Workaround for a potential bug in the driver related to TMA descriptor by @fbusato in #6985
do not introduce a pack in a structured binding with nvcc by @ericniebler in #6994
Extract environment boilerplate code from within the device interfaces to a separate header by @gonidelis in #6622
Remove _CCCL_HAS_CUDA_COMPILER() by @davebayer in #6984
[libcu++] Fix memory pool and buffer test issues on Windows by @pciolkosz in #6993
fix off-by-one error in the implementation of cuda::std::__tuple by @ericniebler in #6996
Small fixes around warpspeed scan by @bernhardmgruber in #6998
Remove <version> include by @davebayer in #7001
Upgrade GitHub Actions to latest versions by @salmanmkc in #6991
cuda.coop: Use cuda.core.experimental.Linker instead of internal numba-cuda _Linker by @shwina in #7011
Make c2h vector utils constexpr by @davebayer in #7009
Improves comments on decoupled look-back code example by @elstehle in #7015
Extract reduce_op_sync into a free function by @bernhardmgruber in #7004
Remove experimental namespace from cuda.core import by @NaderAlAwar in #7022
reexpress completion signature transform alias to make clangd happy by @ericniebler in #7026
Qualify call to __launch_impl in launch.h to avoid ambiguity errors by @ericniebler in #7024
Rework hierarchy levels by @davebayer in #6957
[CUB]: use vectorized kernel for triad and add benchmark for dtypes of size 2 by @NaderAlAwar in #7019
[libcu++] Fix synchronous resource adapter property passing by @pciolkosz in #6976
[libcu++] Remove _view from the shared memory getter name by @pciolkosz in #6997
[thrust] Ignore CUDA free errors in thrust memory resource by @pciolkosz in #7002
[libcu++] Correctly handle extended lambda in cuda::launch by @pciolkosz in #6987
Use <stdexcept> header unconditionally by @fbusato in #7028
Error out when nvrtcc cannot parse cuda_thread_count by @bernhardmgruber in #7035
Allow all public headers to be included with host compilers only by @davebayer in #7012
[cuda.compute]: Fixes and updates to benchmarks by @shwina in #6999
Support operations with side-effects (state) in cuda.compute by @shwina in #7008
Fix cuda::memcpy async edge cases and add more tests by @bernhardmgruber in #6608
Explicitly set CCCL_TOPLEVEL_PROJECT to OFF when needed by @KyleFromNVIDIA in #7016
[libcu++] Add explicit alignment specification in buffer by @pciolkosz in #7005
Use the sccache-dist build cluster for RAPIDS CI jobs by @trxcllnt in #7014
tidy up the primitive variant type used by cudax::execution by @ericniebler in #7029
Fix docs by @gevtushenko in #7052
Disable LDL/STL checks, for failures seen with NVRTC 13.1 by @shwina in #7054
Enhance DLPack compatibility by @fbusato in #7045
Support lambdas as operators in cuda.compute by @shwina in #7058
[libcu++] Make kernel_config member private and allow it in hierarchy queries by @pciolkosz in #7034
[libcu++] Remove mentions of cuda/event header from docs by @pciolkosz in #7066
[BUG] use references for mdspan internal methods by @fbusato in #7059
Avoid invalid compiler warning with VS2026 by @miscco in #7077
Avoid compiler issue with MSVC _CCCL_UNREACHABLE by @miscco in #7080
cuda.compute: Allow multiple uses of the same function in single compilation by @shwina in #7072
Refactor c2h generator to ensure teardown before main exits by @bernhardmgruber in #7067
Remove cumlprims_mg from RAPIDS workflows/devcontainers by @bdice in #7082
[DOCS] Clarifies DeviceTopK docs that inputs and output ranges may not overlap by @elstehle in #7078
Enhance RDC detection and add _CCCL_HAS_DEVICE_RUNTIME() macro by @davebayer in #7049
Expand warning suppression for braces around subobject by @miscco in #7087
[STF] Document how to enable assertions by @caugonnet in #7084
Simplify cuda::host_launch API by @davebayer in #6689
Improvements to cuda.compute documentation by @shwina in #7061
[libcu++] Add tests for some buffer members and alignment passing by @pciolkosz in #7055
[libcu++] Fix driver api test after curand changes by @pciolkosz in #7095
Add DeviceTransform to device wide CUB docs by @bernhardmgruber in #7101
Fix incorrect if else logic in fmax by @miscco in #7107
Add -device-type128 flags only once by @davebayer in #7100
[libcu++] Check if managed pools are accessible in is_pointer_accessible test by @pciolkosz in #7096
Fix calculation of necessary bits in feistel projection by @miscco in #7098
Fix deferred annotations handling in gpu_struct by @shwina in #7121
Use cudaMemcpyDefault for trivial copies by @bdice in #7006
Disable NVHPC builds for pull request CI by @miscco in #7135
Generator for prologue/epilogue by @davebayer in #7099
Refactor mdspan cuda::std::__detectably_invalid by @fbusato in #6733
Fix nvrtcc minimum arch for __float128 support by @davebayer in #7119
Disable cudax with msvc in CI for now by @pciolkosz in #7139
Simplify namespace definitions by @davebayer in #7104
Move DLPack include to separate file by @davebayer in #7108
Replace and deprecate compute_capability::major() and compute_capability::minor() by @davebayer in #7118
Disable reference_wrapper test for VS2026 by @miscco in #7088
Clean up hierarchy by @davebayer in #7023
Implement new tuning API arch dispatching by @bernhardmgruber in #7093
Improve std:: builtin handling with nvrtc by @davebayer in #7131
libcu++: silence msvc+nvcc12.9 warning plaguing c.parallel. by @griwes in #7144
Implement cub::DeviceFind::FindIf by @gonidelis in #2405
Fix/modernize thrust examples by @Flawxd in #7094
Modularize chrono by @miscco in #6671
Those are unused internal traits by @miscco in #7148
Implement ranges::reverse_view by @miscco in #6751
Fixes for shuffle_iterator by @RAMitchell in #7130
[STF] Use execution places without STF contexts by @caugonnet in #7149
Revert nested namespace change to <nv/target> by @wmaxey in #7151
Replace internal uses of thrust::tuple with cuda::std::tuple by @miscco in #6629
Add Android-specific assert handling in __cccl/assert.h by @fbusato in #7156
Fix make_tma_descriptor() unit test by @fbusato in #7152
Rename new tuning API policies and fix MSVC warning by @bernhardmgruber in #7103
Align local vector storage arrays in vec transform by @bernhardmgruber in #7162
Try and avoid GCC-15 warning about expected ) by @miscco in #7166
Fix build issues with documentation by @miscco in #7122
[DOCS] Improves docs for DeviceTopK, clarifying that inputs and output ranges must not overlap by @elstehle in #7086
Test passing a custom policy to DispatchRadixSort by @bernhardmgruber in #7170
Avoid benign overflow in __calloc_device by @miscco in #7176
Fixes for thrust::shuffle by @RAMitchell in #7172
cub, c.parallel: {lower,upper}_bound by @griwes in #7007
Initial nvrtcc implementation by @davebayer in #7051
Implement comparison operators for thrust::reference and thrust::pointer by @miscco in #7190
Agent Updates by @alliepiper in #7194
Add missing CUB_RUNTIME_FUNCTION annotations. by @alliepiper in #7195
Define methods for test ranges by @miscco in #7220
Fix noexcept specification of extreme_value_distribution by @miscco in #7219
Drop thrust::detail::is_commutative by @miscco in #7218
Try and work around NVHPC issue with is_metafunction_defined by @miscco in #7217
Reenable MSVC cudax CI by @miscco in #7221
Add support for [[lifetimebound]] by @fbusato in #7155
[cccl.c] Use function try blocks by @davebayer in #7236
Do not try to run catch2 tests with nvrtc by @miscco in #7242
Add .branch_notes. by @alliepiper in #7238
Make cuda.compute importable in a CPU-only environment by @shwina in #7171
Drop libcudacxx ABI Evolution clause by @bernhardmgruber in #7247
[CI] MSVC sccache auth -> file by @alliepiper in #7257
Implement the new tuning API for DeviceTransform by @bernhardmgruber in #6914
Refactor some bits of DeviceRadixSort by @bernhardmgruber in #7193
Drop leftover code after tuning API migration by @bernhardmgruber in #7264
Fix extracting CUDA stream in cub::DeviceTransform by @bernhardmgruber in #7239
Run build/test commands with 5h30m timeout on CI. by @alliepiper in #7213
Change the order of conditions in cuda::barrier by @davebayer in #7259
Don't run CPU-only import test if the wheel artifact doesn't exist by @shwina in #7270
Test passing more stream types to cub::DeviceTransform by @bernhardmgruber in #7278
Add versionadded directives to all public API functions by @cliffburdick in #7215
Refactor cub::DeviceRadixSort by @bernhardmgruber in #7282
[FEA]: Add DevEx/Infra ticket templates by @alliepiper in #7261
Fix __query_or CPO by @miscco in #7266
Test passing a custom policy to DispatchUniqueByKey by @bernhardmgruber in #7296
Test passing a custom policy to DispatchSelectIf by @bernhardmgruber in #7294
Test passing a custom policy to DispatchHistogram by @bernhardmgruber in #7288
Fix is_address_from for cluster_shared for pre-sm_90 by @davebayer in #7245
Test passing a custom policy to DispatchThreeWayPartitionIf by @bernhardmgruber in #7295
cuda.compute: Consolidate caching logic across all algorithms by @shwina in #7281
Test passing a custom policy to DispatchAdjacentDifference, DispatchMergeSort, DispatchScan, DispatchBatchMemcpy by @bernhardmgruber in #7289
Test passing a custom policy to DispatchSegmentedSort by @bernhardmgruber in #7307
Test passing a custom policy to DispatchSegmentedRadixSort by @bernhardmgruber in #7308
Test passing a custom policy to DispatchSegmentedReduce by @bernhardmgruber in #7311
Add accessor methods to shared_resource by @bdice in #7315
cub, c.parallel: change {lower,upper}_bound to return indices. by @griwes in #7320
[nvrtcc] Add __NVRTCC_USE_NVRTC__ macro to nvrtcc by @davebayer in #7293
Remove fmtlib from CCCL by @davebayer in #7300
Fix clang warning about missing braces again by @miscco in #7302
Test passing a custom policy to DispatchReduceByKey by @bernhardmgruber in #7310
Fix missing newline for spdx/pragma once by @msarahan in #7306
mdspan to DLPack by @fbusato in #7027
Test passing a custom policy to DeviceRleDispatch by @bernhardmgruber in #7314
Skip checking build prereqs if installing by @wmaxey in #7316
Implement the new tuning API for DeviceRadixSort by @bernhardmgruber in #6767
Add cuda.compute APIs for upper_bound and lower_bound by @shwina in #7250
cuda.compute: Fix deferred annotations handling in signature_from_annotations by @shwina in #7321
Test passing a custom policy to DispatchScanByKey by @bernhardmgruber in #7309
Fixup missed feedback on #7311 by @bernhardmgruber in #7323
Enhance internal vector type utilities by @fbusato in #7327
Fix MSVC version detection by @miscco in #7305
Implement parallel cuda::std::reduce by @miscco in #6777
Remove _CCCL_HAS_INCLUDE by @davebayer in #7304
Fix narrow conversion in __is_valid_address_range when compling on 32-bit systems by @davebayer in #7333
[nvrtcc] Add nvrtcc dependency by @davebayer in #7287
Fix double destroy in vector by @miscco in #7331
Optimize cuda::add_overflow for unsigned types by @davebayer in #7340
[STF] Make data place more extensible by @caugonnet in #7252
Formatters for cuda::arch_id and cuda::compute_capability by @davebayer in #7335
Drop most parts of thrust::allocator_traits by @miscco in #7286
[CI] Re-enable RAPIDS builds in PRs by @alliepiper in #7255
Use arch dispatch workaround on GCC 8-9 as well by @bernhardmgruber in #7349
Drop partial specialization in friend functions of fast_mod_div by @miscco in #7348
Add blackduck-sca.yml by @alliepiper in #7360
[cccl.c]: Disable SASS check for merge_sort pairs by @NaderAlAwar in #7357
Assert before terminating on throw in device code by @davebayer in #7358
DLPack to mdspan by @fbusato in #7047
Test VerifyCodegen only with latest CTK by @davebayer in #7351
Ensure device_find_if works with non-default-constructible types by @miscco in #7337
Fix _CCCL_THROW in dlpack_to_mdspan by @fbusato in #7363
Expose CUDA vector type traits by @fbusato in #7364
Add some ramblings about symbol visibility by @miscco in #6114
Fix handling of boolean types in cuda.compute by @shwina in #7389
Decouple numba-cuda from type system and other internals by @shwina in #7342
Fix DeviceTransform docs by @bernhardmgruber in #7392
[libcu++] Add runtime check if memory pools are supported by @pciolkosz in #7339
Pass -std flag from CLI to cmake in c.parallel build scripts by @NaderAlAwar in #7394
Migrate cuco HLL by @srinivasyadav18 in #6666
Add CTK 13.1 CI jobs, devcontainers. by @alliepiper in #6887
make the abi of __basic_any compatible between c++17 and c++20 by @ericniebler in #7401
Enforce supported long double format by @davebayer in #7345
Implement passing stream and memory resource to execution policies by @miscco in #7299
Reenable overflow builtins with nvc++ 26.1+ by @davebayer in #7414
Implement the new tuning API for DeviceSegmentedReduce by @bernhardmgruber in #7334
part deux: make the abi of __basic_any compatible between c++17 and c++20 by @ericniebler in #7405
Move host stdlib wrappers to <cuda/std/__host_stdlib> directory by @davebayer in #7411
Improves error handling in CUB algorithms using thrust::triple_chevron calls by @elstehle in #7415
[DOCS] Clarifies that thrust::reduce_by_key actually selects the last item of a range of consecutively equal keys by @elstehle in #7408
fix determinism rejection logic for scan by @srinivasyadav18 in #7133
Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #6990
[libcu++] Fix handle-type mask checks in memory pool tests. by @pciolkosz in #7428
Avoid deallocate throwing by @gonidelis in #7233
Drop accidental [[nodiscard]] on constructor by @miscco in #7413
Use SPDX license headers in thrust/thrust/system/detail/generic by @bernhardmgruber in #7422
Use SPDX license headers in thrust/thrust/detail by @bernhardmgruber in #7423
Update Catch2 to 3.12 by @bernhardmgruber in #6067
Add DeviceTransform benchmarks from pytorch by @bernhardmgruber in #7391
Refactor cub::ThreadLoad by @bernhardmgruber in #7419
cuda.compute: Don't attempt to set host_advance by @shwina in #7425
Use SPDX license headers in thrust/system/cpp by @bernhardmgruber in #7430
Use SPDX license headers in thrust/random by @bernhardmgruber in #7435
Use SPDX license headers in thrust/system/cuda by @bernhardmgruber in #7432
Use SPDX license headers in thrust/iterator by @bernhardmgruber in #7433
Use SPDX license headers some thrust files 2/2 by @bernhardmgruber in #7437
Use SPDX license headers in thrust/system/detail/sequential by @bernhardmgruber in #7431
Use SPDX license headers some thrust files 1/2 by @bernhardmgruber in #7436
Optimize cuda::sub_overflow by @davebayer in #7344
[c.parallel]: migrate transform to use jit templates instead of string based implementations by @NaderAlAwar in #7399
Fix build against libc++ by @miscco in #7448
cuda.compute: Fix struct comparison (ordering matters) by @shwina in #7451
Disable batch benchmarks for DeviceTransform by @bernhardmgruber in #7450
Initial version of DeviceSegmentedTopk for fixed-size segments by @elstehle in #6980
Document random module by @RAMitchell in #7412
Check for __cpp_xxx value, not definition by @davebayer in #4811
Fix missing c2h symbol when compiling with clang-cuda by @davebayer in #7454
Optimize cuda::add_overflow for signed types by @davebayer in #7343
Implement parallel cuda::std::transform by @miscco in #7395
[cuda.coop]: add device-side coop.warp.sum benchmark with pynvbench by @NaderAlAwar in #6846
Replace typedef with using by @davebayer in #7271
Refactor cub::ThreadStore by @bernhardmgruber in #7418
Fix non default constructible input types test for cub::FindIf by @gonidelis in #7447
Add versionadded annotations to CUB public APIs by @cliffburdick in #7406
Fix random doc formatting by @RAMitchell in #7492
Include ninja, ctest, and sccache logs in build artifacts by @trxcllnt in #7487
Suppress MSVC-specific warnings on linux. by @alliepiper in #7462
Drop usage of pickle by @miscco in #7491
Add cub::DeviceTransform N->M API entrypoint by @bernhardmgruber in #7473
Improve reduce implementation by @miscco in #7493
Minor libcu++ lit config improvements by @trxcllnt in #7486
Implement parallel cuda::std::replace by @miscco in #7407
cuda.compute: improve caching performance by not relying on isinstance() checks for protocols by @shwina in #7501
Retry failed image pulls 10 times by @trxcllnt in #7488
Fix DeviceReduce env test for rfa by @bernhardmgruber in #7481
Pass PARALLEL_LEVEL to cmake --build in ci/build_stdpar.sh by @trxcllnt in #7483
cuda.compute: Use native CCCL.c support for stateful ops by @shwina in #7500
[STF] Add exec_place_guard RAII helper for scoped exec_place activation by @caugonnet in #7434
Implement parallel cuda::std::replace_copy by @miscco in #7410
Provide Operator Properties by @fbusato in #7240
Implement parallel cuda::std::generate by @miscco in #7416
[STF] Fix equality operators for places by @caugonnet in #7494
Implement parallel cuda::std::count by @miscco in #7382
Clarify Docker install step for WSL in .devcontainer README by @acosmicflamingo in #7516
Restrict to numba-cuda less than 0.27 by @shwina in #7529
Fix caching of functions referencing numpy ufuncs (or any "dotted" functions) by @shwina in #7535
Add docs for cccl-runtime 3.2 additions and cccl-runtime landing page by @pciolkosz in #7489
Tweak random docs by @RAMitchell in #7534
Use modern __syncthreads_or primitive by @gonidelis in #7509
Fix identity_element tests by @miscco in #7526
[STF] Use relaxed capture mode by @caugonnet in #7566
Remove recursion from __internal_is_address_from by @dkolsen-pgi in #7561
Fix operator identity and absorbing element for char by @bernhardmgruber in #7568
Rewrite iterators to not depend on Numba by @shwina in #7441
[Backport branch/3.3.x] Fix ranges_overlap for nvc++ -cuda by @github-actions[bot] in #7599
[Backport branch/3.3.x] Fix cuda::device::current_arch_id by @github-actions[bot] in #7602
[Backport branch/3.3.x] Fix cuda::barrier missing accounting of results in try_wait by @github-actions[bot] in #7635
[Backport branch/3.3.x] Check for _GLIBCXX_USE_CXX11_ABI only when compiling with libstdc++ by @github-actions[bot] in #7631

New Contributors

@miyanyan made their first contribution in #6773
@Pansysk75 made their first contribution in #6572
@rbourgeois33 made their first contribution in #6674
@ktaletsk made their first contribution in #6910
@asmelko made their first contribution in #6971
@salmanmkc made their first contribution in #6991
@KyleFromNVIDIA made their first contribution in #7016
@Flawxd made their first contribution in #7094
@msarahan made their first contribution in #7306
@acosmicflamingo made their first contribution in #7516

Full Changelog: v3.3.0.dev...v3.3.0