Full Changelog: v3.3.0...v3.3.0
What's Changed
📚 Libcudacxx
- [libcudacxx] Fix a typo in the documentation by @caugonnet in #7330
- Add a test for <nv/target> to validate old dialect support. by @wmaxey in #7241
🔄 Other Changes
- Implement
cudax::cufileby @davebayer in #6122 - Update linear_congruential_generator with constexpr, tests and a fast discard by @RAMitchell in #6402
- Replace
_CCCL_HAS_CUDA_COMPILER()with_CCCL_CUDA_COMPILATION()by @davebayer in #6399 - Remove unnecessary casts in complex multiplication/division by @davebayer in #6670
- Add benchmark batch script by @bernhardmgruber in #6661
- Improvements and testing for inspect_changes CI functionality. by @alliepiper in #6535
- Improve clarity of CCCL assert macro documentation by @jrhemstad in #6675
- Fix oversubscription issue with lit precompile, label hack by @alliepiper in #6554
- Make missing sccache nonfatal. by @alliepiper in #6582
- Address pending comments for
make_tma_descriptorby @fbusato in #6662 - Add nvhpc 25.9. by @alliepiper in #6003
- Test building for all arches. by @alliepiper in #6113
- Add nvbench_helper tests to CI. by @alliepiper in #6679
- Add more targets to pytorch build. by @alliepiper in #6685
- Add host std lib version detection by @davebayer in #6678
- Improve CUB benchmark docs by @bernhardmgruber in #6640
- Use
if constevalin libcu++ by @davebayer in #6424 - Update docs for
_CCCL_IF_CONSTEVALby @davebayer in #6692 - Fixes issue with select close to int_max by @elstehle in #6641
- Update libcudacxx C++ dialect handling. by @alliepiper in #6693
- Simplifies env usage in
DeviceTopKtests by @elstehle in #6680 - Switch to S3 preprocessor cache by @alliepiper in #6561
- fix omp scan bug by @charan-003 in #6560
- Refactor out variant from transform tunings by @bernhardmgruber in #6669
- [libcu++] Waive hierarchy constexpr testing on GCC8 by @pciolkosz in #6707
- Use wrapper with
void*argument types for iterator advance/dereference signature by @shwina in #6634 - Restore libcudacxx dialect presets. by @alliepiper in #6705
- Refactor error handling in radix sort dispatch by @bernhardmgruber in #6681
- Remove special dialect handling from cudax build system. by @alliepiper in #6702
- Segmented scan followup by @oleksandr-pavlyk in #6706
- Fix electing leader from any group in
cuda::memcpy_asyncby @bernhardmgruber in #6710 - Avoid scaling twice in
ReduceNondeterministicPolicyby @bernhardmgruber in #6711 - Remove special handling of C++ dialect in CUB's build system by @alliepiper in #6713
- [libcu++] Use resource test fixture members through this by @pciolkosz in #6717
- Improves top-k examples to illustrate stream usage by @elstehle in #6723
- Tweak
sol.pya bit by @bernhardmgruber in #6721 - Implement PCG64 as extension by @RAMitchell in #6292
- Use PDL in cub::DeviceScan by @bernhardmgruber in #6639
- Fix header in libcudacxx test by @alliepiper in #6726
- Remove dead code. by @alliepiper in #6725
- Add deps on thrust/cub to libcudacxx. by @alliepiper in #6694
- Remove special handling for dialect in Thrust's build system. by @alliepiper in #6722
- [libcu++] Automatically bump up the release threshold of default mempools by @pciolkosz in #6718
- Backport
cuda::std::reference_wrapperC++20 features by @davebayer in #6709 - Relax error tolerance for deterministic_device_reduce (RFA) test by @srinivasyadav18 in #6720
- [DOC] Add temp_storage_bytes usage guide by @Aminsed in #6208
- Improve charconv test compile times by @davebayer in #6687
- Move source location builtins directly to
<cuda/std/source_location>by @davebayer in #6738 - Small improvements for
cuda::ipowby @davebayer in #6736 - Add support for clang's alignment builtins by @davebayer in #6741
- Disable test that is failing in multiple configurations by @miscco in #6745
- Implement std::normal_distribution by @RAMitchell in #6585
- Update
cuda::std::spanconcepts by @davebayer in #6744 - Improve bit builtins support by @davebayer in #6737
- Implement
ranges::drop_viewby @miscco in #5049 - Improve fp decompose by @davebayer in #6749
- Enable caching of advance/dereference methods for Zipiterator and PermutationIterator by @shwina in #6753
- implement
indeterminate_domainfrom P3826R2 by @ericniebler in #6628 - Fix
cuda::std::reference_wrappernoexcept test with gcc-8 by @davebayer in #6757 cuda.compute: In TransformIterator, use type annotations (if available) to determine the output type of user-provided op by @shwina in #6760cuda.compute: Fixes and improvements to function caching by @shwina in #6758- Fix
__throw_cuda_erroravailability with nvrtc by @davebayer in #6759 - Implement
ranges::find_ifandranges::find_if_notby @miscco in #6752 - Fix radix_sort tuning namespace by @bernhardmgruber in #6755
- [libcu++] Add sm_62 arch traits by @pciolkosz in #6772
- fix(readme): Update broken Godbolt example link by @miyanyan in #6773
- Implement CUDA backend for parallel
cuda::std::for_eachby @miscco in #5610 - Ensure that we properly warn about device lambdas that need to query the return type by @miscco in #6765
- Add missing test for thrust::reduce_into by @Pansysk75 in #6572
- cuda.compute: Add select algorithm based on three_way_partition by @shwina in #6766
- Add queries for CUB ptx version as
arch_idby @bernhardmgruber in #6776 - Add
operator<<to some CUB enums by @bernhardmgruber in #6774 - cuda.compute: Fix caching of functions that call other functions by @shwina in #6770
- Implement std::exponential_distribution by @RAMitchell in #6750
- Fix issue with libcudacxx header tests. by @alliepiper in #6785
- Add a type and operation enum to CUB by @bernhardmgruber in #6780
- Use conventional order of
_CCCL_API friendconsistently by @miscco in #6781 - Implement std::binomial_distribution by @RAMitchell in #6747
- Fixes
i32overflow for benchmark data generation of more thanINT_MAXnumber of items by @elstehle in #6809 - Temporarily add upper bound to numba-cuda dependency by @shwina in #6815
- Make cuda capabilities part of cccl config by @davebayer in #6806
- Update std::uniform_real_distribution by @RAMitchell in #6798
- [cub] Implement
cub::MaxPotentialDynamicSmemBytesby @davebayer in #6818 - libcudacxx: streamline simple trait aliases by @Aminsed in #6740
- Fix a typo in
compute.rstby @shwina in #6826 - Improve our
WarpReduceimplementation by @miscco in #6814 - Implement
cuda::sincosby @davebayer in #6742 - Replace inline ptx with intrinsics by @davebayer in #6810
- [cudax->libcu++] Move buffer type from cudax to libcu++ by @pciolkosz in #6627
- Improve CMake package handling, add MSVC compat flags to libcudacxx's public interface. by @alliepiper in #6791
- Fix arch related
cuda::device::APIs for nvhpc in CUDA mode by @davebayer in #6829 - Implement the new tuning API for
DeviceReduceby @bernhardmgruber in #6544 - Replace internal
assertwith_CCCL_ASSERTin libcu++ by @davebayer in #6825 - Implement std::gamma_distribution by @RAMitchell in #6786
- Run scan benchmark for 2^32 elements by @bernhardmgruber in #6834
- Remove upper bound on numba-cuda by @shwina in #6835
- Use lit for
cuda::arch_idandcuda::compute_capabilitytests by @davebayer in #6775 - Extends
DeviceScantests in preparation for the warpspeed scan implementation by @elstehle in #6836 - smoke test for all_of algorithm by @viralbhadeshiya in #6828
- [CUB][device] Add a env-based overload of the device segmented reductions primitives by @rbourgeois33 in #6674
- Avoid use of cccl namespace macros in cub by @davebayer in #6844
- Beautify vector mismatch reporting by @bernhardmgruber in #6837
- Implement std::lognormal_distribution by @RAMitchell in #6789
- Implement std::weibull_distribution by @RAMitchell in #6797
- [PTX] Add
cp.async.bulk.dst.src.mbarrier::complete_tx::bytes.ignore_oobby @bernhardmgruber in #6854 - Complex asinh accuracy refinement by @s-oboyle in #6428
- Allow numpy struct types as initial value for Zipiterator inputs by @shwina in #6861
- In test_device_segmented_scan_api change type from int to unsigned by @oleksandr-pavlyk in #6868
- Add missing doc strings to support old CMake. by @alliepiper in #6869
- avoid error adding pointer to reference in
any_resourceby @ericniebler in #6875 - smoke test for adjacent_difference by @viralbhadeshiya in #6872
- Bump minimum CMake to 3.18, add CI testing of public packages with it. by @alliepiper in #6871
- [PTX] Regenerate by @bernhardmgruber in #6859
- Implement std::poisson distribution by @RAMitchell in #6748
- [libcu++] Add memory_pool header and correct legacy resources namespace by @pciolkosz in #6852
- [cuda.compute]: Fix issue with
get_dtype()not working anymore for pytorch arrays by @NaderAlAwar in #6882 - [cuda.compute]: Add fast path to extract PyTorch array pointer by @NaderAlAwar in #6884
- accommodate new behavior of clang's
__builtin_structured_binding_sizeby @ericniebler in #6888 - [libcu++] Don't require accessibility property on type erased wrappers by @pciolkosz in #6851
- Move launch API from cudax to libcu++ by @pciolkosz in #6667
- [libcu++] Fix minor version compatibility in 13.X by @pciolkosz in #6895
- [libcu++] Leak static CUDA resources and add missing release on memory pool by @pciolkosz in #6892
- Add limited RTX PRO 6000 coverage. by @alliepiper in #6841
- Add quotes and error checking to devcontainer init. by @alliepiper in #6886
- Update std::uniform_int_distribution by @RAMitchell in #6799
- portability macro for checking whether an expression satisfies a concept in a
_CCCL_REQUIRES_EXPRclause by @ericniebler in #6890 - Add ptxas local memory usage warnings to cub builds by @davebayer in #6838
- Pcg64 uint128 fallback implementation for MSVC by @RAMitchell in #6746
- [cuda.compute] Add dependency on nvidia-nvvm by @shwina in #6909
- Add missing OpKind docs entries by @ktaletsk in #6910
- Fix overflow issue in histogram even benchmark when the number of bins exceeds what
LevelTcan represent by @NaderAlAwar in #6908 - [libcu++] Add as_ref() to memory pool types by @pciolkosz in #6900
- Fix exhaustive policy chain pruning test by @bernhardmgruber in #6903
- Extend transform benchmarks to 2^32 elements by @bernhardmgruber in #6920
- Implement std::cauchy_distribution by @RAMitchell in #6787
- Implement std::extreme_value_distribution by @RAMitchell in #6788
- Implement std::fisher_f_distribution by @RAMitchell in #6857
- change concepts portability macros to avoid use of macro
EXPANDby @ericniebler in #6366 - Implement
cuda::__all_arch_idsandcuda::__is_specific_archby @davebayer in #6916 - [libcu++] Rename device_transform back to launch_transform by @pciolkosz in #6927
- Add unsupported compiler flag to .clangd by @ericniebler in #6911
- Add an option to use CCCL from CTK for C2H by @bernhardmgruber in #6848
- Avoid waring about missing braces for subobject by @miscco in #6929
- Add missing nvrtc nv target archs by @davebayer in #6880
- Make sure we actually use overflow builtins by @davebayer in #6904
- Implement std::chi_squared_distribution by @RAMitchell in #6856
- [libcu++] Static assert that resource is copyable in buffer constructors by @pciolkosz in #6928
- Use vectorized transform kernel for sizeof(T) < 4 workloads of arity >1 on Hopper by @bernhardmgruber in #6921
- port the
trampoline_schedulerfrom stdexec to cudax::execution by @ericniebler in #6894 - Properly specialize cub functions for
__nv_bfloat16by @miscco in #6931 - [CUB] Fix mask types in block_radix_rank.cuh by @Aminsed in #6189
- [CUB]: Use the new tuning API for nondeterministic reduce by @NaderAlAwar in #6932
- clean up some allocator and memory utilities by @ericniebler in #6939
- Unify operator handling in cuda.compute by @shwina in #6938
- Use integer promotion for
warp_reduceby @miscco in #6819 - implement
task_schedulerfrom C++26 ([exec.task.scheduler]) by @ericniebler in #5975 - Implement std::negative_binomial_distribution by @RAMitchell in #6879
- Implement std::student_t_distribution by @RAMitchell in #6858
- Remove
[[nodiscard]]from barrier's.arrive(...)method by @davebayer in #6947 - Implement std::geometric_distribution by @RAMitchell in #6924
- [cuda.compute] Refactor code for creating void* wrappers by @shwina in #6941
- Expose not guaranteed determinism to reduce in cuda.compute by @NaderAlAwar in #6926
- Make
__cccl_is_floating_point_vconsistent with__cccl_is_integer_vby @davebayer in #6952 - Disable
__builtin_structured_binding_sizewith nvcc by @davebayer in #6961 - Provide thrust::find_if benchmark by @gonidelis in #6956
- Remove all usage of old experimental MR macro by @pciolkosz in #6962
- Fix
cuda::std::absfor floating points by @davebayer in #6958 - Expose
<cuda/std/charconv>by @davebayer in #6672 - Don't use
__builtin_bswap128during constant evaluation by @davebayer in #6967 - Avoid using
_CCCL_UNREACHABLE()unless it's necessary by @davebayer in #6948 - Cleanups for random module by @RAMitchell in #6951
- Remove need for hardcoded
LevelTfor histogram in c.parallel and cuda.compute by @NaderAlAwar in #6915 - Use ublkcp/memcpy_async in transform when dtype size is not a power of two by @NaderAlAwar in #6972
- Add internal
cuda::__is_device_or_managed_memoryby @fbusato in #6918 - Improve and apply
_CCCL_THROWby @fbusato in #6684 - Define _CCCL_ASSERT_IMPL_HOST correctly for clang on Windows by @asmelko in #6971
- Allow
if constevalin device code with nvcc 13.1 by @davebayer in #6902 - c.parallel: reuse CUB agent policies for histogram by @NaderAlAwar in #6974
- [libcu++] Dynamically load CUDA library instead of using the runtime by @pciolkosz in #6899
- [libcu++] Uncomment some tests and fix launch include after launch was moved to libcu++ by @pciolkosz in #6966
- Re-enable using
std::meowbuiltins by @davebayer in #6978 - use cooperative_groups in
execution::bulkto synchronize across thread blocks by @ericniebler in #6992 - Provide Shared Memory
mdspanandaccessorby @fbusato in #6703 - Add non-throwing overloads to
is_pointer_accessibleby @fbusato in #6988 - [cuda.compute]: fix alignment not being set properly for
gpu_structtypes by @NaderAlAwar in #6995 - Workaround for a potential bug in the driver related to TMA descriptor by @fbusato in #6985
- do not introduce a pack in a structured binding with nvcc by @ericniebler in #6994
- Extract environment boilerplate code from within the device interfaces to a separate header by @gonidelis in #6622
- Remove
_CCCL_HAS_CUDA_COMPILER()by @davebayer in #6984 - [libcu++] Fix memory pool and buffer test issues on Windows by @pciolkosz in #6993
- fix off-by-one error in the implementation of
cuda::std::__tupleby @ericniebler in #6996 - Small fixes around warpspeed scan by @bernhardmgruber in #6998
- Remove
<version>include by @davebayer in #7001 - Upgrade GitHub Actions to latest versions by @salmanmkc in #6991
- cuda.coop: Use
cuda.core.experimental.Linkerinstead of internal numba-cuda_Linkerby @shwina in #7011 - Make c2h vector utils
constexprby @davebayer in #7009 - Improves comments on decoupled look-back code example by @elstehle in #7015
- Extract reduce_op_sync into a free function by @bernhardmgruber in #7004
- Remove experimental namespace from cuda.core import by @NaderAlAwar in #7022
- reexpress completion signature transform alias to make clangd happy by @ericniebler in #7026
- Qualify call to
__launch_implin launch.h to avoid ambiguity errors by @ericniebler in #7024 - Rework hierarchy levels by @davebayer in #6957
- [CUB]: use vectorized kernel for triad and add benchmark for dtypes of size 2 by @NaderAlAwar in #7019
- [libcu++] Fix synchronous resource adapter property passing by @pciolkosz in #6976
- [libcu++] Remove _view from the shared memory getter name by @pciolkosz in #6997
- [thrust] Ignore CUDA free errors in thrust memory resource by @pciolkosz in #7002
- [libcu++] Correctly handle extended lambda in cuda::launch by @pciolkosz in #6987
- Use
<stdexcept>header unconditionally by @fbusato in #7028 - Error out when nvrtcc cannot parse
cuda_thread_countby @bernhardmgruber in #7035 - Allow all public headers to be included with host compilers only by @davebayer in #7012
- [cuda.compute]: Fixes and updates to benchmarks by @shwina in #6999
- Support operations with side-effects (state) in
cuda.computeby @shwina in #7008 - Fix
cuda::memcpy asyncedge cases and add more tests by @bernhardmgruber in #6608 - Explicitly set
CCCL_TOPLEVEL_PROJECTtoOFFwhen needed by @KyleFromNVIDIA in #7016 - [libcu++] Add explicit alignment specification in buffer by @pciolkosz in #7005
- Use the sccache-dist build cluster for RAPIDS CI jobs by @trxcllnt in #7014
- tidy up the primitive variant type used by cudax::execution by @ericniebler in #7029
- Fix docs by @gevtushenko in #7052
- Disable LDL/STL checks, for failures seen with NVRTC 13.1 by @shwina in #7054
- Enhance DLPack compatibility by @fbusato in #7045
- Support lambdas as operators in
cuda.computeby @shwina in #7058 - [libcu++] Make kernel_config member private and allow it in hierarchy queries by @pciolkosz in #7034
- [libcu++] Remove mentions of cuda/event header from docs by @pciolkosz in #7066
- [BUG] use references for mdspan internal methods by @fbusato in #7059
- Avoid invalid compiler warning with VS2026 by @miscco in #7077
- Avoid compiler issue with MSVC _CCCL_UNREACHABLE by @miscco in #7080
- cuda.compute: Allow multiple uses of the same function in single compilation by @shwina in #7072
- Refactor c2h generator to ensure teardown before main exits by @bernhardmgruber in #7067
- Remove cumlprims_mg from RAPIDS workflows/devcontainers by @bdice in #7082
- [DOCS] Clarifies
DeviceTopKdocs that inputs and output ranges may not overlap by @elstehle in #7078 - Enhance RDC detection and add
_CCCL_HAS_DEVICE_RUNTIME()macro by @davebayer in #7049 - Expand warning suppression for braces around subobject by @miscco in #7087
- [STF] Document how to enable assertions by @caugonnet in #7084
- Simplify
cuda::host_launchAPI by @davebayer in #6689 - Improvements to
cuda.computedocumentation by @shwina in #7061 - [libcu++] Add tests for some buffer members and alignment passing by @pciolkosz in #7055
- [libcu++] Fix driver api test after curand changes by @pciolkosz in #7095
- Add DeviceTransform to device wide CUB docs by @bernhardmgruber in #7101
- Fix incorrect if else logic in fmax by @miscco in #7107
- Add
-device-type128flags only once by @davebayer in #7100 - [libcu++] Check if managed pools are accessible in is_pointer_accessible test by @pciolkosz in #7096
- Fix calculation of necessary bits in feistel projection by @miscco in #7098
- Fix deferred annotations handling in gpu_struct by @shwina in #7121
- Use cudaMemcpyDefault for trivial copies by @bdice in #7006
- Disable NVHPC builds for pull request CI by @miscco in #7135
- Generator for prologue/epilogue by @davebayer in #7099
- Refactor
mdspancuda::std::__detectably_invalidby @fbusato in #6733 - Fix
nvrtccminimum arch for__float128support by @davebayer in #7119 - Disable cudax with msvc in CI for now by @pciolkosz in #7139
- Simplify namespace definitions by @davebayer in #7104
- Move DLPack include to separate file by @davebayer in #7108
- Replace and deprecate
compute_capability::major()andcompute_capability::minor()by @davebayer in #7118 - Disable reference_wrapper test for VS2026 by @miscco in #7088
- Clean up hierarchy by @davebayer in #7023
- Implement new tuning API arch dispatching by @bernhardmgruber in #7093
- Improve
std::builtin handling with nvrtc by @davebayer in #7131 - libcu++: silence msvc+nvcc12.9 warning plaguing c.parallel. by @griwes in #7144
- Implement
cub::DeviceFind::FindIfby @gonidelis in #2405 - Fix/modernize thrust examples by @Flawxd in #7094
- Modularize
chronoby @miscco in #6671 - Those are unused internal traits by @miscco in #7148
- Implement
ranges::reverse_viewby @miscco in #6751 - Fixes for shuffle_iterator by @RAMitchell in #7130
- [STF] Use execution places without STF contexts by @caugonnet in #7149
- Revert nested namespace change to <nv/target> by @wmaxey in #7151
- Replace internal uses of
thrust::tuplewithcuda::std::tupleby @miscco in #6629 - Add Android-specific assert handling in
__cccl/assert.hby @fbusato in #7156 - Fix
make_tma_descriptor()unit test by @fbusato in #7152 - Rename new tuning API policies and fix MSVC warning by @bernhardmgruber in #7103
- Align local vector storage arrays in vec transform by @bernhardmgruber in #7162
- Try and avoid GCC-15 warning about expected
)by @miscco in #7166 - Fix build issues with documentation by @miscco in #7122
- [DOCS] Improves docs for
DeviceTopK, clarifying that inputs and output ranges must not overlap by @elstehle in #7086 - Test passing a custom policy to DispatchRadixSort by @bernhardmgruber in #7170
- Avoid benign overflow in
__calloc_deviceby @miscco in #7176 - Fixes for thrust::shuffle by @RAMitchell in #7172
- cub, c.parallel: {lower,upper}_bound by @griwes in #7007
- Initial
nvrtccimplementation by @davebayer in #7051 - Implement comparison operators for
thrust::referenceandthrust::pointerby @miscco in #7190 - Agent Updates by @alliepiper in #7194
- Add missing CUB_RUNTIME_FUNCTION annotations. by @alliepiper in #7195
- Define methods for test ranges by @miscco in #7220
- Fix noexcept specification of
extreme_value_distributionby @miscco in #7219 - Drop
thrust::detail::is_commutativeby @miscco in #7218 - Try and work around NVHPC issue with
is_metafunction_definedby @miscco in #7217 - Reenable MSVC cudax CI by @miscco in #7221
- Add support for
[[lifetimebound]]by @fbusato in #7155 - [cccl.c] Use function try blocks by @davebayer in #7236
- Do not try to run catch2 tests with nvrtc by @miscco in #7242
- Add .branch_notes. by @alliepiper in #7238
- Make
cuda.computeimportable in a CPU-only environment by @shwina in #7171 - Drop libcudacxx ABI Evolution clause by @bernhardmgruber in #7247
- [CI] MSVC sccache auth -> file by @alliepiper in #7257
- Implement the new tuning API for
DeviceTransformby @bernhardmgruber in #6914 - Refactor some bits of
DeviceRadixSortby @bernhardmgruber in #7193 - Drop leftover code after tuning API migration by @bernhardmgruber in #7264
- Fix extracting CUDA stream in
cub::DeviceTransformby @bernhardmgruber in #7239 - Run build/test commands with 5h30m timeout on CI. by @alliepiper in #7213
- Change the order of conditions in
cuda::barrierby @davebayer in #7259 - Don't run CPU-only import test if the wheel artifact doesn't exist by @shwina in #7270
- Test passing more stream types to
cub::DeviceTransformby @bernhardmgruber in #7278 - Add
versionaddeddirectives to all public API functions by @cliffburdick in #7215 - Refactor
cub::DeviceRadixSortby @bernhardmgruber in #7282 - [FEA]: Add DevEx/Infra ticket templates by @alliepiper in #7261
- Fix
__query_orCPO by @miscco in #7266 - Test passing a custom policy to DispatchUniqueByKey by @bernhardmgruber in #7296
- Test passing a custom policy to DispatchSelectIf by @bernhardmgruber in #7294
- Test passing a custom policy to DispatchHistogram by @bernhardmgruber in #7288
- Fix
is_address_fromforcluster_sharedfor pre-sm_90 by @davebayer in #7245 - Test passing a custom policy to DispatchThreeWayPartitionIf by @bernhardmgruber in #7295
- cuda.compute: Consolidate caching logic across all algorithms by @shwina in #7281
- Test passing a custom policy to DispatchAdjacentDifference, DispatchMergeSort, DispatchScan, DispatchBatchMemcpy by @bernhardmgruber in #7289
- Test passing a custom policy to DispatchSegmentedSort by @bernhardmgruber in #7307
- Test passing a custom policy to DispatchSegmentedRadixSort by @bernhardmgruber in #7308
- Test passing a custom policy to DispatchSegmentedReduce by @bernhardmgruber in #7311
- Add accessor methods to shared_resource by @bdice in #7315
- cub, c.parallel: change {lower,upper}_bound to return indices. by @griwes in #7320
- [nvrtcc] Add
__NVRTCC_USE_NVRTC__macro tonvrtccby @davebayer in #7293 - Remove fmtlib from CCCL by @davebayer in #7300
- Fix clang warning about missing braces again by @miscco in #7302
- Test passing a custom policy to DispatchReduceByKey by @bernhardmgruber in #7310
- Fix missing newline for spdx/pragma once by @msarahan in #7306
mdspantoDLPackby @fbusato in #7027- Test passing a custom policy to DeviceRleDispatch by @bernhardmgruber in #7314
- Skip checking build prereqs if installing by @wmaxey in #7316
- Implement the new tuning API for
DeviceRadixSortby @bernhardmgruber in #6767 - Add cuda.compute APIs for
upper_boundandlower_boundby @shwina in #7250 - cuda.compute: Fix deferred annotations handling in signature_from_annotations by @shwina in #7321
- Test passing a custom policy to DispatchScanByKey by @bernhardmgruber in #7309
- Fixup missed feedback on #7311 by @bernhardmgruber in #7323
- Enhance internal vector type utilities by @fbusato in #7327
- Fix MSVC version detection by @miscco in #7305
- Implement parallel
cuda::std::reduceby @miscco in #6777 - Remove
_CCCL_HAS_INCLUDEby @davebayer in #7304 - Fix narrow conversion in
__is_valid_address_rangewhen compling on 32-bit systems by @davebayer in #7333 - [nvrtcc] Add nvrtcc dependency by @davebayer in #7287
- Fix double destroy in vector by @miscco in #7331
- Optimize
cuda::add_overflowfor unsigned types by @davebayer in #7340 - [STF] Make data place more extensible by @caugonnet in #7252
- Formatters for
cuda::arch_idandcuda::compute_capabilityby @davebayer in #7335 - Drop most parts of
thrust::allocator_traitsby @miscco in #7286 - [CI] Re-enable RAPIDS builds in PRs by @alliepiper in #7255
- Use arch dispatch workaround on GCC 8-9 as well by @bernhardmgruber in #7349
- Drop partial specialization in friend functions of
fast_mod_divby @miscco in #7348 - Add blackduck-sca.yml by @alliepiper in #7360
- [cccl.c]: Disable SASS check for merge_sort pairs by @NaderAlAwar in #7357
- Assert before terminating on
throwin device code by @davebayer in #7358 DLPacktomdspanby @fbusato in #7047- Test VerifyCodegen only with latest CTK by @davebayer in #7351
- Ensure
device_find_ifworks with non-default-constructible types by @miscco in #7337 - Fix
_CCCL_THROWin dlpack_to_mdspan by @fbusato in #7363 - Expose CUDA vector type traits by @fbusato in #7364
- Add some ramblings about symbol visibility by @miscco in #6114
- Fix handling of boolean types in
cuda.computeby @shwina in #7389 - Decouple
numba-cudafrom type system and other internals by @shwina in #7342 - Fix
DeviceTransformdocs by @bernhardmgruber in #7392 - [libcu++] Add runtime check if memory pools are supported by @pciolkosz in #7339
- Pass -std flag from CLI to cmake in c.parallel build scripts by @NaderAlAwar in #7394
- Migrate cuco HLL by @srinivasyadav18 in #6666
- Add CTK 13.1 CI jobs, devcontainers. by @alliepiper in #6887
- make the abi of
__basic_anycompatible between c++17 and c++20 by @ericniebler in #7401 - Enforce supported
long doubleformat by @davebayer in #7345 - Implement passing stream and memory resource to execution policies by @miscco in #7299
- Reenable overflow builtins with
nvc++26.1+ by @davebayer in #7414 - Implement the new tuning API for
DeviceSegmentedReduceby @bernhardmgruber in #7334 - part deux: make the abi of
__basic_anycompatible between c++17 and c++20 by @ericniebler in #7405 - Move host stdlib wrappers to
<cuda/std/__host_stdlib>directory by @davebayer in #7411 - Improves error handling in CUB algorithms using
thrust::triple_chevroncalls by @elstehle in #7415 - [DOCS] Clarifies that
thrust::reduce_by_keyactually selects the last item of a range of consecutively equal keys by @elstehle in #7408 - fix determinism rejection logic for scan by @srinivasyadav18 in #7133
- Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #6990
- [libcu++] Fix handle-type mask checks in memory pool tests. by @pciolkosz in #7428
- Avoid deallocate throwing by @gonidelis in #7233
- Drop accidental
[[nodiscard]]on constructor by @miscco in #7413 - Use SPDX license headers in
thrust/thrust/system/detail/genericby @bernhardmgruber in #7422 - Use SPDX license headers in thrust/thrust/detail by @bernhardmgruber in #7423
- Update Catch2 to 3.12 by @bernhardmgruber in #6067
- Add DeviceTransform benchmarks from pytorch by @bernhardmgruber in #7391
- Refactor
cub::ThreadLoadby @bernhardmgruber in #7419 - cuda.compute: Don't attempt to set host_advance by @shwina in #7425
- Use SPDX license headers in
thrust/system/cppby @bernhardmgruber in #7430 - Use SPDX license headers in
thrust/randomby @bernhardmgruber in #7435 - Use SPDX license headers in
thrust/system/cudaby @bernhardmgruber in #7432 - Use SPDX license headers in
thrust/iteratorby @bernhardmgruber in #7433 - Use SPDX license headers some thrust files 2/2 by @bernhardmgruber in #7437
- Use SPDX license headers in
thrust/system/detail/sequentialby @bernhardmgruber in #7431 - Use SPDX license headers some thrust files 1/2 by @bernhardmgruber in #7436
- Optimize
cuda::sub_overflowby @davebayer in #7344 - [c.parallel]: migrate transform to use jit templates instead of string based implementations by @NaderAlAwar in #7399
- Fix build against libc++ by @miscco in #7448
- cuda.compute: Fix struct comparison (ordering matters) by @shwina in #7451
- Disable batch benchmarks for DeviceTransform by @bernhardmgruber in #7450
- Initial version of
DeviceSegmentedTopkfor fixed-size segments by @elstehle in #6980 - Document random module by @RAMitchell in #7412
- Check for
__cpp_xxxvalue, not definition by @davebayer in #4811 - Fix missing c2h symbol when compiling with clang-cuda by @davebayer in #7454
- Optimize
cuda::add_overflowfor signed types by @davebayer in #7343 - Implement parallel
cuda::std::transformby @miscco in #7395 - [cuda.coop]: add device-side
coop.warp.sumbenchmark with pynvbench by @NaderAlAwar in #6846 - Replace
typedefwithusingby @davebayer in #7271 - Refactor
cub::ThreadStoreby @bernhardmgruber in #7418 - Fix non default constructible input types test for cub::FindIf by @gonidelis in #7447
- Add versionadded annotations to CUB public APIs by @cliffburdick in #7406
- Fix random doc formatting by @RAMitchell in #7492
- Include ninja, ctest, and sccache logs in build artifacts by @trxcllnt in #7487
- Suppress MSVC-specific warnings on linux. by @alliepiper in #7462
- Drop usage of pickle by @miscco in #7491
- Add
cub::DeviceTransformN->M API entrypoint by @bernhardmgruber in #7473 - Improve reduce implementation by @miscco in #7493
- Minor
libcu++lit config improvements by @trxcllnt in #7486 - Implement parallel
cuda::std::replaceby @miscco in #7407 - cuda.compute: improve caching performance by not relying on
isinstance()checks for protocols by @shwina in #7501 - Retry failed image pulls 10 times by @trxcllnt in #7488
- Fix DeviceReduce env test for rfa by @bernhardmgruber in #7481
- Pass
PARALLEL_LEVELto cmake --build inci/build_stdpar.shby @trxcllnt in #7483 - cuda.compute: Use native CCCL.c support for stateful ops by @shwina in #7500
- [STF] Add exec_place_guard RAII helper for scoped exec_place activation by @caugonnet in #7434
- Implement parallel
cuda::std::replace_copyby @miscco in #7410 - Provide Operator Properties by @fbusato in #7240
- Implement parallel
cuda::std::generateby @miscco in #7416 - [STF] Fix equality operators for places by @caugonnet in #7494
- Implement parallel
cuda::std::countby @miscco in #7382 - Clarify Docker install step for WSL in .devcontainer README by @acosmicflamingo in #7516
- Restrict to numba-cuda less than 0.27 by @shwina in #7529
- Fix caching of functions referencing numpy ufuncs (or any "dotted" functions) by @shwina in #7535
- Add docs for cccl-runtime 3.2 additions and cccl-runtime landing page by @pciolkosz in #7489
- Tweak random docs by @RAMitchell in #7534
- Use modern __syncthreads_or primitive by @gonidelis in #7509
- Fix identity_element tests by @miscco in #7526
- [STF] Use relaxed capture mode by @caugonnet in #7566
- Remove recursion from __internal_is_address_from by @dkolsen-pgi in #7561
- Fix operator identity and absorbing element for
charby @bernhardmgruber in #7568 - Rewrite iterators to not depend on Numba by @shwina in #7441
- [Backport branch/3.3.x] Fix
ranges_overlapfornvc++ -cudaby @github-actions[bot] in #7599 - [Backport branch/3.3.x] Fix
cuda::device::current_arch_idby @github-actions[bot] in #7602 - [Backport branch/3.3.x] Fix cuda::barrier missing accounting of results in try_wait by @github-actions[bot] in #7635
- [Backport branch/3.3.x] Check for
_GLIBCXX_USE_CXX11_ABIonly when compiling with libstdc++ by @github-actions[bot] in #7631
New Contributors
- @miyanyan made their first contribution in #6773
- @Pansysk75 made their first contribution in #6572
- @rbourgeois33 made their first contribution in #6674
- @ktaletsk made their first contribution in #6910
- @asmelko made their first contribution in #6971
- @salmanmkc made their first contribution in #6991
- @KyleFromNVIDIA made their first contribution in #7016
- @Flawxd made their first contribution in #7094
- @msarahan made their first contribution in #7306
- @acosmicflamingo made their first contribution in #7516
Full Changelog: v3.3.0.dev...v3.3.0