What's Changed
- Implement remaining ranges iterator concepts and modernize array by @miscco in #627
- Fix C++11 support of recently added tests by @ahendriksen in #651
- Update CUDA newest to CTK 12.3 by @jrhemstad in #629
- Add
cuda::ptx::*
namespace by @ahendriksen in #574 - The test seems to pass just fine by @miscco in #654
- Fixes discard_memory compilation failure for pre-Volta by @elstehle in #637
- Reduce benchmarking time by @gevtushenko in #657
- Add CCCL_VERSION and script for updating version by @jrhemstad in #652
- Fixes compiler error for extended fp type data gen by @elstehle in #666
- fixup
___CUDA_VPTX
->_CUDA_VPTX
by @wmaxey in #664 - Attempt to WAR CUB / RDC / MSVC issue by @gevtushenko in #669
- Rework our system header approach to be more error proof by @miscco in #661
- Project automation - fix sync action and draft setting step by @jarmak-nv in #625
- Fix fallback when checking git repo by @wmaxey in #1085
- Currently the verbose option does not work beacuse of a typo in the argument handling by @miscco in #1088
- Adds virtual shared memory helper and tests by @elstehle in #619
- Add
cuda::ptx::st_async
by @ahendriksen in #1078 - Add
cuda::ptx::red_async
by @ahendriksen in #1080 - Remove libcudacxx symlinks by @wmaxey in #1075
- Move PTX tests that missed the symlink PR by @wmaxey in #1098
- Fix truncation of constant value by @gevtushenko in #1097
- Add
cuda::ptx:mbarrier_{try/test}_wait{_parity}
by @ahendriksen in #674 - Initial CUB/NVRTC support by @gevtushenko in #1081
- Fix
cuda::ptx::red.async
for int32_t types by @ahendriksen in #1102 - Fix local test runs with lit by @miscco in #1108
- Fix config when only non-CDPv1 arches are enabled. by @alliepiper in #1109
- Do not replace the sccache binary for windows by @miscco in #1115
- Test cuda graph capture by @gevtushenko in #1112
- Fix overflow bug for >2^32 elements in thrust::shuffle by @djns99 in #1074
- Introduce CUB transform reduce by @gevtushenko in #1091
- Add infrastructure for compile-time CUB tests by @gevtushenko in #1124
- Fix GCC6 / FP8 warning by @gevtushenko in #1130
- Fix thrust transform reduce bench by @gevtushenko in #1133
- Fix
ptx.st.async.compile.pass.cpp
failing in C++11. by @wmaxey in #1132 - Fix
_LIBCUDACXX_UNREACHABLE
for old MSVC by @miscco in #1114 - Allow filtering P0 benchmarks by @gevtushenko in #1135
- Update barrier_arrive_tx.md docs by @gonzalobg in #1147
- Update std iterators by @miscco in #672
- Fix argument name in windows CI by @miscco in #1145
- Fix XFAIL condition for subsumption tests by @miscco in #1144
- Project Automation - remove draft automation + reduce permissions by @jarmak-nv in #1154
- Use rst in block-scope docs by @gevtushenko in #1150
- Fix errors when find_package(CCCL) is called twice. by @alliepiper in #1157
- Fix icc / cub by @gevtushenko in #1152
- Abort testing on unsupported dialect flags by @wmaxey in #1158
- Run with latest nvbench by @robertmaynard in #583
- Set finer-grain workflow permissions by @jrhemstad in #1163
- Port device docs to rst by @gevtushenko in #1160
- CI log improvements by @jrhemstad in #621
- Setup documentation and corresponding github action by @wmaxey in #1118
- Update Docs links in README.md by @wmaxey in #1169
- Fix GCC 13 by @gevtushenko in #1175
- Add missing exit from
run-as-coder
by @jrhemstad in #1176 - Adds new virtual shared memory facility to DeviceMergeSort by @elstehle in #1117
- Add first batch of Catch2 tests for DeviceRadixSort by @alliepiper in #1164
- Implement math functions for
thrust::complex
by @miscco in #1178 - Use anchors in matrix.yaml by @jrhemstad in #1193
- Ensure the targets that Thrust creates are global. by @robertmaynard in #1182
- Fix availability of
is_constant_evaluated
on old MSVC by @miscco in #1180 - Enable std::variant for libcu++ by @miscco in #1076
- Implement
enable_borrowed_range
by @miscco in #1196 - Reduce thrust benchmarks noise by @gevtushenko in #1203
- Prepare more algorithms by @miscco in #1161
- Add icc compiler to CI matrix by @jrhemstad in #1159
- Unify handling of dialects by @miscco in #1200
- Add argument to build/test scripts for additional cmake options by @jrhemstad in #620
- Move definitions of execution space macros into
cccl
by @miscco in #1199 - Adds new virtual shared memory facility to
DeviceSelect::UniqueByKey
by @elstehle in #1197 - Add Catch2 tests for cub::DeviceSegmentedRadixSort by @alliepiper in #1214
- Fix the example on README.md by @so298 in #1220
- Add missing overloads for thrust::pow by @miscco in #1222
- Fix 'nvc++ -stdpar' by @dkolsen-pgi in #1224
- Fix examples in reduce docs by @gevtushenko in #1230
- Do not benchmark small problem sizes by @gevtushenko in #1243
- Implement
enable_view
by @miscco in #1208 - Refactors
thrust::unique_by_key
to usecub::DeviceSelect::UniqueByKey
by @elstehle in #1245 - Fix merge conflict from incoming PR by @miscco in #1250
- Disable
fast-math
for ICC by @miscco in #1252 - Fix a typo in thrust-config.cmake by @valgur in #1259
- Implement
ranges::{c}begin
andranges::{c}end
by @miscco in #1256 - Switch to entropy-based stopping criterion by @gevtushenko in #1280
- Fix a sync bug in
stream_ref::wait
by @PointKernel in #1238 - Silence some static asserts in ptx helpers by @miscco in #1257
- Restore docs images by @jrhemstad in #1285
- Clarify Thrust/CUB ABI guarantees by @jrhemstad in #1269
- Fix MSVC issues by @miscco in #1261
- Ensure that
cuda::std::pair
is potentially trivially copyable by @miscco in #1249 - Update packman to fix CUB docs by @gevtushenko in #1291
- Implement
ranges::{c}rbegin
by @miscco in #1295 - Make
cuda::stream_ref
universally available by @miscco in #1293 - Properly test internal headers by @miscco in #1258
- Remove remaining C++03 compatibility from unit tests by @Blonck in #1228
- Add some documentation for
memory_resource
by @miscco in #1217 - Filter axis values in perf analysis by @gevtushenko in #1304
- Get CCCL revision outside of git repo by @gevtushenko in #1305
- [DOC]: Move ptx.md out of extended API by @ahendriksen in #1308
- Implement
ranges::{c}rend
by @miscco in #1301 - thrust/mr: fix the case of reuising a block for a smaller alloc. by @griwes in #1232
- Allow offloading samples by @gevtushenko in #1316
- [DOC]: Fix documentation links by @ahendriksen in #1311
- Separate windows and Linux CI matrix by @jrhemstad in #1206
- Revert "Separate windows and Linux CI matrix " by @jrhemstad in #1324
- Introduce CUB ForEach algorithms by @gevtushenko in #1302
- Cleanup transitive includes of
<cuda/std/functional>
by @miscco in #1253 - Implement
ranges::{c}data
by @miscco in #1313 - Remove stale comments from README by @jrhemstad in #1328
- Ports
cub::DeviceMergeSort
tests to Catch2 by @elstehle in #1319 - Implement
ranges::size
andranges::ssize
by @miscco in #1330 - PTX: Add helper functions for dsmem by @ahendriksen in #1336
- Remove double "ignore" in discard_iterator.h docs by @gonidelis in #1342
- PTX: Add
cuda::ptx::fence
by @ahendriksen in #1341 - Replace deprecated
_VSTD
macro withstd
by @rupprecht in #1331 - PTX: Add
cuda::ptx::mapa
andcuda::ptx::getctarank
by @ahendriksen in #1345 - Cleanup our
__cccl_config
by @miscco in #1322 - Update to devcontainers 24.04 by @jrhemstad in #1357
- ♻️📝 Update
mode
example to usethrust::unique_count
by @codereport in #1354 - Switch to NV runners for Windows. by @wmaxey in #1356
- Implement
ranges::empty
by @miscco in #1338 - PTX: Add
cuda::ptx::get_sreg
by @ahendriksen in #1351 - Fix godbolt link. by @jrhemstad in #1369
- Implement ranges concepts by @miscco in #1364
- Print helpful error message in test scripts when no GPU is found by @jrhemstad in #1362
- Implement
ranges::dangling
by @miscco in #1371 - Ensure that thrust fancy iterators are
trivially_copy_constructible
when possible by @miscco in #1368 - Improve compiler detection defines by @Yaraslaut in #1320
- Use relative includes for our public headers by @miscco in #1325
- Implement
ranges::view_interface
by @miscco in #1377 - Use checked allocators in CUB catch2 tests by @alliepiper in #1271
- small update to docs for CTK by @ZelboK in #1378
- Fix order of system_header supression and includes by @miscco in #1323
- Hide API accepting kernel pointers by @gevtushenko in #1395
- Refactors
ChooseOffsetT
to use::cuda::std
and introduces alias templatechoose_offset_t
by @elstehle in #1405 - Cleanup our delegated constructor workaround by @miscco in #1404
- Implement
ranges::subrange
by @miscco in #1387 - Test large arrays in in device radix sort by @alliepiper in #1349
- CMake support absolute CMAKE_INSTALL_LIBDIR values by @robertmaynard in #1393
- Fixes integer overflows in index computation when indexes approach
numeric_limits<OffsetT>::max()
by @elstehle in #1419 - Fix ptx usage to account for PTX ISA availability by @miscco in #1359
- Refactors
thrust::copy_if
to usecub::DeviceSelect
by @elstehle in #1379 - Fix include of <thrust/random.h> with NVC++ by @dkolsen-pgi in #1417
- Do not use VLAs in
cp_async_bulk_tensor_*
tests by @miscco in #1423 - Add support for sm_90a in <nv/target> API by @ahendriksen in #1411
- Add additional build job for sm90 by @jrhemstad in #1428
- Rework
<span>
to be latest revision by @miscco in #1415 - PTX: Add
cuda::ptx:cp_async_bulk_*
by @ahendriksen in #1403 - Prepare namespace
ranges::views
by @miscco in #1434 - PTX: Add
cuda::ptx:barrier_cluster_{arrive,wait}
by @ahendriksen in #1366 - Refactor
thrust::[stable_]partition[_copy]
to usecub::DevicePartition
by @elstehle in #1435 - Fix
common_reference
ofpair
by @miscco in #1438 - Properly check whether a string is alphanumeric by @miscco in #1443
- Remove
cuda::ptx::mapa
by @ahendriksen in #1442 - Add
cuda::ptx:tensormap_{replace,cp_fenceproxy}
by @ahendriksen in #1441 - Enable more algorithms for internal use by @miscco in #1432
- Cleanup diagnostic handling by @miscco in #1420
- Create patch 2.4.0 by @wmaxey in #1455
- Address various issues from internal CI by @miscco in #1462
- Extent gcc miscompilation workaround for replace.cu by @miscco in #1461
- Fix CUB docs image fetcher by @gevtushenko in #1466
- Add
cuda::ptx::cp_reduce_async_bulk
by @ahendriksen in #1445 - Restore disabling benchmarks from ci scripts (removed in #493) by @wmaxey in #1458
- Add test coverage for SM90 without PTX ISA 8.0 by @miscco in #1468
- Ensure that we can use
std::ignore
on device by @miscco in #1470 - Move
.multicast
tests out into their own file by @miscco in #1478 - Ensure that we can test libcu++ against architectures < 70 by @miscco in #1475
- Reduce number of instantiations in
set_symmetric_difference
tests by @miscco in #1476 - Fixx test issues against gcc-6 by @miscco in #1477
- Improve code block CSS in libcu++ docs by @Nyrio in #1483
- Address issues with MSVC2017 by @miscco in #1479
- Remove libcxx tests by @miscco in #1480
- Separate CUB's catch2 test binaries by default for CI. by @alliepiper in #1482
- Add Dev Containers guide for WSL by @gonidelis in #1394
- PTX: add
cuda::mbarrier_init
by @ahendriksen in #1491 - Remove legacy Thrust/CUB CI files. by @bdice in #1504
- Fix issues with ambiguous calls to
addressof
inthrust::optional
by @miscco in #1499 - Ensure that we play nicely with std::iterators by @miscco in #1511
- Try harder to unwrap nested
thrust::tuple_of_iterator_references
by @miscco in #1469 - Match_any testing single bit by fusing into single LOP3 instruction by @IlyaGrebnov in #1372
- Revert "Refactor
thrust::complex
as a struct derived fromcuda::std::complex
(#454)" by @miscco in #1497 - Removes arch filtering of sm 90 for rdc builds by @elstehle in #1506
- Adds test for
cub::PtxVersion
by @elstehle in #1521 - Fix tuple backwards compatibility by @miscco in #1522
- [FEA] Split ptx.h by @ahendriksen in #1520
- Make libcudacxx's codegen part of CI and add it to the project. by @wmaxey in #1526
- Ensure that we can run
reduce_by_key
with const inputs by @miscco in #1528 - Disallow float offset type in cub::segmented_reducde by @gonidelis in #1430
- cuda::std::complex specializations for half and bfloat by @griwes in #1140
- Rebase 2.4.x with main. by @wmaxey in #1472
- [BACKPORT]: Provide backfills for missing
__half
functionality by @miscco in #1544 - [BACKPORT] Fix usage of naked array with 0 elements in sm90 barrier tests. (#1546) by @wmaxey in #1549
- [BACKPORT] Fix unused variable warning for __can_use_complete_tx (#1547) by @wmaxey in #1550
New Contributors
- @djns99 made their first contribution in #1074
- @so298 made their first contribution in #1220
- @valgur made their first contribution in #1259
- @PointKernel made their first contribution in #1238
- @rupprecht made their first contribution in #1331
- @codereport made their first contribution in #1354
- @Yaraslaut made their first contribution in #1320
- @Nyrio made their first contribution in #1483
- @IlyaGrebnov made their first contribution in #1372
Full Changelog: v2.3.2...v2.4.0-rc0