github NVIDIA/cccl v2.3.0
CCCL 2.3.0

latest releases: v2.7.0-rc2, v2.7.0-rc1, v2.7.0-rc0...
8 months ago

What’s New

In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

System Headers and Warnings

Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I.

To achieve the same effect as -isystem, CCCL headers will now use the system_header pragma. For more information, see #527.

TL;DR: You should never see warnings emitted from a CCCL header ever again!

Linkage Issues

Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.

Thrust

thrust::tuple, thrust::pair, and thrust::complex have been replaced with cuda::std alternatives. This can be a breaking change, but should be source compatible.

CUB

Up to 60% performance improvements of cub::DeviceSelect::UniqueByKey, cub::DeviceScan::ExclusiveSumByKey, and cub::DeviceReduce::ReduceByKey on A100. cub::DeviceSegmentedReduce now supports 64-bit indexing.

libcudacxx

  • The cuda::ptx namespace and <cuda/ptx> header is now available and provides access to various inline PTX functions that enumerate various async memcpy and barrier intrinsics.
  • #379 - Added experimental bulk TMA memcpy under <cuda/barrier>

What's Changed

  • Port cub::DeviceSegmentedReduce tests to catch2 by @elstehle in #303
  • Branch/2.2.x by @gevtushenko in #305
  • Tune unique by key on A100 by @gevtushenko in #306
  • Merge branch/2.2.x to main by @jrhemstad in #308
  • Add example cmake project by @jrhemstad in #177
  • Adds catch2 tests for reduce-by-key by @elstehle in #311
  • Tune scan by key on A100 by @gevtushenko in #325
  • Replace diag_suppress by nv_diag_suppress in documentation by @ahendriksen in #281
  • Fix MSVC / CUB tests build by @gevtushenko in #336
  • gdb pretty printer: handle non-cuda device vectors by @siboehm in #264
  • Add a nvrtc configuration for libcu++ by @miscco in #202
  • GH Infra: project automation and issue template fixes by @jarmak-nv in #297
  • Tune reduce by key on A100 by @gevtushenko in #346
  • Merge commits from 2.2 branch by @miscco in #350
  • Fix a shadow warning in thrust's execute_with_dependencies.h by @hageboeck in #334
  • Assorted fixes for MSVC 2017 by @miscco in #341
  • [skip-tests] Guard inline variables with _LIBCUDACXX_INLINE_VAR macro by @miscco in #355
  • Port cub::DeviceScan tests to catch2 by @elstehle in #347
  • Remove _NOEXCEPT macro in favor of noexcept in libcu++ by @Blonck in #349
  • Project Automation: add conditional steps due to context errors by @jarmak-nv in #353
  • Work around strange gcc bug by @miscco in #363
  • Implement iter_swap CPO by @miscco in #332
  • Replace default, constexpr, and delete macros by original keywords by @Blonck in #360
  • Add clang16 devcontainer and CI job by @miscco in #362
  • [skip-tests] Skip merge conflict from old iter_swap PR by @miscco in #369
  • [skip-tests] Also skip all CI runs that require a GPU when [skip-tests] is set by @miscco in #370
  • Remove _LIBCUDACXX_CXX03_LANG macro and all encapsulated code by @Blonck in #368
  • Remove checks against _LIBCUDACXX_STD_VER < 11 by @Blonck in #375
  • Use copy-pr-bot by @ajschmidt8 in #381
  • Implement the permutable concept by @miscco in #367
  • [NFC] We missed some _NOEXCEPT_ macro uses by @miscco in #371
  • Implement identity changes for c++20 by @miscco in #383
  • Hide third party cmake options in our cmake developer builds. by @allisonvacanti in #300
  • Port cub::DeviceScanByKey tests to Catch2 by @elstehle in #380
  • Fixes a race in DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #399
  • Add commit information to the test output by @miscco in #401
  • Project Automation: Handle PRs opened as non-draft + multiple bug fixes by @jarmak-nv in #387
  • Project Automation: set Roadmap project value on issue/pr close and Auto-type new issues by @jarmak-nv in #389
  • Add support for tests that should fail at runtime by @ahendriksen in #418
  • Port DeviceAdjacentDifference::SubtractRight tests to catch2 by @miscco in #390
  • Project automation - Fix indentation for continue-on-error by @jarmak-nv in #425
  • [BUG] Ensure that all headers build on their own by @miscco in #200
  • Remove util_device.cuh from iterator headers to enable online compilation by @leofang in #412
  • Fix ci-overview example by @gevtushenko in #428
  • Port cub::DeviceRunLengthEncode tests to catch2 by @miscco in #411
  • Add cuda::device::barrier_arrive tx by @ahendriksen in #358
  • Fix CubDebug by @gevtushenko in #430
  • Do not use static member functions to initialize static member variables. by @miscco in #438
  • Implement the projected helper struct by @miscco in #385
  • Add PTX wrapping functions for TMA features by @ahendriksen in #379
  • Clarify docstring for num_items parameter of DeviceSegmentedRadixSort by @HapeMask in #320
  • Enable lit to determine the compute architectures by @miscco in #447
  • Add NVRTC_SKIP_KERNEL_RUN tag to compile, but skip running NVRTC test by @ahendriksen in #434
  • Improve documentation of cuda::barrier by @ahendriksen in #440
  • Extend thrust::complex unit tests to prepare for upcoming replacement with std::complex by @Blonck in #413
  • Remove having two install rules for -header-search.cmake by @robertmaynard in #298
  • Run .devcontainer/launch.sh with bash + add error checking by @wence- in #407
  • Remove C++03 compatability from unit tests by @Blonck in #378
  • [libcu++] Fix use of __ppc64__ by @miscco in #451
  • Update the README by @jrhemstad in #291
  • [libcu++] Try to avoid gcc misscompilation issues by @miscco in #452
  • Consolidate matrix logic into single script/job by @jrhemstad in #361
  • Implement the indirectly_comparable concept by @miscco in #445
  • Fix compute matrix dropping trailing zeros by @jrhemstad in #466
  • Avoid integer promotion warnings with MSVC by @miscco in #460
  • Implement ranges comparison objects by @miscco in #464
  • Fix CUB/MSVC/RDC tests by @gevtushenko in #469
  • Fix Thrust/CUB Linkage Issues by @gevtushenko in #443
  • Script for Running CUB Benchmarks by @gevtushenko in #472
  • [skip ci] Add list of CCCL users to README by @jrhemstad in #474
  • constexpr all the things by @pb-dseifert in #476
  • Add Gonzalo/Allard to trustees by @jrhemstad in #482
  • Implement the sortable concept by @miscco in #471
  • [libcu++] Add _LIBCUDACXX_CUDACC_BELOW_12_3 macro by @gonzalobg in #479
  • Refactor thrust::complex as a struct derived from cuda::std::complex by @Blonck in #454
  • Add ci scripts for windows by @miscco in #251
  • Enable complex interop on MSVC by @miscco in #490
  • [skip ci] Add related projects to readme. by @jrhemstad in #492
  • Reenable nvrtc tests by @miscco in #488
  • Implement the mergeable concept by @miscco in #484
  • 64-bit indexing for DeviceSegmentedReduce by @jecs in #414
  • Implement move_sentinel by @miscco in #496
  • Support skipped benches in run script by @gevtushenko in #508
  • Implement unreachable_sentinel by @miscco in #506
  • Disable flaky barrier tests by @miscco in #510
  • Add constant initialization of managed variable to silence gcc warning by @miscco in #509
  • Add verbose flag to ninja build. by @jrhemstad in #491
  • Add devcontainer readme by @jrhemstad in #481
  • Add contributor guide by @jrhemstad in #500
  • [skip ci] Fix devcontainer guide link by @jrhemstad in #518
  • [skip ci] Add example godbolt link. by @jrhemstad in #519
  • Replace cuda::atomic with legacy functions for old arch compatibility. by @allisonvacanti in #516
  • Simplify examples matrix. by @jrhemstad in #517
  • Disable PR workflow triggering on pushes to main. by @jrhemstad in #532
  • Add CI job to verify devcontainers are always up to date by @jrhemstad in #514
  • [CI] Sink error when git repo is missing from build. by @wmaxey in #533
  • Rework our tuple implementation to work with older MSVC by @miscco in #530
  • Add jobs using clang as CUDA compiler by @jrhemstad in #493
  • Remove cudaDeviceSetSharedMemConfig from CUB tests by @gevtushenko in #538
  • Implement __bounded_iter by @miscco in #540
  • Fix cub::BlockAdjacentDifference documentation by @pauleonix in #542
  • Add cuda::device::memcpy_async_tx by @ahendriksen in #405
  • Introduce Thrust benchmarks by @gevtushenko in #534
  • Fix MSVC benchmarks build by @gevtushenko in #536
  • Fix nvc++ as host compiler by @gevtushenko in #560
  • Add missing overload definition of thrust::complex operator!= by @srinivasyadav18 in #564
  • Make template parameters consistent in thrust::complex operators by @srinivasyadav18 in #555
  • Migrate CI configs to CMake presets. by @allisonvacanti in #324
  • Replace thrust::detail::integral_constant with libcudacxx implementation by @ZelboK in #561
  • Add cuda::device::barrier_expect_tx by @ahendriksen in #498
  • Add ARM build configs for latest gcc/clang. by @jrhemstad in #468
  • Fea/486 Improve thrust::complex operators compile time throughput by @srinivasyadav18 in #567
  • Define compiler env vars for CMake in dev containers. by @allisonvacanti in #576
  • Revert back to working nvbench commit by @miscco in #582
  • use clang-format in dev containers by @miscco in #513
  • Introduce CCCL clang-format by @gevtushenko in #551
  • Add cp.async.bulk global -> shared support to cuda::memcpy_async by @ahendriksen in #501
  • [skip ci] Also update the base image by @miscco in #584
  • Replace thrust::tuple implementation with cuda::std::tuple by @miscco in #262
  • Fix clangd integration by @gevtushenko in #588
  • Always treat CCCL as system headers by @miscco in #531
  • Refactor inline comments by @gevtushenko in #581
  • Relax Catch2 include order requirements by @gevtushenko in #601
  • Project Automation - Fix issue/pr sync workflow by @jarmak-nv in #504
  • [skip-tests] Add a preset that builds all configs of all projects. by @allisonvacanti in #580
  • Implement ranges::advance by @miscco in #546
  • Update status check job to check status of precursor jobs by @jrhemstad in #605
  • Report times for libcudacxx tests in CI by @jrhemstad in #606
  • Fix bug in the construct_at optimization by @miscco in #608
  • [skip-tests] Disable rdc tests for windows. by @miscco in #615
  • Implement ranges::next by @miscco in #611
  • Support FP8 in radix sort by @gevtushenko in #623
  • Fix examples/cccl_infra mixup in ci. by @wmaxey in #633
  • Fixes block-scope run-length decode one-past-the-end memory access into smem TempStorage by @elstehle in #626
  • Harmonize CUB includes by @gevtushenko in #632
  • Create NVRTCC, a utility for running tests under NVRTC by @wmaxey in #494
  • Fix typo and grammar errors by @VaibhavWakde52 in #639
  • [Backport branch/2.3.x] Add CCCL_VERSION and script for updating version by @github-actions in #667
  • Backport 574 ptx by @miscco in #663
  • [Backport branch/2.3.x] Fix C++11 support of recently added tests by @github-actions in #658
  • [Backport branch/2.3.x] Update CUDA newest to CTK 12.3 by @github-actions in #1072
  • [Backport to branch/2.3.x] Rework our system header approach to be more error proof (#661) by @miscco in #675
  • [Backport branch/2.3.x] Fix fallback when checking git repo by @github-actions in #1086
  • [Backport branch/2.3.x] Currently the verbose option does not work beacuse of a typo in the argument handling by @github-actions in #1090
  • [Backport branch/2.3.x] Add cuda::ptx::st_async by @github-actions in #1093
  • [Backport branch/2.3.x] Add cuda::ptx::red_async by @github-actions in #1094
  • Backport PR #1075 by @wmaxey in #1100
  • [Backport branch/2.3.x] Add cuda::ptx:mbarrier_{try/test}_wait{_parity} by @github-actions in #1106
  • [Backport branch/2.3.x] Fix cuda::ptx::red.async for int32_t types by @github-actions in #1107
  • [Backport branch/2.3.x] Fix local test runs with lit by @github-actions in #1110
  • [Backport branch/2.3.x] Fix config when only non-CDPv1 arches are enabled. by @github-actions in #1111
  • [Backport branch/2.3.x] Fix GCC6 / FP8 warning by @github-actions in #1131
  • [Backport branch/2.3.x] Fix ptx.st.async.compile.pass.cpp failing in C++11. by @github-actions in #1136
  • BACKPORT: Fix _LIBCUDACXX_UNREACHABLE for old MSVC (#1114) by @miscco in #1143
  • [2.3.x] Backport benchmarking PRs by @wmaxey in #1168
  • Backport P0 filter commit. by @wmaxey in #1172
  • [BACKPORT] Implement math functions for thrust::complex by @miscco in #1191
  • Backport fix icc / cub (#1152) by @wmaxey in #1171
  • [BACKPORT]: Fix availability of is_constant_evaluated on old MSVC by @miscco in #1198
  • [BACKPORT] Add icc to the ci matrix by @miscco in #1209
  • [BACKPORT]: Add missing overloads for thrust::pow by @miscco in #1223

New Contributors

Full Changelog: v2.2.0...2.3.0

Don't miss a new cccl release

NewReleases is sending notifications on new releases.