3.0.0 (2023-01-23)
- CuTe, a new core library and backend for CUTLASS 3.0 that defines a single Layout vocabulary type and an associated algebra of layouts for a much more expressive and composable abstraction for tensors, sets of parallel agents, and operations by said agents on tensors.
- A new conceptual operation hierarchy that replaces the architecture-centric hierarchy of CUTLASS 2.x and documentation for CUTLASS 3.0's GEMM API changes.
- Strict API backwards compatibility that exposes both 2.x and 3.x API kernels through the same
device::GemmUniversalAdapter
andkernel::GemmUniversal
types, allowing users to include both APIs in the same translation units. More information can be found in the 3.x backwards compatibility section. - Updates to Functionality which directs users on which kernels are supported via CUTLASS-2 and CUTLASS-3.
- Updates to Compatibility Section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures and Target Architecture.
- New warp-specialized GEMM kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters.
- Extensions to CUTLASS profiler to support threadblock cluster shapes in library and profiler tile configurations.
- CUTLASS library integration for 3.x API kernels built through the new
CollectiveBuilder
API, enabling CUTLASS profiler. - Support for Hopper GEMMs through the new 3.0 API with CuTe-based exposure of the Hopper Tensor Memory Accelerator and WGMMA Tensor Core features.
- Set of examples that demonstrate the usage of the new 3.0 API to easily build GEMM kernels targeting Hopper: examples 48, 49, and 50.