NVIDIA/cutlass v2.2.0 on GitHub

NVIDIA Ampere Architecture features
- Fast Tensor Core operations:
- Maximum performance via mma.sync
- Tensor Float 32, BFloat16, and double-precision data types
- Mixed integer data types (int8, int4, bin1)
- Asynchronous copy for deep software pipelines via cp.async
- Described in GTC 2020 Webinar (SR 21745) (free registration required)
Features:
- SDK examples showing GEMM fused with bias+relu and fused GEMM+GEMM
- Complex-valued GEMMs targeting NVIDIA Ampere Tensor Cores in double-precision and Tensor Float 32
- Gaussian complex GEMMs using 3m complex multiply algorithm
- Universal GEMM kernel supporting two batch modes and two algorithms for parallel reductions
Policy updates:
- CUDA 11 Toolkit needed to enable NVIDIA Ampere Architecture features
- Disabled F16C by default for compatibility - enable on cmake command line with -DCUTLASS_ENABLE_F16C=ON

NVIDIA/cutlass v2.2.0 CUTLASS 2.2 on GitHub