- NVIDIA Ampere Architecture features
- Fast Tensor Core operations:
- Maximum performance via
mma.sync
- Tensor Float 32, BFloat16, and double-precision data types
- Mixed integer data types (int8, int4, bin1)
- Asynchronous copy for deep software pipelines via
cp.async
- Described in GTC 2020 Webinar (SR 21745) (free registration required)
- Features:
- SDK examples showing GEMM fused with bias+relu and fused GEMM+GEMM
- Complex-valued GEMMs targeting NVIDIA Ampere Tensor Cores in double-precision and Tensor Float 32
- Gaussian complex GEMMs using 3m complex multiply algorithm
- Universal GEMM kernel supporting two batch modes and two algorithms for parallel reductions
- Policy updates:
- CUDA 11 Toolkit needed to enable NVIDIA Ampere Architecture features
- Disabled F16C by default for compatibility - enable on cmake command line with
-DCUTLASS_ENABLE_F16C=ON