NVIDIA/cutlass v2.4.0 on GitHub

CUTLASS 2.4

Implicit GEMM convolution kernels supporting CUDA and Tensor Cores on NVIDIA GPUs
- Operators: forward (Fprop), backward data gradient (Dgrad), and backward weight gradient (Wgrad) convolution
- Data type: FP32, complex, Tensor Float 32 (TF32), BFloat16 (BF16), Float16, Int4, Int8, Int32
- Spatial dimensions: 1-D, 2-D, and 3-D
- Layout: NHWC, NCxHWx
Implicit GEMM convolution components:
- Global memory iterators supporting Fprop, Dgrad, and Wgrad
- MmaMultistage for implicit GEMM convolution for NVIDIA Ampere architecture
- MmaPipeline for implicit GEMM convolution for NVIDIA Volta and Turing architectures
- Documentation describing Implicit GEMM Convolution algorithm and implementation