-
TF32x3: emulated single-precision using Tensor Cores
- 45+ TFLOPs on NVIDIA A100
- GEMM SDK example (real)
- COMPLEX GEMM SDK example (complex)
- Implicit GEMM Convolution SDK example
-
Mainloop fusion for Convolution: convolution with fused per-channel scale-bias-relu
- Conv Fprop SDK example
- Conv WGrad SDK example
- cutlass::conv::device::ImplicitGemmConvolutionFusion
-
Grouped GEMM: similar to batched GEMM with distinct problem size per group
- SDK example with performance comparison with Batched Strided GEMM
- cutlass::gemm::device::GemmGrouped
-
Implicit GEMM Convolution fusion supports staging 1st convolution's output accumulator in the shared memory on Turing. This allows more flexible warp tile sizes and less regsiter pressue.
-
Optimal performance using CUDA 11.5
-
Updates from the community (thanks!)
-
Deprecation announcement: CUTLASS plans to deprecate the following:
- Maxwell and Pascal GPU architectures
- Ubuntu 16.04
- CUDA 10.2