NVIDIA/cutlass v2.10.0 on GitHub

CUTLASS 2.10.0

CUTLASS Python now supports GEMM, Convolution and Grouped GEMM for different data types as well as different epilogue flavors.
Optimizations for CUTLASS's Grouped GEMM kernel. It can move some scheduling into the host side if applicable.
Optimizations for GEMM+Softmax.
Grouped GEMM for Multihead Attention is a general MHA that does not require equal sequence length in every GEMM.
GEMM + Layer norm fusion for Ampere can fuse the layernorm into GEMMs before and after.
GEMM Epilogue Permutation Fusion can permute the GEMM output before storing.
Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized.
Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now.
Standalone Layernorm and Pooling kernels.
Back-to-back GEMM enhancements.
Updates and bugfixes from the community (thanks!)

NVIDIA/cutlass v2.10.0 CUTLASS 2.10.0 on GitHub