NVIDIA/cutlass v3.1.0 on GitHub

New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
New efficient epilogues using TMA for Hopper.
Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
Performance optimizations for the warp-specialized persistent ping-pong kernel.
Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying Params structs.
FMHA Backward Pass from Meta xFormers.
Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
Batched Strided GEMV support both row major and column major input matrix.
Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
Row Broadcast can be fused in the epilogue.
The GitHub branch is renamed from master to main in this release.
Optimal performance using CUDA 12.1
Updates and bugfixes from the community (thanks!)

NVIDIA/cutlass v3.1.0 CUTLASS 3.1 on GitHub

NVIDIA/cutlass v3.1.0
CUTLASS 3.1

on GitHub