NVIDIA/cutlass v3.2.0 on GitHub

New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs.
New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler.
Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
New CUTLASS 2D Convolution Python interface. New example here.
Support for Windows (MSVC) builds.

NVIDIA/cutlass v3.2.0 CUTLASS 3.2 on GitHub