- New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs.
- New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
- Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
- Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler.
- Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
- Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
- New CUTLASS 2D Convolution Python interface. New example here.
- Support for Windows (MSVC) builds.