- Hopper structured sparse GEMM.
- FP16
- FP8
- INT8
- TF32
- A refactor to the CUTLASS 3.x convolution
kernel::ConvUniversal
API to bring it in line withgemm::GemmUniversal
. Now the 3.x convolution API is no longer considered as a beta API. - An improved mixed input GEMM and a lookup table implementation for
INT4
xFP8
scale-only mode. - EVT nodes for Top-K selection and softmax and GEMM example using those.
- Programmatic Dependent Launch (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding documentations.
- A new debugging tool, synclog, for dumping out all synchronization events from within a kernel to a file. Please see synclog documentation for details.
- A new TMA-enabled epilogue for grouped GEMM that brings significant performance improvement, as well as its EVT support.
- A SIMT-enabled pointer-array epilogue.
- A new Ping-Pong kernel schedule for Grouped GEMM and some other optimizations.
- A new instantiation strategy for CUTLASS profiler kernels along with improved documentation for instantiation level in CUTLASS profiler.
- A new hardware support for comparisons and computations of
cutlass::bfloat16_t
- Fixed use of isnan on Windows for
half_t
.