NVIDIA/cutlass v3.6.0 on GitHub

Hopper structured sparse GEMM.
- FP16
- FP8
- INT8
- TF32
A refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. Now the 3.x convolution API is no longer considered as a beta API.
An improved mixed input GEMM and a lookup table implementation for INT4xFP8 scale-only mode.
EVT nodes for Top-K selection and softmax and GEMM example using those.
Programmatic Dependent Launch (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding documentations.
A new debugging tool, synclog, for dumping out all synchronization events from within a kernel to a file. Please see synclog documentation for details.
A new TMA-enabled epilogue for grouped GEMM that brings significant performance improvement, as well as its EVT support.
A SIMT-enabled pointer-array epilogue.
A new Ping-Pong kernel schedule for Grouped GEMM and some other optimizations.
A new instantiation strategy for CUTLASS profiler kernels along with improved documentation for instantiation level in CUTLASS profiler.
A new hardware support for comparisons and computations of cutlass::bfloat16_t
Fixed use of isnan on Windows for half_t.

NVIDIA/cutlass v3.6.0 CUTLASS 3.6.0 on GitHub