NVIDIA/cutlass v3.7.0 on GitHub

A new Hopper blockwise scaling FP8 GEMM where the operands and block scaling tensor are staged via shared memory.
Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new make_kernel_hardware_info API as shown in example 48.
Enabled high precision accumulation for Hopper FP8 Sparse GEMM.

NVIDIA/cutlass v3.7.0 CUTLASS 3.7.0 on GitHub