- A new Hopper blockwise scaling FP8 GEMM where the operands and block scaling tensor are staged via shared memory.
- Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new
make_kernel_hardware_info
API as shown in example 48. - Enabled high precision accumulation for Hopper FP8 Sparse GEMM.