NVIDIA/cutlass v3.9.0 on GitHub

Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
- Collective mainloops that target for:
  - Blockscaled datatypes with support for dense GEMM
  - Blockscaled datatypes with support for sparse GEMM
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
- Blackwell SM120 epilogue and full set of EVT fusions.
Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
- Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor.
- Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation.
- Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor.
- Grouped GEMM with nvfp4 datatype.
- Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor.
- Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor.
Set of unit tests that demonstrate the usage of both sparse and dense Blackwell SM120 blockscaled GEMM.
Support for Blackwell SM100 Sparse kernels:
- Collective mainloop that target for
  - SM100 Sparse GEMM
Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:
- Sparse GEMM
- Blockscaled Sparse GEMM with NVFP4 input data type
- Blockscaled Sparse GEMM with mixed input data type (MXFP8 and MXFP4)
Set of unit tests that demonstrate the usage of sparse and blockscaled sparse Blackwell SM100 GEMM.
A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS example covers the flashMLA-like weight-absorbed decoding use-case.
A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS example to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
A new distributed GEMM example for SM100 Blackwell architecture.
Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
- Enhancement of blockwise GEMM for Hopper architecture.
- Enhancement of groupwise GEMM for Hopper architecture.
- Support for grouped GEMM with blockwise and groupwise scaling for Hopper architecture.
- Support for grouped-wise GEMM in CUTLASS profiler.
- Support for blockwise GEMM for Blackwell architecture.
- Support for groupwise GEMM for Blackwell architecture.
- Support for grouped GEMM with blockwise and groupwise scaling for Blackwell architecture.
Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
- Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
- Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
- Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
- More detailed introductions and examples to leverage this feature can be found in profiler.md.
Support void as the D element in sm100 kernel epilogues.

NVIDIA/cutlass v3.9.0 CUTLASS 3.9.0 on GitHub

NVIDIA/cutlass v3.9.0
CUTLASS 3.9.0

on GitHub