- Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
- Collective mainloops that target for:
- Blockscaled datatypes with support for dense GEMM
- Blockscaled datatypes with support for sparse GEMM
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
- Blackwell SM120 epilogue and full set of EVT fusions.
- Collective mainloops that target for:
- Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
- Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor.
- Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation.
- Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor.
- Grouped GEMM with nvfp4 datatype.
- Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor.
- Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor.
- Set of unit tests that demonstrate the usage of both sparse and dense Blackwell SM120 blockscaled GEMM.
- Support for Blackwell SM100 Sparse kernels:
- Collective mainloop that target for
- SM100 Sparse GEMM
- Collective mainloop that target for
- Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:
- Sparse GEMM
- Blockscaled Sparse GEMM with NVFP4 input data type
- Blockscaled Sparse GEMM with mixed input data type (MXFP8 and MXFP4)
- Set of unit tests that demonstrate the usage of sparse and blockscaled sparse Blackwell SM100 GEMM.
- A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS example covers the flashMLA-like weight-absorbed decoding use-case.
- A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS example to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
- A new distributed GEMM example for SM100 Blackwell architecture.
- Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
- Enhancement of blockwise GEMM for Hopper architecture.
- Enhancement of groupwise GEMM for Hopper architecture.
- Support for grouped GEMM with blockwise and groupwise scaling for Hopper architecture.
- Support for grouped-wise GEMM in CUTLASS profiler.
- Support for blockwise GEMM for Blackwell architecture.
- Support for groupwise GEMM for Blackwell architecture.
- Support for grouped GEMM with blockwise and groupwise scaling for Blackwell architecture.
- Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
- Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
- Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
- Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
- More detailed introductions and examples to leverage this feature can be found in profiler.md.
- Support
voidas the D element in sm100 kernel epilogues.