github NVIDIA/cutlass v3.9.0
CUTLASS 3.9.0

latest releases: v4.3.3, v4.3.2, v4.3.1...
7 months ago
  • Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
    • Collective mainloops that target for:
      • Blockscaled datatypes with support for dense GEMM
      • Blockscaled datatypes with support for sparse GEMM
    • New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
    • Blackwell SM120 epilogue and full set of EVT fusions.
  • Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
    • Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor.
    • Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation.
    • Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor.
    • Grouped GEMM with nvfp4 datatype.
    • Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor.
    • Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor.
  • Set of unit tests that demonstrate the usage of both sparse and dense Blackwell SM120 blockscaled GEMM.
  • Support for Blackwell SM100 Sparse kernels:
    • Collective mainloop that target for
      • SM100 Sparse GEMM
  • Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:
    • Sparse GEMM
    • Blockscaled Sparse GEMM with NVFP4 input data type
    • Blockscaled Sparse GEMM with mixed input data type (MXFP8 and MXFP4)
  • Set of unit tests that demonstrate the usage of sparse and blockscaled sparse Blackwell SM100 GEMM.
  • A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS example covers the flashMLA-like weight-absorbed decoding use-case.
  • A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS example to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
  • A new distributed GEMM example for SM100 Blackwell architecture.
  • Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
    • Enhancement of blockwise GEMM for Hopper architecture.
    • Enhancement of groupwise GEMM for Hopper architecture.
    • Support for grouped GEMM with blockwise and groupwise scaling for Hopper architecture.
    • Support for grouped-wise GEMM in CUTLASS profiler.
    • Support for blockwise GEMM for Blackwell architecture.
    • Support for groupwise GEMM for Blackwell architecture.
    • Support for grouped GEMM with blockwise and groupwise scaling for Blackwell architecture.
  • Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
    • Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
    • Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
    • Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
    • More detailed introductions and examples to leverage this feature can be found in profiler.md.
  • Support void as the D element in sm100 kernel epilogues.

Don't miss a new cutlass release

NewReleases is sending notifications on new releases.