github NVIDIA/cutlass v3.8.0
CUTLASS 3.8.0

2 days ago

CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.
For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.

  • Support for new CuTe building blocks specifically for Blackwell SM100 architecture:
    • 5th generation Blackwell Tensor Core instructions (TCGen05) via CuTe MMA atoms.
    • Extensions to Tensor Memory Accelerator via CuTe Copy atoms.
    • Exposure of Blackwell's new tensor memory (note: distinct from TMA) as tmem across CuTe as a first class data locale.
    • Exposure of tmem->rmem, rmem->tmem and smem->tmem data movement instructions as copy atoms in CuTe.
    • make_tmem_copy() utility method to ease creation of tiled copies for tmem copy atoms.
    • Support for new variants of LDSM on Blackwell via CuTe Copy atoms.
  • Support for new CUTLASS building blocks specifically for Blackwell SM100 architecture:
    • Various narrow precision FP4, FP6, and FP8 formats as well as their block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8
    • Pipelines that implement Blackwell specific synchronization.
    • Cluster launch control API supporting preferred and fallback cluster shapes.
    • Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
    • Tile schedulers using Blackwell's Cluster Launch Control (CLC) feature to implement dynamic persistence scheduling for GEMMs, and stream-K.
    • Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
  • Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
    • Blackwell specific kernel layers that
      • Implement a new warp-specialization recipe tuned specifically for Blackwell SM100 architecture.
      • Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.
      • Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
    • Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
      • Non-block scaled data types without support for pointer array and grouped GEMM with TMA
      • Non-block scaled data types with support for pointer array and grouped GEMM with TMA
      • Block scaled data types without support for pointer array and grouped GEMM with TMA
      • Block scaled data types with support for pointer array and grouped GEMM with TMA
    • Blackwell collective mainloop for convolution kernels supporting non-block scaled data types for fprop, dgrad, and wgrad.
    • New GEMM, convolution, and epilogue dispatch policies for collectives, kernel layers, and builders.
    • Blackwell epilogue that supports loading accumulators from tmem and full set of EVT fusions.
  • CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
    • Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
    • Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
    • Support for mixed input GEMM kernels on Hopper in the profiler.
  • New CUTLASS profiler flag use-cuda-graphs to reduce overheads when benchmarking launch-bound kernels.
  • A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler (./cutlass_profiler --operation=GroupedGemm --help for details).
  • Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
    • Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples, demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
    • GEMM with opt-in collective builder schedules showcasing available recipes for Blackwell.
    • Block scaled data type GEMMs targeting Blackwell's native block scaled Tensor Cores:
      • NVFP4 inputs with BF16 output
      • NVFP4 inputs with NVFP4 output
      • Mixed MXFP8 and MXFP6 inputs with BF16 output
    • GEMM example demonstrating Blackwell's new preferred cluster support via dynamic cluster shapes for increased occupancy.
    • GEMM with CLC based StreamK scheduler for load balancing.
    • Grouped GEMM for vanilla FP8 data inputs and NVFP4 block scaled inputs.
    • Convolution kernels for fprop, dgrad, and wgrad.
    • Fused multi-head attention fprop kernel supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
    • A new BF16x9 GEMM kernel that emulates FP32 GEMM (SGEMM) using BF16 operations.
  • Set of examples that demonstrate the usage of the 3.x API for targeting Hopper architecture:
    • A set of new Hopper grouped GEMM kernels that support mixed A and B datatypes.
    • A new Hopper FP8 GEMM with groupwise scaling.
  • Documentation updates:
    • Quickstart - instantiating a Blackwell block-scaled GEMM.
    • Detailed Blackwell block-scaled GEMM functionality documentation
    • A new functionality documentation specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
    • Updates to compatibility section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and Target Architecture.

Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.

Don't miss a new cutlass release

NewReleases is sending notifications on new releases.