github NVIDIA/cutlass v4.5.0
CUTLASS 4.5.0

3 hours ago

CuTe DSL

  • New features

    • New Block API block_copy() to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA by block_copy() and need not to invoke tma_partition(). And users can remove bulk of S2T initialization to simplify S2T copy.
    • MXF8F6F4 mixed precision support
      • BlockScaled MMA now supports MXF8MXF4 or MXF8MXF6
    • Block Scaled MMA for SM120 now works on Spark
    • EFC broadcast semantics support
      • EFC epilogue functions can now broadcast and remap tensor modes via C.remap_modes[:, 0, 1] subscript syntax (where : marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations.
    • Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy
    • dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path
    • cute.copy now supports user specified loop unrolling
  • Bug fixing and improvements

    • Improved source code correlation for profiling/debugging
    • Fixed an aarch64 segfault issue with tvm-ffi
    • Re-organization for CuTe DSL examples/tutorials for better discoverability
  • More examples of authorizing peak-performance kernels

    • MOE examles
      • A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface.
      • Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM.
      • Compared to torch_210_cu13, very few problem has worse perf in B200.
        • mxfp8_2dx3d: avg 1.29 speedup;
        • mxfp8_2dx2d: avg 1.41 speedup;
          • nvfp4_2dx3d: avg 1.11 speedup;
        • nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98)
        • bf16_2dx3d: avg 1.15 speedup (worst case 0.98)
        • bf16_2dx2d: avg 1.17 speedup (worst case 0.96)
        • Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel.
  • API changes

    • ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from blackwell_helpers.py. Please specify a_dtype and b_dtype separately instead.

CUTLASS C++

  • Add 2SM MMA instruction support to mixed TMA+CpAsync SM100 vanilla GEMM kernels.
    • Mixed TMA+CpAsync can now accept static, but non trivial cluster shapes.
    • Uses TMA multicast for A tile when using non-trivial cluster size along N mode.
    • Uses an additional barrier (mma_trampoline_barrier) to track cp.async arrivals in both CTAs.
    • Changes included in example 92.
  • Add support for 128x32xK and 128x64xK tile sizes for SM120 blockscaled MMA collective builders, yielding up to 30% performance improvement on Blackwell SM121 related kernels.
  • Add static load to tensor memory support, included in example 77.
  • Use 64-bit adds for SM100 MMA descriptor offsets and reduce move instructions for improved code generation.
  • Add example 95 to support green context SM partition
    • Enables launching GEMM on stream with partial SM allocation.
  • Add Snake activation functor for EVT.
  • Fix some kernel issues:
    • Fix l2_capacity=0 handling in Blackwell SM100/SM120 kernel templates
    • Fix CUTLASS clang build issues
    • Fix atomicCAS read-modify-write loop in ConstSubbyteReference
    • Replace __nv_atomic_load_n with volatile for CUDA 11.4 compatibility in subbyte reference
    • Remove PipelineStorage shadowing in SM100 complex epilogue
    • Fix build issue in SM90 epilogue fusion visitor TMA warpspecialized
  • Fix some profiler issues:
    • Add missing reference kernels for blockwise GEMM profiler.

Don't miss a new cutlass release

NewReleases is sending notifications on new releases.