github NVIDIA/cutlass v4.4.2
CUTLASS 4.4.2

6 hours ago

CuTe DSL

  • New features
    • CuTe DSL now supports Python 3.14 for both x86_64 and aarch64
    • Runtime Pointer/Tensor/FakeTensor now supports cache_key, providing a stable, hashable representation that simplifies and improves compiled function caching.
  • Bug fixing and improvements
    • Fixed Hopper FMHA causal attention performance regression on CUDA toolkit 13.1 by
      optimizing mbarrier synchronization to avoid unnecessary convergence barriers.
    • Fix kernel loading race condition when multiple GPU are present in the same process in JAX.

CUTLASS C++

  • Enable Blackwell SM120f compilation of examples and exposes NVFP4/MX Grouped GEMM in the CUTLASS Profiler.

Don't miss a new cutlass release

NewReleases is sending notifications on new releases.