github NVIDIA/cutlass v3.5.1
CUTLASS 3.5.1

20 days ago
  • Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code.
  • Exposure of L2 cache_hints in TMA copy atoms
  • Exposure of raster order and tile swizzle extent in CUTLASS library profiler, and
    example 48.
  • TMA store based and EVT supported epilogues for Hopper pointer array batched kernels.
  • A new GemmSparseUniversal API for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference.
  • CUDA host adapter extensions to support TMA descriptor construction driver APIs.
  • Inclusion of more Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler.
  • Support for residual add (beta != 0) in convolution kernels.
  • A new convolution epilogue for CUTLASS 2.x to support non-packed NHWC output.
  • A refactor of include files throughout CUTLASS core directories to reduce circular dependencies and tests to guard against them.
  • A guide for setting up VSCode to work well with CUTLASS and expanded code style guide.
  • Better support for MSVC as a host compiler.
  • Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
  • Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
  • NOTICE:
    • Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. After this, the 3.x convolution API will no longer be considered as a beta API.
    • Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.

Don't miss a new cutlass release

NewReleases is sending notifications on new releases.