NVIDIA/cutlass v3.5.1 on GitHub

Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code.
Exposure of L2 cache_hints in TMA copy atoms
Exposure of raster order and tile swizzle extent in CUTLASS library profiler, and
example 48.
TMA store based and EVT supported epilogues for Hopper pointer array batched kernels.
A new GemmSparseUniversal API for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference.
CUDA host adapter extensions to support TMA descriptor construction driver APIs.
Inclusion of more Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler.
Support for residual add (beta != 0) in convolution kernels.
A new convolution epilogue for CUTLASS 2.x to support non-packed NHWC output.
A refactor of include files throughout CUTLASS core directories to reduce circular dependencies and tests to guard against them.
A guide for setting up VSCode to work well with CUTLASS and expanded code style guide.
Better support for MSVC as a host compiler.
Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
NOTICE:
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. After this, the 3.x convolution API will no longer be considered as a beta API.
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.

NVIDIA/cutlass v3.5.1 CUTLASS 3.5.1 on GitHub

NVIDIA/cutlass v3.5.1
CUTLASS 3.5.1

on GitHub