github NVIDIA/cutlass v2.8.0
CUTLASS 2.8

latest releases: v3.5.1, v3.5.0, v3.4.1...
2 years ago
  • TF32x3: emulated single-precision using Tensor Cores

    • 45+ TFLOPs on NVIDIA A100
    • GEMM SDK example (real)
    • COMPLEX GEMM SDK example (complex)
    • Implicit GEMM Convolution SDK example
  • Mainloop fusion for Convolution: convolution with fused per-channel scale-bias-relu

    • Conv Fprop SDK example
    • Conv WGrad SDK example
    • cutlass::conv::device::ImplicitGemmConvolutionFusion
  • Grouped GEMM: similar to batched GEMM with distinct problem size per group

    • SDK example with performance comparison with Batched Strided GEMM
    • cutlass::gemm::device::GemmGrouped
  • Implicit GEMM Convolution fusion supports staging 1st convolution's output accumulator in the shared memory on Turing. This allows more flexible warp tile sizes and less regsiter pressue.

  • Optimal performance using CUDA 11.5

  • Updates from the community (thanks!)

  • Deprecation announcement: CUTLASS plans to deprecate the following:

    • Maxwell and Pascal GPU architectures
    • Ubuntu 16.04
    • CUDA 10.2

Don't miss a new cutlass release

NewReleases is sending notifications on new releases.