github NVIDIA/cutlass v2.6.0
CUTLASS 2.6.0

latest releases: v3.5.1, v3.5.0, v3.4.1...
3 years ago

CUTLASS 2.6.0

  • Optimal performance when compiled with the CUDA 11.4 Toolkit
    • Adopt the new L2 prefetch feature in cp.async and global load
  • Fused operators with GEMM and Convolution
    • Fused broadcast in epilogue
    • Fused partial reduction in epilogue
  • 64b tensor strides and leading dimensions support for GEMMs
  • Affine rank=2 matrix layouts
    • Row stride and column stride for matrices using cutlass::layout::AffineRank2
    • Support FP64 tensor core and SIMT GEMM.
  • Batched GEMV preview implementation
  • New strided Dgrad implementation
    • Accelerates over previous implementation by cutting down redundant math by 4x
    • Support using new Dy and w analytic iterators and existing cutlass::conv::device::ImplicitGemmConvolution interface
  • Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
    • Updates to quaternion.h and functional.h
    • SDK Example for GEMM and Convolution
    • Unit tests for GEMM and Convolution
  • Many improvements to the epilogue.
    • Provide an option to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
    • Performance improvement for FP16 tensor core kernels
    • Bug fixes
  • Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
  • Updated minimum CUDA Toolkit requirement to 10.2
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

Don't miss a new cutlass release

NewReleases is sending notifications on new releases.