NVIDIA/cutlass v2.6.0 on GitHub

CUTLASS 2.6.0

Optimal performance when compiled with the CUDA 11.4 Toolkit
- Adopt the new L2 prefetch feature in cp.async and global load
Fused operators with GEMM and Convolution
- Fused broadcast in epilogue
- Fused partial reduction in epilogue
64b tensor strides and leading dimensions support for GEMMs
Affine rank=2 matrix layouts
- Row stride and column stride for matrices using cutlass::layout::AffineRank2
- Support FP64 tensor core and SIMT GEMM.
Batched GEMV preview implementation
New strided Dgrad implementation
- Accelerates over previous implementation by cutting down redundant math by 4x
- Support using new Dy and w analytic iterators and existing cutlass::conv::device::ImplicitGemmConvolution interface
Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
- Updates to quaternion.h and functional.h
- SDK Example for GEMM and Convolution
- Unit tests for GEMM and Convolution
Many improvements to the epilogue.
- Provide an option to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
- Performance improvement for FP16 tensor core kernels
- Bug fixes
Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
Updated minimum CUDA Toolkit requirement to 10.2
- CUDA 11.4 Toolkit recommended
Corrections and bug fixes reported by the CUTLASS community
- Thank you for filing these issues!

NVIDIA/cutlass v2.6.0 CUTLASS 2.6.0 on GitHub