CUTLASS 2.9.0
- First layer Convolution kernels specialized for small channel counts and reduced alignment
- Few channels specialization for reduced alignment capabilities
- Fixed channels further specialized when channel count perfectly matches the access vector size
- Unit tests
- Python-based instance emitter in the CUTLASS Library and support in the Profiler
- BLAS3 operators accelerated by Tensor Cores
- Supported types: f32, cf32, f64, cf64
- HERK with emitter
- SYRK with emitter
- SYMM with emitter
- TRMM with emitter
- Unit tests
- CUTLASS Python demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using CUDA Python
- Python-based runtime interoperable with existing emitters
- GEMM + Softmax example
- Optimal performance using CUDA 11.6u2
- Updates and bugfixes from the community (thanks!)