NVIDIA/cutlass v3.3.0 on GitHub

New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
Support for Clang as a host compiler.
Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface

NVIDIA/cutlass v3.3.0 CUTLASS 3.3.0 on GitHub