github NVIDIA/cutlass v3.3.0
CUTLASS 3.3.0

latest releases: v3.5.1, v3.5.0, v3.4.1...
9 months ago
  • New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
  • New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
  • New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
  • EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
  • Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
  • Support for Clang as a host compiler.
  • Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface

Don't miss a new cutlass release

NewReleases is sending notifications on new releases.