github NVIDIA/cutlass v3.6.0
CUTLASS 3.6.0

9 days ago
  • Hopper structured sparse GEMM.
    • FP16
    • FP8
    • INT8
    • TF32
  • A refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. Now the 3.x convolution API is no longer considered as a beta API.
  • An improved mixed input GEMM and a lookup table implementation for INT4xFP8 scale-only mode.
  • EVT nodes for Top-K selection and softmax and GEMM example using those.
  • Programmatic Dependent Launch (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding documentations.
  • A new debugging tool, synclog, for dumping out all synchronization events from within a kernel to a file. Please see synclog documentation for details.
  • A new TMA-enabled epilogue for grouped GEMM that brings significant performance improvement, as well as its EVT support.
  • A SIMT-enabled pointer-array epilogue.
  • A new Ping-Pong kernel schedule for Grouped GEMM and some other optimizations.
  • A new instantiation strategy for CUTLASS profiler kernels along with improved documentation for instantiation level in CUTLASS profiler.
  • A new hardware support for comparisons and computations of cutlass::bfloat16_t
  • Fixed use of isnan on Windows for half_t.

Don't miss a new cutlass release

NewReleases is sending notifications on new releases.