github NVIDIA/cutlass v4.3.0
CUTLASS 4.3.0

12 hours ago

CuTe DSL

CUTLASS C++

  • Further enhance Blackwell SM100 Attention kernels in example 77.
    • Add softmax skip correction.
    • Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
    • Fix a dead hang issue caused by early return warp.
  • Add support through cmdline argument lists for batch, no_verif, cluster_shape and cluster_shape_fallback in example 89.
  • Add Ragged Contiguous Grouped gemm kernel in example 92.
    • This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
  • Add 256x128 tile size support for Hopper SM90 deepgemm in example 67.
    • Performance is optimized to align with Deepseek implementation.
  • Simplification of API for MoE gemms.
    • Instead of requiring users to call several cute utilities to set up the stride, API moe_stride_utils is introduced to help setup strides in the kernel.
    • Instead of requiring users to set vectors like problem_shapes_device and problem_shapes_hosts, a new problem shape struct called MoEProblemShape is introduced which takes in max_m, max_n, max_k and counts vector as input and deduce problem shapes internally whenever required.
  • Enable GEMM_K = 0 in grouped gemm.
  • Optimize group gemm kernels by enabling async TMA desc update.
  • Support Blackwell SM100 convolution stream-K kernel.
  • Add Blackwell SM100 sparse gemm compressor unit tests.
    • Unit tests: compressor_fp16.
    • Add sub-bytes and runtime data type support in compressor unit test testbed.
  • Add profiler support for:
    • Blackwell SM100 and SM120 blockscaled sparse kernels.
    • New MoE grouped gemm API.
    • Blackwell SM100 cpasync kernel.
  • Fix some kernel issues:
    • Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
    • Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
    • Remove warnings caused by cuda vector type alignment setting in CUDA 13.
    • Remove problematic cutlass::int8_t and replace it with int8_t.
    • Fix a few bugs in distributed gemm API and examples.
    • Fix handling negative zero in sparse compressor.
    • Add missing wait_on_dependent_grids for PDL use case.
  • Fix some profiler issues:
    • Add some missing reference kernels.
    • Support VoidC reference kernels.
    • Add calculation of scale factor A and B in function bytes_with_problem_shape of block scaled profiler.
    • Fix an issue when epilogue tile N is not divided by default subtile N.
  • Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
  • Optimal code generation with CUDA toolkit versions 13.0U1.

Don't miss a new cutlass release

NewReleases is sending notifications on new releases.