CuTe DSL
- New features:
- Supported Apache TVM-FFI for further reduced host runtime overhead for JIT functions, better PyTorch and ML frameworks interopability
- Added fake tensor and stream to decouple compile jit function with "from_dlpack" flow. Now we no longer require users to have real tensor when compile jit function.
- Added FastDivmodDivisor with Python operator overloads, new APIs, Cute dialect integration, and optimized static tile scheduler performance for faster index mapping.
- Added l2 cache evict priority for tma related ops. Users could do fine-grain l2 cache control.
- Debuggability improvements:
- Supported source location tracking for DSL APIs (Allow tools like
nsightprofiling to correlate perf metrics with Python source code) - Supported dumping PTX and CUBIN code: Hello World Example
- Supported source location tracking for DSL APIs (Allow tools like
- More examples and notebooks to get started with CuTe DSL:
- Improved performance of elementwise example:
- Generalize code to handle list of input tensors
- Generalize TV layout computation to handle different data types
- Improved Blackwell SM100 persistent dense GEMM with static scheduling:
- To demonstrate usage of new Pipeline APIs
PipelineProducerandPipelineConsumerto simplify code without explicit pipeline state management (Exiting APIs are still maintained) - Separated epilogue code for non-TMA and TMA implementation
- To demonstrate usage of new Pipeline APIs
- Tutorial for Blackwell GEMM: Basic Blackwell SM100 GEMM
- Baseline Blackwell GEMM achieves 84% SOL performance with MNK 8K
- More examples are coming for demo of optimization:
Baseline + X
- Tutorial for Async Pipeline API
- Reworked elementwise add notebook with more details and detailed explanation about TV layout
- Updated implementation to handle general data type and multiple inputs
- Updated explanation for TV layout in simpler language
- Added visualization of TV Layout with 3rd party utils
- Benchmark and autotune demonstration
- Improved performance of elementwise example:
- More examples of authorizing peak-performance kernels:
- Blackwell SM100 mixed-input GEMM
- Blackwell SM100 persistent blockwise dense GEMM
- Blackwell SM100 persistent blockwise contiguous grouped dense GEMM
- Blackwell SM100 persistent blockwise masked grouped dense GEMM
- Blackwell SM100 fmha bwd
- Blackwell SM100 mla
- Hopper SM90 persistent dense GEMM with static scheduling
- Blackwell GeForce batched dense GEMM
- Ampere HSTU Attention
- API updates:
- Please refer to DSL API changelog for details
- Bug fixings and improvements
- Add mma_tiler_n=64 and mma_tiler_n=192 support in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed
TensorSSA.reduceto support static value as initial value - Updated docstring for following APIs to be more concise and easier to understand:
make_layout_tvis_staticPipelineAsyncSmemAllocator
- Fixed documentation for
pipeline,utilsandcute.math - Added overlapping accumulator optimization for block tile N = 256 case for better epilogue latency hiding in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed TensorSSA.getitem indexing to match CuTe's indexing convention
- Fixed an issue with cutlass.max and cutlass.min
- Fixed an issue with mark_compact_shape_dynamic
CUTLASS C++
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add softmax skip correction.
- Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
- Fix a dead hang issue caused by early return warp.
- Add support through cmdline argument lists for
batch,no_verif,cluster_shapeandcluster_shape_fallbackin example 89. - Add Ragged Contiguous Grouped gemm kernel in example 92.
- This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
- Add 256x128 tile size support for Hopper SM90 deepgemm in example 67.
- Performance is optimized to align with Deepseek implementation.
- Simplification of API for MoE gemms.
- Instead of requiring users to call several cute utilities to set up the stride, API
moe_stride_utilsis introduced to help setup strides in the kernel. - Instead of requiring users to set vectors like
problem_shapes_deviceandproblem_shapes_hosts, a new problem shape struct calledMoEProblemShapeis introduced which takes in max_m, max_n, max_k and counts vector as input and deduce problem shapes internally whenever required.
- Instead of requiring users to call several cute utilities to set up the stride, API
- Enable GEMM_K = 0 in grouped gemm.
- Optimize group gemm kernels by enabling async TMA desc update.
- Support Blackwell SM100 convolution stream-K kernel.
- Unit tests: fprop_streamK, dgrad_streamK, wgrad_streamK.
- Add Blackwell SM100 sparse gemm compressor unit tests.
- Unit tests: compressor_fp16.
- Add sub-bytes and runtime data type support in compressor unit test testbed.
- Add profiler support for:
- Blackwell SM100 and SM120 blockscaled sparse kernels.
- New MoE grouped gemm API.
- Blackwell SM100 cpasync kernel.
- Fix some kernel issues:
- Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
- Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
- Remove warnings caused by cuda vector type alignment setting in CUDA 13.
- Remove problematic
cutlass::int8_tand replace it withint8_t. - Fix a few bugs in distributed gemm API and examples.
- Fix handling negative zero in sparse compressor.
- Add missing
wait_on_dependent_gridsfor PDL use case.
- Fix some profiler issues:
- Add some missing reference kernels.
- Support VoidC reference kernels.
- Add calculation of scale factor A and B in function
bytes_with_problem_shapeof block scaled profiler. - Fix an issue when epilogue tile N is not divided by default subtile N.
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 13.0U1.