CuTe DSL
- Bug fixing and improvements
CUTLASS C++
- Fix SM100 F8F6F4 SS MMA (1SM and 2SM) traits to use typed op templates.
- Add UE8M0 (uniform exponent distribution) initialization support in tensor fill utilities.
- Add
cvt.rn.bf16x2.e4m3x2conversion instruction support tonumeric_conversion.h. - Update example 93 with paged KV cache support for Blackwell low-latency GQA.