NVIDIA/cutlass v4.4.2 on GitHub

New features
- CuTe DSL now supports Python 3.14 for both x86_64 and aarch64
- Runtime Pointer/Tensor/FakeTensor now supports cache_key, providing a stable, hashable representation that simplifies and improves compiled function caching.
Bug fixing and improvements
- Fixed Hopper FMHA causal attention performance regression on CUDA toolkit 13.1 by
  optimizing mbarrier synchronization to avoid unnecessary convergence barriers.
- Fix kernel loading race condition when multiple GPU are present in the same process in JAX.

Enable Blackwell SM120f compilation of examples and exposes NVFP4/MX Grouped GEMM in the CUTLASS Profiler.

NVIDIA/cutlass v4.4.2 CUTLASS 4.4.2 on GitHub