Dao-AILab/flash-attention fa4-v4.0.0.beta16 on GitHub

What's Changed

Bump AITER submodule to commit 3b2e6f4 by @sstamenk in #2540
Clamp kv_stage to avoid SMEM overflow for small head_dims on SM100 by @Johnsonms in #2594
[CuTe,Sm100] fix: decode/prefill exp2 emulation consistency by @Luosuu in #2595
NFC: replace deprecated APIs: cute.make_fragment and cute.core.ThrMma by @brandon-yujie-sun in #2602
Bump nvidia-cutlass-dsl to >=4.5.2 and quack-kernels to >=0.5.0 by @Johnsonms in #2605
[CuTe,Fwd,Sm100] refactor mla sm100 forward and add page table by @jayhshah in #2558
ci: bump Jimver/cuda-toolkit to v0.2.35 for CUDA 13.2 support by @ko3n1g in #2617
[ROCm] Bump Triton to >=3.6.0 and update aiter submodule by @micmelesse in #2614

Full Changelog: fa4-v4.0.0.beta15...fa4-v4.0.0.beta16