Dao-AILab/flash-attention fa4-v4.0.0.beta10 on GitHub

What's Changed

Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression by @Johnsonms in #2461
Add CLC scheduler heuristic by @drisspg in #2455
expose num_splits for FA2 and add option for kernel blocksize alignment by @liangel-02 in #2448
[Cute,Fwd,Sm100] fp8 e4m3 and e5m2 support by @dcw02 in #2109
Expose --pack-gqa and --num-splits in benchmark_attn.py by @Johnsonms in #2473
Fix: pass num_splits through varlen_fwd Python wrapper (fixes #2448 regression) by @hsyysy in #2476
[Cute,Fwd,Sm100] Fix the crash when seqlen_k=0 by @Johnsonms in #2470
fix causal calcs by @drisspg in #2463
[cute,bwd] fix PDL race in bwd_preprocess, which corrupting dpsum on SM90+ by @geruome in #2481

Full Changelog: fa4-v4.0.0.beta9...fa4-v4.0.0.beta10