What's Changed
- Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression by @Johnsonms in #2461
- Add CLC scheduler heuristic by @drisspg in #2455
- expose num_splits for FA2 and add option for kernel blocksize alignment by @liangel-02 in #2448
- [Cute,Fwd,Sm100] fp8 e4m3 and e5m2 support by @dcw02 in #2109
- Expose --pack-gqa and --num-splits in benchmark_attn.py by @Johnsonms in #2473
- Fix: pass num_splits through varlen_fwd Python wrapper (fixes #2448 regression) by @hsyysy in #2476
- [Cute,Fwd,Sm100] Fix the crash when seqlen_k=0 by @Johnsonms in #2470
- fix causal calcs by @drisspg in #2463
- [cute,bwd] fix PDL race in bwd_preprocess, which corrupting dpsum on SM90+ by @geruome in #2481
New Contributors
- @dcw02 made their first contribution in #2109
- @hsyysy made their first contribution in #2476
- @geruome made their first contribution in #2481
Full Changelog: fa4-v4.0.0.beta9...fa4-v4.0.0.beta10