Dao-AILab/flash-attention fa4-v4.0.0.beta12 on GitHub

What's Changed

Fix long MSVC linker commands on Windows by @jammm in #2517
Fix test_flash_attn_fast varlen call after qv positional insert by @henrylhtsang in #2527
[Cute,Bwd,Sm90] Fix determinism for GQA, port Sm100 approach in by @v0i0 in #2510
benchmarks/tune_ex2_emu: hd256 sweep support and clock lock/unlock by @Johnsonms in #2495
[FA4][hd256] Backward TMA bulk-store epilogue + LSE/dpsum coalesce by @Johnsonms in #2497
[hd256] Add TMA paged KV support to SM100 2CTA forward kernel by @Johnsonms in #2489
Deterministic backward for blocksparse impl by @drisspg in #2253

Full Changelog: fa4-v4.0.0.beta11...fa4-v4.0.0.beta12