What's Changed
- Fix long MSVC linker commands on Windows by @jammm in #2517
- Fix test_flash_attn_fast varlen call after qv positional insert by @henrylhtsang in #2527
- [Cute,Bwd,Sm90] Fix determinism for GQA, port Sm100 approach in by @v0i0 in #2510
- benchmarks/tune_ex2_emu: hd256 sweep support and clock lock/unlock by @Johnsonms in #2495
- [FA4][hd256] Backward TMA bulk-store epilogue + LSE/dpsum coalesce by @Johnsonms in #2497
- [hd256] Add TMA paged KV support to SM100 2CTA forward kernel by @Johnsonms in #2489
- Deterministic backward for blocksparse impl by @drisspg in #2253
New Contributors
Full Changelog: fa4-v4.0.0.beta11...fa4-v4.0.0.beta12