Dao-AILab/flash-attention fa4-v4.0.0.beta20 on GitHub

What's Changed

fix: sync callers with new _flash_attn_fwd 4-tuple return signature by @hhy3 in #2674
Fix compatibility issues with CuTe DSL 4.6.0+ by @anakinxc in #2648
[CuTe] Pass tmem scalar fields as .ptr to TmemAllocator on SM100 by @pashu-cohere in #2679
Add FLASHATTENTION_DISABLE_SPLIT_ALIGNMENT by @janeyx99 in #2680
ci: rebake cu130 image for cutlass-dsl 4.6.0.dev0 floor by @Johnsonms in #2684
Update FA4 cute quack compatibility by @Luosuu in #2676
ci: install cutlass-dsl/quack at runtime to decouple from the baked image by @Johnsonms in #2685
[Cute] Assume 16B stride divisibility for SM100 backward LSE/dPsum bulk-copy inputs by @Johnsonms in #2686
fix(hd256/sm100): forward reads actual input strides, drop .contiguous() patch by @oattia in #2670
ci: add MLA absorbed coverage to FA4 CI by @Johnsonms in #2690
Parallelize splitkv alignment templated kernels, remove flag by @janeyx99 in #2683

Full Changelog: fa4-v4.0.0.beta19...fa4-v4.0.0.beta20