Dao-AILab/flash-attention fa4-v4.0.0.beta5 on GitHub

What's Changed

[Cute][Testing] Minor improvements on pytest-xdist workflow by @Alkaid-Benetnash in #2311
Nicer headdim error message by @drisspg in #2227
[Fwd,Sm100] Extract named barriers by @drisspg in #2309
Change 2cta opt in to have min seqlen > 2*m_block_size by @drisspg in #2320
[CuteDSL][SM90] varlen bwd works by @KareemMusleh in #2275
Add Logging helper by @drisspg in #2327
[CuTeDSL][Sm80] basic fix for new api by @zhuochenKIDD in #2297
fix: duplicate softmax_scale param by @NanoCode012 in #2328
Fix FA2 + FA4 co-existence by @drisspg in #2331
[Cute,Sm100] Introduce a flexible lambda-based R2P masking by @Alkaid-Benetnash in #2313
[Cute, SM90, bwd] Wire seqused_q/k through backward pass by @NJX-njx in #2315
SM120 forward pass (Blackwell GeForce / DGX Spark) by @blake-snc in #2329
[cutlass] Allow compilation of cutlass FA3 for sm100 via enable_sm90 by @henrylhtsang in #2332
[Cute] fix: rename logging module to avoid circular import at building by @Luosuu in #2335
BUG: SeqlenInfo.create has a tile parameter that defaults to 128 by @risan-raja in #2337
[Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch by @MatthewBonanni in #2338
[AMD] Migrate to Triton Backend to Aiter by @micmelesse in #2230
[Bwd,Sm120] Add SM120 backward pass support by @blake-snc in #2330
[Bwd, SM80] Fix tdKrdS typo by @henrylhtsang in #2341
Add SM120 varlen attention support by @blake-snc in #2333
fix the create_ragged_tensor_for_tma issue by @rainj-me in #2345
[Sm90] Fix test_mask_mod and bwd block-sparse kwarg mismatch by @henrylhtsang in #2365
[Cute, Testing] Fix aot + tvm-ffi EnvStream related parameter mismatch by @Alkaid-Benetnash in #2369
[Cute, Testing] Bump cutedsl to 4.4.2 and remove prior aot cache management workarounds by @Alkaid-Benetnash in #2370
[Cute] fix: FA4 paged attention kv load for DeepSeek (192,128) on SM100 by @Luosuu in #2368
[AMD ROCm] Update ROCm/CK backend to align with latest ComposableKernel API changes by @rocking5566 in #2363
[ROCm] Auto-detect Triton backend if C++ extension is missing by @Soddentrough in #2343
[Fwd,Sm90] Add paged KV attention support (tma and cp.async) by @henrylhtsang in #2360
[CuTe,Flex] limit vec_size to 2 for score mod when not on Sm100 by @reubenconducts in #2371
Support 2CTA for sliding window hdim 192 by @Inodayy in #2347
[Cute,Fwd,Sm100] support irregular qhead / kvhead ratios by @timmy-feng in #2186
benchmarks: add MFU% column to benchmark output by @Johnsonms in #2377
Update flow to enable beta weekly releases by @drisspg in #2378

New Contributors

@NJX-njx made their first contribution in #2315
@blake-snc made their first contribution in #2329
@Luosuu made their first contribution in #2335
@risan-raja made their first contribution in #2337
@MatthewBonanni made their first contribution in #2338
@rainj-me made their first contribution in #2345
@Soddentrough made their first contribution in #2343
@Inodayy made their first contribution in #2347
@Johnsonms made their first contribution in #2377

Full Changelog: fa4-v4.0.0.beta4...fa4-v4.0.0.beta5