Dao-AILab/flash-attention fa4-v4.0.0.beta0 on GitHub

What's Changed

[BugFix] Fix flash_attn_with_kvcache with scalar cache_seqlen by @stepinto in #1795
Add sorting and head swizzle to varlen scheduler by @jayhshah in #1823
Fixes incorrect variable reference in comment by @LoserCheems in #1775
Update the initialization of dk/dv_semaphore by @y-sq in #1839
FA3 tensor size parameter fix for long context len (seqlen >=4M) by @ghadiaravi13 in #1841
ci: Move build job to workflow template by @ko3n1g in #1835
ci: Build via workflow template by @ko3n1g in #1844
ci: Switch to workflow_dispatch by @ko3n1g in #1847
[FA3] Allow returning LSE via kwarg by @vasqu in #1851
[BugFix] fix flash_fwd.FlashAttentionForwardSm80 bugs by @mingyangHao in #1856
[FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuTe DSL by @reubenconducts in #1858
[BUG] CUDA 13: make FA3 compatible with CUDA 13 Builds by @johnnynunez in #1860
[BUILD] SBSA wheels + CUDA 13 Support by @johnnynunez in #1865
benchmark: qualify all attention backends by methods list by @rajesh-s in #1881
ABI stable fa3 by @mikaylagawarecki in #1791
[NVIDIA] Enable Blackwell Family Specific by @johnnynunez in #1882
Fix typo in flops calculation for local attention by @henrylhtsang in #1883
flash-attn-cute bwd sm90 by @tzadouri in #1868
[Cute] Make testing utils standalone for cute by @drisspg in #1892
[Cute] Bump pin for CuTeDSL by @drisspg in #1891
Improve causal backward determinism perf with SPT schedule by @jayhshah in #1893
Upgrade to cutlass v4.2.1 by @johnnynunez in #1905
Switch to use cutlass.utils.get_smem_capacity_in_bytes by @brandon-yujie-sun in #1906
Add Missing None Gradient in FA3 QKVPacked by @JackCharlesZhang in #1908
C++11 fix warnings by @johnnynunez in #1904
[CuteDSL] Explicitly cast for Flash Combine by @drisspg in #1925
Refactors to enable FlexAttention by @drisspg in #1840
feat: Adding varlen support to cute-dsl sm80 bwd by @imbr92 in #1934
Remove self refs in softmax for-loop by @kevin-tong-augment in #1924
[AMD] Torch Compile Issues by @micmelesse in #1756
[CUTE] Enable Pack GQA for score mods by @drisspg in #1937
Add precommit list and then uncomment in chunks by @drisspg in #1941
[ROCm] prepare CK sources for pytorch hipify v2 APIs by @jeffdaily in #1944
Blackwell FlashAttention-BWD (v1.0) by @tzadouri in #1945
Sm100 BWD (barrier) by @tzadouri in #1946
Fix hopper cuda 13 build by @kevmo314 in #1949
[CuteDSL] Fix hash function for cute.jit decorator by @drisspg in #1953
Block Sparsity and Flex Attention mask mod support by @reubenconducts in #1942
[NVIDIA] cutlass v4.3.0 by @johnnynunez in #1952
[CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs by @reubenconducts in #1961
Fix FA3 segfault with custom CUDA streams in ABI stable build by @kevmo314 in #1957
[Cute] Blocks tweaks by @drisspg in #1964
BlockSparse Tweaks by @drisspg in #1970
[Cute] Fix main by @drisspg in #1982
[Cute,Fwd,Sm100] Implement SplitKV by @timmy-feng in #1940
[Cute] Extract block-sparse utilities from SM80/90 by @drisspg in #1984
Enable python-3.10+ by @drisspg in #1998
[Cute, Bwd, Sm100] Add GQA support by @jayhshah in #2004
[Cute,Fwd,Sm100] fix major regression with split kv by @jayhshah in #2006
[CuTe DSL] Block sparsity computation kernel by @reubenconducts in #1983
[NVIDIA] bump github actions by @johnnynunez in #1996
[Cute,Fwd,Sm100] Support paged attention by @timmy-feng in #1999
[Cute] Block sparse support Sm100 by @drisspg in #1985
[Cute,Sm100,Fwd] use correction warps for epi when not using TMA by @jayhshah in #2014
add fastdivmod for oob reads in mask_mods by @drisspg in #2020
[Cute,Fwd,Sm100] don't pass mask_fn to softmax_step generically by @jayhshah in #2026
[CuTeDSL] Swap order of decorators by @anakinxc in #2029
[Cute,Bwd,Sm100] enable deterministic mode for sm100 bwd and fix race conditions by @jayhshah in #2033
[NFC] Trivial fix to silence linter by @jduprat in #1928
Add LICENSE and AUTHORS to flash_attn/cute by @jduprat in #2032
[Cute,Fwd] enable mask mod without blocksparsity by @reubenconducts in #2031
Bump pin by @drisspg in #2025
ruff all the smaller files by @drisspg in #2040
[Cute] Fix head dim 64 bwd by @drisspg in #2035
Add headdim64 tests to race condition by @drisspg in #2041
Add torch.compile support to flash attention 3 by @guilhermeleobas in #1769
[Cute,Bwd,Sm100] Add local for sm100 bwd by @jayhshah in #2046
Add hash attr to shortcut expensive check by @drisspg in #2048
[AMD ROCm] Update to latest composable_kernel to improve performance by @rocking5566 in #2052
fixing cute bwd func def by @liangel-02 in #2056
Fix use-after-free in FA3 deterministic mode. by @skarupke in #2063
[CUTE] Allow grads to be preallocated by @drisspg in #2065
[Cute,Fwd] Extend score_mod to variable sequence length by @reubenconducts in #2043
[CUTE] Enabling TVM-FFI to reduce cpu overhead by @drisspg in #2042
Fix softcap scoremod kwargs typo. by @LeoZDong in #2072
Add score-mod bwd support by @drisspg in #2070
Add blocksparse support for bwd on blackwell by @drisspg in #2085
Fix IMA in fwd on m boundary by @drisspg in #2091
cutedsl 4.3.4 by @drisspg in #2092
README for AMD ROCm by @seungrokj in #2068
[Cute] Fix shuffle sync and enable pack gqa for varlen sm100 by @jayhshah in #2097
[NVIDIA] Enable Jetson Thor FA4 by @johnnynunez in #2108
Add pack-gqa fwd support for sparse impl w/ broadcasted H dim by @drisspg in #2098
[Cute,Fwd] improved block sparsity by @reubenconducts in #2100
Misc tests that should be xfailed for now by @drisspg in #2127
Update CUTLASS to fix undefined symbol : cuDriverGetVersion by @HydraQYH in #2142
[Cute,Fwd,Sm100] Support q_stage=1 for inference by @timmy-feng in #1993
[Cute] Fix two tests that were failing by @henrylhtsang in #2149
[Cute, Bwd, Sm100] Add varlen for sm100 bwd by @jayhshah in #2150
block-sparse backward SM90 by @drisspg in #2136
score-mod backward SM90 by @drisspg in #2137
[Cute] Clarify and fix subtle cachekey bug by @drisspg in #2143
[CUTE][SM100] Fix backward gqa on sm100 post mask-mod semantic change by @drisspg in #2146
[CUTE][SM90]Enable pack-gqa with broadcasted maskmods by @drisspg in #2145
[CUTE][SM90] GQA backward non deterministic by @drisspg in #2158
[Cute,Bwd,Sm100] fix seqused in varlen bwd by @jayhshah in #2167
[CUTE] Bump cutedsl to 4.3.5 by @drisspg in #2170
Improve flash.cute paged_kv cpasync by @v0i0 in #2156
[Cute,Flex] Add option to create and cache cute_hash by @reubenconducts in #2171
[Cute][Flex] Remove no longer needed contig by @drisspg in #2172
[Cute] update row_max before safe overwrite for online_softmax by @jayhshah in #2174
[Cute][Flex] add back in contig by @drisspg in #2177
[Cute][Flex]Add pack-gqa divmod by @drisspg in #2180
[Cute,Fwd,Sm100] distributed offset calculation for paged KV by @timmy-feng in #2104
[Cute,Fwd,Sm100] Add r2p for local mask by @henrylhtsang in #2185
[Cute][Flex] Fix expanded tensor bug by @drisspg in #2189
[Cute, SM90] fix fwd varlen Cute implementation bug for H100 by @KareemMusleh in #2194
Reduce Chance of Build OOM by @Qubitium in #2079
[Cute][Flex] Allow q_offset 1 and add block-sizes to disambiguate edge cases by @drisspg in #2187
ci: Use 1 ninja job for cu13 by @ko3n1g in #2195
[README.md] Update README to include psutil package as build dependency by @wanglc02 in #2210
[Flex][SM100] Replay expand fix on sm100 by @drisspg in #2209
[AMD] Triton Backend for ROCm #3 by @micmelesse in #2178
fix compute_block_sparsity broken in benchmark_mask_mod.py by @zhuochenKIDD in #2221
Fix shared-memory race by @drisspg in #2229
Use TORCH_TARGET_VERSION over TORCH_STABLE_ONLY by @janeyx99 in #2155
short readme for flex flash by @v0i0 in #2231
[FA3] Mark current main version as v3.0.0 stable by @lw in #2223
[Cute,Fwd,Sm100] hdim 192 smem fix by @jayhshah in #2235
Add FLASH_ATTENTION_TRITON_AMD_CONFIG_JSON env var support by @alexheretic in #2239
[CUTE]Bump to Cutedsl by @drisspg in #2216
pytest-dist round robin to gpus by @drisspg in #2241
Fix Hopper tests by @drisspg in #2242
[Cute,Flex,Fwd] Allow vectorized score_mod definitions by @reubenconducts in #2236
[FA2] Fix int32 overflow by @drisspg in #2260
[Cute][Flex] Fix kernel hang w/ multiple empty tiles by @drisspg in #2258
Bump to 4.4.0 cute dsl pin by @drisspg in #2262
BWD sm100 2cta by @tzadouri in #2202
[Cute] Handle window_size=(-1, -1) for non-local attention by @henrylhtsang in #2251
Document usage with 🤗 Kernels by @sayakpaul in #2272
[Cute,Sm100,Bwd] Add hdim 192 hdimv 128 backward for sm100 by @jayhshah in #2270
Correct Cutlass Error Handling by @ankutalev in #2273
guard use_2cta_instrs on sm90 by @reubenconducts in #2274
[cute] Add return_lse by @erikwijmans in #2271
[Flex, Sm100] fix mask mod bugs by @reubenconducts in #2276
[Cute,Sm100,Bwd] Fix and enable 2CTA path for hdim 128 backward by @jayhshah in #2280
Bump to 4.4.1 to avoid segfault by @drisspg in #2291
Fix sm100 fwd missing tSrQs init regression by @drisspg in #2293
[clang build] Fix clang parse error of missing 'typename' prior to dependent typename occurs because LLVM/Clang is strictly adhering to C++ standards by @tomflinda in #2295
[CuTe] Include broadcast dims in backward compile cache keys by @bonpyt in #2298
[Cute][Testing] Add fake tensor mode support for compile-only test passes by @Alkaid-Benetnash in #2283
Enable hdim=96 bwd by @v0i0 in #2302
Fix GQA crash in cute FLASH backend: init load_Q before conditional by @platers in #2301
[Cute,Fwd,Sm100] fix paged kv by @jayhshah in #2303
Add FA4 publishing strategy by @drisspg in #2282
[Cute][Testing] Add persistent compile cache for cutedsl AOT compilation by @Alkaid-Benetnash in #2304

New Contributors

@stepinto made their first contribution in #1795
@LoserCheems made their first contribution in #1775
@y-sq made their first contribution in #1839
@ghadiaravi13 made their first contribution in #1841
@mingyangHao made their first contribution in #1856
@reubenconducts made their first contribution in #1858
@johnnynunez made their first contribution in #1860
@rajesh-s made their first contribution in #1881
@mikaylagawarecki made their first contribution in #1791
@henrylhtsang made their first contribution in #1883
@brandon-yujie-sun made their first contribution in #1906
@JackCharlesZhang made their first contribution in #1908
@imbr92 made their first contribution in #1934
@kevin-tong-augment made their first contribution in #1924
@jeffdaily made their first contribution in #1944
@timmy-feng made their first contribution in #1940
@anakinxc made their first contribution in #2029
@guilhermeleobas made their first contribution in #1769
@liangel-02 made their first contribution in #2056
@skarupke made their first contribution in #2063
@LeoZDong made their first contribution in #2072
@seungrokj made their first contribution in #2068
@HydraQYH made their first contribution in #2142
@v0i0 made their first contribution in #2156
@KareemMusleh made their first contribution in #2194
@wanglc02 made their first contribution in #2210
@zhuochenKIDD made their first contribution in #2221
@lw made their first contribution in #2223
@alexheretic made their first contribution in #2239
@sayakpaul made their first contribution in #2272
@ankutalev made their first contribution in #2273
@erikwijmans made their first contribution in #2271
@tomflinda made their first contribution in #2295
@bonpyt made their first contribution in #2298
@Alkaid-Benetnash made their first contribution in #2283
@platers made their first contribution in #2301

Full Changelog: v2.8.3...fa4-v4.0.0.beta0