What's Changed
- [BugFix] Fix flash_attn_with_kvcache with scalar cache_seqlen by @stepinto in #1795
- Add sorting and head swizzle to varlen scheduler by @jayhshah in #1823
- Fixes incorrect variable reference in comment by @LoserCheems in #1775
- Update the initialization of dk/dv_semaphore by @y-sq in #1839
- FA3 tensor size parameter fix for long context len (seqlen >=4M) by @ghadiaravi13 in #1841
- ci: Move build job to workflow template by @ko3n1g in #1835
- ci: Build via workflow template by @ko3n1g in #1844
- ci: Switch to workflow_dispatch by @ko3n1g in #1847
- [
FA3] Allow returning LSE via kwarg by @vasqu in #1851 - [BugFix] fix flash_fwd.FlashAttentionForwardSm80 bugs by @mingyangHao in #1856
- [FIX] Allow m_block_size == 192 and mma_pv_is_rs == False in Sm90 CuTe DSL by @reubenconducts in #1858
- [BUG] CUDA 13: make FA3 compatible with CUDA 13 Builds by @johnnynunez in #1860
- [BUILD] SBSA wheels + CUDA 13 Support by @johnnynunez in #1865
- benchmark: qualify all attention backends by methods list by @rajesh-s in #1881
- ABI stable fa3 by @mikaylagawarecki in #1791
- [NVIDIA] Enable Blackwell Family Specific by @johnnynunez in #1882
- Fix typo in flops calculation for local attention by @henrylhtsang in #1883
- flash-attn-cute bwd sm90 by @tzadouri in #1868
- [Cute] Make testing utils standalone for cute by @drisspg in #1892
- [Cute] Bump pin for CuTeDSL by @drisspg in #1891
- Improve causal backward determinism perf with SPT schedule by @jayhshah in #1893
- Upgrade to cutlass v4.2.1 by @johnnynunez in #1905
- Switch to use cutlass.utils.get_smem_capacity_in_bytes by @brandon-yujie-sun in #1906
- Add Missing None Gradient in FA3 QKVPacked by @JackCharlesZhang in #1908
- C++11 fix warnings by @johnnynunez in #1904
- [CuteDSL] Explicitly cast for Flash Combine by @drisspg in #1925
- Refactors to enable FlexAttention by @drisspg in #1840
- feat: Adding varlen support to cute-dsl sm80 bwd by @imbr92 in #1934
- Remove self refs in softmax for-loop by @kevin-tong-augment in #1924
- [AMD] Torch Compile Issues by @micmelesse in #1756
- [CUTE] Enable Pack GQA for score mods by @drisspg in #1937
- Add precommit list and then uncomment in chunks by @drisspg in #1941
- [ROCm] prepare CK sources for pytorch hipify v2 APIs by @jeffdaily in #1944
- Blackwell FlashAttention-BWD (v1.0) by @tzadouri in #1945
- Sm100 BWD (barrier) by @tzadouri in #1946
- Fix hopper cuda 13 build by @kevmo314 in #1949
- [CuteDSL] Fix hash function for cute.jit decorator by @drisspg in #1953
- Block Sparsity and Flex Attention mask mod support by @reubenconducts in #1942
- [NVIDIA] cutlass v4.3.0 by @johnnynunez in #1952
- [CuTe DSL] Update "buffers" name to "aux_tensors"; fix flex bugs by @reubenconducts in #1961
- Fix FA3 segfault with custom CUDA streams in ABI stable build by @kevmo314 in #1957
- [Cute] Blocks tweaks by @drisspg in #1964
- BlockSparse Tweaks by @drisspg in #1970
- [Cute] Fix main by @drisspg in #1982
- [Cute,Fwd,Sm100] Implement SplitKV by @timmy-feng in #1940
- [Cute] Extract block-sparse utilities from SM80/90 by @drisspg in #1984
- Enable python-3.10+ by @drisspg in #1998
- [Cute, Bwd, Sm100] Add GQA support by @jayhshah in #2004
- [Cute,Fwd,Sm100] fix major regression with split kv by @jayhshah in #2006
- [CuTe DSL] Block sparsity computation kernel by @reubenconducts in #1983
- [NVIDIA] bump github actions by @johnnynunez in #1996
- [Cute,Fwd,Sm100] Support paged attention by @timmy-feng in #1999
- [Cute] Block sparse support Sm100 by @drisspg in #1985
- [Cute,Sm100,Fwd] use correction warps for epi when not using TMA by @jayhshah in #2014
- add fastdivmod for oob reads in mask_mods by @drisspg in #2020
- [Cute,Fwd,Sm100] don't pass mask_fn to softmax_step generically by @jayhshah in #2026
- [CuTeDSL] Swap order of decorators by @anakinxc in #2029
- [Cute,Bwd,Sm100] enable deterministic mode for sm100 bwd and fix race conditions by @jayhshah in #2033
- [NFC] Trivial fix to silence linter by @jduprat in #1928
- Add LICENSE and AUTHORS to flash_attn/cute by @jduprat in #2032
- [Cute,Fwd] enable mask mod without blocksparsity by @reubenconducts in #2031
- Bump pin by @drisspg in #2025
- ruff all the smaller files by @drisspg in #2040
- [Cute] Fix head dim 64 bwd by @drisspg in #2035
- Add headdim64 tests to race condition by @drisspg in #2041
- Add torch.compile support to flash attention 3 by @guilhermeleobas in #1769
- [Cute,Bwd,Sm100] Add local for sm100 bwd by @jayhshah in #2046
- Add hash attr to shortcut expensive check by @drisspg in #2048
- [AMD ROCm] Update to latest composable_kernel to improve performance by @rocking5566 in #2052
- fixing cute bwd func def by @liangel-02 in #2056
- Fix use-after-free in FA3 deterministic mode. by @skarupke in #2063
- [CUTE] Allow grads to be preallocated by @drisspg in #2065
- [Cute,Fwd] Extend score_mod to variable sequence length by @reubenconducts in #2043
- [CUTE] Enabling TVM-FFI to reduce cpu overhead by @drisspg in #2042
- Fix softcap scoremod kwargs typo. by @LeoZDong in #2072
- Add score-mod bwd support by @drisspg in #2070
- Add blocksparse support for bwd on blackwell by @drisspg in #2085
- Fix IMA in fwd on m boundary by @drisspg in #2091
- cutedsl 4.3.4 by @drisspg in #2092
- README for AMD ROCm by @seungrokj in #2068
- [Cute] Fix shuffle sync and enable pack gqa for varlen sm100 by @jayhshah in #2097
- [NVIDIA] Enable Jetson Thor FA4 by @johnnynunez in #2108
- Add pack-gqa fwd support for sparse impl w/ broadcasted H dim by @drisspg in #2098
- [Cute,Fwd] improved block sparsity by @reubenconducts in #2100
- Misc tests that should be xfailed for now by @drisspg in #2127
- Update CUTLASS to fix undefined symbol : cuDriverGetVersion by @HydraQYH in #2142
- [Cute,Fwd,Sm100] Support
q_stage=1for inference by @timmy-feng in #1993 - [Cute] Fix two tests that were failing by @henrylhtsang in #2149
- [Cute, Bwd, Sm100] Add varlen for sm100 bwd by @jayhshah in #2150
- block-sparse backward SM90 by @drisspg in #2136
- score-mod backward SM90 by @drisspg in #2137
- [Cute] Clarify and fix subtle cachekey bug by @drisspg in #2143
- [CUTE][SM100] Fix backward gqa on sm100 post mask-mod semantic change by @drisspg in #2146
- [CUTE][SM90]Enable pack-gqa with broadcasted maskmods by @drisspg in #2145
- [CUTE][SM90] GQA backward non deterministic by @drisspg in #2158
- [Cute,Bwd,Sm100] fix seqused in varlen bwd by @jayhshah in #2167
- [CUTE] Bump cutedsl to 4.3.5 by @drisspg in #2170
- Improve flash.cute paged_kv cpasync by @v0i0 in #2156
- [Cute,Flex] Add option to create and cache cute_hash by @reubenconducts in #2171
- [Cute][Flex] Remove no longer needed contig by @drisspg in #2172
- [Cute] update row_max before safe overwrite for online_softmax by @jayhshah in #2174
- [Cute][Flex] add back in contig by @drisspg in #2177
- [Cute][Flex]Add pack-gqa divmod by @drisspg in #2180
- [Cute,Fwd,Sm100] distributed offset calculation for paged KV by @timmy-feng in #2104
- [Cute,Fwd,Sm100] Add r2p for local mask by @henrylhtsang in #2185
- [Cute][Flex] Fix expanded tensor bug by @drisspg in #2189
- [Cute, SM90] fix fwd varlen Cute implementation bug for H100 by @KareemMusleh in #2194
- Reduce Chance of Build OOM by @Qubitium in #2079
- [Cute][Flex] Allow q_offset 1 and add block-sizes to disambiguate edge cases by @drisspg in #2187
- ci: Use 1 ninja job for cu13 by @ko3n1g in #2195
- [README.md] Update README to include
psutilpackage as build dependency by @wanglc02 in #2210 - [Flex][SM100] Replay expand fix on sm100 by @drisspg in #2209
- [AMD] Triton Backend for ROCm #3 by @micmelesse in #2178
- fix compute_block_sparsity broken in benchmark_mask_mod.py by @zhuochenKIDD in #2221
- Fix shared-memory race by @drisspg in #2229
- Use TORCH_TARGET_VERSION over TORCH_STABLE_ONLY by @janeyx99 in #2155
- short readme for flex flash by @v0i0 in #2231
- [FA3] Mark current main version as v3.0.0 stable by @lw in #2223
- [Cute,Fwd,Sm100] hdim 192 smem fix by @jayhshah in #2235
- Add
FLASH_ATTENTION_TRITON_AMD_CONFIG_JSONenv var support by @alexheretic in #2239 - [CUTE]Bump to Cutedsl by @drisspg in #2216
- pytest-dist round robin to gpus by @drisspg in #2241
- Fix Hopper tests by @drisspg in #2242
- [Cute,Flex,Fwd] Allow vectorized score_mod definitions by @reubenconducts in #2236
- [FA2] Fix int32 overflow by @drisspg in #2260
- [Cute][Flex] Fix kernel hang w/ multiple empty tiles by @drisspg in #2258
- Bump to 4.4.0 cute dsl pin by @drisspg in #2262
- BWD sm100 2cta by @tzadouri in #2202
- [Cute] Handle window_size=(-1, -1) for non-local attention by @henrylhtsang in #2251
- Document usage with 🤗 Kernels by @sayakpaul in #2272
- [Cute,Sm100,Bwd] Add hdim 192 hdimv 128 backward for sm100 by @jayhshah in #2270
- Correct Cutlass Error Handling by @ankutalev in #2273
- guard use_2cta_instrs on sm90 by @reubenconducts in #2274
- [cute] Add return_lse by @erikwijmans in #2271
- [Flex, Sm100] fix mask mod bugs by @reubenconducts in #2276
- [Cute,Sm100,Bwd] Fix and enable 2CTA path for hdim 128 backward by @jayhshah in #2280
- Bump to 4.4.1 to avoid segfault by @drisspg in #2291
- Fix sm100 fwd missing tSrQs init regression by @drisspg in #2293
- [clang build] Fix clang parse error of missing 'typename' prior to dependent typename occurs because
LLVM/Clangis strictly adhering to C++ standards by @tomflinda in #2295 - [CuTe] Include broadcast dims in backward compile cache keys by @bonpyt in #2298
- [Cute][Testing] Add fake tensor mode support for compile-only test passes by @Alkaid-Benetnash in #2283
- Enable hdim=96 bwd by @v0i0 in #2302
- Fix GQA crash in cute FLASH backend: init load_Q before conditional by @platers in #2301
- [Cute,Fwd,Sm100] fix paged kv by @jayhshah in #2303
- Add FA4 publishing strategy by @drisspg in #2282
- [Cute][Testing] Add persistent compile cache for cutedsl AOT compilation by @Alkaid-Benetnash in #2304
New Contributors
- @stepinto made their first contribution in #1795
- @LoserCheems made their first contribution in #1775
- @y-sq made their first contribution in #1839
- @ghadiaravi13 made their first contribution in #1841
- @mingyangHao made their first contribution in #1856
- @reubenconducts made their first contribution in #1858
- @johnnynunez made their first contribution in #1860
- @rajesh-s made their first contribution in #1881
- @mikaylagawarecki made their first contribution in #1791
- @henrylhtsang made their first contribution in #1883
- @brandon-yujie-sun made their first contribution in #1906
- @JackCharlesZhang made their first contribution in #1908
- @imbr92 made their first contribution in #1934
- @kevin-tong-augment made their first contribution in #1924
- @jeffdaily made their first contribution in #1944
- @timmy-feng made their first contribution in #1940
- @anakinxc made their first contribution in #2029
- @guilhermeleobas made their first contribution in #1769
- @liangel-02 made their first contribution in #2056
- @skarupke made their first contribution in #2063
- @LeoZDong made their first contribution in #2072
- @seungrokj made their first contribution in #2068
- @HydraQYH made their first contribution in #2142
- @v0i0 made their first contribution in #2156
- @KareemMusleh made their first contribution in #2194
- @wanglc02 made their first contribution in #2210
- @zhuochenKIDD made their first contribution in #2221
- @lw made their first contribution in #2223
- @alexheretic made their first contribution in #2239
- @sayakpaul made their first contribution in #2272
- @ankutalev made their first contribution in #2273
- @erikwijmans made their first contribution in #2271
- @tomflinda made their first contribution in #2295
- @bonpyt made their first contribution in #2298
- @Alkaid-Benetnash made their first contribution in #2283
- @platers made their first contribution in #2301
Full Changelog: v2.8.3...fa4-v4.0.0.beta0