What's Changed
- fix: sync callers with new _flash_attn_fwd 4-tuple return signature by @hhy3 in #2674
- Fix compatibility issues with CuTe DSL 4.6.0+ by @anakinxc in #2648
- [CuTe] Pass tmem scalar fields as .ptr to TmemAllocator on SM100 by @pashu-cohere in #2679
- Add FLASHATTENTION_DISABLE_SPLIT_ALIGNMENT by @janeyx99 in #2680
- ci: rebake cu130 image for cutlass-dsl 4.6.0.dev0 floor by @Johnsonms in #2684
- Update FA4 cute quack compatibility by @Luosuu in #2676
- ci: install cutlass-dsl/quack at runtime to decouple from the baked image by @Johnsonms in #2685
- [Cute] Assume 16B stride divisibility for SM100 backward LSE/dPsum bulk-copy inputs by @Johnsonms in #2686
- fix(hd256/sm100): forward reads actual input strides, drop .contiguous() patch by @oattia in #2670
- ci: add MLA absorbed coverage to FA4 CI by @Johnsonms in #2690
- Parallelize splitkv alignment templated kernels, remove flag by @janeyx99 in #2683
New Contributors
- @hhy3 made their first contribution in #2674
- @pashu-cohere made their first contribution in #2679
- @oattia made their first contribution in #2670
Full Changelog: fa4-v4.0.0.beta19...fa4-v4.0.0.beta20