github Dao-AILab/flash-attention fa4-v4.0.0.beta20

pre-release5 hours ago

What's Changed

  • fix: sync callers with new _flash_attn_fwd 4-tuple return signature by @hhy3 in #2674
  • Fix compatibility issues with CuTe DSL 4.6.0+ by @anakinxc in #2648
  • [CuTe] Pass tmem scalar fields as .ptr to TmemAllocator on SM100 by @pashu-cohere in #2679
  • Add FLASHATTENTION_DISABLE_SPLIT_ALIGNMENT by @janeyx99 in #2680
  • ci: rebake cu130 image for cutlass-dsl 4.6.0.dev0 floor by @Johnsonms in #2684
  • Update FA4 cute quack compatibility by @Luosuu in #2676
  • ci: install cutlass-dsl/quack at runtime to decouple from the baked image by @Johnsonms in #2685
  • [Cute] Assume 16B stride divisibility for SM100 backward LSE/dPsum bulk-copy inputs by @Johnsonms in #2686
  • fix(hd256/sm100): forward reads actual input strides, drop .contiguous() patch by @oattia in #2670
  • ci: add MLA absorbed coverage to FA4 CI by @Johnsonms in #2690
  • Parallelize splitkv alignment templated kernels, remove flag by @janeyx99 in #2683

New Contributors

Full Changelog: fa4-v4.0.0.beta19...fa4-v4.0.0.beta20

Don't miss a new flash-attention release

NewReleases is sending notifications on new releases.