What's Changed
- [ROCm Windows] fix build failed by @Apophis3158 in #2519
- [CuTe,Bwd,Sm100] don't disable 2cta due to cuda 12 in bwd by @reubenconducts in #2543
- [CuTe,Bwd] guard softcap for varlen backward by @reubenconducts in #2544
- [CuTe,Flex] varlen blocksparsity by @reubenconducts in #2224
- [FA4][hd256] Fix layout of non-contiguous qkv in backward kernel by @wangsiyu in #2545
- [Cute,Bwd,Sm100] fix incorrect calculation of n_block global max for bwd deterministic by @jayhshah in #2549
- fix varlen w/ paging split kv bug by @liangel-02 in #2550
New Contributors
- @Apophis3158 made their first contribution in #2519
Full Changelog: fa4-v4.0.0.beta12...fa4-v4.0.0.beta13