What's Changed
- Feat([FA4][CUTE DSL]) Add head_dim=256 support (forward + backward) by @wangsiyu in #2412
- [Cute,hd256] Post-merge cleanup: dead code, duplicate imports by @Johnsonms in #2487
- [CuTe,Flex] Wire up interface for flex autograd support by @reubenconducts in #2485
- [CuTe,Flex] Add score_mod_bwd param to flash_attn_varlen_func by @reubenconducts in #2496
- fix: typos and missing comments in FA4 cute kernel files by @dxasu in #2502
- [SM100] Guard gO None in empty-tile correction by @geruome in #2504
- [CuTe, Flex] simplify blocksparse interface in flash_attn_func by @reubenconducts in #2506
- Fix: pass
streamto SM100 MLA kernel by @MatthewBonanni in #2505 - Fix clc scheduling request bug by @drisspg in #2508
- [Tests,MLA] Close coverage gaps in test_flash_attn_mla_absorbed by @Johnsonms in #2483
- Add cache utils logging test by @drisspg in #2509
- [hd256] Improve forward kernel with exp2 FMA emulation (3% to 9% performance gain) by @Johnsonms in #2488
- SM90 FA4 QuACK 0.4 Compatibility by @EduardDurech in #2513
- ci: use /tmp for apptainer tmpdir to fix xattrerror on VAST by @Johnsonms in #2511
New Contributors
- @wangsiyu made their first contribution in #2412
- @dxasu made their first contribution in #2502
- @EduardDurech made their first contribution in #2513
Full Changelog: fa4-v4.0.0.beta10...fa4-v4.0.0.beta11