What's Changed
- [Cute][Testing] Minor improvements on pytest-xdist workflow by @Alkaid-Benetnash in #2311
- Nicer headdim error message by @drisspg in #2227
- [Fwd,Sm100] Extract named barriers by @drisspg in #2309
- Change 2cta opt in to have min seqlen > 2*m_block_size by @drisspg in #2320
- [CuteDSL][SM90] varlen bwd works by @KareemMusleh in #2275
- Add Logging helper by @drisspg in #2327
- [CuTeDSL][Sm80] basic fix for new api by @zhuochenKIDD in #2297
- fix: duplicate softmax_scale param by @NanoCode012 in #2328
- Fix FA2 + FA4 co-existence by @drisspg in #2331
- [Cute,Sm100] Introduce a flexible lambda-based R2P masking by @Alkaid-Benetnash in #2313
- [Cute, SM90, bwd] Wire seqused_q/k through backward pass by @NJX-njx in #2315
- SM120 forward pass (Blackwell GeForce / DGX Spark) by @blake-snc in #2329
- [cutlass] Allow compilation of cutlass FA3 for sm100 via enable_sm90 by @henrylhtsang in #2332
- [Cute] fix: rename logging module to avoid circular import at building by @Luosuu in #2335
- BUG: SeqlenInfo.create has a tile parameter that defaults to 128 by @risan-raja in #2337
- [Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch by @MatthewBonanni in #2338
- [AMD] Migrate to Triton Backend to Aiter by @micmelesse in #2230
- [Bwd,Sm120] Add SM120 backward pass support by @blake-snc in #2330
- [Bwd, SM80] Fix tdKrdS typo by @henrylhtsang in #2341
- Add SM120 varlen attention support by @blake-snc in #2333
- fix the create_ragged_tensor_for_tma issue by @rainj-me in #2345
- [Sm90] Fix test_mask_mod and bwd block-sparse kwarg mismatch by @henrylhtsang in #2365
- [Cute, Testing] Fix aot + tvm-ffi EnvStream related parameter mismatch by @Alkaid-Benetnash in #2369
- [Cute, Testing] Bump cutedsl to 4.4.2 and remove prior aot cache management workarounds by @Alkaid-Benetnash in #2370
- [Cute] fix: FA4 paged attention kv load for DeepSeek (192,128) on SM100 by @Luosuu in #2368
- [AMD ROCm] Update ROCm/CK backend to align with latest ComposableKernel API changes by @rocking5566 in #2363
- [ROCm] Auto-detect Triton backend if C++ extension is missing by @Soddentrough in #2343
- [Fwd,Sm90] Add paged KV attention support (tma and cp.async) by @henrylhtsang in #2360
- [CuTe,Flex] limit vec_size to 2 for score mod when not on Sm100 by @reubenconducts in #2371
- Support 2CTA for sliding window hdim 192 by @Inodayy in #2347
- [Cute,Fwd,Sm100] support irregular qhead / kvhead ratios by @timmy-feng in #2186
- benchmarks: add MFU% column to benchmark output by @Johnsonms in #2377
- Update flow to enable beta weekly releases by @drisspg in #2378
New Contributors
- @NJX-njx made their first contribution in #2315
- @blake-snc made their first contribution in #2329
- @Luosuu made their first contribution in #2335
- @risan-raja made their first contribution in #2337
- @MatthewBonanni made their first contribution in #2338
- @rainj-me made their first contribution in #2345
- @Soddentrough made their first contribution in #2343
- @Inodayy made their first contribution in #2347
- @Johnsonms made their first contribution in #2377
Full Changelog: fa4-v4.0.0.beta4...fa4-v4.0.0.beta5