What's Changed
- [Triton] Fix graph capture issues and env var by @micmelesse in #2620
- [CuTe,Bwd,Sm100] allow 2cta with score mod and mask mod in bwd by @reubenconducts in #2557
- [CuTe] Fix lint failures by @drisspg in #2625
- [CuTe] Fix lint failure in flash_bwd_sm100.py by @Johnsonms in #2627
- fix: add weights_only=True to all torch.load call sites by @aryanputta in #2622
- [Cute,Sm100,Fwd] use correction warps if not tma store; remove outdated packgqa guard by @jayhshah in #2629
- Add aux-scalars to interface to enable dynamic ints and floats in expressions by @drisspg in #2616
- fix: build and select cu13.2 prebuilt wheels by @ko3n1g in #2618
- ci(fa4): enforce cutlass-dsl/quack dep floors and rebake cu130 image by @Johnsonms in #2636
New Contributors
- @aryanputta made their first contribution in #2622
Full Changelog: fa4-v4.0.0.beta16...fa4-v4.0.0.beta17