Dao-AILab/flash-attention fa4-v4.0.0.beta17 on GitHub

What's Changed

[Triton] Fix graph capture issues and env var by @micmelesse in #2620
[CuTe,Bwd,Sm100] allow 2cta with score mod and mask mod in bwd by @reubenconducts in #2557
[CuTe] Fix lint failures by @drisspg in #2625
[CuTe] Fix lint failure in flash_bwd_sm100.py by @Johnsonms in #2627
fix: add weights_only=True to all torch.load call sites by @aryanputta in #2622
[Cute,Sm100,Fwd] use correction warps if not tma store; remove outdated packgqa guard by @jayhshah in #2629
Add aux-scalars to interface to enable dynamic ints and floats in expressions by @drisspg in #2616
fix: build and select cu13.2 prebuilt wheels by @ko3n1g in #2618
ci(fa4): enforce cutlass-dsl/quack dep floors and rebake cu130 image by @Johnsonms in #2636

Full Changelog: fa4-v4.0.0.beta16...fa4-v4.0.0.beta17