ggml-org/llama.cpp b9075 on GitHub

Details

cuda: fuse snake activation (mul, sin, sqr, mul, add) (#22667)

cuda: fuse snake activation (mul, sin, sqr, mul, add)

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.

cuda: address review feedback from @am17an

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.

cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an
Update tests/test-backend-ops.cpp

Co-authored-by: Aman Gupta amangupta052@gmail.com

cuda: snake fusion check add->type matches x->type

Address review feedback from @am17an

cuda: snake fusion check add->type matches x->type

Moved for readability (equivalent)
Address review feedback from @am17an

Co-authored-by: Aman Gupta amangupta052@gmail.com

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler: