ggml-org/llama.cpp b7739 on GitHub

Details

CUDA: Factor out and re-use block_reduce function (#18785)

CUDA: Refactor and expose two_stage_warp_reduce_* function
Use two_stage_warp_reduce also in softmax kernel, move smem out of it

Moving smem out of __device__ function to __global__ function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (cudaFuncSetAttribute fails when not accounting for
it once for each call to two_stage_warp_reduce)

Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta amangupta052@gmail.com

Use two_stage_warp_reduce in group_norm_f32
Use two_stage_warp_reduce in rms_norm_f32
Fix smem calculation which expects bytes
Make two_stage_warp_reduce accept all values warp_reduce accepts

Also integrate it into norm_f32 function

Use two_stage_warp_reduce in l2_norm_f32
Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

Make norm tests cover all cuda paths
Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

Use enum class for block_reduce_method

This is more type-safe than plain enum

Rename variables as suggested in code review by @am17an
Rename two_stage_warp_reduce -> block_reduce
Fix trailing whitespace in common.cuh
Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

Inline definitions

Co-authored-by: Aman Gupta amangupta052@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler: