github ggml-org/llama.cpp b7739

latest release: b7740
7 hours ago
Details

CUDA: Factor out and re-use block_reduce function (#18785)

  • CUDA: Refactor and expose two_stage_warp_reduce_* function

  • Use two_stage_warp_reduce also in softmax kernel, move smem out of it

Moving smem out of __device__ function to __global__ function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (cudaFuncSetAttribute fails when not accounting for
it once for each call to two_stage_warp_reduce)

  • Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta amangupta052@gmail.com

  • Use two_stage_warp_reduce in group_norm_f32

  • Use two_stage_warp_reduce in rms_norm_f32

  • Fix smem calculation which expects bytes

  • Make two_stage_warp_reduce accept all values warp_reduce accepts

Also integrate it into norm_f32 function

  • Use two_stage_warp_reduce in l2_norm_f32

  • Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

  • Make norm tests cover all cuda paths

  • Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

  • Use enum class for block_reduce_method

This is more type-safe than plain enum

  • Rename variables as suggested in code review by @am17an

  • Rename two_stage_warp_reduce -> block_reduce

  • Fix trailing whitespace in common.cuh

  • Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

  • Inline definitions

Co-authored-by: Aman Gupta amangupta052@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.