Details
CUDA: Factor out and re-use block_reduce function (#18785)
-
CUDA: Refactor and expose two_stage_warp_reduce_* function
-
Use
two_stage_warp_reducealso in softmax kernel, move smem out of it
Moving smem out of __device__ function to __global__ function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (cudaFuncSetAttribute fails when not accounting for
it once for each call to two_stage_warp_reduce)
- Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Aman Gupta amangupta052@gmail.com
-
Use two_stage_warp_reduce in group_norm_f32
-
Use two_stage_warp_reduce in rms_norm_f32
-
Fix smem calculation which expects bytes
-
Make
two_stage_warp_reduceaccept all values warp_reduce accepts
Also integrate it into norm_f32 function
-
Use two_stage_warp_reduce in l2_norm_f32
-
Use type traits for block reduction for better legibility
Also adresss other requests by @am17an such as variable renaming
-
Make norm tests cover all cuda paths
-
Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK
Unit-tests passed locally, let's see if they pass in the CI as well
- Use
enum classforblock_reduce_method
This is more type-safe than plain enum
-
Rename variables as suggested in code review by @am17an
-
Rename two_stage_warp_reduce -> block_reduce
-
Fix trailing whitespace in common.cuh
-
Make condition of static_assert type-dependent
This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:
https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785
- Inline definitions
Co-authored-by: Aman Gupta amangupta052@gmail.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: