Details
ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (#21168)
-
ds_read_b128 for q4_0 and q4_1 mmq kernels
Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.
-
Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation
-
Explicit for loop in mmq, renamed vec into tmp
-
Fixed max_cpy usage in the loading loop
-
Fixed typo in q4_1 kernel
-
Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler johannesg@5d6.de
-
Renoved trailing white line 500
-
Update mmq.cuh removed other whitelines
-
Remove trailing whitespaces
Co-authored-by: iacopPBK iacopPBK@users.noreply.github.com
Co-authored-by: Johannes Gäßler johannesg@5d6.de
Co-authored-by: iacopPBK iacop@deneb.com
macOS/iOS:
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: