ggml-org/llama.cpp b8701 on GitHub

Details

ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (#21168)

ds_read_b128 for q4_0 and q4_1 mmq kernels
Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.
Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation
Explicit for loop in mmq, renamed vec into tmp
Fixed max_cpy usage in the loading loop
Fixed typo in q4_1 kernel
Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Co-authored-by: iacopPBK iacopPBK@users.noreply.github.com
Co-authored-by: Johannes Gäßler johannesg@5d6.de
Co-authored-by: iacopPBK iacop@deneb.com

macOS/iOS:

Linux:

Windows:

openEuler: