github ggml-org/llama.cpp b8701

latest release: b8702
2 hours ago
Details

ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (#21168)

  • ds_read_b128 for q4_0 and q4_1 mmq kernels

    Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.

  • Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation

  • Explicit for loop in mmq, renamed vec into tmp

  • Fixed max_cpy usage in the loading loop

  • Fixed typo in q4_1 kernel

  • Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • Renoved trailing white line 500

  • Update mmq.cuh removed other whitelines

  • Remove trailing whitespaces


Co-authored-by: iacopPBK iacopPBK@users.noreply.github.com
Co-authored-by: Johannes Gäßler johannesg@5d6.de
Co-authored-by: iacopPBK iacop@deneb.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.