github ggml-org/llama.cpp b8064

2 hours ago
Details

cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624)

  • cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization
  • load all 8 int8 for a grid position in one load
  • calculate signs via popcnt instead of fetching from ksigns table
  • broadcast signs to drop individual shift/mask
  • cuda: iq2xxs: simplify sum scaling

express (sum * scale + sum / 2) / 4 as (sum * (scale * 2 + 1)) / 8
express ((aux32 >> 28) * 2 + 1) as (aux32 >> 27 | 1)

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

  • uint -> uint32_t

error: identifier "uint" is undefined

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.