Details
cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624)
- cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization
- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask
- cuda: iq2xxs: simplify sum scaling
express (sum * scale + sum / 2) / 4 as (sum * (scale * 2 + 1)) / 8
express ((aux32 >> 28) * 2 + 1) as (aux32 >> 27 | 1)
saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small
- uint -> uint32_t
error: identifier "uint" is undefined
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: