Details
llama-quant : correct n_attention_wv usage (#20357)
- llama-quant : correct
n_attention_wvusage
In #19770, I introduced a regression in the way the
quantize_state_impl counter values were initialized. I was
incrementing and using n_attention_wv in the same loop, when it should
have been fixed by the time we're deciding tensor types in
llama_tensor_get_type_impl (for use_more_bits).
I never observed a difference in any of my
tests
- it was only after @bartowski kindly pointed this out that I realized
it was incorrect. (Thanks!)
- simplify
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: