Details
ggml : add native AVX512-FP16 support for F16 operations (#20529)
The overall benchmark speed remains almost the same because the CPU is
now calculating faster than the RAM can deliver the data. (See perf stat
results below showing 2.7 billion fewer instructions).
Also note that this path will be only enabled for native build or with
custom flags.
now:
Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128':
189,073.52 msec task-clock # 14.658 CPUs utilized
404 context-switches # 2.137 /sec
19 cpu-migrations # 0.100 /sec
372,390 page-faults # 1.970 K/sec
310,877,195,595 instructions # 0.54 insn per cycle
581,071,530,602 cycles # 3.073 GHz
19,352,107,994 branches # 102.352 M/sec
48,304,438 branch-misses # 0.25% of all branches
84,998,431,152 L1-dcache-loads # 449.552 M/sec
12,186,410,279 L1-dcache-load-misses # 14.34% of all L1-dcache accesses
12.899358742 seconds time elapsed
187.823044000 seconds user
1.253416000 seconds sys
before:
Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128':
190,594.56 msec task-clock # 14.652 CPUs utilized
436 context-switches # 2.288 /sec
22 cpu-migrations # 0.115 /sec
372,782 page-faults # 1.956 K/sec
313,574,921,966 instructions # 0.54 insn per cycle
586,064,970,425 cycles # 3.075 GHz
19,585,778,563 branches # 102.761 M/sec
48,437,488 branch-misses # 0.25% of all branches
86,219,336,628 L1-dcache-loads # 452.370 M/sec
12,232,085,771 L1-dcache-load-misses # 14.19% of all L1-dcache accesses
13.007923164 seconds time elapsed
189.395316000 seconds user
1.202612000 seconds sys
Signed-off-by: Adrien Gallouët angt@huggingface.co
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: