github ggml-org/llama.cpp b8197

latest release: b8198
5 hours ago
Details

ggml : use a simple std::thread in AMX without OpenMP (#20074)

Disabling OpenMP generally provides better inference performance (at
least in my testing) but the loading becomes slightly slower.

Benchmark results for convert_B_packed_format():

Before this commit:

     N      K |  No OpenMP     OpenMP |    Diff |  Speedup
------------------------------------------------------------
   512   2880 |    640.9us    263.5us |  -58.9% |    0.41x
  2880   4096 |     2.55ms    261.7us |  -89.8% |    0.10x
201088   2880 |   256.44ms    21.61ms |  -91.6% |    0.08x
------------------------------------------------------------

Total: 325.43ms vs 31.05ms

After:

     N      K |  No OpenMP     OpenMP |    Diff |  Speedup
------------------------------------------------------------
   512   2880 |     1.49ms    263.5us |  -82.3% |    0.18x
  2880   4096 |     1.55ms    261.7us |  -83.1% |    0.17x
201088   2880 |    24.03ms    21.61ms |  -10.1% |    0.90x
------------------------------------------------------------

Total: 78.97ms vs 31.05ms

Tested with unsloth/gpt-oss-20b-GGUF:Q4_K_M.

Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.