github ggml-org/llama.cpp b9330

2 hours ago
Details

model: tag ffn_latent as MUL_MAT to fix buft probe (#23664)

ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but
nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks
the backend about the declared op, so it tested an elementwise MUL on a
q8_0 weight. That used to return true unconditionally and the weight
stayed on GPU by luck. Once supports_op told the truth, the probe got a
no and the loader pushed the weight and its matmul to CPU, splitting the
graph. Tagging it MUL_MAT asks the real question, the math is unchanged.

Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

UI:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.