ggml-org/llama.cpp b8956 on GitHub

Details

CANN: add new ops, optimize existing ops (#21204)

New operators:

GGML_OP_SET: implement via aclnnInplaceCopy on target region
GGML_OP_CUMSUM: implement via aclnnCumsum
GGML_OP_FILL: implement via aclnnInplaceFillScalar
GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:

GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
aclnnGeGluV3 when applicable; fallback conditions now checked inside
each function rather than at the call site
CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
tensor); add eps clamping before division to avoid divide-by-zero
PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
ReflectionPad1d once on the full 4-D view; remove redundant nb copies
GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
helper into gather_batched lambda with batch loop inlined
SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
refactor helper into scatter_batched lambda with batch loop inlined
OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
batch dims where ne02/ne03 may differ from ne2/ne3
backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:

COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
buffer instead of InplaceEqTensor, avoiding corruption of src0
ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
fields in ggml_graph_node_properties; has_matching_properties() was
missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
incorrectly share cached graphs and produce wrong results (ERR≈679)
graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
bytes so that ops differing only in parameters are not incorrectly
replayed from cache

macOS/iOS:

Linux:

Android:

Windows:

openEuler: