github ggml-org/llama.cpp b8956

latest release: b8957
one hour ago
Details

CANN: add new ops, optimize existing ops (#21204)

New operators:

  • GGML_OP_SET: implement via aclnnInplaceCopy on target region
  • GGML_OP_CUMSUM: implement via aclnnCumsum
  • GGML_OP_FILL: implement via aclnnInplaceFillScalar
  • GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
  • GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
    aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
  • GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
  • GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:

  • GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
    aclnnGeGluV3 when applicable; fallback conditions now checked inside
    each function rather than at the call site
  • CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
    ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
  • L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
    tensor); add eps clamping before division to avoid divide-by-zero
  • PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
    ReflectionPad1d once on the full 4-D view; remove redundant nb copies
  • GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
    helper into gather_batched lambda with batch loop inlined
  • SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
    refactor helper into scatter_batched lambda with batch loop inlined
  • OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
    per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
    batch dims where ne02/ne03 may differ from ne2/ne3
  • backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:

  • COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
    buffer instead of InplaceEqTensor, avoiding corruption of src0
  • ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
    fields in ggml_graph_node_properties; has_matching_properties() was
    missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
    incorrectly share cached graphs and produce wrong results (ERR≈679)
  • graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
    bytes so that ops differing only in parameters are not incorrectly
    replayed from cache

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.