github ggml-org/llama.cpp b8846

2 hours ago
Details

ggml : reduce CPU overhead in meta backend (#22041)

  • cache subgraph splits when cgraph is unchanged

Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively.

Assign uid to every sub-graph so that CUDA's fast uid check path hits too.

  • Address review comments

  • Keep the scope as is

  • Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code.

  • Address review comments

  • Update ggml/src/ggml-backend-meta.cpp

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • Update ggml/src/ggml-backend-meta.cpp

Co-authored-by: Johannes Gäßler johannesg@5d6.de


Co-authored-by: Johannes Gäßler johannesg@5d6.de

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.