ggml-org/llama.cpp b8754 on GitHub

Details

hexagon: improved Op queuing, buffer and cache management (#21705)

The host now prepares batches of requests and dispatches them via a single dspqueue message.

Buffers are mapped explicitly by NPU while processing batches.

hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops
hex-utils: add explicit l2flush and l2clear helpers
hex-opreq: use fine-grain per tensor l2 management
hex-opreq: avoid redundant invalidates for tensors we already flushed
hex-opreq: update debug messages
htp-opreq: reuse ops_context
hex-opreq: do not flush or invalidate cache lines beyond buffer boundry
hex-opreq: fix errors in log message
Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"

This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.

Looks like 4MB cont. vitual space should cover the 1MB cache.

hexagon: drop cache flush size to 2MB
hex-opreq: start reworking opreq packing
hex-opreq: introduce new way of packing opbatch where tensors are stored separately
hex-opreq: add a simple fastrpc call to force unmap all buffers
hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size
hex-opreq: bump opreq batch size to 256
hex-mm: place src1 spad at the top of vtcm for easy reuse
hex-ops: introduce internal types and disable src1 reuse for now

Nothing new just formalizing the repack / qyn.quant types we've been using.

This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.

Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.

hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx
hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers
hex-buf: add support for allocating shared/pinned buffer for opreqs
hex-opbatch: make opbatches configurable
hex-naming: better name for ggml_hexagon_shared_buffer
hex-naming: add session->c_name() helper
hex-opbatch: start using shm but still copy for now
hex-opbatch: use shared buffer for packing opbatch
hex-opbatch: beter naming for opbatch related classes and code
hex-opbatch: reuse batched tensors with same data/dims/strides
hex-opbatch: update logging
hex-opbatch: add support for vmem limit for op batching
hex-opbatch: update htp side to properly support dynamic mmap/unmap
hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing
hex-opbatch: fixed src1 handling in act ops
hex-act: fix empty src1 handling in swiglu and friends

Simplify preamble macro while at it

cleaning up some left-overs from merges

If an Op matches the regex hex backend will reject it.

hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future
hexagon: improved vtcm acquision to remove inter-op overhead

Fully compatible with QNN-HTP coex

This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.

Just a cleanup. We don't need separate contexts at this point.

It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Trivikram Reddy tamarnat@qti.qualcomm.com
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

macOS/iOS:

Linux:

Windows:

openEuler: