github ggml-org/llama.cpp b8754

latest releases: b8756, b8755
2 hours ago
Details

hexagon: improved Op queuing, buffer and cache management (#21705)

  • hexagon: introduce op request batching and rewrite buffer managment

The host now prepares batches of requests and dispatches them via a single dspqueue message.

Buffers are mapped explicitly by NPU while processing batches.

  • hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops

  • hex-utils: add explicit l2flush and l2clear helpers

  • hex-opreq: use fine-grain per tensor l2 management

  • hex-opreq: avoid redundant invalidates for tensors we already flushed

  • hex-opreq: update debug messages

  • htp-opreq: reuse ops_context

  • hex-opreq: do not flush or invalidate cache lines beyond buffer boundry

  • hex-opreq: fix errors in log message

  • Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"

This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.

  • hexagon: limit l2 flushes to 1MB which covers l2 cache

  • hex-opreq: limit cache flush to 4MB

Looks like 4MB cont. vitual space should cover the 1MB cache.

  • hexagon: drop cache flush size to 2MB

  • hex-opreq: start reworking opreq packing

  • hex-opreq: introduce new way of packing opbatch where tensors are stored separately

  • hex-opreq: add a simple fastrpc call to force unmap all buffers

  • hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size

  • hex-opreq: bump opreq batch size to 256

  • hex-mm: place src1 spad at the top of vtcm for easy reuse

  • hex-ops: introduce internal types and disable src1 reuse for now

Nothing new just formalizing the repack / qyn.quant types we've been using.

  • htp-opreq: use tensor pointers instead of copies

  • hex-opreq: introduce more robust way for tracking vtcm/spad reuse

This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.

  • hex-cumsum: fix error post opreq merge

  • hex-opreq: move request batch handling into the session

Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.

  • hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx

  • hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers

  • hex-buf: add support for allocating shared/pinned buffer for opreqs

  • hex-opbatch: make opbatches configurable

  • hex-naming: better name for ggml_hexagon_shared_buffer

  • hex-naming: add session->c_name() helper

  • hex-opbatch: start using shm but still copy for now

  • hex-opbatch: use shared buffer for packing opbatch

  • hex-opbatch: beter naming for opbatch related classes and code

  • hex-opbatch: reuse batched tensors with same data/dims/strides

  • hex-opbatch: update logging

  • hex-opbatch: add support for vmem limit for op batching

  • hex-opbatch: update htp side to properly support dynamic mmap/unmap

  • hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing

  • hex-opbatch: fixed src1 handling in act ops

  • hex-act: fix empty src1 handling in swiglu and friends

Simplify preamble macro while at it

  • hex-mm: minor fix vtcm and dma handling in matmul

cleaning up some left-overs from merges

  • hex-opbatch: allocate extra 1KB for dspqueue overhead

  • hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc

  • hex-mm: properly handle hmx_disabled flag

  • hex-ops: update comments

  • hex-ops: add debug output for get/set-rows

  • hex-mmap: optimize un/mapping of buffers

  • hex-opreq: global cache flush and invalidate beyond 128KB threshold

  • hex-ops: add super simple opfilter regex for debugging

If an Op matches the regex hex backend will reject it.

  • hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future

  • hexagon: improved vtcm acquision to remove inter-op overhead

Fully compatible with QNN-HTP coex

  • hex-mm: fixed hvx fallback path

  • hex-mm: lower the vmem threshold a bit further to ~3GB

  • hexagon: update debug & error logs

This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.

  • hexagon: move ops context into main context

Just a cleanup. We don't need separate contexts at this point.

  • hex-opbatch: cleanup naming and headers for opbatch and related descriptors

  • hex-fa: it's now better to enable FA during TG to reduce graph splits

  • hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var

It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.

  • hexagon: fixed editorconfig check

  • Update ggml/src/ggml-hexagon/ggml-hexagon.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Co-authored-by: Trivikram Reddy tamarnat@qti.qualcomm.com
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.