Details
hexagon: improved Op queuing, buffer and cache management (#21705)
- hexagon: introduce op request batching and rewrite buffer managment
The host now prepares batches of requests and dispatches them via a single dspqueue message.
Buffers are mapped explicitly by NPU while processing batches.
-
hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops
-
hex-utils: add explicit l2flush and l2clear helpers
-
hex-opreq: use fine-grain per tensor l2 management
-
hex-opreq: avoid redundant invalidates for tensors we already flushed
-
hex-opreq: update debug messages
-
htp-opreq: reuse ops_context
-
hex-opreq: do not flush or invalidate cache lines beyond buffer boundry
-
hex-opreq: fix errors in log message
-
Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"
This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.
-
hexagon: limit l2 flushes to 1MB which covers l2 cache
-
hex-opreq: limit cache flush to 4MB
Looks like 4MB cont. vitual space should cover the 1MB cache.
-
hexagon: drop cache flush size to 2MB
-
hex-opreq: start reworking opreq packing
-
hex-opreq: introduce new way of packing opbatch where tensors are stored separately
-
hex-opreq: add a simple fastrpc call to force unmap all buffers
-
hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size
-
hex-opreq: bump opreq batch size to 256
-
hex-mm: place src1 spad at the top of vtcm for easy reuse
-
hex-ops: introduce internal types and disable src1 reuse for now
Nothing new just formalizing the repack / qyn.quant types we've been using.
-
htp-opreq: use tensor pointers instead of copies
-
hex-opreq: introduce more robust way for tracking vtcm/spad reuse
This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.
-
hex-cumsum: fix error post opreq merge
-
hex-opreq: move request batch handling into the session
Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.
-
hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx
-
hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers
-
hex-buf: add support for allocating shared/pinned buffer for opreqs
-
hex-opbatch: make opbatches configurable
-
hex-naming: better name for ggml_hexagon_shared_buffer
-
hex-naming: add session->c_name() helper
-
hex-opbatch: start using shm but still copy for now
-
hex-opbatch: use shared buffer for packing opbatch
-
hex-opbatch: beter naming for opbatch related classes and code
-
hex-opbatch: reuse batched tensors with same data/dims/strides
-
hex-opbatch: update logging
-
hex-opbatch: add support for vmem limit for op batching
-
hex-opbatch: update htp side to properly support dynamic mmap/unmap
-
hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing
-
hex-opbatch: fixed src1 handling in act ops
-
hex-act: fix empty src1 handling in swiglu and friends
Simplify preamble macro while at it
- hex-mm: minor fix vtcm and dma handling in matmul
cleaning up some left-overs from merges
-
hex-opbatch: allocate extra 1KB for dspqueue overhead
-
hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc
-
hex-mm: properly handle hmx_disabled flag
-
hex-ops: update comments
-
hex-ops: add debug output for get/set-rows
-
hex-mmap: optimize un/mapping of buffers
-
hex-opreq: global cache flush and invalidate beyond 128KB threshold
-
hex-ops: add super simple opfilter regex for debugging
If an Op matches the regex hex backend will reject it.
-
hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future
-
hexagon: improved vtcm acquision to remove inter-op overhead
Fully compatible with QNN-HTP coex
-
hex-mm: fixed hvx fallback path
-
hex-mm: lower the vmem threshold a bit further to ~3GB
-
hexagon: update debug & error logs
This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.
- hexagon: move ops context into main context
Just a cleanup. We don't need separate contexts at this point.
-
hex-opbatch: cleanup naming and headers for opbatch and related descriptors
-
hex-fa: it's now better to enable FA during TG to reduce graph splits
-
hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var
It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.
-
hexagon: fixed editorconfig check
-
Update ggml/src/ggml-hexagon/ggml-hexagon.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Co-authored-by: Trivikram Reddy tamarnat@qti.qualcomm.com
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled)
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: