ggml-org/llama.cpp b8797 on GitHub

Details

hexagon: optimization for HMX mat_mul (#21554)

hexagon: add async HMX worker

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.

hexagon: cost-based VTCM chunk search for out-stationary matmul
hexagon: fix futex race in hmx_worker_drain
Store the boolean to local variable avoid atomic load twice
hex-mm: hmx optimize scatter/transpose and use HMX intrinsics
hex-vmem: drop vmem limit a touch under 3GB on v73
hexagon: add fwd declaration of htp_context
hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface

Simplifies the overall implemantion, reduces thread wakeup roundtrips.

hex-mm: add debug log to hmx work func called from hmx-queue
Update hmx-queue.h

Co-authored-by: Max Krasnyansky max.krasnyansky@gmail.com

Co-authored-by: Kim-Chyan Gan kgan@qti.qualcomm.com
Co-authored-by: Max Krasnyansky maxk@qti.qualcomm.com
Co-authored-by: Max Krasnyansky max.krasnyansky@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler: