github ggml-org/llama.cpp b8797

2 hours ago
Details

hexagon: optimization for HMX mat_mul (#21554)

  • hexagon: add async HMX worker

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.

  • hexagon: cost-based VTCM chunk search for out-stationary matmul

  • hexagon: fix futex race in hmx_worker_drain
    Store the boolean to local variable avoid atomic load twice

  • hex-mm: hmx optimize scatter/transpose and use HMX intrinsics

  • hex-vmem: drop vmem limit a touch under 3GB on v73

  • hexagon: add fwd declaration of htp_context

  • hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface

Simplifies the overall implemantion, reduces thread wakeup roundtrips.

  • hex-mm: add debug log to hmx work func called from hmx-queue

  • Update hmx-queue.h

Co-authored-by: Max Krasnyansky max.krasnyansky@gmail.com


Co-authored-by: Kim-Chyan Gan kgan@qti.qualcomm.com
Co-authored-by: Max Krasnyansky maxk@qti.qualcomm.com
Co-authored-by: Max Krasnyansky max.krasnyansky@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.