github ggml-org/llama.cpp b9857

2 hours ago
Details

hexagon: flash attention rework (optimizations, accuracy improvements, etc) (#25085)

  • hex-mm: fold mm quant tasks into the main matmul threads

  • hex-mm: minor formatting fixes

  • hex-mm: cleanup is_quant checks in dma dispatch

  • hex-mm: fix dst-spad alignment

  • hex-mm: move fp kernels in the hvx-mm-kernels header

  • hex-mm: fuse with ADD

  • hex-fa: factor out ukernels into separate headers and unify the rest

  • hex-fa: move kernel-params compute into the host

  • hex-fa: refactor vtcm alloc for consistency

  • hex-fa: add support for FA_SELECT

  • hex-fa: update tracing insrumentation to cover all functions

  • hex-fa: update hvx fallback thresholds to recover t/g regressions

  • hex-fa: update tracing instrumentation

  • hex-fa: improved tracing with additional events

  • hex-fa: optimize mask processing (fastdiv, etc)

  • hex-fa: improve mask dma caching

  • hmx-fa: change loop order to maximize mask cache hits

  • hex-fa: remove over instrumentation

  • hex-fa: breakdown QKV prep trace events

  • hmx-fa: further mask proc optimizations

  • hex-fa: mask broadcast is the common case, optimize for that

  • hex-fa: use aligned loads where possible

  • hex-fa: update loops to use uint32_t indices

  • hmx-fa: fold vtcm init into q prep task

  • hex-fa: update rest of the hmx funcs to use uint32_t

  • hmx-fa: fold build_d into the main softmax loop

  • hmx-fa: start kv dmas earlier

  • hmx-fa: start mask dma a bit earlier

  • hex-fa: precompute rows per task to avoid divs

  • hmx-fa: specialize fa_o_store for f16 and f32

  • hmx-fa: prelim support for Sinks

  • hmx-fa: keep softmax accumulators in fp32

  • hex-fa: add tanh_f16 and exp2_f16 and use that in FA

  • hex-fa: use fp16 math in the hvx kernel

  • hex-fa: avoid expensive float -> __fp16 cast for slopes and softcap

  • hex-fa: replace most vec_exp_f32 with vec_exp2_f16

  • hmx-fa: vectorize sinks update

  • hex-fa: minor formatting

  • hmx-fa: fold softcap loop into the tile load

  • hmx-fa: use vectoralias to populate sinks

  • hex-fa: remove redudant check

  • hex-fa: fix vtcm size compute to use fp32 for accumulators

  • hex-mm: fix trailing spaces

  • hmx-fa: dont use -inf to init mask to avoid conversion overflows

  • hex-fa: no need to explicitly guard -inf in the f16->f32 converter now

  • hmx-fa: cleanup fa sinks handling

  • hex-mm: fixed src2 stride handling when mm is fused with add

  • hex-fa: make lto happy

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.