Details
hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (#23989)
-
hex-mm: initial support for F32 * F32 -> F32 matmuls
-
hex-rms-norm: fix src1 stride use in fused rms_norm_mul
-
hex-ops: clear spad pointers in the ops that clober it
This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes.
- hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX
Decided to use Q4_0 * F32 -> F32 matmul for this.
Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16.
Super simple and pretty efficient.
-
hmx-mm: route f16 2D matmuls through the same kernel used for all other types
-
hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way
This update futher improves matmul performance and at the same time removes most of the redudant logic
we had in different paths.
-
hmx-fa: slighlty improved pipeline simimar to matmul updates
-
hmx-mm: initial version of MAT_MUL_ID support for HMX
-
hmx-mm: fixed mxfp4 handling for MUL_MAT_ID
-
hex-gdn: optimize GATED_DELTA_NET
DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :)
-
hmx-mm: missed one more case where we can use fastmod
-
hexagon: update DCVS settings for a slight perf bump
-
hmx-fa: use fastdiv in hmx-flash-attn
-
hmx-fa: precompute slope values to avoid disrupting the inner loop
-
hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi
-
hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty
-
hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
- Ubuntu x64 (SYCL FP32) DISABLED
Android:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.3 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL) DISABLED
- Windows x64 (HIP)
openEuler:
- DISABLED
- openEuler x86 (310p)
- openEuler x86 (910b, ACL Graph)
- openEuler aarch64 (310p)
- openEuler aarch64 (910b, ACL Graph)
UI: