Details
hexagon: Q4_0 and MXFP4 repack fixes (#20527)
-
hexagon: fix tail corruption with rows sizes not multiple of 256
-
hexagon: use different stride for repacking partial blocks
-
hex-mm: update repack and kernels to avoid shuffles for full 256-element blocks
Previous commit changed the repacking to use even:odd (0:1,2:3,..) packing
instead of the original (0:128,1:129,...) packing in order to fix tail corruption.
Since the mm kernels already deal with partial tails we can use even:odd
packing only for the last block.
This avoid performance penalty of having to shuffle to zip the elements
in the common case.
-
hex-mm: update rmpy x8 for better optimizations
-
hex-mm: tighten supported MUL_MAT checks to avoid spurios failures
-
hex-mm: use vzero to init accumulators
-
hex-mm: properly call partial rmpy_x8
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: