Details
hexagon: general DMA and Binary Op fixes for large strides (#20918)
- hex-dma: make chained dma the default to handle newer models
This also includes some new instrumentation that we can remove later.
-
hexagon: add uint32 dump helper
-
hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv
ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset
spans page boundaries.
-
hexagon: update ssm-conv to make base-addr compute a bit easier to read
-
hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB)
-
hex-bin: fix incorrect stride logic
-
hexagon: make sure repack buffs are dumped for verbose > 2
-
hex-bin: consistently use dma_queue_push even for dummy dst transactions
-
hex-dma: start using 2d-wide mode on v75 and up
The removes the need to deal with the 16-bit limitaion for the strides.
-
hex-bin: cleanup kernel selection logic
-
hex-bin: cleanup binary op core and fix transposed tensor handling
-
snapdragon: update run-bench to use larger ubatch and fa-on
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: