Performance improvements in S/D/ZGEMM on Zen3/4/5
SGEMM Optimizations for tiny matrices
New Thread Control APIs with global and thread-local variants
Support for OpenMP 2.5 and earlier versions
Optional support for reproducibility using compiler options
Updates to aocl-gemm add-on module
Column Major support for BF16 and FP32
FP32 RD kernels for AVX512 and AVX2 ISA
GEMV kernel for m=1 case using AVX2 and AVX512 YMM registers