AOCL-BLAS 5.2.2 Release Notes
Overview
AOCL-BLAS 5.2.2 is an incremental release building on the 5.2 GA release, delivering performance optimizations, bug fixes, improved threading stability, and expanded test coverage.
Performance Optimizations
- Optimized SGEMM rd kernels on Zen3
- Improved SGEMM rd kernel on Zen4/Zen5
- SGEMM tiny path tuning for Zen4 and Zen5
- Added tiny path for SGEMM
- Added fast path for single-threaded AVX512 DGEMV kernel
- Tuned decision logic for DGEMV multithreading for skinny sizes
- Tuned DGEMV no-transpose thresholds
- Re-tuned GEMV thresholds
- Optimal rerouting of GEMV inputs to avoid packing
- Tuned input threshold for tiny DGEMM interface
- Tuned ZGEMM threshold for Zen5
- Changed ZGEMM SUP threshold logic for Zen5 to fix performance regression
- Replaced intrinsics with inline assembly for
bli_saxpyv_zen4_intandbli_saxpyf_zen_int_5 - Improved fringe case handling for AXPYV kernel
- Disabled small_gemm for Zen4/Zen5 and added single-thread check for tiny path
- Disabling GEMV(M1) rerouting in BF16 APIs (AVX512)
Bug Fixes
- Fixed memory leak in DGEMV kernel
- Fixed extreme values handling in GEMV
- Fixed integer division in GEMV that was supposed to be a double operation
- Fixed Integer Overflow issue in TPSV
- Fixed out-of-bound access in F32 matrix add/mul ops
- Bugfix: BF16 to F32 conversion in AVX2 F32 codepath
- Bug fix in BF16 AVX2 conversion path
- Fix for F32 to BF16 conversion and AVX512 ISA support checks
- Fixed
cblas_ctrmminvalid diag handling - Coverity issue fix for ZTRSM
- Fixed Coverity static analysis issue in DTRSM
- Fixed high priority Coverity issues in LPGEMM
- Fixed Coverity issues with CID: 23269 and CID: 137049
- Resolved operator precedence warning in Zen5 DCOMPLEX threshold logic
- Modified AXPY kernel to ensure consistency of numerical results
- GCC 15 SUP kernel workaround (2)
- Disabled no post-ops path in LPGEMM F32 kernels for certain GCC versions
- Updated guards in
s8s8s32of32_sym_quantframework
Threading & Stability
- Fixed data race in native code-path
- Add OpenMP barrier before releasing threadinfo & global communicator to avoid race
- Replaced OMP barrier with
bli_thread_barrierand added similar fixes - Global communicator is now freed outside the parallel region
- Thread: free global communicator after parallel region completes
- Initialize
mem_tstructures safely and handle NULL communicator in threading - Fix DTL dynamic thread logging in BLAS operations
- Added dynamic threads and actual threads in the DTL log of SAXPY
- Enabled disable-sba-pools feature in AOCL-BLAS
Build System & Infrastructure
- Updates to the build systems (CMake and Make) for LPGEMM compilation
- CMake: Adding targets and aliases so that BLIS works with
FetchContent - Set security flags default enable
- DTL Windows
getpidsupport - Add compiler information to
make showconfigandbench_getlibraryInfo - Make all bench applications consistent
- Standardize Zen kernel names
Test Suite (GTestSuite)
- Added Banded API tests: gbmv, hbmv, sbmv, tbmv, tbsv
- Added Packed API tests: hpmv, spmv, tpmv, tpsv, hpr, hpr2, spr, spr2
- Added conjugate dot and ger IIT_ERS tests
- Added data pool support
- Moved data generator definitions to a cpp file
- Computediff improvements
- Fix in swap
- Break up tests for better organization
- Multiple miscellaneous test fixes
Documentation
- Fixing doc about building bench
- Add external PR integration process and flowchart to CONTRIBUTING.md
- Updated LICENSE and NOTICES files for AOCL-5.2 release
Compiler Warnings
- Multiple rounds of compiler warnings fixes
- Adding
bli_print_msgbeforebli_abort()forbli_thrinfo_sup_create_for_cntl - DTL log updates
- Code tidying