OpenMathLib/OpenBLAS v0.3.29 on GitHub

general:

fixed a potential NULL pointer dereference in multithreaded builds
added function aliases for GEMMT using its new name GEMMTR adopted by Reference-BLAS
fixed a build failure when building without LAPACK_DEPRECATED functions
the minimum required CMake version for CMake-based builds was raised to 3.16.0 in order
to remove many compatibility and deprecation warnings
added more detailed CMake rules for OpenMP builds (mainly to support recent LLVM)
fixed the behavior of the recently added CBLAS_?GEMMT functions with row-major data
improved thread scaling of multithreaded SBGEMV
improved thread scaling of multithreaded TRTRI
fixed compilation of the CBLAS testsuite with gcc14 (and no Fortran compiler)
added support for option handling changes in flang-new from LLVM18 onwards
added support for recent calling conventions changes in Cray and NVIDIA compilers
added support for compilation with the NAG Fortran compiler
fixed placement of the -fopenmp flag and libsuffix in the generated pkgconfig file
improved the CMakeConfig file generated by the Makefile build
fixed const-correctness of cblas_?geadd in cblas.h
fixed a potential inaccuracy in multithreaded BLAS3 calls
fixed empty implementations of get/set_affinity that print a warning in OpenMP builds
fixed function signatures for TRTRS in the converted C version of LAPACK
fixed omission of several single-precision LAPACK symbols in the shared library
improved build instructions for the provided "pybench" benchmarks
improved documentation, including added build instructions for WoA and HarmonyOS
as well as descriptions of environment variables that affect build and runtime behavior
added a separate "make install_tests" target for use with cross-compilations
integrated improvements and corrections from Reference-LAPACK:
- removed a comparison in LAPACKE ?tpmqrt that is always false (LAPACK PR 1062)
- fixed the leading dimension for B in tests for GGEV (LAPACK PR 1064)
- replaced the ?LARFT functions with a recursive implementation (LAPACK PR 1080)

arm:

fixed build with recent versions of the NDK (missing .type declaration of symbols)

arm64:

fixed a long-standing bug in the (generic) c/zgemm_beta kernel that could lead to
reads and writes outside the array bounds in some circumstances
rewrote cpu autodetection to scan all cores and return the highest performing type
improved the DGEMM performance for SVE targets and small matrix sizes
improved dimension criteria for forwarding from GEMM to GEMV kernels
added SVE kernels for ROT and SWAP
improved SVE kernels for SGEMV and DGEMV on A64FX and NEOVERSEV1
added support for using the "small matrix" kernels with CMake as well
fixed compilation on Windows on Arm
improved compile-time detection of SVE capability
added cpu autodetection and initial support for Apple M4
added support for compilation on systems running IOS
added support for compilation on NetBSD ("evbarm" architecture)
fixed NRM2 implementations for generic SVE targets and the Neoverse N2
fixed compilation for SVE-capable targets with the NVIDIA compiler

x86_64:

fixed a wrong storage size in the SBGEMV kernel for Cooper Lake
added cpu autodetection for Intel Granite Rapids
added cpu autodetection for AMD Ryzen 5 series
added optimized SOMATCOPY_CT for AVX-capable targets
fixed the fallback implementation of GEMM3M in GENERIC builds
tentatively re-enabled builds with the EXPRECISION option
worked around a miscompilation of tests with mingw32-gfortran14
added support for compilation with the Intel oneAPI 2025.0 compiler on Windows

power:

fixed multithreaded SBGEMM
fixed a CMake build problem on POWER10
improved the performance of SGEMV
added vectorized implementations of SBGEMV and support for forwarding 1xN SBGEMM to them
fixed illegal instructions and potential memory overflow in SGEMM on PPCG4
fixed handling of NaN and Inf arguments in SSCAL and DSCAL on PPC440,G4 and 970
added improved CGEMM and ZGEMM kernels for POWER10
added Makefile logic to remove all optimization flags in DEBUG builds

mips64:

fixed compilation with gcc14
fixed GEMM parameter selection for the MIPS64_GENERIC target
fixed a potential build failure when compiling with OpenMP

loongarch64:

fixed compilation for Loongson3 with recent versions of gmake
fixed a potential loss of precision in Loongson3A GEMM
fixed a potential build failure when compiling with OpenMP
added optimized SOMATCOPY for LASX-capable targets
introduced a new cpu naming scheme while retaining compatibility
added support for cross-compiling Loongarch64 targets with CMake
added support for compilation with LLVM

riscv64:

removed thread yielding overhead caused by sched_yield
replaced some non-standard intrinsics with their official names
fixed and sped up the implementations of CGEMM/ZGEMM TCOPY for vector lenghts 128 and 256
improved the performance of SNRM2/DNRM2 for RVV1.0 targets
added optimized ?OMATCOPY_CN kernels for RVV1.0 targets

md5sums
d7df28656fa28616da028d1e94eab216 OpenBLAS-0.3.29.zip
853a0c5c0747c5943e7ef4bbb793162d OpenBLAS-0.3.29.tar.gz
195aff920ba64329cbd358a54d4fefc5 OpenBLAS-0.3.29_x64.zip
bd44474436a81e1d7ac66a5d38124d24 OpenBLAS-0.3.29_x64_64.zip
483119550c3414cbc44027a6c782f581 OpenBLAS-0.3.29_x86.zip

OpenMathLib/OpenBLAS v0.3.29 OpenBLAS 0.3.29 version on GitHub