This release provides major new functionality in the core BLIS framework and many other bugfixes and small changes. This is a release candidate; please try it out and provide feedback on performance and stability!
Improvements present in 2.0:
Known Issues:
- There is a performance regression in the
ztrmm
andztrsm
operations. On the Ampere Altra, performance is impacted by up to 30%; it is currently unknown if and how much this bug affects other architectures but the effect should be much smaller in most cases.
Framework:
- BLIS now supports "plugins", which provide additional functionality through user-defined kernels, blocksizes, and kernel preferences. Users can use an installed copy of BLIS (even a binary-only distribution) to create a plugin outside of the BLIS source tree. User-written reference kernels can then be registered into BLIS, and are compiled by the BLIS build system for all configured architecture. This also means that user-provided kernels participate in run-time kernel selection based on the actual hardware used! Additionally, users can provide and register optimized kernels for specific architectures which are automatically selected as appropriate. See
docs/PluginHowTo.md
for more information. - A new API has been added which allows users to modify the default "control tree". This data structure defines the specific algorithmic steps used to implement a level-3 BLAS operation such as
gemm
orsyrk
. Users can start with a predefined control tree for one of the level-3 BLAS operations (excepttrsm
currently) and then modify it to produce a custom operation. Users can change kernels for packing and computation, associated blocksizes, and provide additional information (such as external parameters or additional data) which is passed directly to the kernels. Seedocs/PluginHowTo.md
for more information and a working example. - All level-3 BLAS operations (except
trsm
) now support full mixed-precision mixed-domain computation. The A, B, and C matrices, as well as the alpha and beta scalars, may be provided in any of the supported data types (single/double precision and real/complex domain, currently), and an additionally-provided computational precision controls how the computation is actually performed internally. The computational precision can be set on theobj_t
structure representing the C matrix. - Added a
func2_t
struct for dealing with 2-type kernels (see below). Afunc2_t
can be safely cast tofunc_t
to refer to only kernels with equal type parameters. (Devin Matthews) - The
bli_*_front
functions have been removed. - Extensive other back-end changes and improvements.
Compatibility:
- Added a ScaLAPACK compatibility mode which disables some conflicting BLAS definitions. (Field Van Zee)
- Fixed issues with improperly escaped strings in python scripts for compatibility with python 3.12+. (@AngryLoki)
- Added a user-defined macro
BLIS_ENABLE_STD_COMPLEX
which usesstd::complex
typedefs inblis.h
for C++ code. (Devin Matthews) - Fixed a bug in the definition of some scalar level-0 macros affecting compatibility of
bli_creal
andbli_zreal
, for example. (Devin Matthews) - Fixed improperly-quoted strings in Python scripts which affected compatibility with Python 3.12+. (@AngryLoki)
- The static initializer macros (
BLIS_*_INITIALIZER
) have been fixed for compatibility with C++. (Devin Matthews) - Install "helper"
blis.h
andcblas.h
headers directly toINCDIR
(in addition to the full files inINCDIR/blis
). (Field Van Zee, Jed Brown, Mo Zhou)
Kernels:
- Fixed an out-of-bounds read bug in the
haswell
gemmsup
kernels. (John Mather) - Fixed a bug in the complex-domain
gemm
kernels forpiledriver
. (@rmast) - Kernel, blocksizes, and preference lookup functions now use
siz_t
rather than specific enums. (Devin Matthews) - Fixed some issues with run-time kernel detection and add more ARM part numbers/manufacturer codes. (John Mather)
- Kernels can now be added which have two datatype parameters. Kernel IDs are assigned such that 1-type and 2-type kernels cannot be interchanged accidentally. (Devin Matthews)
- The packing microkernels and computational microkernels (
gemm
andgemmtrsm
) now receive offsets into the global matrix. The latter are passed via theauxinfo_t
struct. (Devin Matthews) - The separate "MRxk" and "NRxk" packing kernels have been merged into one generic packing kernel. Packing kernels are now expected to pack any size micropanel, but may optimize for specific shapes. (Devin Matthews)
- Added explicit packing kernels for diagonal portions of matrices, and for certain mixed-domain/1m cases. (Devin Matthews)
- Improved support for duplication during packing ("broadcast-B") across all packing kernels.
Build system:
- The
cblas.h
file is now "flattened" immediately afterblis.h
is (if enabled), rather than later in the build process. (Jeff Diamond, Field Van Zee) - Added script to help with preparing release candidate branches. (Field Van Zee)
- The configure script has been overhauled. In particular, using spaces in
CC
/CXX
is now supported. (Devin Matthews) - Improved support for C++ source files in BLIS or in plugins. (Devin Matthews)
Testing:
- test/3 drivers now allow using the "default" induced method, rather than forcing native or 1m operation. (Field Van Zee, Leick Robinson)
- Fix some segfaults in the test/3 drivers. (Field Van Zee, Leick Robinson)
- The testsuite now tests all possible type combinations when requested. (Devin Matthews)
- Improved detection of problems in
make check-blis
and related targets. (Devin Matthews)
Documentation:
- Added documentation for the new plugin system and for creating custom operations by modifying the BLIS control tree. (Devin Matthews)
- Updated documentation for downloading BLIS in
README.md
and instructions for maintainers inRELEASING
. (Field Van Zee)