github flame/blis 2.0-rc0
BLIS 2.0-rc0

pre-release21 hours ago

This release provides major new functionality in the core BLIS framework and many other bugfixes and small changes. This is a release candidate; please try it out and provide feedback on performance and stability!

Improvements present in 2.0:

Known Issues:

  • There is a performance regression in the ztrmm and ztrsm operations. On the Ampere Altra, performance is impacted by up to 30%; it is currently unknown if and how much this bug affects other architectures but the effect should be much smaller in most cases.

Framework:

  • BLIS now supports "plugins", which provide additional functionality through user-defined kernels, blocksizes, and kernel preferences. Users can use an installed copy of BLIS (even a binary-only distribution) to create a plugin outside of the BLIS source tree. User-written reference kernels can then be registered into BLIS, and are compiled by the BLIS build system for all configured architecture. This also means that user-provided kernels participate in run-time kernel selection based on the actual hardware used! Additionally, users can provide and register optimized kernels for specific architectures which are automatically selected as appropriate. See docs/PluginHowTo.md for more information.
  • A new API has been added which allows users to modify the default "control tree". This data structure defines the specific algorithmic steps used to implement a level-3 BLAS operation such as gemm or syrk. Users can start with a predefined control tree for one of the level-3 BLAS operations (except trsm currently) and then modify it to produce a custom operation. Users can change kernels for packing and computation, associated blocksizes, and provide additional information (such as external parameters or additional data) which is passed directly to the kernels. See docs/PluginHowTo.md for more information and a working example.
  • All level-3 BLAS operations (except trsm) now support full mixed-precision mixed-domain computation. The A, B, and C matrices, as well as the alpha and beta scalars, may be provided in any of the supported data types (single/double precision and real/complex domain, currently), and an additionally-provided computational precision controls how the computation is actually performed internally. The computational precision can be set on the obj_t structure representing the C matrix.
  • Added a func2_t struct for dealing with 2-type kernels (see below). A func2_t can be safely cast to func_t to refer to only kernels with equal type parameters. (Devin Matthews)
  • The bli_*_front functions have been removed.
  • Extensive other back-end changes and improvements.

Compatibility:

  • Added a ScaLAPACK compatibility mode which disables some conflicting BLAS definitions. (Field Van Zee)
  • Fixed issues with improperly escaped strings in python scripts for compatibility with python 3.12+. (@AngryLoki)
  • Added a user-defined macro BLIS_ENABLE_STD_COMPLEX which uses std::complex typedefs in blis.h for C++ code. (Devin Matthews)
  • Fixed a bug in the definition of some scalar level-0 macros affecting compatibility of bli_creal and bli_zreal, for example. (Devin Matthews)
  • Fixed improperly-quoted strings in Python scripts which affected compatibility with Python 3.12+. (@AngryLoki)
  • The static initializer macros (BLIS_*_INITIALIZER) have been fixed for compatibility with C++. (Devin Matthews)
  • Install "helper" blis.h and cblas.h headers directly to INCDIR (in addition to the full files in INCDIR/blis). (Field Van Zee, Jed Brown, Mo Zhou)

Kernels:

  • Fixed an out-of-bounds read bug in the haswell gemmsup kernels. (John Mather)
  • Fixed a bug in the complex-domain gemm kernels for piledriver. (@rmast)
  • Kernel, blocksizes, and preference lookup functions now use siz_t rather than specific enums. (Devin Matthews)
  • Fixed some issues with run-time kernel detection and add more ARM part numbers/manufacturer codes. (John Mather)
  • Kernels can now be added which have two datatype parameters. Kernel IDs are assigned such that 1-type and 2-type kernels cannot be interchanged accidentally. (Devin Matthews)
  • The packing microkernels and computational microkernels (gemm and gemmtrsm) now receive offsets into the global matrix. The latter are passed via the auxinfo_t struct. (Devin Matthews)
  • The separate "MRxk" and "NRxk" packing kernels have been merged into one generic packing kernel. Packing kernels are now expected to pack any size micropanel, but may optimize for specific shapes. (Devin Matthews)
  • Added explicit packing kernels for diagonal portions of matrices, and for certain mixed-domain/1m cases. (Devin Matthews)
  • Improved support for duplication during packing ("broadcast-B") across all packing kernels.

Build system:

  • The cblas.h file is now "flattened" immediately after blis.h is (if enabled), rather than later in the build process. (Jeff Diamond, Field Van Zee)
  • Added script to help with preparing release candidate branches. (Field Van Zee)
  • The configure script has been overhauled. In particular, using spaces in CC/CXX is now supported. (Devin Matthews)
  • Improved support for C++ source files in BLIS or in plugins. (Devin Matthews)

Testing:

  • test/3 drivers now allow using the "default" induced method, rather than forcing native or 1m operation. (Field Van Zee, Leick Robinson)
  • Fix some segfaults in the test/3 drivers. (Field Van Zee, Leick Robinson)
  • The testsuite now tests all possible type combinations when requested. (Devin Matthews)
  • Improved detection of problems in make check-blis and related targets. (Devin Matthews)

Documentation:

  • Added documentation for the new plugin system and for creating custom operations by modifying the BLIS control tree. (Devin Matthews)
  • Updated documentation for downloading BLIS in README.md and instructions for maintainers in RELEASING. (Field Van Zee)

Don't miss a new blis release

NewReleases is sending notifications on new releases.