libxsmm/libxsmm 1.5 on GitHub

A major addition for LIBXSMM is the introduction of the DNN API, which can be used for e.g., Convolutional Neural Networks (CNNs). As a consequence, the banner description of LIBXSMM has been updated:

Library targeting Intel Architecture (x86) for small, dense or sparse matrix multiplications, and small convolutions.

The small convolutions are currently focused on Intel AVX-512, but compiler-generated fallback code is in place as well. Beside of AVX-512, forward convolutions (along with support for different storage formats) are also covered with Intel AVX2. Among LIBXSMM's internal storage scheme, the library supports a variety of other popular data formats one of which is Tensorflow's native NHWK storage scheme. With respect to the supported data types, single-precision convolution kernels (FP32) are fully supported by the JIT code generator. Moreover, there is initial code for Int16-based data already in place. During the past development cycle, Google Inc. stated some interest in LIBXSMM, and also contributed the Linux perf support to confirm the commitment. For others who would like to join our efforts, a preliminary Wiki page about contributions has been added (https://github.com/hfp/libxsmm/wiki/Contribute).

INTRODUCED

New DNN API, sample code, and benchmarks (Googlenetv1, DeepBench, and Overfeat)
Enabled tiled GEMM support in static/dynamic wrapper; MT support via libxsmmext
More format variations of sparse matrix multiplication (dense/sparse etc.)
Sample code showing sparse matrix multiplication (PyFR examples collection)
Published synchronization layer (atomics, and simple/bare OS-thread/lock abstraction)
Introduced mini-API for optimized barrier implementation (general multicore support)
Introduced API for memory allocation (malloc interface); mostly exposed from internal API
Beside of Intel VTune, now Linux perf and jitdump are supported (Thank you Maciej D.!)
SPECFEM sample: received nicely written example contribution (Thank you Daniel P.!)
OSX (incl. "El Capitan") now supports Intel Compiler, Apple/Clang, and GNU GCC
CRAY's Compiling Environment (CCE) is now supported
PGI compiler is now supported

CHANGES

Solidified API/impl. for out-of-place (OOP) transposes; ST/MT support (MT via libxsmmext)
Type-optimized OOP-transpose implementations, and generic/full support for any element type
Shared OpenMP infrastructure/abstraction for transposes and GEMMs.
Introduced and documented LIBXSMM_MT environment variable (ST/MT/sync control).
Performance enhancements for sparse matrix multiplication (code gen., prefetches)
Support for SMM kernels (BIG=1) with larger extent(s) in terms of M, N, K, LDA, LDB, or LDC
Support for "ease of use" APIs (internal multi-threading), and external MT runtimes
Include "secondary" APIs in the first place (libxsmm.h) i.e., malloc, timer, sync.h
Included statistic into LIBXSMM_VERBOSE table for kernels which exceed the MNK threshold.
Updated documentation to cover the new DNN API; added samples code (samples/dnn)
Enhanced infrastructure and portability for Variable Length Arrays (VLAs)
Library infrastructure (templates) for different element/pixel types (F32, I16, I8)
Improved development infrastructure (merging version.txt, and commit msg. hook)
Improved Travis-CI turnaround time (due to commit msg. hook [skip ci], and upload timeout)
Improved support for Clang, and bleeding edge compiler/architectures (intrinsic layer, etc)
CPUID distinction between AVX-512/Core, AVX-512/MIC, and AVX-512/Common
Better build-time support for AVX-512 (AVX=3 MIC=0|1, etc.)
Removed disabling JIT-support under Windows (still, calling convention is not in place)
Better intro-style/banner (license, Travis, etc.) for online documentation (README.sh, etc.)
Improved info message when building LIBXSMM (compiler, code path info, etc.)
Revised wrapper mechanism, static wrapper now req. special build of libxsmmext (WRAP=1|2)
Improved dispatching LIBXSMM_PREFETCH strategy (common, GEMM, tiled GEMM)
Introduced LIBXSMM_GEMM_PREFETCH=-1|0...10 environment variable for tiled GEMM
Debug helper (internal): libxsmm_meta_image_typeinfo, libxsmm_meta_image_write, libxsmm_gemm_dump
Renamed libxsmm_[get|set]verbose_mode to libxsmm[get|set]_verbosity (verbosity level)
Improved verbose mode TRY-counter now collects rejected JIT requests (unsupported GEMM calls)
Verbose mode (>1) prints rejected GEMM calls (console), or dump (<-1) data in MHD format
Meta Image (MHD) format for data dumps (inspection via ITK-SNAP, ParaView, or similar)
TSC-based (not about CPU cycles!) libxsmm_timer_xtick (in addition to libxsmm_timer_tick)
Improved calculating tile sizes for tiled GEMM (LIBXSMM_CLMP, LIBXSMM_SQRT2)
Improved header-only support, and related/new CI test target (Travis CI)

FIXES

Improvements and fixes of the backend support for sparse matrix multiplication
Bug fixes wrt code dispatch, medium-sized GEMMs, and the wrapper mechanism
Fixed issue where certain GEMM API did not respect the JIT-bypass/BLAS-fallback
Support for "no BLAS dependency" (which previously broke the static wrapper)
Correctly handle user-documented prefetch id vs. internal prefetch flag/bits
Disarm MKL_DIRECT_CALL/DMKL_DIRECT_CALL_SEQ when determining original BLAS symbol
Adjusted/fixed support for dispatching statically generated SMM kernels
Fixed issue where BIG SMM kernels returned the wrong code from the registry
Fixed inline assembly for CPUID detection; issue was only exposed with Clang
Fixed/disabled LIBXSMM_ATOMIC_STORE_ZERO issue (may hang) for non-LIBXSMM_GCCATOMICS
Fixed lazy initialization for certain cases/tool chains (related to c'tor/d'tor attr.)
Fixed compiler warnings with older Intel Compiler (atomics layer)

libxsmm/libxsmm 1.5 Version 1.5 on GitHub

libxsmm/libxsmm 1.5
Version 1.5

on GitHub