libxsmm/libxsmm 1.8 on GitHub

This set of changes brings the Padding API to life and implements the necessary mechanisms to cover a wider range of cases. This may allow to run a larger variety of TensorFlow workloads using LIBXSMM. The implementation also brings Winograd-based convolutions (chosen automatically when using LIBXSMM_DNN_CONV_ALGO_AUTO). Moreover, support for the Intel Xeon Phi processor code-named "Knights Mill" ("KNM") has been added (QFMA and VNNI instructions can be executed using the Intel SDE).

INTRODUCED

A summary of code samples has been added (pdf), and also a guide (mainly for contributors) to "Getting Started using TensorFlow with LIBXSMM" [PDF]
Additional sparse matrix primitives (fsspmdm domain); see "pyfr" and "edge" sample code
Support for OpenMP SIMD directive on GCC (-fopenmp-simd) used in some translation units
Improved code path selection for legacy compiler versions (functions with multiple compilation targets)
DNN: Winograd based convolutions incl. threshold to automatically select (LIBXSMM_DNN_CONV_ALGO_AUTO) between LIBXSMM_DNN_CONV_ALGO_DIRECT, and LIBXSMM_DNN_CONV_ALGO_WINOGRAD
DNN: logically padded data incl. support for Winograd based implementation
DNN: support for Intel Knights Mill (KNM) instruction set extension (AVX-512)
DNN: support another custom format that blocks the minibatch dimension
SMM: support of FORTRAN 77 for manual JIT-dispatch (libxsmm_xmmdispatch, libxsmm_xmmcall)
SPMDM: narrowed scope of "sum" array to improve optimization on LLVM
SMM/EXT/OMP: introduced table of blocksizes depending on problem size; already yields improved performance for big(er) i.e., tiled matrix multiplications (xgemm sample now includes a hyperparameter tuning script)
SMM/DNN: JIT'ted matrix copy functions (already used in CNN domain); both matcopy and (upcoming) JIT'ted transpose will fully unlock performance of big(ger) GEMMs
AUX/MEM: scope-oriented multi-pool scratch memory allocator with heuristic for buffers of different lifetime

CHANGES

Removed LIBXSMM_MT and LIBXSMM_TASKS environment variables, and updated documentation
COMPATIBLE=1 setting is now automatically applied (e.g., useful with Cray Compiler)
LIBXSMM_TRYLOCK=1 now uses a single lock, and thereby reduces code duplication for the contended case; the trylock property is for user-code that can handle a NULL-pointer as result of the code dispatch i.e., implementing a fallback code path (BLAS)
AUX/MEM: superseded libxsmm_malloc_size function with libxsmm_get_malloc_info
Revised termination message wrt scratch memory allocation (LIBXSMM_VERBOSE)
Other: updated "spack" (HPC packet manager) to use more reasonable build options
SPMDM: improved load balance

FIXES

Implemented FORTRAN dispatch interface (F2K) differently to get it working with CCE (Cray Compiler)
Worked around problem/crashes due to an outdated TCMALLOC replacement of malloc/free (CCE)
TMM: tiled MM fallback code path in multi-threaded tiled GEMM exposed an issue with LIBXSMM_TRYLOCK=1
TMM: fixed incorrect OpenMP in task-based implementation; now always selected when in external par. region
SPMDM: bug fix for handling last block of k correctly and avoid out-of-bound accesses
Minor: fixed all flake8 complaints of our Python scripts, fixed code issues pointed out by static analysis
Fixed transpose FORTRAN sample code

libxsmm/libxsmm 1.8 Version 1.8 on GitHub

libxsmm/libxsmm 1.8
Version 1.8

on GitHub