This set of changes brings the Padding API to life and implements the necessary mechanisms to cover a wider range of cases. This may allow to run a larger variety of TensorFlow workloads using LIBXSMM. The implementation also brings Winograd-based convolutions (chosen automatically when using LIBXSMM_DNN_CONV_ALGO_AUTO). Moreover, support for the Intel Xeon Phi processor code-named "Knights Mill" ("KNM") has been added (QFMA and VNNI instructions can be executed using the Intel SDE).
INTRODUCED
- A summary of code samples has been added (pdf), and also a guide (mainly for contributors) to "Getting Started using TensorFlow with LIBXSMM" [PDF]
- Additional sparse matrix primitives (fsspmdm domain); see "pyfr" and "edge" sample code
- Support for OpenMP SIMD directive on GCC (-fopenmp-simd) used in some translation units
- Improved code path selection for legacy compiler versions (functions with multiple compilation targets)
- DNN: Winograd based convolutions incl. threshold to automatically select (LIBXSMM_DNN_CONV_ALGO_AUTO) between LIBXSMM_DNN_CONV_ALGO_DIRECT, and LIBXSMM_DNN_CONV_ALGO_WINOGRAD
- DNN: logically padded data incl. support for Winograd based implementation
- DNN: support for Intel Knights Mill (KNM) instruction set extension (AVX-512)
- DNN: support another custom format that blocks the minibatch dimension
- SMM: support of FORTRAN 77 for manual JIT-dispatch (libxsmm_xmmdispatch, libxsmm_xmmcall)
- SPMDM: narrowed scope of "sum" array to improve optimization on LLVM
- SMM/EXT/OMP: introduced table of blocksizes depending on problem size; already yields improved performance for big(er) i.e., tiled matrix multiplications (xgemm sample now includes a hyperparameter tuning script)
- SMM/DNN: JIT'ted matrix copy functions (already used in CNN domain); both matcopy and (upcoming) JIT'ted transpose will fully unlock performance of big(ger) GEMMs
- AUX/MEM: scope-oriented multi-pool scratch memory allocator with heuristic for buffers of different lifetime
CHANGES
- Removed LIBXSMM_MT and LIBXSMM_TASKS environment variables, and updated documentation
- COMPATIBLE=1 setting is now automatically applied (e.g., useful with Cray Compiler)
- LIBXSMM_TRYLOCK=1 now uses a single lock, and thereby reduces code duplication for the contended case; the trylock property is for user-code that can handle a NULL-pointer as result of the code dispatch i.e., implementing a fallback code path (BLAS)
- AUX/MEM: superseded libxsmm_malloc_size function with libxsmm_get_malloc_info
- Revised termination message wrt scratch memory allocation (LIBXSMM_VERBOSE)
- Other: updated "spack" (HPC packet manager) to use more reasonable build options
- SPMDM: improved load balance
FIXES
- Implemented FORTRAN dispatch interface (F2K) differently to get it working with CCE (Cray Compiler)
- Worked around problem/crashes due to an outdated TCMALLOC replacement of malloc/free (CCE)
- TMM: tiled MM fallback code path in multi-threaded tiled GEMM exposed an issue with LIBXSMM_TRYLOCK=1
- TMM: fixed incorrect OpenMP in task-based implementation; now always selected when in external par. region
- SPMDM: bug fix for handling last block of k correctly and avoid out-of-bound accesses
- Minor: fixed all flake8 complaints of our Python scripts, fixed code issues pointed out by static analysis
- Fixed transpose FORTRAN sample code