github libxsmm/libxsmm 1.8
Version 1.8

latest releases: 1.old_kernelapi_rip, 1.libxsmm_dnn_rip, 1.eol...
7 years ago

This set of changes brings the Padding API to life and implements the necessary mechanisms to cover a wider range of cases. This may allow to run a larger variety of TensorFlow workloads using LIBXSMM. The implementation also brings Winograd-based convolutions (chosen automatically when using LIBXSMM_DNN_CONV_ALGO_AUTO). Moreover, support for the Intel Xeon Phi processor code-named "Knights Mill" ("KNM") has been added (QFMA and VNNI instructions can be executed using the Intel SDE).

INTRODUCED

  • A summary of code samples has been added (pdf), and also a guide (mainly for contributors) to "Getting Started using TensorFlow with LIBXSMM" [PDF]
  • Additional sparse matrix primitives (fsspmdm domain); see "pyfr" and "edge" sample code
  • Support for OpenMP SIMD directive on GCC (-fopenmp-simd) used in some translation units
  • Improved code path selection for legacy compiler versions (functions with multiple compilation targets)
  • DNN: Winograd based convolutions incl. threshold to automatically select (LIBXSMM_DNN_CONV_ALGO_AUTO) between LIBXSMM_DNN_CONV_ALGO_DIRECT, and LIBXSMM_DNN_CONV_ALGO_WINOGRAD
  • DNN: logically padded data incl. support for Winograd based implementation
  • DNN: support for Intel Knights Mill (KNM) instruction set extension (AVX-512)
  • DNN: support another custom format that blocks the minibatch dimension
  • SMM: support of FORTRAN 77 for manual JIT-dispatch (libxsmm_xmmdispatch, libxsmm_xmmcall)
  • SPMDM: narrowed scope of "sum" array to improve optimization on LLVM
  • SMM/EXT/OMP: introduced table of blocksizes depending on problem size; already yields improved performance for big(er) i.e., tiled matrix multiplications (xgemm sample now includes a hyperparameter tuning script)
  • SMM/DNN: JIT'ted matrix copy functions (already used in CNN domain); both matcopy and (upcoming) JIT'ted transpose will fully unlock performance of big(ger) GEMMs
  • AUX/MEM: scope-oriented multi-pool scratch memory allocator with heuristic for buffers of different lifetime

CHANGES

  • Removed LIBXSMM_MT and LIBXSMM_TASKS environment variables, and updated documentation
  • COMPATIBLE=1 setting is now automatically applied (e.g., useful with Cray Compiler)
  • LIBXSMM_TRYLOCK=1 now uses a single lock, and thereby reduces code duplication for the contended case; the trylock property is for user-code that can handle a NULL-pointer as result of the code dispatch i.e., implementing a fallback code path (BLAS)
  • AUX/MEM: superseded libxsmm_malloc_size function with libxsmm_get_malloc_info
  • Revised termination message wrt scratch memory allocation (LIBXSMM_VERBOSE)
  • Other: updated "spack" (HPC packet manager) to use more reasonable build options
  • SPMDM: improved load balance

FIXES

  • Implemented FORTRAN dispatch interface (F2K) differently to get it working with CCE (Cray Compiler)
  • Worked around problem/crashes due to an outdated TCMALLOC replacement of malloc/free (CCE)
  • TMM: tiled MM fallback code path in multi-threaded tiled GEMM exposed an issue with LIBXSMM_TRYLOCK=1
  • TMM: fixed incorrect OpenMP in task-based implementation; now always selected when in external par. region
  • SPMDM: bug fix for handling last block of k correctly and avoid out-of-bound accesses
  • Minor: fixed all flake8 complaints of our Python scripts, fixed code issues pointed out by static analysis
  • Fixed transpose FORTRAN sample code

Don't miss a new libxsmm release

NewReleases is sending notifications on new releases.