github libxsmm/libxsmm 1.8.1
Version 1.8.1

latest releases: 1.old_kernelapi_rip, 1.libxsmm_dnn_rip, 1.eol...
7 years ago

This release brings some new features (matcopy/2d-copy and tcopy based on JIT-generated code) as well as a number of bug fixes (TGEMM), improvements (KNM), and refinements (LIBXSMM_GEMM_WRAP control, etc). Given the completed copy/transpose support, this release prepares for a complete stand-alone GEMM routines.

INTRODUCED

  • Choice between tiled/small GEMM during call-interception (LIBXSMM_GEMM_WRAP=1|2).
  • Introduced JIT'ted transpose kernels including tiling for larger matrices.
  • Transpose routines now auto-dispatch JIT-kernels incl. auto-tuned tiles.
  • Introduced matcopy routines similar to the transpose routines (C/C++/F).
  • LIBXSMM_DNN_CONV_OPTION_OVERWRITE for faster initial forward convolution.
  • Implemented/documented named JIT routines in TF when using VTune.
  • Additional statistics about MCOPY/TCOPY (LIBXSMM_VERBOSE=2).
  • Lowered overhead of tiled/parallelized GEMM/MCOPY/TCOPY.
  • Made libxsmm_hash function available (MEM/AUX module).
  • Initial support for lower precision (backward conv.)

CHANGES

  • AVX-512 based CPUID-dispatched input/output of Winograd transformation (forward conv.).
  • Adjusted build system to pick-up RPM_OPT_FLAGS (RPM based Linux distributions).
  • Moved extensive Q&A to Wiki page and cleaned up the reference documentation.
  • Improved/extended Getting Started Guide for TensorFlow with LIBXSMM.
  • Improved general backend error propagation, and avoid duplicated messages.
  • Iterative subdivision of large matrix transposes (tcopy) and matcopy (mcopy).
  • Non-task based and (optional) task based parallelization of tcopy and mcopy.
  • Mentioned KNM target key ("knm") in reference documentation.
  • Improved prefetches in KNM code path of weight update.
  • Adjusted initialization sequence during startup.
  • Improved parallelization grammar.

FIXES

  • Fixed pruned tile sizes and division-by-zero error in tiled GEMM.
  • Propagate backend errors in case of an insufficient JIT buffer.
  • CRC32 SW implementation issues unveiled by the CRAY Compiler.
  • Call parallelized transpose (C++ interface) when requested.
  • Fixed VTune support (named JIT code); broken in v1.8.
  • Fixed incorrect prefetch locations in KNM code path.
  • Fixed alignment condition in tcopy/mcopy code.
  • Fixed TF allocator integration with GCC 7.1.0.
  • Fixed some more warnings in sample codes.

Don't miss a new libxsmm release

NewReleases is sending notifications on new releases.