This release brings some new features (matcopy/2d-copy and tcopy based on JIT-generated code) as well as a number of bug fixes (TGEMM), improvements (KNM), and refinements (LIBXSMM_GEMM_WRAP control, etc). Given the completed copy/transpose support, this release prepares for a complete stand-alone GEMM routines.
INTRODUCED
- Choice between tiled/small GEMM during call-interception (LIBXSMM_GEMM_WRAP=1|2).
- Introduced JIT'ted transpose kernels including tiling for larger matrices.
- Transpose routines now auto-dispatch JIT-kernels incl. auto-tuned tiles.
- Introduced matcopy routines similar to the transpose routines (C/C++/F).
- LIBXSMM_DNN_CONV_OPTION_OVERWRITE for faster initial forward convolution.
- Implemented/documented named JIT routines in TF when using VTune.
- Additional statistics about MCOPY/TCOPY (LIBXSMM_VERBOSE=2).
- Lowered overhead of tiled/parallelized GEMM/MCOPY/TCOPY.
- Made libxsmm_hash function available (MEM/AUX module).
- Initial support for lower precision (backward conv.)
CHANGES
- AVX-512 based CPUID-dispatched input/output of Winograd transformation (forward conv.).
- Adjusted build system to pick-up RPM_OPT_FLAGS (RPM based Linux distributions).
- Moved extensive Q&A to Wiki page and cleaned up the reference documentation.
- Improved/extended Getting Started Guide for TensorFlow with LIBXSMM.
- Improved general backend error propagation, and avoid duplicated messages.
- Iterative subdivision of large matrix transposes (tcopy) and matcopy (mcopy).
- Non-task based and (optional) task based parallelization of tcopy and mcopy.
- Mentioned KNM target key ("knm") in reference documentation.
- Improved prefetches in KNM code path of weight update.
- Adjusted initialization sequence during startup.
- Improved parallelization grammar.
FIXES
- Fixed pruned tile sizes and division-by-zero error in tiled GEMM.
- Propagate backend errors in case of an insufficient JIT buffer.
- CRC32 SW implementation issues unveiled by the CRAY Compiler.
- Call parallelized transpose (C++ interface) when requested.
- Fixed VTune support (named JIT code); broken in v1.8.
- Fixed incorrect prefetch locations in KNM code path.
- Fixed alignment condition in tcopy/mcopy code.
- Fixed TF allocator integration with GCC 7.1.0.
- Fixed some more warnings in sample codes.