Performance optimizations
Intel Architecture processors
- Introduced initial int8 optimizations for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
- Improved matmul and inner product performance with bfloat16 data type.
- Improved performance of
tanh
algorithm for eltwise primitive and LSTM cells.
Intel Processor Graphics and Xe architecture-based Graphics
- Improved performance of Convolution, RNN, Inner Product and Matmul functionality for all supported GPUs.
- Improved performance of int8 convolutions with activations in NHWC format for Xe architecture-based Graphics (code named DG1 and Tiger Lake).
AArch64-based processors
- Added support for ArmPL library to improve performance of functionality relying on GEMM (matmul, inner product, convolutions).
New Functionality
- Introduced support for processors based on IBM POWER architecture.
- Introduced Linear-Before-Reset GRU for GPU.
- Extended eltwise primitive with support for
round
operation.
Usability
- Reduced primitives creation time due to enabled OpenCL pre-compiled headers feature in recent versions of OpenCL driver.
- Reduced entitlement required on macOS with hardened runtime to
allow-jit
. - Extended documentation on runtime and build time controls for JIT profilers support, primitive cache, CPU dispatcher controls, and verbose mode.
Validation
- Introduced validation mode for out of memory situations.
Thanks to the contributors
This release contains contributions from the project core team as well as Alberto Gonzalez Palomo @AlbertoGP, Arthur Mitrano @aaraujom, Ilia Taraban @itaraban, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Tsao Zhong @CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.