Performance optimizations

Intel Architecture processors

Introduced initial int8 optimizations for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
Improved matmul and inner product performance with bfloat16 data type.
Improved performance of tanh algorithm for eltwise primitive and LSTM cells.

Intel Processor Graphics and Xe architecture-based Graphics

Improved performance of Convolution, RNN, Inner Product and Matmul functionality for all supported GPUs.
Improved performance of int8 convolutions with activations in NHWC format for Xe architecture-based Graphics (code named DG1 and Tiger Lake).

AArch64-based processors

Added support for ArmPL library to improve performance of functionality relying on GEMM (matmul, inner product, convolutions).

New Functionality

Introduced support for processors based on IBM POWER architecture.
Introduced Linear-Before-Reset GRU for GPU.
Extended eltwise primitive with support for round operation.

Usability

Reduced primitives creation time due to enabled OpenCL pre-compiled headers feature in recent versions of OpenCL driver.
Reduced entitlement required on macOS with hardened runtime to allow-jit.
Extended documentation on runtime and build time controls for JIT profilers support, primitive cache, CPU dispatcher controls, and verbose mode.

Validation

Introduced validation mode for out of memory situations.

Thanks to the contributors

This release contains contributions from the project core team as well as Alberto Gonzalez Palomo @AlbertoGP, Arthur Mitrano @aaraujom, Ilia Taraban @itaraban, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Tsao Zhong @CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.

oneapi-src/oneDNN v1.6 on GitHub