Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
- Introduced initial optimizations for future Intel Xeon Scalable processor (code name Sierra Forest). The functionality is disabled by default and should be enabled via CPU dispatcher control.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Improved concat primitive performance with per-argument scales on Intel GPUs.
AArch64-based Processors:
- Improved layer normalization primitive performance with ACL.
AMD GPUs:
- Introduced optimized matmul implementation.
RISC-V-based Processors:
- Improved pooling primitive performance for processors with RISC-V vector extension (RVV) support.

Functionality

Enabled Graph API as a production feature. Graph API is intended to simplify oneDNN integration into frameworks.
Added an option to zero-out weight gradient in RNN primitive. See details in corresponding RFC.
[experimental] Added support for sparse memory and dense by sparse matrix-matrix multiplication support in matmul primitive. The functionality is supported on processors with Intel AVX2 and Intel AVX-512 instruction support.
Introduced out-of-order queues support for OpenCL runtime. See OpenCL Interoperability section in developer guide for more details.
Added support for non-zero alpha parameter in batch normalization ReLU post-op on Intel GPUs.
Enabled layer normalization primitive with f64 datatype support on Intel GPUs.
Added support of per-argument scales in matmul, convolution, inner product and reorder primitives on NVIDIA GPUs.

[experimental] Extended benchdnn with functional and performance validation for Graph API.

Builds with OpenCL runtime will fail unless Graph API is disabled with ONEDNN_BUILD_GRAPH=OFF.

Graph API constant cache feature is disabled with SYCL CPU runtime due to an issue with oneAPI DPC++ Compiler runtime. This will result in lower performance for some scenarios.

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, Annop Wongwathanarat @annop-w, @arlesniak, @bdmoore1, Crefeda Rodrigues @cfRod, David Svantesson @davsva01, Fadi Arafeh @fadara01, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Pavel Zamelin @pazamelin, Pawel Piotrowicz @pawelpiotrowicz, Peter Caday @petercad, @ranzhejiang, and Sanchit Grover @sanchit-grover-intel. We would also like to thank everyone who asked questions and reported issues.