Performance optimizations

Reduced overheads associated with primitive cache.
Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of Winograd convolution.
- Improved functionality performance for padded memory formats.
- Improved performance of reorder and shuffle primitives for multiple formats and all dimensions.
- Improved performance of pooling primitive for float16 data type.
- Improved performance of lnorm primitive for plain formats.
- Improved performance of resampling primitive for blocked formats.
Intel Architecture processors
- Introduced initial optimizations for bfloat16 functionality for future Intel Xeon Scalable processor with Intel AMX support (code name Sapphire Rapids).
- Improved performance of int8 and bfloat16 RNN and inner product primitives.
- Improved performance of shuffle primitive for bfloat16 data type.
- Introduced CPU ISA hints environment variable and API. New API is intended to dispatch function implementations using YMM registers to improve performance on processors with a single Intel AVX512 compute unit.
- Improved forward convolution performance for Intel AVX-512 systems.
- Introduced initial performance optimizations for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
- Improved performance of int8 primitive for processors with Intel SSE4.1 instruction set support.
- Improved convolution and batch normalization performance with threadpool.
AArch64-based processors
- Improved performance of Winograd convolution with ArmCL.
- Improved performance of int8 convolution with ArmCL.
- Added JIT support for Aarch64 and JIT implementations for reorder, eltwise, pooling, and batch normalization primitives.
NVIDIA GPUs
- (preview) Introduced support for NVIDIA GPU. The implementation relies on DPC++ Compiler, cuDNN, and cuBLAS libraries.

New Functionality

Introduced int8 support for LSTM primitive with projection for CPU.
Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, matmul and reduction (GPU only) along with performance optimizations for CPUs and GPUs.
Extended the number of supported post-ops for primitives to 20.
Extended eltwise primitive with support for logsigmoid and clip_v2 algorithms.
Introduced support for PRelu primitive.
Extended matmul implementation with support for per-output channel zero-points for quantization.
Extended support for broadcasting in binary primitive to both inputs for CPU.
Introduced float16 support in reduction primitive for GPU.
Introduced support for mixed input and output types in binary primitive for GPU.

Usability

Added API to enable displaying timestamps in oneDNN verbose mode. Timestamps allow to use oneDNN verbose output in profiling tools.

Validation

Extended benchdnn to report operation bandwidth.
Added ability to choose target GPU in benchdnn.

Thanks to the contributors

This release contains contributions from the project core team as well as Alejandro Alvarez, Aleksandr Nikolaev @alenik01, araki.kenichi @qnet-araki, Arthur Mitrano @aaraujom, Benjamin Fitch, Ben Tracy @CodeplayBen, Daniel Soutar @danielsoutar, @dylan-angus-codeplay, Diana Bite @diaena, higuchi.motoko @higuchi-motoko, Jacob Kahn @jacobkahn, Kentaro Kawakami @kawakami-k, Kumudha KN @KumudhaN, kurihara @Koji-Kurihara, Mehdi Goli @mehdi-goli, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Rafik Saliev @rfsaliev, Xinyu Chen @xinyu-intel, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.

oneapi-src/oneDNN v2.1 on GitHub

Performance optimizations

New Functionality

Usability

Validation

Thanks to the contributors

oneapi-src/oneDNN v2.1
on GitHub