Performance optimizations
-
Reduced overheads associated with primitive cache.
-
Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of Winograd convolution.
- Improved functionality performance for padded memory formats.
- Improved performance of reorder and shuffle primitives for multiple formats and all dimensions.
- Improved performance of pooling primitive for float16 data type.
- Improved performance of lnorm primitive for plain formats.
- Improved performance of resampling primitive for blocked formats.
-
Intel Architecture processors
- Introduced initial optimizations for bfloat16 functionality for future Intel Xeon Scalable processor with Intel AMX support (code name Sapphire Rapids).
- Improved performance of int8 and bfloat16 RNN and inner product primitives.
- Improved performance of shuffle primitive for bfloat16 data type.
- Introduced CPU ISA hints environment variable and API. New API is intended to dispatch function implementations using YMM registers to improve performance on processors with a single Intel AVX512 compute unit.
- Improved forward convolution performance for Intel AVX-512 systems.
- Introduced initial performance optimizations for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
- Improved performance of int8 primitive for processors with Intel SSE4.1 instruction set support.
- Improved convolution and batch normalization performance with threadpool.
-
AArch64-based processors
- Improved performance of Winograd convolution with ArmCL.
- Improved performance of int8 convolution with ArmCL.
- Added JIT support for Aarch64 and JIT implementations for reorder, eltwise, pooling, and batch normalization primitives.
-
NVIDIA GPUs
- (preview) Introduced support for NVIDIA GPU. The implementation relies on DPC++ Compiler, cuDNN, and cuBLAS libraries.
New Functionality
- Introduced int8 support for LSTM primitive with projection for CPU.
- Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, matmul and reduction (GPU only) along with performance optimizations for CPUs and GPUs.
- Extended the number of supported post-ops for primitives to 20.
- Extended eltwise primitive with support for
logsigmoid
andclip_v2
algorithms. - Introduced support for PRelu primitive.
- Extended matmul implementation with support for per-output channel zero-points for quantization.
- Extended support for broadcasting in binary primitive to both inputs for CPU.
- Introduced float16 support in reduction primitive for GPU.
- Introduced support for mixed input and output types in binary primitive for GPU.
Usability
- Added API to enable displaying timestamps in oneDNN verbose mode. Timestamps allow to use oneDNN verbose output in profiling tools.
Validation
- Extended benchdnn to report operation bandwidth.
- Added ability to choose target GPU in benchdnn.
Thanks to the contributors
This release contains contributions from the project core team as well as Alejandro Alvarez, Aleksandr Nikolaev @alenik01, araki.kenichi @qnet-araki, Arthur Mitrano @aaraujom, Benjamin Fitch, Ben Tracy @CodeplayBen, Daniel Soutar @danielsoutar, @dylan-angus-codeplay, Diana Bite @diaena, higuchi.motoko @higuchi-motoko, Jacob Kahn @jacobkahn, Kentaro Kawakami @kawakami-k, Kumudha KN @KumudhaN, kurihara @Koji-Kurihara, Mehdi Goli @mehdi-goli, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Rafik Saliev @rfsaliev, Xinyu Chen @xinyu-intel, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.