Performance optimizations

Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of convolutions and matmul primitives.
- Improved performance of int8 convolutions for NHWC activations format.
Intel Architecture processors:
- Improved performance of primitives for NHWC activations format.
- Improved fp32 GEMM performance for small N
- Improved performance of int8 primitives for processors with Intel SSE4.1 instruction set support.
AArch64-based processors
- Added support for Arm Performance Library (ArmPL). ArmPL provides optimized GEMM implementation for aarch64.
- Added support for (Arm Compute Library (ArmCL))[https://github.com/arm-software/ComputeLibrary]. ArmCL provides optimized convolution implementation for aarch64.

New Functionality

Added support for IBMz (s390x) and IBM POWER (powerpc64) architectures
Introduced RNN GRU for GPU.
Introduced int8 RNN GRU for CPU
Introduced asymmetric quantization support for convolutions, matmul, and inner product
Introduced dilated pooling support.
Extended matmul primitive to support multiple dimensions in batch and broadcast on CPU.
(preview) Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, and matmul.
(preview) Extended the number of supported post-ops for primitives to 20.
(preview) Introduced reduction primitive for CPU. Together with post-ops this functionality allows to implement normalization.

Thanks to the contributors

This release contains contributions from the project core team as well as Ben Fitch, Brian Shi, David Edelsohn @edelsohn, Diana Bite @diaena, Moaz Reyad @moazreyad, Nathan John Sircombe @nSircombe, Niels Dekker @N-Dekker, Peter Caday @petercad, Pinzhen Xu @pinzhenx, pkubaj @pkubaj, Tsao Zhong @CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.

Known Issues and Limitations

f32 convolutions may hang sporadically on Intel Processor Graphics Gen11. No workaround available.
Pooling, batch normalization, and binary primitives may segfault when executed on Xe architecture-based graphics. No workaround available.
oneDNN functionality may corrupt memory and lead to application crash on GPU with Level Zero runtime in USM mode on all GPU platforms. As a workaround use SYCL buffers or OpenCL runtime:
export SYCL_BE=PI_OPENCL
Matmul function may hang on GPU with Level Zero runtime on Windows. As a workaround use OpenCL runtime:
export SYCL_BE=PI_OPENCL
Convolution may hang on GPU for shapes with 3 input channels. No workaround available.
Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
o On Linux* (See more details at OpenCL™ Driver for Intel® HD, Iris™, and Iris™ Pro Graphics for Linux):
$ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'
o On Windows* (See more details at Timeout Detection and Recovery (TDR) Registry Keys):
Increase TdrDelay and TdrDdiDelay values in registry
See DPC++ limitations that impact the library as well.

oneapi-src/oneDNN v2.0-beta10 on GitHub

Performance optimizations

New Functionality

Thanks to the contributors

Known Issues and Limitations

oneapi-src/oneDNN v2.0-beta10
on GitHub