Performance optimizations
- Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of convolutions and matmul primitives.
- Improved performance of int8 convolutions for NHWC activations format.
- Intel Architecture processors:
- Improved performance of primitives for NHWC activations format.
- Improved fp32 GEMM performance for small N
- Improved performance of int8 primitives for processors with Intel SSE4.1 instruction set support.
- AArch64-based processors
- Added support for Arm Performance Library (ArmPL). ArmPL provides optimized GEMM implementation for aarch64.
- Added support for Arm Compute Library (ArmCL). ArmCL provides optimized convolution implementation for aarch64.
New Functionality
- Added support for IBMz (s390x) and IBM POWER (powerpc64) architectures
- Introduced RNN GRU for GPU.
- Introduced int8 RNN GRU for CPU
- Introduced asymmetric quantization support for convolutions and matmul
- Introduced dilated pooling support.
- Extended matmul primitive to support multiple dimensions in batch and broadcast on CPU.
- (preview) Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, and matmul.
- (preview) Extended the number of supported post-ops for primitives to 20.
- (preview) Introduced reduction primitive for CPU. Together with post-ops this functionality allows to implement normalization.
Thanks to the contributors
This release contains contributions from the project core team as well as Ben Fitch, Brian Shi, David Edelsohn @edelsohn, Diana Bite @diaena, Moaz Reyad @moazreyad, Nathan John Sircombe @nSircombe, Niels Dekker @N-Dekker, Peter Caday @petercad, Pinzhen Xu @pinzhenx, pkubaj @pkubaj, Tsao Zhong @CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.