Performance optimizations
Intel Architecture processors
- Improved performance of convolutional neural networks (CNN) related functionality with NHWC activations on all supported processors
- Improved binary primitive performance for the broadcast case
- Improved performance of eltwise primitive backpropagation and corresponding post-ops
- Improved performance of pooling, resampling, LRN primitives
- Improved performance of bfloat16 and fp32 weights gradient convolutions with groups
- Improved performance of int8 convolutions with 1x1 kernel and spatial strides
Intel Processor Graphics and Xe architecture-based Graphics
- Introduced initial optimizations for Xe architecture-based Graphics (code named DG1 and Tiger Lake).
- Improved performance of convolutional neural networks (CNN) related functionality with NHWC activations.
Usability
- Introduced support for Arm* 64-bit Architecture (AArch64) and other non-x86 processors.
- Separated primitive cache state from engine making it persistent.
- Introduced API for managing primitive cache state.
Validation
- Introduced validation mode to detect out of bounds access.
Thanks to the contributors
This release contains contributions from the project core team as well as Anuj Mittal @anujm1, Arthur Mitrano @aaraujom, Benjamin Fitch, Ilia Taraban @itaraban, Leona C. @indie, Nathan John Sircombe @nSircombe, Sergey Nesterov @cepera, Tsao Zhong @CaoZhongZ, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.