Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
- Improved s32 binary primitive performance.
- Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
- Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
- Improved performance of convolution for depthwise cases with Graph API.
- [experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
Intel Graphics Products:
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced RNN primitive initialization time on Intel GPUs.
AArch64-based Processors:
- Improved fp32 to bf16 reorder performance.
- Improved max pooling performance with Arm Compute Library (ACL).
- Improved dilated convolution performance for depthwise cases with ACL.

Functionality

Introduced group normalization primitive support. The functionality is currently available on CPUs.
Intel CPUs:
- Introduced support for zero points in int8 convolution with groups and 3D spatial.

Usability

Extended verbose mode output:
- Improved diagnostics on engine creation errors.
- Added information on Graph API calls.
- Added information on strides for non-dense memory objects.
- Added values of runtime dimension.
- Added indication that primitive descriptor was created with any memory format tag.
Introduced examples for Graph API.
Graph API constant tensor cache is now disabled by default and requires opt-in with [dnnl::graph::set_constant_tensor_cache()](https://oneapi-src.github.io/oneDNN/group_dnnl_graph_api_constant_tensor_cache.html#doxid-group-dnnl-graph-api-constant-tensor-cache-1ga9e37974d35ff5aafe1cbae2f69a2ab00) call.
Reduced oneDNN Graph API memory consumption in certain scenarios.

Validation

Extended benchdnn performance reporting with primitive creation time.
Introduced cold cache mode in benchdnn.

Known Limitations

Current GPU OpenCL runtime for Linux has an issue resulting in convolution producing incorrect results on integrated GPUs based on Xe architecture. SYCL configuration is not affected.
Pooling, resampling, prelu, batch normalization, layer normalization, and eltwise primitives may sporadically produce incorrect results on Intel Arc GPUs on Windows.
Current GPU driver for Linux has an issue resulting in program hangs or crashes when oneDNN primitives are executed concurrently on Intel Datacenter GPU Max Series.
Extensive use of RNN primitive on Intel GPUs with default primitive cache setting may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
Int8 deconvolution with signed weights and activations may produce incorrect results of processors with Intel AMX support.

Thanks to these Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, @baibeta, Benjamin Taylor @bentaylorhk-arm, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, @snadampal, @sparkyrider, Thomas Köppe @tkoeppe. We would also like to thank everyone who asked questions and reported issues.

oneapi-src/oneDNN v3.3 on GitHub