Performance Optimizations

Intel Architecture Processors

Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
Improved performance of int8 RNN primitive on processors with Intel AVX2 and Intel AVX-512 instruction set support.
Improved performance of int8 depthwise separable convolution primitive with per-channel zero points on processors with Intel AVX2 and Intel AVX-512 instruction set support.
Improved fp16 and bf16 softmax performance with relaxed accumulation mode.
Improved performance of int8 matmul primitive with fp16 output data type.
Improved performance of the following subgraphs with Graph API:
- Gated Multi-Layer Perceptron (Gated MLP).

Intel Graphics Products

Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
Improved performance of convolution with source zero points by pre-packing compenstation.
Improved performance of backward by data convolution with strides for large filter.
Improved performance of the following subgraphs with Graph API:
- Scaled Dot-Product Attention (SDPA) with implicit causal mask.
- SDPA with int8 or int4 compressed key and value.
- Gated MLP.

AArch64-based Processors

Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
Improved bf16 to fp32 reorder performance.
Improved bf16 reorder performance.
Improved bf16 convolution with ACL.

NVIDIA GPUs

Improved matmul performance using cuBLASLt-based implementation.

Functionality

Common

Introduced support for select algorithm in binary primitive. The functionality is optimized for Intel CPUs.
Extended quantization support in matmul and reorder with grouped scales and zero-points for weights. This functionality is optimized for Intel CPUs and GPUs.
Introduced initial support for 4-bit floating-point data types f4_e2m1 and f4_e3m0 in matmul and reorder, as well as e8m0 scales data type in matmul and reorder. This functionality is available on Intel CPUs and GPUs.
Introduced GenIndex, and GreaterEqual operations in Graph API.

Intel Architecture Processors

Introduced support for fp32 matmul with fp16 and bf16 weights.

Intel Graphics Products

Introduced stochastic rounding support for convolution, matmul and reorder based on Philox counter-based random number generator.
Introduced support for strided memory formats in convolution.

Generic GPU vendor

Introduced support for reduction primitive.
Introduced support for inner product primitive forward propagation.

Usability

Common

With the SYCL runtime, memory objects on the CPU engine are now reference-counted and no longer need to be explicitly kept alive for the duration of the primitive execution. This aligns memory object lifetime behavior on CPU and GPU engines.
Added Graph API examples for Gated MLP and int4 Gated MLP patterns.

Intel Architecture Processors

Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for Intel CPU and Intel GPU implementations.
Enabled frame pointers support on Intel64 platforms to improve integration with profilers.

Intel Processor Graphics

Improved verbose diagnostics for Intel GPU driver compatibility issues.
Improved support of large size tensors in convolution, matmul and reduction primitives on Intel GPUs.
Reduced scratchpad usage for NCHW convolution on Intel GPUs.

AArch64-based Processors

Added support for the Arm Compute Library (ACL) thread_local scheduler via ThreadpoolScheduler.
Improved memory efficiency in ACL matmuls by fixing a bug where scratchpad memory was not being used.
Made the ACL matmul primitive thread-safe which allows concurrent execution.

Validation

Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
Extended benchdnn with support for rewriting data types in the test JSON files in the graph driver.
Extended benchdnn with support and validation for the number of partitions returned from the test JSON files.

Deprecated Functionality

Experimental Graph Compiler is deprecated and will be removed in future releases.

Breaking Changes

Updated minimal supported CMake version to 3.13 (was 2.8.12).
Updated minimal supported GCC version to 8.0 (was 4.8).
Updated minimal supported Clang version to 11.0 (was 3.0).
Updated minimal supported ACL version to 24.11.1 (was 24.09).
Removed support for SYCL standards preceding SYCL 2020.
Enforced fp32 accumulation mode in fp16 matmul and inner product primitives on Intel Graphics products without Intel XMX cores. Previous behavir can be enabled with relaxed accumulation mode.

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Karasev @karasjoh000, John Osorio @kala855, Keola Wierschem @kwiersch, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, Michał Górny @mgorny, Nicolò Scipione @s-Nick, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Tadej Ciglarič @t4c1, Varad Ahirwadkar @varad-ahirwadkar, Viktoriia Gvozdeva @vgvozdeva, @vishwascm, @yair-obodovsky, Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues.

oneapi-src/oneDNN v3.7 on GitHub

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

NVIDIA GPUs

Functionality

Common

Intel Architecture Processors

Intel Graphics Products

Generic GPU vendor

Usability

Common

Intel Architecture Processors

Intel Processor Graphics

AArch64-based Processors

Validation

Deprecated Functionality

Breaking Changes

Thanks to our Contributors

oneapi-src/oneDNN v3.7
on GitHub