Performance Optimizations
Intel Architecture Processors
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
- Improved performance of group normalization primitive.
- Improved bf16 matmul performance with int4 compressed weights on processors with Intel AMX instruction set support.
- Improved performance of
fp8
matmul, pooling, and eltwise primitives on processors with Intel AMX instruction set support. - Improved
fp32
RNN primitive performance on processors with Intel AVX2 instruction set support. - Improved performance of the following subgraphs with Graph API:
convolution
andbinary
operation fusions with better layout selection in Graph API.fp8
convolution
andunary
orbinary
on processors with Intel AMX instruction set.- Scaled Dot Product Attention (SDPA) without scale, Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
LayerNorm
,GroupNorm
, andSoftMax
withint8
quantized output and zero-points.
Intel Graphics Products
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Introduced broad production quality optimizations for Intel Arc Graphics for Intel Core Ultra Processors (Series 2) (formerly Lunar Lake).
- Introduced broad production quality optimizations for future discrete GPU based on Xe2 architecture (code name Battlemage).
- Introduced support for Intel Arc Graphics for future Intel Core Ultra Processor (code name Arrow Lake-H).
- Improved performance of
fp8_e5m2
primitives on Intel Data Center GPU Max Series (formerly Ponte Vecchio). - Improved matmul and inner product primitives performance for shapes relevant to large language models (LLMs) on GPUs with Intel XMX support.
- Improved
int8
convolution performance with weight zero points. - Reduced primitive creation time for softmax, layer normalization, and concat primitives via kernel reuse.
- Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns.
f16
variants of these patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) XMX) support. fp8
convolution
andunary
orbinary
on Intel Data Center GPU Max Series.LayerNorm
,GroupNorm
, andSoftMax
withint8
quantized output and zero-points.
- SDPA without scale, MQA, and GQA patterns.
AArch64-based Processors
- Improved
fp32
convolution backpropagation performance on processors with SVE support. - Improved reorder performance for blocked format on processors with SVE support.
- Improved
bf16
softmax performance on processors with SVE support. - Improved batch normalization performance on processors with SVE support.
- Improved matmul performance on processors with SVE support.
- Improved
fp16
convolution with Arm Compute Library (ACL). - Improved matmul performance with ACL.
- Switched matmul and convolution implementation with ACL to stateless API significantly improving primitive creation time and increasing caching efficiency and performance for these operators.
Functionality
- Introduced generic GPU support. This implementation relies on portable SYCL kernels and can be used as a starting point to enable new devices in oneDNN.
- Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL based implementations.
- Enabled support for
int8
activations with grouped scales andint8
orint4
compressed weights in matmul primitive. This functionality is implemented on Intel GPUs. - Introduces support for stochastic rounding for
fp8
data type functionality. - [experimental] Extended microkernel API:
- Introduced
int8
quantization support. - Extended transform microkernel with transposition support and support for arbitrary strides.
- Introduced verbose diagnostics support.
- Introduced
- [experimental] Extended sparse API:
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This functionality is implemented on CPUs and Intel GPUs.
- Introduced
int8
support in eltwise primitive with 'clip' algorithm. This functionality is implemented on CPUs. - Graph API:
- Introduced
GroupNorm
operation and fusions in Graph API. - Introduced support for standalone
StaticReshape
andStaticTranspose
operations.
- Introduced
Usability
- Added examples for SDPA, MQA, and GQA patterns implementation with Graph API.
- Added an example for deconvolution primitive.
- Added examples for Vanilla RNN and LBR GRU RNN cells.
- Introduced support for Intel DPC++/C++ Compiler 2025.0.
- Introduced interoperability with SYCL Graph record/replay mode.
- Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
- [experimental] Introduced logging mechanism based on spdlog library.
- Introduced support for
ONEDNN_ENABLE_WORKLOAD
build knob for Graph API. - Improved performance of
get_partitions()
function in Graph API.
Validation
- Introduced protection from out of memory scenarios in benchdnn Graph API driver.
Breaking Changes
- Experimental microkernel API in this release is not compatible with the version available in oneDNN v3.5.
- Updated minimal supported ACL version to 24.08.1 (was 24.04).
Thanks to these Contributors
This release contains contributions from the project core team as well as Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron, Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts @apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph, Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha, Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm, @matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich, Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu, Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick, Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen, Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov @vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone who asked questions and reported issues.