Performance Optimizations
Intel Architecture Processors
- Improved performance for 4th generation Intel Xeon Scalable processors
(formerly Sapphire Rapids). - Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
- Improved performance of group normalization primitive.
- Improved
bf16
matmul performance withint4
compressed weights on processors
with Intel AMX instruction set support. - Improved performance of
fp8
matmul, pooling, and eltwise primitives on
processors with Intel AMX instruction set support. - Improved
fp32
RNN primitive performance on processors with Intel AVX2
instruction set support. - Improved performance of the following subgraphs with Graph API:
convolution
andbinary
operation fusions with better layout selection
in Graph API.fp8
convolution
andunary
orbinary
on processors with Intel AMX
instruction set support.- Scaled Dot Product Attention (SDPA) without scale,
Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns. LayerNorm
,GroupNorm
, andSoftMax
withint8
quantized output
and zero-points.
Intel Graphics Products
- Improved performance for the Intel Data Center GPU Max Series (formerly
Ponte Vecchio). - Introduced broad production quality optimizations for Intel Arc Graphics for
Intel Core Ultra Processors (Series 2) (formerly Lunar Lake). - Introduced broad production quality optimizations for future discrete GPU
based on Xe2 architecture (code name Battlemage). - Introduced support for Intel Arc Graphics for future Intel Core Ultra
Processor (code name Arrow Lake-H). - Improved performance of
fp8_e5m2
primitives on Intel Data Center GPU Max
Series (formerly Ponte Vecchio). - Improved matmul and inner product primitives performance for shapes relevant
to large language models (LLMs) on GPUs with Intel XMX support. - Improved
int8
convolution performance with weight zero-points. - Reduced primitive creation time for softmax, layer normalization, and concat
primitives via kernel reuse. - Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns.
f16
variants of these
patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R)
XMX) support. fp8
,convolution
, andunary
orbinary
on the Intel Data Center GPU Max
Series.LayerNorm
,GroupNorm
, andSoftMax
withint8
quantized output and
zero-points.
- SDPA without scale, MQA, and GQA patterns.
AArch64-based Processors
- Improved
fp32
convolution backpropagation performance on processors with
SVE support. - Improved reorder performance for blocked format on processors with
SVE support. - Improved
bf16
softmax performance on processors with SVE support. - Improved batch normalization performance on processors with SVE support.
- Improved matmul performance on processors with SVE support.
- Improved
fp16
convolution with Arm Compute Library (ACL). - Improved matmul performance with ACL.
- Switched matmul and convolution implementation with ACL to stateless API
significantly improving primitive creation time and increasing caching
efficiency and performance for these operators.
Functionality
- Introduced generic GPU support. This implementation relies on portable
SYCL kernels and can be used as a starting point to enable new devices in
oneDNN. - Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL-based
implementations. - Enabled support for
int8
activations with grouped scales andint8
orint4
compressed weights in matmul primitive. This functionality
is implemented on Intel GPUs. - Introduces support for stochastic rounding for
fp8
data type
functionality. - [experimental] Extended microkernel API:
- Introduced
int8
quantization support. - Extended transform microkernel with transposition support and support for
arbitrary strides. - Introduced verbose diagnostics support.
- Introduced
- [experimental] Extended sparse API:
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This
functionality is implemented on CPUs and Intel GPUs.
- Introduced
int8
support in eltwise primitive with 'clip' algorithm. This
functionality is implemented on CPUs. - Graph API:
- Introduced
GroupNorm
operation and fusions in Graph API. - Introduced support for standalone
StaticReshape
andStaticTranspose
operations.
- Introduced
Usability
- Added examples for SDPA, MQA, and GQA patterns
implementation with Graph API. - Added an example for deconvolution primitive.
- Added examples for Vanilla RNN and
LBR GRU RNN cells. - Introduced support for Intel DPC++/C++ Compiler 2025.0.
- Introduced interoperability with SYCL Graph record/replay mode.
- Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
- [experimental] Introduced logging mechanism based on spdlog
library. - Introduced support for
ONEDNN_ENABLE_WORKLOAD
build knob for Graph API. - Improved performance of
get_partitions()
function in Graph API.
Validation
- Introduced protection from out-of-memory scenarios in benchdnn Graph API
driver.
Deprecated Functionality
- Experimental Graph Compiler is deprecated and will be removed in future releases.
Breaking Changes
- Experimental microkernel API in this release is not compatible with
the version available in oneDNN v3.5. - Updated minimal supported ACL version to 24.08.1 (was 24.04).
Thanks to these Contributors
This release contains contributions from the project core team as well as
Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron,
Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts
@apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph,
Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha,
Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm,
@matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich,
Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu,
Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros
Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick,
Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen,
Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov
@vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone
who asked questions and reported issues.