Performance Optimizations
Intel Architecture Processors
- Improved matmul and inner product primitives performance on processors with Intel AMX instruction set support.
- Improved performance of convolution and inner product primitives on processors with Intel AVX2 instruction set support.
- Improved performance of
int8
convolution support with zero points. - Improved
fp32
convolution performance withfp16
andbf16
compressed weights on processors with Intel AVX2 or Intel AVX-512 instruction set support. - Improved
fp16
/bf16
depthwise convolution performance withfp32
bias orsum
post-ops or dilation. - Improved
bf16
pooling backpropagation performance. - Improved binary post-ops performance with
per_w
broadcast.
Intel Graphics Products
- Improved performance on Intel GPUs based on Xe3 architecture.
- Improved convolution performance on:
- Intel Arc Graphics for Intel Core Ultra (Series 2, formerly Lunar Lake).
- Intel Arc B-series discrete graphics (formerly Battlemage).
- Improved
int8
matmul performance with zero-points support for source and weight tensors. - Improved
f4_e2m1
andf4_e3m0
matmul and reorder performance. - Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with
int4
andint8
compressed key and value. fp16
/bf16
SDPA withfp32
intermediate data types. Usingfp32
intermediate data types is recommended.- SDPA with head size 512 and 576.
- Grouped Query Attention (GQA) with 5D input tensors.
- Scaled Dot Product Attention (SDPA) with
AArch64-based Processors
- Improved
fp16
reorder performance. - Improved
int8
matmul performance. - Improved
bf16
inner product forward propagation performance with Arm Compute Library (ACL). - Improved convolution performance on processors with SVE support with ACL.
Functionality
Common
- Extended Graph API
Softmax
operation to supportinf_as_zero
mode. This functionality enables SDPA subgraph compliant with Pytorch Safe Softmax semantics.
Intel Architecture Processors
- Introduced support for
f32
convolution withfp16
compressed weights. - Enabled
int8
/int4
compressed weights support in matmul primitive.
Intel Graphics Products
- Introduced select algorithm support in binary primitive.
- Introduced support for
f4_e2m1
andf4_e3m0
data types in convolution. - Introduced support for the GenIndex operation in Graph API.
Generic GPU Vendor
- Introduced support for:
- Vanilla RNN forward propagation
- Inner product backpropagation
- Group normalization
- Improved accuracy of inner product primitive with sum post-ops for large shapes.
NVIDIA GPUs
- Introduced Graph API support.
Usability
- Added support for Group Normalization primitive with
ONEDNN_ENABLE_PRIMITIVE
build option. - Enabled support for ROCm 6 on AMD GPUs.
- Improved CMake integration for oneDNN installation with Nvidia backend enabled.
- Reduced memory footprint for matmul primitive when using ACL.
Validation
- Added benchdnn option
--execution-mode
to test oneDNN functionality with SYCL Graph record/execute mode. - Extended benchdnn option
--cold-cache
with support for cold TLB mode. - Added benchdnn option
--bia-dt
to control bias data type for matmul, inner product, convolution, and deconvolution. - Extended syntax of benchdnn
--dt
option in Graph API driver to manage data types of individual tensors in a pattern.
Breaking Changes
- Removed the experimental Graph Compiler backend for Graph API.
Thanks to our Contributors
This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, Denis @redradist, Dmitriy Ovchinnikov @inteldimitrius, Eliezer Weissmann @eliezerweissmann, Hubert Maciak @hmaciak, Ilya Lavrenov @ilya-lavrenov, James McGregor @Jmc18134, @jstachowintel, Marek Michalowski @michalowski-arm, Maria Zhukova @mzhukova, Orel Yehuda @yehudaorel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, @Shreyas-fuj, Shu Chen @shu1chen, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, and @zhangfeiv0.