Performance Optimizations
Intel Architecture Processors
- Improved performance on future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment variableONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2
. - Improved performance on future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable
ONEDNN_MAX_CPU_ISA=AVX10_2_512
. - Improved performance of matmul primitive on processors with Intel AMX support.
- Improved performance of
f32
matmul primitive for GEMV cases on on processors with Intel AVX2 instruction set support. - Improved matmul performance with
int4
andint8
compressed weights and per-channel zero-points. - Improved
f32
matmul performance withint4
andint8
compressed weights on processors with Intel AVX2 and Intel AVX512 instruction set support. - Improved
bf16
matmul performance withint4
andint8
compressed weights on processors with Intel AVX512, Intel DL Boost and bfloat16 instruction set support. - Improved performance of
int8
convolution primitive when using zero points. - Improved performance of
int8
matmul and inner product primitives withfp16
destination. - Improved performance of
f32
andbf16
convolution primitive withint8
destination. - Improved performance of RNN primitive on processors with Intel AVX2 instruction set support when using OpenMP runtime.
- Improved performance of subgraphs containing sequence of multiple binary ops with Graph API.
Intel Graphics Products
- Improved GEMM performance for small batch size on Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Improved matmul performance for Qwen2-7B shapes on Intel Arc graphics (formerly DG2) and Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H).
- Improved
int8
matmul performance withint4
weights and per-tensor zero-points. - Improved
bf16
matmul performance withfp8
weights. - Graph API optimizations:
- Improved Scaled Dot Product Attention (SDPA) subgraph performance for inference when relaxed accumulation mode is enabled on Intel Core Ultra processors (formerly Meteor Lake).
- Improved SDPA and GQA subgraphs performance when using host-side scalars.
- Improved performance of GQA subgraph for 2nd token scenarios.
- Improved performance of subgraphs containing sequence of multiple binary ops.
- Improved performance of Grouped Query Attention (GQA) subgraphs for training forward and backward propagation.
AArch64-based Processors
- Improved performance of reorder primitive
- Improved performance of
bf16
convolutions - Improved performance of convolutions on 128-bit SVE platforms
- Improved performance of eltwise on Arm® Neoverse™ N1
Functionality
Functional API
- Introduced host-side scalar memory objects. This functionality allows passing host-side scalars instead of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported in matmul and convolution primitives on Intel GPUs.
- Introduced support for pre-computed reductions in matmul primitive. This functionality is intended to improve performance in case of
int8
activations andint8
weights with zero-point.
Graph API
- Introduced
host_scalar
property for logical tensors. This functionality allows passing host-side scalars instead of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported to define attention scale, sequence length, and the negative infinity value in SDPA/GQA subgraphs. - Introduced accumulation mode attribute support in
Matmul
op. This attribute allows relaxingfp32
accumulation requirements to achieve performance benefits on some platforms.
Intel Graphics Products
- Introduced support for
fp4
weights in matmul primitive. - Introduced support for grouped quantization with group size 16 in matmul with
int8
compressed weights. - Introduced support group size 16
int8
for decompressed weight with regular weights decompression.
Intel Architecture Processors
- Introduced
fp4
weights support forfp32
matmul and convolution for future Intel Xeon processors with Intel AVX10.2 instruction set support.
Usability
- Extended diagnostics available in verbose mode for primitive descriptor creation issues.
- Extended dispatch diagnostics in verbose mode output for primitives implementations on Intel GPUs.
AArch64-based Processors
- Fixed crashes in backward-pass convolutions
- Fixed numerical errors in 4D matmul primitives
- Fixed numerical errors in low-precision convolutions
- Fixed numerical errors in reorders with compensation
- Fixed illegal-instruction crashes on Arm® Neoverse™ N1
- Fixed crashes in binary primitive in Debug builds
- Fixed segmentation fault in
eltwise_log
post-ops for large kernels
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm
,dnnl::gemm_u8s8s32
, anddnnl::gemm_s8s8s32
functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Breaking Changes
AArch64-based Processors
- Bumped the minimum required Arm® Compute Library 52.4.0
Thanks to our Contributors
This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24,
Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, Daniel Kuts @apach301, Daniel Whittaker @danwhittaker-arm, Deeksha Kasture @kasturedeeksha, George Nash @georgen117, Henry Gardiner @henry-gar, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, Marek Michalowski @michalowski-arm, Sheldon Robinson @sheldonrobinson, @Shreyas-fuj, Viktoriia Gvozdeva @vgvozdeva, Xiang1 Guo, Yejing Lai @Yejing-Lai, Yonghao Gu, Yusuf Butt @UseTheForce007, Zhibo Li @zhili03, @almayne, @co63oc, @focusunsink, @gassan-arm, @jstachowintel, @pmanczak, @puneetmatharu, @raistefintel, @vishwascm, @vyevtyus, @zhangfeiv0, @zhangjian29, and @xiazhuozhao.