Performance Optimizations
Intel Architecture Processors
- Improved performance on future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment
variableONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2. - Improved performance on future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality
is not dispatched by default and requires opt-in with environment variableONEDNN_MAX_CPU_ISA=AVX10_2_512. - Improved performance of matmul primitive on processors with Intel AMX support.
- Improved performance of
f32matmul primitive for GEMV cases on on processors with Intel AVX2 instruction
set support. - Improved matmul performance with
int4andint8compressed weights and per-channel zero-points. - Improved
f32matmul performance withint4andint8compressed weights on processors with Intel AVX2 and
Intel AVX512 instruction set support. - Improved
bf16matmul performance withint4andint8compressed weights on processors with Intel AVX512,
Intel DL Boost and bfloat16 instruction set support. - Improved performance of
int8convolution primitive when using zero points. - Improved performance of
int8matmul and inner product primitives withfp16destination. - Improved performance of
f32andbf16convolution primitive withint8destination. - Improved performance of RNN primitive on processors with Intel AVX2 instruction set support when using OpenMP runtime.
- Improved performance of subgraphs containing sequence of multiple binary ops with Graph API.
Intel Graphics Products
- Improve GEMM performance for small batch size on Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Improved matmul performance for Qwen2-7B shapes on Intel Arc graphics (formerly Alchemist) and
Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H). - Improved
int8matmul performance withint4weights and per-tensor zero-points. - Improved
bf16matmul performance withfp8weights. - Graph API optimizations:
- Improved Scaled Dot Product Attention (SDPA) subgraph performance for inference when relaxed accumulation mode
is enabled on Intel Core Ultra processors (formerly Meteor Lake). - Improved SDPA and GQA subgraphs performance when using host-side scalars.
- Improved performance of GQA subgraph for 2nd token scenarios.
- Improved performance of subgraphs containing sequence of multiple binary ops.
- Improved performance of Grouped Query Attention (GQA) subgraphs for training forward and backward propagation.
- Improved Scaled Dot Product Attention (SDPA) subgraph performance for inference when relaxed accumulation mode
AArch64-based Processors
- Improved reorder primitive performance.
- Improved
bf16convolutions performance. - Improved convolutions performance on CPUs with 128-bit SVE support.
- Improved eltwise primitive performance on Arm(R) Neoverse(TM) N1 processor.
Functionality
Functional API
- Introduced host-side scalar memory objects. This functionality allows passing host-side scalars instead of device
memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported in matmul
and convolution primitives on Intel GPUs. - Introduced support for pre-computed reductions in matmul primitive. This functionality is intended to improve
performance in case ofint8activations andint8weights with zero-point.
Graph API
- Introduced
host_scalarproperty for logical tensors. This functionality allows passing host-side scalars instead
of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported to
define attention scale, sequence length, and the negative infinity value in SDPA/GQA subgraphs. - Introduced accumulation mode attribute support in
Matmulop. This attribute allows relaxingfp32accumulation
requirements to achieve performance benefits on some platforms.
Intel Graphics Products
- Introduced support for
fp4weights in matmul primitive. - Introduced support for weight scales and zero-points with group size 16 in matmul with compressed weights.
Intel Architecture Processors
- Introduced
fp4weights support forfp32matmul and convolution for future Intel Xeon processors with
Intel AVX10.2 instruction set support.
Usability
- Extended diagnostics available in verbose mode for primitive descriptor creation issues.
- Extended dispatch diagnostics in verbose mode output for primitives implementations on Intel GPUs.
Known Limitations
- Convolution primitive may require excessive amount of scratchpad memory for shapes with large input width value on Intel CPUs.
bf16convolution primitive has a performance regression on Intel Arc B-series graphics.- Reduction primitive may produce incorrect results for tensors exceeding 4 GB on Intel Arc graphics (formerly DG2) and Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H).
- Concat primitive may produce incorrect results for certain shapes on Intel Arc A-series GPUs.
fp16matmul primitive has a performance regression on Intel GPUs based on Xe2 architecture.f32matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics.int8inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Datacenter GPU Max series.bf16layer normalization backpropagation may produce incorrect results on Intel Datacenter GPU Max Series.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm,dnnl::gemm_u8s8s32, anddnnl::gemm_s8s8s32functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Breaking Changes
AArch64-based Processors
- Bumped the minimum required Arm(R) Compute Library version to 52.4.0
Thanks to our Contributors
This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24,
Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, Daniel Kuts @apach301,
Daniel Whittaker @danwhittaker-arm, Deeksha Kasture @kasturedeeksha, George Nash @georgen117,
Henry Gardiner @henry-gar, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw,
Marek Michalowski @michalowski-arm, Sheldon Robinson @sheldonrobinson, @Shreyas-fuj, Viktoriia Gvozdeva @vgvozdeva,
Xiang1 Guo, Yejing Lai @Yejing-Lai, Yonghao Gu, Yusuf Butt @UseTheForce007, Zhibo Li @zhili03, @almayne, @co63oc,
@focusunsink, @gassan-arm, @jstachowintel, @pmanczak, @puneetmatharu, @raistefintel, @vishwascm, @vyevtyus, @zhangfeiv0,
@zhangjian29, and @xiazhuozhao.