Performance Optimizations
Intel 64/AMD64 Processors
- Improved performance on future Intel Core Ultra processors with Intel AVX10.2 instruction set support (code name Nova Lake). These optimizations are now enabled by default on compatible processors.
- Improved performance on future Intel Xeon processors with Intel AVX10.2 and Intel AMX instruction set support (code name Diamond Rapids). These optimizations are now enabled by default on compatible processors.
- Improved performance of
fp8andint8matmul with transposed source on processors with Intel AMX instruction set support. - Improved performance of
bf16andf16matmul with transposed source on processors with Intel AVX2 instruction set support.
Intel Graphics
- Introduced initial performance optimizations for future integrated GPUs based on Xe3p-LPG architecture.
- Introduced initial performance optimizations for future discrete GPUs based on Xe3p-XPC architecture.
- Improved
f16matmul performance on Intel Arc Graphics for Intel Core Ultra processor Series 3 (formerly Panther Lake). - Improved performance of matmul with host-side scalar arguments.
- Improved matmul performance for cases with small M/N and large K.
- Improved SDPA forward and backpropagation subgraph performance with Graph API.
AArch64 Processors
- Improved
f16andf32softmax performance across Arm Neoverse cores. - Improved eltwise performance on Arm Neoverse N1 cores.
- Improved matmul and convolution performance on Arm Neoverse V2 cores.
RISC-V Processors
- Improved
f32matmul, inner product, convolution, softmax and layer normalization primitives performance on processors withVextension support. - Improved
f16softmax primitive performance on processors withZvfhextension support.
Functionality
Functional API
- [experimental] Introduced grouped memory format and grouped matmul support to improve performance of AI models based on Mixture-of-Experts (MoE) architecture. This is an experimental feature that requires opt-in with
ONEDNN_EXPERIMENTAL_GROUPED_MEMORY=ONbuild option. Optimized version of this functionality is implemented for Intel GPUs. - [experimental] Extended grouped matmul with optional execution-time hint
DNNL_ARG_HINT_MAX_GROUP_SIZEto communicate the maximum size of the group across the variable dimension for the execution call.
Graph API
- Introduced
Dropoutoperation. Extended supported fusion patterns to enable fusion ofDropoutwithMatmul,Softmax, and elementwise operations.
Usability
Common
- Extended information about primitive execution available in VTune Profiler with the same level of details as reported by oneDNN verbose mode. This feature requires VTune Profiler 2025.7 or later.
Intel Graphics
- [experimental] Introduced support for Level Zero runtime on Intel GPUs. New functionality includes Level Zero interoperability API and build knob
ONEDNN_GPU_RUNTIME=ZE.
AArch64 Processors
- Introduced support for the library to correctly query processor cache sizes.
- Reduced memory usage of certain convolutions on Arm Neoverse V1/V2 cores.
- Fixed a bug causing high-memory usage and crashes in convolution with certain post-ops.
Validation
- Extended benchdnn with support for integer masks in quantization attributes.
- Improved consistency of benchdnn performance results when data compression is enabled by default on Intel Graphics.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm,dnnl::gemm_u8s8s32, anddnnl::gemm_s8s8s32functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive. f4_e3m0data type is deprecated and will be removed in future releases.
Thanks to our Contributors
This release contains contributions from the project core team as well as Alexandre de Limas Santana @alexandrelimassantana, Andrei (Andrey) Khropov @andrey-khropov, Andrei Hutu @Anndrey24, Fadi Arafeh @fadara01, George Nash @georgen117, Kamil Wieloch @kwieloch-intel, Kasture Deeksha, MarkVeerasingam @MarkVeerasingam, Nikhil Gupta @nikhil-arm, @pmanczak, @vishwascm, and Xia Zhuozhao @xiazhuozhao.