Performance Optimizations
Intel Architecture Processors
- Improved
fp32matmul performance withfp4compressed weights. - Improved
fp32matmul performance for cases when one of the tensors has a trivial dimension on processors with Intel AVX-512 instruction set support.
Intel Graphics Products
- Improved
fp16/bf16matmul performance for large tensor cases on Intel Arc graphics for Intel Core Ultra processor series 3 (formerly Panther Lake). - Improved matmul performance for cases with 4-byte alignment on Intel GPUs based on Xe2 architecture.
- Improved performance of
fp16/bf16matmul withmxfp4weights. - Improved convolution performance with host-side scalar scales and zero points.
AArch64-based Processors
- Improved performance of
s8/u8eltwise post-ops on Arm(R) Neoverse(TM) V1 processors. - Improved
f16andbf16eltwise performance forabs,relu,square,sqrt,clip, andclip_v2. - Improved
expeltwise performance on Arm(R) Neoverse(TM) N1 processors. - Improved reorder primitive performance.
- Added matmul optimisations for GEMVs.
- Improved performance of
bf16matmul. - Improved performance of
bf16/int8convolutions. - Convolutions with large spatial filters now consume much less memory during primitive setup.
RISC-V based processors
- Improved eltwise and binary primitives performance.
- Improved
f32GEMM performance. - Improved
f32matmul, softmax, convolution and inner product primitives performance. - Improved
f32batch, group and layer normalization primitives performance. - Improved
f32andfp16pooling primitive performance. - Improved reorder(
fp32tou8) primitive performance.
Functionality
Functional API
- Introduced destination tensor dynamic quantization in matmul primitive following Open Compute Microscaling (MX) formats specification. See MXFP8 matmul tutorial for quick introduction into MX-capabilities in oneDNN.
- Introduced support for NVFP4 quantization scheme. The changes include support for
fp8_e4m3grouped scales and dynamic quantization support for destination tensor with NVFP4-specific formula for scales computation. - Introduced support for dropout as a primitive attribute for matmul, softmax and eltwise primitives.
Graph API
- Introduced support for RMS Normalization operation.
- Introduced support for output gradient of attention mask for SDPA and GQA training.
Intel Graphics Products
- Introduced support for convolution with
u8weights. - Introduced support for 2D grouped scales in
fp8matmul.
Intel Architecture Processors
- Introduced support for different data types of source and destination in pooling forward propagation.
AArch64-based Processors
- Added limited support for the BRGEMM Microkernel API.
- Added limited support for Windows on Arm builds with MSVC.
Usability
- Extended quantization attributes documentation to cover all quantization schemes supported by the library.
- Added matmul fp8 quantization example demonstrating use of matmul primitive with
fp8source, destination, and weights. - Extended oneDNN threadpool runtime with an option to support asynchronous execution and updated all CPU implementations accordingly. This extension makes oneDNN compatible with OpenXLA "thunk" runtime.
- Extended information about primitive execution available in VTune(TM) Profiler with the same level of detail as reported by oneDNN verbose mode. This feature requires VTune Profiler 2025.7 or later.
- Introduced
ONEDNN_SAFE_RBPbuild knob that instructs x64 implementations to preserve value ofrbpregister for tools that rely on stack unwinding. This option may have visible performance impact on some workloads. - Removed build time dependency on OpenCL runtime in SYCL build configuration.
ONEDNN_ENABLE_GRAPH_DUMPbuild knob is enabled by default.- Fixed a potential overflow on AArch64 builds with Arm Compute Library.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm,dnnl::gemm_u8s8s32, anddnnl::gemm_s8s8s32functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Thanks to our Contributors
This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24, Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, czekun @ZackyLake, Deeksha Kasture @kasturedeeksha, Fadi Arafeh @fadara01, Gassan Salama @gassan-arm, Henry Gardiner @henry-gar, @jstachowintel, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, Murray Steele @murste01, Narendra Bagria @narenbagria, Joseph Kuo @PershingSquare, @pmanczak, @vishwascm, Yejing Lai @Yejing-Lai, 夏卓昭 @xiazhuozhao