Performance Optimizations

Intel Architecture Processors

Improved fp32 matmul performance with fp4 compressed weights.
Improved fp32 matmul performance for cases when one of the tensors has a trivial dimension on processors with Intel AVX-512 instruction set support.

Improved fp16/bf16 matmul performance for large tensor cases on Intel Arc graphics for Intel Core Ultra processor series 3 (formerly Panther Lake).
Improved matmul performance for cases with 4-byte alignment on Intel GPUs based on Xe2 architecture.
Improved performance of fp16/bf16 matmul with mxfp4 weights.
Improved convolution performance with host-side scalar scales and zero points.

Improved performance of s8/u8 eltwise post-ops on Arm(R) Neoverse(TM) V1 processors.
Improved f16 and bf16 eltwise performance for abs, relu, square, sqrt, clip, and clip_v2.
Improved exp eltwise performance on Arm(R) Neoverse(TM) N1 processors.
Improved reorder primitive performance.
Added matmul optimisations for GEMVs.
Improved performance of bf16 matmul.
Improved performance of bf16/int8 convolutions.
Convolutions with large spatial filters now consume much less memory during primitive setup.

Improved eltwise and binary primitives performance.
Improved f32 GEMM performance.
Improved f32 matmul, softmax, convolution and inner product primitives performance.
Improved f32 batch, group and layer normalization primitives performance.
Improved f32 and fp16 pooling primitive performance.
Improved reorder(fp32 to u8) primitive performance.

Introduced destination tensor dynamic quantization in matmul primitive following Open Compute Microscaling (MX) formats specification. See MXFP8 matmul tutorial for quick introduction into MX-capabilities in oneDNN.
Introduced support for NVFP4 quantization scheme. The changes include support for fp8_e4m3 grouped scales and dynamic quantization support for destination tensor with NVFP4-specific formula for scales computation.
Introduced support for dropout as a primitive attribute for matmul, softmax and eltwise primitives.

Introduced support for RMS Normalization operation.
Introduced support for output gradient of attention mask for SDPA and GQA training.

Introduced support for different data types of source and destination in pooling forward propagation.

Extended quantization attributes documentation to cover all quantization schemes supported by the library.
Added matmul fp8 quantization example demonstrating use of matmul primitive with fp8 source, destination, and weights.
Extended oneDNN threadpool runtime with an option to support asynchronous execution and updated all CPU implementations accordingly. This extension makes oneDNN compatible with OpenXLA "thunk" runtime.
Extended information about primitive execution available in VTune(TM) Profiler with the same level of detail as reported by oneDNN verbose mode. This feature requires VTune Profiler 2025.7 or later.
Introduced ONEDNN_SAFE_RBP build knob that instructs x64 implementations to preserve value of rbp register for tools that rely on stack unwinding. This option may have visible performance impact on some workloads.
Removed build time dependency on OpenCL runtime in SYCL build configuration.
ONEDNN_ENABLE_GRAPH_DUMP build knob is enabled by default.
Fixed a potential overflow on AArch64 builds with Arm Compute Library.

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive.

This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24, Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, czekun @ZackyLake, Deeksha Kasture @kasturedeeksha, Fadi Arafeh @fadara01, Gassan Salama @gassan-arm, Henry Gardiner @henry-gar, @jstachowintel, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, Murray Steele @murste01, Narendra Bagria @narenbagria, Joseph Kuo @PershingSquare, @pmanczak, @vishwascm, Yejing Lai @Yejing-Lai, 夏卓昭 @xiazhuozhao