π’ Announcements & Breaking Changes
Build & Platform
- C++20 is now required to build ONNX Runtime from source. Minimum toolchains: MSVC 19.29+, GCC 10+, Clang 10+. Users of prebuilt packages are unaffected. (#27178)
- CUDA minimum version raised to 12.0 β CUDA 11.x is no longer supported. Users pinned to CUDA 11.x should stay on ORT 1.24.x or upgrade their CUDA toolkit/driver. (#27570)
- ONNX upgraded to 1.21.0 (#27601)
- sympy is now an optional dependency for Python builds. (#27200)
Execution Provider Changes
- ArmNN EP has been removed. Users should remove any
--use_armnnbuild flags and migrate to the MLAS/KleidiAI-backed CPU EP or QNN EP for Qualcomm hardware. (#27447)
API Version
- ORT_API_VERSION updated to 25. (#27280)
π Security Fixes
- Fixed potential integer truncation leading to heap out-of-bounds read/write (#27544)
- Addressed Pad Reflect vulnerability (#27652)
- Security fix for transpose optimizer (#27555)
- Upgraded minimatch 3.1.2 β 3.1.4 for CVE-2026-27904 (#27667)
- Hardened shell command handling for constant strings (#27840)
- Added validation of
onnx::TensorProtodata size before allocation (#27547) - Cleaned up external data path validation (#27539)
- Fixed misaligned address reads for tensor attributes from raw data buffers (#27312)
- Fixed CPU Attention overflow issue (#27822)
- Fixed CPU LRN integer overflow issues (#27886)
- Additional input validation hardening:
- Tile kernel dim overflow (#27566)
- Out-of-bounds read in cross entropy (#27568)
- TreeEnsembleClassifier attributes (#27571)
- AffineGrid (#27572)
- EmbedLayerNorm position_ids (#27573)
- RotaryEmbedding position_ids (#27597)
- RoiAlign batch_indices (#27603)
- MaxUnpool indices (#27432)
- QMoECPU swiglu OOB (#27748)
- SVMClassifier initializer (#27699)
- Col2Im SafeInt (#27625)
β¨ New Features
π Execution Provider Plugin API & CUDA Plugin EP
ORT 1.25.0 introduces the CUDA Plugin EP β the first core implementation that enables third-party CUDA-backed EPs to be delivered as dynamically loaded plugins without rebuilding ORT.
- CUDA Plugin EP: Core implementation (#27816)
- CUDA Plugin EP: BFC-style arena and CUDA mempool allocators for stream-aware memory management (#27931)
- Plugin EP Sync API for synchronous execution (#27538)
- Plugin EP event profiling APIs (#27649)
- Plugin EP APIs to retrieve ONNX operator schemas (#27713)
- Annotation-based graph partitioning with resource accounting (#27595, #27972)
- EP API adapter improvements: header-only adapter,
OpKernelInfo::GetConfigOptions,LoggingManager::HasDefaultLogger()(#26879, #26919, #27540, #27541, #27587) - WebGPU EP made compatible with EP API (#26907)
π§ Core APIs
- Per-session thread pool work callbacks API (#27253)
enable_profilingin RunOptions (#26846)- KernelInfo string-array attribute APIs for C and C++ (#27599)
OrtModelinput support for Compile API (#27332)- Session config to create weightless EPContext models during compilation (#27197)
- Compiled model compatibility APIs in example plugin EP (#27088)
- Model Package support (preview): Initial infrastructure for automatically selecting compiled EPContext model variants from a packaged collection based on EP, device, and hardware constraints. The directory structure is not yet finalized. (#27786)
π New ONNX Ops & Opset Coverage
- Attention opset 23 on CUDA with GQA, boolean masks, softcap, and softmax precision (#26466, #27030, #27082, #27428, #27714)
- Attention opset 24 on CUDA, disjoint from contrib op (#27542); nonpad KV seqlen on CPU (#27384)
- TensorScatter-24 for CPU and CUDA (#27389, #27446)
- DeformConv for CPU/CUDA (#27393)
- LpNormalization-22 (#27164)
- CUDA opset gap fills:
- Control flow & misc: Flatten, Identity, If, Loop, Scan, ConstantOfShape, Size (opset 21/23) (#27728)
- Pooling: GlobalAveragePool/GlobalMaxPool (β22) (#27733)
- Shape ops: Shape (β25), Squeeze/Unsqueeze (β25) (#27734, #27739)
- TopK (β24, BF16) (#27735), GRU (β22) (#27738)
- Pad (β25, wrap mode) (#27774), Resize v19 (#27415), RoiAlign v16/v22 (#27646)
π₯οΈ Execution Provider Updates
NVIDIA CUDA EP
- GQA with XQA and quantized KV cache, including FP8 (E4M3) KV cache support (#27246, #27321)
- CUDA graph capture compatibility for LLM ops and pre-compiled paths (#27484, #27477)
- Volumetric (3-D) GridSample support (#27201)
- Optimized 3D nearest resize kernel for 5D tensors (#27578)
- Optional
router_weightsinput to QMoE (#27687)
NVIDIA TensorRT RTX EP
- D3D12 external resource import support (#26948)
Qualcomm QNN EP
- Disabled file mapping for embedded cache (#27627)
- Fixed use-after-free of logger object (#27804)
- Fixed wheel build issues on WSL and Linux SDK version propagation (#27730, #27800)
Other EPs
- VitisAI EP: Added PE version info to provider DLL (#27626)
- DML EP: Fixed overflow in DmlGraphFusionHelper::ProcessInputData (#27815), fixed new-delete mismatch in QuantizeLinear (#27823)
π Web & JavaScript
WebGPU EP β Performance
- Gemm/MatMul optimization using subgroup features (#26433)
- MatMulNBits: 2-bit zero-point support (#27285, #27325), higher K-parallelism (#27834), DP4A SmallM tiling (#27910)
- Flash Attention: head_sink support (#27410), configurable multi rotary cache concat offset (#27434)
- Optimized 4D Transpose (#26942), string stream optimization (#27223)
WebGPU EP β New Op Support
- Added TopK (#27560), Softplus (#27457), Identity (#27067)
- Added Conv3D support (#27917), LpNorm support (#27876)
- int64/bool support for Range, Expand, Flatten, Gather, Unsqueeze (#26673, #27478, #27561)
- DequantizeLinear fixes (#27706), Einsum 5D tensor fixes (#27779)
WebGPU EP β Stability
- Fixed device destroyed on session release breaking recreation (#27634)
- Fixed static destruction crash on exit (#27470, #27569)
- Backward compat: Legacy WebGPU/WebNN memory info names are now accepted again (#27637)
- Deterministic Split-K handling (#27086), buffer segment alignment fix (#27853)
- Binary size reduction for WebAssembly builds (#27370, #27371)
WebNN EP
- Broader GQA support and improved MultiHeadAttention (#27234, #27494)
- Added DepthToSpace support (#27508)
Node.js & React Native
- Fixed float16 tensor support in Node.js and React Native (#27327, #27549)
- Added 16KB page size alignment for Android (required for Android 15+) (#27523)
π§ CPU & Core Optimizations
MLAS / KleidiAI / Quantization
- KleidiAI BF16 SME2 kernel integration (#26773), asymmetric 4-bit MatMulNBits on ARM64 (#27751)
- Fused Silu and Gelu kernels for AVX512 (#27690)
- Depthwise conv kernel for NCHW on AVX512 (#27874)
- ARM64 NCHWc NEON asm kernels (#27099, #27788), BF16 KAI SBGemm on NCHWc ARM (#27703)
- POWER10 Sgemm PackA optimization (#27575)
- Improved pre-packing for 2-bit LUT kernels (#27131)
DQβMatMulNBits Fusion
Extended to cover significantly more quantized LLM inference scenarios on CPU:
- 2-bit and 8-bit weights with Cast(fp16βfp32) patterns (#27614)
- FP16 models on CPU EP (#27640), fp16 8-bit on ARM64 (#27692)
- Gemm + per-tensor/per-channel quantization (#27769)
- FP16 quantized weight compatibility: models with HQNBIT quantized weights now route through the FP32 MLAS path for broader CPU compatibility (#27820)
Model Optimizer & Fusions
- Qwen3 model type support and RotaryEmbedding fusion for Qwen3 RoPE patterns (#27556, #27590)
- MobileClip attention fusion for both attention block patterns (#27883)
- Nemotron speech conformer encoder MHA fusion (#27764)
- Fixed GPT-2 no-past attention fusion for transformers β₯ 4.27 (#27449)
- Fixed BART attention fusion for SDPA pattern from transformers β₯ 4.49 (#27458)
- Pre-layer normalization support in attention fusion (#27418)
- SkipLayerNorm fusion with bias Add (#27765), broadcasting skip shapes (#27489)
- SpaceToDepth fusion pattern (#27747)
- NCHWc transformer: more patterns and ONNX-domain Gelu/HardSigmoid activations (#27691, #27821)
- Optimized qMoE code path for single-token execution (#27383)
- ONNX Attention KV cache optimization with ConcatNewToPast (#27613)
π Language Bindings
Python
- Exposed
OrtDeviceVendorIdenum for vendor-awareOrtDevicealiases (#27594) - Added bindings for
GetCompatibilityInfoFromModel/GetCompatibilityInfoFromModelBytes(#27565) - Fixed
OrtValue.from_dlpackrejecting zero-size tensors as non-contiguous (#27451)
C#
- Added bindings for
GetCompatibilityInfoFromModel/GetCompatibilityInfoFromModelBytes(#27565)
Java
- Avoid provider resource extraction when library already exists in
onnxruntime.native.path(#27668)
π Bug Fixes
Critical Fixes
- Fixed CPU Attention overflow issue (#27822)
- Fixed CPU LRN integer overflow issues (#27886)
- Fixed incorrect pad indices in AveragePool
count_include_padcomputation β silent correctness issue (#27375) - Fixed integer division/modulo by zero in CPU EP Div and Mod operators (#27693, #27833)
- Fixed non-ASCII Unicode model path crash (#27724)
- Fixed arithmetic overflow in Det operator (#27070)
- Fixed narrow-to-wide string conversion bugs in DLL load error reporting (#27777)
Operator & Graph Fixes
- Fixed 3D attention mask broadcasting in MHA (#27464)
- Fixed GQA shape inference for present outputs (#27250)
- Fixed Einsum bugs for reduction and empty input cases (#27225, #27226)
- Prevented cross-EP Cast fusion in
RemoveDuplicateCastTransformer(#27363) - Fixed ConvTranspose bias input validation on CPU/CUDA (#27209)
- Fixed Cast node naming collisions in float16 conversion (#27469)
- Fixed concat/slice elimination and unsqueeze elimination against optional attrs and invalid models (#27638)
- Improved EPContext error message when node is not assigned to an EP (#27474)
EP-Specific Fixes
- Fixed MiGraphX EP double allocation (#27551)
- Fixed MLAS qgemm dispatch and kernel regressions in quantized conv tests (#27671)
- Fixed run-level profiling for subgraph operators (#27870)
- Fixed
--build_wasm_static_libimplicitly enabling--build_wasm(#27342)
π Contributors
Thanks to our 72 contributors for this release!
@tianleiwu, @fs-eire, @edgchen1, @titaiwangms, @hariharans29, @eserscor, @Rishi-Dave, @guschmue, @adrianlizarraga, @jambayk, @qjia7, @skottmckay, @adrastogi, @sanaa-hamel-microsoft, @yuslepukhin, @ingyukoh, @Jiawei-Shao, @vraspar, @xhcao, @chilo-ms, @Honry, @JonathanC-ARM, @kunal-vaishnavi, @ShirasawaSama, @chaya2350, @derdeljan-msft, @gedoensmax, @HectorSVC, @milpuz01, @quic-calvnguy, @xenova, @akholodnamdcom, @AlekseiNikiforovIBM, @amd-genmingz, @ashrit-ms, @bachelor-dou, @BODAPATIMAHESH, @Colm-in-Arm, @daijh, @dodokw, @fanchenkong1, @ivarusic-amd, @JanSellner, @jchen10, @jiafatom, @jnagi-intel, @johannes-rehm-snkeos, @justinchuby, @keshavv27, @Kevin-Taha, @kevinlam92, @kpkbandi, @Laan33, @melkap01-Arm, @miaobin, @n-v-k, @nico-martin, @patryk-kaiser-ARM, @praneshgo, @prathikr, @qc-tbhardwa, @sagarbhure-msft, @sdotpeng, @the0cp, @TsofnatMaman, @umangb-09, @walidbr, @wenqinI, @xadupre, @xhan65, @xiaofeihan1
Full Changelog: v1.24.4...v1.25.0