Announcements

GCC version < 7 is no longer supported
CMAKE_SYSTEM_PROCESSOR needs be set when cross-compiling on Linux because pytorch cpuinfo was introduced as a dependency for ARM big.LITTLE support. Set it to the value of uname -m output of your target device.

General

ONNX 1.10 support
- opset 15
- ONNX IR 8 (SparseTensor type, model local functionprotos, Optional type not yet fully supported this release)
Improved documentation of C/C++ APIs
IBM Power support
WinML - DLL dependency fix supports learning models on Windows 8.1
Support for sub-building onnxruntime-extensions and statically linking into onnxruntime binary for custom builds
- Add --_use_extensions option to run models with custom operators implemented in onnxruntime-extensions

Registration of a custom allocator for sharing between multiple sessions. (See RegisterAllocator and UnregisterAllocator APIs in onnxruntime_c_api.h)
SessionOptionsAppendExecutionProvider_TensorRT API is deprecated; use SessionOptionsAppendExecutionProvider_TensorRT_V2
New APIs: SessionOptionsAppendExecutionProvider_TensorRT_V2, CreateTensorRTProviderOptions, UpdateTensorRTProviderOptions, GetTensorRTProviderOptionsAsString, ReleaseTensorRTProviderOptions, EnableOrtCustomOps, RegisterAllocator, UnregisterAllocator, IsSparseTensor, CreateSparseTensorAsOrtValue, FillSparseTensorCoo, FillSparseTensorCsr, FillSparseTensorBlockSparse, CreateSparseTensorWithValuesAsOrtValue, UseCooIndices, UseCsrIndices, UseBlockSparseIndices, GetSparseTensorFormat, GetSparseTensorValuesTypeAndShape, GetSparseTensorValues, GetSparseTensorIndicesTypeShape, GetSparseTensorIndices,

Performance improvement on ARM
- Added S8S8 (signed int8, signed int8) matmul kernel. This avoids extending uin8 to int16 for better performance on ARM64 without dot-product instruction
- Expanded GEMM udot kernel to 8x8 accumulator
- Added sgemm and qgemm optimized kernels for ARM64EC
Operator improvements
- Improved performance for quantized operators: DynamicQuantizeLSTM, QLinearAvgPool
- Added new quantized operator QGemm for quantizing Gemm directly
- Fused HardSigmoid and Conv
Quantization tool - subgraph support
Transformers tool improvements
- Fused Attention for BART encoder and Megatron GPT-2
- Integrated mixed precision ONNX conversion and parity test for GPT-2
- Updated graph fusion for embed layer normalization for BERT
- Improved symbolic shape inference for operators: Attention, EmbedLayerNormalization, Einsum and Reciprocal

Official ORT GPU packages (except Python) now include both CUDA and TensorRT Execution Providers.
- Python packages will be updated next release. Please note that EPs should be explicitly registered to ensure the correct provider is used.
GPU packages are built with CUDA 11.4 and should be compatible with 11.x on systems with the minimum required driver version. See: CUDA minor version compatibility
Pypi
- ORT + DirectML Python packages now available: onnxruntime-directml
- GPU package can be used on both CPU-only and GPU machines
Nuget
- C#: Added support for using netstandard2.0 as a target framework
- Windows symbol (PDB) files are no longer included in the Nuget package, reducing size of the binary Nuget package by 85%. To download, please see the artifacts below in Github.

CUDA EP
- Framework improvements that boost CUDA performance of subgraph heavy models (#8642, #8702)
- Support for sequence ops for improved performance for models using sequence type
- Kernel perf improvements for Pad and Upsample (up to 4.5x faster)
TensorRT EP
- Added support for TensorRT 8.0 (x64 Windows/Linux, ARM Jetson), which includes new TensorRT explicit-quantization features (ONNX Q/DQ support)
- General fixes and quality improvements
OpenVINO EP
- Added support for OpenVINO 2021.4
DirectML EP
- Bug fix for Identity with non-float inputs affecting DynamicQuantizeLinear ONNX backend test

WebAssembly
- SIMD (Single Instruction, Multiple Data) support
- Option to load WebAssembly from worker thread to avoid blocking main UI thread
- wasm file path override
WebGL
- Simpler workflow for WebGL kernel implementation
- Improved performance with Conv kernel enhancement

Added more example mobile apps
CoreML and NNAPI EP enhancements
Reduced peak memory usage when initializing session with ORT format model as bytes
Enhanced partitioning to improve performance when using NNAPI and CoreML
- Reduce number of NNAPI/CoreML partitions required
- Add ability to force usage of CPU for post-processing in SSD models
  - Improves performance by avoiding expensive device copy to/from NPU for cheap post-processing section of the model
Changed to using xcframework in the iOS package
- Supports usage of arm64 iPhone simulator on Mac with Apple silicon

Expanding input formats supported to include dictionaries and lists.
Enable user defined autograd functions
Support for fallback to PyTorch for execution
Added support for deterministic compute to enable reproducibility with ORTModule
Add DebugOptions and LogLevels to ORTModule API* to improve debuggability
Improvements additions to kernels/gradients: Concat, Split, MatMul, ReluGrad, PadOp, Tile, BatchNormInternal
Support for ROCm 4.3.1 on AMD GPU