Announcements

Security issues addressed by this release
1. A protobuf security issue CVE-2022-1941 that impact users who load ONNX models from untrusted sources, for example, a deep learning inference service which allows users to upload their models then runs the inferences in a shared environment.
2. An ONNX security vulnerability that allows reading of tensor_data outside the model directory, which allows attackers to read or write arbitrary files on an affected system that loads ONNX models from untrusted sources. (#12915)
Deprecations
- CUDA 10.x support at source code level
- Windows 8.x support in Nuget/C API prebuilt binaries. Support for Windows 7+ Desktop versions (including Windows servers) will be retained by building ONNX Runtime from source.
- NUPHAR EP code is removed
Dependency versioning updates
- C++ 17 compiler is now required to build ORT from source. On Linux, GCC version >=7.0 is required.
- Minimal numpy version bumped to 1.21.6 (from 1.21.0) for ONNX Runtime Python packages
- Official ONNX Runtime GPU packages now require CUDA version >=11.6 instead of 11.4.

General

Transformers CUDA improvements
- Quantization on GPU for BERT - notebook, documentation on QAT, transformer optimization toolchain and quantized kernels.
- Add fused attention CUDA kernels for BERT.
- Fuse Add (bias) and Transpose of Q/K/V into one kernel for Attention and LongformerAttention.
- Reduce GEMM computation in LongformerAttention with a new weight format.
General quantization (tool and kernel)
- Quantization debugging tool - identify sensitive node/layer from accuracy drop discrepancies
- New quantize API based on QuantConfig
- New quantized operators: SoftMax, Split, Where

CUDA EP
- Official ONNX Runtime GPU packages are now built with CUDA version 11.6 instead of 11.4, but should still be backwards compatible with 11.4
TensorRT EP
- Build option to link against pre-built onnx-tensorrt parser; this enables potential "no-code" TensorRT minor version upgrades and can be used to build against TensorRT 8.5 EA
- Improved nested control flow support
- Improve HashId generation used for uniquely identifying TRT engines. Addresses issues such as TRT Engine Cache Regeneration Issue
- TensorRT uint8 support
OpenVINO EP
- OpenVINO version upgraded to 2022.2.0
- Support for INT8 QDQ models from NNCF
- Support for Intel 13th Gen Core Process (Raptor Lake)
- Preview support for Intel discrete graphics cards Intel Data Center GPU Flex Series and Intel Arc GPU
- Increased test coverage for GPU Plugin
SNPE EP
- Add support for Windows Dev Kit 2023
- Nuget Package is now available
DirectML EP
- Update to DML 1.9.1
- New ops: LayerNormalization, Gelu, MatMulScale, DFT, FusedMatMul (contrib)
- Bug fixes: DML EP Fix InstanceNormalization with 3D tensors (#12693), DML EP squeeze all axes when empty (#12649), DirectML GEMM broken in opset 11 and 13 when optional tensor C not provided (#12568)
[new] CANN EP - Initial integration of CANN EP contributed by Huawei to support Ascend 310 (#11477)

EP infrastructure
- Implemented support for additional EPs that use static kernels
  - Required for EPs like XNNPACK to be supported in minimal build
  - Removes need for kernel hashes to reduce maintenance overhead for developers
  - NOTE: ORT format models will need to be regenerated as the format change is NOT backwards compatible. We're replacing hashes for the CPU EP kernels with operator constraint information for operators used by the model so that we can match any static kernels available at runtime.
XNNPack
- Added more kernels including QDQ format model support
  - AveragePool, Softmax,
  - QLinearConv, QLinearAveragePool, QLinearSoftmax
- Added support for XNNPACK using threadpool
  - See documentation for recommendations on how to configure the XNNPACK threadpool
ORT format model peak memory usage
- Added ability to use ORT format model directly for initializers to reduce peak memory usage
  - Enabled via SessionOptions config
    - https://onnxruntime.ai/docs/reference/ort-format-models.html#load-ort-format-model-from-an-in-memory-byte-array
    - Set "session.use_ort_model_bytes_directly" and "session.use_ort_model_bytes_for_initializers" to "1"

Training packages updated to CUDA version 11.6 and removed CUDA 10.2 and 11.3
Performance improvements via op fusions like BiasSoftmax and Dropout fusion, Gather to Split fusion etc targeting SOTA models
Added Aten support for GroupNorm, InstanceNormalization, Upsample nearest
Bug fix for SimplifiedLayerNorm, seg fault for alltoall