We are excited to announce Intel® Extension for PyTorch* 1.11.0-cpu release by tightly following PyTorch 1.11 release. Along with extension 1.11, we focused on continually improving OOB user experience and performance. Highlights include:
- Support a single binary with runtime dynamic dispatch based on AVX2/AVX512 hardware ISA detection
- Support install binary from
pip
with package name only (without the need of specifying the URL) - Provide the C++ SDK installation to facilitate ease of C++ app development and deployment
- Add more optimizations, including graph fusions for speeding up Transformer-based models and CNN, etc
- Reduce the binary size for both the PIP wheel and C++ SDK (2X to 5X reduction from the previous version)
Highlights
-
Combine the AVX2 and AVX512 binary as a single binary and automatically dispatch to different implementations based on hardware ISA detection at runtime. The typical case is to serve the data center that mixtures AVX2-only and AVX512 platforms. It does not need to deploy the different ISA binary now compared to the previous version
NOTE: The extension uses the oneDNN library as the backend. However, the BF16 and INT8 operator sets and features are different between AVX2 and AVX512. Please refer to oneDNN document for more details.
When one input is of type u8, and the other one is of type s8, oneDNN assumes that it is the user’s responsibility to choose the quantization parameters so that no overflow/saturation occurs. For instance, a user can use u7 [0, 127] instead of u8 for the unsigned input, or s7 [-64, 63] instead of the s8 one. It is worth mentioning that this is required only when the Intel AVX2 or Intel AVX512 Instruction Set is used.
-
The extension wheel packages have been uploaded to pypi.org. The user could directly install the extension by
pip/pip3
without explicitly specifying the binary location URL.
v1.10.100-cpu | v1.11.0-cpu |
python -m pip install intel_extension_for_pytorch==1.10.100 -f https://software.intel.com/ipex-whl-stable |
pip install intel_extension_for_pytorch |
- Compared to the previous version, this release provides a dedicated installation file for the C++ SDK. The installation file automatically detects the PyTorch C++ SDK location and installs the extension C++ SDK files to the PyTorch C++ SDK. The user does not need to manually add the extension C++ SDK source files and CMake to the PyTorch SDK. In addition to that, the installation file reduces the C++ SDK binary size from ~220MB to ~13.5MB.
v1.10.100-cpu | v1.11.0-cpu |
intel-ext-pt-cpu-libtorch-shared-with-deps-1.10.0+cpu.zip (220M)
intel-ext-pt-cpu-libtorch-cxx11-abi-shared-with-deps-1.10.0+cpu.zip (224M) |
libintel-ext-pt-1.11.0+cpu.run (13.7M)
libintel-ext-pt-cxx11-abi-1.11.0+cpu.run (13.5M) |
-
Add more optimizations, including more custom operators and fusions.
- Fuse the QKV linear operators as a single Linear to accelerate the Transformer*(BERT-*) encoder part - #278.
- Remove Multi-Head-Attention fusion limitations to support the 64bytes unaligned tensor shape. #531
- Fold the binary operator to Convolution and Linear operator to reduce computation. #432 #438 #602
- Replace the outplace operators with their corresponding in-place version to reduce memory footprint. The extension currently supports the operators including
sliu
,sigmoid
,tanh
,hardsigmoid
,hardswish
,relu6
,relu
,selu
,softmax
. #524 - Fuse the Concat + BN + ReLU as a single operator. #452
- Optimize Conv3D for both imperative and JIT by enabling NHWC and pre-packing the weight. #425
-
Reduce the binary size. C++ SDK is reduced from ~220MB to ~13.5MB while the wheel packaged is reduced from ~100MB to ~40MB.
-
Update oneDNN and oneDNN graph to 2.5.2 and 0.4.2 respectively.
Known Issues
-
BF16 AMP(auto-mixed-precision) runs abnormally with the extension on the AVX2-only machine if the topology contains
Conv
,Matmul
,Linear
, andBatchNormalization
-
Runtime extension does not support the scenario that the BS is not divisible by the stream number
-
Incorrect Conv and Linear result if the number of OMP threads is changed at runtime
The oneDNN memory layout depends on the number of OMP threads, which requires the caller to detect the changes for the # of OMP threads while this release has not implemented it yet.
-
INT8 performance of EfficientNet and DenseNet with the extension is slower than that of FP32
-
Low performance with INT8 support for dynamic shapes
The support for dynamic shapes in Intel® Extension for PyTorch* INT8 integration is still working in progress. For the use cases where the input shapes are dynamic, for example, inputs of variable image sizes in an object detection task or of variable sequence lengths in NLP tasks, the Intel® Extension for PyTorch* INT8 path may slow down the model inference. In this case, please utilize stock PyTorch INT8 functionality.
-
Low throughput with DLRM FP32 Train
A ‘Sparse Add’ PR is pending on review. The issue will be fixed when the PR is merged.
-
If the inference is done with a custom function, conv+bn folding feature of the
ipex.optimize()
function doesn’t work.import torch import intel_pytorch_extension as ipex class Module(torch.nn.Module): def __init__(self): super(Module, self).__init__() self.conv = torch.nn.Conv2d(1, 10, 5, 1) self.bn = torch.nn.BatchNorm2d(10) self.relu = torch.nn.ReLU() def forward(self, x): x = self.conv(x) x = self.bn(x) x = self.relu(x) return x def inference(self, x): return self.forward(x) if __name__ == '__main__': m = Module() m.eval() m = ipex.optimize(m, dtype=torch.float32, level="O0") d = torch.rand(1, 1, 112, 112) with torch.no_grad(): m.inference(d)
This is PyTorch FX limitation, user can avoid this error by calling
m = ipex.optimize(m, level="O0")
, which doesn't apply the extension optimization, or disableconv+bn
folding by callingm = ipex.optimize(m, level="O1", conv_bn_folding=False)
.
What's Changed
Full Changelog: v1.10.100...v1.11.0