We are excited to announce the release of Intel® Extension for PyTorch* 2.1.0+cpu which accompanies PyTorch 2.1. This release mainly brings in our latest optimization on Large Language Model (LLM), torch.compile backend optimization as to leverage TorchInductor’s capability, performance optimization of static quantization under dynamic shape, together with a set of bug fixing and small optimization. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.
Highlights
-
Large Language Model (LLM) optimization (Experimental): Intel® Extension for PyTorch* provides a lot of specific optimizations for LLMs in this new release. In operator level, we provide highly efficient GEMM kernel to speedup Linear layer and customized operators to reduce the memory footprint. To better trade-off the performance and accuracy, different low-precision solutions e.g., smoothQuant for INT8 and weight-only-quantization for INT4 and INT8 are also enabled. Besides, tensor parallel can also be adopt to get lower latency for LLMs.
A new API function,
ipex.optimize_transformers
, is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. You just need to invoke theipex.optimize_transformers
function instead of theipex.optimize
function to apply all optimizations transparently. More detailed information can be found at Large Language Model optimizations overview.Specifically, this new release includes the support of SmoothQuant and weight only quantization (both INT8 weight and INT4 weight) as to provide better performance and accuracy for low precision scenarios.
A typical usage of this new feature is quite simple as below:
import torch import intel_extension_for_pytorch as ipex ... model = ipex.optimize_transformers(model, dtype=dtype)
-
torch.compile backend optimization with PyTorch Inductor (Experimental): We optimized Intel® Extension for PyTorch to leverage PyTorch Inductor’s capability when working as a backend of torch.compile, which can better utilize torch.compile’s power of graph capture, Inductor’s scalable fusion capability, and still keep customized optimization from Intel® Extension for PyTorch.
-
performance optimization of static quantization under dynamic shape: We optimized the static quantization performance of Intel® Extension for PyTorch for dynamic shapes. The usage is the same as the workflow of running static shapes while inputs of variable shapes could be provided during runtime.
-
Bug fixing and other optimization