2.3.110+xpu
We are excited to announce the release of Intel® Extension for PyTorch* v2.3.110+xpu. This is the new release which supports Intel® GPU platforms (Intel® Data Center GPU Flex Series, Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics) based on PyTorch* 2.3.1.
Highlights
-
Intel® oneDNN v3.5.3 integration
-
Intel® oneAPI Base Toolkit 2024.2.1 compatibility
-
Large Language Model (LLM) optimization
Intel® Extension for PyTorch* provides a new dedicated module,
ipex.llm
, to host for Large Language Models (LLMs) specific APIs. Withipex.llm
, Intel® Extension for PyTorch* provides comprehensive LLM optimization on FP16 and INT4 datatypes. Specifically for low precision, Weight-Only Quantization is supported for various scenarios. And user can also run Intel® Extension for PyTorch* with Tensor Parallel to fit in the multiple ranks or multiple nodes scenarios to get even better performance.A typical API under this new module is
ipex.llm.optimize
, which is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise.ipex.llm.optimize
is an upgrade API to replace previousipex.optimize_transformers
, which will bring you more consistent LLM experience and performance. Below shows a simple example ofipex.llm.optimize
for fp16 inference:import torch import intel_extension_for_pytorch as ipex import transformers model= transformers.AutoModelForCausalLM.from_pretrained(model_name_or_path).eval() dtype = torch.float16 model = ipex.llm.optimize(model, dtype=dtype, device="xpu") model.generate(YOUR_GENERATION_PARAMS)
More examples of this API can be found at LLM optimization API.
Besides that, we optimized more LLM inference models. A full list of optimized models can be found at LLM Optimizations Overview.
-
Serving framework support
Typical LLM serving frameworks including vLLM and TGI can co-work with Intel® Extension for PyTorch* on Intel® GPU platforms (Intel® Data Center GPU Max 1550 and Intel® Arc™ A-Series Graphics). Besides the integration of LLM serving frameworks with
ipex.llm
module level APIs, we enhanced the performance and quality of underneath Intel® Extension for PyTorch* operators such as paged attention and flash attention for better end to end model performance. -
Prototype support of full fine-tuning and LoRA PEFT with mixed precision
Intel® Extension for PyTorch* also provides new capability for supporting popular recipes with both full fine-tuning and LoRA PEFT for mixed precision with BF16 and FP32. We optimized many typical LLM models including Llama 2 (7B and 70B), Llama 3 8B, Phi-3-Mini 3.8B model families and Chinese model Qwen-7B, on both single GPU and Multi-GPU (distributed fine-tuning based on PyTorch FSDP) use cases.
Known Issues
Please refer to Known Issues webpage.