2.3.110+xpu
We are excited to announce the release of Intel® Extension for PyTorch* v2.3.110+xpu. This is the new release which supports Intel® GPU platforms (Intel® Data Center GPU Max Series, Intel® Arc™ A-Series Graphics, Intel® Core™ Ultra Processors with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics and Intel® Data Center GPU Flex Series) based on PyTorch* 2.3.1.
Highlights
-
oneDNN v3.5.3 integration
-
Intel® oneAPI Base Toolkit 2024.2.1 compatibility
-
Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics support on Windows (Prototype)
-
Large Language Model (LLM) optimization
Intel® Extension for PyTorch* provides a new dedicated module,
ipex.llm, to host for Large Language Models (LLMs) specific APIs. Withipex.llm, Intel® Extension for PyTorch* provides comprehensive LLM optimization on FP16 and INT4 datatypes. Specifically for low precision, Weight-Only Quantization is supported for various scenarios. And user can also run Intel® Extension for PyTorch* with Tensor Parallel to fit in the multiple ranks or multiple nodes scenarios to get even better performance.A typical API under this new module is
ipex.llm.optimize, which is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise.ipex.llm.optimizeis an upgrade API to replace previousipex.optimize_transformers, which will bring you more consistent LLM experience and performance. Below shows a simple example ofipex.llm.optimizefor fp16 inference:import torch import intel_extension_for_pytorch as ipex import transformers model= transformers.AutoModelForCausalLM.from_pretrained(model_name_or_path).eval() dtype = torch.float16 model = ipex.llm.optimize(model, dtype=dtype, device="xpu") model.generate(YOUR_GENERATION_PARAMS)
More examples of this API can be found at LLM optimization API.
Besides that, we optimized more LLM inference models. A full list of optimized models can be found at LLM Optimizations Overview.
-
Serving framework support
Typical LLM serving frameworks including vLLM and TGI can co-work with Intel® Extension for PyTorch* on Intel® GPU platforms (Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics). Besides the integration of LLM serving frameworks with
ipex.llmmodule level APIs, we enhanced the performance and quality of underneath Intel® Extension for PyTorch* operators such as paged attention and flash attention for better end to end model performance. -
Prototype support of full fine-tuning and LoRA PEFT with mixed precision
Intel® Extension for PyTorch* also provides new capability for supporting popular recipes with both full fine-tuning and LoRA PEFT for mixed precision with BF16 and FP32. We optimized many typical LLM models including Llama 2 (7B and 70B), Llama 3 8B, Phi-3-Mini 3.8B model families and Chinese model Qwen-7B, on both single GPU and Multi-GPU (distributed fine-tuning based on PyTorch FSDP) use cases.
Breaking Changes
- Block format support: oneDNN Block format integration support is being deprecated and will no longer be available starting from the release after v2.3.110+xpu.
Known Issues
Please refer to Known Issues webpage.