github intel/intel-extension-for-pytorch v2.3.110+xpu
Intel® Extension for PyTorch* v2.3.110+xpu Release Notes

latest release: v2.5.0+cpu
one month ago

2.3.110+xpu

We are excited to announce the release of Intel® Extension for PyTorch* v2.3.110+xpu. This is the new release which supports Intel® GPU platforms (Intel® Data Center GPU Flex Series, Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics) based on PyTorch* 2.3.1.

Highlights

  • Intel® oneDNN v3.5.3 integration

  • Intel® oneAPI Base Toolkit 2024.2.1 compatibility

  • Large Language Model (LLM) optimization

    Intel® Extension for PyTorch* provides a new dedicated module, ipex.llm, to host for Large Language Models (LLMs) specific APIs. With ipex.llm, Intel® Extension for PyTorch* provides comprehensive LLM optimization on FP16 and INT4 datatypes. Specifically for low precision, Weight-Only Quantization is supported for various scenarios. And user can also run Intel® Extension for PyTorch* with Tensor Parallel to fit in the multiple ranks or multiple nodes scenarios to get even better performance.

    A typical API under this new module is ipex.llm.optimize, which is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. ipex.llm.optimize is an upgrade API to replace previous ipex.optimize_transformers, which will bring you more consistent LLM experience and performance. Below shows a simple example of ipex.llm.optimize for fp16 inference:

      import torch
      import intel_extension_for_pytorch as ipex
      import transformers
    
      model= transformers.AutoModelForCausalLM.from_pretrained(model_name_or_path).eval()
    
      dtype = torch.float16
      model = ipex.llm.optimize(model, dtype=dtype, device="xpu")
    
      model.generate(YOUR_GENERATION_PARAMS)

    More examples of this API can be found at LLM optimization API.

    Besides that, we optimized more LLM inference models. A full list of optimized models can be found at LLM Optimizations Overview.

  • Serving framework support

    Typical LLM serving frameworks including vLLM and TGI can co-work with Intel® Extension for PyTorch* on Intel® GPU platforms (Intel® Data Center GPU Max 1550 and Intel® Arc™ A-Series Graphics). Besides the integration of LLM serving frameworks with ipex.llm module level APIs, we enhanced the performance and quality of underneath Intel® Extension for PyTorch* operators such as paged attention and flash attention for better end to end model performance.

  • Prototype support of full fine-tuning and LoRA PEFT with mixed precision

    Intel® Extension for PyTorch* also provides new capability for supporting popular recipes with both full fine-tuning and LoRA PEFT for mixed precision with BF16 and FP32. We optimized many typical LLM models including Llama 2 (7B and 70B), Llama 3 8B, Phi-3-Mini 3.8B model families and Chinese model Qwen-7B, on both single GPU and Multi-GPU (distributed fine-tuning based on PyTorch FSDP) use cases.

Known Issues

Please refer to Known Issues webpage.

Don't miss a new intel-extension-for-pytorch release

NewReleases is sending notifications on new releases.