We are excited to announce the release of Intel® Extension for PyTorch* 2.4.0+cpu which accompanies PyTorch 2.4. This release mainly brings you the support for Llama3.1, basic support for LLM serving frameworks like vLLM/TGI, and a set of optimization to push better performance for LLM models. This release also extends the list of optimized LLM models to a broader level and includes a set of bug fixing and small optimizations. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.
Highlights
- Llama 3.1 support
Meta has newly released Llama 3.1 with new features like longer context length (128K) support. Intel® Extension for PyTorch* provides support of Llama 3.1 since its launch date with early release version, and now support with this official release.
- Serving framework support
Typical LLM serving frameworks including vLLM, TGI can co-work with Intel® Extension for PyTorch* now which provides optimized performance for Xeon® Scalable CPUs. Besides the integration of LLM serving frameworks with ipex.llm module level APIs, we also continue optimizing the performance and quality of underneath Intel® Extension for PyTorch* operators such as paged attention and flash attention. We also provide new support in ipex.llm module level APIs for 4bits AWQ quantization based on weight only quantization, and distributed communications with shared memory optimization.
- Large Language Model (LLM) optimization:
Intel® Extension for PyTorch* further optimized the performance of the weight only quantization kernels, enabled more fusion pattern variants for LLMs and extended the optimized models to include whisper, falcon-11b, Qwen2, and definitely Llama 3.1, etc. A full list of optimized models can be found at LLM optimization.
-
Bug fixing and other optimization
Full Changelog: v2.3.0+cpu...v2.4.0+cpu