intel/intel-extension-for-pytorch v2.5.0+cpu on GitHub

We are excited to announce the release of Intel® Extension for PyTorch* 2.5.0+cpu which accompanies PyTorch 2.5. This release mainly brings you the support for Llama3.2, optimization on newly launched Intel® Xeon® 6 P-core platform, GPTQ/AWQ format support, and latest optimization to push better performance for LLM models. This release also includes a set of bug fixing and small optimizations. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.

Highlights

Llama 3.2 support
Meta has newly released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B). Intel® Extension for PyTorch* provides support of Llama 3.2 since its launch date with early release version, and now support with this official release.
Optimization for Intel® Xeon® 6
Intel® Xeon® 6 deliver new degrees of performance with more cores, a choice of microarchitecture, additional memory bandwidth, and exceptional input/output (I/O) across a range of workloads. Intel® Extension for PyTorch* provides dedicated optimization on this new processor family for features like Multiplexed Rank DIMM (MRDIMM), SNC=3 scenario, etc..
Large Language Model (LLM) optimization:
Intel® Extension for PyTorch* provides more feature support of the weight only quantization including GPTQ/AWQ format support, symmetric quantization of activation and weight, and added chunked prefill/prefix prefill support in LLM module API, etc.. These features enable better adoption of community model weight and provides better performance for low-precision scenarios. This release also extended the optimized models to include newly published Llama 3.2 vision models. A full list of optimized models can be found at LLM optimization.
Bug fixing and other optimization
- Optimized the performance of the IndirectAccessKVCacheAttention kernel
  #3185 #3209 #3214 #3218 #3248
- Fixed the Segmentation fault in the IndirectAccessKVCacheAttention kernel #3246
- Fixed the correctness issue in the PagedAttention kernel for Llama-68M-Chat-v1 #3307
- Fixed the support in ipex.llm.optimize to ensure model.generate returns the correct output type when return_dict_in_generate is set to True. #3333
- Optimized the performance of the Flash Attention kernel #3291
- Upgraded oneDNN to v3.6 #3305

intel/intel-extension-for-pytorch v2.5.0+cpu Intel® Extension for PyTorch* v2.5.0+cpu Release Notes on GitHub

Highlights

intel/intel-extension-for-pytorch v2.5.0+cpu
Intel® Extension for PyTorch* v2.5.0+cpu Release Notes

on GitHub