intel/intel-extension-for-pytorch v2.6.0+cpu on GitHub

We are excited to announce the release of Intel® Extension for PyTorch* 2.6.0+cpu which accompanies PyTorch 2.6. This release mainly brings you full optimization on latest Intel® Xeon® 6 P-core platform, new LLM model support including Falcon3/Jamba/DeepSeek V2.5, and latest LLM optimization including FP8 KV cache, GPTQ/AWQ support under Tensor Parallel mode, and INT8 computation for WOQ. This release also includes a set of bug fixing and small optimizations. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.

Highlights

Comprehensive optimization for Intel® Xeon® 6
Intel® Xeon® 6 deliver new degrees of performance with more cores, a choice of microarchitecture, additional memory bandwidth, and exceptional input/output (I/O) across a range of workloads. Intel® Extension for PyTorch* v2.5 introduced the basic optimization for Intel® Xeon® 6, while in this new release, we added more comprehensive optimization as reflected with a set of typical AI models like DLRM/Bert-Large/ViT/Stable Diffusion/LCM/GPT-J/Llama, etc.
Large Language Model (LLM) optimization:
Intel® Extension for PyTorch* provides more feature support of the weight only quantization including INT8 based computation by leveraging AMX-INT8 from Intel® Xeon® 6, and GPTQ/AWQ support under Tensor Parallel mode. FP8 KV cache and FP16 general datatype support in LLM module API, etc.. These features enable better adoption of community model weight and provides better performance for low-precision scenarios. This release also extended the optimized models to include newly published models like Falcon3, DeepSeek V2.5 and Jamba. A full list of optimized models can be found at LLM optimization.
Bug fixing and other optimization
- Optimized the performance of LLM #3420 #3441 #3406 #3376
- Supported loading INT4 checkpoint with Tensor Parallel #3328
- Enabled TP=3 with INT4 checkpoint for Weight Only Quantization #3400
- Supported sharding checkpoint with GPTQ policy #3423
- Enabled lowp-mode=INT8 for NF4 weight #3395
- Fixed the correctness issue in the Weight Only Quantization kernel for Llama3-11b-vision #3469
- Upgraded oneDNN to v3.6.2 #3399

intel/intel-extension-for-pytorch v2.6.0+cpu Intel® Extension for PyTorch* v2.6.0+cpu Release Notes on GitHub

Highlights

intel/intel-extension-for-pytorch v2.6.0+cpu
Intel® Extension for PyTorch* v2.6.0+cpu Release Notes

on GitHub