accelerate 1.13.0 on Python PyPI

AWS Neuron support

We now have support for AWS Neuron (Trainium/Inferentia) devices. Thanks @michaelbenayoun for adding this.

Neuron integration by @michaelbenayoun in #3935

XPU Improvements

We've removed IPEX dependency and improved device-agnostic code for XPU.

using spawn instead of fork for XPU device by @kaixuanliu in #3884
Remove ipex by @yao-matrix in #3883
enhance new codes to XPU, and make them be device agnostic by @yao-matrix in #3890
Fix KMP_AFFINITY incorrectly set for non-CPU training by @hexfaker in
#3912

FSDP2 Improvements

We've added a bunch of important fixes for FSDP2 users: upcasting only grad-requiring params, better tied embedding errors, DCP optimizer loading, bf16 optimizer step crash fix, and torch < 2.7.0 compatibility.

Upcast FSDP2 parameters only if requires_grad by @ojh31 in #3848
Fix FSDP2 tied embedding errors with targeted ValueError guidance by @amanzoni1 in #3878
bug: fsdp cannot load optimizer state using dcp by @flymin in #3904
fix crash in optimizer.step when fsdp2 is enabled and model is bfloat16 by @sywangyi in #3905
Fix FSDP2 crash with ignored_params on torch < 2.7.0 by @Mr-Neutr0n in #3924

DeepSpeed Sequence Parallelism

We've added several fixes to the DeepSpeed + Sequence Parallelism integration introduced in v1.12.0, including evaluation support during SP training and proper process group handling.

[SP] fix loss computation example by @kashif in #3858
[SP and CP] error out if both CP and SP enabled by @kashif in #3862
DeepSpeed has its own process group by @kashif in #3916
[Deepspeed] skip device mesh creation when deepspeed and sp_size >1 by @kashif in #3914
Enable evaluation during deepspeed Sequence Parallel by @jp1924 in #3917

FP8

We've enhanced FP8 training. Thanks @shimizust for fixing torchao support.

Fix FP8 torchao default config with padding and FSDP2 all-gather support by @shimizust in #3831
Fix execution with Transformer Engine by @ksivaman in #3852
add MS-AMP deprecation warnings by @neha222222 in #3857

Performance

Accelerate now imports faster by deferring heavy dependencies, and torch.compile hooks are disabled lazily.

Faster import by @SunMarc in #3953
lazy compile disable by @SunMarc in #3947
Disable hook compile by @SunMarc in #3888

Minor fixes

Allow non-Tensor values in a batch with dispatch_batches=True by @tomaarsen in #3850
fix module and optimizer parameter mismatch before prepare_tp_ by @naomili0924 in #3845
Fix KeyError in extract_model_from_parallel for partial torch.compile by @amanzoni1 in #3881
Fix hf_device_map device index comparison in prepare_model by @rezaqorbani in #3895
Fix StatefulDataLoader KeyError with num_workers > 0 by @veeceey in #3931
Fix stateful dataloader DDP by @SunMarc in #3952
Fix: Remove duplicate W&B initialization in offline mode by @shantanugupta2004 in #3886
Avoid using nvidia-smi on a CPU-only Colab instance by @FlorianVal in #3872
Fix logging logic when in_order is set to True by @yuxinyuan in #3280
Fix cpu offload check by @SunMarc in #3946
fix bug when both cpu_ram_efficient_loading and cpu_offload are enabled by @kaixuanliu in #3910
Fix async compatibility across python versions by @SunMarc in #3901
fix tp only bug by @sywangyi in #3908
fix parallelism_config None error by @jp1924 in #3927
Np parall fix by @sywangyi in #3900
change the default value of fsdp_min_num_params to int by @CodeMan62 in #3902
Fix mutable default in Megatron init and IndexError on empty ModuleList by @jashshah999 in #3944
Prepare TP fix by @michaelbenayoun in #3945
feat: added fine tuning example focused on TPUs by @tengomucho in #3847
Remove 8bit force hook for bnb by @SunMarc in #3907
docs: flag MS-AMP as deprecated in low-precision training guides by @ManasVardhan in #3929
fix: correct typo 'guarentee' to 'guarantee' by @thecaptain789 in #3922
Updating support of Megatron-LM by @pengdurice in #3842
Update support of Megatron-LM PR 2 by @pengdurice in #3887
Fix RNG state setting for HPU by @michaelbenayoun in #3936
fix: load the HPU RNG state by @michaelbenayoun in #3937

accelerate 1.13.0 v1.13.0: Neuron support, IPEX removal, and distributed training fixes on Python PyPI