AWS Neuron support
We now have support for AWS Neuron (Trainium/Inferentia) devices. Thanks @michaelbenayoun for adding this.
- Neuron integration by @michaelbenayoun in #3935
XPU Improvements
We've removed IPEX dependency and improved device-agnostic code for XPU.
- using spawn instead of fork for XPU device by @kaixuanliu in #3884
- Remove ipex by @yao-matrix in #3883
- enhance new codes to XPU, and make them be device agnostic by @yao-matrix in #3890
- Fix KMP_AFFINITY incorrectly set for non-CPU training by @hexfaker in
#3912
FSDP2 Improvements
We've added a bunch of important fixes for FSDP2 users: upcasting only grad-requiring params, better tied embedding errors, DCP optimizer loading, bf16 optimizer step crash fix, and torch < 2.7.0 compatibility.
- Upcast FSDP2 parameters only if requires_grad by @ojh31 in #3848
- Fix FSDP2 tied embedding errors with targeted ValueError guidance by @amanzoni1 in #3878
- bug: fsdp cannot load optimizer state using dcp by @flymin in #3904
- fix crash in optimizer.step when fsdp2 is enabled and model is bfloat16 by @sywangyi in #3905
- Fix FSDP2 crash with ignored_params on torch < 2.7.0 by @Mr-Neutr0n in #3924
DeepSpeed Sequence Parallelism
We've added several fixes to the DeepSpeed + Sequence Parallelism integration introduced in v1.12.0, including evaluation support during SP training and proper process group handling.
- [SP] fix loss computation example by @kashif in #3858
- [SP and CP] error out if both CP and SP enabled by @kashif in #3862
- DeepSpeed has its own process group by @kashif in #3916
- [Deepspeed] skip device mesh creation when deepspeed and sp_size >1 by @kashif in #3914
- Enable evaluation during deepspeed Sequence Parallel by @jp1924 in #3917
FP8
We've enhanced FP8 training. Thanks @shimizust for fixing torchao support.
- Fix FP8 torchao default config with padding and FSDP2 all-gather support by @shimizust in #3831
- Fix execution with Transformer Engine by @ksivaman in #3852
- add MS-AMP deprecation warnings by @neha222222 in #3857
Performance
Accelerate now imports faster by deferring heavy dependencies, and torch.compile hooks are disabled lazily.
- Faster import by @SunMarc in #3953
- lazy compile disable by @SunMarc in #3947
- Disable hook compile by @SunMarc in #3888
Minor fixes
- Allow non-Tensor values in a batch with dispatch_batches=True by @tomaarsen in #3850
- fix module and optimizer parameter mismatch before prepare_tp_ by @naomili0924 in #3845
- Fix KeyError in extract_model_from_parallel for partial torch.compile by @amanzoni1 in #3881
- Fix hf_device_map device index comparison in prepare_model by @rezaqorbani in #3895
- Fix StatefulDataLoader KeyError with num_workers > 0 by @veeceey in #3931
- Fix stateful dataloader DDP by @SunMarc in #3952
- Fix: Remove duplicate W&B initialization in offline mode by @shantanugupta2004 in #3886
- Avoid using nvidia-smi on a CPU-only Colab instance by @FlorianVal in #3872
- Fix logging logic when in_order is set to True by @yuxinyuan in #3280
- Fix cpu offload check by @SunMarc in #3946
- fix bug when both cpu_ram_efficient_loading and cpu_offload are enabled by @kaixuanliu in #3910
- Fix async compatibility across python versions by @SunMarc in #3901
- fix tp only bug by @sywangyi in #3908
- fix parallelism_config None error by @jp1924 in #3927
- Np parall fix by @sywangyi in #3900
- change the default value of fsdp_min_num_params to int by @CodeMan62 in #3902
- Fix mutable default in Megatron init and IndexError on empty ModuleList by @jashshah999 in #3944
- Prepare TP fix by @michaelbenayoun in #3945
- feat: added fine tuning example focused on TPUs by @tengomucho in #3847
- Remove 8bit force hook for bnb by @SunMarc in #3907
- docs: flag MS-AMP as deprecated in low-precision training guides by @ManasVardhan in #3929
- fix: correct typo 'guarentee' to 'guarantee' by @thecaptain789 in #3922
- Updating support of Megatron-LM by @pengdurice in #3842
- Update support of Megatron-LM PR 2 by @pengdurice in #3887
- Fix RNG state setting for HPU by @michaelbenayoun in #3936
- fix: load the HPU RNG state by @michaelbenayoun in #3937