huggingface/accelerate v0.22.0 on GitHub

Experimental distributed operations checking framework

A new framework has been introduced which can help catch timeout errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1" to enable this. Read more in the docs, introduced via #1756

`Accelerator.load_state` can now load the most recent checkpoint automatically

If a ProjectConfiguration has been made, using accelerator.load_state() (without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via #1741

Multiple enhancements to gradient accumulation

In this release multiple new enhancements to distributed gradient accumulation have been added.

accelerator.accumulate() now supports passing in multiple models introduced via #1708
A util has been introduced to perform multiple forwards, then multiple backwards, and finally sync the gradients only on the last .backward() via #1726

FSDP Changes

FSDP support has been added for NPU and XPU devices via #1803 and #1806
A new method for supporting RAM-efficient loading of models with FSDP has been added via #1777

DataLoader Changes

Custom slice functions are now supported in the DataLoaderDispatcher added via #1846

What's New?

fix failing test on 8GPU by @statelesshz in #1724
Better control over DDP's no_sync by @NouamaneTazi in #1726
Get rid of calling get_scale() by patching the step method of optimizer. by @yuxinyuan in #1720
fix the bug in npu by @statelesshz in #1728
Adding a shape check for set_module_tensor_to_device. by @Narsil in #1731
Fix errors when optimizer is not a Pytorch optimizer. by @yuxinyuan in #1733
Make balanced memory able to work with non contiguous GPUs ids by @thomwolf in #1734
Fixed typo in __repr__ of AlignDevicesHook by @KacperWyrwal in #1735
Update docs by @muellerzr in #1736
Fixed the bug that split dict incorrectly by @yuangpeng in #1742
Let load_state automatically grab the latest save by @muellerzr in #1741
fix KwargsHandler.to_kwargs not working with os.environ initialization in __post_init__ by @CyCle1024 in #1738
fix typo by @cauyxy in #1747
Check for misconfiguration of single node & single GPU by @muellerzr in #1746
Remove unused constant by @muellerzr in #1749
Rework new constant for operations by @muellerzr in #1748
Expose autocast kwargs and simplify autocast wrapper by @muellerzr in #1740
Fix FSDP related issues by @pacman100 in #1745
FSDP enhancements and fixes by @pacman100 in #1753
Fix check failure in Accelerator.save_state using multi-gpu by @CyCle1024 in #1760
Fix error when max_memory argument is in unexpected order by @ranchlai in #1759
Fix offload on disk when executing on CPU by @sgugger in #1762
Change is_aim_available() function to not match aim >= 4.0.0 by @alberttorosyan in #1769
Introduce an experimental distributed operations framework by @muellerzr in #1756
Support wrapping multiple models in Accelerator.accumulate() by @yuxinyuan in #1708
Contigous on gather by @muellerzr in #1771
[FSDP] Fix load_fsdp_optimizer by @awgu in #1755
simplify and correct the deepspeed example by @pacman100 in #1775
Set ipex default in state by @muellerzr in #1776
Fix import error when torch>=2.0.1 and torch.distributed is disabled by @natsukium in #1800
reserve 10% GPU in get_balanced_memory to avoid OOM by @ranchlai in #1798
add support of float memory size in convert_file_size_to_int by @ranchlai in #1799
Allow users to resume from previous wandb runs with allow_val_change by @SumanthRH in #1796
Add FSDP for XPU by @abhilash1910 in #1803
Add FSDP for NPU by @statelesshz in #1806
Fix pytest import by @muellerzr in #1808
More specific logging in gather_for_metrics by @dleve123 in #1784
Detect device map auto and raise a helpful error when trying to not use model parallelism by @muellerzr in #1810
Typo fix by @muellerzr in #1812
Expand device-map warning by @muellerzr in #1819
Update bibtex to reflect team growth by @muellerzr in #1820
Improve docs on grad accumulation by @vwxyzjn in #1817
add warning when using to and cuda by @SunMarc in #1790
Fix bnb import by @muellerzr in #1813
Update docs and docstrings to match load_and_quantize_model arg by @JonathanRayner in #1822
Expose a bit of args/docstring fixup by @muellerzr in #1824
Better test by @muellerzr in #1825
Minor idiomatic change for fp8 check. by @float-trip in #1829
Use device as context manager for init_on_device by @shingjan in #1826
Ipex bug fix for device properties in modelling by @abhilash1910 in #1834
FIX: Bug with unwrap_model and keep_fp32_wrapper=False by @BenjaminBossan in #1838
Fix verify_device_map by @Rexhaif in #1842
Change CUDA check by @muellerzr in #1833
Fix the noneffective parameter: gpu_ids (Rel. Issue #1848) by @devymex in #1850
support for ram efficient loading of model with FSDP by @pacman100 in #1777
Loading logic safetensors by @SunMarc in #1853
fix dispatch for quantized model by @SunMarc in #1855
Update fsdp_with_peak_mem_tracking.py by @pacman100 in #1856
Add env variable for init_on_device by @shingjan in #1852
remove casting to FP32 when saving state dict by @pacman100 in #1868
support custom slice function in DataLoaderDispatcher by @thevasudevgupta in #1846
Include a note to the forums in the bug report by @muellerzr in #1871

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@yuxinyuan
- Support wrapping multiple models in Accelerator.accumulate() (#1708)
- Fix errors when optimizer is not a Pytorch optimizer. (#1733)
- Get rid of calling get_scale() by patching the step method of optimizer. (#1720)
@NouamaneTazi
- Better control over DDP's no_sync (#1726)
@abhilash1910
- Add FSDP for XPU (#1803)
- Ipex bug fix for device properties in modelling (#1834)
@statelesshz
- Add FSDP for NPU (#1806)
- fix failing test on 8GPU (#1724)
- fix the bug in npu (#1728)
@thevasudevgupta
- support custom slice function in DataLoaderDispatcher (#1846)

Full Changelog: v0.21.0...v0.22.0

huggingface/accelerate v0.22.0 v0.22.0: Distributed operation framework, Gradient Accumulation enhancements, FSDP enhancements, and more! on GitHub