Experimental distributed operations checking framework
A new framework has been introduced which can help catch timeout
errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1"
to enable this. Read more in the docs, introduced via #1756
Accelerator.load_state
can now load the most recent checkpoint automatically
If a ProjectConfiguration
has been made, using accelerator.load_state()
(without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via #1741
Multiple enhancements to gradient accumulation
In this release multiple new enhancements to distributed gradient accumulation have been added.
accelerator.accumulate()
now supports passing in multiple models introduced via #1708- A util has been introduced to perform multiple forwards, then multiple backwards, and finally sync the gradients only on the last
.backward()
via #1726
FSDP Changes
- FSDP support has been added for NPU and XPU devices via #1803 and #1806
- A new method for supporting RAM-efficient loading of models with FSDP has been added via #1777
DataLoader Changes
- Custom slice functions are now supported in the
DataLoaderDispatcher
added via #1846
What's New?
- fix failing test on 8GPU by @statelesshz in #1724
- Better control over DDP's
no_sync
by @NouamaneTazi in #1726 - Get rid of calling
get_scale()
by patching the step method of optimizer. by @yuxinyuan in #1720 - fix the bug in npu by @statelesshz in #1728
- Adding a shape check for
set_module_tensor_to_device
. by @Narsil in #1731 - Fix errors when optimizer is not a Pytorch optimizer. by @yuxinyuan in #1733
- Make balanced memory able to work with non contiguous GPUs ids by @thomwolf in #1734
- Fixed typo in
__repr__
of AlignDevicesHook by @KacperWyrwal in #1735 - Update docs by @muellerzr in #1736
- Fixed the bug that split dict incorrectly by @yuangpeng in #1742
- Let load_state automatically grab the latest save by @muellerzr in #1741
- fix
KwargsHandler.to_kwargs
not working withos.environ
initialization in__post_init__
by @CyCle1024 in #1738 - fix typo by @cauyxy in #1747
- Check for misconfiguration of single node & single GPU by @muellerzr in #1746
- Remove unused constant by @muellerzr in #1749
- Rework new constant for operations by @muellerzr in #1748
- Expose
autocast
kwargs and simplifyautocast
wrapper by @muellerzr in #1740 - Fix FSDP related issues by @pacman100 in #1745
- FSDP enhancements and fixes by @pacman100 in #1753
- Fix check failure in
Accelerator.save_state
using multi-gpu by @CyCle1024 in #1760 - Fix error when
max_memory
argument is in unexpected order by @ranchlai in #1759 - Fix offload on disk when executing on CPU by @sgugger in #1762
- Change
is_aim_available()
function to not match aim >= 4.0.0 by @alberttorosyan in #1769 - Introduce an experimental distributed operations framework by @muellerzr in #1756
- Support wrapping multiple models in Accelerator.accumulate() by @yuxinyuan in #1708
- Contigous on gather by @muellerzr in #1771
- [FSDP] Fix
load_fsdp_optimizer
by @awgu in #1755 - simplify and correct the deepspeed example by @pacman100 in #1775
- Set ipex default in state by @muellerzr in #1776
- Fix import error when torch>=2.0.1 and
torch.distributed
is disabled by @natsukium in #1800 - reserve 10% GPU in
get_balanced_memory
to avoid OOM by @ranchlai in #1798 - add support of float memory size in
convert_file_size_to_int
by @ranchlai in #1799 - Allow users to resume from previous wandb runs with
allow_val_change
by @SumanthRH in #1796 - Add FSDP for XPU by @abhilash1910 in #1803
- Add FSDP for NPU by @statelesshz in #1806
- Fix pytest import by @muellerzr in #1808
- More specific logging in
gather_for_metrics
by @dleve123 in #1784 - Detect device map auto and raise a helpful error when trying to not use model parallelism by @muellerzr in #1810
- Typo fix by @muellerzr in #1812
- Expand device-map warning by @muellerzr in #1819
- Update bibtex to reflect team growth by @muellerzr in #1820
- Improve docs on grad accumulation by @vwxyzjn in #1817
- add warning when using to and cuda by @SunMarc in #1790
- Fix bnb import by @muellerzr in #1813
- Update docs and docstrings to match
load_and_quantize_model
arg by @JonathanRayner in #1822 - Expose a bit of args/docstring fixup by @muellerzr in #1824
- Better test by @muellerzr in #1825
- Minor idiomatic change for fp8 check. by @float-trip in #1829
- Use device as context manager for
init_on_device
by @shingjan in #1826 - Ipex bug fix for device properties in modelling by @abhilash1910 in #1834
- FIX: Bug with
unwrap_model
andkeep_fp32_wrapper=False
by @BenjaminBossan in #1838 - Fix
verify_device_map
by @Rexhaif in #1842 - Change CUDA check by @muellerzr in #1833
- Fix the noneffective parameter:
gpu_ids
(Rel. Issue #1848) by @devymex in #1850 - support for ram efficient loading of model with FSDP by @pacman100 in #1777
- Loading logic safetensors by @SunMarc in #1853
- fix dispatch for quantized model by @SunMarc in #1855
- Update
fsdp_with_peak_mem_tracking
.py by @pacman100 in #1856 - Add env variable for
init_on_device
by @shingjan in #1852 - remove casting to FP32 when saving state dict by @pacman100 in #1868
- support custom slice function in
DataLoaderDispatcher
by @thevasudevgupta in #1846 - Include a note to the forums in the bug report by @muellerzr in #1871
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @yuxinyuan
- @NouamaneTazi
- Better control over DDP's
no_sync
(#1726)
- Better control over DDP's
- @abhilash1910
- @statelesshz
- @thevasudevgupta
- support custom slice function in
DataLoaderDispatcher
(#1846)
- support custom slice function in
Full Changelog: v0.21.0...v0.22.0