Better multinode support in the launcher
The accelerate command
launch did not work well for distributed training using several machines. This is fixed in this version.
- Use torchrun for multinode by @muellerzr in #631
- Fix multi-node issues from launch by @muellerzr in #672
Launch training on specific GPUs only
Instead of prefixing your launch command with CUDA_VISIBLE_DEVICES=xxx
you can now specify the GPUs you want to use in your Accelerate config.
- Allow for GPU-ID specification on CLI by @muellerzr in #732
Better tracebacks and rich support
The tracebacks are now cleaned up to avoid printing several times the same error, and rich is integrated as an optional dependency.
- Integrate Rich into Accelerate by @muellerzr in #613
- Make rich an optional dep by @muellerzr in #673
What's new?
- Fix typo in docs/index.mdx by @mishig25 in #610
- Fix DeepSpeed CI by @muellerzr in #612
- Added GANs example to examples by @EyalMichaeli in #619
- Fix example by @muellerzr in #620
- Update README.md by @ezhang7423 in #622
- Fully remove
subprocess
from the multi-gpu launcher by @muellerzr in #623 - M1 mps fixes by @pacman100 in #625
- Fix multi-node issues and simplify param logic by @muellerzr in #627
- update MPS support docs by @pacman100 in #629
- minor tracker fixes for complete* examples by @pacman100 in #630
- Put back in place the guard by @muellerzr in #634
- make init_trackers to launch on main process by @Gladiator07 in #642
- remove check for main process for trackers initialization by @Gladiator07 in #643
- fix link by @philschmid in #645
- Add static_graph arg to DistributedDataParallelKwargs. by @rom1504 in #637
- Small nits to grad accum docs by @muellerzr in #656
- Saving hyperparams in yaml file for Tensorboard for #521 by @Shreyz-max in #657
- Use debug for loggers by @muellerzr in #655
- Improve docstrings more by @muellerzr in #666
- accelerate bibtex by @pacman100 in #660
- Cache torch_tpu check by @muellerzr in #670
- Manim animation of big model inference by @muellerzr in #671
- Add aim tracker for accelerate by @muellerzr in #649
- Specify local network on multinode by @muellerzr in #674
- Test for min torch version + fix all issues by @muellerzr in #638
- deepspeed enhancements and fixes by @pacman100 in #676
- DeepSpeed launcher related changes by @pacman100 in #626
- adding torchrun elastic params by @pacman100 in #680
- 🐛 fix by @pacman100 in #683
- Fix skip in dispatch dataloaders by @sgugger in #682
- Clean up DispatchDataloader a bit more by @sgugger in #686
- rng state sync for FSDP by @pacman100 in #688
- Fix DataLoader with samplers that are batch samplers by @sgugger in #687
- fixing support for Apple Silicon GPU in
notebook_launcher
by @pacman100 in #695 - fixing rng sync when using custom sampler and batch_sampler by @pacman100 in #696
- Improve
init_empty_weights
to override tensor constructor by @thomasw21 in #699 - override DeepSpeed
grad_acc_steps
fromaccelerator
obj by @pacman100 in #698 - [doc] Fix 404'd link in memory usage guides by @tomaarsen in #702
- Add in report generation for test failures and make fail-fast false by @muellerzr in #703
- Update runners with report structure, adjust env variable by @muellerzr in #704
- docs: examples readability improvements by @ryanrussell in #709
- docs:
utils
readability fixups by @ryanrussell in #711 - refactor(test_tracking):
key_occurrence
readability fixup by @ryanrussell in #710 - docs:
hooks
readability improvements by @ryanrussell in #712 - sagemaker fixes and improvements by @pacman100 in #708
- refactor(accelerate): readability improvements by @ryanrussell in #713
- More docstring nits by @muellerzr in #715
- Allow custom device placements for different objects by @sgugger in #716
- Specify gradients in model preparation by @muellerzr in #722
- Fix regression issue by @muellerzr in #724
- Fix default for num processes by @sgugger in #726
- Build and Release docker images on a release by @muellerzr in #725
- Make running tests more efficient by @muellerzr in #611
- Fix old naming by @muellerzr in #727
- Fix issue with one-cycle logic by @muellerzr in #728
- Remove auto-bug label in issue template by @sgugger in #735
- Add a tutorial on proper benchmarking by @muellerzr in #734
- Add an example zoo to the documentation by @muellerzr in #737
- trlx by @muellerzr in #738
- Fix memory leak by @muellerzr in #739
- Include examples for CI by @muellerzr in #740
- Auto grad accum example by @muellerzr in #742