FSDP2 Improvements
This release brings a large batch of FSDP2 fixes and quality-of-life improvements: correct dtype handling on load, sharding of embeddings/norms, QLoRA crash prevention, and a more robust auto-wrap policy.
- Fsdp2 fully_shard embedding and norm by @SunMarc in #4015
- Fix fsdp2 load full state dict dtype mismatch by @SunMarc in #4021
- Fix region compilation fsdpv2 by @SunMarc in #4022
- [FSDP2] Cast model to uniform dtype before fully_shard to fix mixed-dtype AssertionError by @roycho96 in #3985
- [FSDP2] Auto-exclude non-floating frozen Params4bit from fully_shard to prevent QLoRA crash by @roycho96 in #3987
- fix(FSDP2): auto-wrap policy ignoring _no_split_modules fallback by @JohnGiorgi in #3999
- fix: use key-based matching in fsdp2_load_full_state_dict by @roycho96 in #3982
- fix: add missing model_has_params4bit guard to fsdp2_load_full_state_dict call by @roycho96 in #3981
- Fix to-fsdp2: drop REMOVED / NOT_YET_IMPLEMENTED FSDP1 keys instead of leaking them by @lollinng in #4065
- Prevent double-wrapping models in prepare_model() by @joshuaswanson in #3977
AMD ROCm support
Accelerate now works end-to-end on AMD ROCm devices. Thanks @Abdennacer-Badaoui!
- Make accelerate work end-to-end on AMD ROCm by @Abdennacer-Badaoui in #4025
Neuron
Further Neuron improvements to reduce recompilation and cover missing device cases.
- Add padded allgather and broadcast for Neuron devices to reduce recompilation by @czkkkkkk in #4000
- fix: add missing neuron device case by @michaelbenayoun in #4042
Quantization & Offloading
We improved offloading support for quantized models, including Torchao, int8, and tied-weight handling.
- Torchao offload by @SunMarc in #3973
- Fix int8 offload hook detachment statistics restoration by @jiqing-feng in #4044
- Fix keep_in_fp32_modules not working for tied weights in load_and_quantize_model by @jiqing-feng in #4043
- Fix dtype_byte_size for FP8 fnuz / e8m0fnu dtypes by @lollinng in #4063
Data Loading
- Feat: Support dynamic batch size in BatchSamplerShard with even_batches by @yuxinyuan in #3969
- Fix iterable dataset sharding condition when n_shards == num_processes by @SunMarc in #3958
- Fix implicit padding in split_between_processes when apply_padding=False and num_samples < num_processes by @3manifold in #4052
Minor fixes
- [DeepSpeed] allow kernels flash-attn in SP by @kashif in #3959
- Fix: Conditionally import torch.distributed.algorithms.join in accelerator.py by @0xDELUXA in #3962
- Fix is_hf_initialized attribute by @SunMarc in #3976
- feat(utils): add max reduction type by @imstevenpmwork in #4027
- fix(state): make MLU backend part of the _prepare_backend elif chain by @Anai-Guo in #4057
- fix notebook launcher cuda init by @SunMarc in #4059
- pytorch-triton-xpu rename to triton-xpu by @sywangyi in #4007
- Relax numerical tolerance for XPU in test_big_modeling by @YangKai0616 in #4001
- Fix gloo backend error in test_load_checkpoint_and_dispatch_with_broadcast on XPU by @kaixuanliu in #4056
- Raise ValueError instead of a bare string in ParallelismConfig.get_device_mesh by @lollinng in #4064
- tests: Gracefully handle missing set_device for mps by @booxter in #4028
- test: add regression test for no_split_module_classes accepting set type by @UFO0506 in #4048
- Fix all tests by @SunMarc in #4072
- docs: add aggregate profiler memory example by @aryanputta in #4054
- DOC: document missing parameters in load_accelerator_state, find_executable_batch_size, and send_to_device by @kratos0718 in #4051
- docs: Fix docstring of fsdp2_prepare_auto_wrap_policy by @slocoro in #4037
- Fix DistributedType documentation by @3manifold in #3980
- Fix grammar, spelling, and consistency issues across docs and examples by @cihandemir in #3961
- docs: fix typos in docstrings, comments, and user docs by @mokashang in #4040
- chore: update doc-builder workflow SHA by @rtrompier in #4009
- chore: bump doc-builder SHA for main doc build workflow by @rtrompier in #4018
- [CI] Bump style-bot SHA + switch to GitHub App by @paulinebm in #4031
- Fix TrackioTracker.log() ignoring step parameter by @joshuaswanson in #3975
- fix: pass step parameter in TrackioTracker.log() by @liuyun7345 in #3970
- fix(tracking): default step=None on tracker.log and accept extra kwargs in MLflowTracker by @1fanwang in #4039
- Fix MLflowTracker.store_init_configuration mutating the caller's config dict by @ATOM00blue in #4046
- fix(tracker): guard init_trackers and log against None kwargs by @xodn348 in #4026
- 🔒 Pin GitHub Actions to commit SHAs by @paulinebm in #3992
- chore: update build-docker-images-release.yml by @hf-security-analysis[bot] in #4069
- chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #4049
- Bump the actions group with 8 updates by @dependabot[bot] in #4068
Full Changelog: v1.13.0...v1.14.0