accelerate 1.14.0 on Python PyPI

FSDP2 Improvements

This release brings a large batch of FSDP2 fixes and quality-of-life improvements: correct dtype handling on load, sharding of embeddings/norms, QLoRA crash prevention, and a more robust auto-wrap policy.

Fsdp2 fully_shard embedding and norm by @SunMarc in #4015
Fix fsdp2 load full state dict dtype mismatch by @SunMarc in #4021
Fix region compilation fsdpv2 by @SunMarc in #4022
[FSDP2] Cast model to uniform dtype before fully_shard to fix mixed-dtype AssertionError by @roycho96 in #3985
[FSDP2] Auto-exclude non-floating frozen Params4bit from fully_shard to prevent QLoRA crash by @roycho96 in #3987
fix(FSDP2): auto-wrap policy ignoring _no_split_modules fallback by @JohnGiorgi in #3999
fix: use key-based matching in fsdp2_load_full_state_dict by @roycho96 in #3982
fix: add missing model_has_params4bit guard to fsdp2_load_full_state_dict call by @roycho96 in #3981
Fix to-fsdp2: drop REMOVED / NOT_YET_IMPLEMENTED FSDP1 keys instead of leaking them by @lollinng in #4065
Prevent double-wrapping models in prepare_model() by @joshuaswanson in #3977

AMD ROCm support

Accelerate now works end-to-end on AMD ROCm devices. Thanks @Abdennacer-Badaoui!

Make accelerate work end-to-end on AMD ROCm by @Abdennacer-Badaoui in #4025

Neuron

Further Neuron improvements to reduce recompilation and cover missing device cases.

Add padded allgather and broadcast for Neuron devices to reduce recompilation by @czkkkkkk in #4000
fix: add missing neuron device case by @michaelbenayoun in #4042

Quantization & Offloading

We improved offloading support for quantized models, including Torchao, int8, and tied-weight handling.

Torchao offload by @SunMarc in #3973
Fix int8 offload hook detachment statistics restoration by @jiqing-feng in #4044
Fix keep_in_fp32_modules not working for tied weights in load_and_quantize_model by @jiqing-feng in #4043
Fix dtype_byte_size for FP8 fnuz / e8m0fnu dtypes by @lollinng in #4063

Data Loading

Feat: Support dynamic batch size in BatchSamplerShard with even_batches by @yuxinyuan in #3969
Fix iterable dataset sharding condition when n_shards == num_processes by @SunMarc in #3958
Fix implicit padding in split_between_processes when apply_padding=False and num_samples < num_processes by @3manifold in #4052

Minor fixes

[DeepSpeed] allow kernels flash-attn in SP by @kashif in #3959
Fix: Conditionally import torch.distributed.algorithms.join in accelerator.py by @0xDELUXA in #3962
Fix is_hf_initialized attribute by @SunMarc in #3976
feat(utils): add max reduction type by @imstevenpmwork in #4027
fix(state): make MLU backend part of the _prepare_backend elif chain by @Anai-Guo in #4057
fix notebook launcher cuda init by @SunMarc in #4059
pytorch-triton-xpu rename to triton-xpu by @sywangyi in #4007
Relax numerical tolerance for XPU in test_big_modeling by @YangKai0616 in #4001
Fix gloo backend error in test_load_checkpoint_and_dispatch_with_broadcast on XPU by @kaixuanliu in #4056
Raise ValueError instead of a bare string in ParallelismConfig.get_device_mesh by @lollinng in #4064
tests: Gracefully handle missing set_device for mps by @booxter in #4028
test: add regression test for no_split_module_classes accepting set type by @UFO0506 in #4048
Fix all tests by @SunMarc in #4072
docs: add aggregate profiler memory example by @aryanputta in #4054
DOC: document missing parameters in load_accelerator_state, find_executable_batch_size, and send_to_device by @kratos0718 in #4051
docs: Fix docstring of fsdp2_prepare_auto_wrap_policy by @slocoro in #4037
Fix DistributedType documentation by @3manifold in #3980
Fix grammar, spelling, and consistency issues across docs and examples by @cihandemir in #3961
docs: fix typos in docstrings, comments, and user docs by @mokashang in #4040
chore: update doc-builder workflow SHA by @rtrompier in #4009
chore: bump doc-builder SHA for main doc build workflow by @rtrompier in #4018
[CI] Bump style-bot SHA + switch to GitHub App by @paulinebm in #4031
Fix TrackioTracker.log() ignoring step parameter by @joshuaswanson in #3975
fix: pass step parameter in TrackioTracker.log() by @liuyun7345 in #3970
fix(tracking): default step=None on tracker.log and accept extra kwargs in MLflowTracker by @1fanwang in #4039
Fix MLflowTracker.store_init_configuration mutating the caller's config dict by @ATOM00blue in #4046
fix(tracker): guard init_trackers and log against None kwargs by @xodn348 in #4026
🔒 Pin GitHub Actions to commit SHAs by @paulinebm in #3992
chore: update build-docker-images-release.yml by @hf-security-analysis[bot] in #4069
chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #4049
Bump the actions group with 8 updates by @dependabot[bot] in #4068

Full Changelog: v1.13.0...v1.14.0

accelerate 1.14.0 v1.14.0: AMD ROCm support, FSDP2 hardening on Python PyPI