This release is meant to fix the following issues (regressions / silent correctness):
- Fix missing OpenMP support on Apple Silicon binaries (pytorch/builder#1697)
- Fix crash when mixing lazy and non-lazy tensors in one operation (#117653)
- Fix PyTorch performance regression on Linux aarch64 (pytorch/builder#1696)
- Fix silent correctness in DTensor
_to_copy
operation (#116426) - Fix properly assigning
param.grad_fn
for next forward (#116792) - Ensure gradient clear out pending
AsyncCollectiveTensor
in FSDP Extension (#116122) - Fix processing unflatten tensor on compute stream in FSDP Extension (#116559)
- Fix FSDP
AssertionError
on tensor subclass when settingsync_module_states=True
(#117336) - Fix DCP state_dict cannot correctly find FQN when the leaf module is wrapped by FSDP (#115592)
- Fix OOM when when returning a AsyncCollectiveTensor by forcing
_gather_state_dict()
to be synchronous with respect to the mian stream. (#118197) (#119716) - Fix Windows runtime
torch.distributed.DistNetworkError
: [WinError 32] The process cannot access the file because it is being used by another process (#118860) - Update supported python versions in package description (#119743)
- Fix SIGILL crash during
import torch
on CPUs that do not support SSE4.1 (#116623) - Fix DCP RuntimeError in
get_state_dict
andset_state_dict
(#119573) - Fixes for HSDP + TP integration with device_mesh (#112435) (#118620) (#119064) (#118638) (#119481)
- Fix numerical error with
mixedmm
on NVIDIA V100 (#118591) - Fix RuntimeError when using SymInt input invariant when splitting graphs (#117406)
- Fix compile
DTensor.from_local
in trace_rule_look up (#119659) - Improve torch.compile integration with CUDA-11.8 binaries (#119750)
Release tracker #119295 contains all relevant pull requests related to this release as well as links to related issues.