torch 2.4.0 on Python PyPI

PyTorch 2.4 Release Notes

Highlights
Tracked Regressions
Backward incompatible changes
Deprecations
New features
Improvements
Bug Fixes
Performance
Documentation
Developers
Security

Highlights

We are excited to announce the release of PyTorch® 2.4!
PyTorch 2.4 adds support for the latest version of Python (3.12) for torch.compile.
AOTInductor freezing gives developers running AOTInductor more performance based optimizations by allowing the
serialization of MKLDNN weights. As well, a new default TCPStore server backend utilizing libuv has been introduced
which should significantly reduce initialization times for users running large-scale jobs.
Finally, a new Python Custom Operator API makes it easier than before to integrate custom kernels
into PyTorch, especially for torch.compile.

This release is composed of 3661 commits and 475 contributors since PyTorch 2.3.
We want to sincerely thank our dedicated community for your contributions.
As always, we encourage you to try these out and report any issues as we improve 2.4.

Tracked Regressions

Subproc exception with torch.compile and onnxruntime-training

There is a reported issue when using torch.compile if onnxruntime-training lib is
installed. The issue will be fixed in v2.4.1. It can be solved locally by setting the environment variable
TORCHINDUCTOR_WORKER_START=fork before executing the script.

cu118 wheels will not work with pre-cuda12 drivers

It was also reported that the new version of triton uses cuda features that are not compatible with pre-cuda12 drivers.
In this case, the workaround is to set
TRITON_PTXAS_PATH manually as follows (adapt the code according to the local installation path):

TRITON_PTXAS_PATH=/usr/local/lib/python3.10/site-packages/torch/bin/ptxas  python script.py

Backwards Incompatible Change

Python frontend

Default `TreadPool` size to number of physical cores (#125963)

Changed the default number of threads used for intra-op parallelism from the number of logical cores to the number of
physical cores. This should reduce core oversubscribing when running CPU workload and improve performance.
Previous behavior can be recovered by using torch.set_num_threads to set the number of threads to the desired value.

Fix `torch.quasirandom.SobolEngine.draw` default dtype handling (#126781)

The default dtype value has been changed from torch.float32 to the current default dtype as given by
torch.get_default_dtype() to be consistent with other APIs.

Forbid subclassing `torch._C._TensorBase` directly (#125558)

This is an internal subclass that a user used to be able to create an object that is almost a Tensor in Python and was
advertised as such in some tutorials. This is not allowed anymore to improve consistency and all users should
subclass torch.Tensor directly.

Composability

Non-compositional usages of as_strided + mutation under `torch.compile` will raise an error (#122502)

The torch.compile flow involves functionalizing any mutations inside of the region being compiled. Torch.as_strided is
an existing view op that can be used non-compositionally: meaning when you call x.as_strided(...), as_strided will only
consider the underlying storage size of x, and ignore its current size/stride/storage_offset when creating a new view.
This makes it difficult to safely functionalize mutations on views of as_strided that are created non-compositionally,
so we ban them rather than risking silent correctness issues under torch.compile.

An example of a non-compositional usage of as_strided followed by mutation that we will error on is below. You can avoid
this issue by re-writing your usage of as_strided so that it is compositional (for example: either use a different set
of view ops instead of as_strided, or call as_strided directly on the base tensor instead of an existing view of it).

@torch.compile
def foo(a):
    e = a.diagonal()
    # as_strided is being called on an existing view (e),
    # making it non-compositional. mutations to f under torch.compile
    # are not allowed, as we cannot easily functionalize them safely
    f = e.as_strided((2,), (1,), 0)
    f.add_(1.0)
    return a

We now verify schemas of custom ops at registration time (#124520)

Previously, you could register a custom op through the operator registration APIs, but give it a schema that contained
types unknown to the PyTorch Dispatcher. This behavior came from TorchScript, where “unknown” types were implicitly
treated by the TorchScript interpreter as type variables. However, calling such a custom op through regular pytorch
would result in an error later. As of 2.4, we will raise an error at registration time, when you first register the
custom operator. You can get the old behavior by constructing the schema with allow_typevars=true.

TORCH_LIBRARY(my_ns, m) {
  // this now raises an error at registration time: bar/baz are unknown types
  m.def("my_ns::foo(bar t) -> baz");
  // you can get back the old behavior with the below flag
  m.def(torch::schema("my_ns::foo(bar t) -> baz", /*allow_typevars*/ true));
}

Autograd frontend

Delete torch.autograd.function.traceable APIs (#122817)

The torch.autograd.function.traceable(...) API, which sets the is_traceable class attribute
on a torch.autograd.Function class was deprecated in 2.3 and is now being deleted.
This API does not do anything and was only meant for internal purposes.
The following raised an warning in 2.3, and now errors because the API has been deleted:

@torch.autograd.function.traceable
class Func(torch.autograd.Function):
    ...

Release engineering

Remove caffe2 db and distributed from build system (#125092)

Optim

Remove SparseAdam weird allowance of raw Tensor input (#127081).

Distributed

DeviceMesh

Update get_group and add get_all_groups (#128097)
In 2.3 and before, users can do:

mesh_2d = init_device_mesh(
    "cuda", (2, 2), mesh_dim_names=("dp", "tp")
)
mesh_2d.get_group()  # This will return all sub-pgs within the mesh
assert mesh_2d.get_group()[0] == mesh_2d.get_group(0)
assert mesh_2d.get_group()[1] == mesh_2d.get_group(1)

But from 2.4 forward, if users call get_group without passing in the dim, users will get a RuntimeError.
Instead, they should use get_all_groups:

mesh_2d = init_device_mesh(
    "cuda", (2, 2), mesh_dim_names=("dp", "tp")
)
mesh_2d.get_group()  # This will throw a RuntimeError
assert mesh_2d.get_all_groups()[0] == mesh_2d.get_group(0)
assert mesh_2d.get_all_groups()[1] == mesh_2d.get_group(1)

Pipelining

Retire torch.distributed.pipeline (#127354)
In 2.3 and before, users can do:

import torch.distributed.pipeline # warning saying that this will be removed and users need to migrate to torch.distributed.pipelining

But from 2.4 forward, if users write the code above, users will get a ModuleNotFound error.
Instead, they should use torch.distributed.pipelining:

import torch.distributed.pipeline # -> ModuleNotFoundError
import torch.distributed.pipelining

jit

Fix serialization/deepcopy behavior for tensors that are aliasing but not equal (#126126)

Fx

Complete revamp of float/promotion sympy handling (#126905)

ONNX

Remove caffe2 contrib and experiments (#125038)

Deprecations

Python frontend

User warning when using torch.load with default weights_only=False value (#129239, #129396, #129509).
A warning is now raised if the weights_only value is not specified during a call to torch.load, encouraging users to
adopt the safest practice when loading weights.
Deprecate device-specific autocast API (#126062)
All the autocast APIs are unified under torch.amp and it can be used as a drop-in replacement for torch.{device}.amp APIs (passing a device argument where applicable)..
Export torch.newaxis=None for Python Array API/Numpy consistency (#125026)

Composability

Deprecate calling FakeTensor.data_ptr in eager-mode. FakeTensors are tensors without a valid data pointer, so in
general their data pointer is not safe to access. This makes it easier for torch.compile to provide a nice error
message when tracing custom ops into a graph that are not written in a PT2-friendly way (because, for example, they
try to directly access a tensor’s data pointer from a region of code being traced). More details on integrating custom
ops with torch.compile can be found here (#123292)
Dynamic shapes:
- SymInt-ify mem-efficient attention forward op signature (#125418)
- Don't call item() into torch.scalar_tensor uselessly (#125373)
- Fix scalar type for constraint_range to Long (#121752)
- Guard oblivious on meta registrations (#122216), vector_norm (#126772), and unbind (#124959)
- Make expected stride test in torch._prims_common size oblivious (#122370)
- Use torch._check for safety assert in _reshape_view_helper (#125187)
- Add a code comment about torch._check_is_size in tensor_split (#125292)
- Make min(stride, strides[idx]) in collapse_view_helper size oblivious (#125301)
- Don't short circuit if shape is same (#125188)

CPP

Refactor autocast C++ APIs to be device-agnostic (#124359)

Release Engineering

Remove of QNNPACK third-party module (#126941)

Optim

Deprecate LRScheduler.print_lr (#126105)

nn

torch.nn.hardtahn allowed min_val to be greater than max_val (#121627)

Distributed

Distributed Checkpointing (DCP)
Deprecated submodules feature for distributed_state_dict (#127793)
In 2.3 and before, users can do:

model = AnyModel(device=torch.device("cuda"))
model_state_dict = get_model_state_dict(model)
set_model_state_dict(
    model,
    model_state_dict=new_model_state_dict,
    options=StateDictOptions(strict=False),
)

# Below way of calling API is also legit
model_state_dict2 = get_model_state_dict(model, submodules={model.submodule})
set_model_state_dict(
    model,
    model_state_dict={model.submodule: new_submodel_state_dict},
    options=StateDictOptions(strict=False),
)

But from 2.4 forward, if users call get_model_state_dict or set_model_state_dict with a submodule path or
state_dict, users will see a warning about the feature. To achieve the same functionality, users can manually
filter out the state_dict returned from get_state_dict API and preprocess the model_state_dict before
calling set_state_dict API:

model = AnyModel(device=torch.device("cuda"))
model_state_dict = get_model_state_dict(model)
set_model_state_dict(
    model,
    model_state_dict=new_model_state_dict,
    options=StateDictOptions(strict=False),
)
# Deprecating warnings thrown for the below way of calling API
model_state_dict2 = get_model_state_dict(model, submodules={model.submodule})
set_model_state_dict(
    model,
    model_state_dict={model.submodule: new_submodel_state_dict},
    options=StateDictOptions(strict=False),
)

FullyShardedDataParallel (FSDP)
Deprecate FSDP.state_dict_type and redirect users to distributed_state_dict (#127794)
In 2.3 and before, users can do:

model = AnyModel(device=torch.device("cuda"))
fsdp_model = FSDP(model)
# Users can do both ways below
get_model_state_dict(model)
with FSDP.state_dict_type(fsdp_model, StateDictType.FULL_STATE_DICT):
    fsdp_model.state_dict()

But from 2.4 forward, if users call state_dict or set state_dict with the FSDP.state_dict_type, users will see warnings. And the recommended solution now is to use get_model_state_dict and set_model_state_dict directly:

model = AnyModel(device=torch.device("cuda"))
fsdp_model = FSDP(model)

get_model_state_dict(model)
# Deprecating warnings thrown for the below way of calling API
with FSDP.state_dict_type(fsdp_model, StateDictType.FULL_STATE_DICT):
    fsdp_model.state_dict()

Profiler

Remove FlameGraph usage steps from export_stacks docstring (#123102)
The export_stacks API will continue to work as before, however we’ve removed the docstring to use FrameGraph.
PyTorch doesn’t own FrameGraph, and cannot guarantee that it functions properly.

Quantization

Remove deprecated torch._aminmax operator (#125995).
torch._aminmax -> torch.aminmax instead

Export

Start deprecation of capture_pre_autograd_graph (#125848, #126403)

XPU

Refactor autocast C++ APIs to be device-agnostic(#124359)
at::autocast::get_autocast_gpu_dtype() -> at::autocast::get_autocast_dtype(at::kCUDA)
at::autocast::get_autocast_cpu_dtype() -> at::autocast::get_autocast_dtype(at::kCPU)
Refactor autocast Python APIs(#124479)
torch.get_autocast_gpu_dtype() -> torch.get_autocast_dtype(“cuda”),
torch.set_autocast_gpu_dtype(dtype) -> torch.set_autocast_dtype(“cuda”, dtype),
torch.is_autocast_enabled() -> torch.is_autocast_enabled(“cuda”),
torch.set_autocast_enabled(enabled) -> torch.set_autocast_enabled(”cuda”, enabled),
torch.get_autocast_cpu_dtype() -> torch.get_autocast_dtype(“cpu”)
Make torch.amp.autocast more generic (#125103)
torch.cuda.amp.autocast(args…) -> torch.amp.autocast(“cuda”,args…),
torch.cpu.amp.autocast(args…) -> torch.amp.autocast(“cpu”, args…),
Deprecate device-specific GradScaler autocast API(#126527)
torch.cuda.amp.GradScaler(args…) -> torch.amp.GradScaler(“cuda”, args…),
torch.cuda.amp.GradScaler(args…) -> torch.amp.GradScaler(“cpu”, args…),
Generalize custom_fwd&custom_bwd to be device-agnostic (#126531)
torch.cuda.amp.custom_fwd(args…) -> torch.amp.custom_fwd(args…, device_type=’cuda’),

ONNX

Remove more caffe2 files (#126628)

New Features

Python frontend

Add
- support for unsigned int sizes for torch.unique (#123643)
- torch.OutOfMemoryError to signify out of memory error from any device (#121702)
- new device-agnostic API for autocast in torch.amp.* (#124938)
- new device-agnostic API for Stream/Event in torch.{Stream,Event} (#125757)
- channels last support to max, average and adaptive pooling functions (#116305)
- torch.serialization.add_safe_globals that allows users to allowlist classes for weights_only
  load (#124331, #124330, #127808)
- pickling support for torch.Generator (#126271)
- torch.utils.module_tracker to track position within torch.nn.Module hierarchy (#125352)

Composability

Add
- OpOverload.redispatch; use it in new custom ops API (#124089)
- mutated_args field to custom_op (#123129)
- new Python Custom Operators API
- register_autograd to register backward formulas for custom ops (#123110)
- torch.library.opcheck (#124496), torch.library.register_autograd (#124071), torch.library.register_kernel (#124299)
Blanket ban kwarg-only Tensors (#124805)
Change register_autograd to reflect ordering of setup_context and backward (#124403)
Ensure torch.library doctests runs under xdoctest (#123282)
Fix torch.library.register_fake's module reporting (#125037)
New Custom Ops Documentation landing page (#127400)
Refresh OpOverloadPacket if a new OpOverload gets added (#126863, #128000)
Rename
- impl_abstract to register_fake, part 1/2 (#123937)
- register_impl to register_kernel (#124200)
Schema inference now includes default values (#123453)
Stop requiring a pystub for register_fake by default (#124064)
Support TensorList inputs/outputs (#123615)
Update the functionalization error message (#123261)
add ability to provide manual schema (#124180)
fix schema inference for kwarg-only args (#124637)
mutated_args -> mutates_args (#123437)
register_autograd supports non-tensor kwargonly-args (#124806)
set some tags when constructing the op (#124414)
setup_context fills in default values (#124852)
torch.library.register_fake accepts more types (#124066)
use new python custom ops API on prims ops (#124665)

Optim

Enable torch.compile support for LRScheduler with Tensor LRs (#123751, #123752, #123753, #127190)

nn frontend

Add RMSNorm module (#121364)

linalg

Implement svd_lowrank and pca_lowrank for complex numbers (#125580)
Extend preferred_backend on ROCm backend.
Add cuBLASLt gemm implementation (#122106)

Distributed

c10d

Implemented IntraNodeComm primitives for allgather_matmul (#118038)
Add first differentiable collective all_to_all_single_grad (#123599)
Add P2P versions of send/recv_object_list operations (#124379)
Add a new Collectives API for doing distributed collectives operations in the Elastic
store with more performant and debuggable primitives (#126695)

FullyShardedDataParallel v2 (FSDP2)

FSDP2 is a new fully sharded data parallel implementation that uses DTensor-based dim-0 per-parameter
sharding for improved flexibility (e.g. mixed-dtype all-gather, no constraints on requires_grad) without
significant cost to performance.
See the document for more details and a
comparison with FSDP1 (#122888, #122907, #123142, #123362, #123491, #123857, #119302, #122908, #123953, #120952, #123988, #124293, #124318, #124319, #120256, #124513, #124955, #125191, #125269, #125394, #126070, #126267, #126305, #126166, #127585, #127776, #127832, #128138, #128117, #128242)

Pipelining

PyTorch Distributed pipeline parallelism APIs were upstreamed from the
PiPPy project and are available as a prototype release in
PyTorch 2.4.
The package is under torch.distributed.pipelining
and consists of two parts: a splitting frontend and a distributed runtime.
The splitting frontend takes your model code as-is, splits it up into “model partitions”, and captures the data-flow relationship.
The distributed runtime executes the pipeline stages on different devices in parallel, handling things like micro-batch splitting,
scheduling, communication, and gradient propagation.
For more information please check out the
documentation and
tutorial (#126322, #124776, #125273, #125729, #125975, #126123, #126419, #126539, #126582, #126732, #126653, #127418, #127084, #127673, #127332, #127946, #128157, #128163, #127796, #128201, #128228, #128240, #128236, #128273, #128279, #128276, #128278, #127066)

Profiler

Add profiler support for PrivateUse1 (#124818)

Dynamo

torch.compile is compatible with Python 3.12.
Guarding on nn modules attributes (#125202) - TorchDynamo guards on nn module attributes. This was a frequently raised
issue in the past (examples issue1, issue2, issue3, issue4, issue5,
issue6 and issue7). This increases TorchDynamo soundness with minimal perf impact.
Hardened the recently introduced tracing rules infrastructure. This allows torch.compile users to easily control TorchDynamo tracing of PyTorch internal code.
Extended torch.compile support for RAdam and Adamax optimizer. Compiler optimizers now demonstrate SOTA performance.
Experimental feature - We introduced a new experimental flag torch._dynamo.config.inline_inbuilt_nn_modules to enable torch.compile to reuse compiled
artifacts on repeated blocks in the models. This gives another point in the tradeoff space of compilation time and performance speedup.
By moving torch.compile from full model to a repeated block (e.g. moving torch.compile from full LLM to a repeated Transformer block),
we can now achieve faster compilation time with some performance dip compared to full model.
We plan to make this flag default to True in the 2.5 release.

Export

Introduce ShapesCollection, a dynamic shapes builder API (#124898)

Inductor

Add higher order associative scan operator (#119430)

jit

Add aten::sort.any op for sorting lists of arbitrary elements (#123982)

MPS

Conform torch.mps to device module interface (#124676)

XPU

Inductor Intel GPU backend (#121895)
a new autocast API torch.amp.is_autocast_available(#124938)
attributes to xpu device prop (#121898)
XPU implementation for PyTorch ATen operators (#120891)
generic stream/event on XPU backend (#125751)
gpu trace on XPU (#121795)
Switch to torch.float16 on XPU AMP mode (#127741)

ONNX

quantized layer norm op to opset 17 (#127640)
symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828)
Support for Some Bitwise Ops in Onnx Exporter (#126229)
Allow ONNX models without parameters (#121904)
Integrate onnxscript optimizer (#123379)

Vulkan

quantized transposed 2D convolutions (#120151, #122547)
the quantized ReLU operator (#123004)

Improvements

Python frontend

bfloat16 support for torch.binary_cross_entropy on CPU (#123823)
MAP_SHARED option for torch.load when mmap=True (#124889)
default value when printing function signature (#127059)
all variants of upsampling functions to be done in high precision in autocast (#121324)

Composability

FakeTensors, meta tensors and python decompositions are used to perform shape propagation when tracing out a graph in
torch.compile. There were a number of op coverage improvements this release:
- New metas / fake tensor rules:
  - aten._embedding_bag_dense_backward, aten._embedding_bag_per_sample_weights_backward (#125785), aten.randint.out, aten.rand.out (#122375), aten.unique2 (#124306), aten.histc (#124548), aten.channel_shuffle (#123033), aten._masked_scale (#127389), aten.addcdiv.ScalarList, aten.addcmul.ScalarList (#123486)
New decomps:
- Aten.resize_as (#122317), several out= variants of ops with existing decomps (#122979, #115437)

Autograd frontend

nn.functional.batch_norm: add forward AD rule for miopen backend (#125069)
nn.functional.scaled_dot_product_attention: add backward rule for cuDNN backend (#122510)

Release Engineering

Add CI support for aarch64 linux.
This is a phase-1 support with a subset of unit tests to cover AWS Graviton processors.
The CI is triggered when the ciflow/linux-aarch64 label is added.
It is currently auto triggered for PyTorch core, c10, mkldnn, and torch inductor related PRs.
(#120931, #121284, #125255, #121136, #124781, #125599)
Add experimental CUDA pip wheels for ARM architectures supporting the NVIDIA Hopper architecture as nightly binaries and a prototype for the PyTorch 2.4.0 release.
The nightly binaries support torchvision as well as torchaudio in addition.
(#126174, #127514)
Add support for CUDA 12.4 in CI/CD (#121684, #121956, #127825, #125944, #128250)
Add support for numpy 2.0.0rc1 in CI and CD (#123286, #122157)
Enable support for torch.compile and triton with Python 3.12 CI/CD (#127547, #123307, #126218)
Intel GPU enablement in CI (#122254, #123920, #125655)
Migrated CI/CD jobs to macOS 14 (#127582, #127853, #125801)
ROCM: upgrade CI/CD to 6.1 (#124811, #118216, #124300, #125646)
CUDNN version 9.1.0.70 for CUDA 11.8, 12.1, 12.4 builds (#123475)
NCCL submodule v2.20.5 (#121635)
submodule oneDNN v3.4.2 (#126137)
Wrapped deprecated function/class with typing_extensions.deprecated (#127689)

nn frontend

Add swap_tensors path to nn parametrizations (#124130)
Relax use_count constraints for swap_tensors when AccumulateGrad node holds a reference (#127313)
Increase numel limit to 2^63 for replicatepad1d (#122199)
Use int64_t indexing for Upsample2d backwards (#123682)
Remove warning from LazyModuleMixin constructor (#123968)

Optim

Radam and Nadam support the flag for "maximize" (#126765, #127214)
Include scheduler_on_plateau in optim.h (#121722)

Foreach

Allow foreach ops to run for any backend, not just CPU (#127412)

cuda

Update CUDA out of memory message with private pool info (#124673)
Add autocast rule for torch.vdot (#125697)
Fix type hint for cuda.get_device_name() and cuda. get_device_capability() (#126743)

Quantization

X86 Inductor backend
- Enable linear and linear-unary post-op gelu quant recipe for X86InductorQuantizer (#114853)
- Add Quantization recipe filter per operator type for X86InductorQuantizer (#122775)
- Add Matmul recipe into X86InductorQuantizer (#122776)
- Improve performance of qconv by reducing integration overhead (#123240)
PT2E quantization flow
- Add support for conv transpose + bn + {relu} weights fusion in PTQ and QAT (#122046, #123652)
- Simplify fake_quant_per_channel (#123186)
- Support fp8 quantization (#123161)
- Propagate get_attr meta through known ops only (#124415)
- Fix issue of lowering nn.linear ops with kwargs (#126331)

Distributed

c10d

TORCH_NCCL_HIGH_PRIORITY option for ProcessGroupNCCL (#122830)
__repr__ to P2POp class (#126538)
commCreateFromRanks to c10d (#127421, #127982)
dist.get_node_local_rank helper (#123992)
an option to enable TCPStore libuv backed for c10d rendezvous (#124684)
Captured dtype in Flight Recorder (#126581)
Enable ncclCommDevIdxMap unconditionally (#122049)
Extended the flight recorder dump from timeout to any exception (#123023)
Make TCPStore server use libuv by default (#127957)
Make get_node_local_rank() accept fallback_rank (#126737)
Make abort communicators in destroy_process_group call on default and code cleanup (#124334)
Mapped float8 types to uint8 for allgather (#126556)
Optionally avoided rethrowing CUDA Errors in NCCL Watchdog (#126587)
Wrapped TCPStore check in a try/catch (#127030)
ProcessGroupWrapper support custom backend (#124447)
ncclComm is not aborted before checking exception (#124466)

DeviceMesh

Add a private init backend option (#124780)
Initialized mesh tensor with CPU context (#124767)
Add DeviceMesh.from_group() (#124787)
Make _validate_tp_mesh_dim support 3D (#125763)
Supported N groups in from_group (#126258)
Make sure device mesh can be imported from torch.distributed (#126119)

Distributed quantization

Used BFloat16 in distributed quantization when supported by NCCL (#125113)

DistributedDataParallel (DDP)

Add a mode to avoid clone() in DDPSink (#122927)

Distributed Checkpointing (DCP)

Add type_check param to copy state dict utils (#127417)
Add strict option to DefaultPlanner (#123869)
Always created requests for non-tensor objects (#125334)
Always flattened mapping even if no tensors present (#125335)
Correctly handle _extra_state (#125336)
Implement broadcast_from_rank0 option for model/optim state_dict (#125338, #125339)
Introduced async staging extension points (#122965)
Make distributed state_dict support torch.distributed is not initialized case (#127385)
Make param name consistent with overridden function (#124770)
Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict (#127070)
Supported flattening the optimizer state_dict when saving and unflattening when loading (#127071)
Unified the API signatures of set_model_state_dict and set_optimizer_state_dict (#127384)

DTensor

backward support for scaled_dot_product_attention (flash-attention) (#122541)
more foreach ops (#123214)
op support for view_as_complex and view_as_real (#122569)
op support for memory efficient attention (#122996)
support for fused_adam and fused_adamw when lr is a tensor (#126750)
ASGD foreach optimizer with associated unit tests (#121942)
the handle of DTensor.device_mesh.device_type in dynamo (#118803)
the support of placement kwargs for DTensor.to_local() in dynamo (#119947)
scatter op with simple replication (#126713)
distributed topk operator (#126711)
Make Partial placement public (#127338, #127420)
ensure expected input spec have correct tensor meta (#122949)
ensure meta tensor random op does not alternate rng state (#125693)
Move early return check into redistribute autograd function (#121653)
Move some modules to private namespace (#127339)
Standardized multi mesh-dim strategy with utils (#126712)
2D clip_grad_norm_ (#121945)
simple replicate strategy for SVD (#127004)
Turned on foreach implementation for (1) clip_grad_norm_ for DTensor by default (#126423), (2) optimizer for DTensor by default (#123394)

FullyShardedDataParallel (FSDP)

device in pin_memory argument (#119878)
private _unshard API (#124304)
privateuse1 in FSDP's sharded grad scaler (#126971)
Avoided CPU sync in clip_grad_norm_ (#122001)
Marked pre_backward_hook unserializable (#125464)
Skipped FSDP hooks base on dynamo config (#123021)
Used generic device handle instead of cuda (#121620)

ShardedTensor

Supported non-contiguous rank validation in sharded tensor (#123230)

TorchElastic

debug info logging interface for expired timers (#123883)
health check server hook in torch elastic (#122750, #123504)
option for sharing TCPStore created by rendezvous handlers (#125743)
support for binding to TCP in WorkerServer (#127986)
Applied "distributed debug handlers" (#127805)
Cleared timer for already terminated process (#122324)
Skipped expired timer logging for empty expired timers (#125039)

Tensor Parallel

wildcard support for Tensor Parallel parallelize_plan (#122968)
kwargs support to prepare_module_input (#124114)

Profiler

Profiler `torch.profiler`:

metrics for performance timing and other statistics collection (#123412)
Kineto traces will export ns granularity for finer timestamps (#122425, #123650)
Unified the device (CUDA, XPU, PrivateUse1) in profiler’s post processing (#123247)
Improve profiler post processing by iterating frontend function events rather than all function events (#124596)
Report strides in json traces (#125851)
Register COLLECTIVE_COMM profiler activity type when available (#121461)
Support third-party devices emit a range for each autograd operator (#125822)

Memory Snapshot `torch.cuda.memory._dump_snapshot`:

Improve the description of blocks with missing frames in the Memory Visualizer (#124784)
Add recordAnnotations to capture record_function annotations (#124179)

Profiler `record_function`:

For with_effects, skip over profiler.record_function_exit (#121829)
support for RecordFunctionFast to take inputs (#123208)
support for kwargs in RecordFunctionFast (#123600)
Collecting autograd sequence numbers on PythonTLSSnapshot dispatch keys for Nested Tensor (#123304)

Export

a printer to the unflattened module (#124315)
disable_forced_specializations flag (#124949, #126925)
export support for auto_functionalize (#121990, #122177, #122246)
readable placeholder names to ExportedProgram nodes (#123587, #123590, #124765)
set_grad_enabled higher order operator (#123391, #125066, #121736)
stack_trace for non-strict export (#121034)
torch_fn, a more consistent metadata across strict and non-strict export (#122693)
torchbind tracing support (#122619, #123370, #122622, #125490)
Allow static constraints in dynamic_shapes (#121860)
Ignore logging.Logger.* calls during dynamo export (#123402)
Make metadata serialization more strict (#124411)
Populate ShapeEnv's var_to_val during deserialization (#121759)
Prototype TorchScript 2 ExportedProgram Converter (#126920, #127466)
Provide refine function for automatically accepting dynamic shapes suggested fixes (#127436)
Save/load example inputs in the ExportedProgram (#122618)
Suggest constant dim values in dynamic shapes fixes (#125458)
Support map in pre-dispatch functionalization (#121444)
We introduced the concept of “effect tokens”, which is how we allow side-effectful operators in torch.compile/export (#121552, #122357)

Fx

shape inference tool (#120097)
device_ordinal to Subgraph in splitter_base (#125616)
exclusion function to minimizer base (#124504)
missing forbidden mutation methods in immutable collections (#125468)
option to turn on return_tuple in _SplitterBase (#123868)
prefix option to CapabilityBasedPartitioner (#126382)
Create block traverse mode in minimizer for graph aware debugging (#125613)
Implement Graph Transform Observer (#127427)
Option to include stride and device annotation in gm.print_readable() (#123690)
Register create_node_hook (#126671)

Dynamo

We performed a careful audit and fixed all known memory leaks in TorchDynamo.
We hardened torch.compile + __torch_function__ support by onboarding Scaled Dot Product Attention (SDPA) and TensorDict.

Inductor

0 initialization to Triton masked loads (#127311)
HalideCodeCache (#126416)
clone if output is a view from constant (#123200)
config to allow buffer mutation (#126584)
decompose_mem_bound_mm to the customization pre and post grad passes (#123376)
inductor support (#123709)
kernel_code logging artifact (#126631)
lowering for avg_pool{1, 3}d (#116085), cummax, cummin (#120429)
missing files to torch_key (#128230)
mode to MemoryDep to track atomic accumulates (#123223)
pybind for tensor_converter util functions (#121744)
qlinear_pointwise.binary op for X86Inductor backend (#123144)
support for multiple flexattention calls in a single compile (#125516)
tensor_constantX to pass constant buffer update's check (#122562, #122690)
the quant lift up pass in convert phase (#122777)
a decomposition for select_scatter (#124426)
Allow multiple cudagraph recordings per compiled graph (#126822)
Automatic detection for buffer mutation and binary linking (#126706)
Change
- OverridesData to take callables instead of strings (#123397)
- aot_compile callsites (#122225)
Clean up for removing 2 decompose patterns (#123422)
Codegen runtime asserts in Inductor (#124874)
Customize pre grad and post grad patterns (#121915)
Disallow fusions of foreach and reductions (#127048)
Enable
- lowering of qlinear-binary(-unary) fusion for X86Inductor (#122593)
- mmaped weights when CUDA is used (#124346)
- meta internal AOTInductor compilation on ROCM (#124123)
Enhance RecordFunctionFast input args and use input args in triton_heuristics.py (#123459)
Filter non input symexprs from codecache guards (#128052)
Get PT2 Cutlass backend working under fbcode (#125688)
Hipifying aoti code_wrapper (#124241)
Improve group batch fusion with same parent/users fusion enablement (#127648)
Inductor respects strides for custom ops by default (#126986)
Initial implementation of Inductor FX Graph Remote Cache (#124669)
Make
- torch._inductor.dependencies.Dep a proper class (#124407)
- c10/util ostream function implementations to their headers (#123847)
- some cudagraphs checks into C++ (#122251)
Pass triton kernel info to record function (#123871)
Read the patterns from the config instead of hard-code passes (#125136)
Remove
- API that allows for extra deferred runtime asserts during lowering (#124864)
- assertion for cat target_func (#125540)
Serialize large weights (#123002)
Specialize on unguarded alignment of example inputs (#123319)
Split cat customization (#123045)
Support
- CUDA_INC_PATH env variable when compiling extensions (#126808)
- custom op in JIT with cpp wrapper (#122554)
- pytrees as associative_scan input (#122137)
- use_runtime_constant_folding for CPU (#122563)
Try to reuse old symbol name rather than new symbol name when renaming (#124782)
Update the cpp_wrapper entry function signature (#121745)
Use source code hash instead of torch version (#126092)
Various improvements to error handling during autotuning (#126847)
batch pointwise op + unbind stack pass in post grad (#126959)
config target platform (#126306)
disable comprehensive padding in fbcode (#124191)
enable software pipelining on AMD devices (#125858)
epilogue support for gemm template (#126019)
make mask_rcnn inference work in max-autotune mode (#123008)
pt2 dper passes: run shape prop before each pass (#122451)
remove 2 decompose patterns (#123371)
switch assume_aligned_inputs to False (#124336)
unified the vectorized conversion with at::vec::convert for all data types (#119979)

jit

Shape function fix for _batch_norm_with_update (#122430)
Attach target function to OSError when source can't be found (#125248)
Support getattr/hasattr on NamedTuple (#121863)

ONNX

Allow fake models to run with ONNXProgram.call (#122230)
Fix ONNX export with print (#123368)
Improve torch.onnx.export runtime from O(n^2) to O(n) (#123025, #123027, #123063, #124909, #123028, #123028, #123029, #123026, #124912)
Make ONNXProgram.model_proto and disk file the same (#122196)
Skip optimizer when it fails (#127349)
Update decomposition table to core ATen ops (#127353)
beartype to emit warning instead of error by default (#123205)

MPS

Add naive quantized int4_mm, int8_mm and .gputrace capture hooks (#125163)
Better error-check for linear op (#124952)
Enable
- index_select for complex types (#122590)
- torch.mm and other ops for complex dtypes (#127241)
Implemented isin_Tensor_Tensor_out for MPS backend (#124896)
Improve F.adaptive_avg_pool2d error messages on MPS backend (#124143)
Native non-zero op implementation (#125355)

XPU

Generalize host allocator to be device-agnostic(#123079)
Make macro with AMP more generic(#124050)
Refactor
- CUDA’s AMP autocast policy to be generic(#124051)
- gpu trace to be device-agnostic(#121794)
Support generic Stream/Event on CUDA/HIP backend(#125757)

Bug fixes

Python frontend fixes

DtoH sync in torch.index_put_ (#125952)
torch.load map_location for wrapper subclass and device being serialized through numpy (#126728)
memory leak in torch.dtype.to_complex() (#125154)
nn.Parameter constructor type hint (#125106)
parameter name in torch.can_cast to from_ (#126030)
support of paths with space in torch.utils.cpp_extensions (#122974)
Support numpy array in Tensor.eq (#122249)

Composability fixes

FakeTensors, meta tensors and python decompositions are used to perform shape propagation when tracing out a graph in
torch.compile. There were a number of bug fixes improvements this release:
- FakeTensor fixes:
  - Handle symbolic size access in FakeTensor (#124760)
  - Avoid cuda init in FakeTensorMode (#124413)
  - Do not run CUDA lazy init if it is triggered with fake mode on (#122636)
  - Refactor faketensor ops that produce unbacked symints to memoize (#125623)
- Meta device fixes:
  - fix meta tensor set_() incorrectly modifying nbytes of the storage (#123880)
  - Fix aten._weight_int4pack_mm meta registration for float16 inputs (#124136)
- Fixes to python decompositions:
  - aten.upsample_bicubic2d: support for uint8 (#120411)
  - aten.upsample_nearest* ops: properly registered decomp to dispatch keys (#122782), (#122783)
  - _refs.masked_fill: support privateuse1 device when value.device.type is cpu (#124835)
  - _refs._reshape_view_helper: specialization shortcut for converting n-d to 1-d and 1-d to 2-d views (#127641)
  - Fix decomp for torch.tensor(...) constructor with nested python lists(#125639)
  - Aten.rrellu_: fix decomp when default values are missing (#126978)
AOTDispatcher is the component of the torch.compile stack that functionalizes and normalizes the graph, and adds
support for compiling the backward during training. There were several bugfixes and improvements to AOTDispatcher:
- Fix torch.compile used with triton kernels under inference_mode (#124489)
- Fix incorrect graph when functionalizing aten.expand followed by mutation (#122114)
- Properly keep input mutations in the graph when they are under torch.no_grad, even if there are outstanding aliases (#122433)
- Replay original views from the user code instead of falling back to as_strided in a few cases, which can improve
  performance of the backward pass in cases where torch.compile captures small graphs with outputs that alias graph inputs (#121007)
For __torch_dispatch__-based tensor subclasses, support custom layout overrides under torch dispatch mode (#125379)

cuda fixes

cuda array for empty arrays (#121458)
a perf regression in kernel launcher for the foreach_* family of ops (#123566)
CUDA out of memory error message formatting (#123984)
CUblasLt compilation on windows (#125792)

Autograd frontend fixes

torch.utils.checkpoint: Use pytrees to improve determination of what RNG state to stash (#121462)
Fix error message of autograd (#123154)

Release Engineering fixes

Fix mypy issues in fake_tensor.py (#124428)
Fix running of: lintrunner --all-files --take FLAKE8 (#124771)
Fix libc and libstdcxx installation on conda environments (#121556)
Release engineering tooling and CI fixes. Workflows, Trymerge, Bot Labeler, Mergebot (#125042, #121762, #121920, #124965, #122155, #123301, #121733, #127567, #128080)

nn frontend fixes

access to unitialized memory in VSX vector functions for quantized values (#122399)
swap_tensors path in nn.Module._apply for modules that inherit from RNNBase (RNN, GRU, LSTM) (#122800)
ctc_loss zero/negative length corner cases (#123193)
_LazyConvXdMixin.initialize_parameters and add related tests (#123756)
load_state_dict with unexpected key whose prefix matches a valid key (#124385)
requires_grad propagation in nn.utils.parametrize (#124888)
nan with large bfloat16 values for FlashAttention backend of nn.functional.scaled_dot_product_attention
issue in affine_grid_backward when grad_grid is non-contiguous (#124370)
Add error checks for invalid inputs on thnn_conv2d (#121906)(#122135)

Optim fixes fixes

Wrong ASGD implementation (#125440, #126375)
loading optimizer options from archive (#125215)

linalg fixes

svd_lowrank(..., M) in the presence of broadcasting (#122681)
linalg.vector_norm when used with autocast(cuda) (#125175)

CPP fixes

Handle all types c10::isSigned (#125637)
crash for AVX512 int4 matrix multiplication if weights are unaligned (#124128)
loading custom C++ extension within DataParallel-ized model (#125404)

Distributed fixes

c10d

coalescedCollective op Flight Recording (#120430)
group_name/group_desc set up in eager initialization (#127053)
bug in _update_process_group API (#128262)
bug in update_process_group DDP API (#128092)
excepthook crash on exit after destroy_process_group (#126739)
various errors in TCPStoreLibUvBackend.cpp (#127230)
work handle for coalescing manager (#122849)
Add check gloo availability when doing _ProcessGroupWrapper check (#124233)
Add initialize lastEnqueuedSeq_ and lastCompletedSeq_ in ProcessGroupNCCL (#121980)
Ensured gil is not released when calling to PyBytes (#128212)
Guarded gpu context during abort (#127363)
Make monitorThread sleep when we try to dump flight recorder (#123788)
Only included NCCL related header file with macro USE_C10D_NCCL (#127501)
Prevented wait_tensor() calls on graph inputs from getting DCEd for AsyncCollectiveTensor (#125677)

DeviceMesh

hash and eq not match (#123572)
device type issue in _get_device_handle (#124390)
Enable cache and reuse of sliced result to prevent funky behaviors and NCCL deadlock at large scale (#122975)
Make dtype of mesh tensor from init_device_mesh() consistent with directly calling DeviceMesh() (#123677)

DistributedDataParallel (DDP)

DDP no_sync when find_unused_parameters is True (#124193)

Distributed Checkpointing (DCP)

to remove non_persistent buffer in distributed state dict (#125337)
set_optimizer_state_dict() changes the parameters with some optimizers (#125708)
various bugs for broadcast_from_rank0 (#127635)
Remove the check of FSDP has root (#121544)
Kept params in torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#127644)

FullyShardedDataParallel (FSDP)

FSDP 2D state_dict to use run_check=False (#123802)
HSDP: sharding placement (#123778), validation error msg (#123019)
summon_full_params on submodule (#123290)

TorchElastic

Make torch.multiprocessing.ProcessContext.join() wait for all child procs to exit before return (#125969)

Profiler fixes

an asynchronous trace bug where end timestamp overflows and events are years in the future (#124080)
torch.profiler Schedule Function (Function Event only) to accumulate events (#125510)
Add a sanity test to the unit testing (#124773)
Add missing field device_resource_id in profiler events (#121480)
Cleaned up deprecated use_cuda by default (#126180)
Do not emit a warning when using CPU profiler only (#125654)
Handle more cases of symbolic sizes/strides detection (#123696)
Reduced warning msg in torch.profiler when using AMD (#124469)
Release gil in prepareProfiler (#121949)
Remove a redundant *1000 to timestamp since we already have ns precision (#124374)
Split up profiler test file (#124856)

Dynamo fixes

'Could not infer dtype of SymBool' on torch.tensor call (#125656)
'get_attr' call in dynamo 'run_node' (#127696)
'get_real_value' on placeholder nodes (#127698)
assume_constant_result for UnspecializedNNModuleVariable methods (#127695)
guard_size_oblivious on non-symbolic expression (#123743)
tvm backend interface (#126529)
Add support for tensor's is_complex method (#124927)
Allow asserts to fail (#126661)
Forward OptimizedModule.setattr to the wrapped module (#122098)
Initial exception handling support in dynamo (#126923)
Keep track of ViewMeta with symbolic inputs (#125876)
Support macOS and Linux/aarch64 platforms (#128124)

Export fixes

GraphModuleDeserializer handling of signature (#122342)
bug in get_update_constraint (#125194)
conv decomp when decomposing to core-aten (#123283)
mode not on stack error for while loop (#122323)
runtime assertions to add call_function (#125878)
to_copy to be inserted in the exported graph (#125628)
unflattening with duplicate tensors (#125192)
up nn_module_stack for nodes occurred around tracepoint ops (#124457)
leaky fake tensor on attribute assignment, support buffer assignment (#122337)
Allow Dim(1,2) for export dynamic shapes (v2 after revert) (#121910)
Allow modules to be created in the forward (#125725)
Correctly serialize empty list based on argument type (#123748)
Forward fix failures for torch.export switch to predispatch (#126081)
Handle param aliasing (#127471, #125509, #125758)
Make error name private (#126715)
More strictly respect scope when removing inputs in unflattener (#127607)
Skip nn_module_stack verifier for non-fx.GraphModule modules (#122210)

Fx fixes

fx graph triton import bug (#122041)
graph partitioner and make runtime assertion work with submodules in export (#125793)
infinite recursion in API BC test (#125706)
mem size mismatch from split/chunk in const folding (#125199)
triton import time cycles (#122059)
Don't intersect when clamping for size oblivious (#123675)
Don't use Proxy torch function in the sym size calls (#121981)
FakeTensorProp assert consistency of sizes when metadata previously existed (#124059)
Keep set_() input mutations in the AOTDispatcher graph, ban other cases (#122981)
Make
- check_is_size clamp to sys.maxsize - 1, so sys.maxsize comparison returns False (#122372)
- torch._check understand Eq commutativity (#125629)
Preserve
- node.meta when fusing subgraph (#125261)
- partitioner order (#122111)
- unbacked SymInt on SymNode (#120816)
Remove
- duplicated nodes in dfs_iter_find_cycle (#125585)
- incorrect check (#123616)
Skip index_put_ in dce (#122683)

Inductor fixes

AFOC QPS Regression (#122944)
C++ compilation error for tensor array in abi_compatible mode
FakeTensorUpdater logic for updating fake tensors (#116168)
a bool value codegen issue when calling custom ops (#127398)
a bug when mutated buffer meets .to (#127671)
a codegen issue when .item() is used for kernel arg (#126575)
a dynamic shape problem when lowering diagonal (#121881)
an internal test regression (#123481)
another out-of-bounds access (#122580)
cat backwards wrapping on symints (#121527)
compilation_latency regression caused by #127060 (#127326)
constant propagation pass (#114471)
cuda compilation under fbcode remote execution (#126408)
cummax and cummin lowering for empty case (#126461)
cutlass path in inductor (#125463)
edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer (#126622)
includes to system Python (#125285)
issue with randint + symbolic shapes (#122428)
issues in pre_grad passes (#123181)
mask propagation in the presence of where (#125574)
memory planning compile error (#123867)
missing unbacked def for unbacked in input expr (#127770)
nextafter in inductor CPP codegen (#126876)
ops.scan for non-commutative operators (#126633)
out-of-bounds read/write in cvt_int64_to_[fp32|int32] (#122511)
scheduler typehints (#127769)
test with inlining flag (#128200)
to #126656 (#127050)
triton codegen main do_bench_gpu import error (#126213)
unbacked symbol in stride when using item() (#122298)
unsupported type of output=s1 (#126797)
ScatterFallback codegen (#124580)
a constant tensor device move issue (#128265)
an assertion for node debug str (#127021)
grid z bug for large grid (#127448)
invalid call to aoti_torch_tensor_copy_ (#126668)
linear_add_bias path (#127597)
loop ordering test (#127807)
miss isa bool check (#128274)
post_grad pattern (#127457)
redis-related env vars in remote_cache.py (#127583)
Add missing acosh op to vec256_float_neon.h (#122513)
Back out
- "Added a check in register_lowering to avoid decomposed ops (#117632)" (#122709)
- "Precompile triton templates (#121998)" (#123305)
Backport triton-lang/triton#3433 (#122470)
Correctly calculate the numel with symint in DDP fusion (#124422)
Disable stack allocation when there is a fallback op (#122367)
Do not forward parent's value range to CSE variable for variables created within codegen (#123099)
Do not propogate (#124769)
Don't clamp slices generated from cat kernel (#124139)
Enable B019 - flags memory leaks through LRU cache on method (#127686)
FX graph cache: Fix bug handling constants (#121925)
Fall back to eager mode when viewing with differing bitwidths (#120998, #121786)
Implement masked_load for integral types (#122608)
Improve unbacked SymInt input support in Inductor (#124739)
Inductor: fix Conv output stride for dynamic shapes (#121400)
Remove symbol exports in C shim for Windows (#125472)
Revert "Inductor respects strides for custom ops by default (#126986)" (#127923)
Use pexpr, not texpr in Triton launch codegen (#128038)
turn off triton memcache for amd devices (#122560)
typing scheduler.py [1/2]: Bug fix (#126610)
use two pass reduction for deterministic reduction order (#115620)
Forward fixes
- for D56289438 (#124882)
- for templates + views (#127446)

ONNX fixes

Fix list dtype finding bug in dispatcher (#122327)
Rename ort to maia in dynamo's ort backend (#124967)
Cast checkpoint weights to match model parameter's dtype (#122100)
Reduce excessive warning to info (#122442)
Prevent dup initializers when ONNXProgram.save is called many times (#122435)

MPS fixes

FFT descriptor fields to resolve precision issue (#125328)
FFT implementation bug dropping negative frequency components (#123274)
GELU, LeakyRELU and MISH on non-contiguous tensors (#123049)
abs for complex types (#125662)
copies larger than 4GB (#124635)
crash with binary_cross_entropy is invoked for half dtypes (#124258)
for MPS regression in scalar creation (#123234)
for addcdiv contiguous problem (#124442)
naive matmul for BFloat16 (#121731)
nextafter for negative values (#125029)
overflow in cumsum when dtype is bool (#125318)
strided ELU op correctness issue (#125692) and mse_loss correctness issue (#125696)
Fwd-fix for clamp regression (#122148)
Remove in place views fixing various crashes (#124895)

XPU fixes

record issue on XPUGuardImpl (#123523)

Performance

Python frontend

Use sleef on macOS Apple silicon by default (#126509)

cuda

Speed up torch.softmax kernel (#122970)

nn frontend

Parallelize upsampling ops across the batch/channel dimension (#127082)

Optim

Add fast fused kernels for Adam, AdamW, SGD, and Adagrad on CPU (#123074, #123629, #124905)

linalg

Improvements
- the CPU performance of linalg.vector_norm when reducing over a dimension of length 1 (#122143)
- performance of FP16 gemv on ARM (#126297, #126745, #126746, #126877, #127033) and BF16 gemm fallback on ARM (#126592)
- autotuning through TunableOp on ROCm (#124362)

Foreach

Allow int vals to go down the fastpath for _foreach_max (#127303)
_foreach_copy now supports different source/dest dtypes on the fast path (#127186)

Distributed

C10d

Disable compute of collective duration by default (#122138)

DTensor

Used str for reduce_op instead of c10d enum (#125172)
Make early return for _split_tensor (#125810)
Make directly return local_tensor under no_grad (#128145)

Distributed Checkpointing (DCP)

Improve the performance of distributed state_dict (#125501)

TorchElastic

Changed monitor_interval for torchelastic default value to 0.1 sec (#124692)
Add timing events to different stages of rendezvous (#125636)

jit

Fix exponential memory usage when TorchScript types share the same name (#121874), (#121928)

Fx

Add side table to FX Graph for O(1) op/target query (#121565)
Apply guard knowledge to all simplifications (#123342)
Do not calculate hint in advice_is_size (#124472)
Enable FX graph and symbolic shape caching (#121697, #125258, #123724, #124610)
Flatten/Unflatten micro optimization in proxy_tensor.py (#121993)
Minor compile time optimization in has_free_symbols (#122144)
Skip assert in check_is_size (#124209)
Teach ShapeEnv that a <= b => a < b + 1 (#123436)
Use sympy xreplace instead of subs (#124208)
_find not update unchanged replacements (#124274)
eval_static: guards, unbacked compute once (#124217)

Inductor

Speedup convert<float>(Vectorized<half>::loadu(ptr, 8)) on ARM (#125889)
Add more mm kernel choices (#125000)
Add NEON ISA support on
- arm64 Macs (#122217)
- aarch64 (#123584)

MPS

Improvements to perf of int4pack_mm (#125983, #127135, #125704)
Making copy_cast, softmax and cat_out unranked (#123191)

XPU

Intel GPU
- Convolution&Deconvolution aten operators(#117529)
- Matmul aten operators(addmm, badbmm, etc.)(#117202)
Support xpu host allocator (#123080)
oneDNN
- Conv primitive integration (#117512)
- Matmul primitive integration (#117112)
- library compilation for Intel GPU support (#117098)

Documentation

Python frontend

Add doc for
- torch.distributions.utils.clamp_probs (#128136)
- the legacy constructor for Tensor (#122625)
- torch.Size.numel (#124186)
- torch.utils.benchmark.utils.compare.Compare (#125009)
- torch.utils.collect_env.get_env_info (#128021)
Clarify Security Policy (#120531)
Fixes doc
- example of torch.masked_scatter (#123664)
- for torch.load map_location (#125473)
Improve doc for
- torch.set_default_dtype (#121730)
- torch.load weights_only argument (#127575)
Update doc for
- functions in torch.multinomial (#125495)
- functions in torch.random (#125265)
- torch.dot (#125908)

Composability

Add extended debugging options for troubleshooting torch.compile issues (#122028)

cuda

Add doc for torch.cuda.nccl.version (#128022)
Add documentation for nvtx.range (#121699)

Autograd frontend

torch.autograd.Function: update docs for separate context and forward functions (#121955)
torch.utils.checkpoint: Improve error message when use_reentrant=True is used with .grad() (#125155)
Improve the clarity of the torch.Tensor.backward doc (#127201)
Fix typing for torch.autograd.Function with ctx-less forward (#122167)

Release Engineering

Fix torch and torch.compile links (#121823, #121824)
Add
- fuzzer instructions to pt2 bug template (#123156)
- better instructions for pytorchbot merge command on cancel (#124947)
- instructions on how to run doc coverage locally (#123688)

nn frontend

Fixes
- KLDiv example (#126857)
- torch.nn.TripletMarginLoss allowing margin less or equal to 0 (#121978)
- example and typo in nn.ChannelShuffle and nn.PReLU docs (#123959)
- redundant tensor in nn.MaxUnpool2d example (#127850)
- wording in nn.Linear docstring (#127240)
Improvements
- NLLLoss documentation (#127346)
- documentation of torch.nn.utils.rnn (#123559)
- return value documentation for nn.Module.load_state_dict (#123637)
- the example description for torch.nn.utils.rnn.pad_sequence (#123183)
Update the is_causal explanation in the nn.functional.scaled_dot_product_attention doc (#127209)
Warn SDPA users about dropout behavior (#126294)

Optim

Document complex optimizer semantic behavior (#121667)
Add missing parameter doc of Adagrad (#125886)

linalg

Improve docs on the sorting of eig/eigvals (#127492)

Distributed

c10d

Add
- a doc page for NCCL ENVs (#128235)
- migration notes for --local-rank option style change for torchrun for PyTorch 2.0 onwards (#109480)
Documents
- 'tag' limitation for nccl send/recv (#125278)
- destroy_process_group usage (#122358)
Fixes
- example in torch.distributed.new_subgroups docstring (#123492)
- the document of distributed.new_group() (#122703)

Distributed Checkpointing (DCP)

Corrected typos in assert (#122633)

DTensor

Add comment on replicate -> partial for _NormPartial (#121976)
Updated public API docs for DTensor (#127340)

FullyShardedDataParallel (FSDP)

Remove excessive warnings and rewrite FSDP docstrings (#123281)
Fix docs for inter/intra node PG helpers (#126288)
Updated docstring to include device_mesh arg (#126589)

Profiler

Updated PT2+Profiler docs (#122272)

Export

Fix documentation for register_fake_class (#126422)

Fx

Document for add_var_to_val (#121850)

Dynamo

Add a Dynamo deepdive to documentation (#122305)
Update compile doc to suggest Module.compile (#123951)
Fixes
- links rendering when surrounding code in Dynamo deepdive (#123427)
- the link to torch.compiler_custom_backends (#125865)
- typos in torch._dynamo.config.py (#126150)
- NumPy + backward example (#126872)

Inductor

Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672)
documentation for pattern_matcher.py (#127459)

ONNX

Fix pytorch version for onnx in doc (#124182)
Add docstring to masked_fill, expand, select, unsqueeze, cat fns (#128055)
Documenting torch.onnx.operator.shape_as_tensor (#128051)
Init sigmoid comments (#127983) (edited)

XPU

PyTorch 2.4 XPU Getting Started (#127872)
Update Intel GPU Support on README (#126001)
Tensor (#126383 #127280)
Stream (#121398)
AMP (#127276 #127278)
torch.compile with XPU support (#127879)

Developers

Composability

cpu_fallback for aten::triu_indices on custom device crash (#121306)
API to check whether running in torch_dispatch mode (#122339)
clarify c10::Dispatcher kernel static asserts (#124519)

Release Engineering

TD (target determination) reorders tests in CI based on heuristics and
removes tests it believes to be irrelevant to the changes in the PR.
This has led to an approximate 25% reduction in TTS between PRs and the main branch,
but also in ~15 reverts (#121835, #121836, #122279, #122615, #122901, #124082, #122976, #125931)
torchbench on-demand test workflow (#122624).
BE: Ruff lint improvements (#124743, #124570)
ability to save TORCH_COMPILE_DEBUG logs for CI failures (#124408)
freezing option for cpu inductor accuracy test in inductor CI (#124715)

Optim

Modify device check in capturable optimizer to support more devices (#124919)
Improve typing and error messages in LRScheduler (#125556, #127943, #121633, #125161)
Only initialize state if needed in SGD (#123757)
Exempt torch.compile from more checks in Adamax (#123498)
Merged the pyi files into py files of optimizer (#125153, #125452)
Tighten fallback conditions for compiled optimizer (#125825)

Distributed

c10d

Updated error message for sparse all-reduce (#121644)
Add
- generic scuba logging capability into c10d (#121859)
- log the target of Flight Recorder dump (#122345)
- the source rank in the logs when detecting the timeout (#122850)
- more fields for periodic logging (#123860)
- pg_name and pg_desc to logger (#126409)
- Work's numel to logger for debugging purposes (#127468)
Allow user to pass process group description for ProcessGroupNCCL (#123472)
Print the duration of the broadcast of ncclunique_id (#123963)
Pass and recorded process_group_name when creating ProcessGroupNCCL (#123117)
Pass pg name and desc to NCCL communicator (#124149)
Make only PG0 should dump when monitoring thread timed out (#125356)
split seq_id to collective_seq_id and p2p_seq_id (#125727)
Print certain logs only on the head rank of each node (#125432)
Make warn env vars only once during program (#127046)

DTensor

Add some initial c10d ops to CommDebugMode (#125475)
Remove unused failed_reason (#126710)
Add all_reduce_coalesced tracing to CommDebugMode (#127025)

Distributed Checkpointing (DCP)

additional logging for improved observability in DCP (#121352)

FullyShardedDataParallel (FSDP)

Remove unnecessary warnings (#126365)
warnings on wrapping ModuleList/ModuleDict (#124764)

Miscellaneous

Remove dist_ prefix from TORCH_LOGS shortcuts (#126499)
Make torch.distributed.breakpoint() to work under Python/Meta contexts (#118645)

TorchElastic

Make log directory creation idempotent (#126496)

Fx

Suggest TORCHDYNAMO_EXTENDED_DEBUG_ envvars when appropriate (#122473)

Inductor

aoti_torch_item as a util function (#126352)
model_type and global_rank for the scuba log for the dashboard Optimus pattern frequency monitor (#123398)
Change the log for the group batch fusion (#122245)
Do not use importlib.load_module (#122542)
Enable FX graph caching on another round of inductor tests (#121994)
Improves
- exception typing. Remove NOQAs (#125535)
- generate_extern_kernel_out's signature (#123351)
- logging (#122932)
- the optimus scuba log (#122361)
Misc refactors (#126945)
Only print bw result for the first time we benchmark a kernel (#123568)
Refactor
- MultiOutput. codegen_list_tuple_access to use subclass type checks (#121662)
- indexing() into triton.py
- part of IterationRangesEntry into triton.py (#126944)
- some fallback op util functions (#126182)
- is_legacy_abi_kernel and abi_compatible_kernel (#121523)
Renamed mutationlayout/aliasedlayout (#122474)
Unify val_to_arg_str and val_to_cpp_arg_str (#126916)
Update
- DTYPE_TO_CPP mapping (#126915)
- opinfo tests (flattened diff) (#124657)
- tensor_converter util functions (#121743)
- triton pin (#121268)
Use C++17 helper templates (#122607)
delete inductor config.trace.compile_profile (#127143)
log pt2 config dict to signpost from inductor post grad (#124593)
refactor
- code to use define_kernel and call_kernel similar to CUDA (#123704)
- device dispatch inside do_bench (#125736)

MPS

Reorganize logics and naming in copy.mm (#123310)
Pointer to the non-zero limit ticket#124244
Introduce MetalShaderLibrary class (#125550)
Include MPSGraphVenturaOps.h for complex types on macOS12 (#127859)
Define _compute_tolerances (#121754)

XPU

Support general device runtime Interface for Intel GPU (#121883)
Enable triton installation for Intel GPU (#122254)
Reuse inductor test for Intel GPU (#122866, #124147)
Update Intel triton for Pytorch 2.4 release (#128615)
Support reduction split for Intel GPU (#129337)
call empty_cache for dynamo tests (#126377)
Support xpu autocast policy (#124052)

Security

Python frontend

warning for weights_only (#129239, #129396, #129509) (see Deprecations section)

Release Engineering

Vulnerability related updates of packages used in CI (#124614, #124675, #124976, #124983, #125698, #126805, #126989)

torch 2.4.0 PyTorch 2.4: Python 3.12, AOTInductor freezing, libuv backend for TCPStore on Python PyPI

PyTorch 2.4 Release Notes

Highlights

Tracked Regressions

Subproc exception with torch.compile and onnxruntime-training

cu118 wheels will not work with pre-cuda12 drivers

Backwards Incompatible Change

Python frontend

Default TreadPool size to number of physical cores (#125963)

Fix torch.quasirandom.SobolEngine.draw default dtype handling (#126781)

Forbid subclassing torch._C._TensorBase directly (#125558)

Composability

Non-compositional usages of as_strided + mutation under torch.compile will raise an error (#122502)

We now verify schemas of custom ops at registration time (#124520)

Autograd frontend

Delete torch.autograd.function.traceable APIs (#122817)

Release engineering

Optim

Distributed

DeviceMesh

Pipelining

jit

Fx

ONNX

Deprecations

Python frontend

Composability

CPP

Release Engineering

Optim

nn

Distributed

Profiler

Quantization

Export

XPU

ONNX

New Features

Python frontend

Composability

Optim

nn frontend

linalg

Distributed

c10d

FullyShardedDataParallel v2 (FSDP2)

Pipelining

Profiler

Dynamo

Export

Inductor

jit

MPS

XPU

ONNX

Vulkan

Improvements

Python frontend

Composability

Autograd frontend

Release Engineering

nn frontend

Optim

Foreach

cuda

Quantization

Distributed

c10d

DeviceMesh

Distributed quantization

DistributedDataParallel (DDP)

Distributed Checkpointing (DCP)

DTensor

FullyShardedDataParallel (FSDP)

ShardedTensor

TorchElastic

Tensor Parallel

Profiler

Profiler torch.profiler:

Memory Snapshot torch.cuda.memory._dump_snapshot:

torch 2.4.0
PyTorch 2.4: Python 3.12, AOTInductor freezing, libuv backend for TCPStore

on Python PyPI

Default `TreadPool` size to number of physical cores (#125963)

Fix `torch.quasirandom.SobolEngine.draw` default dtype handling (#126781)

Forbid subclassing `torch._C._TensorBase` directly (#125558)

Non-compositional usages of as_strided + mutation under `torch.compile` will raise an error (#122502)

Profiler `torch.profiler`:

Memory Snapshot `torch.cuda.memory._dump_snapshot`:

Profiler `record_function`: