pypi torch 2.0.0
PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever

latest releases: 2.4.1, 2.4.0, 2.3.1...
18 months ago

PyTorch 2.0 Release notes

  • Highlights
  • Backwards Incompatible Changes
  • Deprecations
  • New Features
  • Improvements
  • Bug fixes
  • Performance
  • Documentation


We are excited to announce the release of PyTorch® 2.0 (release note) which we highlighted during the PyTorch Conference on 12/2/22! PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at compiler level under the hood with faster performance and support for Dynamic Shapes and Distributed.

This next-generation release includes a Stable version of Accelerated Transformers (formerly called Better Transformers); Beta includes torch.compile as the main API for PyTorch 2.0, the scaled_dot_product_attention function as part of torch.nn.functional, the MPS backend, functorch APIs in the torch.func module; and other Beta/Prototype improvements across various inferences, performance and training optimization features on GPUs and CPUs. For a comprehensive introduction and technical overview of torch.compile, please visit the 2.0 Get Started page.

Along with 2.0, we are also releasing a series of beta updates to the PyTorch domain libraries, including those that are in-tree, and separate libraries including TorchAudio, TorchVision, and TorchText. An update for TorchX is also being released as it moves to community supported mode. More details can be found in this library blog.

This release is composed of over 4,541 commits and 428 contributors since 1.13.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.0 and the overall 2-series this year.


  • torch.compile is the main API for PyTorch 2.0, which wraps your model and returns a compiled model. It is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.
  • As an underpinning technology of torch.compile, TorchInductor with Nvidia and AMD GPUs will rely on OpenAI Triton deep learning compiler to generate performant code and hide low level hardware details. OpenAI Triton-generated kernels achieve performance that's on par with hand-written kernels and specialized cuda libraries such as cublas.
  • Accelerated Transformers introduce high-performance support for training and inference using a custom kernel architecture for scaled dot product attention (SPDA). The API is integrated with torch.compile() and model developers may also use the scaled dot product attention kernels directly by calling the new scaled_dot_product_attention() operator.
  • Metal Performance Shaders (MPS) backend provides GPU accelerated PyTorch training on Mac platforms with added support for Top 60 most used ops, bringing coverage to over 300 operators.
  • Amazon AWS optimize the PyTorch CPU inference on AWS Graviton3 based C7g instances. PyTorch 2.0 improves inference performance on Graviton compared to the previous releases, including improvements for Resnet50 and Bert.
  • New prototype features and technologies across TensorParallel, DTensor, 2D parallel, TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.
Stable Beta Prototype Platform Changes
Accelerated PT 2 Transformers torch.compile DTensor CUDA support for 11.7 & 11.8 (deprecating CUDA 11.6)
PyTorch MPS Backend TensorParallel Python 3.8 (deprecating Python 3.7)
Scaled dot product attention 2D Parallel AWS Graviton3
Functorch Torch.compile (dynamic=True)
Dispatchable Collectives
torch.set_default_device and torch.device as context manager
X86 quantization backend
GNN inference and training performance

*To see a full list of public 2.0, 1.13 and 1.12 feature submissions click here

Backwards Incompatible Changes

Drop support for Python versions <= 3.7 (#93155)

Previously the minimum supported version of Python for PyTorch was 3.7. This PR updates the minimum version to require 3.8 in order to install PyTorch. See Hardware / Software Support for more information.

Drop support for CUDA 10 (#89582)

This PR updates the minimum CUDA version to 11.0. See the getting-started for installation or building from source for more information.

Gradients are now set to None instead of zeros by default in torch.optim.*.zero_grad() and torch.nn.Module.zero_grad() (#92731)

This changes the default behavior of zero_grad() to zero out the grads by setting them to None instead of zero tensors. In other words, the set_to_none kwarg is now True by default instead of False. Setting grads to None reduces peak memory usage and increases performance. This will break code that directly accesses data or does computation on the grads after calling zero_grad() as they will now be None. To revert to the old behavior, pass in zero_grad(set_to_none=False).

1.13 2.0
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(2,22)
>>> i = torch.randn(2, 2, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
tensor([[0., 0.],
        [0., 0.]])
>>> module.weight.grad + 1.0
tensor([[1., 1.],
        [1., 1.]])
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(5, 5)
>>> i = torch.randn(2, 5, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
AttributeError: 'NoneType' object has no attribute 'data'
>>> module.weight.grad + 1.0
TypeError: unsupported operand type(s) for +:
'NoneType' and 'float'

Update torch.tensor and nn.Parameter to serialize all their attributes (#88913)

Any attribute stored on torch.tensor and torch.nn.Parameter will now be serialized. This aligns the serialization behavior of torch.nn.Parameter, torch.Tensor and other tensor subclasses

1.13 2.0
# torch.Tensor behavior
>>> a = torch.Tensor()
>>> = 'hey'

>>> buffer = io.BytesIO()
>>>, buffer)
>>> b = torch.load(buffer)

>>> print(
>>> print(
AttributeError: 'Tensor' object has no attribute 'foo'

# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> = 'hey'

>>> buffer = io.BytesIO()
>>>, buffer)
>>> b = torch.load(buffer)
>>> print(
>>> print(
AttributeError: 'Parameter' object has no attribute 'foo'

# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
...   pass

>>> a = MyTensor()
>>> = 'hey'
>>> print(

>>> buffer = io.BytesIO()
>>>, buffer)
>>> b = torch.load(buffer)
# torch.Tensor behavior
a = torch.Tensor() = 'hey'

>>> buffer = io.BytesIO()
>>>, buffer)
>>> b = torch.load(buffer)
>>> print(
>>> print(

# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> = 'hey'

>>> buffer = io.BytesIO()
>>>, buffer)
>>> b = torch.load(buffer)
>>> print(
>>> print(

# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
...   pass

>>> a = MyTensor()
>>> = 'hey'
>>> print(

>>> buffer = io.BytesIO()
>>>, buffer)
>>> b = torch.load(buffer)

If you have an attribute that you don't want to be serialized you should not store it as an attribute on tensor or Parameter but instead it is recommended to use torch.utils.weak.WeakTensorKeyDictionary

>>> foo_dict = weak.WeakTensorKeyDictionary()
>>> foo_dict[a] = 'hey'
>>> print(foo_dict[a])

Algorithms {Adadelta, Adagrad, Adam, Adamax, AdamW, ASGD, NAdam, RAdam, RMSProp, RProp, SGD} default to faster foreach implementation when on CUDA + differentiable=False

When applicable, this changes the default behavior of step() and anything that calls into adadelta(...), adagrad(...), adam(...), adamax(...), adamw(...), asgd(...), nadam(...), radam(...), rmsprop(...), rprop(...), sgd(...) directly to use the foreach implementation instead of the for-loop for better performance. However, this change can potentially be backward incompatible since there may be small numerical differences between the results computed with the foreach implementation and the previous default. The foreach implementation will be the default only if the following conditions are met.

  1. The user has not specified kwargs relating to implementation (foreach, fused, or differentiable),
  2. All tensors are native tensors (not subclasses) and on CUDA,
  3. torch.jit.is_scripting is False.

When these conditions are satisfied, the implementation used will match the implementation used when one passes foreach=True. The user defined flag for foreach will NOT be overwritten in order to preserve user selections. For more details, check the documentation. There should be no significant differences between the results returned by these optimizers. To revert to the old behavior, say, for adam, pass in adam(..., foreach=False, ...) or initialize Adam with Adam(..., foreach=False, ...).

Pull Requests: #92306, #92716, #92723,#92724, #92726, #92727, #92728, #92715, #91896, #92730, #90865, #93184, #92181, #92923, #95415, #95818, #95811

torch.nn.utils.stateless.functional_call now respects tied weights (#90477)

Assume a module has two tied weights, x and x_tied. Previously, invoking functional_call(module, parameters_and_buffers, args, kwargs=None, *, strict=False) with a parameter dictionary of only one of the tied weights would result in the other one(s) not being updated.

We’ve changed the behavior so that providing one of the tied weights in the parameter dictionary will update all other tied weights. If you would like the behavior in previous versions of PyTorch, please set tie_weights=False.

Please also see the related deprecation section "torch.nn.stateless.functional_call in favor of torch.func.functional_call".

1.13 2.0
>>> class Foo(nn.Module):
...    def __init__(self):
...        super().__init__()
...        self.x = nn.Parameter(torch.zeros([]))
...        self.x_tied = self.x
...    def forward(self, inp):
...        return self.x + self.x_tied

>>> foo = Foo()
>>> params = {'x': torch.ones([])}
>>> result = functional_call(foo, params, torch.randn([]))
>>> print(result)
>>> class Foo(nn.Module):
...    def __init__(self):
...        super().__init__()
...        self.x = nn.Parameter(torch.zeros([]))
...        self.x_tied = self.x
...    def forward(self, inp):
...        return self.x + self.x_tied

>>> foo = Foo()
>>> params = {'x': torch.ones([])}
>>> result = functional_call(foo,
...                         params,
...                         torch.randn([]),
...                         tie_weights=False)
>>> print(result)

Require return_complex to be passed explicitly to torch.stft for real input (#86724)

torch.stft takes an optional return_complex parameter that indicates whether the output should be a floating point tensor or a complex tensor. return_complex previously defaulted to False for real input tensors. This PR removes the default and makes return_complex a required argument for real inputs. However, complex inputs will continue to default to return_complex=True.

1.13 2.0
>>> a = torch.rand(1024)
>>> _ = torch.stft(a, n_fft=128)
>>> t = torch.rand(1024)
>>> _ = torch.stft(t, n_fft=128, return_complex=False)

Require inputs to torch.istft to be complex valued

torch.istft no longer supports input in the form of real tensors
with shape (..., 2) to mimic complex tensors. Instead, convert
inputs to a complex tensor first before calling torch.istft.

1.13 2.0
>>> t = torch.rand(65, 33, 2)
>>> _ = torch.istft(t, n_fft=128, length=1024)
>>> t = torch.rand(65, 33, 2)
>>> _ = torch.istft(t, n_fft=128, length=1024)
RuntimeError: istft requires a complex-valued input
tensor matching the output from stft with return_complex=True.
>>> t_complex = torch.view_as_complex(t)
>>> _ = torch.istft(t_complex, n_fft=128, length=1024)

Change default behavior of sparse tensor construction to not do component verification(#92094)

We now disable the costly component verification of torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor by default. The user can use the new check_invariants flag or torch.sparse.check_sparse_tensor_invariants to locally enable component verification. This allows users to constrain these costly checks to specific regions of their code and enables better overall performance. Previously users had no access to public constructors that disable these checks.

1.13 2.0
>>> i = [[0, 1, 1],
         [2, 0, 5]]
>>> v =  [3, 4, 5]
>>> s = torch.sparse_coo_tensor(i, v, (2, 3))
RuntimeError: size is inconsistent with
indices: for dim 1, size is 3 but found index 5
>>> i = [[0, 1, 1],
         [2, 0, 5]]
>>> v =  [3, 4, 5]
>>> s = torch.sparse_coo_tensor(i,
...                            v,
...                            (2, 3),
...                            check_invariants=True)
RuntimeError: size is inconsistent with indices: for
dim 1, size is 3 but found index 5
>>> with torch.sparse.check_sparse_tensor_invariants():
...     s = torch.sparse_coo_tensor(i, v, (2, 3))
RuntimeError: size is inconsistent with indices: for
dim 1, size is 3 but found index 5

Remove deprecated functionality from torch.testing

Historically, torch.testing exposed a lot of private and undocumented functionality publicly. The 2.0 release completes the deprecation cycle for the following items and removes them:

Hooks registered on tensor to always run, even if they are the inputs to .grad() (#85849)

This is a bug fix. Per the docs, hooks registered to Tensor should fire any time gradients are computed w.r.t. to that tensor. This change corrects the behavior to be consistent with the documentation. See documentation for more details about backward hooks execution..


a = torch.tensor(1., requires_grad=True)
b = a.clone()
b.register_hook(hook)  # the hook registered here didn't fire before!
torch.autograd.grad(b.clone(), inputs=(b,))

grad_fn post-hooks can always observe the modifications to gradient by any grad_fn pre-hooks or hooks registered to Tensor, even if this is a leaf tensor (#85849)

This corrects the behavior of hooks to be consistent with the documentation in the case where the tensor is a leaf tensor, i.e. the node is a grad accumulator node. See documentation for more details about backward hooks execution.


def hook(grad):
   # updates grad
   return grad * 3

def hook2(grad_input, grad_output):
   # Before this change, grad_output would NOT see the x3

a = torch.tensor(1., requires_grad=True)
b = a.clone()
acc_grad = b.grad_fn.next_functions[0][0]
torch.autograd.backward(b.clone(), inputs=(a,))  # hook fire

Remove FSDP params_with_grad (#87480)

In FSDP, we used to have an API params_with_grad for users to get parameters which have gradients from the FSDP module. We decided not to expose this helper because it is not a common paradigm.

1.13 2.0
m = FullyShardedDataParallel(module)
m = FullyShardedDataParallel(module)
m.params_with_grad()  # Runtime error thrown
# For work-around, users can still do
[p for p in self.parameters() if p.grad is not None]

Users doing wildcard import of torch.distributed.fsdp.fully_sharded_data_parallel will no longer get non-public symbols (#87917)

Users could previously import both public and non-public symbols:

1.13 2.0
from torch.distributed.fsdp.fully_sharded_data_parallel import *
ShardingStrategy.FULL_SHARD # Non-public API
FullyShardedDataParallel(module) # public API
from torch.distributed.fsdp.fully_sharded_data_parallel import *
ShardingStrategy.FULL_SHARD # Non-public API, this will fail now
Fully`Sharded`DataParallel(module) # public API
# Users can instead
from torch.distributed.fsdp.fully_sharded_data_parallel import (
FullyShardedDataParallel(module, sharding_strategy=ShardingStrategy.FULL_SHARD)

Signature of FSDP auto_wrap_policy related APIs were changed in (#88450).

1.13 2.0
lambda_auto_wrap_policy(m, unwrapped_params=...)
transformer_auto_wrap_policy(m, unwrapped_params=...)
size_based_auto_wrap_policy(m, unwrapped_params=...)
lambda_auto_wrap_policy(m, nonwrapped_numel=...)
transformer_auto_wrap_policy(m, nonwrapped_numel=...)
size_based_auto_wrap_policy(m, nonwrapped_numel=...)

Updated alltoall signature to be consistent with other c10d APIs (#90569)

The keyword argument names have been changed.

1.13 2.0
alltoall(output=..., input=...)
alltoall(output_tensors=..., input_tensors=...)

Remove unused functions in (#90025)

This commit removes the following unused functions from both the torch.quantization and the namespaces:

  • graph_pretty_str
  • get_per_tensor_qparams
  • quantize_node
  • get_qconv_op
  • create_qparam_nodes
  • node_return_type_is_int
  • is_get_tensor_info_node

Make accept inputs in the right order (#90698)

The existing BackendConfig fusion pattern uses a "reversed nested tuple" format that is unintuitive.
This pattern format also complicates the signatures of the user specified "fuser methods", which needed to accept arguments in reverse nested order to match
the patterns:

1.13 2.0
import torch as nn
import as nni
from import (
def fuse_linear_relu(is_qat, relu, bn_conv):
    (bn, conv) = bn_conv
    return nni.ConvBnReLU2d(conv, bn, relu)

config = (
    BackendPatternConfig((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d)))

backend_config.configs  # returns Dict[Pattern, BackendPatternConfig]
def fuse_linear_relu(is_qat, conv, bn, relu):
    return nni.ConvBnReLU2d(conv, bn, relu)

config = (
    BackendPatternConfig((nn.Conv2d, nn.BatchNorm2d, nn.ReLU))

# Or for backward-compatibility
def fuse_linear_relu(is_qat, relu, bn_conv):
    (bn, conv) = bn_conv
    return nni.ConvBnReLU2d(conv, bn, relu)

config = (
    ._set_pattern_complex_format((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d)))

backend_config.configs  # returns List[BackendPatternConfig]

Make the AO codebase compliant with the public vs private API guidelines of pytorch Public-API-definition-and-documentation

If users were using any of the AO private APIs then these would have to be accessed with a preceding _ to conform with the guidelines.

1.13 2.0

Pull Requests: (#86029, #87515, #87516, #87517, #87518, #87519, #88392, #88394, #88396, #88397, #87521, #88395, #87883, #88399, #88398, #86022, #86023, #86024, #86025, #86026, #86027, #86028, #86030, #86031, #86032, #86033, #86034, #86037, #90315, #88391, #90554, #87520)

Remove overwrite_output_observer and represent the observer constraints for fixed qparams ops through the existing DTypeWithConstraints mechanism (#88620)

This commit removes overwrite_output_observer and overwrite_output_fake_quantize overwrite observer settings in the BackendConfig. Instead, we represent the observer constraints for
fixed qparams ops through the existing DTypeWithConstraints mechanism. Note that, however, to be consistent with other DTypeWithConstraints checks, we no longer throw an error if an incorrect observer is specified, but simply ignore the offending QConfig and log a warning instead. This is the BC-breaking part of the change.

from import default_qconfig
from import prepare_fx

model = ModelWithFixedQParamsOps()
qconfig_mapping = QConfigMapping().set_global(default_qconfig)
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)

Before this commit, running the above leads to an exception because the wrong observers are used for fixed qparams ops. After this commit, the above will only encounter a warning,and the fixed qparams ops will not be quantized. In both cases, switching to get_default_qconfig_mapping will cause the fixed qparams ops to be quantized.

Remove and

The following classes under the namespace are migrated to the

  • QuantizeHandler
  • BinaryOpQuantizeHandler
  • CatQuantizeHandler
  • ConvReluQuantizeHandler
  • LinearReLUQuantizeHandler
  • BatchNormQuantizeHandler
  • EmbeddingQuantizeHandler
  • RNNDynamicQuantizeHandler
  • DefaultNodeQuantizeHandler
  • FixedQParamsOpQuantizeHandler
  • CopyNodeQuantizeHandler
  • GeneralTensorShapeOpQuantizeHandler
  • CustomModuleQuantizeHandler
  • StandaloneModuleQuantizeHandler

The following classes under the namespace are migrated to the

  • DefaultFuseHandler
  • FuseHandler

Remove public APIs under the namespace(#89810)

The following APIs that were mistakenly public under the namespace are removed in this commit.

  • get_quantize_handler_cls
  • get_fusion_pattern_to_fuse_handler_cls
  • get_native_quant_patterns
  • get_pattern_to_quantize_handlers
1.13 2.0
from import (
all_quant_patterns = get_native_quant_patterns()
from import (
from import (
from import (
all_quant_patterns = _get_pattern_to_quantize_handlers(

Update torch.{slice|select|diagonal|as_strided}_scatter ops to preserve input stride/storage_offset (#91029)

These operators are primarily used by the functionalization pass, used in AOTAutograd. Previously, they would always return contiguous tensors. Now, they return a tensor with the same striding as their first argument.

1.13 2.0
>>> x = torch.ones(2, 2, 2)
>>> base = x[:, :, 1]
>>> base.stride()
(4, 2)
>>> x = torch.zeros(2, 2, 2)
>>> base = x[:, :, 1]
>>> base.stride()
(4, 2)
>>> torch.diagonal_scatter(base, torch.ones(2)).stride()
# returns a tensor with same strides as base.
(4, 2)
>>> x = torch.ones(2, 2, 2)
>>> base = x[:, :, 1]
>>> base.stride()
(4, 2)
>>> x = torch.zeros(2, 2, 2)
>>> base = x[:, :, 1]
>>> base.stride()
(4, 2)
>>> torch.diagonal_scatter(base, torch.ones(2)).stride()
# returns a contiguous tensor
(2, 1)

Remove ONNX deprecated monkey patches to torch.Graph (#94747)

The Deprecated monkey patches to torch.Graph, torch.Block and torch.Node are removed

Monkey patches to the classes torch.Graph, torch.Block and torch.Node from torch.onnx have been removed. This means the methods torch.Graph.op(),, torch.Block.op(), torch.Graph.constant(), and torch.Node.__getitem__ are no longer available.

Users creating custom symbolic functions for the torch.onnx exporter can continue to assume the g.op() interface for creating an operator in the graph, which is now exposed via the GraphContext class. Users should not assume any other methods from the GraphContext class other than those defined natively by torch.Graph and .op().

Code change to existing symbolic functions is not expected with this change.

Add full checker mode in torch.onnx.export (#83186)

This removes boolean value of full_check parameter in TORCH API check_onnx_proto, and forces full_check with warning messages if it fails.

Also, the API didn’t check on types in the graph even with full_check=True previously. With the change, a warning message will show if the graph contains type error.

C++ API specific BC-Breaking Changes:

Deleted torch::deploy from PyTorch Core (#85953)

torch::deploy has been migrated to over to MultiPy. Ongoing development will continue in this repository.

Remove the use of lazy::View (#87822)

The view and aliasing infrastructure in lazy tensor core has been deprecated in favor of functionalization.

Renamed c10::fromIntArrayRef to c10::fromIntArrayRefSlow and changed call sites (#86235)

The function has been renamed to more accurately reflect its performance characteristics.


torch.func aka functorch

We’ve deprecated the functorch module in favor of the new torch.func module

We’re excited to announce that, as the final step of upstreaming and integrating functorch into PyTorch, the functorch APIs are now available in the torch.func module. Our function transform APIs are identical to before, but we have changed how the interaction with NN modules work.

We’ve deprecated functorch._ function transforms (e.g. vmap, grad, jvp) in favor of their identical torch.func._ counterparts (#92279).
PyTorch has consolidated on torch.func.functional_call as the NN module functional API. Please migrate from functorch.{make_functional, make_functional_with_buffers} to it. For more details see this Guide
Please migrate from functorch.combine_state_for_ensemble to torch.func.stack_module_state. For more details see this Guide
We are no longer supporting functorch.compile (also known as AOTAutograd) as a frontend for compilation in PyTorch; we have integrated AOTAutograd into PyTorch’s compilation story. If you are a user, please use torch.compile() instead.

Python API

Deprecate TypedStorage, its derived classes, and all of their public methods (#85303)

Typed storages have been removed from the C++ side and torch.UntypedStorage is used in place. The use of torch.TypedStorage and all of its subclasses is now deprecated.

1.13 2.0

If you need to access individual elements in a storage as a particular dtype, you can simply create a tensor to view it:

torch.tensor(storage, dtype=...)

Deprecate tensor.mT,tensor.T,tensor.mH,tensor.H on 0D-tensors (#92143)

1.13 2.0
>>> a = torch.tensor(10)
>>> a.T
>>> a.H
>>> a = torch.tensor(10)
>>> a.T
UserWarning: Tensor.T is deprecated on 0-D tensors.
This function is the identity in these cases.
>>> a.H
UserWarning: Tensor.H is deprecated on 0-D tensors.
Consider using x.conj().

Autograd API

Deprecate decorating classes with torch.no_grad (#89522)

Decorating classes with torch.no_grad is now deprecated. You should be decorating its functions or methods instead. To preserve the current behavior of class decoration, you can directly decorate the __init__ method and nothing else.

1.13 2.0
class Blah():
class Blah():
  def __init__(self):


Remove the use of overload at::frobenius_norm(const Tensor&) (#81762)

In continuation with the deprecation process from release 1.12 the tensor overload for this function has been removed. This function was not used in the bindings of Pytorch and should not impact users of torch.norm.

torch.nn API

Canceling deprecation of functional.{tanh, sigmoid} functions (#86905)

Both these ops are heavily used and so will not be removed. Deprecation warnings have been removed.

Deprecated torch.nn.utils.stateless.functional_call in favor of torch.func.functional_call (#92280)

We’ve moved torch.nn.stateless.functional_call under the torch.func module to reflect how it is useful for working with nn.Modules in a functional style. As of PyTorch 2.0, torch.func.functional_call is a drop-in replacement for torch.nn.stateless.functional_call and we will remove torch.nn.utils.stateless.functional_call in a future version of PyTorch. However, please note that we did change the default behavior of torch.nn.stateless.functional_call in PyTorch 2.0 (see “torch.nn.utils.stateless.functional_call now respects tied weights” under BC-breaking notes).


Deprecated private API torch._six (#94709)

Removed the Python 2 and 3 compatibility library six and future and torch._six.

# from torch._six import string_classes
# from torch._six import int_classes
# from torch._six import inf, nan
from torch import inf, nan
# torch._six.string_classes


Deprecated Caffe2 ONNX exporter support #95071

Users must use PyTorch 1.x versions to use Caffe2 ONNX exporter. This capability will be completely removed from PyTorch 2.x series.

New Features

torch.nn API

  • Add torch.nn.functional.scaled_dot_product_attention() to allow writing fast Transformer-like functions and use it to speed up nn.Transformer() ( #91362, #91066, #90413, #87312, #94008, #89470, #90776, #92189)
  • Add hooks for Module.register_{buffer,module,parameter} functions (#86148, #87369)
  • Add Module.full_backward_pre_hook (#86700)
  • Add Module.state_dict_pre_hook (#90435)
  • Add Module.call_super_init: bool flag that can be used to ensure Module initialization is properly calling parent’s __init__ (#91819)



  • Introduce CUDA Device Assertions Infrastructure (#84609)
  • Logcumsumexp for complex dtypes for CUDA (build-time optimized) (#94310)
  • Caching allocator tracing (#86241)
  • Add Pluggable CUDA allocator backend (#86786)
  • Add cudaMallocAsync as an alternative backend for the CUDA allocator (#82682)


  • Add set_to_none flag for C++ optim endpoint (#92989)

NestedTensor API

  • Add support for for NestedTensor backend (#87146)
  • Add backwards support for gelu and relu operators (#94776)
  • Add support for torch.neg operator (#88131)




Foreach API


  • Add XNNPACK Delegate Framework.
  • Add support for better benchmarking
    • Add support in lite_predictor benchmark binary to select event lists and perform benchmarking using Linux perf through Kineto profiler (#87876)
    • List all missing ops at once (#94205)

Sparse API

  • Add torch.sparse.check_sparse_tensor_invariants context manager that allows users to opt into more checks at runtime for better debugging. (#92094)
  • Add check_invariants flag to torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor to allow users to verify components at construction time. (#92094)
  • Add reduce flag for CPU to with support for sum, mean, amax, amin (#83727)

Optimizer API

  • Make {Adadelta, Adagrad, Adamax, AdamW, ASGD, NAdam, RAdam, RProp} differentiable (#86096, #86258, #86183)
  • Publicly expose _LRScheduler to LRScheduler (#88503)


  • Add a transform for positive-definite matrices. (#76777)


  • Set up new module (#85599)
  • Add the Nuttall window to signals/ (#90103)
  • Implement old singal/windows in Python (#87082, #87330)



  • Add Vulkan support for several torch operators:
    • torch.abs (#87414)
    • for height and width dimensions (#94612)
  • Vulkan optimization passes now automatically apply data transfers between the CPU and GPU for input and output tensors (#87432)
    • If the requires_backend_transfers flag of a model is set to false, then input tensors do not to be transferred to the GPU (via tensor.gpu()) and output tensors do not to be transferred back to the CPU (via tensor.cpu()) since these transfers are inserted into the model
    • To avoid inserting data transfers into a model, add MobileOptimizer.VULKAN_AUTOMATIC_GPU_TRANSFER under torch.utils.mobile_optimizer to the optimization_blocklist argument of optimize_for_mobile (#92081)


  • hipGraph support for pytorch mainline (#88202)



  • Allow freezing JIT modules that contain mutable interfaces (#86039, #91020)
  • ApplyLinear-BatchNormNd folding during torch.jit.freeze (#86706)
  • Add an option to skip loading of debug traces, in order to reduce memory usage (#91430)
  • Introduce torch.jit._drop function modifier to avoid compiling a method on a non-nn.Module class (#93012)
  • Allow providing a kwargs-like dict of example inputs to torch.jit.trace with the new example_kwarg_inputs argument (#81623, #94032)
  • Include example input shapes when serializing jit.traced modules to assist with debugging (#90744)


  • Add Ada Lovelace (cuda arch sm8.9) support (#87436)
  • Add an option to disable TORCH_WARN and TORCH_WARN_ONCE log (#87188)
  • Enable memory map file support for Android, Apple, and CXX (#88545)
  • Support DNNL_GRAPH_CPU_RUNTIME=TBB build option (#87512)



  • Add an environment variable to skip cudnn version compatibility check (#89184)
  • Enable cuDNN Frontend v8 API by Default (#91117)


Python API

  • Set std/var correction overloads default value to None (#56398)
  • Implement correction argument in torch.masked.{std,var} (#87118)
  • Update torch.squeeze to allow squeezing multiple dimensions at once (#89017)
  • Add support for int32 indices in index/index_put ops (#86309)
  • Enable where to have cpu scalar args (#87022)
  • Add support for NumPy scalars to torch.tensor.asarray (#90914)
  • Update opt_einsum to have more reasonable defaults (#86985)
  • Improve error message for Tensor.set_ when dtypes mismatch(#88804)
  • Enable out variant of torch.max(#85926)
  • Implement faster gradient clipping using foreach function (#91846)

Autograd API

  • Add backward support for torch.ormqr (#86800)
  • Pre-hooks registered on tensor are guaranteed to run before pre-hooks registered on grad_fn (#85849)
  • Add a new overridable method setup_context (#89859, #92312)
    • You must use override this method if you plan to use your autograd Function with functorch
    • If you choose to override this method, forward should no longer take ctx as an input.
  • Add context manager torch.autograd.set_multithreading_enabled for disabling multithreading in the autograd engine (#86245)
  • Add backward AD support for unary foreach functions (#89591)

torch.nn API

  • Add remove_duplicate flag to Module.named_buffers() method (#84984) and Module.named_parameters() (#88090)
  • Add kwarg support for Module forward-pre and forward hooks (#89389)
  • Improve error message for Transformer() fast path (#90783) and kernel selection (#90783)
  • Add support for torch.bf16 for Embedding (#94163)
  • Add freeze argument to Embedding() (#86769)
  • Add torch.channels_last_3d support for SyncBatchNorm() (#88401)
  • Add torch.bfloat16 support on CPU for functional.{mish,hardtanh,silu} (#82460)
  • Add support for inputs with different data types for LayerNorm() (#81851, #88064), BatchNorm{1,2,3}d() (#84410), GroupNorm() (#89485, #81852, #88663, #92671, #92668)
  • Improve printing of ModuleList() (#90452)
  • Add torch.uint8 support for functional.interpolate() on CPU (#90771)
  • Make functional.max_pool1d error checking consistent between CPU and CUDA (#90211)
  • Add SyncBatchNorm() fallback to BatchNorm() when it is used in a non-distributed setting (#89706)
  • Add channels-last support for GroupNorm() on XPU (#87680)
  • Add is_causal kwarg to TransformerEncoder() layer (#90508)
  • Add prepend argument to Module hooks to register a hook that will be called before the existing ones (#87370)


  • Activation checkpointing
    • Return None from apply_activation_checkpointing (#87871)
    • Enable non-reentrant support for checkpoint_sequential (#86331)
    • Separate CPU offload activation to its own wrapper (#85459)
  • DistributedDataParallel
    • Add PackedSequence support when device_ids is specified (#86614)
    • Enable DDP to handle custom dataclass forward outputs (#92334)
  • Distributed (c10d)
    • Add sequence number support for UCC PG (#85047)
  • FullyShardedDataParallel
    • Default to BACKWARD_PRE for the backward_prefetch of FSDP (#88428)
    • Skip collective communications for NO_SHARD in clip_grad_norm_ (#89137)
    • Allow handle training state to be both BACKWARD_PRE and BACKWARD_POST in the post-backward assert (#89791)
    • Limit all gather after pre-unshard (#89057)
    • Include module classes in ModuleWrapPolicy.__repr__ (#89058)
    • Apply the "largest" dtype across all parameters/gradients as defined by PyTorch's type promotion semantics for the total norm returned in clip_grad_norm_ for low prec grads (#90028)
    • Introduce ModuleWrapPolicy for simplicity in FSDP autowrap (#88450)
    • Enable mixed hybrid/non-hybrid sharding strategies (#90846)
    • Re-support model dtype change after FSDP init (#91192)
    • Enable use_orig_params=True, no_sync and mixed precision to work together (#91193)
    • Enable summon_full_params(with_grads=True) (#85738, #87314)
    • Add keep_low_precision_grads support when CPU offloading (#86495)
    • Consolidate FSDP state_dict offload_to_cpu settings (#86211)
    • Add set_state_dict_type API to setup state_dict_type without using context manager (#86243)
    • Enable the support of use_orig_param for FSDP’s optim_state_dict (#89898, #89899, #89900)
    • Enable nested FSDP wrapper to use different mixed precision (#90523)
    • Enable input cast skip in MixedPrecision (#90620)
    • Publish optim_state_dict and optim_state_dict_to_load for FSDP (#90798, #91343, #92744, #92118, #92991, #92992, #93285, #93318, #94109, #94129)
    • Make default input casting in root module only and enable the ability to set different mixed precisions for different submodules (#91365)
  • Torch Elastic
    • Update torchrun and TorchElastic to take optional local_addr param to allow skip local IP lookup if specified (#88922)


  • Update vmap to accept None(s) in out_dim (#91644)
  • torch.func.jacrev: Support chunked computation (#89376, #91326)
  • vmap: chunk_size support (#91157)
  • torch.vmap: Implement checks (rather than internal asserts) for vmap escaped errors (#89585)
  • Avoid calling allclose in the backward if there are tensor subclasses (#91444)
  • Refactor NN stateless APIs by swapping module tensors (#92536)


  • Use binary units for CUDA memory summary (#91854)
  • Improve perf by avoiding implicit string creation in c10_cuda_check_implementation (#88350)
  • Add option to record C++ backtraces in _record_memory_history (#86145)
  • Set CUDA_MODULE_LOADING to LAZY when not set by the user (#85692)
  • Add warning if captured graph is empty (#88754)
  • Add option to dump a captured graph for debugging (#85519)
  • Add support to foreach torch zero for bfloat16s (#90437)
  • Enable bfloat16 for hardtanh_backward_cuda (#91511)
  • Use pytree to allow any input format for cuda graph (#90941)
  • Add requested_bytes to CUDA Caching Allocator Stats (#88575)
  • Add an option to disable reduced precision reductions for BF16 GEMM (#89172)
  • Add an env variable to disable addmm_cuda_lt kernel (#91436)


  • Add XPU backend to support and torch.load (#89679)


  • Reduce ambiguity in Tensor namespace collisions (#92266)

Dataloader API

  • Add support for pin memory on xpu device (#86545)
  • Add type annotation to get_worker_info (#87017)
  • Allow prefetch factor to be optional (#88972)

NestedTensor API

  • Add add/mul for nested dense [B, *, D], [B, 1, D] case (CUDA-only) (#88289)
  • Add support for over irregular dimensions (#88585)
  • Add torch.nested.nested_tensor() constructor (#88213)

Complex API

  • Improve complex support for: torch.nn.functional.conv_transpose3d (#87967), torch.log1p (#89214,#90422), torch.lerp (#75584), torch.logcumsumexp for CPU (#93153)
  • Solve under/overflow for complex division (#92539)


  • Improve coverage of primtorch and torch._ref decompositions: prims.clone (#86705), ndtr, ndtri, log_ndtr, erfcx (#86077), NLL loss (#81128), conv backward (#87047), xlogy and xlog1py (#77712), alpha_dropout (#87989)
  • More operations now work with meta tensors: _adaptive_avg_pool2d_backward (#86359), (#87074), avg_pool2d and avg_pool2d_backward (#87043), scalar_tensor and argmax (#88590), topk (#88694), max_pool2d_with_indices_backward (#88743), grid_sampler_2d_backward (#88745), linalg_cholesky and linalg_cholesky_ex (#89430), aten._cdist_forward (#90042), aten.pixel_shuffle (#91605)

Linalg API

  • Fix typos in messages under aten (#88964)


  • Improve CoreML logging and dependent libraries.
    • Updated Cocoapods (#88075)
    • Preserved CoreML errors by using special throw macro when encountering CoreML API errors (#86938)
  • Clean Up MobileOptimizerType Rewrite Flags Public API and Documentation (#91600)
  • Clean up flatbuffer lib dependency and fixed its test to match pkl models (#86041, #93022)
  • Type corrections to avoid unnecessary static_casts (#93898)
  • Add flake8-logging-format linter (#90805, #94840)

Sparse API

  • Add autograd support for linear (#86137, #86302), mm, log1p(#86301, #88155), to_sparse_*(#90281)
  • Improve support for sparse_dim, dense_dim (#86203, #86203), torch.sum(#86300, #92979), torch.sparse.sampled_addmm(#86401),frac, deg2rad, rad2deg, relu(#88153, #88156, #88442, #86749),conj()(#91695),to_sparse(#90718),sparse_mask` (#92248, #94829)
  • Add support for per batch index contiguity in CSR/CSC/BSR/BSC (#91243), non-contiguous values in CSR/CSC/BSR/BSC (#91243), non-zero dense_dim to COO/CSC/BSR/BSC/Strided conversions. (#90177), uncoalesced operands to sparse_mask (#91964)
  • Improve error messages for indices, values, (c)row_indices, (c)col_indices (#93149) and addmm (#94843)
  • Extend gradcheck to BSR and BSC inputs. (#90719)
  • Sort BSR indices as part of CSR to BSR conversion (#90918)


  • Implement aten::native_batch_norm.out for CPU (#88604)
  • Log1p for complex in CPU (#89691)
  • Enable oneDNN implementation for LSTM (#91158)


  • Add better debugging for torch.package (#92939)


  • Remove weight arg from DTypeConfig for non-weighted ops (#86335)
  • Add get_symmetric_qnnpack_qconfig_mapping for XNNPACK quantized ops (#87002)
  • Add assert for backend correctness in get_default_qconfig related apis (#86259)
  • Replacing List[QConfigMapping] in parallel numeric profiler (#86922)
  • Check the fixedqparam op qconfig based on backend_config (#87425)
  • Explicitly set default quantized engine instead of relying on the order of supported_qengines (#89804)
  • Support setting qconfig by module_type in QConfigMapping in PT 2.0 export flow (#92355)
  • Migration of quantization code from torch._ to (#86171, #86172)
  • Improvements to qnnpack fully connected sparse ops (#85243, #85244, #85245, #85246, #85247)
  • Support lowering of channel shuffle in FX (#83731)
  • Remove explicitly default QConfigMapping settings (#90066)
  • quant: make various configs printable (#91419)
  • Enable FX quant for patterns like x.view(x.size(...), ...) (#90001)
  • X86 qengine always uses fbgemm kernels on OS other than Linux (#93218)
  • Change prepare_fx and convert_fx to preserve the GraphModule type of input (#94412)
  • update xnnpack to newer version and update API usage in pytorch (#94330)
  • Remove _input_output_observed from backend_config (#92589)
  • Add support for LSTM Structured Pruning prune_functions + pattern (#90801)
  • Enable FX static quantization for LSTM (#85068)
  • Allow setting fixed quantization params for inner LSTM ops (#88456)
  • Add support for GRU in fx graph mode quantization (#91976)


  • Operator support col2im opset 18 (#84594), mse_loss (#90717), aten::contains (#91660), src/index dynamic axes support for aten::scatter_add (#90090), aten::zero (#91731), Raise Unsupported for GridSample with volumetric 5D input (#92212)
  • Pretty print diagnostic logging (#88261)
  • Bump onnx to 1.13.1, onnxruntime to 1.14.0 (#90332, #94767)
  • Add full graph checker option for torch.onnx.export API (#83186)
  • Integrate all ONNX operators with a new JitScalarType API (#87245)
  • Add share_from_this to torch::jit::Graph (#87343)
  • Use optional op to keep None in results for ONNX internal tests (#84789)
  • Add support for autograd function inlining in ONNX_ATEN_FALLBACK mode (#85736)
  • Default runtime type checking to raising errors (#86555)
  • Remove the INT64_MAX magic numbers (#88341)


  • Refactor graph partition to check for cyclic dependency (#86511)
  • Enable nvprims.transpose fusions for nvFuser (#86967)
  • Simplify magic method definition code. (#88017)
  • Add sym_floor, sym_sqrt, sym_int (#88760)
  • Propagate .meta info when replacing subgraphs in fx (#87255)
  • Make torch.fx compatible with Python-3.11 (#92895)
  • Add type(module) to be stored in the module stack (#87149)
  • Ensure that symbolic variables incorporate fresh constraints before they're used (#87254)
  • Add type annotation to getitem node before split_module (#88510)
  • Implement pass for annotating getitem nodes (#90237)
  • Guard Symbol and ShapeGuardPrinter behind HAS_SYMPY (#90704)
  • Copy meta field in fx.GraphModule on deepcopy (#92062, #92623)
  • Match get_attr when comparing nodes (#91657)
  • Make deepcopy of fx.GraphModule handle circular reference. (#93038)
  • Populate memo in deepcopy BEFORE copying children. (#93295)


  • Add fp16 support for torch.nn.Linear (#89774), torch.nn.GELU (#86218)
  • Add support for empty Tensors in torch.bitwise_not (#87286), torch.nn.LayerNorm (#94212), many backward functions (#94343), torch.nn.functional.hardswish (#94342), torch.topk (#91884), torch.arange (#94485), torch.linal.inv (#94551),
  • Improve error message for nn.Conv2d when inputs are on different devices (#86303)
  • Add support via fallback for torch.nn.{Fold, UnFold} (#94491)
  • Add support for reduction ops on multiple axis at a time (#91734)
  • Add support for k greater than 16 for torch.topk (#94639)


  • Add @pytorch in tools/bazel.bzl (#91424)
  • Change visibility for //c10:headers (#91422)
  • Simplify OpenMP detection in CMake (#91576)
  • Use @pytorch// in bazel build files which improves embedding usecases (#89660)
  • Enable USE_CUDA for bazel build (#92640)
  • Add missing default initializers to class members (#94049)


  • Skip builtins while enumerating class methods (#91805)
  • Support lovelace for NVRTC (#87611)
  • Expanded symbolic shape support (movedim) (#91696)


  • Update CI test environment; Add symbolic functions (#94564)
  • Import Literal, Protocol, and Final from standard library typing as of Python 3.8+ (#94490)
  • Add cpuinfo to for new issues reporting which helps triaging on CPU (#93899)
  • Refactor nvfuser build (#89621)
  • Add error checking to flaky test bot platform parser (#86632)
  • Make LazyGraphExecutor extensible (#87218)
  • Delete BUILD_SPLIT_CUDA option (#87502)
  • Use faster cache flush in triton benchmarking (#88557)
  • Guard global observer init against Edge profiler (#86347)

Bug fixes

Python API

  • Fix as_strided_scatter derivative formula(#87646)
  • Add bfloat16 support to (#87205)
  • Disable dimension wrapping for scalar tensors (#89234)
  • Fix SIGSEGV on a big-endian machine when reading pickle data (#92810)
  • Fix BC-breaking change to reduction arguments amin/amax (#93091)
  • Fix incorrect tensor storage check (#86845)
  • Ensure einsum contracts left to right (#87199)
  • Add nondeterministic error for torch.tensor.scatter (#88244)
  • Fix multi-index for torch.tensor.index_select over scalar tensor (#94347)
  • Add scalar support for torch.tensor.where (#92849)
  • Improve error message for unsupported argument types (#87601)
  • Change as_strided_scatter’s storage offset default to None from 0 (#87481)
  • Make torch.histc consistent between CPU and CUDA (#87832)
  • Add float to list of allowed ops for serialization (#94910)
  • Fix numpy1.24 deprecations in unittests ([#93997] (#93997))
  • Properly moving segment_reduce to be private as expected (#93166)

Autograd API

  • Fix behavior of hooks registered to Tensors that had previously been modified in-place (#92734)
    • Previously hooks registered to a tensor after it is modified in-place would erroneously receive the gradients of the output w.r.t. to that tensor before it is modified in-place if that tensor had previously had a hook registered to it before it was modified in-place.
    • See documentation for more details about backward hooks execution when tensors are modified in-place.
  • Update saved variable hooks to no longer trigger on wrapped numbers (#87316)
  • Modifying a view created in no-grad mode in-place no longer triggers an internal assert (#88243)
  • Improve error message when saved tensor is detached inplace (#88860)
  • Prevent module full_backward_hook from erroring in double backward (#88357)
  • Fix forward AD custom Function non-differentiable outputs (#90787)
  • Don't materialize forward grad for non-differentiable types (#91183)
  • Return input as-is if marked dirty even when requires_grad=False (#91214)
  • Fix saved tensor hooks to propogate errors back to python as-is (#94456)
  • Fix NumPy broadcasting for backward of linalg.solve (#91456), linalg.lstsq (#91460)
  • Fix torch.var backward when input numel == correction (#94546)
  • Fix CopySlices logic to ensure wrapped node runs properly. (#89812)

torch.nn API

  • Fix for RNN-like Modules to work with stateless.functional_call() (#91111), better error messages (#87442),
  • Add missing dim checks EmbeddingBag (#85433)
  • Fix Upsample and EmbeddingBag module printing (#93850)
  • Fix segfaul in Conv3D CPU implementation (#94325)
  • Fix overflow issue in Upsample (#94290)
  • Fix functiona.pixel_{shuffle,unshuffle} to consistently return views or not (#86608)
  • Fix 64bit indexing Conv3d() (#87527), Upsample() (#87901)
  • Fix preserving requires_grad-ness in fusion utils (#89100)
  • Fix support for empty inputs/outputs for Conv{1,2,3}d() (#86521), functional.adaptive_{avg, max}_pool() (#88906)
  • Fix buffer overflow in Upsample() (#89252), MaxUnpool3d() (#94372)
  • Fix functional.grid_sample() loss of precision for torch.float16 inputs (#90427)
  • Fix functional.interpolate() bicubic interpolation to properly preserve memory format (#90470)


  • Fix cross to match unbatched behavior (#86926)
  • Properly error on complex inputs or outputs in jacrev, jacfwd (#94805)
  • Fix batching rule for dropout (#92975)
  • Fix vmap and anomaly mode interaction (#92672)
  • Fix and update type hints for (#91579)
  • torch.tril & torch.tril : add out of bound checks (#89384)
  • Fix batching rule (#86932)
  • Fix reduction boxed batching rules (#91109)


  • Check SM version before calling flash attention with BFloat16 (#86600)
  • Add range check to multi margin loss target (#89008)
  • Fix NVML visible device parsing (#92315)
  • Take CUDA_VISIBLE_DEVICES into account for nvml calls (#94568)
  • Fix topk IMA (#93095)
  • Fix: half reduction with multiple sub-iterators (#85596)
  • Fix segfault when swapping custom allocator (#89613)
  • Conditionally set device in autograd engine (#91191)
  • Store autocast_gpu_dtype in custom_fwd and custom_bwd for BFloat16 autocast (#88029)
  • Do not use at::cuda::getDefaultCUDAStream() (#91180)
  • Ensure that our error handling runs with the GIL enabled (#92848)
  • Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally (#93192)
  • Fixes a memory leak by making autocast cache global instead of thread-local (#86492)
  • Take CUDA_VISIBLE_DEVICES into account for nvml calls (#94568)
  • Explicitly set the workspace for cuBLAS handles (#86645)


  • Fix CUDNN_PATH handling on Windows (#88898)
  • Fix typos in warning/error messages(#88961)
  • Remove uneeded checks from embedding bag impl (#92982)
  • Fix c++ : segfault in modulelist and moduledict (#93074)


  • Fix overflow issue in tensorboard image summary (#90423)
  • Remove deprecated call to (#89832)

NestedTensor API

  • Enable non-contiguous Nested Tensors for BMM inputs for NT on CUDA (#88108), linear backward (#94317)
  • Fix bug in unsqueeze_nested stride calculation (#88688)


  • Distributed(c10d)
    • Fix a static initialization order fiasco in c10d (#90149)
    • Fix send, recv return type (#92152)
    • Fix MPI backend PG initialization (#92847)
    • Fix header-filter for clang-tidy c10 and apply some fixes to c10 and c10d (#91178)
    • Fix backend_type for backend/PG plugin (#93129)
    • Fix UCC PG barrier (#86961)
    • Properly finalize unsuccessful UCC collective posts (#89306)
    • Add pre & post processing for UCC CPU collectives (#89030)
    • Re-enabl isinstance with torch.distributed.ReduceOp (#87303, #88275)
    • Ameliorate custom __eq__ for ReduceOp (#90088)
    • Fix warning if backend registers timer (#91702)
  • DistributedDataParallel
    • Fix DDP when the number of output features is zero (#87793)
  • FullyShardedDataParallel
    • Fix use_orig_params=True for reentrant activation checkpointing by disabling the post-backward hooks (#87413)
    • Re-establish the wrapped module in _lazy_init in case module changing after FSDP constructor (#87837)
    • Fix the incorrect norm calculation for NO_SHARD by handling sharded and non-sharded parameters differently in FSDP.clip_grad_norm_ (#88955)
    • Pass through ActivationWrapper directly to the inner wrapped module to fix state_dict issues (#87950)
    • Remove the clean of FQNs even for use_orig_params=True in FSDP (#91767, #92662)
    • Restrict meta model check to non ignored modules in FSDP (#86766)
    • Fix keep_low_precision_grads=True for use_orig_params=True (#90027)
    • Fix for use_orig_params=True + no_sync (#90546)
    • Fix no_sync, use_orig_params=True, mixed precision, sharded (#92874)
    • Fix input grad propagation when using param mixed precision (#90921)
    • Fix _mp_shard in record_stream (#91096)
    • Fix "use-after-free" in reshard logic (#94859)
    • Fix clip_grad_norm_ issues (#94835), (#86337)
    • Fix load_sharded_state_dict FQN mismatches for shared parameters (#86524)
    • Fix grad zero vs. None edge case (#87308)
    • Fix FSDP state_dict transformations of modules with persistent buffers failure with mixed precision enabled (#93396)
    • [FSDP] Fix nn.Parameter usage for 2D and use_orig_params=True (#89782, #89845, #90562)
  • RPC
    • FFixixed use after free in tensorpipe agent (#87627)
  • Torch Elastic
    • Make TorchElastic timer importable on Windows (#88522)
  • Tensor parallel & 2D parallel
    • Fix the logic to trigger load hooks for 2D parallel integration with FSDP. (#86272)


Foreach API

  • Fix _foreach_norm on some tensor sizes (#91844)
  • Exempt _foreach_norm from autograd_not_implemented_fallback check (#93995)

Complex API

  • Fix serialization of conj and neg_view (#88182)

Linalg API

  • Add empty tensor check to _compute_linear_combination (#94245)

Optimizer API

  • Fix discrepancy between mt vs st impl (#92699)
  • Do NOT inplace modify gradients (#92706)
  • Fix memory leak in _LRScheduler.step() (#85602)
  • Look up group["capturable"], not defaults["capturable"] in Adam(W) (#94149)
  • FusedAdam(W) should take OptState into account before unscaling grads (#94060)
  • Fix LinearLR scheduler start_factor (#86695)
  • Keep AveragedModel buffers in sync when use_buffers=False (#84054)
  • Fix OneCycleLR error log (#92040)
  • Fix SparseAdam consuming iterator (#86210)
  • Fix empty grad support for SparseAdam (#86459)


  • Fix set pickle_module if not specified (#88570)
  • Explicitly check filelike arg of (#88867)
  • Fix dtype mismatch for unallocated storage deserialization (#91285)
  • Add float to list of allowed ops (#94910)


  • Fix segfault in has_torch_function (#88559)
  • Fix for usages of torch_dispatch with operators that take in an OptionalTensorList argument (#88887)
  • Allow direct Tensor constructor to return preexisting PyObject (#92754)
  • Add fallthrough kernel for AutogradMeta key (#94603)
  • Several fixes to existing primtorch and reference decompositions:
    • cat: fix striding (#89332)
    • prelu: Fix prelu ref when a.ndim < 2 (#89809)
    • huber_loss_backward fix (#86955)
    • uniform fix (#90094)
    • unfold_copy fix (#86371)
  • Fix aliasing for primtorch view meta kernels (#86285)
  • Properly compute device for elementwise operations with CPU scalar tensor (#93073)
  • Several fixes to existing operators’ meta tensor kernels:
  • Several bug fixes as part of hardening functionalization, which is used in AOTAutograd:
    • fix detach() in functionalization (#87750)
    • fix torch.as_strided_scatter_backward memory initialization (#88342)
    • fix functionalization resize stride compute (#94018)
    • fix x.is_contiguous(channels_last) in functionalization (#94195)
    • fix set_() with functionalization (#90722)
    • check for undefined tensors in advanced indexing during functionalization (#90791)
    • fix some composite compliance ops for functionalization (#86470)
    • Make aten.copy preserve strides (#89464)

Sparse API

  • Fixes to (#90763), (#90917), (#91094)
  • Fix CSR to CSC conversion when given indices of int32 dtype (#91061)
  • Fix mul when given CUDA CSR Tensor and scalar (#91239)
  • Fix conversion from CSC, BSC to COO to only result in coalesced Tensors when appropriate (#91440)
  • Fix numel after resizing a CSR/BSR/CSC/BSC tensor. (#91831)
  • Fix torch.triangular_solve for CSR on CPU when unitriangular=True. (#93352)


  • Fix philox randn to follow standard normal distribution (#91945)


  • Fix access to uninitialized memory in VSX vector functions (#89833)
  • Fix buffer overflow from AddressSanitizer checks due to inaccurate bfloat16 representation of large integer (#89210)
  • Make torch.histc ignore NaNs on CPU (consistent with CUDA) (#85870)
  • Fix vectorized trigonometric functions for VSX (#86453)
  • Call symint::sizes() instead of sizes() on convolution error messages. (#89549)
  • Make torch.linspace result on CPU consistent with numpy (#89048)
  • Remove variable_excluded_from_dispatch() assertion from mkldnncommon (#92168)
  • exponential_ few fixes (1) lambda > 0 (2) mkl kernel to continuous (3) better error log on dtype (#92891)
  • Vectorize more stable complex division (#93277)
  • cauchy_ few fixes (1) check gamma > 0 (2) better dtype error log (#93314)


  • Fix CPU autocast for due to the new type ITensorListRef (#87756)
  • Add parameters check for torch._mkldnn_transpose (#85318)
  • Fix build with Intel compiler due to c10/util/TypeIndex.h (#89610)


  • Treat builtins as default extern module (#88385)
  • Support pickle version 4 by adding missing ops (#90223)
  • Check spec for module source before falling back to file in package exporter (#90258)


  • Fix the call to get_executorch_backend_config (#86338)
  • Fix weight_dtype and bias_dtype backend_config checks (#86719)
  • Respect non_leaf_module_list for activation modules (#88498)
  • Fix incorrect integer cast on histogram observer bounds (#90355)
  • Improve numerical stability of HistogramObserver (#86522)
  • Quant_min typo bugfix in (#88024)
  • Fix fuse_func method overwrite (#87791)
  • Fix get_default_qat_qconfig for PT 1.13 (#88876)
  • Check the value of numel to avoid segfault (#81547)
  • Fix mkldnn quantization issue for weight reorder error (#86876)
  • Fix Memory Leak in QNNPACK QSoftmax Op (#89544)
  • Copy MHA's batch_first attribute in prepare() (#91680)
  • Fix for swap_custom_module_to_observer doing duplicate swaps on the same (#91905)


  • Correctly restore pybind11 error_already_set (#93238)
  • Remove proxy tensor's check for data dependent output (#93265)
  • Make ShapeEnv deepcopy-able (#93403)
  • Fix SubgraphMatcher for case of no anchor found (#86421)
  • Fix for partitioner with symbolic shapes (#86425)
  • Fix getitem in partitioner and make metadata storage more consistent (#87012)
  • Fix magic method try reverse protocol (#88030)
  • Fix FakeTensorProp on Module with Parameters or Buffers (#88700)
  • Fix PassManager to not use a class variable mutable list (#89108)
  • Prevent tracing when we track_tensor_tree (#89139)
  • Make all make_fx invocations isolated (opaque to higher make_fx invocations) by default (#93290)
  • Fix matching args in PatternMatcher (#94375)
  • Allow FakeTensorProp to run on graphs traced with some None inputs (#94569)
  • Copy codegen in legalize_graph (#90023)
  • Fix proxy unwrapping for cond() (#91907)


  • Fix triu/tril operator export with diagonal input (#86843)
  • Skip tensor printing during model tracing (#86223)
  • Fix aten::index_put(self, mask, v) export when rank(mask) &lt; rank(self) (#92862)
  • Fix 0d-tensor broadcast export (#87211)
  • Fix device type detection based on strings (#86168)
  • Fix scatter_add with different static shape of src and index (#89787)
  • Fix _pad_circular export (#86984)
  • Fix concat with empty tensors (#87620)
  • Disable ONNX ceil_mode and count_include_pad to align torch ceil_mode results in corner case (#87892)
  • Fix ignored small eps in layer normalization in fp16 (#89869)
  • Fix unconvertible_ops as per #89261 (#89299)
  • Fix Gather replacement in RNN peephole (#93120)
  • Fix cat operator for tensors with unknown rank (#94870)
  • Fix scalar type analysis for copied constant (#86716)
  • Fix scalar type detection for optional tensors (#94427)
  • Fix 'prim::PackPadded' shape inference (#91829)
  • Add onnx::Max into standard Op for scalar type alignment (#88750)
  • Add setType from user into InferredType and Reliable in ConstantValueMap (#88622)
  • Integrate ONNX ATen Fallback export with the new operator registry (#87735)
  • Fix ONNX ATen Fallback integration for BUILD_CAFFE2=0 builds (#88504)
  • Fix torch.autograd.Function.symbolic method support (#94746)
  • Fix FindCommonAncestor in function_extraction (#86650)
  • Update training state logic to support ScriptedModule (#86745)


  • Fix hipify mapping for cuDeviceGet (#90726)


  • Fix issues with non-contiguous Tensor handling (#86956, #86958)
  • Fix issues with ops implementation torch.median (#90326, #88807), torch.{std,var} correction argument (#91203), torch.index_select (#94117, #91064), torch.cumsum (#94119), torch.where (#86240), torch.nn.Embedding (#82809), torch.nn.Softplus (#88555), torch.nn.functional.pad (#89864), torch.max (#91520), padding functions (#91522), torch.nn.functional.upsample (#91669), pooling functions (#91519, #94348), torch.nn.{NLLLoss,SmoothL1Loss} (#94226), torch.nn.SoftPlus (#94256), torch.masked_fill (#94263), torch.fill_ (#94479), torch.median (#94489), torch.nonzero (#94442), torch.nn.BatchNorm (#94351), torch.{min,max} (#94386), torch.nn.GELU (#94529), torch.nn.LSTM (#94889), #95137),torch.nn.Conv2d(#95078),torch.nn.functional.bilinear(#94892),torch.copy\_ (#95272),torch.max_pool2d(#94963),torch.div (#95769)
  • Fix issues with torch.bool for Unary ops (#91120), scatter ops (#94464),
  • Fix issues with torch.float16 for torch.nan_to_num (#94220), torch.nn.HuberLoss (#94567)
  • Properly raise error for torch.int64 inputs for (#94270), torch.floor_divide (#94488), torch.square (#94766),
  • Properly cast torch.int64 to torch.int32 for reduction ops and raise warning. (#94484)
  • Properly raise unimplemented error for torch.nn.Conv3d (#94492),
  • Fix data type issues with index_add for non-torch.float inputs by casting them to torch.float (#88542)
  • Fix the high watermark value for unified memory allocation on x86 (#91268)
  • Fix handling of ops taking multiple dtypes as input (#91197, #91514)
  • Fix handling of channels last for (#91786, #94662), torch.Conv2d (#91822, #94384), torch.nn.{ELU,ReLU,Hardswish} (#94664), torch.nn.BatchNorm (#94760), torch.nn.MaxPool2d (#94877)
  • Fix view operations handling (#94259, #94278,#95145, #95762, #95905)
  • Fix numerical stability issues with various ops (#94889)
  • Fix TORCH_WARN_ONCE (#95559) (#95559)


  • Move incorrectly placed closing curly brace of extern "C" block (#87853)
  • Set INTERFACE_LINK_DIRECTORIES on caffe2::mkl (#89359)
  • Also include MKL_THREAD_LIB in link libraries for caffe2::mkl (#89378)
  • Fix MSVC compiler error in basic_ops.h (#93322)
  • Fix a bug that redefines __STDC_FORMAT_MACROS (#89310)
  • Fix ReplaceWithMaybeCopy test in OSS (#88099)


  • Fix out-of-bounds error in torch.jit.script for functions with many decorators (#87804)
  • Assorted fixes for NNC cpu fuser (#85056, #86788, #88798, #89978)
  • Set the correct size of aten tensor in presence of MKL-DNN padding (#86767)
  • Fix Scalar(bool) handling in toIValue (#87179)


  • Fix an issue with Vulkan not being able to be compiled on Windows (#92207)
  • Fix a possible empty vector dereference in the Vulkan optimization pass (#92918)


  • Fix cudnn RNN reproducibility issue (#90522)
  • Fix benchmark_limit ignoring failed kernels in FIND (#91032)


  • Set nvfuser default to disabled, keep CI (#86369)
  • Add manual cuda deps search logic (#90411)
  • Workaround for NumPy builds that ship with a broken Dlpack deleter (#89759)
  • Workaround MSVC ICE due to constexpr char* template argument (#86288)
  • Add define to fix issue with compatibility with latest Windows SDK (#85408)
  • Remove invalid git option when updating submodules (#91132)


Python API

  • Improve torch.lerp performance on cpu (#84845)
  • Improve torch.istft performance (#88060)
  • Call view within einsum to remediate MPS regression (#87135)
  • Remove unnecessary calls to python builtins(#94323)
  • Improve type hints for Module forward hooks (#92061)

Autograd API

  • Use in-place input accumulation fast path for dense Tensors. (#90217)

torch.nn API

  • Improve functional.interpolate() speed for torch.channels_last (#86361, #86361, #90302)
  • Improve performance for functional.multi_head_attention_forward() (#93234, #89847)
  • Improve performance for TransformerEncoderLayer() and MultiheadAttention() (#87377, #88488, #88831, #88854, #88970, #91171)
  • Improve SyncBatchNorm() performance by using the right gathering ops (#89521)
  • Improve ConvTransposed2D() CPU performance for torch.{float32, bfloat16} (#92530)
  • Improve functional.local_response_norm() performance for 3d inputs (#91052)



  • Layer norm backward speed gain with warp shuffles (#87445, #87814)
  • Avoid unnecessary type casts (#86086)
  • Use atomicAdd for bfloat16 in Ampere and above (#84981)


  • Vectorize torch.exp2 on CPU and add complex support (#92115)
  • Add various performance fixes to c++ STL usage (#94034)

NestedTensor API

  • Improve performance for NestedTensor torch.bmm(#86856), (#85894)
  • Remove unnecessary check in select_nested (#89150)


  • Do not call pad in no-padding case(#88769)

Complex API

  • Improve complex lerp performance (#84844)


  • Passing serialized XNNPACK model by reference (#89089)
  • Fix to add multiple outputs for the CoreML delegate (#88345)

Sparse API

  • Improve performance of mul when given COO (#86269)
  • Improve to(dtype) support for all sparse compressed formats (#89055)
  • Improve conversion of BSR/BSC to COO using to_sparse (#91389)
  • Improve sparse_mask (#91964)
  • Improve to_dense backward by removing redundant call to coalesce (#92001)
  • Improve validation of CSR/CSC/BSR/BSC tensors for low dimensional inputs (#94048)
  • Improve torch.sparse.sampled_addmm performance on CPU for CSR inputs (#90978)

Optimizer API


  • Optimizations for flip (#89414, #91806,#88989, #90013)
  • Add fmsub to vectorization primitives (#86568)
  • Optimize GELU BFloat16 Impl in CPU path (#79378)
  • Fix biasadd OMP perf issue for the packed MKL SGEMM (#92300)
  • Optimize LogSoftmax by improving thread-allocation in _vec_log_softmax_lastdim (#85398)
  • BF16 autocast conv transpose 1d/2d/3d for CPU (#92527)
  • Add mkl implementation for exponential on CPU (#69967)


  • Use deque instead of list for BFS (#91139)
  • Refactor the dfs cyclic search from recursive to iterative approach (#91042)



  • Add BFloat16 dtype support for oneDNN Graph JIT fuser (#85591)


  • Improve hot path heuristics performance in V8 (#90811)


Python API

Autograd API

torch.nn API

  • Improve documentation for: MaxPool2d (#86559), utils.clip_grad_norm_() (#91312), Module() (#87142), {Unfold,Fold}() (#88819), torch.nn.functional.gelu (#89061), functional.conv2d padding (#85004), functional.leaky_relu() (#94090), MaxUnpool{1,2,3}D (#94629)

NestedTensor API

  • Update Persons of Interest (#90069)
  • Fix path to nested_tensor in example (#86891)


  • Add 'mps' to the tensor attributes doc page (#86585)


  • Activation checkpointing
    • Clean up comments in activation checkpoint (#86622)
  • Distributed (c10d)
  • DistributedDataParallel
  • RPC
    • Fix non-existing parameters in docstrings in benchmarks (#91115)
  • Tensor parallelism and DTensor:
    • Add more clarifications and fix errors in tensor parallelism docs (#94786)
    • Update 2D parallelism API naming and docs (#94771)
  • FullyShardedDataParallel
    • Add docs to explain the running the forward pass of of submodules in FSDP (#86343)
    • Clarify warnings to mention collectives (#87478)
    • Remove HSDP Zero-2 from doc (#90503)
    • Improve the comments for FSDP (#92359)
  • Distributed Checkpoint
    • Enable documentation for Distributed Checkpoint. (#92813)
  • Torch Elastic
    • Fix a minor typo in documentation (#90667)
    • Fix init connect timeout by comparing host with the current IP list (#90221)


  • Downgrade the warning about forward-mode AD coverage (#87383)
  • Add version selector back to functorch docs (#86602)
  • Add documentation for torch.func (#91319)
  • Fix AOTAutograd tutorial (#87415)
  • Add migration guide from functorch (#91811)
  • Improve inplace/view note on copy slices (#89856)
  • Add more details to the functorch install page (#86823)

Linalg API

  • Add a note on the stability of linalg functions. (#88313)
  • Improve documentation for various linalg functions (#89013,#89383, #91129)


  • Fix ScalarTensor repr in Extending PyTorch example (#86330)
  • Fix incorrect wrapping of function decorator (#94446)
  • Add all to torch.{autograd, fx, cuda} submodules (#85343)

Dataloader API

  • Update dataloader docstring mentioning prefetch factor behavior (#89874)

Sparse API

  • Extend documentation for to_sparse (#89912)
  • Small correction to torch.sparse overview documentation(#93258)

Optimizer API


  • Fix various spelling and grammatical errors (#90662, #91253)


  • Improve documentation for various distributions (#91091, #87577)
  • Add original sources/references to in distributions (#86543)


  • Improvements to various READMEs (#89319, #86914,#86523, #89795, #90403)
  • Add docstrings for operators defined in torch.ops.quantized_decomposed namespace (#89547)
  • Add x86 backend as default backend of server inference (#86794)
  • Fix non-existing parameters in docstrings in torch/ao (#90875)
  • Move parts of BackendConfig tutorial (#91999)


  • Fix non-existing parameters in docstrings in torch/onnx (#90593)
  • Update diagnostics system (#94565)


  • Enabled xdoctest runner in CI (#83816)

Don't miss a new torch release

NewReleases is sending notifications on new releases.