Pytorch 1.13 Release Notes

Highlights
Backwards Incompatible Changes
New Features
Improvements
Performance
Documentation
Developers

Highlights

We are excited to announce the release of PyTorch 1.13! This includes stable versions of BetterTransformer. We deprecated CUDA 10.2 and 11.3 and completed migration of CUDA 11.6 and 11.7. Beta includes improved support for Apple M1 chips and functorch, a library that offers composable vmap (vectorization) and autodiff transforms, being included in-tree with the PyTorch release. This release is composed of over 3,749 commits and 467 contributors since 1.12.1. We want to sincerely thank our dedicated community for your contributions.

Summary:

The BetterTransformer feature set supports fastpath execution for common Transformer models during Inference out-of-the-box, without the need to modify the model. Additional improvements include accelerated add+matmul linear algebra kernels for sizes commonly used in Transformer models and Nested Tensors is now enabled by default.
Timely deprecating older CUDA versions allows us to proceed with introducing the latest CUDA version as they are introduced by Nvidia®, and hence allows support for C++17 in PyTorch and new NVIDIA Open GPU Kernel Modules.
Previously, functorch was released out-of-tree in a separate package. After installing PyTorch, a user will be able to import functorch and use functorch without needing to install another package.
PyTorch is offering native builds for Apple® silicon machines that use Apple's new M1 chip as a beta feature, providing improved support across PyTorch's APIs.

Stable	Beta	Prototype
Better Transformer CUDA 10.2 and 11.3 CI/CD Deprecation	Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs Extend NNC to support channels last and bf16 Functorch now in PyTorch Core Library Beta Support for M1 devices	Arm® Compute Library backend support for AWS Graviton CUDA Sanitizer

You can check the blogpost that shows the new features here.

Backwards Incompatible changes

Python API

uint8 and all integer dtype masks are no longer allowed in Transformer (#87106)

Prior to 1.13, key_padding_mask could be set to uint8 or other integer dtypes in TransformerEncoder and MultiheadAttention, which might generate unexpected results. In this release, these dtypes are not allowed for the mask anymore. Please convert them to torch.bool before using.

1.12.1

>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.uint8)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
# works before 1.13
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)

1.13

>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.bool)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)

Updated `torch.floor_divide` to perform floor division (#78411)

Prior to 1.13, torch.floor_divide erroneously performed truncation division (i.e. truncated the quotients). In this release, it has been fixed to perform floor division. To replicate the old behavior, use torch.div with rounding_mode='trunc'.

1.12.1

>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -1.])

1.13

>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -2.])
# Old behavior can be replicated using torch.div with rounding_mode='trunc'
>>> torch.div(a, b, rounding_mode='trunc')
tensor([ 2., -1.])

Fixed `torch.index_select` on CPU to error that index is out of bounds when the `source` tensor is empty (#77881)

Prior to 1.13, torch.index_select would return an appropriately sized tensor filled with random values on CPU if the source tensor was empty. In this release, we have fixed this bug so that it errors out. A consequence of this is that torch.nn.Embedding which utilizes index_select will error out rather than returning an empty tensor when embedding_dim=0 and input contains indices which are out of bounds. The old behavior cannot be reproduced with torch.nn.Embedding, however since an Embedding layer with embedding_dim=0 is a corner case this behavior is unlikely to be relied upon.

1.12.1

>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
tensor([], size=(1, 0), grad_fn=<EmbeddingBackward0>)

1.13

>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
RuntimeError: INDICES element is out of DATA bounds, id=4 axis_dim=3

Disallow overflows when tensors are constructed from scalars (#82329)

Prior to this PR, overflows during tensor construction from scalars would not throw an error. In 1.13, such cases will error.

1.12.1

>>> torch.tensor(1000, dtype=torch.int8)
tensor(-24, dtype=torch.int8)

1.13

>>> torch.tensor(1000, dtype=torch.int8)
RuntimeError: value cannnot be converted to type int8 without overflow

Remove deprecated `torch.eig`, `torch.matrix_rank`, `torch.lstsq` (#70982, #70981, #70980)

The deprecation cycle for the above functions has been completed and they have been removed in the 1.13 release.

torch.nn

Enforce that the `bias` has the same dtype as `input` and `weight` for convolutions on CPU (#83686)

To align with the implementation on other devices, the CPU implementation for convolutions was updated to enforce that the dtype of the bias matches the dtype of the input and weight.

1.12.1

# input and weight are dtype torch.int64
# bias is torch.float32
>>> out = torch.nn.functional.conv2d(input, weight, bias, ...)

1.13

# input and weight are dtype torch.int64
# bias is torch.float32
>>> with assertRaisesError():
>>>    out = torch.nn.functional.conv2d(input, weight, bias, ...)

# Updated code to avoid the error
>>> out = torch.nn.functional.conv2d(input, weight, bias.to(input.dtype), ...)

Autograd

Disallow setting the `.data` of a tensor that `requires_grad=True` with an integer tensor (#78436)

Setting the .data of a tensor that requires_grad with an integer tensor now raises an error.

1.12.1

>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
>>> x
tensor([0, 0], requires_grad=True)

1.13

>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: data set to a tensor that requires gradients must be floating point or complex dtype

Added variable_list support to ExtractVariables struct (#84583)

Prior to this change, C++ custom autograd Function considers tensors passed in TensorList to not be tensors for the purposes of recording the backward graph. After this change, custom Functions that receive TensorList must modify their backward functions to also compute gradients for these additional tensor inputs. Note that this behavior now differs from that of custom autograd Functions in Python.

1.12.1

struct MyFunction : public Function<MyFunction> {
    static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
      return 2 * tensors[0] + 3 * t;
    }

    static variable_list backward(
        AutogradContext* ctx,
        variable_list grad_output) {
      return {3 * grad_output[0]};
    }
};

1.13

struct MyFunction : public Function<MyFunction> {
    static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
      return 2 * tensors[0] + 3 * t;
    }

    static variable_list backward(
        AutogradContext* ctx,
        variable_list grad_output) {
      return {3 * grad_output[0], 2 * grad_output[0]};
    }
};

Don't detach when making views; force kernel to detach (#84893)

View operations registered as CompositeExplicitAutograd kernels are no longer allowed to return input tensors as-is. You must explicitly create a new tensor (e.g., using .alias()).

1.12.1

torch::Tensor view_op(const torch::Tensor& self) {
  return self;
}

1.13

torch::Tensor view_op(const torch::Tensor& self) {
  return self.alias();
}

ONNX

`torch.onnx.register_custom_op_symbolic` now only registers the symbolic function at the specified opset version (#85636)

This updates register_custom_op_symbolic's behavior to only register the symbolic function at a single version. This is more aligned with the semantics of the API signature. Previously the API registers a symbolic function to all versions up to the specified version. As a result of this change, users will need to register a symbolic function to the exact version when they want to override an existing symbolic function. Users are not affected if (1) an implementation does not exist for the op, or (2) the symbolic function is already registering to the exact version for export.

1.12.1

# Assuming an implemented symbolic function `custom_op_function`
torch.onnx.register_custom_op_symbolic("aten::foo", custom_op_function, 16)

1.13

# Assuming an implemented symbolic function `custom_op_function`
for opset in range(1, 17):
    torch.onnx.register_custom_op_symbolic("aten::foo", custom_op_function, opset)

Default ONNX opset is updated to 14 (#83284)

The update is done in regularly to ensure we are in sync with the onnx updates. Users can specify opset_version in torch.onnx.export to maintain opset version 13.

`torch.onnx.symbolic_registry` is removed (#84382)

We removed the symbolic_registry module and hid it as an internal implementation detail. Users previously relying on the register_op function to register custom symbolic functions should move to use the torch.onnx.register_custom_op_symbolic API.

`ScalarType` and global variables in `torch.onnx.symbolic_helper` are removed (#82995)

The ScalarType class in torch.onnx.symbolic_helper, along with the global variables cast_pytorch_to_onnx, pytorch_name_to_type, scalar_name_to_pytorch, scalar_type_to_onnx and scalar_type_to_pytorch_type are removed from the module. Users previously using these global variables for PyTorch JIT-ONNX type conversion in symbolic functions should move to use the torch.onnx.JitScalarType class.

1.12.1

# 1
torch.onnx.symbolic_helper.scalar_type_to_onnx[
    symbolic_helper.scalar_type_to_pytorch_type.index(x.dtype)
].value

# 2
torch.onnx.symbolic_helper.scalar_name_to_pytorch[element_type] in cast_pytorch_to_onnx.keys()

# 3
torch.onnx.symbolic_helper.cast_pytorch_to_onnx["Long"]

# 4
torch.onnx.symbolic_helper.cast_pytorch_to_onnx[tensor.type().scalarType()]

1.13

# 1
torch.onnx.JitScalarType.from_dtype(x.dtype).onnx_type()

# 2
torch.onnx.JitScalarType.from_name(element_type).onnx_compatible()

# 3
torch.onnx.TensorProtoDataType.INT64

# 4
torch.onnx.JitScalarType.from_name(tensor.type().scalarType()).onnx_type()

Distributed

In c10d collectives, input tensors dtype must now be the same (#84664)

We added a check to validate all dtype across all input tensors. Previously, users were allowed to pass in tensors with diferent dtypes for c10d collectives. Now, passing in tensors with different dtypes will throw a RuntimeError with the following message: “Invalid usage of tensors with different dtypes Found torch.float and torch.half”. Users can use tensor.to(dtype={some_dtype}) to fix this.

1.12.1

# users could pass inputs having different dtypes
>>> tensor = torch.ones(2, 2) * 7
>>> tensor_h = tensor.half()
>>> tensor_list = [torch.zeros(2, 2) for _ in range(4)] # Assume world_size = 4
# Both cases work.
>>> dist.all_gather(tensor_list, tensor)
>>> dist.all_gather(tensor_list, tensor_h)
...

1.13

# all inputs of c10d collectives need to have the same dtype
>>> tensor = torch.ones(2, 2) * 7
>>> tensor_h = tensor.half()
>>> tensor_list = [torch.zeros(2, 2) for _ in range(4)] # Assume world_size = 4
# Only allow same dtype for all input tensors.
>>> dist.all_gather(tensor_list, tensor) # RuntimeError thrown
...

Users doing wildcard imports of torch.distributed.distributed_c10d will no longer get non-public symbols (#84872)

We limit the usage of c10d APIs to public APIs, so if a user does a wildcard import and calls an internal API, it will fail. Please see the example below:

1.12.1

# users could import both public and non-public symbols:
from torch.distributed.distributed_c10d import *
>>> is_nccl_available() # public API
>>> _check_single_tensor(...) # Non-public API
...

1.13

# users can only import public symbols
from torch.distributed.distributed_c10d import *
is_nccl_available() # public API
_check_single_tensor(...) # Non-public API, this will fail now
...

Process Group C++ extensions must use absolute path when importing ProcessGroup.hpp (#86257), ProcessGroup::Work object moved out of work to its own Work class (#83680):

Details of the changes and the updated tutorial can be found in the PyTorch tutorial PR #2099

1.12.1

// users use relative path to import C++ headers and Work resides in ProcessGroup class
#include <c10d/ProcessGroup.hpp>
#include <c10d/Store.hpp>
#include <c10d/Types.hpp>
#include <c10d/Utils.hpp>
...
class WorkDummy : public ProcessGroup::Work {
    ...
}

1.13

// users must use absolute path of import C++ files and Work is its own class
#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
#include <torch/csrc/distributed/c10d/Store.hpp>
#include <torch/csrc/distributed/c10d/Types.hpp>
#include <torch/csrc/distributed/c10d/Utils.hpp>
...
#include <torch/csrc/distributed/c10d/Work.hpp>
class WorkDummy : public Work {
    ...
}

Quantization

Add required `example_args` argument to `prepare_fx` and `prepare_qat_fx` (#249) (#77608)

We added an additional required example_inputs argument to prepare_fx and prepare_qat_fx APIs, this can be used to do type inference to figure out the type information for each of the fx Node in the graph.

1.12.1

m = resnet18(...)
m = prepare_fx(m, qconfig_dict)
# or
m = prepare_qat_fx(m, qconfig_dict)

1.13

m = resnet18(...)
m = prepare_fx(m, qconfig_dict, example_inputs=(torch.randn(1, 3, 224, 224),))
# or
m = prepare_qat_fx(m, qconfig_dict, example_inputs=(torch.randn(1, 3, 224, 224),))

Stop moving models to CPU in quantization convert (#80555)

Previously, we automatically moved the model to CPU in torch.ao.quantization.fx.convert to work around the issue where certain functions called by convert expect CPU arguments. This commit pushes this responsibility to the caller since it is the user's decision of which device to use.

1.12.1

model = resnet18(...)
model = prepare_fx(model, qconfig_mapping, example_inputs)
# calibrate
model = convert_fx(model)

1.13

model = resnet18(...)
model.cpu()  # if needed
model = prepare_fx(model, qconfig_mapping, example_inputs)
# calibrate
model = convert_fx(model)

Replace the `is_reference` flag of the `torch.ao.quantize_fx.convert_fx` function with the `convert_to_reference` function (#80091, #81326)

This PR removes the is_reference flag from the existing convert_fx API and replaces it with a new convert_to_reference function. This separates (1) converting the prepared model to a reference model from (2) lowering the reference model to a quantized model, enabling users to call their custom lowering function for
custom backends.

1.12.1

from torch.ao.quantization.quantize_fx import (
    prepare_fx,
    convert_to_reference,
)

prepared = prepare_fx(model, ...)
reference = convert_to_reference(prepared, ...)

1.13

from torch.ao.quantization.quantize_fx import (
    prepare_fx,
    convert_to_reference_fx,
)

prepared = prepare_fx(model, ...)
reference = convert_to_reference_fx(prepared, ...)

Add default configs for fixed qparams ops (#80184)

This commit adds qconfigs with special observers for fixed qparams ops (operators whose corresponding quantized version has fixed quantized parameters for output) like sigmoid in get_default_qconfig_mapping and get_default_qat_qconfig_mapping. For correctness, we also require users to use these special observers if we detect these fixed qparams ops in prepare.

1.12.1 (fails after this PR):

from torch.ao.quantization.quantize_fx import prepare_fx

model = ModelWithFixedQParamsOps()
qconfig_mapping = QConfigMapping()
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)

1.13

from torch.ao.quantization import get_default_qconfig_mapping
from torch.ao.quantization.quantize_fx import prepare_fx

model = ModelWithFixedQParamsOps()
qconfig_mapping = get_default_qconfig_mapping()
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)

Replace `qconfig_dict` with a typed `QConfigMapping` object (#78452, #79618)

Previously, FX graph mode quantization configurations were specified through a dictionary of qconfigs. However, this
API was not in line with other core APIs in PyTorch. This commit replaces this dictionary with a config object that users will
create and pass to prepare and convert. This leads to better type safety and better user experience in notebook settings
due to improved auto completion.

1.12.1 (deprecated)

from torch.ao.quantization.quantize_fx import prepare_fx

qconfig_dict = {
    "": qconfig,
    "object_type": [
        (torch.nn.Linear, qconfig),
    ],
    "module_name_regex": [
        ("foo.*bar", qconfig),
    ],
    "module_name": [
        ("mod", qconfig),
    ],
}

prepare_fx(model, qconfig_dict)

1.13

from torch.ao.quantization import QConfigMapping
from torch.ao.quantization.quantize_fx import prepare_fx

qconfig_mapping = QConfigMapping()
    .set_global(qconfig)
    .set_object_type(torch.nn.Linear, qconfig)
    .set_module_name_regex("foo.*bar", qconfig)
    .set_module_name("mod", qconfig)

prepare_fx(model, qconfig_mapping)

Replace `*custom_config_dict` with typed config objects (#79066)

This commit replaces the following config dicts with python objects:

prepare_custom_config_dict → PrepareCustomConfig
convert_custom_config_dict → ConvertCustomConfig
fuse_custom_config_dict → FuseCustomConfig

This leads to better type safety and better user experience in
notebook settings due to improved auto completion.
1.12.1

from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

prepare_custom_config_dict = {
  "float_to_observed_custom_module_class": {
     "static": {
         FloatClass: ObservedClass
     }
  },
  "non_traceable_module_name": ["mod1", "mod2"],
  "non_traceable_module_class": [class1, class2],
  "input_quantized_idxs": [0, 1],
  "output_quantized_idxs": [0],
  "preserved_attributes": ["attr1", "attr2"],
}

convert_custom_config_dict = {
  "observed_to_quantized_custom_module_class": {
     "static": {
         FloatClass: ObservedClass
     }
  },
  "preserved_attributes": ["attr1", "attr2"],
}

model = prepare_fx(
    model,
    qconfig_mapping,
    example_inputs,
    prepare_custom_config_dict=prepare_custom_config_dict)

model(data)

model = convert_fx(model, convert_custom_config_dict=convert_custom_config_dict)

1.13

from torch.ao.quantization.fx.custom_config import (
    PrepareCustomConfig,
    ConvertCustomConfig,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

prepare_custom_config = PrepareCustomConfig() \
    .set_float_to_observed_mapping(float_class, observed_class) \
    .set_non_traceable_module_names(["mod1", "mod2"]) \
    .set_non_traceable_module_classes([class1, class2]) \
    .set_input_quantized_indexes([0, 1]) \
    .set_output_quantized_indexes([0]) \
    .set_preserved_attributes(["attr1", "attr2"])

convert_custom_config = ConvertCustomConfig() \
    .set_observed_to_quantized_mapping(observed_class, quantized_class) \
    .set_preserved_attributes(["attr1", "attr2"])

model = prepare_fx(
    model,
    qconfig_mapping,
    example_inputs,
    prepare_custom_config=prepare_custom_config)

model(data)

model = convert_fx(model, convert_custom_config=convert_custom_config)

Remove `remove_quant_dequant_pairs` and fix tests (#84203)

This PR removed some passes in convert_fx, and also fixes the way we quantize layer_norm operator, so the qconfig for layer_norm op needs to be updated as well.

1.12.1

import torch
from torch.ao.quantization.qconfig_mapping import QConfigMapping, QConfig
from torch.ao.quantization.observer import default_weight_observer
from torch.ao.quantization.backend_config import (
    DTypeConfig,
    ObservationType,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

qconfig = QConfig(activation=qconfig.activation, weight=default_weight_observer)
qconfig_mapping = QConfigMapping().set_object_type(torch.nn.LayerNorm, q_config) \
.set_object_type(torch.nn.functional.layer_norm, q_config)

# assuming mymodel contains a LayerNorm layer or torch.nn.functional.layer_norm
m = MyModel()
example_inputs = (torch.rand(3, 3),)
m = prepare_fx(m, qconfig_mapping, example_inputs)

1.13

import torch
from torch.ao.quantization.qconfig_mapping import QConfigMapping, QConfig
from torch.ao.quantization.observer import default_placeholder_observer
from torch.ao.quantization.backend_config import (
    DTypeConfig,
    ObservationType,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

qconfig = QConfig(activation=qconfig.activation, weight=default_placeholder_observer)
qconfig_mapping = QConfigMapping().set_object_type(torch.nn.LayerNorm, q_config) \
.set_object_type(torch.nn.functional.layer_norm, q_config)

# assuming mymodel contains a LayerNorm layer or torch.nn.functional.layer_norm
m = MyModel()
example_inputs = (torch.rand(3, 3),)
m = prepare_fx(m, qconfig_mapping, example_inputs)

Align observer dtype with reference model spec (#85345)

Before this PR, the dtype attribute of observers was not clearly defined. It originally meant interface_dtype in the eager mode workflow, which is how the codebase before this PR is using it. In the new reference model spec, dtype attribute of an observer represents the dtype value which needs to be passed into a quantize function in the reference model spec. This PR aligns the codebase to this definition of dtype.

1.12.1

dynamic_quant_observer = PlaceholderObserver.with_args(
    dtype=torch.float, compute_dtype=torch.quint8)

1.13

dynamic_quant_observer = PlaceholderObserver.with_args(
    dtype=torch.quint8, compute_dtype=torch.quint8)

Composability

Changed the backend C++ kernel representation for some operators that take in lists of tensors (#73350)

If an operator in ATen takes in a list of tensors, and is marked as “structured” in native_functions.yaml (example), then previously, TensorList was represented as at::TensorList, or c10::ArrayRef<at::Tensor>. Now, it is represented as a more efficient type: const ITensorListRef&.

1.12.1

at::Tensor cat_kernel(at::TensorList tensors,int64_t dim) {
    ...
}
TORCH_LIBRARY_IMPL(aten, dispatch_key, m) {
    ...
    m.impl("cat", &cat_kernel);
}

1.13

at::Tensor cat_kernel(const at::ITensorListRef& tensors,int64_t dim) {
    ...
}
TORCH_LIBRARY_IMPL(aten, dispatch_key, m) {
    ...
    m.impl("cat", &cat_kernel);
}

C++ API

Lowered randint default dtype to the C++ API (#81410)

Prior to 1.13, the default for the dtype argument of torch.randint, torch.long, was set via manual python binding. However, in the C++ API, torch::randint would default to the global default data type, which is usually float. In 1.13 we changed the default for dtype in the C++ API to int64 in order to match the python API. To reproduce the old behavior, one can set the dtype argument.

1.12.1

torch::randint(/*low=*/0, /*high=*/10, {2, 3});

1.13

// assuming default dtype is float
torch::randint(/*low=*/0, /*high=*/10, {2, 3}, torch::kFloat);

Enabled `dim=None` for `torch.{std, var, std_mean, var_mean}` (#81845, #82765, #82912)

Prior to 1.13, a C++ API call that has argument types torch::{std, var, std_mean, var_mean}(Tensor, OptionalIntArrayRef, int64_t, bool) used to resolve to the {std, var, std_mean, var_mean}.correction overload. In this release, it resolves to the {std, var, std_mean, var_mean}.dim overload. With the .correction overload, the third argument of type int64_t could be used to pass a correction δN other than 1. In order to call the {std, var, std_mean, var_mean}.correction overload in 1.13, the old int64_t argument can be wrapped in a c10::optional.

1.12.1

// using std as an example
int64_t correction = 2;
torch::std(t, /*dim=*/dim, /*correction=*/correction, /*keepdim=*/True);

1.13

// To replicate in 1.13 using std as an example
auto correction = c10::make_optional<int64_t>(2);
torch::std(t, /*dim=*/dim, /*correction=*/correction, /*keepdim=*/True);

Deprecations

Distributed

We are deprecating the following APIs of c10d: *_coalesced APIs (#85959), *_multigpu APIs (#85961) and ProcessGroupRoundRobin (#85158)

We added warnings when users call c10d’s *_coalesced, *_multigpu and ProcessGroupRoundRobin APIs. Previously, users can use these APIs without any warnings but now they will see warnings like “torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions”. There are still workarounds for *_coalesced APIs but no workarounds will be provided for the other two.

1.12.1

# users could use the following APIs with no warnings:
all_reduce_coalesced(...)
all_gather_coalesced(...)
broadcast_multigpu(...)
all_reduce_multigpu(...)
reduce_multigpu(...)
all_gather_multigpu(...)
reduce_scatter_multigpu(...)
...

1.13

# users can still use these APIs but it will come with warnings:
all_reduce_coalesced(...)
# Warnings:
# torch.distributed.all_reduce_coalesced will be deprecated. If you must
# use it, please revisit our documentation later at
# https://pytorch.org/docs/master/distributed.html#collective-functions"

# Potential workaround:
reqs = []
with dist._coalescing_manager(group, reqs):
    reqs.append(dist.all_reduce(tensor1, async_op=True))
    reqs.append(dist.all_reduce(tensor2, async_op=True))
for req in reqs:
    req.wait()
...

We are deprecating passing optim_input into the FSDP optimizer state checkpointing APIs. The user can simply not pass the optim_input argument, and all behavior is preserved. No fix is needed from users side for now.

1.12.1

# the user can use the following APIs with no warnings
full_optim_state_dict(...)
sharded_optim_state_dict(...)
shard_full_optim_state_dict(...)
flatten_sharded_optim_state_dict(...)
scatter_full_optim_state_dict(...)
rekey_optim_state_dict(...)

1.13

# users can still use these APIs, but they will come with warnings
# The `optim_input` argument is deprecated and will be removed after PyTorch 1.13.
# You may remove it from your code without changing its functionality.

LinAlg

Deprecate torch.lu in favor of linalg.lu_factor (#77636)

The new operation has a cleaner API and better docs. The update rule is as follows:

1.12.1

LU2, pivots2, info = torch.lu(A, compute_pivots, get_infos=True)
LU1, pivots1, info = torch.lu(A, compute_pivots)

1.13

LU2, pivots2, info = torch.linalg.lu_factor_ex(A, compute_pivots)
LU1, pivots1 = torch.linalg.lu_factor(A, compute_pivots)

Deprecate torch.lu_solve in favor of linalg.lu_solve(#77637)

The new operation has a notation consistent with linalg.solve, and has an extra parameter adjoint=False. The update rule is as follows:

1.12.1

X = torch.lu_solve(B, LU, pivots)

1.13

X = linalg.lu_solve(LU, pivots, B)

ONNX

Monkey patched convenience method on `torch._C.Graph`, `torch._C.Block` and `torch._C.Node` are deprecated. (#83006)

Deprecated methods include Graph.op(), Graph.constant(), Graph.at(), Block.op(), and Node.__getitem__(). Previously, these methods are patched into the classes above when users call torch.onnx.export() and are typically used in custom symbolic functions. Users can continue to expect g.op() and g.at() in symbolic functions to work. The g parameter has been substituted by the GraphContext object (#84728). The methods are now exposed by the GraphContext class with APIs unchanged. Users should not rely on the Graph.op(), Graph.constant(), Graph.at(), Block.op(), Node.__getitem__() methods when they are directly interacting with the C classes. Users should use only the op() and at() methods of the GraphContext object, as other fields in the class will change in future releases.

New features

Python API

Added a deterministic implementation of scatter_add on CUDA for all input sizes (#79466)
Added torch.concatenate that aliases torch.cat (#85073)
Added Tensor.is_cpu() that returns whether a tensor is on CPU (#78887)
Added a force kwarg to Tensor.numpy() that enables returning a numpy ndarray that does not share storage with the tensor (#78564)
Added `torch.special.{airy_ai, bessel_j0, bessel_j1, bessel_y0, bessel_y1, modified_bessel_i0, modified_bessel_i1, modified_bessel_k0, modified_bessel_k1, scaled_modified_bessel_k0, scaled_modified_bessel_k1, spherical_bessel_j0}`` (#78900), (#78901), (#78902), (#78912), (#78451)
Added torch.special.{chebyshev_polynomial_t, chebyshev_polynomial_u, chebyshev_polynomial_v, chebyshev_polynomial_w, hermite_polynomial_h, hermite_polynomial_he, laguerre_polynomial_l, legendre_polynomial_p, shifted_chebyshev_polynomial_t, shifted_chebyshev_polynomial_u, shifted_chebyshev_polynomial_v, shifted_chebyshev_polynomial_w} (#78196), (#78293), (#78304), (#78366), (#78352), (#78357)
Added weights_only option to torch.load that restricts load to state_dict only, enabling safe loading. This can also be set using the TORCH_FORCE_WEIGHTS_ONLY_LOAD environment variable (#86812)

Build

Added -Werror=unused-but-set-variable build flag (#79305)
Added ability to get release versions based on the current tag (#78584)
Added -Werror=type-limits in Bazel CPU build (#79139)
Added -Werror=unused-variable in Bazel CPU build (#79156)
Added —config=shell to bazelrc file for easier debugging (#79350)
Added clang -Wconstant-conversion to catch errors detected in #75400 (#80461)
Added -Werror=non-virtual-dtor build flag (#81012)
Turned on pocketfft flag for third-party pocket_fft library (#81670)
Updated NCCL to v2.13.4-1 (#82775)
Added -Wunused-local-typedef build flag (#86154)
Increased max python version to include 3.10 (#84815)

Complex

Added complex half support for:
- [CPU] torch.{index_select, index_add} (#79217), (#79897).
- [CUDA] torch.roll (#79970), torch.fft.{fftshift, ifftshift} (#79970), torch.{acos, acosh, asinh, atanh}, (#80030), torch.{cos, sinh, cosh, tanh} (#78718), torch.sqrt, rsqrt (#77490), torch.{triu, tril, diag, trace}(#78062).
- [CPU and CUDA] torch.where (#78665), torch.{where, pow, masked_fill, sgn, tan, angle}(#78665)
Added complex support for torch.nn.ConvTranspose1d (#79694).

torch.nn

Added pop function to nn.Sequential and nn.ModuleList (#81601)
Added deepcopy support for parametrized nn.Module (#80811)

torch.nn.optim

Added maximization support via the maximize kwarg for optim.SparseAdam (#80336), optim.ASGD (#81875), optim.Rprop (#81864), optim.RMSprop (#80326)
Added support for differentiable optimizers via the differentiable kwarg optim.SGD (#80938), optim.Adam (#82205), optim.RMSprop (#83578)
Added support for complex number for optim.Adam (#80279), optim.AdamW (#80280), optim.Adamax (#80319), optim.RMSprop (#83860), optim.Rprop (#83858),
Handled complex params as independent real params in optim.{RMSprop, ASGD} (#83860), (#84472)
Added optim.lr_scheduler.PolynomialLR (#82769)

BetterTransformer

Allowed user to assert no mask contiguous check is necessary (#82533)
Added support for norm_first in nn.TransformerEncoderLayer fast path (#78269)
Added ustom scaled dot product implementations dense (#85984)
Added Better Transformer fastpath diagnostics (#81013)

ForEach

Implemented inplace foreach maximum and minimum (#82523)

LinAlg

Added linalg.lu_solve, linalg.solve_ex, linalg.vecdot, linalg.vander (#77634, #80073, #70542, #76303)

Sparse

Added torch.sparse.spdiags for easier creation of diagonal sparse matrices (#78439)

torch.fx

Enabled symbolic shapes (#82063, #82317, #82209, #83380, #85808, #84113, #84829, #84918, #85185, #85261, #85260, #85754, #85768, #86050, #86098, #86067)
Created an improved version of subgraph matcher (#82090, #82853, #85444, #85456, #85617)
Rewrite subgraph_rewriter with subgraph_matcher (#83717)
Added PassBase for writing passes, PassResult for the return value of passes, and a PassManager for managing the workflow of passes (#79878, #81366, #80531, #82485, #83933, #84094, #84425, #84232)
Added an FX graph partitioner and fuser (#79439, #80292)
Added a reinplacing FX pass (#80897, #83626, #83845, #83846)
Added a CSE pass to the common passes (#81512, #81530, #81742)
Created DecompositionInterpreter for decomposing aten → prims after an initial make_fx call (#79989)
Created a Backend for NvFuser based graph partitioner + Prims (#80591, #81311, #81436, #81911)
Created a Backend for Cudagraphs from dynamo (#80566)
Created a type constraint generator to Z3 (#79912, #80084, #80095, #80102, #80110, #80147, #80744, #80799, #80823, #80847, #80909, #80925, #80976, #81159, #81175, #81189, #81190, #81265, #81274, #81344, #81360, #81376, #81445, #81516, #81527, #81714, #82163, #82590, #82597, #82614, #82742, #82856, #82923,#82938,#83087, #83109, #83194, #83334, #83682, #83945)

JIT

Added new NVFuser Python Frontend Record Keeping for Cache enablement. (#81578)
Added torch.ops.nvprims namespace for nvFuser-specific prims (#82155)
Enabled fusion of conv with elementwise OP in NNC (#77157)
Added symbolic shape functions for conv_transpose2d.input, convolution, convolution_backward (#77283, #83557, #80860)
Added support in symbolic shapes for generalized lists of tensor shapes, tuple outputs, optional None, upper and lower bounds (#77389, #83092, #83222, #78679)
Added support for aten::_convolution when it is 2D conv in NNC (#84038)
Exposed ProcessGroup::Work.wait() API to TorchScript (#83303)

ONNX

Inlined prim::PythonOp for Autograd Function Export (#74765)

AMD

Enabled nvfuser (#82498)

CUDA

Added CUDA trace Python hooks (#82824)
Added CUDA Sanitizer (#83984)
Added support for multiple outputs in python jiterator (#77921, #78139)

Intel

Added a launch script with Best Recipe of Deep Learning on Intel Xeon CPU (#63932)
Enabled Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289)
Added unified x86 quantization backend (#84329)

MPS

Added aten::index_add.out operator for MPS backend (#79935)
Added aten::prelu operator for MPS backend (#82401)
Added aten::bitwise-not operator native support for MPS backend (#83678)
Added aten::tensor::index_put operator for MPS backend (#85672)
Added aten::upsample_nearest1d operator for MPS backend (#81303)
Added aten::bitwise_{and|or|xor} operators for MPS backend (#82307)
Added aten::index.Tensor_out operator for MPS backend (#82507)
Added aten::masked_select operator for MPS backend (#85818)
Added aten::multinomial operator for MPS backend (#80760)

Profiler

Integrated Execution Graph Observer into PyTorch Profiler (#75358, #79753, #82895, #84285)
TorchTidy: experimental tool to identify anti-patterns from traces (#79631, #79874, #79993, #80094, #80108, #80572, #81056, #81273, #81501, #81733, #81740, #81921, #82421, #82248, #82261, #82782)
Added reporting for OOM events to the Pytorch Profiler. (#80050)

Vulkan

Added Vulkan support for the following operators:
- torch.cumsum (#78554, #81107)
- torch.nn.LSTM (#78943, #79702)
- torch.nn.ReplicationPad2d (#79057, #79291)
- torch.nn.threshold (#78654, #79717)
- torch.nn.BatchNorm2d (#80510)
- torch.nn.LayerNorm (#80980)
- torch.nn.GLU (#80910, #81729)
- torch.select (#81771)
- torch.stack (#81064)
Prototype implementations for Quantized Tensors were added (#81491). These implementations still need to be exposed to Torchscript, but so far prototype implementations for the following ops have been added:
- torch.quantize_per_tensor (#81492)
- torch.dequantize (#81493)
- Quantized arithmetic ops (#81494, #81632, #81640, #81641)
- Quantized 2D convolution (#81495, #81496, #81497)
- Quantized Upsample2D (#81720)

Mobile

Added support for dtypes and custom classes in model tracer (#84795)
Extended Flatbuffer to get mobile_info for NMLML workflows (#78306)
Added serialization/deserialization of Sparse Quantize Linear Packed Params (#80474)
Added qnnpack bcsr matrix unpacking and use unpacking in Linear module (#80475)
Added OwnedOrBorrowedVector for QNNPack BCSR Indices/Values (#80476)

Distributed

`Distributed Checkpointing` (Prototyping)

This is a prototyping effort which enables loading and saving PyTorch models from one or more hosts. Models can use features such as DDP, FSDP and ShardedTensor and they can have a different configuration between saving and loading - for example, save from 4 hosts and load from a single host. Distributed checkpointing has an extensibility API that enables full control of how a model is saved; and a pluggable IO backend. (#83781, #83419, #84952, #84881)

`Distributed(c10d)`

Made c10d collective ops dispatcher passable. It allows tracing mechanisms such as LazyTensor and AOTAutograd to observe communications, e.g., : broadcast(#76722), allreduce(#79582), allgather (#79669), reduce_scatter (#79683), reduce (#79686), gather (#79687), scatter (#79688), alltoall (#79691), barrier (#79777), send/recv (#79779).
Added UCC process group (#79918)
Enabled uneven input support for all_gather (#83713) and uneven output support for reduce_scatter (#87010)
Added NCCL PreMul Sum to c10d ReduceOp (#84243)

DistributedDataParallel

Made DDP work with Python process group (#79176)
Enabled Zero1's ddp_with_overlap for hpu backend (#80438)

`FullyShardedDataParallel`

Added forward prefetching option in FSDP API (#85177)
Added fp16 and bf16 hooks for FSDP (#81711)
Implemented sharded_optim_state_dict and flatten_sharded_optim_state_dict. (#77628)
Added rate limiter (#83917) Thanks to IBM Research team, @lchu-ibm for his contributions to FSDP and @hfwen0502 for the experimental testbed that identified the issues.
Added an option to keep grads in lower prec (#85223)

`torch.distributed.elastic`

Added watchdog to TorchElastic agent and trainers (#84081)

`Activation Memory Management` (Prototyping)

We offer a new API, torch.distributed.algorithms.checkpoint.checkpoint_wrapper to wrap nn.Modules with activation checkpointing or activation offloading to easily use and experiment with activation checkpoint techniques without modifying model code. This makes it simpler to leverage activation checkpointing to reduce memory footprint of your training applications and train larger models. (#83035, #78704, #78854, #79830, #80089, #84907, #84908, #85448, #85449)

Infra (RelEng)

Enabled multigpu unittests on FSDP (#77947)
Added feature to do rebase (via comment) onto any branch (#78772)
Added implementation to allow PR collaborators to revert their PRs (#82360)
Added torchvision onto the commit pins file (#79151)
Turned on -Werror=all with a few exceptions in Bazel build for CUDA (#79306)
Prepared for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106)
Added ROCm build to pull request jobs (#80149)
Added dynamo test configuration (#80342)
Enabled ROCm CI for trunk test (#80920)
Added linux cuda 11.7 workflows (#81089)
Updated CI docker images and jobs to ROCm5.2 (#81168)
Added UCC PG build in CI (#81583)
Enabled periodic builds for CUDA 11.7 (#81688)
Enabled distributed tests for ROCm (#81751)
Added New TORCH_UCC_BLOCKING_WAIT env variable (#81791)
Change functorch pin mechanism to test functorch in pytorch/pytorch now that functorch is inside pytorch/pytorch (#81918)
Added Python 3.11 nightlies for Linux PyPi (Please note that 3.11 binaries are not fully functional) (#82302)
Updated ROCm nightly builds to rocm5.2 (#82353)
Add functorch target to cmake (#83464)
Upgraded CUDNN version for cuda 11.7 (#84964)
Enabled pytest-shard for functorch (#85321)
Enabled CI to run test_ops in parallel (#85528)
Updated trunk CUDA-10.2 to CUDA-11.7 (#85943)
Added support for building and running Metal tests in CI (#86073)
Bumped nvidia docker version and using python 3.10 for cuda11.7 (#82472)

Improvements

Python API

Added float16 support for torch.{arange, linspace} (#80492)
Added integer support to torch.index_reduce (#80464)
Added a stable kwarg to torch.argsort that controls the relative order of equivalent elements (#75162)
Improved stability of torch.distributions.kl_divergence for two Bernoulli distributions (#79944)
Improved type annotations for torch.{as_tensor, as_subclass} (#86105)
Added type promotion support for torch.{addcmul, addcdiv} (#74234)
Added bfloat16 support for torch.save with XLA/HPU tensors (#77534)
Improved wrapper subclass detection for serialization (#81105)
Updated python API TensorOption signatures for consistency with JIT schemas (#82241)
Allowed disabling oftorch.library.Library with PYTORCH_DISABLE_LIBRARY (#85190)
Enabled dim=None for torch.{mean, sum, nanmean, nansum} (#81286), (#79881), (#82912)
Added feature to enable registration of extension device modules as a native module under the torch namespace (#78329)
Added logsumexp to amp.autocast (#76330)

C++ API

Allowed const T& access to ListElementReference (#83177)
Redirected print messages to stderr in torch.utils.cpp_extension (#82097)
Updated CUDA compiler matrix in torch.utils.cpp_extension (#82860)
Added __all__ to torch.utils.cpp_extension, torch.utils.hooks and torch.utils.show_pickle (#85331)

Autograd

Added forward AD coverage for torch.{amin, amax, nansum, nanmean} (#80082), torch.scatter_reduce (except reduction=prod) (#85000), torch.linalg.det (#79487), torch.{elu_, celu_, selu_} (#83080)
Added forward-over-reverse AD coverage for nn.functional.{binary_cross_entropy} (#77852) , nn.functional.{embedding}(#79699), nn.functional.{mse_loss, softplus, l1_loss, smooth_l1_loss, prelu, hardswish} (#78740), nn.functional.{nll_loss, batch_norm, layer_norm, group_norm, cross_entropy, soft_min} (#84976) torch.{log_softmax, softmax}(#84976), torch.amin, amax, nansum (#80082)
Added support a stable double backward on torch.linalg.det for real inputs (#80217)
Added support for kwargs input to function when torch.utils.checkpoint with use_reentrant=False (#80987)
Added context manager to disable saved tensor hooks: torch.autograd.graph.disable_saved_tensors_hooks (#85971)
Added new cpp custom function API to inform the backward function whether a gradient is necessary to compute: ctx->needs_input_grad(idx) (#82544)
Added all device types in the pybinded DeviceType enum (#83676)
Added check_nan flag to torch.autograd.detect_anomaly which enables users to run anomaly mode without nan checking (#83481)

Build

Specify "Generic" BLAS library name to ensure PyTorch can find the BLAS llibrary (#74269)
Generate CUDAConfig.h only for CUDA builds (#78218)
Moved build_variables.bzl and ufunc_defs.bzl from pytorch-root/tools/ to PyTorch root directory (#78542)
Made lintrunner compatible with M1 (#78628)
BLAS library is linked privately instead of being linked publicly (#78883)
Updated build targets to include generated enum_tag.cpp (#79668)
Use miopen_LIBRARIES and rccl_LIBRARIES directly, when they are valid target for RCCL (#80446)
Deleted Win specific case for CMake older than 3.1 (#81411)
Split .cu to improve compile times (#81193)
Added append_cxx_flag_if_supported macro (#82883)

torch.nn

Improved groups argument validation for nn.Conv{1,2,3}d modules (#77919)
Improved error message for convolution backward fallback kernel (#81538)
Reduced memory usage of nn.Module full backward hooks by removing reference cycles (#80139)
Improved kl_div at boundary and its general implementation (#80334)
Improved input shape validation for MKL-backed convolution operations (#76526)
Improved input validation for nn.AdaptiveAvgPool2d (#84061)
Improved groups argument validation for nn.Conv{1,2,3}d (#85248)
Improved input index validation for nn.MaxUnpool{2,3}d (#78280)
Improved listing of public APIs for optim and nn (#80237)
Added new operator for nn.Sequential: + (#81170), extend (#81179), insert (#81402), +=, * and *= (#81279),
Added deepcopy support for unitialized parameter (#83809)
Added nondeterministic alert for nn.MaxUnpool{1,2,3}d (#84766)
Added Bfloat16 support for the backward pass of nn.functional.kl_div on CUDA (#77676)

torch.nn.optim

Added support for optimizers with more than 2 betas for LRScheduler (#84486)
Added fused kwarg to optim.Adam to enable a fused implementation on CUDA (#85739)

Composability

Significant hardening and improvements to the functionalize() API that lives with functorch (#77129, #77126, #77125, #78199, #77132, #77713, #77714, #78819, #78820, #82008, #82009, #81702, #80416, #80418, #80251, #80526, #82326, #81454, #81471, #83542, #83701, #85975)
Allow __torch_dispatch__ subclasses and modes to override more tensor metadata: device/size/stride/dim (#77684, #77970, #78646, #78691)
Improvements to the torch.library API, for registering python functions to the pytorch dispatcher:
- Improved error checking in torch.library (#77990)
- Make torch.library decorators return function, to allow for chaining (#78996)
Ported cholesky, linalg_qr, linalg_eigh and linalg_eighvalsh to structured kernels, giving them support with meta tensors (#79300, #79054, #79072)
Added python decompositions for many torch operators. This adds meta tensor coverage for a large number of pytorch operators (#77930, #79768, #79808, #84062, #84350, #80219, #78350, #79667, #81003, #81420, #81113, #81241, #81765, #82284, #80497, #80358, #80182, #80737, #81734, #81826, #78461, #78468, #78525, #78914, #78919, #79900, #79225, #80964, #83235, #84108, #84451, #78602, #78603, #78527, #78604, #78992, #78993, #78997, #79278, #79341, #79311, #79411, #79581, #81800, #79834, #82309, #79975, #82587, #82603, #83191, #84349, #84460, #85793, #86057)
Beefed up API for printing out operators registered to the dispatcher (#78995)
Trued up c10::FunctionSchema::operator<< to print native_functions.yaml syntax (#79645)
Made it so that it is valid to set metadata after detach calls, like x.detach().resize_(...) (#83590)
Optimized torch.ops.ns.opname.overload accessor in __torch_dispatch__ (#85132)

Dataloader

Added shape checking on argument weights for WeightedRandomSampler (#78585)
Added support for radom_split to accept percentages as lengths (#78877)
Extended collate function that can register collate functions to handle specific batch types (#85748)

Functorch

functorch.jacfwd now accepts a randomness kwarg (#84220)
Improved the error message when using vmap on a function with no Tensor inputs (#83016)
Relaxed the Tensor.as_strided batching rule. This is a primitive used in forward-mode AD (among other things) and improves composability of vmap with other transforms (like jvp).
functorch.functionalize: added support for in-place views on inputs (#83993)
functorch.functionalize: moved this API out of the functorch.experimental namespace (#85742)
Added vmap support for linalg.cholesky, linalg.eigvals, linalg.eigvalsh, linalg.matrix_norm, linalg.matrix_power, linalg.norm, linalg.tensorinv, linalg.solve_triangular (#82177)
Added vmap support for linalg.solve (#82814)
Added vmap support for linalg.cross (#83759)
Added vmap support for linalg.matrix_rank (#83760)
Added vmap support for linalg.pinv (#83761)
Added vmap support for Tensor.fill_ (#84015)
Added vmap support for linalg.lstsq (#82325)
Added vmap support for linalg.lu_solve (#85175)

LinAlg

Added a driver= kwarg to torch.linalg.svd and svdvals. Add cusolver gesvdaStridedBatched driver to linalg.svd (#74521)
Added opteinsum backend to torch.einsum (#86219)
Added path optimize kwarg to einsum (#84890)
Call view instead of sum in einsum to remediate MPS regression (#87135)
Ensure that we contract left to right in einsum (#87199)
Fixed opt_einsum defaults to be more reasonable (#86985)

Sparse

Added sparse_dim and dense_dim for batched, hybrid CSR/CSC/BSR/BSC (#80565, #80901)
Added support for conversion between batched CSR/CSC/BSR/BSC and dense Tensors (#80781, #83084, #83086, #78025, #80354, #82120)
- Conversion between SparseBsr and Strided (#78025)
- Added support for BSR <-> Strided Conversion (#80354)
Added support for conversion between CSR and CSC (#85091)
Added support for conversion between BSR and BSC (#85091)
Added partial support for CSR/CSC/BSR/BSC inputs to mm, addmm, matmul and F.linear (#85551, #85308, #85379, #85307)
Added support for COO to permute (#79707)
Added support for ComplexHalf to torch.nonzero and add(dense, CSR) (#79062)
Added support for CSC/BSR/BSC to unary zero-preserving functions. (#78173, #85031)
Added support for batched BSR/BSC to transpose (#82122)
Added support for scalar together with COO inputs to mul (#82962)
Added support for CSC/BSR/BSC to empty_like (#82310)
Added support for batch dims of CSR/CSC/BSR/BSC to select (#82119)

torch.fx

In constant folding, added device_for_folded_attrs parameter and sets the requires_grad option for a folded tensor (#79067)
Mode-based tracing in make_fx (#79638, #84238)
Made executor handle kwargs (#79858)
Added ignore_parameters_and_buffers flag to FxGraphDrawer (#79982)
Enabled an is_fx_tracing flag in the FX tracer (#80255)
Attached ProxyTorchDispatchMode to ProxyTensor and use it in __torch_dispatch__ (#82549)
Used enable_tracing flag for ProxyTorchDispatchMode instead of modifying torch dispatch mode stack inner attributes (#82643)
Improved legalize_graph pass in FX (#82874)
Implemented __deepcopy__ for fx.Tracer (#83130)
Hackde up make_fx to natively support varargs (#83210)
Updated proxy_tensor.py to support List input/output (#83302)
Added *_only and all/any pytree utilities (#83316)
Deleted ProxyTensor wrapper subclass (#83330, #83646)
Added support for partial decompositions in make_fx (#83770)
Added metadata field to fx.GraphModule (#84378)
Added option to maintain the FX graph execution order after splitting_module (#85188)

JIT

Added PReLU to MKLDNN convertible Ops in JIT optimize_for_inference (#79011)
Enabled torch._refs.var for nvFuser executor (#79517)
Fixed nvFuser's where (tensor, python_scalar, tensor) type promotion (#80347)
Added ComplexDouble scalar creation bindings to nvFuser's Python API (#80522)
Added real and imag to NVFuser and its python frontend (#79824)
Added Nvfuser opt in for decomposition (#81134)
Added torch.jit.fuser() option for disabling all fusers (#81731)
Added support for symbolic diff for silu (#81724)
Added NVFuser support for (prims.sign, refs.sign, squeeze, native_batch_norm, transpose) (#83167, #85562, #84629, #84117)
Use high precision accumulate buffer for bf16 accumulation in NNC (#84402)

Quantization

Improved quantization support for masked_fill (#78368, #85108)
Improved quantization support for index_put (#78384, #85685)
Improved quantization support for LSTM and MultiHeadAttention (#79959, #79956, #79960, #83304, #85068)
Added support for quantized matmul (#83885)
Introduced a more stable conv_bn fusion for QAT training (#85744)
Removed warnings from using torch.tensor(value) (#84277)

ONNX

Added operator support for torch.tensor_split (#77437), torch.lerp (#78891), torch.movedim and torch.moveaxis (#78931), torch.scatter_add (#79103), torch.argsort (#80234), aten::native_dropout (#81743), aten::native_layer_norm (#81754), aten::convolution (#81815), aten::_log_softmax (#81804), aten::layer_norm for ONNX opset version 17 using LayerNormalization (#84293), nn.init.normal (#84149)
Added quantization support to more single output ops (#83008) aten::reshape, aten::reshape_as, aten::t, aten::transpose, aten::numpy_T, aten::expand, aten::expand_as, aten::embedding, aten::embedding_bag, aten::view, aten::select, aten::eq, aten::ne, aten::gt, aten::lt, aten::le, aten::ge, aten::elu, aten::selu, aten::hardtanh, aten::hardswish, aten::as_strided, quantized::sigmoid, quantized::layer_norm, quantized::group_norm, quantized::leaky_relu, quantized::instance_norm
ONNX operators are exported with names containing their associated scope from nn.module (#82038), (#82039), (#82040)
Introduced runtime type checking with the beartype library in all public APIs (#83673), (#84091)
All torch.onnx APIs now support runtime type checking when @beartype is present in the Python environment. A warning is emitted when a type mismatch is detected.
This feature is experimental. To turn all warnings into errors, set the environment variable TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=ERRORS. To disable this behavior, set TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=DISABLED which effectively makes it a no-op.
Improved shape type inference (#78999)
Turn on ONNX shape inference by default (#82767)
Enabled data propagation from ONNX (#80730)
Introduced SARIF (#85428) for torch.onnx submodule
Improved warnings and errors (#78441), (#78309), (#83332), (#85179), (#83007)
Updated ONNX submodule to 1.12 (#79585)
Apply Common Subexpression Elimination pass to ONNX export (#85665)

AMD

Support benchmark flag for MIOpen (#77438)
Correctly handle the error codes of hipGetDeviceCount (#80405)
Use torch._C._cuda_getArchFlags to get list of gfx archs pytorch was built for (#80498)
torch.cuda.is_bf16_supported() returns True (#80410)
Workaround missing hipProfilerStart/Stop (#82778)
Enabled jiterator on ROCm (#77982)
Enabled MIOpen fused convolution relu (#82002)
Restore MIOpen benchmark flag default to true (#82656)
embedded_interpreter_hip to enable torch::deploy on AMD (#83329)
Add HIP libs into torch deploy init list & corresponding dependency for CURE benchmark running on AMD (#83434)

CUDA

Added synchronize hooks (#84427)
Added CSAN support for CPU synchronizations (#84428)
Return device count using nvml (#84879)
Reworked printing tensor aliases in CSAN error message (#85008)
Added jiterator support when dtype is complex32 for tan, atan, sin, asin (#77802),(#77606)
Added jiterator support when dtype is complex for logical_{or, xor} (#75947)
Reduced overhead of get_current_stream (#78066)
Added an argument to specify warmup iterations in make_graphed_callables (#78124)
Small improvements to device_count (#85192)
Memoize torch.cuda.device_count (#84878)
Remove the construction of unused tensors in fallback convolution implementation (#79183)
__launch_bounds__ for torch.mode with CUDA 11.7 (#79710)
Removed synchronization for D2H copy with a different dtype (#80607)
Added nondeterministic alert to CUDA cumsum (#75693)
Annotated CUDACAchingAllocator snapshots (#82146)
CUDACachingAllocator snapshots from C++ (#86190)
Propagate CUDAOutOfMemoryError to Python. (#83146)
Set cublas workspace size to 4M (#74159)
Allow changing the cuda allocator settings even after the process started (#84970)
Fixed exception handling, improve overheads and avoid constructing storage for element size for DLPack (#84612)
Added BFloat16 for fast layernorm (#83971)
Added BFloat16 support for torch.{im2col,col2im} on CUDA (#84372)
Added Bfloat16 support for ReflectionPad (#84949)
Added explicit __all__ to torch.cuda (#85193)
Set CUDA_MODULE_LOADING to LAZY when not set by the user (#85692)
Support cuDNN Errata Filter (#73934)
Allow the number of kernels profiled under torch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#78299)
Update tests and dispatching for CUDNN V8 API behavior for bfloat16 convs (#81139)

Intel

[RFC] Enable oneMKL&oneDNN on-demands verbose functionality (#63212)
Updated ideep for NNC post-op (#82705)
Enabled native 1d spatial input for Intel xpu (#82301)
Added loss operators to fp32 cast policy of AutocastCPU (#81689)
Added bfloat16 support for lerp on CPU (#84327)
Added prelu op and module for quantized CPU backend (#73491)
Enabled mkldnn matmul for aarch64 bf16 devices (#85546)

MPS

Added ranked tensors for addcmul ops in MPS instead of constants and update MacOS version check (#78354)
Moved MPS compat check into common comparison machinery of TensorLikePair (#77836)
Made MPS buildable with either XCode or CommandLineTools (#79430)
Improved MPS aten::softplus operator by adding RankedPlaceholder for graph nodes instead of constants (#81169)
Extended MPS Conv1D operation for NHWC format (#83121)
Added support for 1D weights in MPS linear layer (#85752)
Added full support for serialization of MPS Tensors (#79465)
Added support for 1D bias in MPS operation torch.addmm (#81519)
Added torch dispatch stub code for MPS backend (#82612)
Use convenience helper function dispatch1DJob for MPS native implementations (#82982)
Enabled support in MPS for torch.adaptive_avgpool_2d for larger output sizes (#85726)
Extended support in MPS for torch.constant_pad_nd for 4D+ padding (#85991)

Profiler

Propagate metadata into Engine::evaluate_function event. (#77696)
Switched to nanoseconds for Result's internal representation (#77697)
Made profiler table column widths changeable via arguments (#85203)

Vulkan

Enabled higher dimensional input in torch.nn.linear (#81773)
Vulkan tensor views now infers dim size when -1 is provided as input (#81668)
Vulkan prepacked op contexts will now release the deserialized CPU tensors from memory upon construction (#83587)
Vulkan shader codegen is now Windows compatible (#85241)

Mobile

Allowed tracing multiple input models at once (#84833)
Leaky relu in metal shader (#78544)
Added detailed error message for iOS test (#79140)
Remove dcode duplications and refactor (#79184)
Optionally run fbgemm in tracer (#83531)
Added hardshrink op to metal backend (#82224)
New flatbuffer_loader functions that do not depend on flatbuffers.h (#82618)
Added max_pool2d, linear, conv2d FP32 operator tests for XNNPACK (#83131)
Removed flatbuffer types/headers from flatbuffer_serializer[_jit].h (#82619)
Migrated remaining pytorch code to use new flatbuffer_loader.h APIs (#82620)
Remove flatbuffer types/headers from flatbuffer_loader.h (#82893)
Use flatbuffer of alternate namespace (#82952)
Hide flatbuffer build dependencies (#82953)
Renamed flatbuffer_all to flatbuffers_jit (#82826)
Renamed flatbuffer_serializer to _mobile or _full_jit (#82827)
Created flatbuffers_mobile (#82828)
Added API for profiling backend memory events for Edge CPU profiler (#80350)
Switched mobile targets to flatbuffers_mobile (#82829)
Added an option to avoid adding base ops to static op library for Edge (#84360)
Fixed load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)
Remove unused field 'order_' in nnapi.h (#84067)

Distributed

`Distributed(c10d)`

c10d API improvements:
- Introduced util functions in c10d get_local_rank, get_global_rank and get_global_ranks (#82134, #84363)
- Replaced internal API _all_gather_base with a public API all_gather_into_tensor (#85686)
- Replaced internal API _reduce_scatter_base with a public API reduce_scatter_tensor (#85867)
Improvements to c10d error messages:
- Added ncclGetLastError (#83724, #85825, #85850)
- Added closing parentheses to the CollectiveFingerprint (#79723)
- Added tensor deserializer and included rank and collective type to the error messages (#79724)
- Adopted ncclRemoteError (#85887)
Passed group ranks and options to third party distributed backends (#73164)
Enabled NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG is set to DETAIL (#83881)
Added a soft error handling mode NCCL_ASYNC_ERROR_HANDLING=2 that does not crash the process (#84386)
Upgraded NCCL to 2.14.3 (#85367)

`Distributed Optimizer`

Added functionality for save and restore step counter for model averanger in PostLocalSGDOptimizer (#78988)

`DistributedDataParallel`

Enabled the static graph to print unused parameters in debug mode for DDP. (#81929)
Enabled stateful PowerSGD communication hook now can be saved and reloaded to resume training (#79334)

`FullyShardedDataParallel`

Allowed different optim_input orders across ranks (#78599)
Added profiling range for FSDP.backward (#78479)
Enabled NamedTuple support for FSDP (#83055)
Added FSDP communication hook interface for NO_SHARD strategy (#79833)
Moved the sharded_state_dict logic to the post hook to avoid OOM (#82613)
Added ability to iterate through dataclasses in fsdp.utils (#82638)
Enabled passing kwargs to load_state_dict (#83309)
Used _init_from_local_tensor to create ShardedTensor to avoid communication overhead (#82911)
Added communication hook for sharded strategies (#83254)
Changed to print exec order only in debug mode (#83868)
Ensured that all ranks use the same order to iterate through optimizer states (#84654)
Optimizer states may be on CPU, copied them to GPU before gathering (#84708)
Handled the state_dict on CPU cases (#85640)
Add FSDPExtensions for TP support (#85039)
Ignored buffers that are non-persistent. (#85740)
Delayed moving tensor to CPU until necessary for optim_state_dict() (#85761)
Dequeue one event instead of flushing for rate limit (#86165)

`torch.distributed.elastic`

Implemented a named pipe based watchdog timer (#83695)

Infra (RelEng)

Consolidated all python targets in the tools folder (#80408)
Improved ios simulator test in CI (#80459)
Add functorch testing shard in CI (#81283)
Added functorch shards for windows CI (#82161)
Added functorch shard for mac x86 tests, linux cu102 tests (#82000)
Added CI workflow to build official docker images with multiarch (#83437)
Sharded trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (default from 2 -> 4 (#83424)
Migrated workflows from 18.04 to 22.04 (#83861)

Bug fixes

Python API

Fixed dim out of range check for logcumsumexp on CUDA when the source tensor is empty(#78284)
Added missing __init__.py for torch.utils.jit (#78629)
Fixed backward crash for gather with an empty index tensor when sparse_grad=True (#78698)
Added type annotations to torch.distributions.kl_divergence (#78432)
Fixed erroneous inclusion of end in the output of torch.arange for some inputs (#80758)
Fixed torch.distributions.Transform to be pickle-able (#81707)
Added check that self and mask are on the same device for torch.masked_fill (#82737)
Fixed potential ref cycle creation in torch.utils.checkpoint (#82776)
Fixed Tensor.__hash__ for Tensor subclasses (#83174)
Fixed torch.cat for 0-dim tensors with different dtypes (#83391)
Fixed torch.equal on CPU when inputs have different dtypes (#83350)
Fixed data-dependent shapes in torch.districutions.{HalfCauchy, HalfNormal} (#84322)
Added check that the size of the last dimension of tau is less than or equal to that of input in torch.ormqr (#85278)
Added check that weights is a 1D tensor in torch.bincount (#85881)
Fixed segfault for out arguments that have a large number of dims (#85294)
Fixed comparison ops with scalar arguments by removing overflow check (#78881)
Normalized torch.utils.dlpack strides to 1 where size of corresponding dimensions < 2 (#83158)
Added a check in torch.empty_strided that sizes has the same dimensionality as strides (#82422)
Fixed torch.istft default output length to prevent trimming of last element (#80031)

C++ API

Fixed missing antialiasing path to the interpolation for bicubic mode (#84599)
Added IListRefTag::Materialized to IListRefIterator destructor. (#85467)
Fixed im2col by adding a check that pad_width and pad_height are non-negative (#85541)
Fixed check_compiler_ok_for_platform on non-English locales in torch.utils.cpp_extension (#85891)

Autograd

Corrected the forward AD formula of torch.sgn which fixed forward-over-backward for torch.linalg.svd and other spectral decompositions, and torch.norm, torch.linalg.{norm, matrix_norm}(#80082)
Fixed derivatives of convolution overridable backward (#80840)
Updated setting non-float non-complex values for forward AD dual tensor to properly error(#78361)
Fixed forward AD to not set tangent as-is in some situations (#79664, #79653)
Fixed cpp hooks, retains grad, and backward(inputs=) behavior in-place (#79996)
Relaxed storage layout checks for forward AD when zero-numel tensor (#81055)
Fixed leak when create_graph=True and full backward hook registered (#82788)
Fixed view and in-place interaction when grad_fn is first accessed in no-grad mode (#83872)
Updated backward of torch.stack to correctly handle implicit real->complex casting (#84993)
Fixed gradients for torch.nn.functional.{leaky_relu, threshold} when inplace=True (#85634)
Corrected autocasting behavior in torch.utils.checkpoint when use_reentrant=False (#81766)
Fixed gradcheck when outputs that don't require grad precede those that do (#77743)
Fixed backward and double backward for nn.functional.binary_cross_entropy_with_logits (#80083)
Fixed derivatives of norm(p=inf) (#78105)
Fixed forward AD when conj-ness of primal and tangent of the dual tensor tensor do not match (#78358)

Build

Use C++17 for RocksDB 7 header. (#75741)
Fixed Windows builds with _DEBUG flag (bbe8d01)
Pass WITH_BLAS option from environment to CMake (#78037)
Remove -Wno-unused-but-set-variable for clang 13.0.0 (#79666)
Fixed variable typo for USE_SYSTEM_PYBIND11. (#80272)
Fixed compilation errors during build with clang13 (#80916)
Added missing -fexceptions flags during PyTorch build (#81394)
Fixed CMake dev warning (#81580)
Fixed false positive AVX, AVX2 and AVX512 detection with MSVC (#82554)
Fixed NCCL detection issues of the Gloo library (#82773)
Fixed objcopy version detection in NCCL cmake process (#82774)
Fixed build error by changing COLORIZE_OUTPUT option to USE_COLORIZE_OUTPUT in cmake file (#83716)
Set default value for NCCL make to MAX_JOBS if ProcessorCount returns 0 (#84231)
Fixed intermittent link errors in NCCL build (#84245)
Deleted torch._dl extension (#84361)
Used unified source file list for BUCK build (#84770)

Complex

Fixed the derivative of torch.acosh for complex numbers (#80841).
Removed unused conjugate kernels for real dtypes (2.2MB reduction in CUDA binary size) (#80374).

torch.nn

Fixed nn.Embedding ‘s max_norm argument when forward mode AD is used (#78560)
Fixed nn.ChannelShuffle when given empty Tensors (#77029)
Fixed nn.RReLU backward on CUDA (#80434)
Fixed spurious warnings in torch.nn.parallel.* APIs (#81476)
Fixed nn.Conv2d fallback implementation for single channel inputs and channels last weight (#82392)
Fixed segfault in adaptive pooling for specific index values (#84010)
Fixed type annotation in nn.Conv{1,2,3}d for in_channels (#84302)
Fixed nn.GeLU for empty inputs (#84926)
Fixed correctness issues for nn.Conv2d on ARM-based machines (#85711)
Fixed nn.ParameterList printing of Tensors on the “meta” device (#78529)
Fixed channels-first behavior for nn.MaxPool3D on CUDA (#80748)
Fixed input shape validation nn.MaxPool1d (#85594)
Fixed nn.Softmax for large input tensors (#84182)
Fixed lower and upper bound checks for nn.RReLU (#84996)
Fixed edge cases in torch.nn.grad by calling into the c++ backward kernel directly (#81839)
Fixed torch.nn.PixelShuffle for empty inputs (#86262)
Fixed consistency of output and input dtypes for torch.nn.BatchNorm (#84410)

torch.nn.optim

Fixed optim.SGD maximize flag when momentum is involved (#81859)
Fixed temporary bug where checkpoints from optimizers created with older PyTorch version could not be loaded (#83588)
Fixed memory leak in optim.lr_scheduler.CyclicLR (#85462)
Fixed initialization of lr in optim.lr_scheduler.SequentialLR (#72856)

BetterTransformer

Cleaned up native transformer implementation (#78265)
Added fastpath test for mask check flag (#82999)
Added check for contiguous well-formed mask (#79927)
Introduced mask contiguity check function (#79186)
Fixed issue in softmax.cu with transformer error when mask seqlen > 1024 (#83639)
Disabled Transformer/MHA fast path when autocast is enabled (#84722)
Moved odd num_head in TransformerEncoder to slow_path (#83483)

Composability

Fixed __torch_function__ bug in getindex that causes an error not set exception (#78781)
Fixed __torch_dispatch__ usage with inplace views (#79902)

Dataloader

Fixed NoneType object has no attribute python_exit_status when DataLoader exits (#83985)

Functorch

functorch.grad: fixed silent correctness issue from calling a view operation on a captured tensor followed by an in-place operation (#85374)
functorch.jacrev, functorch.jacfwd: fixed loud in-place errors when passing in inputs to the transforms and mutating them (#84914, #84915)
functorch.vmap: Fixed support for in-place view operations (Tensor.unsqueeze_, Tensor.transpose_, Tensor.t_, Tensor.squeeze_) (#82899, #82903, #82972)
functorch.vmap: added an error on incorrect weight shape to torch.nn.functional.prelu (#83106)
functorch.vmap: fixed support for multinomial (#83838)
functorch.vmap: fixed incorrect support for conv_transpose with groups > 1 (#84938)
Fixed vmap x vjp x vjp composition for torch.nn.functional.prelu (#84939)
Fixed printing tensors that are not being transformed over inside functorch transforms (#85556)
Disallowed saved tensor hooks in functorch transforms to avoid silently incorrect behavior(#85972)
Fixed cross to match unbatched behavior (#86926)

LinAlg

Strengthen the preconditions of linalg.cross (#83798)
Fix memory issues in linalg.lstsq (#85357)
Fix linalg.lu_solve/torch.unpack to prevent bad memory usage on CPU (#85922)
Preserve the dim of the input in matrix_exp. (#81330)

Sparse

Fixed COO Tensors with less than two non-zero elements to always be marked coalesced. (#82426, #82085)
Fixed CUDA kernel launch misconfiguration for mul on tiny COO tensors (#80254)
Fixed silent type promotion bug by select if given all zero integer COO tensors(#82215)
Fixed CUDA kernel coverage on 0-sized dense inputs for torch.sparse.sampled_addmm (#85194)

torch.fx

Fixed bug where curly brackets were not properly escaped in FxGraphDrawer (#83604)
Fixed torch.fx.wrap to use the callable function.__name__ rather than function.__code__.co_name (#84373)
Added strictness check and made tensors into leaves if input tensors were leaves (#77474)
Used getattr_recursive instead of getattr when splitting (#80011)
Stopped ProxyTensor from turning aten::lift tensors into proxy objects (#81024)
Fixed named_modules to be subscriptable (#81258)
Fixed to_folder by adding custom_builtins to dump (#81433)
Correctly unpacked constants when used in multi-return output (#82568)
Replaced module name for torch.ops (#82395)
Removed unnecessary import warnings (#82760)
Don't constant propagate through nondeterministic functions (#83650)
Don't extract tensor metadata from sparse tensors (#83669)
Skipped folding side-effectful functions (#84016)
Fixed make_fx issue by introducing get_attr into symbolic tracing (#84011)
Disabled autocast cache during aotdispatch (#84035)
Modified split_by_tags to retain output order (#84136)
Made NormalizeArgs preserve node type (#85637)
Fixed PyTree unpacking carrying forward type annotations (#81906)

JIT

Fixed conv-batchnorm folding for previously-broken datatype inputs during JIT freezing (#78241)
Fixed lightweight dispatch OOM error by introducing selective build (#79215)
Used signed integers in CalculatedNecessaryArgs to avoid underflow with schemas where all args have defaults. (#79331)
Fixed indexing into a tensor with a tuple (#79335)
Propagate map_location arg to torch.jit.load in torch.load (#78733)
Improved JIT autodiff heuristics for determining whether outputs require gradients (#78392, #79498)
Used streams for import_ir_module for pickle case to reduce memory usage (#80131)
Added scripting support for "start" kwarg in enumerate() (#80585)
Turned off arc in CoreML backend, because throwing exceptions in arc code leaks memory (#79928)
Suppressed virtual-dtor check on llvm_jit to fix NNC build (#81449)
Fixed annotation extraction for python 3.10 (#81334) (#81334, #81506)
Fixed std::out_of_range when using NNC and ConstantChunk input shapes are unknown (#82698)
Limits constant chunk propagation for pw-node-only in NVFuser (#83083)
When encountering dynamic types, one should cast it recursively. (#83218)
Fixed handling of empty dim list in sum_mean_dim symbolic shape fn (#83357)
Check existence of the array ref when tracing resize_ to avoid _MapBase::at runtime error (#81422)
Fixed define_constant pybind signature to match std::complex scalar in NVFuser (#83684)
Cast to signed char to fix aarch64 build (#84429)
Support torch.ScriptObject in torch::jit::as_object (#84398)
NVFuser torchbench patch to take nvprim fallback when no cuda tensors are provided as inputs (#84411)
Fixed coreml gpu flag not set (#84725)
Print the real type for function schema arguments (#85103)
Fixed torch.jit.trace check that was causing tracing to fail for MPS inputs (#84850)
Throw an error instead of segfaulting when passing None to futures (#85304)
Cherry pick sorting patch for NVFuser fusion segmented (#85620)
Support freezing modules that don't have a forward method (#85779)

Quantization

Added channel axis bound checking in fused_moving_avg_obs_fake_quant_* (#78148)
Disable use of qnnpack with ceil_mode of the avgpool op (#79028)
Improve subpackage import in torch.nn.quantized (#84141)
Fix segmentation fault in QTensor.choose_qparams_optimized (#85552)
Enhance the _rebuild_qtensor function to support other device type other than CPU (#78234)
Fix at::from_blob_quantized_per_tensor_affine strides calculation (#79314)
Fix embedding quantization issue when memory format is not contiguous (#82605)
Fix dispatch declaration bug about quantized op (#83649)
Moved the order of x86 engine to avoid changing the default qengine (#86631)

ONNX

Fixed aten::mul with Boolean inputs (#81671)
Fixed add and sub for non-tensor inputs (#81736)
Fixed RReLU eval mode behavior (#82678)
Fixed onnx optional node type in for/if block (#83599)
Fixed Interpolate: use half_pixel instead of pytorch_half_pixel. (#80003)
Fixed argmin and argmax edge case consistency with PyTorch. (#79503)
Shape Type Inference and Propagation
Fixed shape inconsistency when exporting scalar log2 (#78701)
Fixed inconsistent rand dtype (#79193)
Fixed linalg norm output's shapes and dtypes (#79506)
Fixed any and all outputs' shape (#79371)
Fixed prelu output's shape (#79846)
Fixed onnx logical functions' dtype (#79339)
Fixed hardshrink and softshrink output's shape (#79695)
Fixed quantization outputs' dtype (#79690)
Fixed reduce node shape inference (#85765)
Fixed bug using std::copy_if (#80999)
Fixed default function value in _optimize_graph (#83996)
Fixed constant folding unexpectedly adding folded constant as initializer (#79552)
Fixed autograd subgraph recording with nested graphs (#82852)
Disabled autocast cache in exporter (#84219)
Removed static None graph output (#82623)
Fixed float point detection for optional tensor (with unknown rank) within a list (#81386)
Support device().type() string comparison with constant (#86168)
Fixed scalar_type_analysis metadata for copied constant (#86716)
Fixed triu/tril export with diagonal input (#86843)
Ignore print(Tensor) during tracing (#86223)
Updated training state logic to support ScriptedModule (#86745)

AMD

Fixed memory cross-border access on the ROCM platform (#76100)
Set nvfuser default to disabled (#86369)

CUDA

Fix how we handle host memory in CUDA getDeviceFromPtr (#76902)
Only sync CUDA if the operation is run on GPU (#80328)
Do not use thrust::lower_bound on device (#80746)
Fix set_requires_cuda_init (#81183)
Fix behaviour of index_add / atomicAdd(bool,bool) (#85100)
Fix IMA for topk (#83042)
Use opmath_t for activation functions in Activation.cu (#77949)
Fixed the invalid configuration argument error when running layer norm backward (#80893)
Support non-standard bools in CUDA unique (#79392)
Accept non-standard bools in more CUDA kernels (#78957)
Fix cuda-mode and add more tests (#81898)
Clear autocast amp cache in CUDA Graphs (#81896)
Properly compute batch_element_count in warp_softmax (#82927)
Disabled autocast cache in torch.cuda.make_graphed_callables (#84289)
Store RNG seed for CUDA graphs (#84967)
Assert lambda >= 0 in poisson distribution cuda kernel (#85906)
Work-around 32-bit indexing failures in cuDNN batchnorm (#87861)
Fixed 3d convolution_add_relu in V8 (#85055)

Intel

Fixed bug for thnn_conv2d when input's C is 1 and weight is channels last (#82392)
Fixed oneDNN channels_last path issue (#83653)
Fixed torch.config can't respect USE_MKLDNN flag issue (#75001)
Made the data types of output and input consistent for batchnorm (#86784)
Fixed the issue that cat result would be incorrect for channels-last (#85076)
Fixed the performance issue that the for-loop before ExternallCall could not be parallelized (#85056)
Fixed the performance issue that the for-loop before ExternallCall (#86516)

MPS

Fixed MPS operator torch.full for boolean types (#82575)
Extend MPS Unary operators for empty tensors which should be a no-op (#82650)
Fixed MPS operator torch.scatter for boolean types (#82685)
Fixed MPS operator torch.cat for boolean inputs (#81480)
Fixed typo in MPS allocator (#83465)
Fixed MPS operator torch.full to handle uint8 types (#83697)
Fixed creation of MPS::Placeholder behavior for transposed view operations (#85689)
Fixed handling of output shape for empty inputs to binary ops in MPS backend (#85836)
Added support for handling scalar inputs to MPS operations of torch.scatter and torch.gather (#85842)
Support for handling compatible inputs to MPS operation of torch.where (#85946)
Added support for inputs with datatypes Short, Byte & Char to torch.dot MPS operation by casting to int32 when needed (#86140)
Remove incorrect asserts in MPS backend from Copy.mm file (#86184)
Added support for handling of 1D inputs for MPS operation torch.nll_loss (#81290)
Get correct size of the view tensor when copying from cpu to mps device (#81730)
Fix issues exposed in MPS testConsistency tests. The fix includes correct handling of types in smooth l1 loss, 0 dimensions for torch.repeat and empty inputs for torch.cat operations (#81735)
Handle Integer inputs for MPS linear layer by returning error of unsupported data types (#82183)
Workaround int8 datatype outputs in MPS for View operations (gather) by casting it to int8 (#82315)
Improve handling of empty outputs and fix MPS linear layer’s handling of transposed Tensors in test consistency (#83124)
Fixed handling of conv1D and conv2D MPS operations with non-matching strides/paddings (#83522)
Fixed handling of MPS::Placeholder when View operation is missing gather graph (#83744)
Fixed the index handling in MPS for torch.constant_pad_nd operations with single-dimension input (#83745)
Handle casting for MPS torch.div operation in case of type mismatch (#84742)
Fix device (MPS) to host (cpu) copy by casting from a smaller dtype to a bigger dtype (#84928)
Ensure as_strided_tensorimpl is never called with MPS (#85020)
Fixed integer rounding crash in torch.div MPS operation on M1 (#85016)
Fixed crash in MPS bitwise ops on Mac x86 platforms. (#85285)
Fixed crash in MPS Conv1d backward operation for NHWC (#85283)
Added support for MPS reduction operations of scalar edge-cases (#83743)
Fixed memory corruption in torch.var operation for MPS (#85571)
Fixed memory leaks in MPS that cause the MTLBuffers not to be released and cause OOM (#85661)
Fix test consistency error in MPS due to type mismatch between int8 and uint8 types (#85666)
Fixed shape issues for torch.clamp op in MPS (#85673)
Fixed handling of TensorBase shapes for view ops in MPS for case of multiple slices on a Tensor (#85934)
Fix the dimension of padding to match the input's dimension for MPS Pad operations (#85990)
Fix non-contiguous to contiguous copy of MPS tensors (#86056)
Remove std::cout from MPS multinomial operation (#86246)
Do not dispatch empty job in bitwise_not (#87286)
Made copy from CPU always add storageOffset (#86958)
Revamped copy_to_mps_ implementation (#86956)

Package

Added fix for implicit numpy dependency (#78979)
Allowed torch._C to be recognized a module in torch.package (#80917)
Ignore return value of function declared with 'warn_unused_result' for torch::deploy (#84862)
Removed torch::deploy from pytorch (#85953)

Profiler

Fixed build failure in python 3.10 (#81812)
Pop KinetoThreadLocalState at the start of post processing. (#77996)
Fixed record function inputs_valid_ check (#78002)
Weakened ordering check during post processing. (#78563)
Fixed Python parent id (#79356)
GIL acquire needed in ValueCache::trimPrefixes (#81061)
Added ephemeral inputs to the value cache. (#81958)
Fixed profiling with record_shapes=True and nested tensor (#82854)
Proper reset execution graph data in remove callback registration (#82910)
Solved two syntax issues when dumping execution graph result to json file. (#81854)
Set end time on python events when profiling stops. (#83621)
Don't try to collect strides for non-strided tensors (#83935)
Add null handling to AppendOnlyList::copy memcpy path. (#83963)
Add quoted metadata API to remove empty trace cpu_op metadata (#84128)
Make RecordQueue manage the lifetime of PythonTracer. (#83964)
Don't assign in AppendOnlyList::emplace_back (#85716)
Fixed traversal utility (#85717)
Fixed python object reference counting (#85847)

Visualization

Removed dependency on torch.onnx in graph (#82628)
Updated Image.ANTIALIAS to Image.Resampling.LANCZOS in summary (#85679)

Vulkan

Fixed the aten::cat operator registration (#78806)
Fixed a bug in GRU where incorrect behaviour was being observed when H_in != H_out (#78945)
FIxed a possibly null pointer dereference in the aten::mm operator when using passing an empty bias (#79701)
Code under ATen/native/vulkan/api was essentially rewritten (more details below) and as a result of these refactors, it is now possible to concurrently execute multiple Vulkan models due to correct synchronization when recording to a Vulkan command buffer (#80959)

Mobile

Moved saving storage to the last step. (#78024)
Fixed build For Model Tracer (#84755)
Skip TestNNAPI tests if QNNPACK is not supported (#82882)
Extended LinearPackedParamsBase getstate/setstate deadline in check_forward_backward_compatibility.py Allowlist (#81135)
Removed LinearPackedParamsBase getstate/setstate from check_forward_backward_compatibility.py Allowlist (#81048)
Fixed ao::sparse::BCSR missing in qlinear serialize and deserialize when USE_FBGEMM and USE_PYTORCH_QNNPACK are not set (#81256)
Updated model_ops.yaml (#82444)
Fixed signed/unsigned compare for Metal (#86068)
Re-added benchmarking files to ios TestApp (#85539)

Distributed

`Distributed(c10d)`

Ensured tensors are contiguous for autograd enabled all_gather. (#79747)
Fixed data race condition of batch_isend_irecv (#82450)
Fixed distributed_test.py flakiness by turning off async_errror_handling (#78797)
Reenabled isinstance with torch.distributed.ReduceOp (#87303)

`DistributedDataParallel`

Enabled AllReduceCommHook to accept instrusive_ptr (#80975)

`FullyShardedDataParallel`

Fixed full_optim_state_dict() hang (#80712)
Fixed exec order validation for ignored modules across ranks (#79533)
Cleaned prefixes when searching for params / buffers to ignore (#78278)
Returned original module when fsdp wrapped model call .module (#78671)
Fixed a small bug of pre_backward_hook params prefetch (#78851)
Fixed param name prefixes for ignored modules (#79955)
Fixed FSDP when not all outputs get gradient in backward (#80245)
Fixed that MP config not being passed to FSDP (#80869)
Fixed FSDP device_id when CPU offloading (#82892)
Fixed FSDP not all outputs used in loss (#83195)
Fixed the FQN not found issue for load sharded_state_dict when using activation checkpoint (#84253)
Fixed pin_memory() for CPU offloading (#85048)
Fixed memory regression! (#85087)
Implemented a short-term fix to remove optim_input (#84201)

`torch.distributed.elastic`

Ensured that exit code is propagated from Child to parent process (#81408)

`torch.distributed.rpc`

Only initialize CUDA if there are devices specified in init_rpc (#80180)
Fixed the wrong usage of RRefContext::handleException by having a new API RRefContext::handleExceptionSilent (#83166)
Changed to avoid initializing storage for empty Optionals (#78947)

Infra (RelEng)

Made bazel changes to make “bazel query ...” work (#78870)
Fixed C API to be compatible with latest Python 3.11 beta (Please note that 3.11 binaries are not fully functional) (#81242)

Performance

Python API

Fixed use of temporary buffers for tensors in torch.save. (#80404)
Fixed and improved the efficiency of the backward for torch.xlog{*} functions. (#82713)
Vectorized .copy() acting between different dtypes on CPU (#80905)
Vectorized bfloat16 conversions on CPU (#80906)

Autograd

Codegened autograd nodes no longer is smarter about which gradients to compute (#82544)
Made the derivative of masked_fill more efficient (#83515)
torch.where no longer materializes a zero-filled tensor in its backward (#83043)

torch.nn

Speed up nn.Module constructor by not calling custom setattr (#77098)
Speed up CPU nn.BatchNorm implementation by using torch.zeros() directly (#82558)
Speed up nn.Module.load_state_dict (#85743)

BetterTransformer

Added nn.module activation support in BetterTransformer (#78394), in addition to functional support which is not available in Torchscript
Added mask identifier for multiplexed src_mask/src_key_padding_mask in BT (#81947)
Added a small fastpath test for native multi-head attention (#81432)

Composability

Release GIL when doing shared memory copies on Tensors (#85389)
Some micro-optimizations in RecordFunction, the core util used by the profiler (#76266)
c10::detail::ReplaceAll: avoid some unnecessary allocations (#79915)

Dataloader

Moved loop content into a function to ensure we don't preserve Tensor in pin_memory thread (#83595)

LinAlg

Simplified and optimized linalg.solve (#74046)
Improved heuristics for linalg.lu_solve when B is a matrix (#79838)
Small optimization of linalg.cholesky (#81316)
Prefer contiguous output from mkldnn_bf16_gemm (#82968)
CPUBlas: Use mkldnn optimized BFloat16 matmul for gemm (#65840)
Updated and improved the heuristics for linalg.lu_solve (#73878)
Optimized linalg.householder_product backward to be more memory-efficient (#84627)

Sparse

Improved to_sparse_bsr for batched dense inputs (#83085)
Improved to_dense for CSC (#79635)
Improved index_select performance for COO input on CUDA (#77551)
Improved mul(COO, COO) performance with broadcasting in dense dims. (#83428, #85336)

JIT

Improved coreml load time by loading cpu model first, while asynchronously loading a model (#80941)
Improved torch::jit::as_{module,object} performance (#84399)
Replaced IValue::toString()->string() with IValue::toStringRef() (#85437)

Quantization

Allow contiguous inputs run into qcat_nhwc_stub when dim is last dimension (#72575)
Enable qlinear dynamic parallelization with fbgemm (#84033)

CUDA

Fixed perf regression introduced in #70943 (#78588)
Improved small sort performance on CUDA (#79627)
Use cub::BlockRadixSort to improve medium length sort performance (#79628)
Use cub::BlockRadixSort to improve medium length sort performance (#79628)
Increased size limit on calling CublasLt in addmm by 32x (#82922)
Don't synchronize single element any/all reductions (#84465)
Added col2im_batched kernel (#84543)
Exposed fast get_current_stream (#78165)
Pool cudaEvents in CUDACachingAllocator (#78279)

Intel

Optimize the copy of BFloat16 to Float and Float to BFloat16 (#79685)
Improve performance of ONEDNN backend (#84470)
Optimize softmax backward and logsoftmax backward #80114
Improve sort multi-core perf by adjusting grain_size w.r.t. dim_size (#74897)
Add fast path of qmean/qstd for quantized CPU (#80579)
Use direct memcpy in qcat when all the inputs and output share the same scale and zero_point (#71903)
Vectorize scalar remainder in quantized kernel for normalization (#79673)
Enhance add_out_dense_sparse_cpu for hybrid sparse tensor (#23057)

MPS

Performance improvements for the MPS backend by changing commitAndWait to commit & fixing high memory consumption for View operations. Also improved scalar handling in MPS Allocator (#81951)
Improved performance for MPS backend by reducing the number of command buffers created and hence CPU overhead. It uses commitAndContinue feature in MPS (#81338)
Added direct MPS implementation for constant_pad_nd operation which improved performance as the generic implementation was heavily reliant on View ops which are slow (#82366)
Removed checks that incur unnecessary syncs for MPS device with tensor.item() (#82505)
Enabled Graph caching in MPS for torch random ops with Philox engine (#85833)
Added specialized memory pool for scalar values in MPS which improved performance in torchbench networks (#85817)
Improved memory usage and performance by utilizing garbage collector and adaptive commit feature in MPS (#86119)

Profiler

Optimize getStepCallbacks for common case of no active callbacks for kineto (#77804)
Use custom AppendOnlyList for op_events to reduce the number of atomic operations (#78643)

Vulkan

When waiting on the result of a VkFence, busy polling is now used instead of a single call to VkWaitForFences with no timeout. This can improve benchmark performance by up to 50% by ensuring that the CPU stays at a high frequency when waiting for work on the GPU to complete (#81470)

Mobile

Added compilation_preference & relax_f32_to_f16 APIs (#78758)
Made flatbuffer loads faster if loading as mobile module. (#78998)
Stream pkl (#79931)
Used Apple's Accelerate framework for blas acceleration (#80449)
Read via FileAdapter when loading files in torch if not flatbuffer for lite_interpreter (#84028, #84296)

Documentation

Python API

Fixed torch.as_array documentation formatting (#78485)
Fixed default value for storage_offset in torch.as_strided documentation (#78202)
Removed warning in documentation that torch.real is only supported on complex types (#78644)
Improved reproducibility documentation for the random number generator and torch.use_deterministic_algorithms (#78849)
Fixed example in documentation for serialization (#79454)
Fixed torch.linspace documentation for dtype (#81371)
Fixed typo in documentation for torch.distributions.Dirichlet (#82062)
Fixed example in torch.dist documentation (#82104)
Updated torch.narrow documentation to reflect that start can be a Tensor (#85180)
Added documentation for pin_memory and layout arguments to torch.new_{zeros, ones, full} (#85605)
Added documentation for pin_memory argument to torch.{rand, randn} (#85219), (#85221)
Added argument default values to documentation for torch.rot90 (#85610)
Removed out argument from documentation for torch.squeeze (#85222)
Fixed torch.log example (#78776)
Fixed torch.argmin docs for keepdim argument (#78888)
Updated examples in documentation for torch.use_deterministic_algorithms (#82003)
Changed docstring type callable to Callable for consistency (#82487)
Added documentation for torch.narrow_copy (#84493)
Improved documentation for torch.signbit (#78349)
Added doc string for torch.library.Library.impl (#81047)
Renamed _Typed/_UntypedStorage to Typed/UntypedStorage and updated documentation for torch.storage (#82438)
Added documentation for torch.unflatten() (#81399)

Autograd

Improved autograd custom function docs (#81340)
Added randomness case to the autograd notes (#78617)

Complex

Added a note on CUDA 11.6 (#80363)

torch.nn

Fixed docstring and image for nn.LeakyReLU (#78508, #79102), nn.ELU (#78909), nn.GRU (#79380), nn.Hardswish (#70993), nn.GeLU (#85790)
Fixed docstring for nn.CrossEntropyLoss (#79568 and #82538), nn.MultiMarginLoss (#84513)
Fixed high level nn.init module doc to reflect that all functions run with torch.no_grad (#80882)
Fixed docstring for nn.Module.state_dict (#83104)
Updated docstring for scale_factor in nn.functional.interpolate (#80807)

torch.nn.optim

Fixed docstring for optim.lr_scheduler.ChainedScheduler (#79775)
Fixed docstring for optim.swa_utils.SWALR (#79836)

Composability

Fix MetadataTensor example in __torch_function__ docs (#78073, #78707)

Functorch

Fixed the model description in the functorch ensembling notebook (#83603)
Fixed indentation in functorch limitations docs (#85346)
Updated functorch installation instructions (#85854)
Fixed functorch whirlwind tour notebook to be runnable (#85974)
Documented new installation instructions for functorch (#86823)

LinAlg

Improve torch.lu_unpack docs (#77635)
Fix inconsistent default rcond value (#82887)

Sparse

Updated scatter_add_ documentation to fix argument name (#80223)
Updated torch.sparse docs to better cover CSR/CSC/BSR/BSC (#82108)
Added torch.sparse overview section (#85265)
Updated documentation for mm family ops and F.linear to note limited sparse support (#86220)

torch.fx

Fixed decomposition example (#79807)
Added __all__ to various submodules in torch.fx, distributions, distributed, package (#80367)
Added warning about DCE in FX being unsound with mutation (#81818)

Quantization

Replace qconfig_dict with QConfigMapping in docs (#78533)
Corrects typo in quantization docs (#81687)
Additonal fixes for quantize_fx docs (#84587)
Add example for the error message for fixed qparam ops (#84666)
Add types for scale and zero_point tensor for torch.fake_quantize_per_channel_affine docs (#85733)
Updated quantization docs to show per channel support for conv1d (#81349)
Add torch.ao.nn.quantizeable modules documentation (#79957)
Add more detailed docs for torch.ao.quantization.quantize_fx.{prepare_fx|prepare_qat_fx|convert_fx} (#83132)

ONNX

Added a table of unsupported aten operators in the documentation (#84496)

CUDA

Fixed jiterator doc format (#78471)
Use generic amp autocast in example and specified dtype (#79579)
Fixed small typo in cuda.rst (#84012)
Added user facing documentation for CSAN (#84689)
Fixed broken docstring for set_float32_matmul_precision (#78949)

MPS

Update Persons Of Interest file for MPS (#81757)
Update backends.rst for MPS (#82525)

Package

PackageExporter does not have file_structure (#79948)
Updated package.rst to not include hermetic claim (#81019)
Fixed typos in torch.package documentation (#82994)
Fixed typo in torch/package/_mock.py (#84508)

Distributed

`Distributed(c10d)`

Fixed some links in torch/distributed/CONTRIBUTING.md (#79855)
Updated dist.scatter() documentation (#86069)
Fixed docstring of scatter_object_list (#84596)
Fixed doc string in reduce_scatter (#84983)

`DistributedDataParallel`

Corrected the DDP wrap example by removing pg in DDP wrap (#83034)

`FullyShardedDataParallel`

Improved auto wrap policy doc (#78400)
Corrected comments in FSDP for gradient averaging (#80456)
Updated ShardingStrategy and _free_full_params() docs (#80894)
Added mentioning of optim_input to be removed after 1.13 in the BC breakage warning (#85963)

`torch.distributed.rpc`

Updated distributed/CONTRIBUTING.md to remove ProcessGroupAgent references and add test instructions (#78625)

Infra (RelEng)

Added some documentation about the stats uploading process for CI (#79504)
Fixed release doc builds (#79865)
Updated release.md with release candidate validation steps (#79889)

Developers

Autograd

Added the ability to register a hook to grad_fn with .register_prehook(#83226)

Build

Modified nccl_dependency to take dev mode (#79169)
Moved pytorch buck targets to shared build (#79330)
Added kineto and flatbuffers to OSS BUCK (#79860)
Updated llvm deps for Buck build (#79919)
Moved aten targets to shared buck file (#79966)
Updated buck_setup.sh (#80467)
Minor fix for shared build (#80739)
Deleted CCACHE_DISABLE and SCCACHE_DISABLE from nccl.cmake file (#84007)

Composability

TorchDispatchMode and TorchFunctionMode extension points have been added. They are similar to their __torch_function__ and __torch_dispatch__ counterparts, but can be used as context managers that intercept all torch operator calls, including factory functions. These API’s are still experimental and aren’t quite user facing yet, and we will add more documentation as they are hardened. See this post for more details. (#78214, #78822, #78847, #84774, #83925, #79143, #77667, #80992, #80995, #80998, #82647, #83372)
A large amount of hardening to FakeTensor and FakeTensorMode, a __torch_dispatch__ style mode that allows you to run shape/dtype/device inference. This is similar to the “meta” device, but fake tensors also faithfully store device metadata, and the logic lives in python. (#77969, #77972, #77971, #78516, #78090, #78836, #78895, #78536, #78677, #78522, #78523, #78972, #79170, #80115, #80193, #80544, #81739, #82281, #82574, #82066, #82449, #82337, #82571, #82593, #82172, #84387, #85065, #82846, #85658, #85759, #85920)
Added some new tags and beefed up tags support for operators in the dispatcher:
- Add data_dependent_output tag (#83312)
- Add nondeterministic tags in tags.yaml and add the nondeterministic_seeded tag to all functions in native_functions.yaml defined as nondeterministic by alias_analysis.cpp (#81440)
- Allow specifying operator tags when registering an operator to the dispatcher (#79322)
- add inplace_view tag to resize_() (#82667)
Make string serialization of C++ FunctionSchema consistent with torchgen.model.FunctionSchema (#77926)
Added support for custom namespaces in torchgen (#78015, #79733, #81362, #81581)
Generate kernels for codegen’d out= operators (#78626, #81437)
Added a new alias dispatch key for functional to view op decompositions (#79615)
Added an env var for dispatcher debug logging (#81846, #82277)
Fixed printing of DispatchKey in operator not found message (#81637)
Added test that all BackendComponents are covered by toString (#81713)
Refactored functionality and backend keys to reduce duplication (#81752)
Made factory functions CompositeExplicitAutograd, so they show up as primitives in __torch_dispatch__ (#82470)
Added an OpOverload.decompose() API, for running an operator’s decomposition if one exists (#83075)
Fixed our dispatcher schema parser when parsing tensor list alias annotations (#84005)
Allowed subclasses of c10::TensorImpl() to override non-virtual tensor methods (#84806)
Made pytorch headers consumable from c++20 code bases (#79985)
Added meta device support to _UntypedStorage and _TypedStorage (#78008)

torch.fx

Added debug statements for small ACC subgraphs elimination (#80117)
Checked node type before fetching users (#80166)
Detected ProxyTensor layering violations (#80994)
Increased stack level for get_attr warning (#81041)
Preserved a node’s stack trace (#82670, #83050, #83558, #83706, #83960)
For quantization, removed WEIGHT_INDEX_DICT and BIAS_INDEX_DICT and replaced with node_arg_is_weight and node_arg_is_bias (#83263, #83848)
Asserted that ProxyTensorMode does not accidentally bake in constants (#83297)
Improvements to FX Minimizer (#83833)
Ported matmul compositeimplicitautograd impl into core (#85239)
OpInfo for Slice (#85554)
Raised errors in fx.Interpreter with Node info (#85810)

Quantization

Enabled support for quantized fill of nhwc tensors (#79025)
Tests for code snippets in quantization docs (#79923)
Eliminate Named tensor warnings in XNNPACK and QNNPACK (#77762)
Added earlier termination and improved error message for calling min and max ops on per channel quantized tensors. (#79036)
Added warnings to quantized dynamic conv and linear ops when reduce_range=true (#79273)
Add assertions to fix torch::jit::load bugs (#79192)
Optionally clamp weights post quantization (#83438)

ONNX

onnx.verification Tool to verify exported model discrepancy between sets of inputs (#78323)
Symbolic function registration is now done via decorators (#84709)
g.op methods now exposed via the GraphContext class (#84728)
Initial version of diagnostics infrastructure. (#85107)
Add dtype check in onnx verification (#79263)

Intel

Added native impl for group norm on quantized CPU for channels-last inputs: (#70520)
Added qscheme check for quantization observer (#80126)
Added oneDNN graph fuser context API and unittest (#82491)
Added eltwise OPs for NNC: mish and elu (#80586)
Support BF16ImmPtr (#84041)
Enabled fusion of conv with elementwise OP for NNC (#77157)
Channels last propagation within NNC fusion group (#76948)
Lowering function generates the output buffer with the specified stride for NNC(#76529)
Simplified IfThenElse and CompareSelect within for-loop for NNC (#76793)
Do not pull in autocast* ops into NNC (#85140)

MPS

Improve MPS test by extending test_no_warnings_on_input by capturing any output (#79163)
Add testcase in test_mps for circular mode in torch.pad (#81455)
Fixed build warnings while building with MPS on Mac platforms (#83048)
Add per-op MPS gradient tests and update skips for TestConsistency (#84242)

Profiler

New event representation in profiler (#77693, #77694, #77695, #78163, #79173, #81965, #80797, #81319, #81320, #81321, #81322, #80822, #82993)
Build call tree for profiled events (#77698, #80810)
Copy rollbear/strong_type to c10/util (#78162)
Collect Layout and expose TensorMetadata (#81155)
Added support for storing scalar values in profiling (#81843)
Added support for Device (#82787)
Added SOFT_ASSERT to gracefully recover from invariant violations (#82689)
Added support for accessing strides and scalars (#80072)
Record nn.Module's parameters (#83209)
Extend Python bindings (#83622)
Capture storage data pointer (#84276)
Cleaned up Tensor representation (#85161)
Compute unique IDs for Tensors (#85162)
set_class util (part 1 of Record Optimizer) (#84779)
Tracking Optimizer (part 2 of Record Optimizer) (#84920)
Optimizer param_groups (part 3 of Record Optimizer) (#85784)
Optimizer states (part 4 of Record Optimizer) (#85840)
Extend ID assignment to allocations and frees (#85719)
Made name a property. (#85720)
Added dtype to _TensorMetadata (#85721)
Updated python binding type annotations (#85722)
Started moving python bindings out of autograd (#82584)

Vulkan

Vulkan operators that use prepacking have switched from using individual OpContext classes with PackedContext classes that inherit from a generic VulkanOpContext class which should reduce boilerplate code when implementing new ops that require prepacking (#78814, #78815, #78816, #78817, #78818, #82730, #83526)
Code under ATen/native/vulkan/api was essentially rewritten to improve code organization and readability. The refactor implements RAII patterns for the classes used to represent Vulkan handles to facilitate proper resource management and re-designed how the Context class functions in order to enable concurrent execution of multiple Vulkan models. The stack of PRs containing these refactors can be found at #80699
Lint is now enforced in the ATen/native/vulkan (#81390)
The VulkanMemoryAllocator version used was upgraded to 3.0.1, which now lives under third_party (#81472, #83906, #83934)
Shader layouts are now automatically generated based on the GLSL code (#81715, #81716)

Distributed

`torch.distributed`

Added all to torch.distributed and tensorboard submodules (#80444)
Added all to torch.{fx, distributed, backends} submodules (#85079)
Added all to fx, fistributed and cuda submodules (#85080)
Added all to torch.distributed, futures, fx, nn, package, benchmark submodules (#80520)
Added all to torch.distributed submodules (#80523)
Eliminated code duplication in distributed rendezvous (#81577)
Refactored distributed to use absolute header path (#85780)

`torch.distributed.elastic`

Added all for torch.nn.modules, torch.distributed.elastic, torch.nn.utils submodules (#80240)
Fixed macos public bindings failures (#80970)

`Distributed(c10d)`

Logged full rank fingerprint mismatches in ProcessGroupWrapper (#79901)
Added environment parse function that supports default value (#85563)
Added host and port to TCPStore pyi definition (#84636)
Added underlying_store property for PrefixStore (#84640)
Enabled per-thread ProcessGroup for testing (#84153)
Moved ProcessGroup::Work into a separate class (#83680)
Install c10d headers with absolute path (#86257)

Infra (RelEng)

Migrated off xenial gcc5.4 from merge rules (#78137)
Added functionality for rebasebot to rebase onto viable/strict branch (#78276)
Pinned protobuf version to 3.20.1 in docker CI build (#78369)
Removed gcc5.4 from docker/build.sh (#78405)
Removed gcc5.4 jobs from CircleCI config (#78555)
Added merge rules for “pytorch distributed” module (#78751)
Added onnx / test to required merge rules (#78790)
Added userbenchmark support to TorchBench CI (#78794)
Installed torchdynamo as part of most CI jobs (#79051)
Removed linux-xenial-py3_7-clang7-asan from merge rules (#79088)
Ran torchdynamo tests on PyTorch Linux CI (#79099)
Centralized commit pins in a folder (#79150)
Moved CUDA flags out of --per_file_copts into the cu_library macro (#79414)
Moved CI to cuda-11.6 (#79921)
Enabled pytest to run test_ops, test_ops_gradients, test_ops_jit in non linux cuda environments (#79898)
Upgraded pytorch nightly docker python version to 3.8 (#80051)
Updated Dockerfile to install cmake as part of conda install (#80258)
Re-enabled vulkan test (#81368)
Enhanced mergebot with the feature of posting the PR Comment on cancel (#82744)
Changed nccl build to be single-threaded (#83173)
Added process for maintaining Build + CI contributors list (#83869)
Implemented mechanisms to block land checks if the PR hasn't been approved yet (#84239)
Allowed External Scripts (e.g. vscode) To Discover and Execute unittest Tests (#85584)
Updated the pinned torchdynamo hash to 6ead5cae0d1234aa64db06fe230ef56e12ec76fe (#85683)
Updated the pinned torchvision hash to d7d90f56117ce0955332846a5f90b8d1346c4c09 (#85776)
Modified all functions (except factory functions) to support SymInt and update xla hash to f2b36df6a1a80137eff7644e6d0f4eeb7ff429d6 (#86078)

torch 1.13.0 PyTorch 1.13: beta versions of functorch and improved support for Apple’s new M1 chips are now available on Python PyPI

Pytorch 1.13 Release Notes

Highlights

Backwards Incompatible changes

Python API

uint8 and all integer dtype masks are no longer allowed in Transformer (#87106)

Updated torch.floor_divide to perform floor division (#78411)

Fixed torch.index_select on CPU to error that index is out of bounds when the source tensor is empty (#77881)

Disallow overflows when tensors are constructed from scalars (#82329)

Remove deprecated torch.eig, torch.matrix_rank, torch.lstsq (#70982, #70981, #70980)

torch.nn

Enforce that the bias has the same dtype as input and weight for convolutions on CPU (#83686)

Autograd

Disallow setting the .data of a tensor that requires_grad=True with an integer tensor (#78436)

Added variable_list support to ExtractVariables struct (#84583)

Don't detach when making views; force kernel to detach (#84893)

ONNX

torch.onnx.register_custom_op_symbolic now only registers the symbolic function at the specified opset version (#85636)

Default ONNX opset is updated to 14 (#83284)

torch.onnx.symbolic_registry is removed (#84382)

ScalarType and global variables in torch.onnx.symbolic_helper are removed (#82995)

Distributed

In c10d collectives, input tensors dtype must now be the same (#84664)

Users doing wildcard imports of torch.distributed.distributed_c10d will no longer get non-public symbols (#84872)

Process Group C++ extensions must use absolute path when importing ProcessGroup.hpp (#86257), ProcessGroup::Work object moved out of work to its own Work class (#83680):

Quantization

Add required example_args argument to prepare_fx and prepare_qat_fx (#249) (#77608)

Stop moving models to CPU in quantization convert (#80555)

Replace the is_reference flag of the torch.ao.quantize_fx.convert_fx function with the convert_to_reference function (#80091, #81326)

Add default configs for fixed qparams ops (#80184)

Replace qconfig_dict with a typed QConfigMapping object (#78452, #79618)

Replace *custom_config_dict with typed config objects (#79066)

Remove remove_quant_dequant_pairs and fix tests (#84203)

Align observer dtype with reference model spec (#85345)

Composability

Changed the backend C++ kernel representation for some operators that take in lists of tensors (#73350)

C++ API

Lowered randint default dtype to the C++ API (#81410)

Enabled dim=None for torch.{std, var, std_mean, var_mean} (#81845, #82765, #82912)

Deprecations

Distributed

LinAlg

Deprecate torch.lu in favor of linalg.lu_factor (#77636)

Deprecate torch.lu_solve in favor of linalg.lu_solve(#77637)

ONNX

Monkey patched convenience method on torch._C.Graph, torch._C.Block and torch._C.Node are deprecated. (#83006)

New features

Python API

Build

Complex

torch.nn

torch.nn.optim

BetterTransformer

ForEach

LinAlg

Sparse

torch.fx

JIT

ONNX

AMD

CUDA

Intel

MPS

Profiler

Vulkan

Mobile

Distributed

Distributed Checkpointing (Prototyping)

Distributed(c10d)

FullyShardedDataParallel

torch.distributed.elastic

Activation Memory Management (Prototyping)

Infra (RelEng)

Improvements

Python API

C++ API

Autograd

Build

torch.nn

torch.nn.optim

torch 1.13.0
PyTorch 1.13: beta versions of functorch and improved support for Apple’s new M1 chips are now available

on Python PyPI

Updated `torch.floor_divide` to perform floor division (#78411)

Fixed `torch.index_select` on CPU to error that index is out of bounds when the `source` tensor is empty (#77881)

Remove deprecated `torch.eig`, `torch.matrix_rank`, `torch.lstsq` (#70982, #70981, #70980)

Enforce that the `bias` has the same dtype as `input` and `weight` for convolutions on CPU (#83686)

Disallow setting the `.data` of a tensor that `requires_grad=True` with an integer tensor (#78436)

`torch.onnx.register_custom_op_symbolic` now only registers the symbolic function at the specified opset version (#85636)

`torch.onnx.symbolic_registry` is removed (#84382)

`ScalarType` and global variables in `torch.onnx.symbolic_helper` are removed (#82995)

Add required `example_args` argument to `prepare_fx` and `prepare_qat_fx` (#249) (#77608)

Replace the `is_reference` flag of the `torch.ao.quantize_fx.convert_fx` function with the `convert_to_reference` function (#80091, #81326)

Replace `qconfig_dict` with a typed `QConfigMapping` object (#78452, #79618)

Replace `*custom_config_dict` with typed config objects (#79066)

Remove `remove_quant_dequant_pairs` and fix tests (#84203)

Enabled `dim=None` for `torch.{std, var, std_mean, var_mean}` (#81845, #82765, #82912)

Monkey patched convenience method on `torch._C.Graph`, `torch._C.Block` and `torch._C.Node` are deprecated. (#83006)

`Distributed Checkpointing` (Prototyping)

`Distributed(c10d)`

`FullyShardedDataParallel`

`torch.distributed.elastic`

`Activation Memory Management` (Prototyping)

`Distributed(c10d)`

`Distributed Optimizer`

`DistributedDataParallel`

`FullyShardedDataParallel`

`torch.distributed.elastic`

`Distributed(c10d)`

`DistributedDataParallel`

`FullyShardedDataParallel`

`torch.distributed.elastic`

`torch.distributed.rpc`