Pytorch 1.13 Release Notes
- Highlights
- Backwards Incompatible Changes
- New Features
- Improvements
- Performance
- Documentation
- Developers
Highlights
We are excited to announce the release of PyTorch 1.13! This includes stable versions of BetterTransformer. We deprecated CUDA 10.2 and 11.3 and completed migration of CUDA 11.6 and 11.7. Beta includes improved support for Apple M1 chips and functorch, a library that offers composable vmap (vectorization) and autodiff transforms, being included in-tree with the PyTorch release. This release is composed of over 3,749 commits and 467 contributors since 1.12.1. We want to sincerely thank our dedicated community for your contributions.
Summary:
-
The BetterTransformer feature set supports fastpath execution for common Transformer models during Inference out-of-the-box, without the need to modify the model. Additional improvements include accelerated add+matmul linear algebra kernels for sizes commonly used in Transformer models and Nested Tensors is now enabled by default.
-
Timely deprecating older CUDA versions allows us to proceed with introducing the latest CUDA version as they are introduced by Nvidia®, and hence allows support for C++17 in PyTorch and new NVIDIA Open GPU Kernel Modules.
-
Previously, functorch was released out-of-tree in a separate package. After installing PyTorch, a user will be able to
import functorch
and use functorch without needing to install another package. -
PyTorch is offering native builds for Apple® silicon machines that use Apple's new M1 chip as a beta feature, providing improved support across PyTorch's APIs.
Stable | Beta | Prototype |
---|---|---|
|
|
|
You can check the blogpost that shows the new features here.
Backwards Incompatible changes
Python API
uint8 and all integer dtype masks are no longer allowed in Transformer (#87106)
Prior to 1.13, key_padding_mask
could be set to uint8 or other integer dtypes in TransformerEncoder
and MultiheadAttention
, which might generate unexpected results. In this release, these dtypes are not allowed for the mask anymore. Please convert them to torch.bool
before using.
1.12.1
>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.uint8)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
# works before 1.13
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)
1.13
>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.bool)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)
Updated torch.floor_divide
to perform floor division (#78411)
Prior to 1.13, torch.floor_divide
erroneously performed truncation division (i.e. truncated the quotients). In this release, it has been fixed to perform floor division. To replicate the old behavior, use torch.div
with rounding_mode='trunc'
.
1.12.1
>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -1.])
1.13
>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -2.])
# Old behavior can be replicated using torch.div with rounding_mode='trunc'
>>> torch.div(a, b, rounding_mode='trunc')
tensor([ 2., -1.])
Fixed torch.index_select
on CPU to error that index is out of bounds when the source
tensor is empty (#77881)
Prior to 1.13, torch.index_select
would return an appropriately sized tensor filled with random values on CPU if the source tensor was empty. In this release, we have fixed this bug so that it errors out. A consequence of this is that torch.nn.Embedding
which utilizes index_select
will error out rather than returning an empty tensor when embedding_dim=0
and input
contains indices which are out of bounds. The old behavior cannot be reproduced with torch.nn.Embedding
, however since an Embedding layer with embedding_dim=0
is a corner case this behavior is unlikely to be relied upon.
1.12.1
>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
tensor([], size=(1, 0), grad_fn=<EmbeddingBackward0>)
1.13
>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
RuntimeError: INDICES element is out of DATA bounds, id=4 axis_dim=3
Disallow overflows when tensors are constructed from scalars (#82329)
Prior to this PR, overflows during tensor construction from scalars would not throw an error. In 1.13, such cases will error.
1.12.1
>>> torch.tensor(1000, dtype=torch.int8)
tensor(-24, dtype=torch.int8)
1.13
>>> torch.tensor(1000, dtype=torch.int8)
RuntimeError: value cannnot be converted to type int8 without overflow
Remove deprecated torch.eig
, torch.matrix_rank
, torch.lstsq
(#70982, #70981, #70980)
The deprecation cycle for the above functions has been completed and they have been removed in the 1.13 release.
torch.nn
Enforce that the bias
has the same dtype as input
and weight
for convolutions on CPU (#83686)
To align with the implementation on other devices, the CPU implementation for convolutions was updated to enforce that the dtype
of the bias
matches the dtype
of the input
and weight
.
1.12.1
# input and weight are dtype torch.int64
# bias is torch.float32
>>> out = torch.nn.functional.conv2d(input, weight, bias, ...)
1.13
# input and weight are dtype torch.int64
# bias is torch.float32
>>> with assertRaisesError():
>>> out = torch.nn.functional.conv2d(input, weight, bias, ...)
# Updated code to avoid the error
>>> out = torch.nn.functional.conv2d(input, weight, bias.to(input.dtype), ...)
Autograd
Disallow setting the .data
of a tensor that requires_grad=True
with an integer tensor (#78436)
Setting the .data
of a tensor that requires_grad
with an integer tensor now raises an error.
1.12.1
>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
>>> x
tensor([0, 0], requires_grad=True)
1.13
>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: data set to a tensor that requires gradients must be floating point or complex dtype
Added variable_list support to ExtractVariables struct (#84583)
Prior to this change, C++ custom autograd Function considers tensors passed in TensorList to not be tensors for the purposes of recording the backward graph. After this change, custom Functions that receive TensorList must modify their backward functions to also compute gradients for these additional tensor inputs. Note that this behavior now differs from that of custom autograd Functions in Python.
1.12.1
struct MyFunction : public Function<MyFunction> {
static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
return 2 * tensors[0] + 3 * t;
}
static variable_list backward(
AutogradContext* ctx,
variable_list grad_output) {
return {3 * grad_output[0]};
}
};
1.13
struct MyFunction : public Function<MyFunction> {
static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
return 2 * tensors[0] + 3 * t;
}
static variable_list backward(
AutogradContext* ctx,
variable_list grad_output) {
return {3 * grad_output[0], 2 * grad_output[0]};
}
};
Don't detach when making views; force kernel to detach (#84893)
View operations registered as CompositeExplicitAutograd kernels are no longer allowed to return input tensors as-is. You must explicitly create a new tensor (e.g., using .alias()
).
1.12.1
torch::Tensor view_op(const torch::Tensor& self) {
return self;
}
1.13
torch::Tensor view_op(const torch::Tensor& self) {
return self.alias();
}
ONNX
torch.onnx.register_custom_op_symbolic
now only registers the symbolic function at the specified opset version (#85636)
This updates register_custom_op_symbolic
's behavior to only register the symbolic function at a single version. This is more aligned with the semantics of the API signature. Previously the API registers a symbolic function to all versions up to the specified version. As a result of this change, users will need to register a symbolic function to the exact version when they want to override an existing symbolic function. Users are not affected if (1) an implementation does not exist for the op, or (2) the symbolic function is already registering to the exact version for export.
1.12.1
# Assuming an implemented symbolic function `custom_op_function`
torch.onnx.register_custom_op_symbolic("aten::foo", custom_op_function, 16)
1.13
# Assuming an implemented symbolic function `custom_op_function`
for opset in range(1, 17):
torch.onnx.register_custom_op_symbolic("aten::foo", custom_op_function, opset)
Default ONNX opset is updated to 14 (#83284)
The update is done in regularly to ensure we are in sync with the onnx updates. Users can specify opset_version
in torch.onnx.export
to maintain opset version 13.
torch.onnx.symbolic_registry
is removed (#84382)
We removed the symbolic_registry
module and hid it as an internal implementation detail. Users previously relying on the register_op
function to register custom symbolic functions should move to use the torch.onnx.register_custom_op_symbolic
API.
ScalarType
and global variables in torch.onnx.symbolic_helper
are removed (#82995)
The ScalarType
class in torch.onnx.symbolic_helper
, along with the global variables cast_pytorch_to_onnx
, pytorch_name_to_type
, scalar_name_to_pytorch
, scalar_type_to_onnx
and scalar_type_to_pytorch_type
are removed from the module. Users previously using these global variables for PyTorch JIT-ONNX type conversion in symbolic functions should move to use the torch.onnx.JitScalarType
class.
1.12.1
# 1
torch.onnx.symbolic_helper.scalar_type_to_onnx[
symbolic_helper.scalar_type_to_pytorch_type.index(x.dtype)
].value
# 2
torch.onnx.symbolic_helper.scalar_name_to_pytorch[element_type] in cast_pytorch_to_onnx.keys()
# 3
torch.onnx.symbolic_helper.cast_pytorch_to_onnx["Long"]
# 4
torch.onnx.symbolic_helper.cast_pytorch_to_onnx[tensor.type().scalarType()]
1.13
# 1
torch.onnx.JitScalarType.from_dtype(x.dtype).onnx_type()
# 2
torch.onnx.JitScalarType.from_name(element_type).onnx_compatible()
# 3
torch.onnx.TensorProtoDataType.INT64
# 4
torch.onnx.JitScalarType.from_name(tensor.type().scalarType()).onnx_type()
Distributed
In c10d collectives, input tensors dtype must now be the same (#84664)
We added a check to validate all dtype across all input tensors. Previously, users were allowed to pass in tensors with diferent dtypes for c10d collectives. Now, passing in tensors with different dtypes will throw a RuntimeError with the following message: “Invalid usage of tensors with different dtypes Found torch.float
and torch.half
”. Users can use tensor.to(dtype={some_dtype})
to fix this.
1.12.1
# users could pass inputs having different dtypes
>>> tensor = torch.ones(2, 2) * 7
>>> tensor_h = tensor.half()
>>> tensor_list = [torch.zeros(2, 2) for _ in range(4)] # Assume world_size = 4
# Both cases work.
>>> dist.all_gather(tensor_list, tensor)
>>> dist.all_gather(tensor_list, tensor_h)
...
1.13
# all inputs of c10d collectives need to have the same dtype
>>> tensor = torch.ones(2, 2) * 7
>>> tensor_h = tensor.half()
>>> tensor_list = [torch.zeros(2, 2) for _ in range(4)] # Assume world_size = 4
# Only allow same dtype for all input tensors.
>>> dist.all_gather(tensor_list, tensor) # RuntimeError thrown
...
Users doing wildcard imports of torch.distributed.distributed_c10d will no longer get non-public symbols (#84872)
We limit the usage of c10d APIs to public APIs, so if a user does a wildcard import and calls an internal API, it will fail. Please see the example below:
1.12.1
# users could import both public and non-public symbols:
from torch.distributed.distributed_c10d import *
>>> is_nccl_available() # public API
>>> _check_single_tensor(...) # Non-public API
...
1.13
# users can only import public symbols
from torch.distributed.distributed_c10d import *
is_nccl_available() # public API
_check_single_tensor(...) # Non-public API, this will fail now
...
Process Group C++ extensions must use absolute path when importing ProcessGroup.hpp (#86257), ProcessGroup::Work object moved out of work to its own Work class (#83680):
Details of the changes and the updated tutorial can be found in the PyTorch tutorial PR #2099
1.12.1
// users use relative path to import C++ headers and Work resides in ProcessGroup class
#include <c10d/ProcessGroup.hpp>
#include <c10d/Store.hpp>
#include <c10d/Types.hpp>
#include <c10d/Utils.hpp>
...
class WorkDummy : public ProcessGroup::Work {
...
}
1.13
// users must use absolute path of import C++ files and Work is its own class
#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
#include <torch/csrc/distributed/c10d/Store.hpp>
#include <torch/csrc/distributed/c10d/Types.hpp>
#include <torch/csrc/distributed/c10d/Utils.hpp>
...
#include <torch/csrc/distributed/c10d/Work.hpp>
class WorkDummy : public Work {
...
}
Quantization
Add required example_args
argument to prepare_fx
and prepare_qat_fx
(#249) (#77608)
We added an additional required example_inputs
argument to prepare_fx
and prepare_qat_fx
APIs, this can be used to do type inference to figure out the type information for each of the fx Node in the graph.
1.12.1
m = resnet18(...)
m = prepare_fx(m, qconfig_dict)
# or
m = prepare_qat_fx(m, qconfig_dict)
1.13
m = resnet18(...)
m = prepare_fx(m, qconfig_dict, example_inputs=(torch.randn(1, 3, 224, 224),))
# or
m = prepare_qat_fx(m, qconfig_dict, example_inputs=(torch.randn(1, 3, 224, 224),))
Stop moving models to CPU in quantization convert (#80555)
Previously, we automatically moved the model to CPU in torch.ao.quantization.fx.convert
to work around the issue where certain functions called by convert expect CPU arguments. This commit pushes this responsibility to the caller since it is the user's decision of which device to use.
1.12.1
model = resnet18(...)
model = prepare_fx(model, qconfig_mapping, example_inputs)
# calibrate
model = convert_fx(model)
1.13
model = resnet18(...)
model.cpu() # if needed
model = prepare_fx(model, qconfig_mapping, example_inputs)
# calibrate
model = convert_fx(model)
Replace the is_reference
flag of the torch.ao.quantize_fx.convert_fx
function with the convert_to_reference
function (#80091, #81326)
This PR removes the is_reference flag from the existing convert_fx
API and replaces it with a new convert_to_reference
function. This separates (1) converting the prepared model to a reference model from (2) lowering the reference model to a quantized model, enabling users to call their custom lowering function for
custom backends.
1.12.1
from torch.ao.quantization.quantize_fx import (
prepare_fx,
convert_to_reference,
)
prepared = prepare_fx(model, ...)
reference = convert_to_reference(prepared, ...)
1.13
from torch.ao.quantization.quantize_fx import (
prepare_fx,
convert_to_reference_fx,
)
prepared = prepare_fx(model, ...)
reference = convert_to_reference_fx(prepared, ...)
Add default configs for fixed qparams ops (#80184)
This commit adds qconfigs with special observers for fixed qparams ops (operators whose corresponding quantized version has fixed quantized parameters for output) like sigmoid in get_default_qconfig_mapping
and get_default_qat_qconfig_mapping
. For correctness, we also require users to use these special observers if we detect these fixed qparams ops in prepare.
1.12.1 (fails after this PR):
from torch.ao.quantization.quantize_fx import prepare_fx
model = ModelWithFixedQParamsOps()
qconfig_mapping = QConfigMapping()
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)
1.13
from torch.ao.quantization import get_default_qconfig_mapping
from torch.ao.quantization.quantize_fx import prepare_fx
model = ModelWithFixedQParamsOps()
qconfig_mapping = get_default_qconfig_mapping()
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)
Replace qconfig_dict
with a typed QConfigMapping
object (#78452, #79618)
Previously, FX graph mode quantization configurations were specified through a dictionary of qconfigs. However, this
API was not in line with other core APIs in PyTorch. This commit replaces this dictionary with a config object that users will
create and pass to prepare and convert. This leads to better type safety and better user experience in notebook settings
due to improved auto completion.
1.12.1 (deprecated)
from torch.ao.quantization.quantize_fx import prepare_fx
qconfig_dict = {
"": qconfig,
"object_type": [
(torch.nn.Linear, qconfig),
],
"module_name_regex": [
("foo.*bar", qconfig),
],
"module_name": [
("mod", qconfig),
],
}
prepare_fx(model, qconfig_dict)
1.13
from torch.ao.quantization import QConfigMapping
from torch.ao.quantization.quantize_fx import prepare_fx
qconfig_mapping = QConfigMapping()
.set_global(qconfig)
.set_object_type(torch.nn.Linear, qconfig)
.set_module_name_regex("foo.*bar", qconfig)
.set_module_name("mod", qconfig)
prepare_fx(model, qconfig_mapping)
Replace *custom_config_dict
with typed config objects (#79066)
This commit replaces the following config dicts with python objects:
- prepare_custom_config_dict → PrepareCustomConfig
- convert_custom_config_dict → ConvertCustomConfig
- fuse_custom_config_dict → FuseCustomConfig
This leads to better type safety and better user experience in
notebook settings due to improved auto completion.
1.12.1
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
prepare_custom_config_dict = {
"float_to_observed_custom_module_class": {
"static": {
FloatClass: ObservedClass
}
},
"non_traceable_module_name": ["mod1", "mod2"],
"non_traceable_module_class": [class1, class2],
"input_quantized_idxs": [0, 1],
"output_quantized_idxs": [0],
"preserved_attributes": ["attr1", "attr2"],
}
convert_custom_config_dict = {
"observed_to_quantized_custom_module_class": {
"static": {
FloatClass: ObservedClass
}
},
"preserved_attributes": ["attr1", "attr2"],
}
model = prepare_fx(
model,
qconfig_mapping,
example_inputs,
prepare_custom_config_dict=prepare_custom_config_dict)
model(data)
model = convert_fx(model, convert_custom_config_dict=convert_custom_config_dict)
1.13
from torch.ao.quantization.fx.custom_config import (
PrepareCustomConfig,
ConvertCustomConfig,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
prepare_custom_config = PrepareCustomConfig() \
.set_float_to_observed_mapping(float_class, observed_class) \
.set_non_traceable_module_names(["mod1", "mod2"]) \
.set_non_traceable_module_classes([class1, class2]) \
.set_input_quantized_indexes([0, 1]) \
.set_output_quantized_indexes([0]) \
.set_preserved_attributes(["attr1", "attr2"])
convert_custom_config = ConvertCustomConfig() \
.set_observed_to_quantized_mapping(observed_class, quantized_class) \
.set_preserved_attributes(["attr1", "attr2"])
model = prepare_fx(
model,
qconfig_mapping,
example_inputs,
prepare_custom_config=prepare_custom_config)
model(data)
model = convert_fx(model, convert_custom_config=convert_custom_config)
Remove remove_quant_dequant_pairs
and fix tests (#84203)
This PR removed some passes in convert_fx
, and also fixes the way we quantize layer_norm operator, so the qconfig
for layer_norm op needs to be updated as well.
1.12.1
import torch
from torch.ao.quantization.qconfig_mapping import QConfigMapping, QConfig
from torch.ao.quantization.observer import default_weight_observer
from torch.ao.quantization.backend_config import (
DTypeConfig,
ObservationType,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
qconfig = QConfig(activation=qconfig.activation, weight=default_weight_observer)
qconfig_mapping = QConfigMapping().set_object_type(torch.nn.LayerNorm, q_config) \
.set_object_type(torch.nn.functional.layer_norm, q_config)
# assuming mymodel contains a LayerNorm layer or torch.nn.functional.layer_norm
m = MyModel()
example_inputs = (torch.rand(3, 3),)
m = prepare_fx(m, qconfig_mapping, example_inputs)
1.13
import torch
from torch.ao.quantization.qconfig_mapping import QConfigMapping, QConfig
from torch.ao.quantization.observer import default_placeholder_observer
from torch.ao.quantization.backend_config import (
DTypeConfig,
ObservationType,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
qconfig = QConfig(activation=qconfig.activation, weight=default_placeholder_observer)
qconfig_mapping = QConfigMapping().set_object_type(torch.nn.LayerNorm, q_config) \
.set_object_type(torch.nn.functional.layer_norm, q_config)
# assuming mymodel contains a LayerNorm layer or torch.nn.functional.layer_norm
m = MyModel()
example_inputs = (torch.rand(3, 3),)
m = prepare_fx(m, qconfig_mapping, example_inputs)
Align observer dtype with reference model spec (#85345)
Before this PR, the dtype
attribute of observers was not clearly defined. It originally meant interface_dtype
in the eager mode workflow, which is how the codebase before this PR is using it. In the new reference model spec, dtype
attribute of an observer represents the dtype
value which needs to be passed into a quantize
function in the reference model spec. This PR aligns the codebase to this definition of dtype
.
1.12.1
dynamic_quant_observer = PlaceholderObserver.with_args(
dtype=torch.float, compute_dtype=torch.quint8)
1.13
dynamic_quant_observer = PlaceholderObserver.with_args(
dtype=torch.quint8, compute_dtype=torch.quint8)
Composability
Changed the backend C++ kernel representation for some operators that take in lists of tensors (#73350)
If an operator in ATen takes in a list of tensors, and is marked as “structured” in native_functions.yaml (example), then previously, TensorList was represented as at::TensorList
, or c10::ArrayRef<at::Tensor>
. Now, it is represented as a more efficient type: const ITensorListRef&
.
1.12.1
at::Tensor cat_kernel(at::TensorList tensors,int64_t dim) {
...
}
TORCH_LIBRARY_IMPL(aten, dispatch_key, m) {
...
m.impl("cat", &cat_kernel);
}
1.13
at::Tensor cat_kernel(const at::ITensorListRef& tensors,int64_t dim) {
...
}
TORCH_LIBRARY_IMPL(aten, dispatch_key, m) {
...
m.impl("cat", &cat_kernel);
}
C++ API
Lowered randint default dtype to the C++ API (#81410)
Prior to 1.13, the default for the dtype
argument of torch.randint
, torch.long
, was set via manual python binding. However, in the C++ API, torch::randint
would default to the global default data type, which is usually float
. In 1.13 we changed the default for dtype
in the C++ API to int64
in order to match the python API. To reproduce the old behavior, one can set the dtype
argument.
1.12.1
torch::randint(/*low=*/0, /*high=*/10, {2, 3});
1.13
// assuming default dtype is float
torch::randint(/*low=*/0, /*high=*/10, {2, 3}, torch::kFloat);
Enabled dim=None
for torch.{std, var, std_mean, var_mean}
(#81845, #82765, #82912)
Prior to 1.13, a C++ API call that has argument types torch::{std, var, std_mean, var_mean}(Tensor, OptionalIntArrayRef, int64_t, bool)
used to resolve to the {std, var, std_mean, var_mean}.correction
overload. In this release, it resolves to the {std, var, std_mean, var_mean}.dim
overload. With the .correction
overload, the third argument of type int64_t
could be used to pass a correction δN other than 1. In order to call the {std, var, std_mean, var_mean}.correction
overload in 1.13, the old int64_t
argument can be wrapped in a c10::optional
.
1.12.1
// using std as an example
int64_t correction = 2;
torch::std(t, /*dim=*/dim, /*correction=*/correction, /*keepdim=*/True);
1.13
// To replicate in 1.13 using std as an example
auto correction = c10::make_optional<int64_t>(2);
torch::std(t, /*dim=*/dim, /*correction=*/correction, /*keepdim=*/True);
Deprecations
Distributed
We are deprecating the following APIs of c10d: *_coalesced
APIs (#85959), *_multigpu
APIs (#85961) and ProcessGroupRoundRobin
(#85158)
We added warnings when users call c10d’s *_coalesced
, *_multigpu
and ProcessGroupRoundRobin
APIs. Previously, users can use these APIs without any warnings but now they will see warnings like “torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions”. There are still workarounds for *_coalesced
APIs but no workarounds will be provided for the other two.
1.12.1
# users could use the following APIs with no warnings:
all_reduce_coalesced(...)
all_gather_coalesced(...)
broadcast_multigpu(...)
all_reduce_multigpu(...)
reduce_multigpu(...)
all_gather_multigpu(...)
reduce_scatter_multigpu(...)
...
1.13
# users can still use these APIs but it will come with warnings:
all_reduce_coalesced(...)
# Warnings:
# torch.distributed.all_reduce_coalesced will be deprecated. If you must
# use it, please revisit our documentation later at
# https://pytorch.org/docs/master/distributed.html#collective-functions"
# Potential workaround:
reqs = []
with dist._coalescing_manager(group, reqs):
reqs.append(dist.all_reduce(tensor1, async_op=True))
reqs.append(dist.all_reduce(tensor2, async_op=True))
for req in reqs:
req.wait()
...
We are deprecating passing optim_input
into the FSDP optimizer state checkpointing APIs. The user can simply not pass the optim_input
argument, and all behavior is preserved. No fix is needed from users side for now.
1.12.1
# the user can use the following APIs with no warnings
full_optim_state_dict(...)
sharded_optim_state_dict(...)
shard_full_optim_state_dict(...)
flatten_sharded_optim_state_dict(...)
scatter_full_optim_state_dict(...)
rekey_optim_state_dict(...)
1.13
# users can still use these APIs, but they will come with warnings
# The `optim_input` argument is deprecated and will be removed after PyTorch 1.13.
# You may remove it from your code without changing its functionality.
LinAlg
Deprecate torch.lu in favor of linalg.lu_factor (#77636)
The new operation has a cleaner API and better docs. The update rule is as follows:
1.12.1
LU2, pivots2, info = torch.lu(A, compute_pivots, get_infos=True)
LU1, pivots1, info = torch.lu(A, compute_pivots)
1.13
LU2, pivots2, info = torch.linalg.lu_factor_ex(A, compute_pivots)
LU1, pivots1 = torch.linalg.lu_factor(A, compute_pivots)
Deprecate torch.lu_solve in favor of linalg.lu_solve(#77637)
The new operation has a notation consistent with linalg.solve
, and has an extra parameter adjoint=False
. The update rule is as follows:
1.12.1
X = torch.lu_solve(B, LU, pivots)
1.13
X = linalg.lu_solve(LU, pivots, B)
ONNX
Monkey patched convenience method on torch._C.Graph
, torch._C.Block
and torch._C.Node
are deprecated. (#83006)
Deprecated methods include Graph.op()
, Graph.constant()
, Graph.at()
, Block.op()
, and Node.__getitem__()
. Previously, these methods are patched into the classes above when users call torch.onnx.export()
and are typically used in custom symbolic functions. Users can continue to expect g.op()
and g.at()
in symbolic functions to work. The g
parameter has been substituted by the GraphContext
object (#84728). The methods are now exposed by the GraphContext
class with APIs unchanged. Users should not rely on the Graph.op()
, Graph.constant()
, Graph.at()
, Block.op()
, Node.__getitem__()
methods when they are directly interacting with the C classes. Users should use only the op()
and at()
methods of the GraphContext
object, as other fields in the class will change in future releases.
New features
Python API
- Added a deterministic implementation of
scatter_add
on CUDA for all input sizes (#79466) - Added
torch.concatenate
that aliasestorch.cat
(#85073) - Added
Tensor.is_cpu()
that returns whether a tensor is on CPU (#78887) - Added a
force
kwarg toTensor.numpy()
that enables returning a numpyndarray
that does not share storage with the tensor (#78564) - Added `torch.special.{airy_ai, bessel_j0, bessel_j1, bessel_y0, bessel_y1, modified_bessel_i0, modified_bessel_i1, modified_bessel_k0, modified_bessel_k1, scaled_modified_bessel_k0, scaled_modified_bessel_k1, spherical_bessel_j0}`` (#78900), (#78901), (#78902), (#78912), (#78451)
- Added
torch.special.{chebyshev_polynomial_t, chebyshev_polynomial_u, chebyshev_polynomial_v, chebyshev_polynomial_w, hermite_polynomial_h, hermite_polynomial_he, laguerre_polynomial_l, legendre_polynomial_p, shifted_chebyshev_polynomial_t, shifted_chebyshev_polynomial_u, shifted_chebyshev_polynomial_v, shifted_chebyshev_polynomial_w}
(#78196), (#78293), (#78304), (#78366), (#78352), (#78357) - Added
weights_only
option totorch.load
that restricts load to state_dict only, enabling safe loading. This can also be set using theTORCH_FORCE_WEIGHTS_ONLY_LOAD
environment variable (#86812)
Build
- Added
-Werror=unused-but-set-variable
build flag (#79305) - Added ability to get release versions based on the current tag (#78584)
- Added
-Werror=type-limits
in Bazel CPU build (#79139) - Added
-Werror=unused-variable
in Bazel CPU build (#79156) - Added —config=shell to bazelrc file for easier debugging (#79350)
- Added clang
-Wconstant-conversion
to catch errors detected in #75400 (#80461) - Added
-Werror=non-virtual-dtor
build flag (#81012) - Turned on pocketfft flag for third-party pocket_fft library (#81670)
- Updated NCCL to v2.13.4-1 (#82775)
- Added
-Wunused-local-typedef
build flag (#86154) - Increased max python version to include 3.10 (#84815)
Complex
- Added complex half support for:
- [CPU]
torch.{index_select, index_add}
(#79217), (#79897). - [CUDA]
torch.roll
(#79970),torch.fft.{fftshift, ifftshift}
(#79970),torch.{acos, acosh, asinh, atanh}
, (#80030),torch.{cos, sinh, cosh, tanh}
(#78718),torch.sqrt, rsqrt
(#77490),torch.{triu, tril, diag, trace}
(#78062). - [CPU and CUDA]
torch.where
(#78665),torch.{where, pow, masked_fill, sgn, tan, angle}
(#78665)
- [CPU]
- Added complex support for
torch.nn.ConvTranspose1d
(#79694).
torch.nn
- Added
pop
function tonn.Sequential
andnn.ModuleList
(#81601) - Added deepcopy support for parametrized
nn.Module
(#80811)
torch.nn.optim
- Added maximization support via the
maximize
kwarg foroptim.SparseAdam
(#80336),optim.ASGD
(#81875),optim.Rprop
(#81864),optim.RMSprop
(#80326) - Added support for differentiable optimizers via the
differentiable
kwargoptim.SGD
(#80938),optim.Adam
(#82205),optim.RMSprop
(#83578) - Added support for complex number for
optim.Adam
(#80279),optim.AdamW
(#80280),optim.Adamax
(#80319),optim.RMSprop
(#83860),optim.Rprop
(#83858), - Handled complex params as independent real params in
optim.{RMSprop, ASGD}
(#83860), (#84472) - Added
optim.lr_scheduler.PolynomialLR
(#82769)
BetterTransformer
- Allowed user to assert no mask contiguous check is necessary (#82533)
- Added support for norm_first in nn.TransformerEncoderLayer fast path (#78269)
- Added ustom scaled dot product implementations dense (#85984)
- Added Better Transformer fastpath diagnostics (#81013)
ForEach
- Implemented inplace
foreach
maximum
andminimum
(#82523)
LinAlg
- Added
linalg.lu_solve
,linalg.solve_ex
,linalg.vecdot
,linalg.vander
(#77634, #80073, #70542, #76303)
Sparse
- Added
torch.sparse.spdiags
for easier creation of diagonal sparse matrices (#78439)
torch.fx
- Enabled symbolic shapes (#82063, #82317, #82209, #83380, #85808, #84113, #84829, #84918, #85185, #85261, #85260, #85754, #85768, #86050, #86098, #86067)
- Created an improved version of subgraph matcher (#82090, #82853, #85444, #85456, #85617)
- Rewrite subgraph_rewriter with subgraph_matcher (#83717)
- Added PassBase for writing passes, PassResult for the return value of passes, and a PassManager for managing the workflow of passes (#79878, #81366, #80531, #82485, #83933, #84094, #84425, #84232)
- Added an FX graph partitioner and fuser (#79439, #80292)
- Added a reinplacing FX pass (#80897, #83626, #83845, #83846)
- Added a CSE pass to the common passes (#81512, #81530, #81742)
- Created DecompositionInterpreter for decomposing aten → prims after an initial make_fx call (#79989)
- Created a Backend for NvFuser based graph partitioner + Prims (#80591, #81311, #81436, #81911)
- Created a Backend for Cudagraphs from dynamo (#80566)
- Created a type constraint generator to Z3 (#79912, #80084, #80095, #80102, #80110, #80147, #80744, #80799, #80823, #80847, #80909, #80925, #80976, #81159, #81175, #81189, #81190, #81265, #81274, #81344, #81360, #81376, #81445, #81516, #81527, #81714, #82163, #82590, #82597, #82614, #82742, #82856, #82923,#82938,#83087, #83109, #83194, #83334, #83682, #83945)
JIT
- Added new NVFuser Python Frontend Record Keeping for Cache enablement. (#81578)
- Added
torch.ops.nvprims
namespace for nvFuser-specific prims (#82155) - Enabled fusion of conv with elementwise OP in NNC (#77157)
- Added symbolic shape functions for
conv_transpose2d.input, convolution, convolution_backward
(#77283, #83557, #80860) - Added support in symbolic shapes for generalized lists of tensor shapes, tuple outputs, optional None, upper and lower bounds (#77389, #83092, #83222, #78679)
- Added support for
aten::_convolution
when it is 2D conv in NNC (#84038) - Exposed
ProcessGroup::Work.wait()
API to TorchScript (#83303)
ONNX
- Inlined
prim::PythonOp
for Autograd Function Export (#74765)
AMD
- Enabled nvfuser (#82498)
CUDA
- Added CUDA trace Python hooks (#82824)
- Added CUDA Sanitizer (#83984)
- Added support for multiple outputs in python jiterator (#77921, #78139)
Intel
- Added a launch script with Best Recipe of Deep Learning on Intel Xeon CPU (#63932)
- Enabled Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289)
- Added unified x86 quantization backend (#84329)
MPS
- Added
aten::index_add.out
operator for MPS backend (#79935) - Added
aten::prelu operator
for MPS backend (#82401) - Added
aten::bitwise-not
operator native support for MPS backend (#83678) - Added
aten::tensor::index_put
operator for MPS backend (#85672) - Added
aten::upsample_nearest1d
operator for MPS backend (#81303) - Added
aten::bitwise_{and|or|xor}
operators for MPS backend (#82307) - Added
aten::index.Tensor_out
operator for MPS backend (#82507) - Added
aten::masked_select
operator for MPS backend (#85818) - Added
aten::multinomial
operator for MPS backend (#80760)
Profiler
- Integrated Execution Graph Observer into PyTorch Profiler (#75358, #79753, #82895, #84285)
- TorchTidy: experimental tool to identify anti-patterns from traces (#79631, #79874, #79993, #80094, #80108, #80572, #81056, #81273, #81501, #81733, #81740, #81921, #82421, #82248, #82261, #82782)
- Added reporting for OOM events to the Pytorch Profiler. (#80050)
Vulkan
- Added Vulkan support for the following operators:
- Prototype implementations for Quantized Tensors were added (#81491). These implementations still need to be exposed to Torchscript, but so far prototype implementations for the following ops have been added:
Mobile
- Added support for dtypes and custom classes in model tracer (#84795)
- Extended Flatbuffer to get mobile_info for NMLML workflows (#78306)
- Added serialization/deserialization of Sparse Quantize Linear Packed Params (#80474)
- Added qnnpack bcsr matrix unpacking and use unpacking in Linear module (#80475)
- Added OwnedOrBorrowedVector for QNNPack BCSR Indices/Values (#80476)
Distributed
Distributed Checkpointing
(Prototyping)
- This is a prototyping effort which enables loading and saving PyTorch models from one or more hosts. Models can use features such as DDP, FSDP and ShardedTensor and they can have a different configuration between saving and loading - for example, save from 4 hosts and load from a single host. Distributed checkpointing has an extensibility API that enables full control of how a model is saved; and a pluggable IO backend. (#83781, #83419, #84952, #84881)
Distributed(c10d)
- Made c10d collective ops dispatcher passable. It allows tracing mechanisms such as LazyTensor and AOTAutograd to observe communications, e.g., : broadcast(#76722), allreduce(#79582), allgather (#79669), reduce_scatter (#79683), reduce (#79686), gather (#79687), scatter (#79688), alltoall (#79691), barrier (#79777), send/recv (#79779).
- Added UCC process group (#79918)
- Enabled uneven input support for
all_gather
(#83713) and uneven output support forreduce_scatter
(#87010) - Added NCCL PreMul Sum to c10d
ReduceOp
(#84243)
DistributedDataParallel
- Made DDP work with Python process group (#79176)
- Enabled Zero1's ddp_with_overlap for hpu backend (#80438)
FullyShardedDataParallel
- Added forward prefetching option in FSDP API (#85177)
- Added fp16 and bf16 hooks for FSDP (#81711)
- Implemented
sharded_optim_state_dict
andflatten_sharded_optim_state_dict
. (#77628) - Added rate limiter (#83917) Thanks to IBM Research team, @lchu-ibm for his contributions to FSDP and @hfwen0502 for the experimental testbed that identified the issues.
- Added an option to keep grads in lower prec (#85223)
torch.distributed.elastic
- Added watchdog to TorchElastic agent and trainers (#84081)
Activation Memory Management
(Prototyping)
- We offer a new API,
torch.distributed.algorithms.checkpoint.checkpoint_wrapper
to wrapnn.Modules
with activation checkpointing or activation offloading to easily use and experiment with activation checkpoint techniques without modifying model code. This makes it simpler to leverage activation checkpointing to reduce memory footprint of your training applications and train larger models. (#83035, #78704, #78854, #79830, #80089, #84907, #84908, #85448, #85449)
Infra (RelEng)
- Enabled multigpu unittests on FSDP (#77947)
- Added feature to do rebase (via comment) onto any branch (#78772)
- Added implementation to allow PR collaborators to revert their PRs (#82360)
- Added torchvision onto the commit pins file (#79151)
- Turned on
-Werror=all
with a few exceptions in Bazel build for CUDA (#79306) - Prepared for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106)
- Added ROCm build to pull request jobs (#80149)
- Added dynamo test configuration (#80342)
- Enabled ROCm CI for trunk test (#80920)
- Added linux cuda 11.7 workflows (#81089)
- Updated CI docker images and jobs to ROCm5.2 (#81168)
- Added UCC PG build in CI (#81583)
- Enabled periodic builds for CUDA 11.7 (#81688)
- Enabled distributed tests for ROCm (#81751)
- Added New TORCH_UCC_BLOCKING_WAIT env variable (#81791)
- Change functorch pin mechanism to test functorch in pytorch/pytorch now that functorch is inside pytorch/pytorch (#81918)
- Added Python 3.11 nightlies for Linux PyPi (Please note that 3.11 binaries are not fully functional) (#82302)
- Updated ROCm nightly builds to rocm5.2 (#82353)
- Add functorch target to cmake (#83464)
- Upgraded CUDNN version for cuda 11.7 (#84964)
- Enabled pytest-shard for functorch (#85321)
- Enabled CI to run test_ops in parallel (#85528)
- Updated trunk CUDA-10.2 to CUDA-11.7 (#85943)
- Added support for building and running Metal tests in CI (#86073)
- Bumped nvidia docker version and using python 3.10 for cuda11.7 (#82472)
Improvements
Python API
- Added
float16
support fortorch.{arange, linspace}
(#80492) - Added integer support to
torch.index_reduce
(#80464) - Added a
stable
kwarg totorch.argsort
that controls the relative order of equivalent elements (#75162) - Improved stability of
torch.distributions.kl_divergence
for two Bernoulli distributions (#79944) - Improved type annotations for
torch.{as_tensor, as_subclass}
(#86105) - Added type promotion support for
torch.{addcmul, addcdiv}
(#74234) - Added
bfloat16
support fortorch.save
with XLA/HPU tensors (#77534) - Improved wrapper subclass detection for serialization (#81105)
- Updated python API
TensorOption
signatures for consistency with JIT schemas (#82241) - Allowed disabling of
torch.library.Library
with PYTORCH_DISABLE_LIBRARY (#85190) - Enabled
dim=None
fortorch.{mean, sum, nanmean, nansum}
(#81286), (#79881), (#82912) - Added feature to enable registration of extension device modules as a native module under the torch namespace (#78329)
- Added
logsumexp
toamp.autocast
(#76330)
C++ API
- Allowed
const T&
access toListElementReference
(#83177) - Redirected print messages to
stderr
intorch.utils.cpp_extension
(#82097) - Updated CUDA compiler matrix in
torch.utils.cpp_extension
(#82860) - Added
__all__
totorch.utils.cpp_extension
,torch.utils.hooks
andtorch.utils.show_pickle
(#85331)
Autograd
- Added forward AD coverage for
torch.{amin, amax, nansum, nanmean}
(#80082),torch.scatter_reduce
(exceptreduction=prod
) (#85000),torch.linalg.det
(#79487),torch.{elu_, celu_, selu_}
(#83080) - Added forward-over-reverse AD coverage for
nn.functional.{binary_cross_entropy}
(#77852) ,nn.functional.{embedding}
(#79699),nn.functional.{mse_loss, softplus, l1_loss, smooth_l1_loss, prelu, hardswish}
(#78740),nn.functional.{nll_loss, batch_norm, layer_norm, group_norm, cross_entropy, soft_min}
(#84976)torch.
{log_softmax, softmax}
(#84976),torch.amin, amax, nansum
(#80082) - Added support a stable double backward on
torch.linalg.det
for real inputs (#80217) - Added support for kwargs input to function when
torch.utils.checkpoint
withuse_reentrant=False
(#80987) - Added context manager to disable saved tensor hooks:
torch.autograd.graph.disable_saved_tensors_hooks
(#85971) - Added new cpp custom function API to inform the backward function whether a gradient is necessary to compute:
ctx->needs_input_grad(idx)
(#82544) - Added all device types in the pybinded DeviceType enum (#83676)
- Added
check_nan
flag totorch.autograd.detect_anomaly
which enables users to run anomaly mode without nan checking (#83481)
Build
- Specify "Generic" BLAS library name to ensure PyTorch can find the BLAS llibrary (#74269)
- Generate CUDAConfig.h only for CUDA builds (#78218)
- Moved build_variables.bzl and ufunc_defs.bzl from pytorch-root/tools/ to PyTorch root directory (#78542)
- Made lintrunner compatible with M1 (#78628)
- BLAS library is linked privately instead of being linked publicly (#78883)
- Updated build targets to include generated enum_tag.cpp (#79668)
- Use miopen_LIBRARIES and rccl_LIBRARIES directly, when they are valid target for RCCL (#80446)
- Deleted Win specific case for CMake older than 3.1 (#81411)
- Split
.cu
to improve compile times (#81193) - Added
append_cxx_flag_if_supported
macro (#82883)
torch.nn
- Improved
groups
argument validation fornn.Conv{1,2,3}d
modules (#77919) - Improved error message for convolution backward fallback kernel (#81538)
- Reduced memory usage of
nn.Module
full backward hooks by removing reference cycles (#80139) - Improved
kl_div
at boundary and its general implementation (#80334) - Improved input shape validation for MKL-backed convolution operations (#76526)
- Improved input validation for
nn.AdaptiveAvgPool2d
(#84061) - Improved
groups
argument validation fornn.Conv{1,2,3}d
(#85248) - Improved input index validation for
nn.MaxUnpool{2,3}d
(#78280) - Improved listing of public APIs for
optim
andnn
(#80237) - Added new operator for
nn.Sequential
:+
(#81170),extend
(#81179),insert
(#81402),+=
,*
and*=
(#81279), - Added deepcopy support for unitialized parameter (#83809)
- Added nondeterministic alert for
nn.MaxUnpool
{1,2,3}d
(#84766) - Added Bfloat16 support for the backward pass of
nn.functional.kl_div
on CUDA (#77676)
torch.nn.optim
- Added support for optimizers with more than 2 betas for LRScheduler (#84486)
- Added
fused
kwarg tooptim.Adam
to enable a fused implementation on CUDA (#85739)
Composability
- Significant hardening and improvements to the
functionalize()
API that lives with functorch (#77129, #77126, #77125, #78199, #77132, #77713, #77714, #78819, #78820, #82008, #82009, #81702, #80416, #80418, #80251, #80526, #82326, #81454, #81471, #83542, #83701, #85975) - Allow
__torch_dispatch__
subclasses and modes to override more tensor metadata: device/size/stride/dim (#77684, #77970, #78646, #78691) - Improvements to the
torch.library
API, for registering python functions to the pytorch dispatcher: - Ported
cholesky
,linalg_qr
,linalg_eigh
andlinalg_eighvalsh
to structured kernels, giving them support with meta tensors (#79300, #79054, #79072) - Added python decompositions for many torch operators. This adds meta tensor coverage for a large number of pytorch operators (#77930, #79768, #79808, #84062, #84350, #80219, #78350, #79667, #81003, #81420, #81113, #81241, #81765, #82284, #80497, #80358, #80182, #80737, #81734, #81826, #78461, #78468, #78525, #78914, #78919, #79900, #79225, #80964, #83235, #84108, #84451, #78602, #78603, #78527, #78604, #78992, #78993, #78997, #79278, #79341, #79311, #79411, #79581, #81800, #79834, #82309, #79975, #82587, #82603, #83191, #84349, #84460, #85793, #86057)
- Beefed up API for printing out operators registered to the dispatcher (#78995)
- Trued up
c10::FunctionSchema::operator<<
to print native_functions.yaml syntax (#79645) - Made it so that it is valid to set metadata after detach calls, like
x.detach().resize_(...)
(#83590) - Optimized
torch.ops.ns.opname.overload
accessor in__torch_dispatch__
(#85132)
Dataloader
- Added shape checking on argument
weights
forWeightedRandomSampler
(#78585) - Added support for
radom_split
to accept percentages aslengths
(#78877) - Extended collate function that can register collate functions to handle specific batch types (#85748)
Functorch
functorch.jacfwd
now accepts arandomness
kwarg (#84220)- Improved the error message when using
vmap
on a function with no Tensor inputs (#83016) - Relaxed the
Tensor.as_strided
batching rule. This is a primitive used in forward-mode AD (among other things) and improves composability of vmap with other transforms (like jvp). functorch.functionalize
: added support for in-place views on inputs (#83993)functorch.functionalize
: moved this API out of thefunctorch.experimental
namespace (#85742)- Added vmap support for
linalg.cholesky
,linalg.eigvals
,linalg.eigvalsh
,linalg.matrix_norm
,linalg.matrix_power
,linalg.norm
,linalg.tensorinv
,linalg.solve_triangular
(#82177) - Added vmap support for
linalg.solve
(#82814) - Added vmap support for
linalg.cross
(#83759) - Added vmap support for
linalg.matrix_rank
(#83760) - Added vmap support for
linalg.pinv
(#83761) - Added vmap support for
Tensor.fill_
(#84015) - Added vmap support for
linalg.lstsq
(#82325) - Added vmap support for
linalg.lu_solve
(#85175)
LinAlg
- Added a
driver=
kwarg totorch.linalg.svd
andsvdvals
. Add cusolver gesvdaStridedBatched driver tolinalg.svd
(#74521) - Added opteinsum backend to
torch.einsum
(#86219) - Added path optimize kwarg to
einsum
(#84890) - Call view instead of sum in
einsum
to remediate MPS regression (#87135) - Ensure that we contract left to right in
einsum
(#87199) - Fixed opt_einsum defaults to be more reasonable (#86985)
Sparse
- Added
sparse_dim
anddense_dim
for batched, hybrid CSR/CSC/BSR/BSC (#80565, #80901) - Added support for conversion between batched CSR/CSC/BSR/BSC and dense Tensors (#80781, #83084, #83086, #78025, #80354, #82120)
- Added support for conversion between CSR and CSC (#85091)
- Added support for conversion between BSR and BSC (#85091)
- Added partial support for CSR/CSC/BSR/BSC inputs to
mm
,addmm
,matmul
andF.linear
(#85551, #85308, #85379, #85307) - Added support for COO to
permute
(#79707) - Added support for ComplexHalf to
torch.nonzero
andadd(dense, CSR)
(#79062) - Added support for CSC/BSR/BSC to unary zero-preserving functions. (#78173, #85031)
- Added support for batched BSR/BSC to
transpose
(#82122) - Added support for scalar together with COO inputs to
mul
(#82962) - Added support for CSC/BSR/BSC to
empty_like
(#82310) - Added support for batch dims of CSR/CSC/BSR/BSC to
select
(#82119)
torch.fx
- In constant folding, added
device_for_folded_attrs
parameter and sets therequires_grad
option for a folded tensor (#79067) - Mode-based tracing in make_fx (#79638, #84238)
- Made executor handle kwargs (#79858)
- Added
ignore_parameters_and_buffers
flag to FxGraphDrawer (#79982) - Enabled an
is_fx_tracing
flag in the FX tracer (#80255) - Attached ProxyTorchDispatchMode to ProxyTensor and use it in
__torch_dispatch__
(#82549) - Used
enable_tracing
flag for ProxyTorchDispatchMode instead of modifying torch dispatch mode stack inner attributes (#82643) - Improved legalize_graph pass in FX (#82874)
- Implemented
__deepcopy__
for fx.Tracer (#83130) - Hackde up make_fx to natively support varargs (#83210)
- Updated proxy_tensor.py to support List input/output (#83302)
- Added *_only and all/any pytree utilities (#83316)
- Deleted ProxyTensor wrapper subclass (#83330, #83646)
- Added support for partial decompositions in make_fx (#83770)
- Added metadata field to fx.GraphModule (#84378)
- Added option to maintain the FX graph execution order after splitting_module (#85188)
JIT
- Added PReLU to MKLDNN convertible Ops in JIT optimize_for_inference (#79011)
- Enabled
torch._refs.var
for nvFuser executor (#79517) - Fixed nvFuser's
where
(tensor, python_scalar, tensor) type promotion (#80347) - Added ComplexDouble scalar creation bindings to nvFuser's Python API (#80522)
- Added real and imag to NVFuser and its python frontend (#79824)
- Added Nvfuser opt in for decomposition (#81134)
- Added
torch.jit.fuser()
option for disabling all fusers (#81731) - Added support for symbolic diff for
silu
(#81724) - Added NVFuser support for (
prims.sign, refs.sign, squeeze, native_batch_norm, transpose
) (#83167, #85562, #84629, #84117) - Use high precision accumulate buffer for bf16 accumulation in NNC (#84402)
Quantization
- Improved quantization support for
masked_fill
(#78368, #85108) - Improved quantization support for
index_put
(#78384, #85685) - Improved quantization support for
LSTM
andMultiHeadAttention
(#79959, #79956, #79960, #83304, #85068) - Added support for quantized
matmul
(#83885) - Introduced a more stable conv_bn fusion for QAT training (#85744)
- Removed warnings from using torch.tensor(value) (#84277)
ONNX
- Added operator support for
torch.tensor_split
(#77437),torch.lerp
(#78891),torch.movedim
andtorch.moveaxis
(#78931),torch.scatter_add
(#79103),torch.argsort
(#80234),aten::native_dropout
(#81743),aten::native_layer_norm
(#81754),aten::convolution
(#81815),aten::_log_softmax
(#81804),aten::layer_norm
for ONNX opset version 17 using LayerNormalization (#84293),nn.init.normal
(#84149) - Added quantization support to more single output ops (#83008)
aten::reshape
,aten::reshape_as
,aten::t
,aten::transpose
,aten::numpy_T
,aten::expand
,aten::expand_as
,aten::embedding
,aten::embedding_bag
,aten::view
,aten::select
,aten::eq
,aten::ne
,aten::gt
,aten::lt
,aten::le
,aten::ge
,aten::elu
,aten::selu
,aten::hardtanh
,aten::hardswish
,aten::as_strided
,quantized::sigmoid
,quantized::layer_norm
,quantized::group_norm
,quantized::leaky_relu
,quantized::instance_norm
- ONNX operators are exported with names containing their associated scope from
nn.module
(#82038), (#82039), (#82040) - Introduced runtime type checking with the beartype library in all public APIs (#83673), (#84091)
- All
torch.onnx
APIs now support runtime type checking when @beartype is present in the Python environment. A warning is emitted when a type mismatch is detected. - This feature is experimental. To turn all warnings into errors, set the environment variable
TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=ERRORS
. To disable this behavior, setTORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=DISABLED
which effectively makes it a no-op. - Improved shape type inference (#78999)
- Turn on ONNX shape inference by default (#82767)
- Enabled data propagation from ONNX (#80730)
- Introduced SARIF (#85428) for
torch.onnx
submodule - Improved warnings and errors (#78441), (#78309), (#83332), (#85179), (#83007)
- Updated ONNX submodule to 1.12 (#79585)
- Apply Common Subexpression Elimination pass to ONNX export (#85665)
AMD
- Support benchmark flag for MIOpen (#77438)
- Correctly handle the error codes of hipGetDeviceCount (#80405)
- Use torch._C._cuda_getArchFlags to get list of gfx archs pytorch was built for (#80498)
torch.cuda.is_bf16_supported()
returns True (#80410)- Workaround missing hipProfilerStart/Stop (#82778)
- Enabled jiterator on ROCm (#77982)
- Enabled MIOpen fused convolution relu (#82002)
- Restore MIOpen benchmark flag default to true (#82656)
- embedded_interpreter_hip to enable torch::deploy on AMD (#83329)
- Add HIP libs into torch deploy init list & corresponding dependency for CURE benchmark running on AMD (#83434)
CUDA
- Added synchronize hooks (#84427)
- Added CSAN support for CPU synchronizations (#84428)
- Return device count using nvml (#84879)
- Reworked printing tensor aliases in CSAN error message (#85008)
- Added jiterator support when dtype is
complex32
fortan
,atan
,sin
,asin
(#77802),(#77606) - Added jiterator support when dtype is complex for
logical_{or, xor}
(#75947) - Reduced overhead of
get_current_stream
(#78066) - Added an argument to specify warmup iterations in make_graphed_callables (#78124)
- Small improvements to
device_count
(#85192) - Memoize
torch.cuda.device_count
(#84878) - Remove the construction of unused tensors in fallback convolution implementation (#79183)
__launch_bounds__
fortorch.mode
with CUDA 11.7 (#79710)- Removed synchronization for D2H copy with a different dtype (#80607)
- Added nondeterministic alert to CUDA
cumsum
(#75693) - Annotated CUDACAchingAllocator snapshots (#82146)
- CUDACachingAllocator snapshots from C++ (#86190)
- Propagate CUDAOutOfMemoryError to Python. (#83146)
- Set cublas workspace size to 4M (#74159)
- Allow changing the cuda allocator settings even after the process started (#84970)
- Fixed exception handling, improve overheads and avoid constructing storage for element size for DLPack (#84612)
- Added BFloat16 for fast layernorm (#83971)
- Added BFloat16 support for
torch.{im2col,col2im}
on CUDA (#84372) - Added Bfloat16 support for
ReflectionPad
(#84949) - Added explicit
__all__
to torch.cuda (#85193) - Set CUDA_MODULE_LOADING to LAZY when not set by the user (#85692)
- Support cuDNN Errata Filter (#73934)
- Allow the number of kernels profiled under torch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#78299)
- Update tests and dispatching for CUDNN V8 API behavior for bfloat16 convs (#81139)
Intel
- [RFC] Enable oneMKL&oneDNN on-demands verbose functionality (#63212)
- Updated ideep for NNC post-op (#82705)
- Enabled native 1d spatial input for Intel xpu (#82301)
- Added loss operators to fp32 cast policy of AutocastCPU (#81689)
- Added bfloat16 support for
lerp
on CPU (#84327) - Added
prelu
op and module for quantized CPU backend (#73491) - Enabled mkldnn matmul for aarch64 bf16 devices (#85546)
MPS
- Added ranked tensors for addcmul ops in MPS instead of constants and update MacOS version check (#78354)
- Moved MPS compat check into common comparison machinery of
TensorLikePair
(#77836) - Made MPS buildable with either XCode or CommandLineTools (#79430)
- Improved MPS
aten::softplus
operator by adding RankedPlaceholder for graph nodes instead of constants (#81169) - Extended MPS Conv1D operation for NHWC format (#83121)
- Added support for 1D weights in MPS linear layer (#85752)
- Added full support for serialization of MPS Tensors (#79465)
- Added support for 1D bias in MPS operation
torch.addmm
(#81519) - Added torch dispatch stub code for MPS backend (#82612)
- Use convenience helper function
dispatch1DJob
for MPS native implementations (#82982) - Enabled support in MPS for
torch.adaptive_avgpool_2d
for larger output sizes (#85726) - Extended support in MPS for
torch.constant_pad_nd
for 4D+ padding (#85991)
Profiler
- Propagate metadata into
Engine::evaluate_function
event. (#77696) - Switched to nanoseconds for Result's internal representation (#77697)
- Made profiler table column widths changeable via arguments (#85203)
Vulkan
- Enabled higher dimensional input in
torch.nn.linear
(#81773) - Vulkan tensor views now infers dim size when -1 is provided as input (#81668)
- Vulkan prepacked op contexts will now release the deserialized CPU tensors from memory upon construction (#83587)
- Vulkan shader codegen is now Windows compatible (#85241)
Mobile
- Allowed tracing multiple input models at once (#84833)
- Leaky
relu
in metal shader (#78544) - Added detailed error message for iOS test (#79140)
- Remove dcode duplications and refactor (#79184)
- Optionally run fbgemm in tracer (#83531)
- Added hardshrink op to metal backend (#82224)
- New flatbuffer_loader functions that do not depend on flatbuffers.h (#82618)
- Added
max_pool2d
,linear
,conv2d
FP32 operator tests for XNNPACK (#83131) - Removed flatbuffer types/headers from flatbuffer_serializer[_jit].h (#82619)
- Migrated remaining pytorch code to use new flatbuffer_loader.h APIs (#82620)
- Remove flatbuffer types/headers from flatbuffer_loader.h (#82893)
- Use flatbuffer of alternate namespace (#82952)
- Hide flatbuffer build dependencies (#82953)
- Renamed flatbuffer_all to flatbuffers_jit (#82826)
- Renamed flatbuffer_serializer to _mobile or _full_jit (#82827)
- Created flatbuffers_mobile (#82828)
- Added API for profiling backend memory events for Edge CPU profiler (#80350)
- Switched mobile targets to flatbuffers_mobile (#82829)
- Added an option to avoid adding base ops to static op library for Edge (#84360)
- Fixed load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)
- Remove unused field 'order_' in nnapi.h (#84067)
Distributed
Distributed(c10d)
- c10d API improvements:
- Improvements to c10d error messages:
- Passed group ranks and options to third party distributed backends (#73164)
- Enabled NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG is set to DETAIL (#83881)
- Added a soft error handling mode
NCCL_ASYNC_ERROR_HANDLING=2
that does not crash the process (#84386) - Upgraded NCCL to 2.14.3 (#85367)
Distributed Optimizer
- Added functionality for save and restore step counter for model averanger in PostLocalSGDOptimizer (#78988)
DistributedDataParallel
- Enabled the static graph to print unused parameters in debug mode for DDP. (#81929)
- Enabled stateful PowerSGD communication hook now can be saved and reloaded to resume training (#79334)
FullyShardedDataParallel
- Allowed different
optim_input
orders across ranks (#78599) - Added profiling range for FSDP.backward (#78479)
- Enabled NamedTuple support for FSDP (#83055)
- Added FSDP communication hook interface for NO_SHARD strategy (#79833)
- Moved the
sharded_state_dict
logic to the post hook to avoid OOM (#82613) - Added ability to iterate through dataclasses in fsdp.utils (#82638)
- Enabled passing kwargs to load_state_dict (#83309)
- Used
_init_from_local_tensor
to create ShardedTensor to avoid communication overhead (#82911) - Added communication hook for sharded strategies (#83254)
- Changed to print exec order only in debug mode (#83868)
- Ensured that all ranks use the same order to iterate through optimizer states (#84654)
- Optimizer states may be on CPU, copied them to GPU before gathering (#84708)
- Handled the
state_dict
on CPU cases (#85640) - Add
FSDPExtensions
for TP support (#85039) - Ignored buffers that are non-persistent. (#85740)
- Delayed moving tensor to CPU until necessary for optim_state_dict() (#85761)
- Dequeue one event instead of flushing for rate limit (#86165)
torch.distributed.elastic
- Implemented a named pipe based watchdog timer (#83695)
Infra (RelEng)
- Consolidated all python targets in the tools folder (#80408)
- Improved ios simulator test in CI (#80459)
- Add functorch testing shard in CI (#81283)
- Added functorch shards for windows CI (#82161)
- Added functorch shard for mac x86 tests, linux cu102 tests (#82000)
- Added CI workflow to build official docker images with multiarch (#83437)
- Sharded
trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (default
from 2 -> 4 (#83424) - Migrated workflows from 18.04 to 22.04 (#83861)
Bug fixes
Python API
- Fixed
dim
out of range check forlogcumsumexp
on CUDA when the source tensor is empty(#78284) - Added missing
__init__.py
fortorch.utils.jit
(#78629) - Fixed backward crash for
gather
with an empty index tensor whensparse_grad=True
(#78698) - Added type annotations to
torch.distributions.kl_divergence
(#78432) - Fixed erroneous inclusion of
end
in the output oftorch.arange
for some inputs (#80758) - Fixed
torch.distributions.Transform
to be pickle-able (#81707) - Added check that
self
andmask
are on the same device fortorch.masked_fill
(#82737) - Fixed potential ref cycle creation in
torch.utils.checkpoint
(#82776) - Fixed
Tensor.__hash__
for Tensor subclasses (#83174) - Fixed
torch.cat
for 0-dim tensors with different dtypes (#83391) - Fixed
torch.equal
on CPU when inputs have different dtypes (#83350) - Fixed data-dependent shapes in
torch.districutions.{HalfCauchy, HalfNormal}
(#84322) - Added check that the size of the last dimension of
tau
is less than or equal to that ofinput
intorch.ormqr
(#85278) - Added check that
weights
is a 1D tensor intorch.bincount
(#85881) - Fixed segfault for
out
arguments that have a large number of dims (#85294) - Fixed comparison ops with scalar arguments by removing overflow check (#78881)
- Normalized
torch.utils.dlpack
strides to 1 where size of corresponding dimensions < 2 (#83158) - Added a check in
torch.empty_strided
thatsizes
has the same dimensionality asstrides
(#82422) - Fixed
torch.istft
default output length to prevent trimming of last element (#80031)
C++ API
- Fixed missing antialiasing path to the interpolation for bicubic mode (#84599)
- Added
IListRefTag::Materialized
toIListRefIterator
destructor. (#85467) - Fixed
im2col
by adding a check thatpad_width
andpad_height
are non-negative (#85541) - Fixed
check_compiler_ok_for_platform
on non-English locales intorch.utils.cpp_extension
(#85891)
Autograd
- Corrected the forward AD formula of
torch.sgn
which fixed forward-over-backward fortorch.linalg.svd
and other spectral decompositions, andtorch.norm
,torch.linalg.{norm, matrix_norm}
(#80082) - Fixed derivatives of convolution overridable backward (#80840)
- Updated setting non-float non-complex values for forward AD dual tensor to properly error(#78361)
- Fixed forward AD to not set tangent as-is in some situations (#79664, #79653)
- Fixed cpp hooks, retains grad, and
backward(inputs=)
behavior in-place (#79996) - Relaxed storage layout checks for forward AD when zero-numel tensor (#81055)
- Fixed leak when
create_graph=True
and full backward hook registered (#82788) - Fixed view and in-place interaction when grad_fn is first accessed in no-grad mode (#83872)
- Updated backward of
torch.stack
to correctly handle implicit real->complex casting (#84993) - Fixed gradients for
torch.nn.functional.{leaky_relu, threshold}
when inplace=True (#85634) - Corrected autocasting behavior in
torch.utils.checkpoint
when use_reentrant=False (#81766) - Fixed gradcheck when outputs that don't require grad precede those that do (#77743)
- Fixed backward and double backward for
nn.functional.binary_cross_entropy_with_logits
(#80083) - Fixed derivatives of
norm(p=inf)
(#78105) - Fixed forward AD when conj-ness of primal and tangent of the dual tensor tensor do not match (#78358)
Build
- Use C++17 for RocksDB 7 header. (#75741)
- Fixed Windows builds with _DEBUG flag (bbe8d01)
- Pass WITH_BLAS option from environment to CMake (#78037)
- Remove
-Wno-unused-but-set-variable
for clang 13.0.0 (#79666) - Fixed variable typo for USE_SYSTEM_PYBIND11. (#80272)
- Fixed compilation errors during build with clang13 (#80916)
- Added missing -fexceptions flags during PyTorch build (#81394)
- Fixed CMake dev warning (#81580)
- Fixed false positive AVX, AVX2 and AVX512 detection with MSVC (#82554)
- Fixed NCCL detection issues of the Gloo library (#82773)
- Fixed objcopy version detection in NCCL cmake process (#82774)
- Fixed build error by changing COLORIZE_OUTPUT option to USE_COLORIZE_OUTPUT in cmake file (#83716)
- Set default value for NCCL make to MAX_JOBS if ProcessorCount returns 0 (#84231)
- Fixed intermittent link errors in NCCL build (#84245)
- Deleted
torch._dl
extension (#84361) - Used unified source file list for BUCK build (#84770)
Complex
- Fixed the derivative of
torch.acosh
for complex numbers (#80841). - Removed unused conjugate kernels for real dtypes (2.2MB reduction in CUDA binary size) (#80374).
torch.nn
- Fixed
nn.Embedding
‘smax_norm
argument when forward mode AD is used (#78560) - Fixed
nn.ChannelShuffle
when given empty Tensors (#77029) - Fixed
nn.RReLU
backward on CUDA (#80434) - Fixed spurious warnings in
torch.nn.parallel.*
APIs (#81476) - Fixed
nn.Conv2d
fallback implementation for single channel inputs and channels last weight (#82392) - Fixed segfault in adaptive pooling for specific index values (#84010)
- Fixed type annotation in
nn.Conv{1,2,3}d
for in_channels (#84302) - Fixed
nn.GeLU
for empty inputs (#84926) - Fixed correctness issues for
nn.Conv2d
on ARM-based machines (#85711) - Fixed
nn.ParameterList
printing of Tensors on the “meta” device (#78529) - Fixed channels-first behavior for
nn.MaxPool3D
on CUDA (#80748) - Fixed input shape validation
nn.MaxPool1d
(#85594) - Fixed
nn.Softmax
for large input tensors (#84182) - Fixed lower and upper bound checks for
nn.RReLU
(#84996) - Fixed edge cases in
torch.nn.grad
by calling into the c++ backward kernel directly (#81839) - Fixed
torch.nn.PixelShuffle
for empty inputs (#86262) - Fixed consistency of output and input dtypes for
torch.nn.BatchNorm
(#84410)
torch.nn.optim
- Fixed
optim.SGD
maximize
flag whenmomentum
is involved (#81859) - Fixed temporary bug where checkpoints from optimizers created with older PyTorch version could not be loaded (#83588)
- Fixed memory leak in
optim.lr_scheduler.CyclicLR
(#85462) - Fixed initialization of
lr
inoptim.lr_scheduler.SequentialLR
(#72856)
BetterTransformer
- Cleaned up native transformer implementation (#78265)
- Added fastpath test for mask check flag (#82999)
- Added check for contiguous well-formed mask (#79927)
- Introduced mask contiguity check function (#79186)
- Fixed issue in softmax.cu with transformer error when mask
seqlen > 1024
(#83639) - Disabled Transformer/MHA fast path when autocast is enabled (#84722)
- Moved odd
num_head
in TransformerEncoder toslow_path
(#83483)
Composability
- Fixed
__torch_function__
bug in getindex that causes an error not set exception (#78781) - Fixed
__torch_dispatch__
usage with inplace views (#79902)
Dataloader
- Fixed
NoneType
object has no attributepython_exit_status
whenDataLoader
exits (#83985)
Functorch
functorch.grad
: fixed silent correctness issue from calling a view operation on a captured tensor followed by an in-place operation (#85374)functorch.jacrev
,functorch.jacfwd
: fixed loud in-place errors when passing in inputs to the transforms and mutating them (#84914, #84915)functorch.vmap
: Fixed support for in-place view operations (Tensor.unsqueeze_
,Tensor.transpose_
,Tensor.t_
,Tensor.squeeze_
) (#82899, #82903, #82972)functorch.vmap
: added an error on incorrectweight
shape totorch.nn.functional.prelu
(#83106)functorch.vmap
: fixed support for multinomial (#83838)functorch.vmap
: fixed incorrect support forconv_transpose
withgroups > 1
(#84938)- Fixed
vmap
xvjp
xvjp
composition fortorch.nn.functional.prelu
(#84939) - Fixed printing tensors that are not being transformed over inside functorch transforms (#85556)
- Disallowed saved tensor hooks in functorch transforms to avoid silently incorrect behavior(#85972)
- Fixed
cross
to match unbatched behavior (#86926)
LinAlg
- Strengthen the preconditions of
linalg.cross
(#83798) - Fix memory issues in
linalg.lstsq
(#85357) - Fix
linalg.lu_solve
/torch.unpack
to prevent bad memory usage on CPU (#85922) - Preserve the dim of the input in
matrix_exp
. (#81330)
Sparse
- Fixed COO Tensors with less than two non-zero elements to always be marked coalesced. (#82426, #82085)
- Fixed CUDA kernel launch misconfiguration for
mul
on tiny COO tensors (#80254) - Fixed silent type promotion bug by
select
if given all zero integer COO tensors(#82215) - Fixed CUDA kernel coverage on 0-sized dense inputs for
torch.sparse.sampled_addmm
(#85194)
torch.fx
- Fixed bug where curly brackets were not properly escaped in FxGraphDrawer (#83604)
- Fixed torch.fx.wrap to use the callable
function.__name__
rather thanfunction.__code__.co_name
(#84373) - Added strictness check and made tensors into leaves if input tensors were leaves (#77474)
- Used getattr_recursive instead of getattr when splitting (#80011)
- Stopped ProxyTensor from turning aten::lift tensors into proxy objects (#81024)
- Fixed named_modules to be subscriptable (#81258)
- Fixed
to_folder
by adding custom_builtins to dump (#81433) - Correctly unpacked constants when used in multi-return output (#82568)
- Replaced module name for torch.ops (#82395)
- Removed unnecessary
import warnings
(#82760) - Don't constant propagate through nondeterministic functions (#83650)
- Don't extract tensor metadata from sparse tensors (#83669)
- Skipped folding side-effectful functions (#84016)
- Fixed make_fx issue by introducing get_attr into symbolic tracing (#84011)
- Disabled autocast cache during aotdispatch (#84035)
- Modified split_by_tags to retain output order (#84136)
- Made NormalizeArgs preserve node type (#85637)
- Fixed PyTree unpacking carrying forward type annotations (#81906)
JIT
- Fixed conv-batchnorm folding for previously-broken datatype inputs during JIT freezing (#78241)
- Fixed lightweight dispatch OOM error by introducing selective build (#79215)
- Used signed integers in
CalculatedNecessaryArgs
to avoid underflow with schemas where all args have defaults. (#79331) - Fixed indexing into a tensor with a tuple (#79335)
- Propagate
map_location
arg totorch.jit.load
intorch.load
(#78733) - Improved JIT autodiff heuristics for determining whether outputs require gradients (#78392, #79498)
- Used streams for
import_ir_module
for pickle case to reduce memory usage (#80131) - Added scripting support for "start" kwarg in
enumerate()
(#80585) - Turned off arc in CoreML backend, because throwing exceptions in arc code leaks memory (#79928)
- Suppressed virtual-dtor check on llvm_jit to fix NNC build (#81449)
- Fixed annotation extraction for python 3.10 (#81334) (#81334, #81506)
- Fixed
std::out_of_range
when using NNC andConstantChunk
input shapes are unknown (#82698) - Limits constant chunk propagation for pw-node-only in NVFuser (#83083)
- When encountering dynamic types, one should cast it recursively. (#83218)
- Fixed handling of empty dim list in
sum_mean_dim
symbolic shape fn (#83357) - Check existence of the array ref when tracing
resize_
to avoid_MapBase::at runtime
error (#81422) - Fixed
define_constant
pybind signature to matchstd::complex
scalar in NVFuser (#83684) - Cast to signed char to fix aarch64 build (#84429)
- Support
torch.ScriptObject
intorch::jit::as_object
(#84398) - NVFuser torchbench patch to take nvprim fallback when no cuda tensors are provided as inputs (#84411)
- Fixed coreml gpu flag not set (#84725)
- Print the real type for function schema arguments (#85103)
- Fixed
torch.jit.trace
check that was causing tracing to fail for MPS inputs (#84850) - Throw an error instead of segfaulting when passing
None
to futures (#85304) - Cherry pick sorting patch for NVFuser fusion segmented (#85620)
- Support freezing modules that don't have a forward method (#85779)
Quantization
- Added channel axis bound checking in
fused_moving_avg_obs_fake_quant_*
(#78148) - Disable use of qnnpack with
ceil_mode
of theavgpool
op (#79028) - Improve subpackage import in
torch.nn.quantized
(#84141) - Fix segmentation fault in
QTensor.choose_qparams_optimized
(#85552) - Enhance the
_rebuild_qtensor
function to support other device type other than CPU (#78234) - Fix
at::from_blob_quantized_per_tensor_affine
strides calculation (#79314) - Fix embedding quantization issue when memory format is not
contiguous
(#82605) - Fix dispatch declaration bug about quantized op (#83649)
- Moved the order of x86 engine to avoid changing the default qengine (#86631)
ONNX
- Fixed
aten::mul
with Boolean inputs (#81671) - Fixed
add
andsub
for non-tensor inputs (#81736) - Fixed
RReLU
eval mode behavior (#82678) - Fixed onnx optional node type in for/if block (#83599)
- Fixed
Interpolate
: usehalf_pixel
instead ofpytorch_half_pixel
. (#80003) - Fixed
argmin
andargmax
edge case consistency with PyTorch. (#79503) - Shape Type Inference and Propagation
- Fixed shape inconsistency when exporting scalar
log2
(#78701) - Fixed inconsistent
rand
dtype (#79193) - Fixed linalg
norm
output's shapes and dtypes (#79506) - Fixed
any
andall
outputs' shape (#79371) - Fixed
prelu
output's shape (#79846) - Fixed onnx logical functions' dtype (#79339)
- Fixed
hardshrink
andsoftshrink
output's shape (#79695) - Fixed quantization outputs' dtype (#79690)
- Fixed reduce node shape inference (#85765)
- Fixed bug using
std::copy_if
(#80999) - Fixed default function value in
_optimize_graph
(#83996) - Fixed constant folding unexpectedly adding folded constant as initializer (#79552)
- Fixed autograd subgraph recording with nested graphs (#82852)
- Disabled autocast cache in exporter (#84219)
- Removed static None graph output (#82623)
- Fixed float point detection for optional tensor (with unknown rank) within a list (#81386)
- Support
device().type()
string comparison with constant (#86168) - Fixed
scalar_type_analysis
metadata for copied constant (#86716) - Fixed triu/tril export with diagonal input (#86843)
- Ignore
print(Tensor)
during tracing (#86223) - Updated training state logic to support ScriptedModule (#86745)
AMD
- Fixed memory cross-border access on the ROCM platform (#76100)
- Set nvfuser default to disabled (#86369)
CUDA
- Fix how we handle host memory in CUDA
getDeviceFromPtr
(#76902) - Only sync CUDA if the operation is run on GPU (#80328)
- Do not use
thrust::lower_bound
on device (#80746) - Fix
set_requires_cuda_init
(#81183) - Fix behaviour of index_add / atomicAdd(bool,bool) (#85100)
- Fix IMA for topk (#83042)
- Use
opmath_t
for activation functions in Activation.cu (#77949) - Fixed the invalid configuration argument error when running layer norm backward (#80893)
- Support non-standard bools in CUDA unique (#79392)
- Accept non-standard bools in more CUDA kernels (#78957)
- Fix cuda-mode and add more tests (#81898)
- Clear autocast amp cache in CUDA Graphs (#81896)
- Properly compute
batch_element_count
inwarp_softmax
(#82927) - Disabled autocast cache in torch.cuda.make_graphed_callables (#84289)
- Store RNG seed for CUDA graphs (#84967)
- Assert
lambda >= 0
in poisson distribution cuda kernel (#85906) - Work-around 32-bit indexing failures in cuDNN batchnorm (#87861)
- Fixed 3d convolution_add_relu in V8 (#85055)
Intel
- Fixed bug for thnn_conv2d when input's C is 1 and weight is channels last (#82392)
- Fixed oneDNN channels_last path issue (#83653)
- Fixed torch.config can't respect USE_MKLDNN flag issue (#75001)
- Made the data types of output and input consistent for batchnorm (#86784)
- Fixed the issue that cat result would be incorrect for channels-last (#85076)
- Fixed the performance issue that the for-loop before ExternallCall could not be parallelized (#85056)
- Fixed the performance issue that the for-loop before ExternallCall (#86516)
MPS
- Fixed MPS operator torch.full for boolean types (#82575)
- Extend MPS Unary operators for empty tensors which should be a no-op (#82650)
- Fixed MPS operator
torch.scatter
for boolean types (#82685) - Fixed MPS operator
torch.cat
for boolean inputs (#81480) - Fixed typo in MPS allocator (#83465)
- Fixed MPS operator torch.full to handle uint8 types (#83697)
- Fixed creation of
MPS::Placeholder
behavior for transposed view operations (#85689) - Fixed handling of output shape for empty inputs to binary ops in MPS backend (#85836)
- Added support for handling scalar inputs to MPS operations of
torch.scatter
andtorch.gather
(#85842) - Support for handling compatible inputs to MPS operation of torch.where (#85946)
- Added support for inputs with datatypes Short, Byte & Char to torch.dot MPS operation by casting to int32 when needed (#86140)
- Remove incorrect asserts in MPS backend from Copy.mm file (#86184)
- Added support for handling of 1D inputs for MPS operation
torch.nll_loss
(#81290) - Get correct size of the view tensor when copying from cpu to mps device (#81730)
- Fix issues exposed in MPS testConsistency tests. The fix includes correct handling of types in smooth l1 loss, 0 dimensions for torch.repeat and empty inputs for torch.cat operations (#81735)
- Handle Integer inputs for MPS linear layer by returning error of unsupported data types (#82183)
- Workaround int8 datatype outputs in MPS for View operations (gather) by casting it to int8 (#82315)
- Improve handling of empty outputs and fix MPS linear layer’s handling of transposed Tensors in test consistency (#83124)
- Fixed handling of conv1D and conv2D MPS operations with non-matching strides/paddings (#83522)
- Fixed handling of MPS::Placeholder when View operation is missing gather graph (#83744)
- Fixed the index handling in MPS for torch.constant_pad_nd operations with single-dimension input (#83745)
- Handle casting for MPS torch.div operation in case of type mismatch (#84742)
- Fix device (MPS) to host (cpu) copy by casting from a smaller dtype to a bigger dtype (#84928)
- Ensure as_strided_tensorimpl is never called with MPS (#85020)
- Fixed integer rounding crash in torch.div MPS operation on M1 (#85016)
- Fixed crash in MPS bitwise ops on Mac x86 platforms. (#85285)
- Fixed crash in MPS Conv1d backward operation for NHWC (#85283)
- Added support for MPS reduction operations of scalar edge-cases (#83743)
- Fixed memory corruption in torch.var operation for MPS (#85571)
- Fixed memory leaks in MPS that cause the MTLBuffers not to be released and cause OOM (#85661)
- Fix test consistency error in MPS due to type mismatch between int8 and uint8 types (#85666)
- Fixed shape issues for torch.clamp op in MPS (#85673)
- Fixed handling of TensorBase shapes for view ops in MPS for case of multiple slices on a Tensor (#85934)
- Fix the dimension of padding to match the input's dimension for MPS Pad operations (#85990)
- Fix non-contiguous to contiguous copy of MPS tensors (#86056)
- Remove
std::cout
from MPSmultinomial
operation (#86246) - Do not dispatch empty job in bitwise_not (#87286)
- Made copy from CPU always add storageOffset (#86958)
- Revamped
copy_to_mps_
implementation (#86956)
Package
- Added fix for implicit numpy dependency (#78979)
- Allowed torch._C to be recognized a module in torch.package (#80917)
- Ignore return value of function declared with 'warn_unused_result' for torch::deploy (#84862)
- Removed torch::deploy from pytorch (#85953)
Profiler
- Fixed build failure in python 3.10 (#81812)
- Pop
KinetoThreadLocalState
at the start of post processing. (#77996) - Fixed record function inputs_valid_ check (#78002)
- Weakened ordering check during post processing. (#78563)
- Fixed Python parent id (#79356)
- GIL acquire needed in ValueCache::trimPrefixes (#81061)
- Added ephemeral inputs to the value cache. (#81958)
- Fixed profiling with record_shapes=True and nested tensor (#82854)
- Proper reset execution graph data in remove callback registration (#82910)
- Solved two syntax issues when dumping execution graph result to json file. (#81854)
- Set end time on python events when profiling stops. (#83621)
- Don't try to collect strides for non-strided tensors (#83935)
- Add null handling to
AppendOnlyList::copy
memcpy path. (#83963) - Add quoted metadata API to remove empty trace cpu_op metadata (#84128)
- Make
RecordQueue
manage the lifetime ofPythonTracer
. (#83964) - Don't assign in AppendOnlyList::emplace_back (#85716)
- Fixed traversal utility (#85717)
- Fixed python object reference counting (#85847)
Visualization
- Removed dependency on
torch.onnx
ingraph
(#82628) - Updated
Image.ANTIALIAS
toImage.Resampling.LANCZOS
in summary (#85679)
Vulkan
- Fixed the
aten::cat
operator registration (#78806) - Fixed a bug in GRU where incorrect behaviour was being observed when
H_in != H_out
(#78945) - FIxed a possibly null pointer dereference in the
aten::mm
operator when using passing an empty bias (#79701) - Code under
ATen/native/vulkan/api
was essentially rewritten (more details below) and as a result of these refactors, it is now possible to concurrently execute multiple Vulkan models due to correct synchronization when recording to a Vulkan command buffer (#80959)
Mobile
- Moved saving storage to the last step. (#78024)
- Fixed build For Model Tracer (#84755)
- Skip TestNNAPI tests if QNNPACK is not supported (#82882)
- Extended LinearPackedParamsBase getstate/setstate deadline in
check_forward_backward_compatibility.py
Allowlist (#81135) - Removed LinearPackedParamsBase getstate/setstate from
check_forward_backward_compatibility.py
Allowlist (#81048) - Fixed
ao::sparse::BCSR
missing in qlinear serialize and deserialize when USE_FBGEMM and USE_PYTORCH_QNNPACK are not set (#81256) - Updated
model_ops.yaml
(#82444) - Fixed signed/unsigned compare for Metal (#86068)
- Re-added benchmarking files to ios TestApp (#85539)
Distributed
Distributed(c10d)
- Ensured tensors are contiguous for autograd enabled
all_gather
. (#79747) - Fixed data race condition of
batch_isend_irecv
(#82450) - Fixed
distributed_test.py
flakiness by turning off async_errror_handling (#78797) - Reenabled
isinstance
withtorch.distributed.ReduceOp
(#87303)
DistributedDataParallel
- Enabled
AllReduceCommHook
to acceptinstrusive_ptr
(#80975)
FullyShardedDataParallel
- Fixed
full_optim_state_dict()
hang (#80712) - Fixed exec order validation for ignored modules across ranks (#79533)
- Cleaned prefixes when searching for params / buffers to ignore (#78278)
- Returned original module when fsdp wrapped model call .module (#78671)
- Fixed a small bug of pre_backward_hook params prefetch (#78851)
- Fixed param name prefixes for ignored modules (#79955)
- Fixed FSDP when not all outputs get gradient in backward (#80245)
- Fixed that MP config not being passed to FSDP (#80869)
- Fixed FSDP device_id when CPU offloading (#82892)
- Fixed FSDP not all outputs used in loss (#83195)
- Fixed the FQN not found issue for load sharded_state_dict when using activation checkpoint (#84253)
- Fixed
pin_memory()
for CPU offloading (#85048) - Fixed memory regression! (#85087)
- Implemented a short-term fix to remove
optim_input
(#84201)
torch.distributed.elastic
- Ensured that exit code is propagated from Child to parent process (#81408)
torch.distributed.rpc
- Only initialize CUDA if there are devices specified in
init_rpc
(#80180) - Fixed the wrong usage of
RRefContext::handleException
by having a new APIRRefContext::handleExceptionSilent
(#83166) - Changed to avoid initializing storage for empty Optionals (#78947)
Infra (RelEng)
- Made bazel changes to make “bazel query ...” work (#78870)
- Fixed C API to be compatible with latest Python 3.11 beta (Please note that 3.11 binaries are not fully functional) (#81242)
Performance
Python API
- Fixed use of temporary buffers for tensors in
torch.save
. (#80404) - Fixed and improved the efficiency of the backward for
torch.xlog{*}
functions. (#82713) - Vectorized
.copy()
acting between different dtypes on CPU (#80905) - Vectorized
bfloat16
conversions on CPU (#80906)
Autograd
- Codegened autograd nodes no longer is smarter about which gradients to compute (#82544)
- Made the derivative of masked_fill more efficient (#83515)
torch.where
no longer materializes a zero-filled tensor in its backward (#83043)
torch.nn
- Speed up
nn.Module
constructor by not calling customsetattr
(#77098) - Speed up CPU
nn.BatchNorm
implementation by usingtorch.zeros()
directly (#82558) - Speed up
nn.Module.load_state_dict
(#85743)
BetterTransformer
- Added nn.module activation support in BetterTransformer (#78394), in addition to functional support which is not available in Torchscript
- Added mask identifier for multiplexed src_mask/src_key_padding_mask in BT (#81947)
- Added a small fastpath test for native multi-head attention (#81432)
Composability
- Release GIL when doing shared memory copies on Tensors (#85389)
- Some micro-optimizations in
RecordFunction
, the core util used by the profiler (#76266) c10::detail::ReplaceAll
: avoid some unnecessary allocations (#79915)
Dataloader
- Moved loop content into a function to ensure we don't preserve
Tensor
inpin_memory
thread (#83595)
LinAlg
- Simplified and optimized
linalg.solve
(#74046) - Improved heuristics for
linalg.lu_solve
when B is a matrix (#79838) - Small optimization of
linalg.cholesky
(#81316) - Prefer contiguous output from mkldnn_bf16_gemm (#82968)
- CPUBlas: Use mkldnn optimized BFloat16 matmul for gemm (#65840)
- Updated and improved the heuristics for
linalg.lu_solve
(#73878) - Optimized
linalg.householder_product
backward to be more memory-efficient (#84627)
Sparse
- Improved
to_sparse_bsr
for batched dense inputs (#83085) - Improved
to_dense
for CSC (#79635) - Improved
index_select
performance for COO input on CUDA (#77551) - Improved
mul(COO, COO)
performance with broadcasting in dense dims. (#83428, #85336)
JIT
- Improved coreml load time by loading cpu model first, while asynchronously loading a model (#80941)
- Improved
torch::jit::as_{module,object}
performance (#84399) - Replaced
IValue::toString()->string()
withIValue::toStringRef()
(#85437)
Quantization
- Allow contiguous inputs run into
qcat_nhwc_stub
when dim is last dimension (#72575) - Enable qlinear dynamic parallelization with fbgemm (#84033)
CUDA
- Fixed perf regression introduced in #70943 (#78588)
- Improved small sort performance on CUDA (#79627)
- Use cub::BlockRadixSort to improve medium length sort performance (#79628)
- Use cub::BlockRadixSort to improve medium length sort performance (#79628)
- Increased size limit on calling CublasLt in addmm by 32x (#82922)
- Don't synchronize single element any/all reductions (#84465)
- Added col2im_batched kernel (#84543)
- Exposed fast get_current_stream (#78165)
- Pool cudaEvents in CUDACachingAllocator (#78279)
Intel
- Optimize the copy of BFloat16 to Float and Float to BFloat16 (#79685)
- Improve performance of ONEDNN backend (#84470)
- Optimize softmax backward and logsoftmax backward #80114
- Improve sort multi-core perf by adjusting grain_size w.r.t. dim_size (#74897)
- Add fast path of
qmean
/qstd
for quantized CPU (#80579) - Use direct memcpy in
qcat
when all the inputs and output share the same scale and zero_point (#71903) - Vectorize scalar remainder in quantized kernel for normalization (#79673)
- Enhance add_out_dense_sparse_cpu for hybrid sparse tensor (#23057)
MPS
- Performance improvements for the MPS backend by changing commitAndWait to commit & fixing high memory consumption for View operations. Also improved scalar handling in MPS Allocator (#81951)
- Improved performance for MPS backend by reducing the number of command buffers created and hence CPU overhead. It uses commitAndContinue feature in MPS (#81338)
- Added direct MPS implementation for constant_pad_nd operation which improved performance as the generic implementation was heavily reliant on View ops which are slow (#82366)
- Removed checks that incur unnecessary syncs for MPS device with tensor.item() (#82505)
- Enabled Graph caching in MPS for torch random ops with Philox engine (#85833)
- Added specialized memory pool for scalar values in MPS which improved performance in torchbench networks (#85817)
- Improved memory usage and performance by utilizing garbage collector and adaptive commit feature in MPS (#86119)
Profiler
- Optimize getStepCallbacks for common case of no active callbacks for kineto (#77804)
- Use custom AppendOnlyList for op_events to reduce the number of atomic operations (#78643)
Vulkan
- When waiting on the result of a
VkFence
, busy polling is now used instead of a single call toVkWaitForFences
with no timeout. This can improve benchmark performance by up to 50% by ensuring that the CPU stays at a high frequency when waiting for work on the GPU to complete (#81470)
Mobile
- Added compilation_preference & relax_f32_to_f16 APIs (#78758)
- Made flatbuffer loads faster if loading as mobile module. (#78998)
- Stream pkl (#79931)
- Used Apple's Accelerate framework for blas acceleration (#80449)
- Read via FileAdapter when loading files in torch if not flatbuffer for lite_interpreter (#84028, #84296)
Documentation
Python API
- Fixed
torch.as_array
documentation formatting (#78485) - Fixed default value for
storage_offset
intorch.as_strided
documentation (#78202) - Removed warning in documentation that
torch.real
is only supported on complex types (#78644) - Improved reproducibility documentation for the random number generator and
torch.use_deterministic_algorithms
(#78849) - Fixed example in documentation for serialization (#79454)
- Fixed
torch.linspace
documentation for dtype (#81371) - Fixed typo in documentation for
torch.distributions.Dirichlet
(#82062) - Fixed example in
torch.dist
documentation (#82104) - Updated
torch.narrow
documentation to reflect thatstart
can be a Tensor (#85180) - Added documentation for
pin_memory
andlayout
arguments totorch.new_{zeros, ones, full}
(#85605) - Added documentation for
pin_memory
argument totorch.{rand, randn}
(#85219), (#85221) - Added argument default values to documentation for
torch.rot90
(#85610) - Removed
out
argument from documentation fortorch.squeeze
(#85222) - Fixed
torch.log
example (#78776) - Fixed
torch.argmin
docs forkeepdim
argument (#78888) - Updated examples in documentation for
torch.use_deterministic_algorithms
(#82003) - Changed docstring type
callable
toCallable
for consistency (#82487) - Added documentation for
torch.narrow_copy
(#84493) - Improved documentation for
torch.signbit
(#78349) - Added doc string for
torch.library.Library.impl
(#81047) - Renamed
_Typed/_UntypedStorage
toTyped/UntypedStorage
and updated documentation fortorch.storage
(#82438) - Added documentation for
torch.unflatten()
(#81399)
Autograd
- Improved autograd custom function docs (#81340)
- Added randomness case to the autograd notes (#78617)
Complex
- Added a note on CUDA 11.6 (#80363)
torch.nn
- Fixed docstring and image for
nn.LeakyReLU
(#78508, #79102),nn.ELU
(#78909),nn.GRU
(#79380),nn.Hardswish
(#70993),nn.GeLU
(#85790) - Fixed docstring for
nn.CrossEntropyLoss
(#79568 and #82538),nn.MultiMarginLoss
(#84513) - Fixed high level
nn.init
module doc to reflect that all functions run withtorch.no_grad
(#80882) - Fixed docstring for
nn.Module.state_dict
(#83104) - Updated docstring for
scale_factor
innn.functional.interpolate
(#80807)
torch.nn.optim
- Fixed docstring for
optim.lr_scheduler.ChainedScheduler
(#79775) - Fixed docstring for
optim.swa_utils.SWALR
(#79836)
Composability
Functorch
- Fixed the model description in the functorch ensembling notebook (#83603)
- Fixed indentation in functorch limitations docs (#85346)
- Updated functorch installation instructions (#85854)
- Fixed functorch whirlwind tour notebook to be runnable (#85974)
- Documented new installation instructions for functorch (#86823)
LinAlg
Sparse
- Updated
scatter_add_
documentation to fix argument name (#80223) - Updated
torch.sparse
docs to better cover CSR/CSC/BSR/BSC (#82108) - Added torch.sparse overview section (#85265)
- Updated documentation for
mm
family ops andF.linear
to note limited sparse support (#86220)
torch.fx
- Fixed decomposition example (#79807)
- Added
__all__
to various submodules in torch.fx, distributions, distributed, package (#80367) - Added warning about DCE in FX being unsound with mutation (#81818)
Quantization
- Replace
qconfig_dict
withQConfigMapping
in docs (#78533) - Corrects typo in quantization docs (#81687)
- Additonal fixes for
quantize_fx
docs (#84587) - Add example for the error message for fixed qparam ops (#84666)
- Add types for scale and zero_point tensor for
torch.fake_quantize_per_channel_affine
docs (#85733) - Updated quantization docs to show per channel support for
conv1d
(#81349) - Add
torch.ao.nn.quantizeable
modules documentation (#79957) - Add more detailed docs for
torch.ao.quantization.quantize_fx.{prepare_fx|prepare_qat_fx|convert_fx}
(#83132)
ONNX
- Added a table of unsupported aten operators in the documentation (#84496)
CUDA
- Fixed jiterator doc format (#78471)
- Use generic amp autocast in example and specified dtype (#79579)
- Fixed small typo in cuda.rst (#84012)
- Added user facing documentation for CSAN (#84689)
- Fixed broken docstring for
set_float32_matmul_precision
(#78949)
MPS
Package
- PackageExporter does not have file_structure (#79948)
- Updated package.rst to not include hermetic claim (#81019)
- Fixed typos in
torch.package
documentation (#82994) - Fixed typo in torch/package/_mock.py (#84508)
Distributed
Distributed(c10d)
- Fixed some links in torch/distributed/CONTRIBUTING.md (#79855)
- Updated dist.scatter() documentation (#86069)
- Fixed docstring of
scatter_object_list
(#84596) - Fixed doc string in
reduce_scatter
(#84983)
DistributedDataParallel
- Corrected the DDP wrap example by removing pg in DDP wrap (#83034)
FullyShardedDataParallel
- Improved auto wrap policy doc (#78400)
- Corrected comments in FSDP for gradient averaging (#80456)
- Updated
ShardingStrategy
and_free_full_params()
docs (#80894) - Added mentioning of
optim_input
to be removed after 1.13 in the BC breakage warning (#85963)
torch.distributed.rpc
- Updated distributed/CONTRIBUTING.md to remove ProcessGroupAgent references and add test instructions (#78625)
Infra (RelEng)
- Added some documentation about the stats uploading process for CI (#79504)
- Fixed release doc builds (#79865)
- Updated release.md with release candidate validation steps (#79889)
Developers
Autograd
- Added the ability to register a hook to grad_fn with
.register_prehook
(#83226)
Build
- Modified nccl_dependency to take dev mode (#79169)
- Moved pytorch buck targets to shared build (#79330)
- Added kineto and flatbuffers to OSS BUCK (#79860)
- Updated llvm deps for Buck build (#79919)
- Moved aten targets to shared buck file (#79966)
- Updated buck_setup.sh (#80467)
- Minor fix for shared build (#80739)
- Deleted CCACHE_DISABLE and SCCACHE_DISABLE from nccl.cmake file (#84007)
Composability
TorchDispatchMode
andTorchFunctionMode
extension points have been added. They are similar to their__torch_function__
and__torch_dispatch__
counterparts, but can be used as context managers that intercept all torch operator calls, including factory functions. These API’s are still experimental and aren’t quite user facing yet, and we will add more documentation as they are hardened. See this post for more details. (#78214, #78822, #78847, #84774, #83925, #79143, #77667, #80992, #80995, #80998, #82647, #83372)- A large amount of hardening to
FakeTensor
andFakeTensorMode
, a__torch_dispatch__
style mode that allows you to run shape/dtype/device inference. This is similar to the “meta” device, but fake tensors also faithfully store device metadata, and the logic lives in python. (#77969, #77972, #77971, #78516, #78090, #78836, #78895, #78536, #78677, #78522, #78523, #78972, #79170, #80115, #80193, #80544, #81739, #82281, #82574, #82066, #82449, #82337, #82571, #82593, #82172, #84387, #85065, #82846, #85658, #85759, #85920) - Added some new tags and beefed up tags support for operators in the dispatcher:
- Add data_dependent_output tag (#83312)
- Add nondeterministic tags in tags.yaml and add the nondeterministic_seeded tag to all functions in native_functions.yaml defined as nondeterministic by alias_analysis.cpp (#81440)
- Allow specifying operator tags when registering an operator to the dispatcher (#79322)
- add
inplace_view
tag toresize_()
(#82667)
- Make string serialization of C++ FunctionSchema consistent with torchgen.model.FunctionSchema (#77926)
- Added support for custom namespaces in
torchgen
(#78015, #79733, #81362, #81581) - Generate kernels for codegen’d
out=
operators (#78626, #81437) - Added a new alias dispatch key for functional to view op decompositions (#79615)
- Added an env var for dispatcher debug logging (#81846, #82277)
- Fixed printing of DispatchKey in operator not found message (#81637)
- Added test that all BackendComponents are covered by toString (#81713)
- Refactored functionality and backend keys to reduce duplication (#81752)
- Made factory functions
CompositeExplicitAutograd
, so they show up as primitives in__torch_dispatch__
(#82470) - Added an
OpOverload.decompose()
API, for running an operator’s decomposition if one exists (#83075) - Fixed our dispatcher schema parser when parsing tensor list alias annotations (#84005)
- Allowed subclasses of
c10::TensorImpl()
to override non-virtual tensor methods (#84806) - Made pytorch headers consumable from c++20 code bases (#79985)
- Added meta device support to
_UntypedStorage
and_TypedStorage
(#78008)
torch.fx
- Added debug statements for small ACC subgraphs elimination (#80117)
- Checked node type before fetching users (#80166)
- Detected ProxyTensor layering violations (#80994)
- Increased stack level for get_attr warning (#81041)
- Preserved a node’s stack trace (#82670, #83050, #83558, #83706, #83960)
- For quantization, removed
WEIGHT_INDEX_DICT
andBIAS_INDEX_DICT
and replaced withnode_arg_is_weight
andnode_arg_is_bias
(#83263, #83848) - Asserted that ProxyTensorMode does not accidentally bake in constants (#83297)
- Improvements to FX Minimizer (#83833)
- Ported matmul compositeimplicitautograd impl into core (#85239)
- OpInfo for Slice (#85554)
- Raised errors in fx.Interpreter with Node info (#85810)
Quantization
- Enabled support for quantized fill of nhwc tensors (#79025)
- Tests for code snippets in quantization docs (#79923)
- Eliminate Named tensor warnings in XNNPACK and QNNPACK (#77762)
- Added earlier termination and improved error message for calling
min
andmax
ops on per channel quantized tensors. (#79036) - Added warnings to quantized dynamic conv and linear ops when
reduce_range=true
(#79273) - Add assertions to fix
torch::jit::load bugs
(#79192) - Optionally clamp weights post quantization (#83438)
ONNX
onnx.verification
Tool to verify exported model discrepancy between sets of inputs (#78323)- Symbolic function registration is now done via decorators (#84709)
g.op
methods now exposed via the GraphContext class (#84728)- Initial version of diagnostics infrastructure. (#85107)
- Add dtype check in onnx verification (#79263)
Intel
- Added native impl for group norm on quantized CPU for channels-last inputs: (#70520)
- Added
qscheme
check for quantization observer (#80126) - Added oneDNN graph fuser context API and unittest (#82491)
- Added eltwise OPs for NNC:
mish
andelu
(#80586) - Support BF16ImmPtr (#84041)
- Enabled fusion of conv with elementwise OP for NNC (#77157)
- Channels last propagation within NNC fusion group (#76948)
- Lowering function generates the output buffer with the specified stride for NNC(#76529)
- Simplified IfThenElse and CompareSelect within for-loop for NNC (#76793)
- Do not pull in autocast* ops into NNC (#85140)
MPS
- Improve MPS test by extending
test_no_warnings_on_input
by capturing any output (#79163) - Add testcase in test_mps for circular mode in torch.pad (#81455)
- Fixed build warnings while building with MPS on Mac platforms (#83048)
- Add per-op MPS gradient tests and update skips for TestConsistency (#84242)
Profiler
- New event representation in profiler (#77693, #77694, #77695, #78163, #79173, #81965, #80797, #81319, #81320, #81321, #81322, #80822, #82993)
- Build call tree for profiled events (#77698, #80810)
- Copy rollbear/strong_type to
c10/util
(#78162) - Collect Layout and expose TensorMetadata (#81155)
- Added support for storing scalar values in profiling (#81843)
- Added support for Device (#82787)
- Added SOFT_ASSERT to gracefully recover from invariant violations (#82689)
- Added support for accessing strides and scalars (#80072)
- Record nn.Module's parameters (#83209)
- Extend Python bindings (#83622)
- Capture storage data pointer (#84276)
- Cleaned up Tensor representation (#85161)
- Compute unique IDs for Tensors (#85162)
- set_class util (part 1 of Record Optimizer) (#84779)
- Tracking Optimizer (part 2 of Record Optimizer) (#84920)
- Optimizer param_groups (part 3 of Record Optimizer) (#85784)
- Optimizer states (part 4 of Record Optimizer) (#85840)
- Extend ID assignment to allocations and frees (#85719)
- Made
name
a property. (#85720) - Added dtype to
_TensorMetadata
(#85721) - Updated python binding type annotations (#85722)
- Started moving python bindings out of autograd (#82584)
Vulkan
- Vulkan operators that use prepacking have switched from using individual
OpContext
classes withPackedContext
classes that inherit from a genericVulkanOpContext
class which should reduce boilerplate code when implementing new ops that require prepacking (#78814, #78815, #78816, #78817, #78818, #82730, #83526) - Code under
ATen/native/vulkan/api
was essentially rewritten to improve code organization and readability. The refactor implements RAII patterns for the classes used to represent Vulkan handles to facilitate proper resource management and re-designed how theContext
class functions in order to enable concurrent execution of multiple Vulkan models. The stack of PRs containing these refactors can be found at #80699 - Lint is now enforced in the
ATen/native/vulkan
(#81390) - The VulkanMemoryAllocator version used was upgraded to 3.0.1, which now lives under
third_party
(#81472, #83906, #83934) - Shader layouts are now automatically generated based on the GLSL code (#81715, #81716)
Distributed
torch.distributed
- Added all to torch.distributed and tensorboard submodules (#80444)
- Added all to torch.{fx, distributed, backends} submodules (#85079)
- Added all to fx, fistributed and cuda submodules (#85080)
- Added all to torch.distributed, futures, fx, nn, package, benchmark submodules (#80520)
- Added all to torch.distributed submodules (#80523)
- Eliminated code duplication in distributed rendezvous (#81577)
- Refactored distributed to use absolute header path (#85780)
torch.distributed.elastic
- Added all for torch.nn.modules, torch.distributed.elastic, torch.nn.utils submodules (#80240)
- Fixed macos public bindings failures (#80970)
Distributed(c10d)
- Logged full rank fingerprint mismatches in ProcessGroupWrapper (#79901)
- Added environment parse function that supports default value (#85563)
- Added host and port to TCPStore pyi definition (#84636)
- Added underlying_store property for PrefixStore (#84640)
- Enabled per-thread ProcessGroup for testing (#84153)
- Moved ProcessGroup::Work into a separate class (#83680)
- Install c10d headers with absolute path (#86257)
Infra (RelEng)
- Migrated off xenial gcc5.4 from merge rules (#78137)
- Added functionality for rebasebot to rebase onto viable/strict branch (#78276)
- Pinned protobuf version to 3.20.1 in docker CI build (#78369)
- Removed gcc5.4 from docker/build.sh (#78405)
- Removed gcc5.4 jobs from CircleCI config (#78555)
- Added merge rules for “pytorch distributed” module (#78751)
- Added onnx / test to required merge rules (#78790)
- Added userbenchmark support to TorchBench CI (#78794)
- Installed torchdynamo as part of most CI jobs (#79051)
- Removed linux-xenial-py3_7-clang7-asan from merge rules (#79088)
- Ran torchdynamo tests on PyTorch Linux CI (#79099)
- Centralized commit pins in a folder (#79150)
- Moved CUDA flags out of --per_file_copts into the cu_library macro (#79414)
- Moved CI to cuda-11.6 (#79921)
- Enabled pytest to run test_ops, test_ops_gradients, test_ops_jit in non linux cuda environments (#79898)
- Upgraded pytorch nightly docker python version to 3.8 (#80051)
- Updated Dockerfile to install cmake as part of conda install (#80258)
- Re-enabled vulkan test (#81368)
- Enhanced mergebot with the feature of posting the PR Comment on cancel (#82744)
- Changed nccl build to be single-threaded (#83173)
- Added process for maintaining Build + CI contributors list (#83869)
- Implemented mechanisms to block land checks if the PR hasn't been approved yet (#84239)
- Allowed External Scripts (e.g. vscode) To Discover and Execute unittest Tests (#85584)
- Updated the pinned torchdynamo hash to
6ead5cae0d1234aa64db06fe230ef56e12ec76fe
(#85683) - Updated the pinned torchvision hash to
d7d90f56117ce0955332846a5f90b8d1346c4c09
(#85776) - Modified all functions (except factory functions) to support SymInt and update xla hash to
f2b36df6a1a80137eff7644e6d0f4eeb7ff429d6
(#86078)