Pytorch 1.13 Release Notes
- Highlights
- Backwards Incompatible Changes
- New Features
- Improvements
- Performance
- Documentation
- Developers
Highlights
We are excited to announce the release of PyTorch 1.13! This includes stable versions of BetterTransformer. We deprecated CUDA 10.2 and 11.3 and completed migration of CUDA 11.6 and 11.7. Beta includes improved support for Apple M1 chips and functorch, a library that offers composable vmap (vectorization) and autodiff transforms, being included in-tree with the PyTorch release. This release is composed of over 3,749 commits and 467 contributors since 1.12.1. We want to sincerely thank our dedicated community for your contributions.
Summary:
-
The BetterTransformer feature set supports fastpath execution for common Transformer models during Inference out-of-the-box, without the need to modify the model. Additional improvements include accelerated add+matmul linear algebra kernels for sizes commonly used in Transformer models and Nested Tensors is now enabled by default.
-
Timely deprecating older CUDA versions allows us to proceed with introducing the latest CUDA version as they are introduced by Nvidia®, and hence allows support for C++17 in PyTorch and new NVIDIA Open GPU Kernel Modules.
-
Previously, functorch was released out-of-tree in a separate package. After installing PyTorch, a user will be able to
import functorchand use functorch without needing to install another package. -
PyTorch is offering native builds for Apple® silicon machines that use Apple's new M1 chip as a beta feature, providing improved support across PyTorch's APIs.
| Stable | Beta | Prototype |
|---|---|---|
|
|
|
You can check the blogpost that shows the new features here.
Backwards Incompatible changes
Python API
uint8 and all integer dtype masks are no longer allowed in Transformer (#87106)
Prior to 1.13, key_padding_mask could be set to uint8 or other integer dtypes in TransformerEncoder and MultiheadAttention, which might generate unexpected results. In this release, these dtypes are not allowed for the mask anymore. Please convert them to torch.bool before using.
1.12.1
>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.uint8)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
# works before 1.13
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)1.13
>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.bool)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)Updated torch.floor_divide to perform floor division (#78411)
Prior to 1.13, torch.floor_divide erroneously performed truncation division (i.e. truncated the quotients). In this release, it has been fixed to perform floor division. To replicate the old behavior, use torch.div with rounding_mode='trunc'.
1.12.1
>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -1.])1.13
>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -2.])
# Old behavior can be replicated using torch.div with rounding_mode='trunc'
>>> torch.div(a, b, rounding_mode='trunc')
tensor([ 2., -1.])Fixed torch.index_select on CPU to error that index is out of bounds when the source tensor is empty (#77881)
Prior to 1.13, torch.index_select would return an appropriately sized tensor filled with random values on CPU if the source tensor was empty. In this release, we have fixed this bug so that it errors out. A consequence of this is that torch.nn.Embedding which utilizes index_select will error out rather than returning an empty tensor when embedding_dim=0 and input contains indices which are out of bounds. The old behavior cannot be reproduced with torch.nn.Embedding, however since an Embedding layer with embedding_dim=0 is a corner case this behavior is unlikely to be relied upon.
1.12.1
>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
tensor([], size=(1, 0), grad_fn=<EmbeddingBackward0>)1.13
>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
RuntimeError: INDICES element is out of DATA bounds, id=4 axis_dim=3Disallow overflows when tensors are constructed from scalars (#82329)
Prior to this PR, overflows during tensor construction from scalars would not throw an error. In 1.13, such cases will error.
1.12.1
>>> torch.tensor(1000, dtype=torch.int8)
tensor(-24, dtype=torch.int8)1.13
>>> torch.tensor(1000, dtype=torch.int8)
RuntimeError: value cannnot be converted to type int8 without overflowRemove deprecated torch.eig, torch.matrix_rank, torch.lstsq (#70982, #70981, #70980)
The deprecation cycle for the above functions has been completed and they have been removed in the 1.13 release.
torch.nn
Enforce that the bias has the same dtype as input and weight for convolutions on CPU (#83686)
To align with the implementation on other devices, the CPU implementation for convolutions was updated to enforce that the dtype of the bias matches the dtype of the input and weight.
1.12.1
# input and weight are dtype torch.int64
# bias is torch.float32
>>> out = torch.nn.functional.conv2d(input, weight, bias, ...)1.13
# input and weight are dtype torch.int64
# bias is torch.float32
>>> with assertRaisesError():
>>> out = torch.nn.functional.conv2d(input, weight, bias, ...)
# Updated code to avoid the error
>>> out = torch.nn.functional.conv2d(input, weight, bias.to(input.dtype), ...)Autograd
Disallow setting the .data of a tensor that requires_grad=True with an integer tensor (#78436)
Setting the .data of a tensor that requires_grad with an integer tensor now raises an error.
1.12.1
>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
>>> x
tensor([0, 0], requires_grad=True)1.13
>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: data set to a tensor that requires gradients must be floating point or complex dtypeAdded variable_list support to ExtractVariables struct (#84583)
Prior to this change, C++ custom autograd Function considers tensors passed in TensorList to not be tensors for the purposes of recording the backward graph. After this change, custom Functions that receive TensorList must modify their backward functions to also compute gradients for these additional tensor inputs. Note that this behavior now differs from that of custom autograd Functions in Python.
1.12.1
struct MyFunction : public Function<MyFunction> {
static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
return 2 * tensors[0] + 3 * t;
}
static variable_list backward(
AutogradContext* ctx,
variable_list grad_output) {
return {3 * grad_output[0]};
}
};1.13
struct MyFunction : public Function<MyFunction> {
static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
return 2 * tensors[0] + 3 * t;
}
static variable_list backward(
AutogradContext* ctx,
variable_list grad_output) {
return {3 * grad_output[0], 2 * grad_output[0]};
}
};Don't detach when making views; force kernel to detach (#84893)
View operations registered as CompositeExplicitAutograd kernels are no longer allowed to return input tensors as-is. You must explicitly create a new tensor (e.g., using .alias()).
1.12.1
torch::Tensor view_op(const torch::Tensor& self) {
return self;
}1.13
torch::Tensor view_op(const torch::Tensor& self) {
return self.alias();
}ONNX
torch.onnx.register_custom_op_symbolic now only registers the symbolic function at the specified opset version (#85636)
This updates register_custom_op_symbolic's behavior to only register the symbolic function at a single version. This is more aligned with the semantics of the API signature. Previously the API registers a symbolic function to all versions up to the specified version. As a result of this change, users will need to register a symbolic function to the exact version when they want to override an existing symbolic function. Users are not affected if (1) an implementation does not exist for the op, or (2) the symbolic function is already registering to the exact version for export.
1.12.1
# Assuming an implemented symbolic function `custom_op_function`
torch.onnx.register_custom_op_symbolic("aten::foo", custom_op_function, 16)1.13
# Assuming an implemented symbolic function `custom_op_function`
for opset in range(1, 17):
torch.onnx.register_custom_op_symbolic("aten::foo", custom_op_function, opset)Default ONNX opset is updated to 14 (#83284)
The update is done in regularly to ensure we are in sync with the onnx updates. Users can specify opset_version in torch.onnx.export to maintain opset version 13.
torch.onnx.symbolic_registry is removed (#84382)
We removed the symbolic_registry module and hid it as an internal implementation detail. Users previously relying on the register_op function to register custom symbolic functions should move to use the torch.onnx.register_custom_op_symbolic API.
ScalarType and global variables in torch.onnx.symbolic_helper are removed (#82995)
The ScalarType class in torch.onnx.symbolic_helper, along with the global variables cast_pytorch_to_onnx, pytorch_name_to_type, scalar_name_to_pytorch, scalar_type_to_onnx and scalar_type_to_pytorch_type are removed from the module. Users previously using these global variables for PyTorch JIT-ONNX type conversion in symbolic functions should move to use the torch.onnx.JitScalarType class.
1.12.1
# 1
torch.onnx.symbolic_helper.scalar_type_to_onnx[
symbolic_helper.scalar_type_to_pytorch_type.index(x.dtype)
].value
# 2
torch.onnx.symbolic_helper.scalar_name_to_pytorch[element_type] in cast_pytorch_to_onnx.keys()
# 3
torch.onnx.symbolic_helper.cast_pytorch_to_onnx["Long"]
# 4
torch.onnx.symbolic_helper.cast_pytorch_to_onnx[tensor.type().scalarType()]1.13
# 1
torch.onnx.JitScalarType.from_dtype(x.dtype).onnx_type()
# 2
torch.onnx.JitScalarType.from_name(element_type).onnx_compatible()
# 3
torch.onnx.TensorProtoDataType.INT64
# 4
torch.onnx.JitScalarType.from_name(tensor.type().scalarType()).onnx_type()Distributed
In c10d collectives, input tensors dtype must now be the same (#84664)
We added a check to validate all dtype across all input tensors. Previously, users were allowed to pass in tensors with diferent dtypes for c10d collectives. Now, passing in tensors with different dtypes will throw a RuntimeError with the following message: “Invalid usage of tensors with different dtypes Found torch.float and torch.half”. Users can use tensor.to(dtype={some_dtype}) to fix this.
1.12.1
# users could pass inputs having different dtypes
>>> tensor = torch.ones(2, 2) * 7
>>> tensor_h = tensor.half()
>>> tensor_list = [torch.zeros(2, 2) for _ in range(4)] # Assume world_size = 4
# Both cases work.
>>> dist.all_gather(tensor_list, tensor)
>>> dist.all_gather(tensor_list, tensor_h)
...1.13
# all inputs of c10d collectives need to have the same dtype
>>> tensor = torch.ones(2, 2) * 7
>>> tensor_h = tensor.half()
>>> tensor_list = [torch.zeros(2, 2) for _ in range(4)] # Assume world_size = 4
# Only allow same dtype for all input tensors.
>>> dist.all_gather(tensor_list, tensor) # RuntimeError thrown
...Users doing wildcard imports of torch.distributed.distributed_c10d will no longer get non-public symbols (#84872)
We limit the usage of c10d APIs to public APIs, so if a user does a wildcard import and calls an internal API, it will fail. Please see the example below:
1.12.1
# users could import both public and non-public symbols:
from torch.distributed.distributed_c10d import *
>>> is_nccl_available() # public API
>>> _check_single_tensor(...) # Non-public API
...1.13
# users can only import public symbols
from torch.distributed.distributed_c10d import *
is_nccl_available() # public API
_check_single_tensor(...) # Non-public API, this will fail now
...Process Group C++ extensions must use absolute path when importing ProcessGroup.hpp (#86257), ProcessGroup::Work object moved out of work to its own Work class (#83680):
Details of the changes and the updated tutorial can be found in the PyTorch tutorial PR #2099
1.12.1
// users use relative path to import C++ headers and Work resides in ProcessGroup class
#include <c10d/ProcessGroup.hpp>
#include <c10d/Store.hpp>
#include <c10d/Types.hpp>
#include <c10d/Utils.hpp>
...
class WorkDummy : public ProcessGroup::Work {
...
}1.13
// users must use absolute path of import C++ files and Work is its own class
#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>
#include <torch/csrc/distributed/c10d/Store.hpp>
#include <torch/csrc/distributed/c10d/Types.hpp>
#include <torch/csrc/distributed/c10d/Utils.hpp>
...
#include <torch/csrc/distributed/c10d/Work.hpp>
class WorkDummy : public Work {
...
}Quantization
Add required example_args argument to prepare_fx and prepare_qat_fx (#249) (#77608)
We added an additional required example_inputs argument to prepare_fx and prepare_qat_fx APIs, this can be used to do type inference to figure out the type information for each of the fx Node in the graph.
1.12.1
m = resnet18(...)
m = prepare_fx(m, qconfig_dict)
# or
m = prepare_qat_fx(m, qconfig_dict)1.13
m = resnet18(...)
m = prepare_fx(m, qconfig_dict, example_inputs=(torch.randn(1, 3, 224, 224),))
# or
m = prepare_qat_fx(m, qconfig_dict, example_inputs=(torch.randn(1, 3, 224, 224),))Stop moving models to CPU in quantization convert (#80555)
Previously, we automatically moved the model to CPU in torch.ao.quantization.fx.convert to work around the issue where certain functions called by convert expect CPU arguments. This commit pushes this responsibility to the caller since it is the user's decision of which device to use.
1.12.1
model = resnet18(...)
model = prepare_fx(model, qconfig_mapping, example_inputs)
# calibrate
model = convert_fx(model)1.13
model = resnet18(...)
model.cpu() # if needed
model = prepare_fx(model, qconfig_mapping, example_inputs)
# calibrate
model = convert_fx(model)Replace the is_reference flag of the torch.ao.quantize_fx.convert_fx function with the convert_to_reference function (#80091, #81326)
This PR removes the is_reference flag from the existing convert_fx API and replaces it with a new convert_to_reference function. This separates (1) converting the prepared model to a reference model from (2) lowering the reference model to a quantized model, enabling users to call their custom lowering function for
custom backends.
1.12.1
from torch.ao.quantization.quantize_fx import (
prepare_fx,
convert_to_reference,
)
prepared = prepare_fx(model, ...)
reference = convert_to_reference(prepared, ...)1.13
from torch.ao.quantization.quantize_fx import (
prepare_fx,
convert_to_reference_fx,
)
prepared = prepare_fx(model, ...)
reference = convert_to_reference_fx(prepared, ...)Add default configs for fixed qparams ops (#80184)
This commit adds qconfigs with special observers for fixed qparams ops (operators whose corresponding quantized version has fixed quantized parameters for output) like sigmoid in get_default_qconfig_mapping and get_default_qat_qconfig_mapping. For correctness, we also require users to use these special observers if we detect these fixed qparams ops in prepare.
1.12.1 (fails after this PR):
from torch.ao.quantization.quantize_fx import prepare_fx
model = ModelWithFixedQParamsOps()
qconfig_mapping = QConfigMapping()
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)1.13
from torch.ao.quantization import get_default_qconfig_mapping
from torch.ao.quantization.quantize_fx import prepare_fx
model = ModelWithFixedQParamsOps()
qconfig_mapping = get_default_qconfig_mapping()
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)Replace qconfig_dict with a typed QConfigMapping object (#78452, #79618)
Previously, FX graph mode quantization configurations were specified through a dictionary of qconfigs. However, this
API was not in line with other core APIs in PyTorch. This commit replaces this dictionary with a config object that users will
create and pass to prepare and convert. This leads to better type safety and better user experience in notebook settings
due to improved auto completion.
1.12.1 (deprecated)
from torch.ao.quantization.quantize_fx import prepare_fx
qconfig_dict = {
"": qconfig,
"object_type": [
(torch.nn.Linear, qconfig),
],
"module_name_regex": [
("foo.*bar", qconfig),
],
"module_name": [
("mod", qconfig),
],
}
prepare_fx(model, qconfig_dict)1.13
from torch.ao.quantization import QConfigMapping
from torch.ao.quantization.quantize_fx import prepare_fx
qconfig_mapping = QConfigMapping()
.set_global(qconfig)
.set_object_type(torch.nn.Linear, qconfig)
.set_module_name_regex("foo.*bar", qconfig)
.set_module_name("mod", qconfig)
prepare_fx(model, qconfig_mapping)Replace *custom_config_dict with typed config objects (#79066)
This commit replaces the following config dicts with python objects:
- prepare_custom_config_dict → PrepareCustomConfig
- convert_custom_config_dict → ConvertCustomConfig
- fuse_custom_config_dict → FuseCustomConfig
This leads to better type safety and better user experience in
notebook settings due to improved auto completion.
1.12.1
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
prepare_custom_config_dict = {
"float_to_observed_custom_module_class": {
"static": {
FloatClass: ObservedClass
}
},
"non_traceable_module_name": ["mod1", "mod2"],
"non_traceable_module_class": [class1, class2],
"input_quantized_idxs": [0, 1],
"output_quantized_idxs": [0],
"preserved_attributes": ["attr1", "attr2"],
}
convert_custom_config_dict = {
"observed_to_quantized_custom_module_class": {
"static": {
FloatClass: ObservedClass
}
},
"preserved_attributes": ["attr1", "attr2"],
}
model = prepare_fx(
model,
qconfig_mapping,
example_inputs,
prepare_custom_config_dict=prepare_custom_config_dict)
model(data)
model = convert_fx(model, convert_custom_config_dict=convert_custom_config_dict)1.13
from torch.ao.quantization.fx.custom_config import (
PrepareCustomConfig,
ConvertCustomConfig,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
prepare_custom_config = PrepareCustomConfig() \
.set_float_to_observed_mapping(float_class, observed_class) \
.set_non_traceable_module_names(["mod1", "mod2"]) \
.set_non_traceable_module_classes([class1, class2]) \
.set_input_quantized_indexes([0, 1]) \
.set_output_quantized_indexes([0]) \
.set_preserved_attributes(["attr1", "attr2"])
convert_custom_config = ConvertCustomConfig() \
.set_observed_to_quantized_mapping(observed_class, quantized_class) \
.set_preserved_attributes(["attr1", "attr2"])
model = prepare_fx(
model,
qconfig_mapping,
example_inputs,
prepare_custom_config=prepare_custom_config)
model(data)
model = convert_fx(model, convert_custom_config=convert_custom_config)Remove remove_quant_dequant_pairs and fix tests (#84203)
This PR removed some passes in convert_fx, and also fixes the way we quantize layer_norm operator, so the qconfig for layer_norm op needs to be updated as well.
1.12.1
import torch
from torch.ao.quantization.qconfig_mapping import QConfigMapping, QConfig
from torch.ao.quantization.observer import default_weight_observer
from torch.ao.quantization.backend_config import (
DTypeConfig,
ObservationType,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
qconfig = QConfig(activation=qconfig.activation, weight=default_weight_observer)
qconfig_mapping = QConfigMapping().set_object_type(torch.nn.LayerNorm, q_config) \
.set_object_type(torch.nn.functional.layer_norm, q_config)
# assuming mymodel contains a LayerNorm layer or torch.nn.functional.layer_norm
m = MyModel()
example_inputs = (torch.rand(3, 3),)
m = prepare_fx(m, qconfig_mapping, example_inputs)1.13
import torch
from torch.ao.quantization.qconfig_mapping import QConfigMapping, QConfig
from torch.ao.quantization.observer import default_placeholder_observer
from torch.ao.quantization.backend_config import (
DTypeConfig,
ObservationType,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
qconfig = QConfig(activation=qconfig.activation, weight=default_placeholder_observer)
qconfig_mapping = QConfigMapping().set_object_type(torch.nn.LayerNorm, q_config) \
.set_object_type(torch.nn.functional.layer_norm, q_config)
# assuming mymodel contains a LayerNorm layer or torch.nn.functional.layer_norm
m = MyModel()
example_inputs = (torch.rand(3, 3),)
m = prepare_fx(m, qconfig_mapping, example_inputs)Align observer dtype with reference model spec (#85345)
Before this PR, the dtype attribute of observers was not clearly defined. It originally meant interface_dtype in the eager mode workflow, which is how the codebase before this PR is using it. In the new reference model spec, dtype attribute of an observer represents the dtype value which needs to be passed into a quantize function in the reference model spec. This PR aligns the codebase to this definition of dtype.
1.12.1
dynamic_quant_observer = PlaceholderObserver.with_args(
dtype=torch.float, compute_dtype=torch.quint8)1.13
dynamic_quant_observer = PlaceholderObserver.with_args(
dtype=torch.quint8, compute_dtype=torch.quint8)Composability
Changed the backend C++ kernel representation for some operators that take in lists of tensors (#73350)
If an operator in ATen takes in a list of tensors, and is marked as “structured” in native_functions.yaml (example), then previously, TensorList was represented as at::TensorList, or c10::ArrayRef<at::Tensor>. Now, it is represented as a more efficient type: const ITensorListRef&.
1.12.1
at::Tensor cat_kernel(at::TensorList tensors,int64_t dim) {
...
}
TORCH_LIBRARY_IMPL(aten, dispatch_key, m) {
...
m.impl("cat", &cat_kernel);
}1.13
at::Tensor cat_kernel(const at::ITensorListRef& tensors,int64_t dim) {
...
}
TORCH_LIBRARY_IMPL(aten, dispatch_key, m) {
...
m.impl("cat", &cat_kernel);
}C++ API
Lowered randint default dtype to the C++ API (#81410)
Prior to 1.13, the default for the dtype argument of torch.randint, torch.long, was set via manual python binding. However, in the C++ API, torch::randint would default to the global default data type, which is usually float. In 1.13 we changed the default for dtype in the C++ API to int64 in order to match the python API. To reproduce the old behavior, one can set the dtype argument.
1.12.1
torch::randint(/*low=*/0, /*high=*/10, {2, 3});1.13
// assuming default dtype is float
torch::randint(/*low=*/0, /*high=*/10, {2, 3}, torch::kFloat);Enabled dim=None for torch.{std, var, std_mean, var_mean} (#81845, #82765, #82912)
Prior to 1.13, a C++ API call that has argument types torch::{std, var, std_mean, var_mean}(Tensor, OptionalIntArrayRef, int64_t, bool) used to resolve to the {std, var, std_mean, var_mean}.correction overload. In this release, it resolves to the {std, var, std_mean, var_mean}.dim overload. With the .correction overload, the third argument of type int64_t could be used to pass a correction δN other than 1. In order to call the {std, var, std_mean, var_mean}.correction overload in 1.13, the old int64_t argument can be wrapped in a c10::optional.
1.12.1
// using std as an example
int64_t correction = 2;
torch::std(t, /*dim=*/dim, /*correction=*/correction, /*keepdim=*/True);1.13
// To replicate in 1.13 using std as an example
auto correction = c10::make_optional<int64_t>(2);
torch::std(t, /*dim=*/dim, /*correction=*/correction, /*keepdim=*/True);Deprecations
Distributed
We are deprecating the following APIs of c10d: *_coalesced APIs (#85959), *_multigpu APIs (#85961) and ProcessGroupRoundRobin (#85158)
We added warnings when users call c10d’s *_coalesced, *_multigpu and ProcessGroupRoundRobin APIs. Previously, users can use these APIs without any warnings but now they will see warnings like “torch.distributed.all_reduce_coalesced will be deprecated. If you must use it, please revisit our documentation later at https://pytorch.org/docs/master/distributed.html#collective-functions”. There are still workarounds for *_coalesced APIs but no workarounds will be provided for the other two.
1.12.1
# users could use the following APIs with no warnings:
all_reduce_coalesced(...)
all_gather_coalesced(...)
broadcast_multigpu(...)
all_reduce_multigpu(...)
reduce_multigpu(...)
all_gather_multigpu(...)
reduce_scatter_multigpu(...)
...1.13
# users can still use these APIs but it will come with warnings:
all_reduce_coalesced(...)
# Warnings:
# torch.distributed.all_reduce_coalesced will be deprecated. If you must
# use it, please revisit our documentation later at
# https://pytorch.org/docs/master/distributed.html#collective-functions"
# Potential workaround:
reqs = []
with dist._coalescing_manager(group, reqs):
reqs.append(dist.all_reduce(tensor1, async_op=True))
reqs.append(dist.all_reduce(tensor2, async_op=True))
for req in reqs:
req.wait()
...We are deprecating passing optim_input into the FSDP optimizer state checkpointing APIs. The user can simply not pass the optim_input argument, and all behavior is preserved. No fix is needed from users side for now.
1.12.1
# the user can use the following APIs with no warnings
full_optim_state_dict(...)
sharded_optim_state_dict(...)
shard_full_optim_state_dict(...)
flatten_sharded_optim_state_dict(...)
scatter_full_optim_state_dict(...)
rekey_optim_state_dict(...)1.13
# users can still use these APIs, but they will come with warnings
# The `optim_input` argument is deprecated and will be removed after PyTorch 1.13.
# You may remove it from your code without changing its functionality.LinAlg
Deprecate torch.lu in favor of linalg.lu_factor (#77636)
The new operation has a cleaner API and better docs. The update rule is as follows:
1.12.1
LU2, pivots2, info = torch.lu(A, compute_pivots, get_infos=True)
LU1, pivots1, info = torch.lu(A, compute_pivots)1.13
LU2, pivots2, info = torch.linalg.lu_factor_ex(A, compute_pivots)
LU1, pivots1 = torch.linalg.lu_factor(A, compute_pivots)Deprecate torch.lu_solve in favor of linalg.lu_solve(#77637)
The new operation has a notation consistent with linalg.solve, and has an extra parameter adjoint=False. The update rule is as follows:
1.12.1
X = torch.lu_solve(B, LU, pivots)1.13
X = linalg.lu_solve(LU, pivots, B)ONNX
Monkey patched convenience method on torch._C.Graph, torch._C.Block and torch._C.Node are deprecated. (#83006)
Deprecated methods include Graph.op(), Graph.constant(), Graph.at(), Block.op(), and Node.__getitem__(). Previously, these methods are patched into the classes above when users call torch.onnx.export() and are typically used in custom symbolic functions. Users can continue to expect g.op() and g.at() in symbolic functions to work. The g parameter has been substituted by the GraphContext object (#84728). The methods are now exposed by the GraphContext class with APIs unchanged. Users should not rely on the Graph.op(), Graph.constant(), Graph.at(), Block.op(), Node.__getitem__() methods when they are directly interacting with the C classes. Users should use only the op() and at() methods of the GraphContext object, as other fields in the class will change in future releases.
New features
Python API
- Added a deterministic implementation of
scatter_addon CUDA for all input sizes (#79466) - Added
torch.concatenatethat aliasestorch.cat(#85073) - Added
Tensor.is_cpu()that returns whether a tensor is on CPU (#78887) - Added a
forcekwarg toTensor.numpy()that enables returning a numpyndarraythat does not share storage with the tensor (#78564) - Added `torch.special.{airy_ai, bessel_j0, bessel_j1, bessel_y0, bessel_y1, modified_bessel_i0, modified_bessel_i1, modified_bessel_k0, modified_bessel_k1, scaled_modified_bessel_k0, scaled_modified_bessel_k1, spherical_bessel_j0}`` (#78900), (#78901), (#78902), (#78912), (#78451)
- Added
torch.special.{chebyshev_polynomial_t, chebyshev_polynomial_u, chebyshev_polynomial_v, chebyshev_polynomial_w, hermite_polynomial_h, hermite_polynomial_he, laguerre_polynomial_l, legendre_polynomial_p, shifted_chebyshev_polynomial_t, shifted_chebyshev_polynomial_u, shifted_chebyshev_polynomial_v, shifted_chebyshev_polynomial_w}(#78196), (#78293), (#78304), (#78366), (#78352), (#78357) - Added
weights_onlyoption totorch.loadthat restricts load to state_dict only, enabling safe loading. This can also be set using theTORCH_FORCE_WEIGHTS_ONLY_LOADenvironment variable (#86812)
Build
- Added
-Werror=unused-but-set-variablebuild flag (#79305) - Added ability to get release versions based on the current tag (#78584)
- Added
-Werror=type-limitsin Bazel CPU build (#79139) - Added
-Werror=unused-variablein Bazel CPU build (#79156) - Added —config=shell to bazelrc file for easier debugging (#79350)
- Added clang
-Wconstant-conversionto catch errors detected in #75400 (#80461) - Added
-Werror=non-virtual-dtorbuild flag (#81012) - Turned on pocketfft flag for third-party pocket_fft library (#81670)
- Updated NCCL to v2.13.4-1 (#82775)
- Added
-Wunused-local-typedefbuild flag (#86154) - Increased max python version to include 3.10 (#84815)
Complex
- Added complex half support for:
- [CPU]
torch.{index_select, index_add}(#79217), (#79897). - [CUDA]
torch.roll(#79970),torch.fft.{fftshift, ifftshift}(#79970),torch.{acos, acosh, asinh, atanh}, (#80030),torch.{cos, sinh, cosh, tanh}(#78718),torch.sqrt, rsqrt(#77490),torch.{triu, tril, diag, trace}(#78062). - [CPU and CUDA]
torch.where(#78665),torch.{where, pow, masked_fill, sgn, tan, angle}(#78665)
- [CPU]
- Added complex support for
torch.nn.ConvTranspose1d(#79694).
torch.nn
- Added
popfunction tonn.Sequentialandnn.ModuleList(#81601) - Added deepcopy support for parametrized
nn.Module(#80811)
torch.optim
- Added maximization support via the
maximizekwarg foroptim.SparseAdam(#80336),optim.ASGD(#81875),optim.Rprop(#81864),optim.RMSprop(#80326) - Added support for differentiable optimizers via the
differentiablekwargoptim.SGD(#80938),optim.Adam(#82205),optim.RMSprop(#83578) - Added support for complex number for
optim.Adam(#80279),optim.AdamW(#80280),optim.Adamax(#80319),optim.RMSprop(#83860),optim.Rprop(#83858), - Handled complex params as independent real params in
optim.{RMSprop, ASGD}(#83860), (#84472) - Added
optim.lr_scheduler.PolynomialLR(#82769)
BetterTransformer
- Allowed user to assert no mask contiguous check is necessary (#82533)
- Added support for norm_first in nn.TransformerEncoderLayer fast path (#78269)
- Added ustom scaled dot product implementations dense (#85984)
- Added Better Transformer fastpath diagnostics (#81013)
ForEach
- Implemented inplace
foreachmaximumandminimum(#82523)
LinAlg
- Added
linalg.lu_solve,linalg.solve_ex,linalg.vecdot,linalg.vander(#77634, #80073, #70542, #76303)
Sparse
- Added
torch.sparse.spdiagsfor easier creation of diagonal sparse matrices (#78439)
torch.fx
- Enabled symbolic shapes (#82063, #82317, #82209, #83380, #85808, #84113, #84829, #84918, #85185, #85261, #85260, #85754, #85768, #86050, #86098, #86067)
- Created an improved version of subgraph matcher (#82090, #82853, #85444, #85456, #85617)
- Rewrite subgraph_rewriter with subgraph_matcher (#83717)
- Added PassBase for writing passes, PassResult for the return value of passes, and a PassManager for managing the workflow of passes (#79878, #81366, #80531, #82485, #83933, #84094, #84425, #84232)
- Added an FX graph partitioner and fuser (#79439, #80292)
- Added a reinplacing FX pass (#80897, #83626, #83845, #83846)
- Added a CSE pass to the common passes (#81512, #81530, #81742)
- Created DecompositionInterpreter for decomposing aten → prims after an initial make_fx call (#79989)
- Created a Backend for NvFuser based graph partitioner + Prims (#80591, #81311, #81436, #81911)
- Created a Backend for Cudagraphs from dynamo (#80566)
- Created a type constraint generator to Z3 (#79912, #80084, #80095, #80102, #80110, #80147, #80744, #80799, #80823, #80847, #80909, #80925, #80976, #81159, #81175, #81189, #81190, #81265, #81274, #81344, #81360, #81376, #81445, #81516, #81527, #81714, #82163, #82590, #82597, #82614, #82742, #82856, #82923,#82938,#83087, #83109, #83194, #83334, #83682, #83945)
JIT
- Added new NVFuser Python Frontend Record Keeping for Cache enablement. (#81578)
- Added
torch.ops.nvprimsnamespace for nvFuser-specific prims (#82155) - Enabled fusion of conv with elementwise OP in NNC (#77157)
- Added symbolic shape functions for
conv_transpose2d.input, convolution, convolution_backward(#77283, #83557, #80860) - Added support in symbolic shapes for generalized lists of tensor shapes, tuple outputs, optional None, upper and lower bounds (#77389, #83092, #83222, #78679)
- Added support for
aten::_convolutionwhen it is 2D conv in NNC (#84038) - Exposed
ProcessGroup::Work.wait()API to TorchScript (#83303)
ONNX
- Inlined
prim::PythonOpfor Autograd Function Export (#74765)
AMD
- Enabled nvfuser (#82498)
CUDA
- Added CUDA trace Python hooks (#82824)
- Added CUDA Sanitizer (#83984)
- Added support for multiple outputs in python jiterator (#77921, #78139)
Intel
- Added a launch script with Best Recipe of Deep Learning on Intel Xeon CPU (#63932)
- Enabled Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289)
- Added unified x86 quantization backend (#84329)
MPS
- Added
aten::index_add.outoperator for MPS backend (#79935) - Added
aten::prelu operatorfor MPS backend (#82401) - Added
aten::bitwise-notoperator native support for MPS backend (#83678) - Added
aten::tensor::index_putoperator for MPS backend (#85672) - Added
aten::upsample_nearest1doperator for MPS backend (#81303) - Added
aten::bitwise_{and|or|xor}operators for MPS backend (#82307) - Added
aten::index.Tensor_outoperator for MPS backend (#82507) - Added
aten::masked_selectoperator for MPS backend (#85818) - Added
aten::multinomialoperator for MPS backend (#80760)
Profiler
- Integrated Execution Graph Observer into PyTorch Profiler (#75358, #79753, #82895, #84285)
- TorchTidy: experimental tool to identify anti-patterns from traces (#79631, #79874, #79993, #80094, #80108, #80572, #81056, #81273, #81501, #81733, #81740, #81921, #82421, #82248, #82261, #82782)
- Added reporting for OOM events to the Pytorch Profiler. (#80050)
Vulkan
- Added Vulkan support for the following operators:
- Prototype implementations for Quantized Tensors were added (#81491). These implementations still need to be exposed to Torchscript, but so far prototype implementations for the following ops have been added:
Mobile
- Added support for dtypes and custom classes in model tracer (#84795)
- Extended Flatbuffer to get mobile_info for NMLML workflows (#78306)
- Added serialization/deserialization of Sparse Quantize Linear Packed Params (#80474)
- Added qnnpack bcsr matrix unpacking and use unpacking in Linear module (#80475)
- Added OwnedOrBorrowedVector for QNNPack BCSR Indices/Values (#80476)
Distributed
Distributed Checkpointing (Prototyping)
- This is a prototyping effort which enables loading and saving PyTorch models from one or more hosts. Models can use features such as DDP, FSDP and ShardedTensor and they can have a different configuration between saving and loading - for example, save from 4 hosts and load from a single host. Distributed checkpointing has an extensibility API that enables full control of how a model is saved; and a pluggable IO backend. (#83781, #83419, #84952, #84881)
Distributed(c10d)
- Made c10d collective ops dispatcher passable. It allows tracing mechanisms such as LazyTensor and AOTAutograd to observe communications, e.g., : broadcast(#76722), allreduce(#79582), allgather (#79669), reduce_scatter (#79683), reduce (#79686), gather (#79687), scatter (#79688), alltoall (#79691), barrier (#79777), send/recv (#79779).
- Added UCC process group (#79918)
- Enabled uneven input support for
all_gather(#83713) and uneven output support forreduce_scatter(#87010) - Added NCCL PreMul Sum to c10d
ReduceOp(#84243)
DistributedDataParallel
- Made DDP work with Python process group (#79176)
- Enabled Zero1's ddp_with_overlap for hpu backend (#80438)
FullyShardedDataParallel
- Added forward prefetching option in FSDP API (#85177)
- Added fp16 and bf16 hooks for FSDP (#81711)
- Implemented
sharded_optim_state_dictandflatten_sharded_optim_state_dict. (#77628) - Added rate limiter (#83917) Thanks to IBM Research team, @lchu-ibm for his contributions to FSDP and @hfwen0502 for the experimental testbed that identified the issues.
- Added an option to keep grads in lower prec (#85223)
torch.distributed.elastic
- Added watchdog to TorchElastic agent and trainers (#84081)
Activation Memory Management (Prototyping)
- We offer a new API,
torch.distributed.algorithms.checkpoint.checkpoint_wrapperto wrapnn.Moduleswith activation checkpointing or activation offloading to easily use and experiment with activation checkpoint techniques without modifying model code. This makes it simpler to leverage activation checkpointing to reduce memory footprint of your training applications and train larger models. (#83035, #78704, #78854, #79830, #80089, #84907, #84908, #85448, #85449)
Infra (RelEng)
- Enabled multigpu unittests on FSDP (#77947)
- Added feature to do rebase (via comment) onto any branch (#78772)
- Added implementation to allow PR collaborators to revert their PRs (#82360)
- Added torchvision onto the commit pins file (#79151)
- Turned on
-Werror=allwith a few exceptions in Bazel build for CUDA (#79306) - Prepared for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106)
- Added ROCm build to pull request jobs (#80149)
- Added dynamo test configuration (#80342)
- Enabled ROCm CI for trunk test (#80920)
- Added linux cuda 11.7 workflows (#81089)
- Updated CI docker images and jobs to ROCm5.2 (#81168)
- Added UCC PG build in CI (#81583)
- Enabled periodic builds for CUDA 11.7 (#81688)
- Enabled distributed tests for ROCm (#81751)
- Added New TORCH_UCC_BLOCKING_WAIT env variable (#81791)
- Change functorch pin mechanism to test functorch in pytorch/pytorch now that functorch is inside pytorch/pytorch (#81918)
- Added Python 3.11 nightlies for Linux PyPi (Please note that 3.11 binaries are not fully functional) (#82302)
- Updated ROCm nightly builds to rocm5.2 (#82353)
- Add functorch target to cmake (#83464)
- Upgraded CUDNN version for cuda 11.7 (#84964)
- Enabled pytest-shard for functorch (#85321)
- Enabled CI to run test_ops in parallel (#85528)
- Updated trunk CUDA-10.2 to CUDA-11.7 (#85943)
- Added support for building and running Metal tests in CI (#86073)
- Bumped nvidia docker version and using python 3.10 for cuda11.7 (#82472)
Improvements
Python API
- Added
float16support fortorch.{arange, linspace}(#80492) - Added integer support to
torch.index_reduce(#80464) - Added a
stablekwarg totorch.argsortthat controls the relative order of equivalent elements (#75162) - Improved stability of
torch.distributions.kl_divergencefor two Bernoulli distributions (#79944) - Improved type annotations for
torch.{as_tensor, as_subclass}(#86105) - Added type promotion support for
torch.{addcmul, addcdiv}(#74234) - Added
bfloat16support fortorch.savewith XLA/HPU tensors (#77534) - Improved wrapper subclass detection for serialization (#81105)
- Updated python API
TensorOptionsignatures for consistency with JIT schemas (#82241) - Allowed disabling of
torch.library.Librarywith PYTORCH_DISABLE_LIBRARY (#85190) - Enabled
dim=Nonefortorch.{mean, sum, nanmean, nansum}(#81286), (#79881), (#82912) - Added feature to enable registration of extension device modules as a native module under the torch namespace (#78329)
- Added
logsumexptoamp.autocast(#76330)
C++ API
- Allowed
const T&access toListElementReference(#83177) - Redirected print messages to
stderrintorch.utils.cpp_extension(#82097) - Updated CUDA compiler matrix in
torch.utils.cpp_extension(#82860) - Added
__all__totorch.utils.cpp_extension,torch.utils.hooksandtorch.utils.show_pickle(#85331)
Autograd
- Added forward AD coverage for
torch.{amin, amax, nansum, nanmean}(#80082),torch.scatter_reduce(exceptreduction=prod) (#85000),torch.linalg.det(#79487),torch.{elu_, celu_, selu_}(#83080) - Added forward-over-reverse AD coverage for
nn.functional.{binary_cross_entropy}(#77852) ,nn.functional.{embedding}(#79699),nn.functional.{mse_loss, softplus, l1_loss, smooth_l1_loss, prelu, hardswish}(#78740),nn.functional.{nll_loss, batch_norm, layer_norm, group_norm, cross_entropy, soft_min}(#84976)torch.{log_softmax, softmax}(#84976),torch.amin, amax, nansum(#80082) - Added support a stable double backward on
torch.linalg.detfor real inputs (#80217) - Added support for kwargs input to function when
torch.utils.checkpointwithuse_reentrant=False(#80987) - Added context manager to disable saved tensor hooks:
torch.autograd.graph.disable_saved_tensors_hooks(#85971) - Added new cpp custom function API to inform the backward function whether a gradient is necessary to compute:
ctx->needs_input_grad(idx)(#82544) - Added all device types in the pybinded DeviceType enum (#83676)
- Added
check_nanflag totorch.autograd.detect_anomalywhich enables users to run anomaly mode without nan checking (#83481)
Build
- Specify "Generic" BLAS library name to ensure PyTorch can find the BLAS llibrary (#74269)
- Generate CUDAConfig.h only for CUDA builds (#78218)
- Moved build_variables.bzl and ufunc_defs.bzl from pytorch-root/tools/ to PyTorch root directory (#78542)
- Made lintrunner compatible with M1 (#78628)
- BLAS library is linked privately instead of being linked publicly (#78883)
- Updated build targets to include generated enum_tag.cpp (#79668)
- Use miopen_LIBRARIES and rccl_LIBRARIES directly, when they are valid target for RCCL (#80446)
- Deleted Win specific case for CMake older than 3.1 (#81411)
- Split
.cuto improve compile times (#81193) - Added
append_cxx_flag_if_supportedmacro (#82883)
torch.nn
- Improved
groupsargument validation fornn.Conv{1,2,3}dmodules (#77919) - Improved error message for convolution backward fallback kernel (#81538)
- Reduced memory usage of
nn.Modulefull backward hooks by removing reference cycles (#80139) - Improved
kl_divat boundary and its general implementation (#80334) - Improved input shape validation for MKL-backed convolution operations (#76526)
- Improved input validation for
nn.AdaptiveAvgPool2d(#84061) - Improved
groupsargument validation fornn.Conv{1,2,3}d(#85248) - Improved input index validation for
nn.MaxUnpool{2,3}d(#78280) - Improved listing of public APIs for
optimandnn(#80237) - Added new operator for
nn.Sequential:+(#81170),extend(#81179),insert(#81402),+=,*and*=(#81279), - Added deepcopy support for unitialized parameter (#83809)
- Added nondeterministic alert for
nn.MaxUnpool{1,2,3}d(#84766) - Added Bfloat16 support for the backward pass of
nn.functional.kl_divon CUDA (#77676)
torch.optim
- Added support for optimizers with more than 2 betas for LRScheduler (#84486)
- Added
fusedkwarg tooptim.Adamto enable a fused implementation on CUDA (#85739)
Composability
- Significant hardening and improvements to the
functionalize()API that lives with functorch (#77129, #77126, #77125, #78199, #77132, #77713, #77714, #78819, #78820, #82008, #82009, #81702, #80416, #80418, #80251, #80526, #82326, #81454, #81471, #83542, #83701, #85975) - Allow
__torch_dispatch__subclasses and modes to override more tensor metadata: device/size/stride/dim (#77684, #77970, #78646, #78691) - Improvements to the
torch.libraryAPI, for registering python functions to the pytorch dispatcher: - Ported
cholesky,linalg_qr,linalg_eighandlinalg_eighvalshto structured kernels, giving them support with meta tensors (#79300, #79054, #79072) - Added python decompositions for many torch operators. This adds meta tensor coverage for a large number of pytorch operators (#77930, #79768, #79808, #84062, #84350, #80219, #78350, #79667, #81003, #81420, #81113, #81241, #81765, #82284, #80497, #80358, #80182, #80737, #81734, #81826, #78461, #78468, #78525, #78914, #78919, #79900, #79225, #80964, #83235, #84108, #84451, #78602, #78603, #78527, #78604, #78992, #78993, #78997, #79278, #79341, #79311, #79411, #79581, #81800, #79834, #82309, #79975, #82587, #82603, #83191, #84349, #84460, #85793, #86057)
- Beefed up API for printing out operators registered to the dispatcher (#78995)
- Trued up
c10::FunctionSchema::operator<<to print native_functions.yaml syntax (#79645) - Made it so that it is valid to set metadata after detach calls, like
x.detach().resize_(...)(#83590) - Optimized
torch.ops.ns.opname.overloadaccessor in__torch_dispatch__(#85132)
Dataloader
- Added shape checking on argument
weightsforWeightedRandomSampler(#78585) - Added support for
radom_splitto accept percentages aslengths(#78877) - Extended collate function that can register collate functions to handle specific batch types (#85748)
Functorch
functorch.jacfwdnow accepts arandomnesskwarg (#84220)- Improved the error message when using
vmapon a function with no Tensor inputs (#83016) - Relaxed the
Tensor.as_stridedbatching rule. This is a primitive used in forward-mode AD (among other things) and improves composability of vmap with other transforms (like jvp). functorch.functionalize: added support for in-place views on inputs (#83993)functorch.functionalize: moved this API out of thefunctorch.experimentalnamespace (#85742)- Added vmap support for
linalg.cholesky,linalg.eigvals,linalg.eigvalsh,linalg.matrix_norm,linalg.matrix_power,linalg.norm,linalg.tensorinv,linalg.solve_triangular(#82177) - Added vmap support for
linalg.solve(#82814) - Added vmap support for
linalg.cross(#83759) - Added vmap support for
linalg.matrix_rank(#83760) - Added vmap support for
linalg.pinv(#83761) - Added vmap support for
Tensor.fill_(#84015) - Added vmap support for
linalg.lstsq(#82325) - Added vmap support for
linalg.lu_solve(#85175)
LinAlg
- Added a
driver=kwarg totorch.linalg.svdandsvdvals. Add cusolver gesvdaStridedBatched driver tolinalg.svd(#74521) - Added opteinsum backend to
torch.einsum(#86219) - Added path optimize kwarg to
einsum(#84890) - Call view instead of sum in
einsumto remediate MPS regression (#87135) - Ensure that we contract left to right in
einsum(#87199) - Fixed opt_einsum defaults to be more reasonable (#86985)
Sparse
- Added
sparse_dimanddense_dimfor batched, hybrid CSR/CSC/BSR/BSC (#80565, #80901) - Added support for conversion between batched CSR/CSC/BSR/BSC and dense Tensors (#80781, #83084, #83086, #78025, #80354, #82120)
- Added support for conversion between CSR and CSC (#85091)
- Added support for conversion between BSR and BSC (#85091)
- Added partial support for CSR/CSC/BSR/BSC inputs to
mm,addmm,matmulandF.linear(#85551, #85308, #85379, #85307) - Added support for COO to
permute(#79707) - Added support for ComplexHalf to
torch.nonzeroandadd(dense, CSR)(#79062) - Added support for CSC/BSR/BSC to unary zero-preserving functions. (#78173, #85031)
- Added support for batched BSR/BSC to
transpose(#82122) - Added support for scalar together with COO inputs to
mul(#82962) - Added support for CSC/BSR/BSC to
empty_like(#82310) - Added support for batch dims of CSR/CSC/BSR/BSC to
select(#82119)
torch.fx
- In constant folding, added
device_for_folded_attrsparameter and sets therequires_gradoption for a folded tensor (#79067) - Mode-based tracing in make_fx (#79638, #84238)
- Made executor handle kwargs (#79858)
- Added
ignore_parameters_and_buffersflag to FxGraphDrawer (#79982) - Enabled an
is_fx_tracingflag in the FX tracer (#80255) - Attached ProxyTorchDispatchMode to ProxyTensor and use it in
__torch_dispatch__(#82549) - Used
enable_tracingflag for ProxyTorchDispatchMode instead of modifying torch dispatch mode stack inner attributes (#82643) - Improved legalize_graph pass in FX (#82874)
- Implemented
__deepcopy__for fx.Tracer (#83130) - Hackde up make_fx to natively support varargs (#83210)
- Updated proxy_tensor.py to support List input/output (#83302)
- Added *_only and all/any pytree utilities (#83316)
- Deleted ProxyTensor wrapper subclass (#83330, #83646)
- Added support for partial decompositions in make_fx (#83770)
- Added metadata field to fx.GraphModule (#84378)
- Added option to maintain the FX graph execution order after splitting_module (#85188)
JIT
- Added PReLU to MKLDNN convertible Ops in JIT optimize_for_inference (#79011)
- Enabled
torch._refs.varfor nvFuser executor (#79517) - Fixed nvFuser's
where(tensor, python_scalar, tensor) type promotion (#80347) - Added ComplexDouble scalar creation bindings to nvFuser's Python API (#80522)
- Added real and imag to NVFuser and its python frontend (#79824)
- Added Nvfuser opt in for decomposition (#81134)
- Added
torch.jit.fuser()option for disabling all fusers (#81731) - Added support for symbolic diff for
silu(#81724) - Added NVFuser support for (
prims.sign, refs.sign, squeeze, native_batch_norm, transpose) (#83167, #85562, #84629, #84117) - Use high precision accumulate buffer for bf16 accumulation in NNC (#84402)
Quantization
- Improved quantization support for
masked_fill(#78368, #85108) - Improved quantization support for
index_put(#78384, #85685) - Improved quantization support for
LSTMandMultiHeadAttention(#79959, #79956, #79960, #83304, #85068) - Added support for quantized
matmul(#83885) - Introduced a more stable conv_bn fusion for QAT training (#85744)
- Removed warnings from using torch.tensor(value) (#84277)
ONNX
- Added operator support for
torch.tensor_split(#77437),torch.lerp(#78891),torch.movedimandtorch.moveaxis(#78931),torch.scatter_add(#79103),torch.argsort(#80234),aten::native_dropout(#81743),aten::native_layer_norm(#81754),aten::convolution(#81815),aten::_log_softmax(#81804),aten::layer_normfor ONNX opset version 17 using LayerNormalization (#84293),nn.init.normal(#84149) - Added quantization support to more single output ops (#83008)
aten::reshape,aten::reshape_as,aten::t,aten::transpose,aten::numpy_T,aten::expand,aten::expand_as,aten::embedding,aten::embedding_bag,aten::view,aten::select,aten::eq,aten::ne,aten::gt,aten::lt,aten::le,aten::ge,aten::elu,aten::selu,aten::hardtanh,aten::hardswish,aten::as_strided,quantized::sigmoid,quantized::layer_norm,quantized::group_norm,quantized::leaky_relu,quantized::instance_norm - ONNX operators are exported with names containing their associated scope from
nn.module(#82038), (#82039), (#82040) - Introduced runtime type checking with the beartype library in all public APIs (#83673), (#84091)
- All
torch.onnxAPIs now support runtime type checking when @beartype is present in the Python environment. A warning is emitted when a type mismatch is detected. - This feature is experimental. To turn all warnings into errors, set the environment variable
TORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=ERRORS. To disable this behavior, setTORCH_ONNX_EXPERIMENTAL_RUNTIME_TYPE_CHECK=DISABLEDwhich effectively makes it a no-op. - Improved shape type inference (#78999)
- Turn on ONNX shape inference by default (#82767)
- Enabled data propagation from ONNX (#80730)
- Introduced SARIF (#85428) for
torch.onnxsubmodule - Improved warnings and errors (#78441), (#78309), (#83332), (#85179), (#83007)
- Updated ONNX submodule to 1.12 (#79585)
- Apply Common Subexpression Elimination pass to ONNX export (#85665)
AMD
- Support benchmark flag for MIOpen (#77438)
- Correctly handle the error codes of hipGetDeviceCount (#80405)
- Use torch._C._cuda_getArchFlags to get list of gfx archs pytorch was built for (#80498)
torch.cuda.is_bf16_supported()returns True (#80410)- Workaround missing hipProfilerStart/Stop (#82778)
- Enabled jiterator on ROCm (#77982)
- Enabled MIOpen fused convolution relu (#82002)
- Restore MIOpen benchmark flag default to true (#82656)
- embedded_interpreter_hip to enable torch::deploy on AMD (#83329)
- Add HIP libs into torch deploy init list & corresponding dependency for CURE benchmark running on AMD (#83434)
CUDA
- Added synchronize hooks (#84427)
- Added CSAN support for CPU synchronizations (#84428)
- Return device count using nvml (#84879)
- Reworked printing tensor aliases in CSAN error message (#85008)
- Added jiterator support when dtype is
complex32fortan,atan,sin,asin(#77802),(#77606) - Added jiterator support when dtype is complex for
logical_{or, xor}(#75947) - Reduced overhead of
get_current_stream(#78066) - Added an argument to specify warmup iterations in make_graphed_callables (#78124)
- Small improvements to
device_count(#85192) - Memoize
torch.cuda.device_count(#84878) - Remove the construction of unused tensors in fallback convolution implementation (#79183)
__launch_bounds__fortorch.modewith CUDA 11.7 (#79710)- Removed synchronization for D2H copy with a different dtype (#80607)
- Added nondeterministic alert to CUDA
cumsum(#75693) - Annotated CUDACAchingAllocator snapshots (#82146)
- CUDACachingAllocator snapshots from C++ (#86190)
- Propagate CUDAOutOfMemoryError to Python. (#83146)
- Set cublas workspace size to 4M (#74159)
- Allow changing the cuda allocator settings even after the process started (#84970)
- Fixed exception handling, improve overheads and avoid constructing storage for element size for DLPack (#84612)
- Added BFloat16 for fast layernorm (#83971)
- Added BFloat16 support for
torch.{im2col,col2im}on CUDA (#84372) - Added Bfloat16 support for
ReflectionPad(#84949) - Added explicit
__all__to torch.cuda (#85193) - Set CUDA_MODULE_LOADING to LAZY when not set by the user (#85692)
- Support cuDNN Errata Filter (#73934)
- Allow the number of kernels profiled under torch.backends.cudnn.benchmark = True to be limitedCudnnv8 benchmark limit (#78299)
- Update tests and dispatching for CUDNN V8 API behavior for bfloat16 convs (#81139)
Intel
- [RFC] Enable oneMKL&oneDNN on-demands verbose functionality (#63212)
- Updated ideep for NNC post-op (#82705)
- Enabled native 1d spatial input for Intel xpu (#82301)
- Added loss operators to fp32 cast policy of AutocastCPU (#81689)
- Added bfloat16 support for
lerpon CPU (#84327) - Added
preluop and module for quantized CPU backend (#73491) - Enabled mkldnn matmul for aarch64 bf16 devices (#85546)
MPS
- Added ranked tensors for addcmul ops in MPS instead of constants and update MacOS version check (#78354)
- Moved MPS compat check into common comparison machinery of
TensorLikePair(#77836) - Made MPS buildable with either XCode or CommandLineTools (#79430)
- Improved MPS
aten::softplusoperator by adding RankedPlaceholder for graph nodes instead of constants (#81169) - Extended MPS Conv1D operation for NHWC format (#83121)
- Added support for 1D weights in MPS linear layer (#85752)
- Added full support for serialization of MPS Tensors (#79465)
- Added support for 1D bias in MPS operation
torch.addmm(#81519) - Added torch dispatch stub code for MPS backend (#82612)
- Use convenience helper function
dispatch1DJobfor MPS native implementations (#82982) - Enabled support in MPS for
torch.adaptive_avgpool_2dfor larger output sizes (#85726) - Extended support in MPS for
torch.constant_pad_ndfor 4D+ padding (#85991)
Profiler
- Propagate metadata into
Engine::evaluate_functionevent. (#77696) - Switched to nanoseconds for Result's internal representation (#77697)
- Made profiler table column widths changeable via arguments (#85203)
Vulkan
- Enabled higher dimensional input in
torch.nn.linear(#81773) - Vulkan tensor views now infers dim size when -1 is provided as input (#81668)
- Vulkan prepacked op contexts will now release the deserialized CPU tensors from memory upon construction (#83587)
- Vulkan shader codegen is now Windows compatible (#85241)
Mobile
- Allowed tracing multiple input models at once (#84833)
- Leaky
reluin metal shader (#78544) - Added detailed error message for iOS test (#79140)
- Remove dcode duplications and refactor (#79184)
- Optionally run fbgemm in tracer (#83531)
- Added hardshrink op to metal backend (#82224)
- New flatbuffer_loader functions that do not depend on flatbuffers.h (#82618)
- Added
max_pool2d,linear,conv2dFP32 operator tests for XNNPACK (#83131) - Removed flatbuffer types/headers from flatbuffer_serializer[_jit].h (#82619)
- Migrated remaining pytorch code to use new flatbuffer_loader.h APIs (#82620)
- Remove flatbuffer types/headers from flatbuffer_loader.h (#82893)
- Use flatbuffer of alternate namespace (#82952)
- Hide flatbuffer build dependencies (#82953)
- Renamed flatbuffer_all to flatbuffers_jit (#82826)
- Renamed flatbuffer_serializer to _mobile or _full_jit (#82827)
- Created flatbuffers_mobile (#82828)
- Added API for profiling backend memory events for Edge CPU profiler (#80350)
- Switched mobile targets to flatbuffers_mobile (#82829)
- Added an option to avoid adding base ops to static op library for Edge (#84360)
- Fixed load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)
- Remove unused field 'order_' in nnapi.h (#84067)
Distributed
Distributed(c10d)
- c10d API improvements:
- Improvements to c10d error messages:
- Passed group ranks and options to third party distributed backends (#73164)
- Enabled NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG is set to DETAIL (#83881)
- Added a soft error handling mode
NCCL_ASYNC_ERROR_HANDLING=2that does not crash the process (#84386) - Upgraded NCCL to 2.14.3 (#85367)
Distributed Optimizer
- Added functionality for save and restore step counter for model averanger in PostLocalSGDOptimizer (#78988)
DistributedDataParallel
- Enabled the static graph to print unused parameters in debug mode for DDP. (#81929)
- Enabled stateful PowerSGD communication hook now can be saved and reloaded to resume training (#79334)
FullyShardedDataParallel
- Allowed different
optim_inputorders across ranks (#78599) - Added profiling range for FSDP.backward (#78479)
- Enabled NamedTuple support for FSDP (#83055)
- Added FSDP communication hook interface for NO_SHARD strategy (#79833)
- Moved the
sharded_state_dictlogic to the post hook to avoid OOM (#82613) - Added ability to iterate through dataclasses in fsdp.utils (#82638)
- Enabled passing kwargs to load_state_dict (#83309)
- Used
_init_from_local_tensorto create ShardedTensor to avoid communication overhead (#82911) - Added communication hook for sharded strategies (#83254)
- Changed to print exec order only in debug mode (#83868)
- Ensured that all ranks use the same order to iterate through optimizer states (#84654)
- Optimizer states may be on CPU, copied them to GPU before gathering (#84708)
- Handled the
state_dicton CPU cases (#85640) - Add
FSDPExtensionsfor TP support (#85039) - Ignored buffers that are non-persistent. (#85740)
- Delayed moving tensor to CPU until necessary for optim_state_dict() (#85761)
- Dequeue one event instead of flushing for rate limit (#86165)
torch.distributed.elastic
- Implemented a named pipe based watchdog timer (#83695)
Infra (RelEng)
- Consolidated all python targets in the tools folder (#80408)
- Improved ios simulator test in CI (#80459)
- Add functorch testing shard in CI (#81283)
- Added functorch shards for windows CI (#82161)
- Added functorch shard for mac x86 tests, linux cu102 tests (#82000)
- Added CI workflow to build official docker images with multiarch (#83437)
- Sharded
trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (defaultfrom 2 -> 4 (#83424) - Migrated workflows from 18.04 to 22.04 (#83861)
Bug fixes
Python API
- Fixed
dimout of range check forlogcumsumexpon CUDA when the source tensor is empty(#78284) - Added missing
__init__.pyfortorch.utils.jit(#78629) - Fixed backward crash for
gatherwith an empty index tensor whensparse_grad=True(#78698) - Added type annotations to
torch.distributions.kl_divergence(#78432) - Fixed erroneous inclusion of
endin the output oftorch.arangefor some inputs (#80758) - Fixed
torch.distributions.Transformto be pickle-able (#81707) - Added check that
selfandmaskare on the same device fortorch.masked_fill(#82737) - Fixed potential ref cycle creation in
torch.utils.checkpoint(#82776) - Fixed
Tensor.__hash__for Tensor subclasses (#83174) - Fixed
torch.catfor 0-dim tensors with different dtypes (#83391) - Fixed
torch.equalon CPU when inputs have different dtypes (#83350) - Fixed data-dependent shapes in
torch.districutions.{HalfCauchy, HalfNormal}(#84322) - Added check that the size of the last dimension of
tauis less than or equal to that ofinputintorch.ormqr(#85278) - Added check that
weightsis a 1D tensor intorch.bincount(#85881) - Fixed segfault for
outarguments that have a large number of dims (#85294) - Fixed comparison ops with scalar arguments by removing overflow check (#78881)
- Normalized
torch.utils.dlpackstrides to 1 where size of corresponding dimensions < 2 (#83158) - Added a check in
torch.empty_stridedthatsizeshas the same dimensionality asstrides(#82422) - Fixed
torch.istftdefault output length to prevent trimming of last element (#80031)
C++ API
- Fixed missing antialiasing path to the interpolation for bicubic mode (#84599)
- Added
IListRefTag::MaterializedtoIListRefIteratordestructor. (#85467) - Fixed
im2colby adding a check thatpad_widthandpad_heightare non-negative (#85541) - Fixed
check_compiler_ok_for_platformon non-English locales intorch.utils.cpp_extension(#85891)
Autograd
- Corrected the forward AD formula of
torch.sgnwhich fixed forward-over-backward fortorch.linalg.svdand other spectral decompositions, andtorch.norm,torch.linalg.{norm, matrix_norm}(#80082) - Fixed derivatives of convolution overridable backward (#80840)
- Updated setting non-float non-complex values for forward AD dual tensor to properly error(#78361)
- Fixed forward AD to not set tangent as-is in some situations (#79664, #79653)
- Fixed cpp hooks, retains grad, and
backward(inputs=)behavior in-place (#79996) - Relaxed storage layout checks for forward AD when zero-numel tensor (#81055)
- Fixed leak when
create_graph=Trueand full backward hook registered (#82788) - Fixed view and in-place interaction when grad_fn is first accessed in no-grad mode (#83872)
- Updated backward of
torch.stackto correctly handle implicit real->complex casting (#84993) - Fixed gradients for
torch.nn.functional.{leaky_relu, threshold}when inplace=True (#85634) - Corrected autocasting behavior in
torch.utils.checkpointwhen use_reentrant=False (#81766) - Fixed gradcheck when outputs that don't require grad precede those that do (#77743)
- Fixed backward and double backward for
nn.functional.binary_cross_entropy_with_logits(#80083) - Fixed derivatives of
norm(p=inf)(#78105) - Fixed forward AD when conj-ness of primal and tangent of the dual tensor tensor do not match (#78358)
Build
- Use C++17 for RocksDB 7 header. (#75741)
- Fixed Windows builds with _DEBUG flag (bbe8d01)
- Pass WITH_BLAS option from environment to CMake (#78037)
- Remove
-Wno-unused-but-set-variablefor clang 13.0.0 (#79666) - Fixed variable typo for USE_SYSTEM_PYBIND11. (#80272)
- Fixed compilation errors during build with clang13 (#80916)
- Added missing -fexceptions flags during PyTorch build (#81394)
- Fixed CMake dev warning (#81580)
- Fixed false positive AVX, AVX2 and AVX512 detection with MSVC (#82554)
- Fixed NCCL detection issues of the Gloo library (#82773)
- Fixed objcopy version detection in NCCL cmake process (#82774)
- Fixed build error by changing COLORIZE_OUTPUT option to USE_COLORIZE_OUTPUT in cmake file (#83716)
- Set default value for NCCL make to MAX_JOBS if ProcessorCount returns 0 (#84231)
- Fixed intermittent link errors in NCCL build (#84245)
- Deleted
torch._dlextension (#84361) - Used unified source file list for BUCK build (#84770)
Complex
- Fixed the derivative of
torch.acoshfor complex numbers (#80841). - Removed unused conjugate kernels for real dtypes (2.2MB reduction in CUDA binary size) (#80374).
torch.nn
- Fixed
nn.Embedding‘smax_normargument when forward mode AD is used (#78560) - Fixed
nn.ChannelShufflewhen given empty Tensors (#77029) - Fixed
nn.RReLUbackward on CUDA (#80434) - Fixed spurious warnings in
torch.nn.parallel.*APIs (#81476) - Fixed
nn.Conv2dfallback implementation for single channel inputs and channels last weight (#82392) - Fixed segfault in adaptive pooling for specific index values (#84010)
- Fixed type annotation in
nn.Conv{1,2,3}dfor in_channels (#84302) - Fixed
nn.GeLUfor empty inputs (#84926) - Fixed correctness issues for
nn.Conv2don ARM-based machines (#85711) - Fixed
nn.ParameterListprinting of Tensors on the “meta” device (#78529) - Fixed channels-first behavior for
nn.MaxPool3Don CUDA (#80748) - Fixed input shape validation
nn.MaxPool1d(#85594) - Fixed
nn.Softmaxfor large input tensors (#84182) - Fixed lower and upper bound checks for
nn.RReLU(#84996) - Fixed edge cases in
torch.nn.gradby calling into the c++ backward kernel directly (#81839) - Fixed
torch.nn.PixelShufflefor empty inputs (#86262) - Fixed consistency of output and input dtypes for
torch.nn.BatchNorm(#84410)
torch.optim
- Fixed
optim.SGDmaximizeflag whenmomentumis involved (#81859) - Fixed temporary bug where checkpoints from optimizers created with older PyTorch version could not be loaded (#83588)
- Fixed memory leak in
optim.lr_scheduler.CyclicLR(#85462) - Fixed initialization of
lrinoptim.lr_scheduler.SequentialLR(#72856)
BetterTransformer
- Cleaned up native transformer implementation (#78265)
- Added fastpath test for mask check flag (#82999)
- Added check for contiguous well-formed mask (#79927)
- Introduced mask contiguity check function (#79186)
- Fixed issue in softmax.cu with transformer error when mask
seqlen > 1024(#83639) - Disabled Transformer/MHA fast path when autocast is enabled (#84722)
- Moved odd
num_headin TransformerEncoder toslow_path(#83483)
Composability
- Fixed
__torch_function__bug in getindex that causes an error not set exception (#78781) - Fixed
__torch_dispatch__usage with inplace views (#79902)
Dataloader
- Fixed
NoneTypeobject has no attributepython_exit_statuswhenDataLoaderexits (#83985)
Functorch
functorch.grad: fixed silent correctness issue from calling a view operation on a captured tensor followed by an in-place operation (#85374)functorch.jacrev,functorch.jacfwd: fixed loud in-place errors when passing in inputs to the transforms and mutating them (#84914, #84915)functorch.vmap: Fixed support for in-place view operations (Tensor.unsqueeze_,Tensor.transpose_,Tensor.t_,Tensor.squeeze_) (#82899, #82903, #82972)functorch.vmap: added an error on incorrectweightshape totorch.nn.functional.prelu(#83106)functorch.vmap: fixed support for multinomial (#83838)functorch.vmap: fixed incorrect support forconv_transposewithgroups > 1(#84938)- Fixed
vmapxvjpxvjpcomposition fortorch.nn.functional.prelu(#84939) - Fixed printing tensors that are not being transformed over inside functorch transforms (#85556)
- Disallowed saved tensor hooks in functorch transforms to avoid silently incorrect behavior(#85972)
- Fixed
crossto match unbatched behavior (#86926)
LinAlg
- Strengthen the preconditions of
linalg.cross(#83798) - Fix memory issues in
linalg.lstsq(#85357) - Fix
linalg.lu_solve/torch.unpackto prevent bad memory usage on CPU (#85922) - Preserve the dim of the input in
matrix_exp. (#81330)
Sparse
- Fixed COO Tensors with less than two non-zero elements to always be marked coalesced. (#82426, #82085)
- Fixed CUDA kernel launch misconfiguration for
mulon tiny COO tensors (#80254) - Fixed silent type promotion bug by
selectif given all zero integer COO tensors(#82215) - Fixed CUDA kernel coverage on 0-sized dense inputs for
torch.sparse.sampled_addmm(#85194)
torch.fx
- Fixed bug where curly brackets were not properly escaped in FxGraphDrawer (#83604)
- Fixed torch.fx.wrap to use the callable
function.__name__rather thanfunction.__code__.co_name(#84373) - Added strictness check and made tensors into leaves if input tensors were leaves (#77474)
- Used getattr_recursive instead of getattr when splitting (#80011)
- Stopped ProxyTensor from turning aten::lift tensors into proxy objects (#81024)
- Fixed named_modules to be subscriptable (#81258)
- Fixed
to_folderby adding custom_builtins to dump (#81433) - Correctly unpacked constants when used in multi-return output (#82568)
- Replaced module name for torch.ops (#82395)
- Removed unnecessary
import warnings(#82760) - Don't constant propagate through nondeterministic functions (#83650)
- Don't extract tensor metadata from sparse tensors (#83669)
- Skipped folding side-effectful functions (#84016)
- Fixed make_fx issue by introducing get_attr into symbolic tracing (#84011)
- Disabled autocast cache during aotdispatch (#84035)
- Modified split_by_tags to retain output order (#84136)
- Made NormalizeArgs preserve node type (#85637)
- Fixed PyTree unpacking carrying forward type annotations (#81906)
JIT
- Fixed conv-batchnorm folding for previously-broken datatype inputs during JIT freezing (#78241)
- Fixed lightweight dispatch OOM error by introducing selective build (#79215)
- Used signed integers in
CalculatedNecessaryArgsto avoid underflow with schemas where all args have defaults. (#79331) - Fixed indexing into a tensor with a tuple (#79335)
- Propagate
map_locationarg totorch.jit.loadintorch.load(#78733) - Improved JIT autodiff heuristics for determining whether outputs require gradients (#78392, #79498)
- Used streams for
import_ir_modulefor pickle case to reduce memory usage (#80131) - Added scripting support for "start" kwarg in
enumerate()(#80585) - Turned off arc in CoreML backend, because throwing exceptions in arc code leaks memory (#79928)
- Suppressed virtual-dtor check on llvm_jit to fix NNC build (#81449)
- Fixed annotation extraction for python 3.10 (#81334) (#81334, #81506)
- Fixed
std::out_of_rangewhen using NNC andConstantChunkinput shapes are unknown (#82698) - Limits constant chunk propagation for pw-node-only in NVFuser (#83083)
- When encountering dynamic types, one should cast it recursively. (#83218)
- Fixed handling of empty dim list in
sum_mean_dimsymbolic shape fn (#83357) - Check existence of the array ref when tracing
resize_to avoid_MapBase::at runtimeerror (#81422) - Fixed
define_constantpybind signature to matchstd::complexscalar in NVFuser (#83684) - Cast to signed char to fix aarch64 build (#84429)
- Support
torch.ScriptObjectintorch::jit::as_object(#84398) - NVFuser torchbench patch to take nvprim fallback when no cuda tensors are provided as inputs (#84411)
- Fixed coreml gpu flag not set (#84725)
- Print the real type for function schema arguments (#85103)
- Fixed
torch.jit.tracecheck that was causing tracing to fail for MPS inputs (#84850) - Throw an error instead of segfaulting when passing
Noneto futures (#85304) - Cherry pick sorting patch for NVFuser fusion segmented (#85620)
- Support freezing modules that don't have a forward method (#85779)
Quantization
- Added channel axis bound checking in
fused_moving_avg_obs_fake_quant_*(#78148) - Disable use of qnnpack with
ceil_modeof theavgpoolop (#79028) - Improve subpackage import in
torch.nn.quantized(#84141) - Fix segmentation fault in
QTensor.choose_qparams_optimized(#85552) - Enhance the
_rebuild_qtensorfunction to support other device type other than CPU (#78234) - Fix
at::from_blob_quantized_per_tensor_affinestrides calculation (#79314) - Fix embedding quantization issue when memory format is not
contiguous(#82605) - Fix dispatch declaration bug about quantized op (#83649)
- Moved the order of x86 engine to avoid changing the default qengine (#86631)
ONNX
- Fixed
aten::mulwith Boolean inputs (#81671) - Fixed
addandsubfor non-tensor inputs (#81736) - Fixed
RReLUeval mode behavior (#82678) - Fixed onnx optional node type in for/if block (#83599)
- Fixed
Interpolate: usehalf_pixelinstead ofpytorch_half_pixel. (#80003) - Fixed
argminandargmaxedge case consistency with PyTorch. (#79503) - Shape Type Inference and Propagation
- Fixed shape inconsistency when exporting scalar
log2(#78701) - Fixed inconsistent
randdtype (#79193) - Fixed linalg
normoutput's shapes and dtypes (#79506) - Fixed
anyandalloutputs' shape (#79371) - Fixed
preluoutput's shape (#79846) - Fixed onnx logical functions' dtype (#79339)
- Fixed
hardshrinkandsoftshrinkoutput's shape (#79695) - Fixed quantization outputs' dtype (#79690)
- Fixed reduce node shape inference (#85765)
- Fixed bug using
std::copy_if(#80999) - Fixed default function value in
_optimize_graph(#83996) - Fixed constant folding unexpectedly adding folded constant as initializer (#79552)
- Fixed autograd subgraph recording with nested graphs (#82852)
- Disabled autocast cache in exporter (#84219)
- Removed static None graph output (#82623)
- Fixed float point detection for optional tensor (with unknown rank) within a list (#81386)
- Support
device().type()string comparison with constant (#86168) - Fixed
scalar_type_analysismetadata for copied constant (#86716) - Fixed triu/tril export with diagonal input (#86843)
- Ignore
print(Tensor)during tracing (#86223) - Updated training state logic to support ScriptedModule (#86745)
AMD
- Fixed memory cross-border access on the ROCM platform (#76100)
- Set nvfuser default to disabled (#86369)
CUDA
- Fix how we handle host memory in CUDA
getDeviceFromPtr(#76902) - Only sync CUDA if the operation is run on GPU (#80328)
- Do not use
thrust::lower_boundon device (#80746) - Fix
set_requires_cuda_init(#81183) - Fix behaviour of index_add / atomicAdd(bool,bool) (#85100)
- Fix IMA for topk (#83042)
- Use
opmath_tfor activation functions in Activation.cu (#77949) - Fixed the invalid configuration argument error when running layer norm backward (#80893)
- Support non-standard bools in CUDA unique (#79392)
- Accept non-standard bools in more CUDA kernels (#78957)
- Fix cuda-mode and add more tests (#81898)
- Clear autocast amp cache in CUDA Graphs (#81896)
- Properly compute
batch_element_countinwarp_softmax(#82927) - Disabled autocast cache in torch.cuda.make_graphed_callables (#84289)
- Store RNG seed for CUDA graphs (#84967)
- Assert
lambda >= 0in poisson distribution cuda kernel (#85906) - Work-around 32-bit indexing failures in cuDNN batchnorm (#87861)
- Fixed 3d convolution_add_relu in V8 (#85055)
Intel
- Fixed bug for thnn_conv2d when input's C is 1 and weight is channels last (#82392)
- Fixed oneDNN channels_last path issue (#83653)
- Fixed torch.config can't respect USE_MKLDNN flag issue (#75001)
- Made the data types of output and input consistent for batchnorm (#86784)
- Fixed the issue that cat result would be incorrect for channels-last (#85076)
- Fixed the performance issue that the for-loop before ExternallCall could not be parallelized (#85056)
- Fixed the performance issue that the for-loop before ExternallCall (#86516)
MPS
- Fixed MPS operator torch.full for boolean types (#82575)
- Extend MPS Unary operators for empty tensors which should be a no-op (#82650)
- Fixed MPS operator
torch.scatterfor boolean types (#82685) - Fixed MPS operator
torch.catfor boolean inputs (#81480) - Fixed typo in MPS allocator (#83465)
- Fixed MPS operator torch.full to handle uint8 types (#83697)
- Fixed creation of
MPS::Placeholderbehavior for transposed view operations (#85689) - Fixed handling of output shape for empty inputs to binary ops in MPS backend (#85836)
- Added support for handling scalar inputs to MPS operations of
torch.scatterandtorch.gather(#85842) - Support for handling compatible inputs to MPS operation of torch.where (#85946)
- Added support for inputs with datatypes Short, Byte & Char to torch.dot MPS operation by casting to int32 when needed (#86140)
- Remove incorrect asserts in MPS backend from Copy.mm file (#86184)
- Added support for handling of 1D inputs for MPS operation
torch.nll_loss(#81290) - Get correct size of the view tensor when copying from cpu to mps device (#81730)
- Fix issues exposed in MPS testConsistency tests. The fix includes correct handling of types in smooth l1 loss, 0 dimensions for torch.repeat and empty inputs for torch.cat operations (#81735)
- Handle Integer inputs for MPS linear layer by returning error of unsupported data types (#82183)
- Workaround int8 datatype outputs in MPS for View operations (gather) by casting it to int8 (#82315)
- Improve handling of empty outputs and fix MPS linear layer’s handling of transposed Tensors in test consistency (#83124)
- Fixed handling of conv1D and conv2D MPS operations with non-matching strides/paddings (#83522)
- Fixed handling of MPS::Placeholder when View operation is missing gather graph (#83744)
- Fixed the index handling in MPS for torch.constant_pad_nd operations with single-dimension input (#83745)
- Handle casting for MPS torch.div operation in case of type mismatch (#84742)
- Fix device (MPS) to host (cpu) copy by casting from a smaller dtype to a bigger dtype (#84928)
- Ensure as_strided_tensorimpl is never called with MPS (#85020)
- Fixed integer rounding crash in torch.div MPS operation on M1 (#85016)
- Fixed crash in MPS bitwise ops on Mac x86 platforms. (#85285)
- Fixed crash in MPS Conv1d backward operation for NHWC (#85283)
- Added support for MPS reduction operations of scalar edge-cases (#83743)
- Fixed memory corruption in torch.var operation for MPS (#85571)
- Fixed memory leaks in MPS that cause the MTLBuffers not to be released and cause OOM (#85661)
- Fix test consistency error in MPS due to type mismatch between int8 and uint8 types (#85666)
- Fixed shape issues for torch.clamp op in MPS (#85673)
- Fixed handling of TensorBase shapes for view ops in MPS for case of multiple slices on a Tensor (#85934)
- Fix the dimension of padding to match the input's dimension for MPS Pad operations (#85990)
- Fix non-contiguous to contiguous copy of MPS tensors (#86056)
- Remove
std::coutfrom MPSmultinomialoperation (#86246) - Do not dispatch empty job in bitwise_not (#87286)
- Made copy from CPU always add storageOffset (#86958)
- Revamped
copy_to_mps_implementation (#86956)
Package
- Added fix for implicit numpy dependency (#78979)
- Allowed torch._C to be recognized a module in torch.package (#80917)
- Ignore return value of function declared with 'warn_unused_result' for torch::deploy (#84862)
- Removed torch::deploy from pytorch (#85953)
Profiler
- Fixed build failure in python 3.10 (#81812)
- Pop
KinetoThreadLocalStateat the start of post processing. (#77996) - Fixed record function inputs_valid_ check (#78002)
- Weakened ordering check during post processing. (#78563)
- Fixed Python parent id (#79356)
- GIL acquire needed in ValueCache::trimPrefixes (#81061)
- Added ephemeral inputs to the value cache. (#81958)
- Fixed profiling with record_shapes=True and nested tensor (#82854)
- Proper reset execution graph data in remove callback registration (#82910)
- Solved two syntax issues when dumping execution graph result to json file. (#81854)
- Set end time on python events when profiling stops. (#83621)
- Don't try to collect strides for non-strided tensors (#83935)
- Add null handling to
AppendOnlyList::copymemcpy path. (#83963) - Add quoted metadata API to remove empty trace cpu_op metadata (#84128)
- Make
RecordQueuemanage the lifetime ofPythonTracer. (#83964) - Don't assign in AppendOnlyList::emplace_back (#85716)
- Fixed traversal utility (#85717)
- Fixed python object reference counting (#85847)
Visualization
- Removed dependency on
torch.onnxingraph(#82628) - Updated
Image.ANTIALIAStoImage.Resampling.LANCZOSin summary (#85679)
Vulkan
- Fixed the
aten::catoperator registration (#78806) - Fixed a bug in GRU where incorrect behaviour was being observed when
H_in != H_out(#78945) - FIxed a possibly null pointer dereference in the
aten::mmoperator when using passing an empty bias (#79701) - Code under
ATen/native/vulkan/apiwas essentially rewritten (more details below) and as a result of these refactors, it is now possible to concurrently execute multiple Vulkan models due to correct synchronization when recording to a Vulkan command buffer (#80959)
Mobile
- Moved saving storage to the last step. (#78024)
- Fixed build For Model Tracer (#84755)
- Skip TestNNAPI tests if QNNPACK is not supported (#82882)
- Extended LinearPackedParamsBase getstate/setstate deadline in
check_forward_backward_compatibility.pyAllowlist (#81135) - Removed LinearPackedParamsBase getstate/setstate from
check_forward_backward_compatibility.pyAllowlist (#81048) - Fixed
ao::sparse::BCSRmissing in qlinear serialize and deserialize when USE_FBGEMM and USE_PYTORCH_QNNPACK are not set (#81256) - Updated
model_ops.yaml(#82444) - Fixed signed/unsigned compare for Metal (#86068)
- Re-added benchmarking files to ios TestApp (#85539)
Distributed
Distributed(c10d)
- Ensured tensors are contiguous for autograd enabled
all_gather. (#79747) - Fixed data race condition of
batch_isend_irecv(#82450) - Fixed
distributed_test.pyflakiness by turning off async_errror_handling (#78797) - Reenabled
isinstancewithtorch.distributed.ReduceOp(#87303)
DistributedDataParallel
- Enabled
AllReduceCommHookto acceptinstrusive_ptr(#80975)
FullyShardedDataParallel
- Fixed
full_optim_state_dict()hang (#80712) - Fixed exec order validation for ignored modules across ranks (#79533)
- Cleaned prefixes when searching for params / buffers to ignore (#78278)
- Returned original module when fsdp wrapped model call .module (#78671)
- Fixed a small bug of pre_backward_hook params prefetch (#78851)
- Fixed param name prefixes for ignored modules (#79955)
- Fixed FSDP when not all outputs get gradient in backward (#80245)
- Fixed that MP config not being passed to FSDP (#80869)
- Fixed FSDP device_id when CPU offloading (#82892)
- Fixed FSDP not all outputs used in loss (#83195)
- Fixed the FQN not found issue for load sharded_state_dict when using activation checkpoint (#84253)
- Fixed
pin_memory()for CPU offloading (#85048) - Fixed memory regression! (#85087)
- Implemented a short-term fix to remove
optim_input(#84201)
torch.distributed.elastic
- Ensured that exit code is propagated from Child to parent process (#81408)
torch.distributed.rpc
- Only initialize CUDA if there are devices specified in
init_rpc(#80180) - Fixed the wrong usage of
RRefContext::handleExceptionby having a new APIRRefContext::handleExceptionSilent(#83166) - Changed to avoid initializing storage for empty Optionals (#78947)
Infra (RelEng)
- Made bazel changes to make “bazel query ...” work (#78870)
- Fixed C API to be compatible with latest Python 3.11 beta (Please note that 3.11 binaries are not fully functional) (#81242)
Performance
Python API
- Fixed use of temporary buffers for tensors in
torch.save. (#80404) - Fixed and improved the efficiency of the backward for
torch.xlog{*}functions. (#82713) - Vectorized
.copy()acting between different dtypes on CPU (#80905) - Vectorized
bfloat16conversions on CPU (#80906)
Autograd
- Codegened autograd nodes no longer is smarter about which gradients to compute (#82544)
- Made the derivative of masked_fill more efficient (#83515)
torch.whereno longer materializes a zero-filled tensor in its backward (#83043)
torch.nn
- Speed up
nn.Moduleconstructor by not calling customsetattr(#77098) - Speed up CPU
nn.BatchNormimplementation by usingtorch.zeros()directly (#82558) - Speed up
nn.Module.load_state_dict(#85743)
BetterTransformer
- Added nn.module activation support in BetterTransformer (#78394), in addition to functional support which is not available in Torchscript
- Added mask identifier for multiplexed src_mask/src_key_padding_mask in BT (#81947)
- Added a small fastpath test for native multi-head attention (#81432)
Composability
- Release GIL when doing shared memory copies on Tensors (#85389)
- Some micro-optimizations in
RecordFunction, the core util used by the profiler (#76266) c10::detail::ReplaceAll: avoid some unnecessary allocations (#79915)
Dataloader
- Moved loop content into a function to ensure we don't preserve
Tensorinpin_memorythread (#83595)
LinAlg
- Simplified and optimized
linalg.solve(#74046) - Improved heuristics for
linalg.lu_solvewhen B is a matrix (#79838) - Small optimization of
linalg.cholesky(#81316) - Prefer contiguous output from mkldnn_bf16_gemm (#82968)
- CPUBlas: Use mkldnn optimized BFloat16 matmul for gemm (#65840)
- Updated and improved the heuristics for
linalg.lu_solve(#73878) - Optimized
linalg.householder_productbackward to be more memory-efficient (#84627)
Sparse
- Improved
to_sparse_bsrfor batched dense inputs (#83085) - Improved
to_densefor CSC (#79635) - Improved
index_selectperformance for COO input on CUDA (#77551) - Improved
mul(COO, COO)performance with broadcasting in dense dims. (#83428, #85336)
JIT
- Improved coreml load time by loading cpu model first, while asynchronously loading a model (#80941)
- Improved
torch::jit::as_{module,object}performance (#84399) - Replaced
IValue::toString()->string()withIValue::toStringRef()(#85437)
Quantization
- Allow contiguous inputs run into
qcat_nhwc_stubwhen dim is last dimension (#72575) - Enable qlinear dynamic parallelization with fbgemm (#84033)
CUDA
- Fixed perf regression introduced in #70943 (#78588)
- Improved small sort performance on CUDA (#79627)
- Use cub::BlockRadixSort to improve medium length sort performance (#79628)
- Use cub::BlockRadixSort to improve medium length sort performance (#79628)
- Increased size limit on calling CublasLt in addmm by 32x (#82922)
- Don't synchronize single element any/all reductions (#84465)
- Added col2im_batched kernel (#84543)
- Exposed fast get_current_stream (#78165)
- Pool cudaEvents in CUDACachingAllocator (#78279)
Intel
- Optimize the copy of BFloat16 to Float and Float to BFloat16 (#79685)
- Improve performance of ONEDNN backend (#84470)
- Optimize softmax backward and logsoftmax backward #80114
- Improve sort multi-core perf by adjusting grain_size w.r.t. dim_size (#74897)
- Add fast path of
qmean/qstdfor quantized CPU (#80579) - Use direct memcpy in
qcatwhen all the inputs and output share the same scale and zero_point (#71903) - Vectorize scalar remainder in quantized kernel for normalization (#79673)
- Enhance add_out_dense_sparse_cpu for hybrid sparse tensor (#23057)
MPS
- Performance improvements for the MPS backend by changing commitAndWait to commit & fixing high memory consumption for View operations. Also improved scalar handling in MPS Allocator (#81951)
- Improved performance for MPS backend by reducing the number of command buffers created and hence CPU overhead. It uses commitAndContinue feature in MPS (#81338)
- Added direct MPS implementation for constant_pad_nd operation which improved performance as the generic implementation was heavily reliant on View ops which are slow (#82366)
- Removed checks that incur unnecessary syncs for MPS device with tensor.item() (#82505)
- Enabled Graph caching in MPS for torch random ops with Philox engine (#85833)
- Added specialized memory pool for scalar values in MPS which improved performance in torchbench networks (#85817)
- Improved memory usage and performance by utilizing garbage collector and adaptive commit feature in MPS (#86119)
Profiler
- Optimize getStepCallbacks for common case of no active callbacks for kineto (#77804)
- Use custom AppendOnlyList for op_events to reduce the number of atomic operations (#78643)
Vulkan
- When waiting on the result of a
VkFence, busy polling is now used instead of a single call toVkWaitForFenceswith no timeout. This can improve benchmark performance by up to 50% by ensuring that the CPU stays at a high frequency when waiting for work on the GPU to complete (#81470)
Mobile
- Added compilation_preference & relax_f32_to_f16 APIs (#78758)
- Made flatbuffer loads faster if loading as mobile module. (#78998)
- Stream pkl (#79931)
- Used Apple's Accelerate framework for blas acceleration (#80449)
- Read via FileAdapter when loading files in torch if not flatbuffer for lite_interpreter (#84028, #84296)
Documentation
Python API
- Fixed
torch.as_arraydocumentation formatting (#78485) - Fixed default value for
storage_offsetintorch.as_strideddocumentation (#78202) - Removed warning in documentation that
torch.realis only supported on complex types (#78644) - Improved reproducibility documentation for the random number generator and
torch.use_deterministic_algorithms(#78849) - Fixed example in documentation for serialization (#79454)
- Fixed
torch.linspacedocumentation for dtype (#81371) - Fixed typo in documentation for
torch.distributions.Dirichlet(#82062) - Fixed example in
torch.distdocumentation (#82104) - Updated
torch.narrowdocumentation to reflect thatstartcan be a Tensor (#85180) - Added documentation for
pin_memoryandlayoutarguments totorch.new_{zeros, ones, full}(#85605) - Added documentation for
pin_memoryargument totorch.{rand, randn}(#85219), (#85221) - Added argument default values to documentation for
torch.rot90(#85610) - Removed
outargument from documentation fortorch.squeeze(#85222) - Fixed
torch.logexample (#78776) - Fixed
torch.argmindocs forkeepdimargument (#78888) - Updated examples in documentation for
torch.use_deterministic_algorithms(#82003) - Changed docstring type
callabletoCallablefor consistency (#82487) - Added documentation for
torch.narrow_copy(#84493) - Improved documentation for
torch.signbit(#78349) - Added doc string for
torch.library.Library.impl(#81047) - Renamed
_Typed/_UntypedStoragetoTyped/UntypedStorageand updated documentation fortorch.storage(#82438) - Added documentation for
torch.unflatten()(#81399)
Autograd
- Improved autograd custom function docs (#81340)
- Added randomness case to the autograd notes (#78617)
Complex
- Added a note on CUDA 11.6 (#80363)
torch.nn
- Fixed docstring and image for
nn.LeakyReLU(#78508, #79102),nn.ELU(#78909),nn.GRU(#79380),nn.Hardswish(#70993),nn.GeLU(#85790) - Fixed docstring for
nn.CrossEntropyLoss(#79568 and #82538),nn.MultiMarginLoss(#84513) - Fixed high level
nn.initmodule doc to reflect that all functions run withtorch.no_grad(#80882) - Fixed docstring for
nn.Module.state_dict(#83104) - Updated docstring for
scale_factorinnn.functional.interpolate(#80807)
torch.optim
- Fixed docstring for
optim.lr_scheduler.ChainedScheduler(#79775) - Fixed docstring for
optim.swa_utils.SWALR(#79836)
Composability
Functorch
- Fixed the model description in the functorch ensembling notebook (#83603)
- Fixed indentation in functorch limitations docs (#85346)
- Updated functorch installation instructions (#85854)
- Fixed functorch whirlwind tour notebook to be runnable (#85974)
- Documented new installation instructions for functorch (#86823)
LinAlg
Sparse
- Updated
scatter_add_documentation to fix argument name (#80223) - Updated
torch.sparsedocs to better cover CSR/CSC/BSR/BSC (#82108) - Added torch.sparse overview section (#85265)
- Updated documentation for
mmfamily ops andF.linearto note limited sparse support (#86220)
torch.fx
- Fixed decomposition example (#79807)
- Added
__all__to various submodules in torch.fx, distributions, distributed, package (#80367) - Added warning about DCE in FX being unsound with mutation (#81818)
Quantization
- Replace
qconfig_dictwithQConfigMappingin docs (#78533) - Corrects typo in quantization docs (#81687)
- Additonal fixes for
quantize_fxdocs (#84587) - Add example for the error message for fixed qparam ops (#84666)
- Add types for scale and zero_point tensor for
torch.fake_quantize_per_channel_affinedocs (#85733) - Updated quantization docs to show per channel support for
conv1d(#81349) - Add
torch.ao.nn.quantizeablemodules documentation (#79957) - Add more detailed docs for
torch.ao.quantization.quantize_fx.{prepare_fx|prepare_qat_fx|convert_fx}(#83132)
ONNX
- Added a table of unsupported aten operators in the documentation (#84496)
CUDA
- Fixed jiterator doc format (#78471)
- Use generic amp autocast in example and specified dtype (#79579)
- Fixed small typo in cuda.rst (#84012)
- Added user facing documentation for CSAN (#84689)
- Fixed broken docstring for
set_float32_matmul_precision(#78949)
MPS
Package
- PackageExporter does not have file_structure (#79948)
- Updated package.rst to not include hermetic claim (#81019)
- Fixed typos in
torch.packagedocumentation (#82994) - Fixed typo in torch/package/_mock.py (#84508)
Distributed
Distributed(c10d)
- Fixed some links in torch/distributed/CONTRIBUTING.md (#79855)
- Updated dist.scatter() documentation (#86069)
- Fixed docstring of
scatter_object_list(#84596) - Fixed doc string in
reduce_scatter(#84983)
DistributedDataParallel
- Corrected the DDP wrap example by removing pg in DDP wrap (#83034)
FullyShardedDataParallel
- Improved auto wrap policy doc (#78400)
- Corrected comments in FSDP for gradient averaging (#80456)
- Updated
ShardingStrategyand_free_full_params()docs (#80894) - Added mentioning of
optim_inputto be removed after 1.13 in the BC breakage warning (#85963)
torch.distributed.rpc
- Updated distributed/CONTRIBUTING.md to remove ProcessGroupAgent references and add test instructions (#78625)
Infra (RelEng)
- Added some documentation about the stats uploading process for CI (#79504)
- Fixed release doc builds (#79865)
- Updated release.md with release candidate validation steps (#79889)
Developers
Autograd
- Added the ability to register a hook to grad_fn with
.register_prehook(#83226)
Build
- Modified nccl_dependency to take dev mode (#79169)
- Moved pytorch buck targets to shared build (#79330)
- Added kineto and flatbuffers to OSS BUCK (#79860)
- Updated llvm deps for Buck build (#79919)
- Moved aten targets to shared buck file (#79966)
- Updated buck_setup.sh (#80467)
- Minor fix for shared build (#80739)
- Deleted CCACHE_DISABLE and SCCACHE_DISABLE from nccl.cmake file (#84007)
Composability
TorchDispatchModeandTorchFunctionModeextension points have been added. They are similar to their__torch_function__and__torch_dispatch__counterparts, but can be used as context managers that intercept all torch operator calls, including factory functions. These API’s are still experimental and aren’t quite user facing yet, and we will add more documentation as they are hardened. See this post for more details. (#78214, #78822, #78847, #84774, #83925, #79143, #77667, #80992, #80995, #80998, #82647, #83372)- A large amount of hardening to
FakeTensorandFakeTensorMode, a__torch_dispatch__style mode that allows you to run shape/dtype/device inference. This is similar to the “meta” device, but fake tensors also faithfully store device metadata, and the logic lives in python. (#77969, #77972, #77971, #78516, #78090, #78836, #78895, #78536, #78677, #78522, #78523, #78972, #79170, #80115, #80193, #80544, #81739, #82281, #82574, #82066, #82449, #82337, #82571, #82593, #82172, #84387, #85065, #82846, #85658, #85759, #85920) - Added some new tags and beefed up tags support for operators in the dispatcher:
- Add data_dependent_output tag (#83312)
- Add nondeterministic tags in tags.yaml and add the nondeterministic_seeded tag to all functions in native_functions.yaml defined as nondeterministic by alias_analysis.cpp (#81440)
- Allow specifying operator tags when registering an operator to the dispatcher (#79322)
- add
inplace_viewtag toresize_()(#82667)
- Make string serialization of C++ FunctionSchema consistent with torchgen.model.FunctionSchema (#77926)
- Added support for custom namespaces in
torchgen(#78015, #79733, #81362, #81581) - Generate kernels for codegen’d
out=operators (#78626, #81437) - Added a new alias dispatch key for functional to view op decompositions (#79615)
- Added an env var for dispatcher debug logging (#81846, #82277)
- Fixed printing of DispatchKey in operator not found message (#81637)
- Added test that all BackendComponents are covered by toString (#81713)
- Refactored functionality and backend keys to reduce duplication (#81752)
- Made factory functions
CompositeExplicitAutograd, so they show up as primitives in__torch_dispatch__(#82470) - Added an
OpOverload.decompose()API, for running an operator’s decomposition if one exists (#83075) - Fixed our dispatcher schema parser when parsing tensor list alias annotations (#84005)
- Allowed subclasses of
c10::TensorImpl()to override non-virtual tensor methods (#84806) - Made pytorch headers consumable from c++20 code bases (#79985)
- Added meta device support to
_UntypedStorageand_TypedStorage(#78008)
torch.fx
- Added debug statements for small ACC subgraphs elimination (#80117)
- Checked node type before fetching users (#80166)
- Detected ProxyTensor layering violations (#80994)
- Increased stack level for get_attr warning (#81041)
- Preserved a node’s stack trace (#82670, #83050, #83558, #83706, #83960)
- For quantization, removed
WEIGHT_INDEX_DICTandBIAS_INDEX_DICTand replaced withnode_arg_is_weightandnode_arg_is_bias(#83263, #83848) - Asserted that ProxyTensorMode does not accidentally bake in constants (#83297)
- Improvements to FX Minimizer (#83833)
- Ported matmul compositeimplicitautograd impl into core (#85239)
- OpInfo for Slice (#85554)
- Raised errors in fx.Interpreter with Node info (#85810)
Quantization
- Enabled support for quantized fill of nhwc tensors (#79025)
- Tests for code snippets in quantization docs (#79923)
- Eliminate Named tensor warnings in XNNPACK and QNNPACK (#77762)
- Added earlier termination and improved error message for calling
minandmaxops on per channel quantized tensors. (#79036) - Added warnings to quantized dynamic conv and linear ops when
reduce_range=true(#79273) - Add assertions to fix
torch::jit::load bugs(#79192) - Optionally clamp weights post quantization (#83438)
ONNX
onnx.verificationTool to verify exported model discrepancy between sets of inputs (#78323)- Symbolic function registration is now done via decorators (#84709)
g.opmethods now exposed via the GraphContext class (#84728)- Initial version of diagnostics infrastructure. (#85107)
- Add dtype check in onnx verification (#79263)
Intel
- Added native impl for group norm on quantized CPU for channels-last inputs: (#70520)
- Added
qschemecheck for quantization observer (#80126) - Added oneDNN graph fuser context API and unittest (#82491)
- Added eltwise OPs for NNC:
mishandelu(#80586) - Support BF16ImmPtr (#84041)
- Enabled fusion of conv with elementwise OP for NNC (#77157)
- Channels last propagation within NNC fusion group (#76948)
- Lowering function generates the output buffer with the specified stride for NNC(#76529)
- Simplified IfThenElse and CompareSelect within for-loop for NNC (#76793)
- Do not pull in autocast* ops into NNC (#85140)
MPS
- Improve MPS test by extending
test_no_warnings_on_inputby capturing any output (#79163) - Add testcase in test_mps for circular mode in torch.pad (#81455)
- Fixed build warnings while building with MPS on Mac platforms (#83048)
- Add per-op MPS gradient tests and update skips for TestConsistency (#84242)
Profiler
- New event representation in profiler (#77693, #77694, #77695, #78163, #79173, #81965, #80797, #81319, #81320, #81321, #81322, #80822, #82993)
- Build call tree for profiled events (#77698, #80810)
- Copy rollbear/strong_type to
c10/util(#78162) - Collect Layout and expose TensorMetadata (#81155)
- Added support for storing scalar values in profiling (#81843)
- Added support for Device (#82787)
- Added SOFT_ASSERT to gracefully recover from invariant violations (#82689)
- Added support for accessing strides and scalars (#80072)
- Record nn.Module's parameters (#83209)
- Extend Python bindings (#83622)
- Capture storage data pointer (#84276)
- Cleaned up Tensor representation (#85161)
- Compute unique IDs for Tensors (#85162)
- set_class util (part 1 of Record Optimizer) (#84779)
- Tracking Optimizer (part 2 of Record Optimizer) (#84920)
- Optimizer param_groups (part 3 of Record Optimizer) (#85784)
- Optimizer states (part 4 of Record Optimizer) (#85840)
- Extend ID assignment to allocations and frees (#85719)
- Made
namea property. (#85720) - Added dtype to
_TensorMetadata(#85721) - Updated python binding type annotations (#85722)
- Started moving python bindings out of autograd (#82584)
Vulkan
- Vulkan operators that use prepacking have switched from using individual
OpContextclasses withPackedContextclasses that inherit from a genericVulkanOpContextclass which should reduce boilerplate code when implementing new ops that require prepacking (#78814, #78815, #78816, #78817, #78818, #82730, #83526) - Code under
ATen/native/vulkan/apiwas essentially rewritten to improve code organization and readability. The refactor implements RAII patterns for the classes used to represent Vulkan handles to facilitate proper resource management and re-designed how theContextclass functions in order to enable concurrent execution of multiple Vulkan models. The stack of PRs containing these refactors can be found at #80699 - Lint is now enforced in the
ATen/native/vulkan(#81390) - The VulkanMemoryAllocator version used was upgraded to 3.0.1, which now lives under
third_party(#81472, #83906, #83934) - Shader layouts are now automatically generated based on the GLSL code (#81715, #81716)
Distributed
torch.distributed
- Added all to torch.distributed and tensorboard submodules (#80444)
- Added all to torch.{fx, distributed, backends} submodules (#85079)
- Added all to fx, fistributed and cuda submodules (#85080)
- Added all to torch.distributed, futures, fx, nn, package, benchmark submodules (#80520)
- Added all to torch.distributed submodules (#80523)
- Eliminated code duplication in distributed rendezvous (#81577)
- Refactored distributed to use absolute header path (#85780)
torch.distributed.elastic
- Added all for torch.nn.modules, torch.distributed.elastic, torch.nn.utils submodules (#80240)
- Fixed macos public bindings failures (#80970)
Distributed(c10d)
- Logged full rank fingerprint mismatches in ProcessGroupWrapper (#79901)
- Added environment parse function that supports default value (#85563)
- Added host and port to TCPStore pyi definition (#84636)
- Added underlying_store property for PrefixStore (#84640)
- Enabled per-thread ProcessGroup for testing (#84153)
- Moved ProcessGroup::Work into a separate class (#83680)
- Install c10d headers with absolute path (#86257)
Infra (RelEng)
- Migrated off xenial gcc5.4 from merge rules (#78137)
- Added functionality for rebasebot to rebase onto viable/strict branch (#78276)
- Pinned protobuf version to 3.20.1 in docker CI build (#78369)
- Removed gcc5.4 from docker/build.sh (#78405)
- Removed gcc5.4 jobs from CircleCI config (#78555)
- Added merge rules for “pytorch distributed” module (#78751)
- Added onnx / test to required merge rules (#78790)
- Added userbenchmark support to TorchBench CI (#78794)
- Installed torchdynamo as part of most CI jobs (#79051)
- Removed linux-xenial-py3_7-clang7-asan from merge rules (#79088)
- Ran torchdynamo tests on PyTorch Linux CI (#79099)
- Centralized commit pins in a folder (#79150)
- Moved CUDA flags out of --per_file_copts into the cu_library macro (#79414)
- Moved CI to cuda-11.6 (#79921)
- Enabled pytest to run test_ops, test_ops_gradients, test_ops_jit in non linux cuda environments (#79898)
- Upgraded pytorch nightly docker python version to 3.8 (#80051)
- Updated Dockerfile to install cmake as part of conda install (#80258)
- Re-enabled vulkan test (#81368)
- Enhanced mergebot with the feature of posting the PR Comment on cancel (#82744)
- Changed nccl build to be single-threaded (#83173)
- Added process for maintaining Build + CI contributors list (#83869)
- Implemented mechanisms to block land checks if the PR hasn't been approved yet (#84239)
- Allowed External Scripts (e.g. vscode) To Discover and Execute unittest Tests (#85584)
- Updated the pinned torchdynamo hash to
6ead5cae0d1234aa64db06fe230ef56e12ec76fe(#85683) - Updated the pinned torchvision hash to
d7d90f56117ce0955332846a5f90b8d1346c4c09(#85776) - Modified all functions (except factory functions) to support SymInt and update xla hash to
f2b36df6a1a80137eff7644e6d0f4eeb7ff429d6(#86078)