PyTorch 1.8.0 Release Notes
- Highlights
- Backwards Incompatible Change
- New Features
- Improvements
- Performance
- Documentation
Highlights
We are excited to announce the availability of PyTorch 1.8. This release is composed of more than 3,000 commits since 1.7. It includes major updates and new features for compilation, code optimization, frontend APIs for scientific computing, and AMD ROCm support through binaries that are available via pytorch.org. It also provides improved features for large-scale training for pipeline and model parallelism, and gradient compression. A few of the highlights include:
- Support for doing python to python functional transformations via
torch.fx
; - Added or stabilized APIs to support FFTs (
torch.fft
), Linear Algebra functions (torch.linalg
), added support for autograd for complex tensors and updates to improve performance for calculating hessians and jacobians; and - Significant updates and improvements to distributed training including: Improved NCCL reliability; Pipeline parallelism support; RPC profiling; and support for communication hooks adding gradient compression. See the full release notes here.
Along with 1.8, we are also releasing major updates to PyTorch libraries including TorchCSPRNG, TorchVision, TorchText and TorchAudio. For more on the library releases, see the post here. As previously noted, features in PyTorch releases are classified as Stable, Beta and Prototype. You can learn more about the definitions in the post here.
You can find more details on all the highlighted features in the PyTorch 1.8 Release blogpost.
Backwards Incompatible changes
Fix Tensor inplace modulo in python (#49390)
Inplace modulo in python %=
was wrongfully done out of place for Tensors. This change fixes the behavior.
Previous code that was relying on this operation being done out of place should be updated to use the out of place version t = t % other
instead of t %= other
.
1.7.1 | 1.8.0 |
---|---|
>>> a = torch.arange(0, 10)
>>> b = a
>>> b %= 3
>>> print(a)
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> print(b)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
| >>> a = torch.arange(0, 10)
>>> b = a
>>> b %= 3
>>> print(a)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
>>> print(b)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
|
Standardize torch.clamp
edge cases (#43288)
For ease of exposition let a_min
be the value of the "min" argument to clamp, and a_max
be the value of the "max" argument to clamp.
This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max)
. torch.clamp
currently computes this in its vectorized CPU implementation but uses different approaches for other backends.
These implementations are the same when a_min < a_max
, but divergent when a_min > a_max
. This divergence is easily triggered:
>>> t = torch.arange(200).to(torch.float)
>>> torch.clamp(t, 4, 2)[0]
tensor(2.)
>>> torch.clamp(t.cuda(), 4, 2)[0]
tensor(4., device='cuda:0')
>>> torch.clamp(torch.tensor(0), 4, 2)
tensor(4)
This PR makes the behavior consistent with NumPy's clip
. C++'s std::clamp
's behavior is undefined when a_min > a_max
. Python has no standard clamp implementation.
Tensor deepcopy now properly copies the .grad
field (#50663)
The deepcopy protocol will now properly copy the .grad
field of Tensors when it exists.
The old behavior can be recovered by setting the .grad
field to None
after doing the deepcopy.
1.7.1 | 1.8.0 |
---|---|
>>> t.grad
tensor([0.8883, 0.5765])
>>> deepcopy(t).grad
None
| >>> t.grad
tensor([0.8883, 0.5765])
>>> deepcopy(t).grad
tensor([0.8883, 0.5765])
|
Fix torch.fmod
type promotion (#47323, #48278)
1.7.1
Raises RuntimeError for integral tensor and floating-point tensor.
The dtype of output is determined by the first input.
>>> x = torch.arange(start=1, end=6, dtype=torch.int32) # tensor([1, 2, 3, 4, 5])
>>> y = torch.arange(start=1.1, end=2.1, step=0.2, dtype=torch.float32) # tensor([1.1, 1.3, 1.5, 1.7, 1.9])
>>> torch.fmod(x, y)
RuntimeError: result type Float can't be cast to the desired output type Int
>>> z = torch.arange(start=0.2, end=1.1, step=0.2, dtype=torch.float64) # tensor([0.2, 0.4, 0.6, 0.8, 1.], dtype=torch.float64)
>>> torch.fmod(y, z).dtype
torch.float32
>>> torch.fmod(z, y).dtype
torch.float64
>>> torch.fmod(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)
1.8.0:
Support integral tensor and floating-point tensor as inputs.
The dtype of output is determined by both inputs.
>>> x = torch.arange(start=1, end=6, dtype=torch.int32) # tensor([1, 2, 3, 4, 5])
>>> y = torch.arange(start=1.1, end=2.1, step=0.2, dtype=torch.float32) # tensor([1.1, 1.3, 1.5, 1.7, 1.9])
>>> torch.fmod(x, y)
tensor([1.0000, 0.7000, 0.0000, 0.6000, 1.2000])
>>> z = torch.arange(start=0.2, end=1.1, step=0.2, dtype=torch.float64) # tensor([0.2, 0.4, 0.6, 0.8, 1.], dtype=torch.float64)
>>> torch.fmod(y, z).dtype
torch.float64
>>> torch.fmod(z, y).dtype
torch.float64
>>> torch.fmod(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])
Preserve non-dense or overlapping tensor's layout in *_like functions (#46046)
All the *_like
factory functions will now generate the same striding as out of place operations would.
This means in particular that non-contiguous tensors will produce non-contiguous outputs.
If you require a contiguous output, you can pass the memory_format=torch.contiguous
keyword argument to the factory function. Such factory functions include clone
, to
, float
, cuda,
*_like
, zeros
, rand{n}
, etc.
Make output of torch.norm
and torch.linalg.norm
consistent for complex inputs (#48284)
Previously, when given a complex input, torch.linalg.norm
and torch.norm
would return a complex output. torch.linalg.cond
would sometimes return a complex output and sometimes return a real output when given a complex input, depending on its p
argument. This PR changes this behavior to match numpy.linalg.norm
and numpy.linalg.cond
, so that a complex input will result in a real number type, consistent with NumPy.
Make torch.svd
return V
, not V.conj()
for complex inputs (#51012)
torch.svd
added support for complex inputs in PyTorch 1.7, but was not documented as doing so. The complex V
tensor returned was actually the complex conjugate of what's expected. This PR fixes the discrepancy.
Users that were already using the previous version of torch.svd
with complex inputs can recover the previous behavior by taking the complex conjugate of the returned V
.
torch.angle
: properly handle pure real numbers (#49163)
This PR updates PyTorch's torch.angle
operator to be consistent with NumPy's. Previously torch.angle
would return zero for all real inputs (including NaN). Now angle returns pi
for negative real inputs, zero for non-negative real inputs, and propagates NaNs.
Enable distribution validation by default for torch.distributions
(#48743)
This may slightly slow down some models. Concerned users may disable validation by using torch.distributions.Distribution.set_default_validate_args(False)
or by disabling individual distribution validation via MyDistribution(..., validate_args=False)
.
This may cause new ValueErrors
in models that rely on unsupported behavior, e.g. Categorical.log_prob()
applied to continuous-valued tensors (only {0,1}-valued tensors are supported).
Such models should be fixed but the previous behavior can be recovered by disabling argument validation using the methods mentioned above.
Prohibit assignment to a sparse tensor (#50040)
Assigning to a sparse Tensor did not work properly and resulted in a no-op. The following code now properly raises an error:
>>> t = torch.rand(10).to_sparse()
>>> t[0] = 42
TypeError: Cannot assign to a sparse tensor
C++ API: operators that take a list of optional Tensor
s cannot be called with ArrayRef<Tensor>
anymore (#49138)
This PR changes the C++ API representation of lists of optional Tensors (e.g. in the Tensor::``index
method) from ArrayRef<Tensor>
to List<optional<Tensor>>
. This change breaks backwards compatibility, since there is no implicit conversion from ArrayRef<Tensor>
to List<optional<Tensor>>
.
A common call pattern is tensor.index({indices_tensor})
, where indices_tensor
is a Tensor
. This will continue to work because the {}
initializer_list constructor for List<optional<Tensor>>
can take Tensor
elements that are implicitly converted to optional<Tensor>
.
However, another common call pattern is tensor.index(indices_tensor)
, where previously the Tensor
got implicitly converted to an ArrayRef<Tensor>
. To implicitly convert Tensor
-> optional<Tensor>
-> List<optional<Tensor>>
would chain two implicit conversions, which C++ doesn't allow. So those call sites should be rewritten to use the tensor.index({indices_tensor})
pattern.
Autograd view creation informations are now properly propagated when views are chained
After this fix, an error will properly be thrown to avoid wrong gradients when an in-place operation is performed on a view of a view, when in-place operation were not allowed on the first view.
This means that code that used to return wrong gradients in 1.7.1 (such as t.unbind()[0].select(0, 0).add_(1)
) will now properly raise an error.
End of deprecation cycle for spectral ops in the torch. namespace (#48594)
This PR removes the deprecated torch.{fft,rfft,ifft,irfft}
and their corresponding methods on torch.Tensor
. PyTorch programs using these functions must now update to use the torch.fft
namespace.
torch.digamma
: properly handle all inputs (#48302)
This PR updates PyTorch's torch.digamma
function to be consistent with SciPy's special.digamma
function. This changes the result of the torch.digamma
function on the nonpositive integers, where the gamma function is not defined. Since the gamma function is undefined at these points, the (typical) derivative of the logarithm of the gamma function is also undefined at these points, and for negative integers this PR updates torch.digamma
to return NaN
. For zero, however, it returns -inf
to be consistent with SciPy.
Interestingly, SciPy made a similar change, which was noticed by at least one user: scipy/scipy#9663
SciPy's returning of negative infinity at zero is intentional:
https://github.com/scipy/scipy/blob/59347ae8b86bcc92c339efe213128f64ab6df98c/scipy/special/cephes/psi.c#L163
This change is consistent with the C++ standard for the gamma function:
https://en.cppreference.com/w/cpp/numeric/math/tgamma
Fix torch.remainder
type promotion (#48668)
1.7.1:
In the case where the second argument is a python number, the result is casted to the dtype of the first argument.
>>> torch.remainder(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)
1.8.0
In the case where the second argument is a python number, the dtype of result is determined by type promotion of both inputs.
>>> torch.remainder(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])
Changes to onnx export API to better handle named arguments (#47367)
The args
input argument of the torch.onnx.export
function is updated to better support optional arguments. An optional dictionary can be passed in addition as the last argument in the args
tuple, specifying inputs with the corresponding named parameter. Note that this is backward breaking for cases where the last input is also of a dictionary type. In the new API, for such cases, it is mandatory to have an empty dictionary as the last argument in the args
tuple.
More details can be found at: https://pytorch.org/docs/1.8.0/onnx.html?highlight=onnx#using-dictionaries-to-handle-named-arguments-as-model-inputs.
Update signature of torch.quantization.quantize
function #48537
The run_args
argument must now contain a list or tuple containing the positional arguments, even if there is only a single argument.
In particular, code like: qmodel = quantize(float_model, default_eval_fn, img_data)
that was working in 1.7.1 will now raise the error: TypeError: default_eval_fn() takes 2 positional arguments but 3 were given
.
You should update this code to provide the image in a list for example: qmodel = quantize(float_model, default_eval_fn, [img_data])
Change the way we quantize relu, leaky relu and sigmoid(#47415, #48038, #45702,#45711, #45883 #45883, #45882, #47660)
Starting with version 1.8.0, in the eager mode quantization flow, relu is not observed anymore as it is not needed.
In previous versions, quantized leaky_relu
and sigmoid
did not require observation and just inherited the quantization parameters from their input, but that does not work very well in eager mode quantization. Starting with version 1.8.0, they are observed operator so that they work better in eager mode quantization.
Update direction numbers to 21201 dims in the SobolEngine (#49710)
This update is BC-breaking because the values drawn by the engine will be different from the ones drawn in 1.7.1 even with the same seed.
1.7.1 | 1.8.0 |
---|---|
>>> from torch.quasirandom import SobolEngine
>>> eng = SobolEngine(1)
>>> eng.draw(3)
tensor([[0.5000],
[0.7500],
[0.2500]])
| >>> from torch.quasirandom import SobolEngine
>>> eng = SobolEngine(1)
>>> eng.draw(3)
tensor([[0.0000],
[0.5000],
[0.7500]])
|
Deprecations
Python API
Deprecate old style nn.Module
backward hooks (#46163)
Old style nn.Module
backward hooks have been broken for a long time (they do not behave as advertised in the documentation). We now have new nn.Module.register_full_backward_hook
that provide a fully working implementation of these hooks.
The old function should not be used and migrated to the new full version.
An example of this discrepancy is shown in the example below where a Linear layer takes as input a single Tensor of size 5 and returns a single Tensor of size 5 but old style hook would return two gradients with respect to the input for only one input.
1.7.1:
import torch
from torch import nn
mod = nn.Linear(5, 5)
def hook(mod, grad_inp, grad_out):
print(f"grad input size: " + " ".join(str(g.size()) for g in grad_inp))
print(f"grad output size: " + " ".join(str(g.size()) for g in grad_out))
mod.register_backward_hook(hook)
mod(torch.rand(5, requires_grad=True)).sum().backward()
>>> `grad input size: torch.Size([5]) torch.Size([5]) # One too many
>>> grad output size: torch.Size([5])`
1.8.0:
Old style hooks are deprecated and will warn when providing wrong result.
import torch
from torch import nn
mod = nn.Linear(5, 5)
def hook(mod, grad_inp, grad_out):
print(f"grad input size: " + " ".join(str(g.size()) for g in grad_inp))
print(f"grad output size: " + " ".join(str(g.size()) for g in grad_out))
mod.register_backward_hook(hook)
mod(torch.rand(5, requires_grad=True)).sum().backward()
>>> grad input size: torch.Size([5]) torch.Size([5]) # One too many
>>> grad output size: torch.Size([5])
>>> `UserWarning: Using a non-full backward hook when the forward contains multiple
autograd Nodes is deprecated and will be removed in future versions. This hook
will be missing some grad_input.`
Full hooks should be used to get the proper result all the time and avoid warnings
mod.register_full_backward_hook(hook)
mod(torch.rand(5, requires_grad=True)).sum().backward()
>>> grad input size: torch.Size([5])
>>> grad output size: torch.Size([5])
torch.stft
: Deprecate default value of the require_complex
argument (#49022, #50102)
Previously torch.stft
took an optional return_complex
parameter that indicated whether the output would be a real tensor or a complex tensor. return_complex
has the default value of False
. This default value is deprecated (meaning that this optional argument is becoming mandatory) and will be removed in future versions. You can pass this argument explicitly to avoid this deprecation.
Deprecate torch.set_deterministic
in favor of torch.use_deterministic_algorithms
(#49904)
This beta feature is being renamed for improved clarity. Users should migrate to use the new name.
Deprecate torch.*
linear algebra functions in favor of the torch.linalg.*
variant for cholesky
(#51460), slogdet
(#51354), inverse
(#51672), pinverse
(#51671)
All the linear algebra functions are being moved to the torch.linalg
submodule that provided a compatible API with NumPy. These new functions have the same set of features as the torch.
ones and should be used instead.
New features
Python API
- New functions (most of them to improve numpy compatibility):
torch.nan_to_num
(#44592),torch.tensor_split
(#45168),torch.nanmedian
(#45847),torch.ravel
(#46098),torch.igamma
(#46183),torch.igammac
(#48171),torch.{column_stack,row_stack}
(#46313),torch.kron
(#45358),torch.copysign
(#46396),Tensor.new_empty_strided
(#47225),torch.{swapdims,swapaxes}
(#46041),torch.tile
(#47974),torch.float_power
(#44937),torch.moveaxis
(#48581),torch.inner
(#46716),torch.msort
(#48440),torch.sinc
(#48740),torch.broadcast_to
(#48997),torch.xlogy
(#48777),torch.f{max,min}
(#49312),torch.diff
(#50569),torch.ldexp
(#45370),torch.broadcast_shapes
(#43935), torch.fft
new features: 2D FFT functions (#45164), use new FFT operators in stft (#47601), helper functions (#44877), fuzzing benchmark (#47872)torch.linalg
new features:linalg.tensorsolve
(#46142),linalg.cholesky
(#46083),linalg.tensorinv
(#45969),linalg.{eigh,eigvalsh}
(#45526),linalg.matrix_rank
(#48206),linalg.solve
(#48456),linalg.qr
(#47764, #50046),linalg.svd
(#45562),linalg.inv
(#48261),linalg.pinv
(#48399),linalg.slogdet
(#49194),linalg.cond
(#45832)- New
torch.nn
Modules:nn.PixelUnshuffle
(#49334),nn.GaussianNLLLoss
(#50886) - Automatic shape inference in
torch.nn
: newnn.LazyLinear
(#44538),nn.LazyConv{1,2,3}d
andnn.LazyConvTranspose{1,2,3}d
(#47350) - Add channels last support for
torch.nn.AdaptiveAvgPool2d
(#48916) - Add option to produce standalone executable with
cpp_extensions
(#47862) - Add sparse-sparse matrix multiplication support (#39526)
- Add
torch.futures.Future.add_done_callback
(#45675) - Add
three_phase
optional argument totorch.optim.lr_scheduler.OneCycleLR
(#42715) - Add
bicubic
option for themode
argument oftorch.nn.functional.grid_sampler
(#44780) - Add new distributions to
torch.distributions
:Kumaraswamy
(#48285),LKJCholesky
(#48798) - Add reparameterization support to
torch.distributions.OneHotCategorical
(#46610) - Add new transforms to
torch.distributions
:CorrCholeskyTransform
(#48041) - Add new constraint to
torch.distributions
:independent
(#50547, #50302) - Add zero annealing epochs to SWA optimizer (#47579)
- Add
close
method totorch.hub.tqdm
mock (#46040) - Add support for pruning based on custom importance scores via the
importance_scores
keyword argument (#48378) - Add torch vitals (#51047)
Complex Numbers
- Complex Number support on CPU and CUDA for
torch.symeig
(#45121),torch.pinverse
(#45819),torch.det
(#45980),torch.diagflat
(#47564),torch.{addcmul, addcdiv}
(#46639),torch.lu_solve
(#48028),torch.matrix_exp
(#48363),torch.eig
(#49168),torch.{acosh, asinh, atanh}
(#50387),torch.masked_scatter
(#51281),torch.bmm
andtorch.baddbmm
(#42553),torch.orgqr
(#50502),torch.index_fill_
(#50578),torch.cholesky_inverse
(#50269) - Complex Number support on CUDA for
torch.qr
(#45032),torch.lu (
#45898
), torch.prod
(#45980),torch.triangular_solve
(#46916),torch.solve
(#47045),torch.cholesky_solve
(#47047),torch.mean
(#47048),torch.svd
(#45795),torch.inverse
(#47595),torch.Tensor.index_put_
(#51148) - Complex Number support on CPU for
torch.trace
(#50380) - Complex Number support for
torch.nn.DataParallel
(#48686),torch.nn.L1Loss
(#49912), Padding functions (#50594) - Complex Number support for
torch.distributed.{all_reduce, all_gather}
(#45879, #46270) - Complex Autograd support for
torch.{atan, log, log10, log1p, log2, reciprocal, tan, pow, rsqrt, tanh, asinh, acosh}
(#46275),torch.{cholesky, triangular_solve, mm, mv, ger}
(#45737),torch.take(), torch.Tensor.fill_()
(#46860),torch.matrix_exp
(#48363),torch.{baddbmm, addbmm, addmm, addmv}
(#50632),torch.qr
(#48489),torch.svd
andtorch.pinverse
(#47761),torch.sqrt
(#49461),torch.diag
(#51268),torch.trace
(#51537),torch.exp
(#47194),torch.mean
(#47566),torch.addr
(#50667), torch.{stack, gather, index_select}, torch.Tensor.index_add_
(#49552),torch.{masked_scatter, masked_select}
(#51281),torch.{addcmul, addcdiv}
(#46639),torch.{acosh, asinh, atanh}
(#50387),torch.solve
(#47045),torch.cholesky_solve
(#47047),torch.inverse
(#47595) - Add complex autograd support for named tensors (#47289)
- Allow converting parameters and buffers of
torch.nn.Module
to complex dtypes (#44788) - Add complex support to IValues (#50883, #51476)
- Add TorchScript type annotation logic for complex numbers (#50884)
- Add serialization logic for complex numbers (#51287)
- Add support for complex number lists in JIT (#51145)
- Add support for complex valued keys for dict in TorchScript (#51472)
- Add
scalar.conj()
(#46596) - Add
Tensor.copy_()
forComplexHalf
tensors (#45339)
Profiler
- New profiler API (#48280)
- Use libkineto in profiler (#46470)
- Add FLOPS computation support to the new profiler API (#51734)
- Add high level profiling trace for dataloading and optimizer (#47655)
- Add support for SVG visualization (#48438)
Autograd
- Add
inputs
argument toautograd.backward()
both in python and c++ (#46855, #47214) - Add support for Tensor-like objects in
torch.autograd.gradcheck
(#45732) - Add experimental
vectorize
flag totorch.autograd.functional.{jacobian, hessian}
(#50915, #51638) - Add anomaly mode in C++ API (#46981, #47164)
- Make
torch.lu
differentiable. (#46284) - Add support for generators in autograd decorators like
torch.no_grad
(#49017)
Dataloader
- Add
BufferedShuffleDataset
(#45290) - Add warning if DataLoader is going to create excessive number of thread (#46867)
- Add prototype of
BatchIterDataPipe
(#49186, #51880) - Add prototype of
SamplerIterDataPipe
(#49363, #52104) - Implement
BucketBatchIterDataPipe
(#51126, #51880) - Add Tar DataPipe-s (#51398)
- Add
MapIterDataPipe
(#51488#51879)
CUDA
- Allow user to specify a fraction of the GPU memory with
set_per_process_memory_fraction
. (#48172) - CUDA BFloat16 TopK (#44755)
- Add LazyNVRTC (#45674)
- Enable CUDA Fuser for ROCm (#45965)
- Define the record_stream method in native_functions.yaml (#44301)
- Add CUDA 11.1 docker build (#46283)
- Add nvtx.range() context manager (#42925)
- CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997)
- [ROCm] enable stream priorities (#47136)
- Add bfloat support for torch.randn and torch.norm (#47143)
- CUDA BFloat16 Dropout (#45005), batchnorm (non-cuDNN) (#44994), backwards (#48809), sparse (#48807), indexing (#48801), embedding (#44848), signal windows (#45155), norm (#48806), isinf and isfinite (#49356), gemms on arch other than ampere (#50442), clamp, remainder, lshift, rshift (#45247)
- Make CUDAGeneratorImpl capturable (#48694)
- Adding support for CuDNN-based LSTM with projections (#47725)
- Add
torch.cuda.can_device_access_peer
(#50446) - Add torch::cuda::ncll::all2all (#45900)
C++ API
- Add distance-agnostic triplet margin loss (#45377)
- Add
torch::nn::ModuleDict
(#47707) - Add
torch::cuda::synchronize
(#50072) - Add new XPU backend type for Intel heterogeneous computation platform. (#49786)
TorchScript
torch::jit::freeze
C++ api introduced (#52337, #52392)- Add API for ignoring arbitrary module attributes during compilation (#45262)
- Support tracing tensor
__setitem__
with dynamic shape (#45828) - Expose script_if_tracing as public API (#46494)
- Support %-based string formatting (#45976)
- Add
torch.jit.isinstance
support for typed containers (#46062) - Allow for source code comments at any level of indentation (#46548)
- Support hashing of various data types by implementing generic hashing for IValues (#46441)
- Support doc string for TorchBind custom classes (#46576)
- Add API for selective lowering of modules to custom JIT backend (#43613)
- add list() support (#42382)
- Support using lambda function as TorchBind constructor (#47819)
- Support user defined classes as constants (#45556)
- Allow del statements with multiple targets (#48876)
- Tuple Slice with both negative and positive stepped size (#48660)
- Expose run_async function on torch::jit::Method (#48607)
- Add flag torch_jit_disable_warning_prints to allow disabling all warnings.warn (#49313)
- Add dict comprehension (#47774)
- Adding support for bitwise augassignment operators (
+=
style statements) (#44621) - Support the
in
operator with str (#47057) - Adding JIT support for cuda streams and events (#48020)
- Add
Type::{castRaw,expectRef}
(#50061) - Allow arbitrary docstrings to be inside torchscript interface methods (#50271)
- Change list striding parameters to take optional integer (#48719)
- Add support for scripting and running module level hooks in JIT (#49544, #49975, #49545, #49546, #49547)
- Support default argument values of a method (#48863)
- Graceful invalidation of Python Node/Value/Block when C++ object is deleted (#50326)
- Support
Union[NoneType, T]
as input type (#51605) - Allow implicit boolean conversion of lists, strings, and dictionaries (#51683)
Mobile
- Add instance_key into mobile stats logging. (#45517)
- Profiling allocator for mobile. (#43951)
- [Metal] Add Metal/MPSCNN support on iOS (#46112)
- [Metal] Introduce USE_PYTORCH_METAL (#46383)
- [Metal] Support Resnet models (b63ddd6)
- PyTorch NNAPI integration prototype (#46780)
- [Metal] Enable Metal on macosx (#47635)
- [Metal] Enable optimize_for_mobile on Linux (#46384)
- [Android] Fix YUV camera image to tensor (#50871)
- [Android] turn on USE_VULKAN for android builds by default (#51291)
- Add windows JNI support (#44257)
- Enable partial loading of GPU models on linux CPU machines (#51236)
Distributed
- Support
send
andrecv
in c10d NCCL backend (#44921, #44922) - Add support for NCCL alltoall (#44374)
- Upstream
fairscale.nn.Pipe
into PyTorch astorch.distributed.pipeline
(#44090) - Add a
--logdir
option to log subprocess output to files in DDP launcher. (#33193) - Support
RRef.backward()
for local RRefs. (#46568) and Owner RRefs. (#46641) - Support C++ implementation for DDP communication hook. (#46566)
- Provide 2 default C++ comm hooks for DDP (#46701)
- Support remote device format
"worker_name/device"
(#46773) - Enable creation and transfer of
ScriptModule
over RPC (#48293) - Enable TCPStore on Windows (#47749)
- Support
torch.distributed.irecv(src=None, ...)
asrecv_anysource
(#49383) - Implement layer-wise PowerSGD as a DDP comm hook (#49639)
- Support
alltoall_single
in TorchScript (#48345) - Enable GPU-to-GPU comm in
TensorPipeAgent
(#44418) - Support timeout in
rref._get_type()
(#50498) - Support timeout for RRef proxy functions (#50499)
- Add optimizer state sharding as
ZeroRedundancyOptimizer
(#46750) - Add distributed functional
Adam
optimizer (#50624),sgd
optimizer (#50618),Adadelta
optimizer (#50623),RMSprop
optimizer (#50619), lAdamW
optimizer (#50620) - Create a DDPLoggingData struct and expose it to python interface (#50622)
- Implement autograd functions for c10d communication operations (#40762)
- Enable TensorPipe's SHM transport (#50760)
- Support device map for distributed autograd while using TensorPipe. (#44859)
- Create PyTorch DDP logging APIs for applications to use (#50637)
- Add
set_exception
API intorch.futures.Future
(#50983) - Add
scatter_object_list
API for c10d (#43930) - Provide parameter to pass GPU ID in barrier function (#49069)
- Enable TensorPipe CUDA fallback channel (#50675)
- Enable TensorPipe's InfiniBand transport (#50761)
torch.fx
- allow custom behavior for args, kwargs, and bool (#45193)
- Mutable Graph APIs (#45227)
- Make output a non-special Node (#45599)
- Make
Tracer.trace()
just return a Graph (#45704) - Preserve type annotations on generated code in Graph (#45880)
- Make
graph_copy
examine existing values in val_map (#46104) - Allow tracing free functions (#46268)
- Make sure args/kwargs are immutable (#46325)
- Make wrapped functions traceable (#46692)
- Added
GraphModule.to_folder
(#47544) - Support default args in symbolic tracing (#47615)
- Add
Node.all_input_nodes
(#48270) - Support torchbind as attribute in torch.fx symbolic tracing (#48732)
- Create subgraph rewriter API (#49540)
- Make len traceable and scriptable with wrap (#50184)
- Add Interpreter and Transformer APIs (#50420)
- Add alternative prettyprinting method to
Graph
(#50878) - Move some heavily used passes out of experimental (#51392)
- Added partial concrete values for symbolic tracing (#51609)
Quantization
- Quantized Operators and Modules
- Embedding and EmbeddingBag operator support
- creating quint4x2 dtype for quantized tensors (#44678)
- PerChannelFloatQParams support for quint4x2 dtype (#45594)
- Add 4-bit embedding_bag prepack/unpack support using quint4x2 (#45751)
- Support 4-bit embedding_bag operators using the dtype quint4x2 (#45752)
- Support for 4-bit quantized EmbeddingBag module (#45865)
- Refactor qembeddingbag to remove duplicate code (#45881)
- Rename the sparse argument for embedding_bag ops (#46003)
- Add support for pruned weights in embedding_bag_byte lookup (#47329)
- fp16 -> fp32 EmbeddingBag moved into CPU impl (#47076)
- Add non-fbgemm fallback implementation for embedding lookup ops (#50706)
- Out variant for embedding_bag_4bit_rowwise_offsets (#51324)
- Using int32 as indices for embedding_bag operators (#45878)
- Add transposed conv support for fbgemm backend for 1d, 2d, 3d (#46607, #46608)
- Add quantized flip dispatch (#46235)
- Add support for ReflectionPad2d (#48036)
- Dynamic GRU quantization support (#49448)
- Quantizable LSTM (#49671)
- Embedding and EmbeddingBag operator support
- Quantization Flow/API
- FX Graph Mode Quantization
- Add prepare_custom_config_dict and convert_custom_config_dict (#46223, #46364)
- Add FixedQParamsFakeQuantize module (#46657)
- Add support for additional_fuse_method_mapping (#46345), additional_{fusion/quant}_pattern (#46346)
- Support in qat sigmoid/hardsigmoid/tanh (#46871), convbn{relu}1d (#47248), FloatFunctional (#46634)
- custom_module support static/dynamic/weight_only quant (#46786)
- Support standalone_module_class (#47705)
- Embedding/EmbeddingBag works in static quant qconfig (#48062)
- Add MatchAllNode in pattern matching (#48979)
- Add support for dynamic quant for RNN and RNNCell (#49126), ConvTranspose{n}d (#49717), quantizing functional linear + {functional relu/module relu} (#50975), functional conv2d + relu (#51079), functional conv1d and conv3d (#51155) (#51254), Scalar as first input for add/mul (#46751), leaky relu (#45712), Embedding (#46677), EmbeddingBag (#46678)
- Remove inplace option for convert_fx (#46955)
- Support non_traceable_module/module_class (#46298)
- Add additional_object_mapping argument to convert (#46338)
- Keep linear op unchanged when qconfig is not supported (#48067)
- Move {input|output}_quantized_idxs cfg from convert to prepare (#49238)
- Allow user to specify qconfig for call_method (#49621)
- Do not observe bias on F.conv and F.linear (#49623, #49628)
- Linear work with float_qparam_dynamic_qconfig (#47068)
- Fix error that DefaultQuantizer is not inserted after a module configured with None qconfig (#47316)
- Scope support for call_method in QuantizationTracer (#50173)
- Support preserved_attributes in prepare_fx (#50306)
- Add option to leave graph inputs and/or outputs quantized (#48624)
- Support quantization for custom module (#44074)
- Remove
inplace
option for fuse_fx (#46953) and prepare_fx (#46954) - Scope support for call_function in QuantizationTracer (#51086)
ONNX
- Preprocess index_put with bool inputs to
torch.masked_{scatter,fill}
(#45584) - Export
torch.{var,var_mean,std_mean}
ops (#45678) - Enable NoneType inputs to export API (#45792)
- Add export of prim::dtype, prim::tolist (#46019)
- Enable onnx shape inference in export by default (#46629)
- Add
torch.silu
operator support for onnx (#51519) - Support list remove for onnx export (#51526)
- Added
torch.hardswish
symbolic in opset 9 (#48423) - Add export of
aten::is_floating
point (#46442) - Add
torch.logical_{and,or,xor}
torch op support in pytorch exporter (#50909) - Add
torch.binary_cross_entropy_with_logits
op to ONNX opset version 12 (#50908) - Support opset13
nn.Squeeze
andnn.Unsqueeze
(#50906) - Add export of
prim::data
(#45747) - Support
torch.nonzero(*, as_tuple=True)
export (#47421) - Update Reducesum operator for opset 13 (#50907)
Misc
- Enable python code coverage on windows (#44548) and onnx (#47387)
- Fix PyTorch compilation on Apple M1 chips (#48275, #49701)
Improvements
Python API
- Add integer support (by promoting integer to float) to
torch.{cos,sin,tan}
(#45733, #46706),torch.log{2,10}
(#46810),torch.{a}tanh
(#47064),torch.a{cos, tan}
(#47005),torch.a{cosh, sinh}
(#47152),torch.sqrt
(#47293),torch.log1p
(#48002).torch.erf{c}
(#48472),torch.asin
(#48461),torch.sigmoid
(#47551),torch.sinh
(#48644),torch.cosh
(#48923),torch.exp{2, m1}
(#48926),torch.reciprocal
(#49102),torch.erfinv
(#49155),torch.rsqrt
(#47909),torch.exp
(#50093),torch.lgamma
(#50140) - Add optional
dtype
argument toTensor.view
(#47951) - Add
out
optional arguments totorch.{reshape,flatten}
(#51249),torch.tensordot
(#47278),torch.fft.*
(#49335),torch.narrow_copy
(#49502) - Add support for int32 indices and offset in
nn.Embedding
andnn.EmbeddingBag
(#46758) - Add boolean type support to
torch.where
(#47454),torch.mul
andTensor.__mul__
(#48637),torch.diag
(#47455),torch.{all,any}
(#44790),Tensor.to_dense
(#50019) - Add inplace version of
torch.cum{sum,prod}_
(#47651) - Add sparse support to
torch.sqrt
(#50088) - Add support for both
dtype
andord
arguments intorch.linalg.norm
(#46637) - Make
torch.nn
Module accept batch size of 0:nn.ReplicationPad
(#39137),nn.Unfold
(#40689),nn.PixelShuffle
(#49187),nn.AvgPool{1,2,3}d
(#50008),nn.MultiLabelMarginLoss
andnn.MultiMarginLoss
(#50007) utils.cpp_extensions
Ensure default extra_compile_args are properly handled (#45956)torch.LongTensor
legacy construction improved error message (#46147)torch.utils.checkpoint
allow having Tensors that don’t require gradients (#45934)torch.nan_to_num
: fix deprecated warnings (#46309)- Remove more use of “blacklist” (#45512, #45781)
- Add type annotation to submodules:
torch.nn.cpp
(#46490),torch.nn.parallel.comm
(#46736),torch.nn.modules.*
(#46828, #45772, #46013, #49957, #49479, #49045, #49035, #49494, #48969), autograd functions from c++ (#46622),torch.distributed
functions from c++ (#46623),torch.storage
(#46876),torch._tensor_str
(#48463, #48584),torch.nn.modules.pooling
(#48412),common_nn
(#48190),torch.lobpcg
(#47680),torch.nn.functional
(#50106),torch.overrides
(#50824),torch.generate_torch_version
(#51637),torch.distributions
(#45689),torch.quantization.quantize_jit
(#45548),torch.utils.tensorboard
(#49834),torch.multiprocessing
(#47756),torch.cuda
(#47134),torch._C._distributed_rpc
(#46624),torch.distributed.*
(#47531, #47532, #47533, #47534),torch.nn.parallel._functions
(#49687) - Make comparison fail when dtypes don’t match (#47288)
- Allow large inputs for
torch.svd
(#47440) - Add nondeterministic alerts to
torch.index_copy
,torch.median
on CUDA andtorch.kthvalue
on CUDA (#46942) - Add float16 and bfloat16 support to
torch.where
(#49004),torch.matmul
(#47873) - Add float16 support for CPU and bfloat16 support for CPU & CUDA to
torch.flip
andtorch.flip{lr, ud}
(#49895) - Add support for providing
indices
as a Tensor fortorch.tensor_split
(#49169) - Add support for SELU activation in
torc.nn.init.calculate_gain
(#50664) - Add function version of
torch.optim
optimizers and refactor existing classes to use the functional version: SGD (#45597), Adadelta (#50409), RMSProp (#50410), AdamW (#50411) - Improve error message when window is on wrong device for
torch.fft.stft
(#51128) - Add rounding_mode selection to
torch.div
(#51706, #52242) - Remove spurious numpy writable warning (#47271)
- Enable deterministic mode for rocBLAS (#48654)
- Hipify submodule revamp and improved integration with cpp_extensions (#48715)
- Remove warning about saving state in
torch.optim.lr_scheduler.LambdaLR
(#46813) - Improve typing of
torch.nn.Unflatten
(#49838) - Add exception classification to
torch.multiprocessing.spawn
Autograd
- Add double backward checks for the
torch.fft
submodule (#46004) - Detect inplace modifications of views of leaf Tensors earlier to improve error (#46204)
torch.utils
data.TensorDataset
: Add more specific error message (#46905)data.DistributedSampler
: Additional validation (#48865)
Complex Numbers
- Improve error message thrown by
torch.sign
for complex tensors (#43280) - Remove unnecessary dtype checks for complex types and disable complex dispatch for CPU
torch.{min,max}
pointwise ops (#50465)
CUDA
- Allow consumer ops to sync on autograd engine base gradient (#45787)
- Add
torch::cuda::nccl::{send,recv}
(#45926) - Cusolver inverse check info (#46625)
- Make numpy optional dependency for
torch.cuda.amp
(#48154) - Support all visible cards when building a cuda extension (#48891)
- Enable using
torch.utils.checkpoint.checkpoint
andtorch.cuda.amp
at the same time (#49757) - Make
DeviceCachingAllocator
's error handling more defensive and a bit easier to read (#51158)
Distributed
- Create NCCL communicator for send/recv on demand (#44922)
- Reduce the peak memory of fp16 compression DDP comm hook by avoiding converting to fp32 (#46078)
- Allow RPC framework to use rank in addition to
WorkerInfo
and name. (#46221) - Add to the
HashStore
getNumKeys()
(#46048) anddeleteKey()
(#46049) - Print exception message on both RPC caller and callee (#46372)
- Add RRef proxy support for
ScriptModule
methods (#48339) - Support retrieving the RRef to the remote module (#48983)
- Add a c++ interface in processGroup to get its backend name (#51066)
- Enable
NamedTuple
data type to work with DDP (#44220) - Support send/recv to/from self when communicator is created on demand (#45873)
- Add Error log when ProcessGroupNCCL takes down a process (#44988)
- Provide additional information about NCCL error codes. (#45950)
- Avoid scatter for single-device case in DDP (#46304)
- Use Blocking Wait if both Blocking Wait and Async Error Handling Are Set (#47926)
- Providing more information while crashing a process in async error handling (#47246)
- Add PowerSGD comm hook (#48060)
- Define a customized state for PowerSGD comm hook (#48348)
- Add a random generator to PowerSGD state for initializing low-rank matrix Q (#48507)
- Replace the key of
error_dict
in PowerSGD state with bucket index (#48867) - Make
CUDAFuture
remember and restore current device in callback (#48789) - Update pipeline API to accept arbitrary sequence of Tensors and not just Tuple (#48467)
- Use
group.WORLD
appropriately in process group initialization. (#48767) - Add error feedback to layerwise PowerSGD (#49418)
- Warm-start of PowerSGD by reusing states from previous iteration is possible (#49451)
- Change
wait()
tovalue()
in some callbacks of PowerSGD communication hook (#49709) - Ensure DDP + Pipe works with
find_unused_parameters
. (#49908) - Enable TensorPipe CUDA sending to self (#50674) and GDR channel (#50763)
- Add warning to distributed optimizer (#50630)
- Make python object collective API args consistent (#50625)
- Add option to make
rref.get_type
non-blocking. (#50977) - Unescape string in RPC error message (#49373)
- Event Logging for NCCL Async Error Handling Process Crash (#47244)
- Remove
balance
anddevices
parameter from Pipe. (#48432) - Error feedback for PowerSGD DDP comm hook (#48670)
- Add an index field to
GradBucket
for PowerSGD (#48757) - Have
FutureNCCL
record streams w/ allocator in addCallback (#48496) and events in current stream (#48497) - Use fresh stream from pool for each
FutureNCCL
callback (#48498) - Record CUDA events for "follow-up"
FutureNCCL
insidemarkCompleted()
(#48499) - Fix
FutureNCCL
'scompleted()
disagreeing withwait()
(#48503) - Fix
FutureNCCL
not recordingDataPtr
s with caching alloc inwait()
(#48563) - Add multi-GPU support to
FutureNCCL
(#48500) - Don't store device indices separately on
FutureNCCL
(#48501) - Support wider range of types in
FutureNCCL
(#48502) - Split
FutureNCCL
's CUDA-specific parts from generic future logic (#48504) - Merge common parts of FutureNCCL into
at::ivalue::Future
(#48505) - Split out reusable
CUDAFuture
fromFutureNCCL
(#48506) - Cache the
DataPtr
s inCUDAFuture
(#48788) - Modify
Pipe
to return an RRef. (#47829) - Cleanup APIs for pipeline parallelism. (#48630)
- Fix TCPStore type coercion (#49685)
- Simplify the implementation of error feedback and warm-start (#50981)
- Explicitly specify the
dtype
of the error tensor (#50985) - Check
start_PowerSGD_iter > 1
and add guidance on tuning PowerSGD configs. (#51427) - Check if the backend is NCCL when a DDP communication hook is registered (#51759)
TorchScript
- Add multiline string dedent support (#45580)
- Add string versions of argument funcs in jit Node (#45464)
- Make sure each
warnings.warn
only executes once inside TorchScript. (#45382) - Allow slicing multiple dimensions with indexes if not Tuple (#45239)
- Change type inferred from empty annotation (#45360)
- Fix stride printing/parsing formatting (#45156)
- Make objects throw Python AttributeError on nonexistant attr access (#45911)
- Make InsertInstruction overflow check a warning instead of fatal (#46369)
- Add an option to getWriteableTensorData to avoid copy CUDA tensor to CPU (#46524)
- Add error messages and workaround for RET failure of containers with a torch class type (#46543)
- Correctly mark unannotated NamedTuple field to be inferred TensorType (#46969)
- Enable ModuleDict non-literal indexing (#45716)
- Add an attribute to the torchscript model exported by metal (#47174)
- Print out interface mismatch for prim::ModuleDictIndex (#47300)
- better message for bad type annotation (#47464)
- Resolve string literal type annotations using
Resolver::resolveType
(#47731) - Resolve
torch.device
in recursive compilation of classes (#47734) - Metacompile boolean constants (#46721)
- Allow JIT unpickler to accept CUDA DataPtr from read_record_ (#46827)
- Skip None submodule during JIT-tracing (#49765)
- Add
__prepare_scriptable__
duck typing to allow replacingnn.Module
s with scriptable preparations (#45645) (#49242) - Fix deprecation warning in scalar_type_analysis (#50218)
- Support scripting classmethod called with object instances (#49967)
- Use FileStore in TorchScript for store registry (#50248)
- Treat has_torch_function and object_has_torch_function as static False when scripting (#48966)
- Print better error when class attribute IValue conversion fails (#50255)
- Clean up some type annotations in test/jit/...../test_class_type.py (#50156)
- Type annotations in test/jit (#50293)
- Eliminate static default_extra_files_mobile from header import.h (#50832)
- Dump torch::jit::AliasDb objects as Graphviz files (#50452)
- Fix test_jit_cuda_archflags on machine with more than one arch (#50405)
- Provide more info when attribute fails to convert (#50870)
- Adding correct error message for for..else (#51258)
- Handle error during dict expansion (#51374)
Mobile
- Update default output extension in optimize_for_mobile.cc (#45598)
- Add named tuple's error message and workaround for RET failure (#46347)
- [Metal] Add metal backend type (#46455)
- [Metal] Add the Python binding for optimize_for_mobile (#46456)
- [Metal] Add pin_memory check in empty_strided (#47228)
- [Metal] Calculate strides for metal tensors (#50309)
- [Metal] Clean up the operator tests (#50311)
- Add an overload for deserialize() that doesn't accept the extra_files map. (#50932)
- bundled_inputs: Preserve bundled input related methods when calling optimize_for_mobile (#49170)
- bundled_inputs: Preserved all functions generated by bundled inputs (#51496)
- bundled_inputs: Expanded Bundled Inputs To Any Public Function (#51153)
- Expose _export_operator_list to python (#51312)
Quantization
- Quantized Operators and Modules
- Add reflection padding to conv (#49011)
- Add support for 2D indices for quantized embedding operators (#47766)
- quantize_tensor_per_channel ARM implementation (#46018)
- Support either min or max in qclamp (#45937)
- Add preliminary support for advanced indexing (#49346)
- Add backend_independent option for quantized linear module (#48192)
- Add out-variant for the reflection pad (#48037)
- Support 2 dim input in quantized batchnorm 1d (#51597)
- Typing, Formatting, Error Messages, Logging and Tests
- numeric suite: add types to eager (#51168)
- Enable type check for torch.quantization.fake_quantize (#45701)
- Type check for
torch.quantization.observer
(#45630),torch.quantization._numeric_suite
(#46330),torch.quantization.stubs
(#46475),quantization.fx.Quantizer
(#48343),quantization.fx.Quantizer
(#48350),quantization_mappings.py
(#49179),fusion_patterns.py
(#49606),torch/nn/quantized/modules
(#49941), quantization-related files intorch/jit
(#49939), fuser (#48844), quantization_patterns (#48851), observed_module.py (#49607), quantization (#49942) - Enable mypy on
torch/quantization/fx/*
(#48331) - Make each line of fx/quantize.py <=80 chars (#48357)
- Add more typehints (#48774, #48794, #48792)
- Nice error message on convtranspose with per-channel weight (#49899)
- Throw a nice error message for allclose with quantized inputs (#49802)
- Add type annotations to torch.nn.quantized.modules.conv (#49702)
- Add type annotations to conv_fused/blas_compare/blas_compare_setup (#51235)
- Add API usage logging to numeric suite (#46504) and quantization (#46095)
- Sparsity
- Others
- Use tensor's quantized properties directly in pickler (#46267)
- Remove register api and rename get_mapping to get_default_mapping (#46337)
- Update HistogramObserver to be scriptable (#51081)
- Support varying size input in numeric suite (#47391)
- Backend string for the quantized types (#49965)
- Disable pruning on embedding look up operators when compressed_indices_mapping = {0} (#48672)
- Support out variant of embedding_bag_byte_rowwise_offsets_out (#49561)
ONNX
- Update embedding_bag export (#44693)
- Improve error handling for adaptive_pool (#45874)
- Support nd mask index in opset >= 11 (#45252)
- Update peephole pass for prim::ListUnpack (#46264)
- Slightly improve indexing with ellipsis under scripting (#46571)
- Update batch_norm symbolic to handle track_running_stats=False (#47135)
- Cast Gather index to Long if needed (#47653)
- Handle dynamic input axes for prim_ConstantChunk (#48176)
- Remove usage of isCompleteTensor() in symbolic functions (#48162)
- Changes to export API to better handle named arguments (#47367)
- Modified var_mean symbolic to support more combinations of dims (#48949)
- Support gelu for fp16 export (#50911)
- Enable Constant Folding for ONNX Opset 13 (#51523)
- Export and shape inference for prim uninitialized in If subblock (#46094)
- Scripting support for inputs to index_put (#46866)
- Track and list model params for scripting (#47348)
- Modifications in remove inplace ops passes to better handle binary inplace ops (#51572)
- Improve error message for parse_arg in symbolic functions (#51516)
- Update error message that displays when encountering an op unsupported for ONNX export (#51522)
- Preserve param names during in-place op removal (#50955)
- Handle sequence output shape and type inference (#50599)
- Update constant-folding of Gather op to include cases where rank of indices input is 0 (#51514)
- Update unsafe_chunk() method to support new version 13 of Split operator (#51524)
- Replace optional parameters of Resize with placeholder for ops13 (#50954)
Vulkan
This release brings about a complete rewrite of PyTorch’s Vulkan backend with primary focus on improved performance, robustness, and better code structure and organization. These changes are transparent to the end user. Considering that this is a rewrite, many of these changes also qualify as performance improvements.
- Add Vulkan Tensor factory. (#44016)
- Redo Vulkan command and descriptor pools. (#44496)
- Add low level utilities image sampler (#45037), fence (#45148), tensor copy (#46481), job dispatch and flush (#46008),
- Add more ops Add (#44017), Mul (#47021), Mm, Pool, Upsample (#47063), Conv2D (#46900, #48266, #48816), clamp (#47196), reshape (#47252), mean (#47312),
- Add CMake option to enable Vulkan [v2] API. (#46503)
- Add
Tensor.is_vulkan
(#46655)
Misc
- Factory operators (at::empty, at::zeroes,...) now have a new overload in the C++ API that takes ScalarType, Layout, Device and pin_memory parameters separately, in addition to the previously existing overload that takes one TensorOptions argument. (#44087)
Bug fixes
Python API
- Fix
torch.nn.BatchNorm{1,2,3}d
channels_last contiguity check (#50659) - Fix
torch.nn.ConstantPadNd
not preserving memory format (#50898) - Fix dtype of first sample in
torch.quasirandom.SobolEngine
(#51578) - Fixes bug in
torch.sspaddmm
(#45963) - Check
support_as_strided
before usingtorch.empty_strided
(#46746) - Fix internal assert for
torch.heaviside
with cuda tensor and cpu scalar tensor (#46831) - Fix negative column numbers for
torch.eye
(#46841) - Fix segfault with
torch.orgqr
(#46700) - Fix
torch.nn.functional.embedding
padding_idx behavior (#46714) - Fix
torch.nn.Embedding.from_pretrained
to properly handle thepadding_idx
argument (#47184) - Fix functions not handling discontiguous Tensors properly:
torch.dropout
(#47552),torch.median
(#46917) - Fix max_pool2d with ceil_mode (#46558)
- Fix type promotion for
torch.trace
on CPU (#47305) - Fix
torch.kthvalue
error for scalar input (#47600) - Fix multinomial when input has 0 probability (#47386)
- Fix incorrect warnings in
torch.nn.Parameter{List,Dict}
(#48315) - Fix printing of
torch.device
(#48655) - Fix parameter generator exhaustion in
torch.optim.SparseAdam
(#47724) - Fix
torch.pow
bug for complex exponents (#49809) - Fix gradient for
torch.norm
whenp=+inf
(#48611) - Fix
SyncBatchNorm
when stats tracking is disabled (#50126) - Fix
torch.elu
backward when alpha is negative (#49272) - Fix pickling for Tensor-like objects (#47732)
- Fix
torch.distributions.Half{Cauchy,Normal}
support forvalidate_args=True
(#50403, #50492) - Fix
torch.distributions.CatTransform
forevent_dim
> 0 (#49111) - Fix
torch.distributions.Binomial
to retain lazy logit initialization (#46055) - Fix
torch.pow
when exponent is provided as a scalar Tensor and on different device (#46185, #46320) - Fix classmethod override argument passing for Tensor-like objects (#47114)
- Fix internal assert when inputs are on the wrong device for
torch.
{maximum, minimum}
(#48446) - Fix
torch.distributions.utils.broadcast_all
crashing on Tensor-like objects (#48169) - Fix vectorized conversion of
-nan
from float16 to float32 (#41280) - Fix
torch.silu
backward for all backends other than CPU and CUDA (#49439) - Fix wrong output when
torch.kthvalue
out=
argument overlaps with input (#48254) - Fix advanced indexing for Tensor-like objects (#49324)
- Fix
torch.distributions.TransformedDistribution
shape logic(#50581) - Fix
torch.nn.functional.interpolate
backward on GPU for nearest interpolation (#51240) - Fix
torch.svd
ignoringsome
keyword argument for empty inputs (#51109) - Fix
torch.distributions.Dirichlet
arg_constraints
(#51369) - Use deterministic implementation of
torch.index_put
andtorch.index
backward CPU in deterministic mode (#51388) - Removes spurious warning in
torch.nonzero
(#51618) - Fix calculation of number of elements to not overflow in many c++ implementations (#46997)
- Fix Parameter detection as Tensor in c++ backend (#48963)
- Fix bug in miopen findAlgorithm (#46852)
Autograd
- Fix deadlock on Windows due to bad thread termination in autograd engine (#43532)
- Fix deadlock in tsan builds due to bad locking in the engine (#45867)
- Avoid NaN values in
torch.cdist
backward for p<1 (#45720) - Fix handling of
requires_grad
arg fortorch.new_
{full,empty,zeros}
(#46486) - Fix inplace check logic to be triggered when written-to Tensor does not require gradients (#46296)
- Set proper output differentiability for
torch.unique
(#47930),torch.count_nonzero
(#50866) - Fix race in autograd engine that lead can lead to
std::out_of_range
error (#50164, #50372) - Fix autograd thread crash on destruction with python-3.9 (#50998)
- Fix autograd side effects when printing (#51364)
- Fix memory leak in anomaly mode (#51610)
- fix
torch.hardsigmoid
backward at boundary values (#51454)
CUDA
- Fix incorrect CUDA
torch.nn.Embedding
result whenmax_norm
is notNone
and indices are not sorted (#45248) - Ensure kernel launches are checked (#46474, #46727)
- Fix bit math (#46837)
- Fix test_inverse_singular for cublas path; fix cusolver inverse multi-stream issue (#47026)
- Fix indices computation for trilinear interpolate backwards (#50084)
- Fix for possible RNG offset calculation bug in cuda vectorized dropout with VEC=2 (#50110)
- Disable cuDNN persistent RNN on
sm_86
devices (#49534) - Fix Error with
torch.flip
for cuda tensors whendims=()
(#50325) - Fix replication_pad CUDA launch configuration (#50565)
- Workaround for MAGMA accessing illegal memory in batched cholesky (#50957)
- Fix
torch.cdist
backward CUDA error due to illegal gridDim setting (#51569) - Prevent CUDAFuture from using uninitialized device index (#51505)
- Fix incorrect usage of CUDACachingAllocator (#48817)
- Fix
torch.cuda.memory_allocated
to return{}
if not initialized (#51179) - Fix crash when trying to reset memory stats when no cuda device is available (#48406)
torch.utils
data.DistributedSampler
: Fix possible padding length overflow (#45329)data.DataLoader
: Fix hang with large sampler (#48669)data.DataLoader
: Fix unintended error when worker force kill happens #43455 (#43462)data.DataLoader
: Fix persistent_workers + pin_memory (#48543)
Complex Number
- Make
torch.view_as_real
raise a proper error for backends where it is not supported (#47018) - Fix bug in
toComplexWithDefault
(#43841) - Fix
torch.cat
backward formula to return correct gradient values for R -> C case (#51681) - Update backward formulas for
torch.{add, sub}
to correctly handle R -> C case. (#46596) - Add custom implementation for
torch.csqrt
if libc++ is used (#52018)
C++ API
- Refine
ConvParams::use_nnpack()
to allow NNPACK convolution algorithm only be used for kernels up to 16x16.(#49464)
Distributed
- Record FutureNCCL callback stream on CUDA caching allocator (#45318)
- Fix object-based collectives API to use
torch.cuda.current_device
instead of rank (#46897) - Explicitly restrict the scope of
torch.cuda.synchronize
to the current device in PowerSGD (#49711) - Fix Hang in Async Error Handling due to Work logging (#46265)
- Add missing
recordStream
inProcessGroupNCCL::alltoall_base
(#46603) - Allow DataParallel to run zero input Module (#46565)
- Fix DDP issue where parameters share same
grad_accumulator
(#46755) - Fix ProcessGroupNCCL profiling when profiler is not run with
use_cuda
(#48946) - Refactor RPC
matchBuiltInOp
to get rid of exception swallowing (#49009) - Solve zombie process problem in DDP launcher (#49305)
- Fix memory leak in TensorPipeAgent. (#50564)
- Fix warm-start for PowerSGD layer-wise compression (#50283)
- Fix CUDA RPC Stream Synchronization (#50949)
- Fix
benchmarks/distributed/ddp/benchmark.py
(#51095) - Fix store based barrier to only use
add
(#49930)
Mobile
- Fix out-of-bounds access for caching allocator calls (#46439)
- Fix CPUCaching allocator guard bug (#46922)
- [Metal] Make the dst tensor contiguous when copying from metal (25833e5)
- [Metal] Fix the broken strides value for 2d transpose (#50310)
- [Android] Fix yuv conversion (#50951)
TorchScript
- Fix bugs in a number of ops in CUDA fuser (#47795, #49143, #49396 ,#48329 and others)
- Fix dict update (#45857)
- Fix Dict bug in constant hashing (#45929)
- Fix TypeError when
torch.jit.load
is passed a pathlib.Path (#45825) - Fix missing call to
__setstate__
when cloning modules (#45858) - Prevent caching of
graph
attribute. (#46960) - Fix traced training attribute (#47211)
- Correctly compare Stream IValues (#47303)
- Correctly print out sign of near-zero double values (#47081)
- Properly serialize types that only appear at function input (#47775)
- Fix bug in get_annotation_str for ast.Subscript (#48741)
- Fix include files for out-of-tree compilation (#48827)
- Fix constant propagation schemas (#49605)
- Fix return type Any for Ternary ops (#49165)
- Fix for module_has_exports (#50680)
- Properly convert Python strings implictly to device (#51340)
- Add missing support for
torch.jit.Final
in python 3.6 (#47393)
torch.fx
- Fix recursion depth issue on Graph deepcopy (#46669)
- Fix handling of
inf
andnan
literals (#46894) - Fix corner case in name sanitization (#46958)
- Fix submodule naming for subgraph split (#47869)
- Fix create_arg for NamedTuple (#48986)
- Fix python code having spurious newlines from placeholders (#49720)
- Make
split_module
results deterministic (#50470) - Fix tracing a free function with embedded constant (#50639)
- Fix using
fx.wrap
as a decorator (#50677) - Fix annotation in generated code (#50777, #52021)
Quantization
- Remove fake_quant after add/mul nodes during eager mode QAT (#49213)
torch.mean
add path for unsupported QNNPACK modes (#45533)- Set type for GetAttr nodes in remapTypes (#46250)
- Avoid inserting fakequant for sigmoid/hardsigmoid/tanh in eval mode (#47297)
- Ensure observer respects device affinity (#47514)
- Fix quant type classification for float_qparam qconfig (#48069)
- Fix quant_type classification for fp16, fp16 (#48073)
- Fix a bug in leakyReLU (#48265)
- Fix quantization for qat.ConvBnReLU1d (#48059)
- Add bias once in conv_fused (#48593)
- Do not return unitialized qschame from getQSchemeAndQParamVector (#49391)
- Fix quantization for DeQuantStub (#49428)
- Ensure observers do not crash for empty Tensors (#49800)
- fake_quant: fix device affinity and buffer resizing for state_dict (#50868)
- Fix memory leak in qnnpack ops (#51612)
- Remove set_quantizer_ from native_functions.yaml (#49463)
- Make choose_qparams_optimized return Tensors to preserve dtype (#45530)
- Use PlaceholderObserver as default dynamic quant observer (#45343)
- FixedQParamsFakeQuantize: adjust default quant_min and quant_max (#47423)
- Add bias once in conv_fused (#48593) (#48661)
- Fix unused var warning when building for different archs. (#48730)
- Make the CUDA fake quantize logic consistent with CPU fake quantize logic (#49808)
- eager quant: fix error with removing forward hooks (#49813)
ONNX
- Fix
torch.flatten
operator (#45632) - Reimplement _var_mean to ensure non-negative (#47240)
- Fix scripting of
torch.{rand,randn,where}
(#45793) - Fix
torch.eye
export (#47016) - Fix dtype for log_softmax export (#46627)
- Fix graph position to insert clone node for inplace op removal (#51520)
- Fix graph sequence output from loop node (#51521)
- Do not dereference nullptr in scalar type analysis (#50237)
- Fix bug in
torch.unfold
symbolic (#51515) - Fix opset 11 ConstantChunk with negative dim (#51525)
- Fix bug in scatter_add (#51527)
Vulkan
- Fix interval midpoint calculation (#46839)
- Fix Vulkan
torch.empty
(and family) breakage as a result of API update. (#47937) - Fix Addmm prepacking to persist after GPU flush (#48313)
- Properly forbid dilation > 1 for conv2d (#48800)
Misc
- Fix c++ extension ninja CUDA build (#49344)
- Only include dataclasses for py < 3.8 to make
setup.py
compatible with older python versions (#45611)
Performance
Python API
- Rewrite
torch.kron
to improve performance and support more dtypes (#50927) - Enable the faster combined weight branch in MHA when query/key/value is same object with NaN (#48126)
Autograd
autograd.gradcheck
update to reduce computations (#45757)- Reduce memory usage for
torch.mm
when only one input requires gradient (#45777) - Reduce autograd engine startup cost (#47592)
- Make
torch.svd
backward formula more memory and computationally efficient. (#50109)
CUDA
- Fix perfornance issue of GroupNorm on CUDA when feature map is small. (#46170)
- Concat fast path with empty tensor (#46805)
- Support the strided tensor on input for
torch.cat
(#46859) - Pin destination memory for
cuda_tensor.to("cpu", non_blocking=True)
(#46878) - Add proper maximum number of threads per block for sm_86 as 1536 (#45889)
- Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778)
- Improve performance of CUDA trilinear interpolate backward (#52649)
C++ API
- Avoid computing AutogradKey if not needed to speed up low level C++ calls (#46252)
- VariableKernel calls into scattered C++ api (#44158)
- Make validate debug-only in Device constructor (#49123)
- Add macro to optionally devirtualize
TensorImpl::numel()
(#49766) andTensorImpl::sizes()
(#50176) - Inline access to low level Dispatcher (#50644)
Distributed
- Only track variables with grad accumulator for find_unused_parameters=True in DDP to save memory (#45942)
- Benchmark combining Distributed Data Parallel and Distributed RPC (#46993)
- Drop FutureNCCL in favor of vanilla CUDAFuture (#49014)
- Pytorch Distributed RPC Reinforcement Learning Benchmark (Throughput and Latency) (#46901)
TorchScript
- Optimized hot path in JIT graph executor (#47465, #48061,#48034)
- Added support for
is_nan
,to
, andlgamma
in CUDA fuser(#45791, #48973, #48976) - Added additional optimizations as part of
torch.jit.freeze
(Conv-Batchnorm, Conv-Add, and Conv-Mul folding, Dropout Removal) (#50222). - Fast TypeMeta/ScalarType conversion (#45544)
- Fix getCustomClassType() perf (#48981)
- Avoid move-constructing a List in listConstruct (#49355)
- Specialize
list_element_from
forIValue
to avoid extra move/copy (#50124)
Mobile
- Avoid inlining kernel lambdas on mobile (#46249)
- Free original weight after prepacking in XNNPACK based op (#46541)
- [Metal] Make permuteWeights inline (#47634)
- [Metal] Use MPSCNN kernels for binary elementwise ops (c18403a)
Vulkan
- Enable prepacked addmm/mm for linear layers (#47815)
- Tweak memory use. (#47728)
- Add linear memory allocator. (#48569)
- Optimize Vulkan command buffer submission rate. (#49112)
torch.fx
- Speed up non-parameter tensor lookup (#47325)
Quantization
- Parallelize the quantization conversion operators (#45536)
- Add a more memory efficient version of fake quant (#50561)
- mem-efficient learnable fake quantization (#49315, #51255, #51159)
- Remove contiguous calls in qembeddingbag (#48993)
- Update embedding module to not store qweight (#50418)
Misc
- Extra sampling of record function events for the profiler (#49114)
Documentation
Python API
- Add information how to control randomness in
DataLoader
(#45749) - Revamp reproducibility notes (#45748)
- Revamp
torch.optim
doc for better understanding (#45944) - Revamp
torch.sparse
tensor documentation. (#45400) - Add doc for
torch.overrides
submodule. (#48170) - Add note on
nn.Module
overview and design principles (#51536) - Add helper functions section to
torch.fft
doc (#46032) - Add object-based collective APIs to public docs (#48909)
- Fix diverse typos and rendering issues in
torch.
doc (#46328, #46589, #47545, #48316, #48328, #48673, #48787, #47762, #48970, #49136, #49388, #49413, #49584, #49667, #41887, #50254, #51053, #51212, #51439, #51286, #49648) - Fix diverse typo and rendering issues in
torch.nn
doc (#45662, #45660, #45587, #45763, #46853, #48577, #48775, #49950, #50430, #48596) - Fix diverse typo and rendering issues in
torch.linalg
doc (#51459, #51353, #51620, #51641, #51651, #51658, #51659, #51660) - Update docs for
torch.nn
: in-place modification of weight innn.Embedding
(#45595) - Update docs for
torch.distributions
:NegativeBinomial
(#45693),Categorical
(#45804),LKJCholesky
(#52904) - Improve
torch.matmul
doc regarding broadcasting (#45699) - Add function signature for
torch.pixel_shuffle
(#45661) - Fix signature for
torch.poisson
(#45656) - Add 3D reduction example to
torch.tensordot
(#45697) - Fix
torch.matrix_exp
(#45909) - Fix typo in
torch.load
docstring for thef
parameter (#49350) - Document fix for
torch.logspace
andtorch.linspace
(#46056) - Improve clarity of
torch.norm
(#42696) - Fix info on the shape of pivots in
torch.lu
(#46844) - Add
generator
param intorch.randperm
doc (#47231) - Updated doc for
torch.{v}dot
(#47242) - Update doc of
torch.eig
about backward(#47598) - Fix
torch.swap{dim/axes}
to properly appear in doc (#48376) - Add global
nn.Module
hooks to nn doc (#48374) - Added
torch.linalg.cond
to doc(#48941) - Improve new_group example in the context of
torch.nn.SyncBatchNorm
(#48897) - Update
is_floating_point()
docs to mention bfloat16 (#49611) - Improve docs for
torch.{scatter,gather}
(#49679) - Rename "Arguments:" to "Args:" in all doc (#49736)
- Fix a KaTeX crash and many docstring issues (#49684)
- Improve
torch.flatten
doc (#49501) - Add note about
torch.flip
returning new tensor and not view. (#50041) - Add instructional error message for cudnn RNN double backward workaround (#33884)
- Add centered FFT example to
torch.fft.fftshift
doc (#51223) - Add
torch.sgn
to doc (#51479)
Autograd
- Fix many typos and rendering issues in
torch.autograd
doc (#48765, #45849, #50166, #51035, #51335) - Update the error message explaining when to use the
retain_grad
flag (#47084)
Complex Number
- Fix typo in complex autograd docs (#49755)
- Doc update for complex numbers (#51129, #51661)
- Document that
torch.remainder
does not support complex inputs (#48024)
CUDA
- Add a Note on CUDA Stream (#45754), #45754)
- Add docs on how to toggle TF32 flags on C++ (#47331)
- Fix syntax issue in C++ cuda api note (#48434)
- Change “truncating” to “rounding“ in TF32 docs (#49625)
- Add docstring to
torch.cuda.get_device_properties
(#49792) - Add doc for
cuda.memory_fraction
andcuda.gpu_process
(#51372)
C++ API
- Add guide for choosing dispatch keys in
native_functions.yaml
(#46126) - Add a few more comments on dispatch key computation methods (#46128)
- Improve error messages for operator registration API (#47636)
- Add Math/DefaultBackend to dispatch key guide, introduce
PythonDispatcher
(#50854)
Distributed
- Clarify callback behavior when future is completed (#50978)
- Enhance
new_group
doc to mention using NCCL concurrently. (#48872) - Adding c10d Store API Docs (#45543)
- Fix distributed documentation for asynchronous collective Work objects (#45709)
- Fix DDP documentation (#46861)
- Fix inaccurate note in
DistributedDataParallel
(#47156) - Minor doc fixes for
init_process_group
(#47644) - Docs fixes for
HashStore
API (#47643) - Update links in DDP note (#47663)
- Small documentation changes for
RRef
and Dist Autograd (#48123) - Add examples for new object-based c10d APIs (#43932)
- Minor update of the comments on PowerSGD. (#49246)
- Updating
init_process_group
docs to indicate correct rank range (#49131) - Store Python API Docs Fixes (#49130)
- Fix link in distributed contributing doc and add link (#49141)
- Updating Docs to Reflect
FileStore
changes (#49557) - Improve documentation for pipeline parallelism. (#48638)
- Reorder
torch.distributed.rpc.init_rpc
docstring arguments (#50419) - Add documentation page for pipeline parallelism. (#50791)
- Update the doc of
DistributedOptimizer
(#51314) - Fix doc inconsistency about callback args in
torch.futures.Future
(#50979)
TorchScript
- Added a developer tutorial for tensor expressions - the core technology used in CUDA fuser (#45527)
- Fix jit model loading example (#48104)
- Fix archive file extension in examples and docs (#50649)
- Fix
ScriptModule
docstring (#48608) - Clarify logic in
ir_emitter
(#51299)
torch.fx
- Add
torch.fx
section to doc (#48814, #50291, #50562, #50896, #50966, #51728) - Add example on how to split up an FX graph into smaller subgraphs with own submodules (#45404)
- Shape propagation example (#45637)
- Add many docstrings and improve their rendering (#47719, #48100, #48738, #48871, #50145, #50396, #50555)
- Document single op replacement (#50116, #50377)
- Document example of Proxy use (#50583)
- Add limitations of symbolic tracing (#50638)
- Added how to write transformations section (#51278)
- Added invert example (#51478)
- Document FX debugging (#51530)
- Write FX Subgraph Rewriter tutorial (#51531)
- Add note about more use cases of FX (#51576)
Quantization
- Add API summary section in quantization docs (#45848, #50681, #50187)
- Fix misleading doc string in quint8.h (#48418)
- Add fx graph mode quantization to quantization docs (#49515)
- Add common errors section (#49902)
- Adding a table comparing eager and fx graph mode (#50413)
- Add docs for embedding/embedding_bag (#51770)
- Add fake_quantize functions documentation (#51748)