Highlights

[BETA] New Model Registration API

Following up on the multi-weight support API that was released on the previous version, we have added a new model registration API to help users retrieve models and weights. There are now 4 new methods under the torchvision.models module: get_model, get_model_weights, get_weight, and list_models. Here are examples of how we can use them:

import torchvision
from torchvision.models import get_model, get_model_weights, list_models


max_params = 5000000

tiny_models = []
for model_name in list_models(module=torchvision.models):
    weights_enum = get_model_weights(model_name)
    if len([w for w in weights_enum if w.meta["num_params"] <= max_params]) > 0:
        tiny_models.append(model_name)

print(tiny_models)
# ['mnasnet0_5', 'mnasnet0_75', 'mnasnet1_0', 'mobilenet_v2', ...]

model = get_model(tiny_models[0], weights="DEFAULT")
print(sum(x.numel() for x in model.state_dict().values()))
# 2239188

As of now, this API is still beta and there might be changes in the future in order to improve its usability based on your feedback.

New Architecture and Model Variants

Classification Models

We’ve added the Swin Transformer V2 architecture along with pre-trained weights for its tiny/small/base variants. In addition, we have added support for the MaxViT transformer. Here is an example on how to use the models:

import torch
from torchvision.models import *

image = torch.rand(1, 3, 224, 224)
model = swin_v2_t(weights="DEFAULT").eval()
# model = maxvit_t(weights="DEFAULT").eval()
prediction = model(image)

Here is the table showing the accuracy of the models tested on ImageNet1K dataset.

Model	Acc@1	Acc@1 change over V1	Acc@5	Acc@5 change over V1
swin_v2_t	82.072	+0.598	96.132	+0.356
swin_v2_s	83.712	+0.516	96.816	+0.456
swin_v2_b	84.112	+0.530	96.864	+0.224
maxvit_t	83.700	-	96.722	-

We would like to thank Ren Pang and Teodor Poncu for contributing the 2 models to torchvision.

[BETA] Video Classification Model

We added two new video classification models, MViT and S3D. MViT is a state of the art video classification transformer model which has 80.757% accuracy on Kinetics400 dataset, while S3D is a relatively small model with good accuracy for its size. These models can be used as follows:

import torch
from torchvision.models.video import *

video = torch.rand(3, 32, 800, 600)
model = mvit_v2_s(weights="DEFAULT")
# model = s3d(weights="DEFAULT")
model.eval()
prediction = model(images)

Here is the table showing the accuracy of the new video classification models tested in the Kinetics400 dataset.

Model	Acc@1	Acc@5
mvit_v1_b	81.474	95.776
mvit_v2_s	83.196	96.36
s3d	83.582	96.64

We would like to thank Haoqi Fan, Yanghao Li, Christoph Feichtenhofer and Wan-Yen Lo for their work on PyTorchVideo and their support during the development of the MViT model. We would like to thank Sophia Zhi for her contribution implementing the S3D model in torchvision.

New Primitives & Augmentations

In this release we’ve added the SimpleCopyPaste augmentation in our reference scripts and we up-streamed the PolynomialLR scheduler to PyTorch Core. We would like to thank Lezwon Castelino and Federico Pozzi for their contributions. We are continuing our efforts to modernize TorchVision by adding more SoTA primitives, Augmentations and architectures with the help of our community. If you are interested in contributing, have a look at the following issue.

Upcoming Prototype APIs

We are currently working on extending our existing Transforms and Functional API to provide native support for Video, Object Detection, Semantic and Instance Segmentation. This will enable us to offer better support to the existing Computer Vision tasks and make importable from the TorchVision binary SoTA augmentations such as MixUp, CutMix, Large Scale Jitter and SimpleCopyPaste. The API is still under development and thus was not included in the release but you can read more about it on our blogpost and provide your feedback on the dedicated Github issue.

Backward Incompatible Changes

We’ve removed some APIs that have been deprecated since version 0.12 (or before). Here is the list of things that we removed and their replacement:

The Kinetics400 class has been removed. Users must now use the newer Kinetics class which is a direct replacement.
The class _DeprecatedConvBNAct, ConvBNReLU, and ConvBNActivation were removed from torchvision.models.mobilenetv2 and are replaced with the more generic Conv2dNormActivation class.
The torchvision.models.mobilenetv3.SqueezeExcitation has been removed in favor of torchvision.ops.SqueezeExcitation.
The class methods convert_to_roi_format, infer_scale, setup_scales from torchvision.ops.MultiScaleRoiAlign have been removed.
We have removed the resample and fillcolor parameters from the Transforms API. They have been replaced with interpolation and fill respectively.
We’ve removed the range parameter from torchvision.utils.make_grid as it was replaced by the value_range parameter to avoid shadowing the Python built-in method.

Detailed Changes (PRs)

Deprecations

[models] Remove cpp model in v0.14 due to deprecation (#6632)
[utils, ops, transforms, models, datasets] Remove deprecated APIs for 0.14 (#6258)

New Features

[datasets] Add various Stereo Matching datasets (#6345, #6346, #6311, #6347, #6349, #6348, #6350, #6351)
[models] Add the S3D architecture to TorchVision (#6412, #6537)
[models] add crestereo implementation (#6310, #6629)
[models] MaxVit model (#6342)
[models] Make get_model_builder public (#6560)
[models] Add registration mechanism for models (#6333, #6369)
[models] Add MViT architecture in TorchVision for both V1 and V2 (#6198, #6373)
[models] Add SwinV2 mode variant (#6246, #6266)
[reference scripts] Add stereo matching reference scripts (#6549, #6554, #6605)
[transforms] Added elastic transform in torchvision.transforms (#4938)
[build] Add M1 binary builds (#5948, #6135, #6140, #6110, #6132, #6324, #6122, #6409)

Improvements

[build] Various torchvision binary build improvements (#6396, #6201, #6230, #6199)
[build] Install NVJPEG on Windows for 11.6 and 11.7 CUDA (#6578)
[models] Change weights return type to Mapping in models api (#6097)
[models] Vectorize box decoding and encoding in FCOS (#6203, #6278)
[ci] Add CUDA 11.7 builds (#6425)
[ci] Various CI improvements (#6590, #6290, #6170, #6218)
[documentation] Various documentations improvements (#6276, #6163, #6450, #6294, #6572, #6176, #6340, #6314, #6427, #6536, #6215, #6150)
[documentation] Add new .. betastatus:: directive and document Beta APIs (#6115)
[hub] Expose on Hub the public methods of the registration API (#6364)
[io, documentation] DOC: add limitation of decode_jpeg in the function docstring (#6637)
[models] Make the assert message more verbose in vision transformer (#6583)
[ops] Generalize ConvNormActivation function to accept tuple for some parameters (#6251)
[reference scripts] Update the dataset cache to factor input parameters (#6234)
[reference scripts] Adding video level accuracy for video_classification reference script (#6241)
[reference scripts] refactor: replace LambdaLR with PolynomialLR in segmentation training script (#6405, #6436)
[reference scripts, documentation] Introduce resize params, fix lr estimation, update docs. (#6444)
[reference scripts, transforms] Add SimpleCopyPaste augmentation (#5825)
[rocm, ci] Update to rocm5.2 wheels (#6571)
[tests] Various tests improvements (#6601, #6380, #6497, #6248, #6660, #6027, #6226, #6594, #6747, #6272)
[tests] Skip big models on CI tests (#6539, #6197, #6573)
[transforms] Added antialias arg to resized crop transform and op (#6193)
[transforms] Refactored and modified private api for resize functional op (#6191)
[utils] Throw ValueError in draw bounding boxes for invalid boxes (#6123)
[utils] Extend _log_api_usage_once to work for overwritten classes (#6237)
[video] Add more logging information for decoder (#6108)
[video] [FBcode->GH] Handle images with AV_PIX_FMT_PAL8 pixel format in decoder callback (#6359)
[io] Add an option to skip packets with empty data (#6442)
[datasets] Put back CelebA download (#6147)
[datasets, tests] Update link to download SBU dataset. Enable the test again (#6638)

Bug Fixes

[build] Set MACOSX_DEPLOYMENT_TARGET=10.9 for binary jobs (#6298)
[ci] Fixing issue with setup_env.sh in docker (#6106)
[datasets] swap MD5 checksums of PCAM val and test split (#6644)
[documentation] fix example galleries in documentation (#6701)
[hub] Add missing resnext101_64x4d to hubconf.py (#6228)
[io] Fix out-of-bounds read in decode_png (#6456)
[models] Fix swapped width and height in DefaultBoxGenerator (#6551)
[models] Fix the error message of _ovewrite_value_param (#6585)
[models] Add missing handle_legacy_interface() calls (#6565)
[models] Fix resnet model by checking if norm_layer weight is None before init (#6082)
[models] Adding _log_api_usage_once to Swin's reusable components. (#6174)
[models] Move out the pad operation from PatchMerging in swin transformer to make it fx compatible (#6252)
[models] Add missing _version to the MLPBlock (#6113)
[ops] Fix d/c IoU for different batch sizes (#6338)
[ops] update roipool to make it torch fx traceable (#6501)
[ops] Fix typing jit issue on RoIPool and RoIAlign (#6397)
[reference scripts] Fix copypaste collate pickle issues (#6181)
[reference scripts] Remove the unused/buggy --train-center-crop flag from Classification preset (#6642)
[tests] Add .float() before .mean() on test_backbone_utils.py because .mean() dont accept integer dtype (#6090)
[transforms] Update pil_constants.py (#6154)
[transforms] Fixed issue with F.crop when cropping outside the input image (#6615)
[transforms] Bugfix for accimage test on functional_pil.resize image (#6208)
[transforms] Fixed error condition in RandomCrop (#6548)
[video] [FBcode->GH] Move func calls outside of *CHECK* in io decoder (#6357)
[video] [bugfix] Fix the output format for VideoClips.subset (#6700) (#6706)
[video] fix bug in output format for pyav (#6672) (#6703)
[ci] Fix for cygpath windows issue (#6513)
[ops] Replaced CHECK_ by TORCH_CHECK_ (#6322)
[build] Fix typo in GHA nightly build condition (#6158)

Code Quality

[ci, test] Improvements on CI and test code quality (#6413, #6303, #6652, #6360, #6493, #6146, #6593, #6297, #6678, #6389)
[ci] Upgrade usort to 1.0.2 and black to 22.3.0 (#5106)
[reference scripts] [FBcode->GH] Rename asset files to remove spaces. (#6666)
[build] fix submodule imports by importing functions directly (#6188)
[datasets] Simplify _check_integrity for cifar and stl10 (#6335)
[datasets] Moved pfm file reading into dataset utils (#6270)
[documentation] Docs: build with Sphinx 5 (#5121)
[models] Typo fix in comment in mvit.py (#6618)
[models] cleanup for box encoding and decoding in FCOS (#6277)
[ops] Remove AffineQuantizer.h from qnms_kernel.cpp (#6141)
[reference scripts] Type fix in transformers.py (#6376)
[transforms] Fix typo in error message (#6291)
[transforms] Update typehint for fill arg in rotate (#6594)
[io] Free avPacket on EAGAIN decoder error (#6432) (#6443)
[android] [pytorch] Bump SoLoader version to 0.10.4 (#81946) (#6327)
[transforms, reference script] port FixedSizeCrop from detection references to prototype transforms (#6417)
[models, transforms] Update the expected removal date for several deprecated API for release v0.14 (#6654)
[tests] Replace torch.utils.data.graph.traverse with traverse_dps (#6657)
[build] Replacing cudatoolkit by cuda for 11.6 (#5996)
[ops] [FBcode->GH] [quant][core][better-engineering] Rename files in quantized directory… (#6133)
[build] [BE] Unify version computation (#6117)
[models] Refactor swin transfomer so later we can reuse component for 3d version (#6088)
[models] [FBcode->GH] Fix vit model assert message to be compatible with torchmultimodal test (#6592)

Contributors

We're grateful for our community, which helps us improving torchvision by submitting issues and PRs, and providing feedback and suggestions. The following persons have contributed patches for this release:

Abhijit Deo, Adam J. Stewart, Aditya Oke, Alexander Jipa, Ambuj Pawar, Andrey Talman, dzdang, Edward Wang (EcoF), Eli Uriegas, Erjia Guan, Federico Pozzi, inisis, Jithun Nair, Joao Gomes, Karan Desai, Kevin Tse, Lenz, Lezwon Castelino, Mayanand, Nicolas Granger, Nicolas Hug, Nikita Shulga, Oleksandr Voietsa, Philip Meier, Ponku, ptrblck, Sergii Dymchenko, Sergiy Bilobrov, Shantanu, Sim Sun, Sophia Zhi, Tinson Lai, Vasilis Vryniotis, vcarpani, vcwai, vfdev-5, Yakhyokhuja Valikhujaev, Yosua Michael Maranatha, Zachariah Carmichael, キツネさん

torchvision 0.14.0 TorchVision 0.14, including new model registration API, new models, weights, augmentations, and more on Python PyPI