This release brings several new features to torchvision, including models for semantic segmentation, object detection, instance segmentation and person keypoint detection, and custom C++ / CUDA ops specific to computer vision.

Note: torchvision 0.3 requires PyTorch 1.1 or newer

Highlights

Reference training / evaluation scripts

We now provide under the references/ folder scripts for training and evaluation of the following tasks: classification, semantic segmentation, object detection, instance segmentation and person keypoint detection.
Their purpose is twofold:

serve as a log of how to train a specific model.
provide baseline training and evaluation scripts to bootstrap research

They all have an entry-point train.py which performs both training and evaluation for a particular task. Other helper files, specific to each training script, are also present in the folder, and they might get integrated into the torchvision library in the future.

We expect users should copy-paste and modify those reference scripts and use them for their own needs.

TorchVision Ops

TorchVision now contains custom C++ / CUDA operators in torchvision.ops. Those operators are specific to computer vision, and make it easier to build object detection models.
Those operators currently do not support PyTorch script mode, but support for it is planned for future releases.

List of supported ops

roi_pool (and the module version RoIPool)
roi_align (and the module version RoIAlign)
nms, for non-maximum suppression of bounding boxes
box_iou, for computing the intersection over union metric between two sets of bounding boxes

All the other ops present in torchvision.ops and its subfolders are experimental, in particular:

FeaturePyramidNetwork is a module that adds a FPN on top of a module that returns a set of feature maps.
MultiScaleRoIAlign is a wrapper around roi_align that works with multiple feature map scales

Here are a few examples on using torchvision ops:

import torch
import torchvision

# create 10 random boxes
boxes = torch.rand(10, 4) * 100
# they need to be in [x0, y0, x1, y1] format
boxes[:, 2:] += boxes[:, :2]
# create a random image
image = torch.rand(1, 3, 200, 200)
# extract regions in `image` defined in `boxes`, rescaling
# them to have a size of 3x3
pooled_regions = torchvision.ops.roi_align(image, [boxes], output_size=(3, 3))
# check the size
print(pooled_regions.shape)
# torch.Size([10, 3, 3, 3])

# or compute the intersection over union between
# all pairs of boxes
print(torchvision.ops.box_iou(boxes, boxes).shape)
# torch.Size([10, 10])

Models for more tasks

The 0.3 release of torchvision includes pre-trained models for other tasks than image classification on ImageNet.
We include two new categories of models: region-based models, like Faster R-CNN, and dense pixelwise prediction models, like DeepLabV3.

Object Detection, Instance Segmentation and Person Keypoint Detection models

Warning: The API is currently experimental and might change in future versions of torchvision

The 0.3 release contains pre-trained models for Faster R-CNN, Mask R-CNN and Keypoint R-CNN, all of them using ResNet-50 backbone with FPN.
They have been trained on COCO train2017 following the reference scripts in references/, and give the following results on COCO val2017

Network	box AP	mask AP	keypoint AP
Faster R-CNN ResNet-50 FPN	37.0
Mask R-CNN ResNet-50 FPN	37.9	34.6
Keypoint R-CNN ResNet-50 FPN	54.6		65.0

The implementations of the models for object detection, instance segmentation and keypoint detection are fast, specially during training.

In the following table, we use 8 V100 GPUs, with CUDA 10.0 and CUDNN 7.4 to report the results. During training, we use a batch size of 2 per GPU, and during testing a batch size of 1 is used.

For test time, we report the time for the model evaluation and post-processing (including mask pasting in image), but not the time for computing the precision-recall.

Network	train time (s / it)	test time (s / it)	memory (GB)
Faster R-CNN ResNet-50 FPN	0.2288	0.0590	5.2
Mask R-CNN ResNet-50 FPN	0.2728	0.0903	5.4
Keypoint R-CNN ResNet-50 FPN	0.3789	0.1242	6.8

You can load and use pre-trained detection and segmentation models with a few lines of code

import torchvision

model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# set it to evaluation mode, as the model behaves differently
# during training and during evaluation
model.eval()

image = PIL.Image.open('/path/to/an/image.jpg')
image_tensor = torchvision.transforms.functional.to_tensor(image)

# pass a list of (potentially different sized) tensors
# to the model, in 0-1 range. The model will take care of
# batching them together and normalizing
output = model([image_tensor])
# output is a list of dict, containing the postprocessed predictions

Pixelwise Semantic Segmentation models

Warning: The API is currently experimental and might change in future versions of torchvision

The 0.3 release also contains models for dense pixelwise prediction on images.
It adds FCN and DeepLabV3 segmentation models, using a ResNet50 and ResNet101 backbones.
Pre-trained weights for ResNet101 backbone are available, and have been trained on a subset of COCO train2017, which contains the same 20 categories as those from Pascal VOC.

The pre-trained models give the following results on the subset of COCO val2017 which contain the same 20 categories as those present in Pascal VOC:

Network	mean IoU	global pixelwise acc
FCN ResNet101	63.7	91.9
DeepLabV3 ResNet101	67.4	92.4

New Datasets

Add Caltech101, Caltech256, and CelebA (#775)
ImageNet dataset (#764) (#858) (#870)
Added Semantic Boundaries Dataset (#808) (#865)
Add VisionDataset as a base class for all datasets (#749) (#859) (#838) (#876) (#878)

New Models

Classification

Add GoogLeNet (Inception v1) (#678) (#821) (#828) (#816)
Add MobileNet V2 (#818) (#917)
Add ShuffleNet v2 (#849) (#886) (#889) (#892) (#916)
Add ResNeXt-50 32x4d and ResNeXt-101 32x8d (#822) (#852) (#917)

Segmentation

Fully-Convolutional Network (FCN) with ResNet 101 backbone
DeepLabV3 with ResNet 101 backbone

Detection

Faster R-CNN R-50 FPN trained on COCO train2017 (#898) (#921)
Mask R-CNN R-50 FPN trained on COCO train2017 (#898) (#921)
Keypoint R-CNN R-50 FPN trained on COCO train2017 (#898) (#921) (#922)

Breaking changes

Make CocoDataset ids deterministically ordered (#868)

New Transforms

Add bias vector to LinearTransformation (#793) (#843) (#881)
Add Random Perspective transform (#781) (#879)

Bugfixes

Fix user warning when applying normalize (#810)
Fix logic error in check_integrity (#871)

Improvements

Fixing mutation of 2d tensors in to_pil_image (#762)
Replace tensor.view with tensor.unsqueeze(0) in make_grid (#765)
Change usage of view to reshape in resnet to enable running with mkldnn (#890)
Improve normalize to work with tensors located on any device (#787)
Raise an IndexError for FakeData.__getitem__() if the index would be out of range (#780)
Aspect ratio is now sampled from a logarithmic distribution in RandomResizedCrop. (#799)
Modernize inception v3 weight initialization code (#824)
Remove duplicate code from densenet load_state_dict (#827)
Replace endswith calls in a loop with a single endswith call in DatasetFolder (#832)
Added missing dot in webp image extensions (#836)
fix inconsistent behavior for ~ expression (#850)
Minor Compressions in statements in folder.py (#874)
Minor fix to evaluation formula of PILLOW_VERSION in transforms.functional.affine (#895)
added is_valid_file parameter to DatasetFolder (#867)
Add support for joint transformations in VisionDataset (#872)
Auto calculating return dimension of squeezenet forward method (#884)
Added progress flag to model getters (#875) (#910)
Add support for other normalizations (i.e., GroupNorm) in ResNet (#813)
Add dilation option to ResNet (#866)

Testing

Add basic model testing. (#811)
Add test for num_class in test_model.py (#815)
Added test for normalize functionality in make_grid function. (#840)
Added downloaded directory not empty check in test_datasets_utils (#844)
Added test for save_image in utils (#847)
Added tests for check_md5 and check_integrity (#873)

Misc

Remove shebang in setup.py (#773)
configurable version and package names (#842)
More hub models (#851)
Update travis to use more recent GCC (#891)

Documentation

Add comments regarding downsampling layers of resnet (#794)
Remove unnecessary bullet point in InceptionV3 doc (#814)
Fix crop and resized_crop docs in functional.py (#817)
Added dimensions in the comments of googlenet (#788)
Update transform doc with random offset of padding due to pad_if_needed (#791)
Added the argument transform_input in docs of InceptionV3 (#789)
Update documentation for MNIST datasets (#778)
Fixed typo in normalize() function. (#823)
Fix typo in squeezenet (#841)
Fix typo in DenseNet comment (#857)
Typo and syntax fixes to transform docstrings (#887)

pytorch/vision v0.3.0 Training scripts, detection/segmentation models and more on GitHub