Unverified Commit cb99fe30 authored by Francisco Massa's avatar Francisco Massa Committed by GitHub
Browse files

Add detection and segmentation models to doc folder (#933)

parent face20bd
torchvision.models torchvision.models
================== ##################
The models subpackage contains definitions of models for addressing
different tasks, including: image classification, pixelwise semantic
segmentation, object detection, instance segmentation and person
keypoint detection.
Classification
==============
The models subpackage contains definitions for the following model The models subpackage contains definitions for the following model
architectures: architectures for image classification:
- `AlexNet`_ - `AlexNet`_
- `VGG`_ - `VGG`_
...@@ -182,8 +192,149 @@ MobileNet v2 ...@@ -182,8 +192,149 @@ MobileNet v2
.. autofunction:: mobilenet_v2 .. autofunction:: mobilenet_v2
ResNext ResNext
------------- -------
.. autofunction:: resnext50_32x4d .. autofunction:: resnext50_32x4d
.. autofunction:: resnext101_32x8d .. autofunction:: resnext101_32x8d
Semantic Segmentation
=====================
As with image classification models, all pre-trained models expect input images normalized in the same way.
The images have to be loaded in to a range of ``[0, 1]`` and then normalized using
``mean = [0.485, 0.456, 0.406]`` and ``std = [0.229, 0.224, 0.225]``.
They have been trained on images resized such that their minimum size is 520.
The pre-trained models have been trained on a subset of COCO train2017, on the 20 categories that are
present in the Pascal VOC dataset. You can see more information on how the subset has been selected in
``references/segmentation/coco_utils.py``. The classes that the pre-trained model outputs are the following,
in order:
.. code-block:: python
['__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']
The accuracies of the pre-trained models evaluated on COCO val2017 are as follows
================================ ============= ====================
Network mean IoU global pixelwise acc
================================ ============= ====================
FCN ResNet101 63.7 91.9
DeepLabV3 ResNet101 67.4 92.4
================================ ============= ====================
Fully Convolutional Networks
----------------------------
.. autofunction:: torchvision.models.segmentation.fcn_resnet50
.. autofunction:: torchvision.models.segmentation.fcn_resnet101
DeepLabV3
---------
.. autofunction:: torchvision.models.segmentation.deeplabv3_resnet50
.. autofunction:: torchvision.models.segmentation.deeplabv3_resnet101
Object Detection, Instance Segmentation and Person Keypoint Detection
=====================================================================
The pre-trained models for detection, instance segmentation and
keypoint detection are initialized with the classification models
in torchvision.
The models expect a list of ``Tensor[C, H, W]``, in the range ``0-1``.
The models internally resize the images so that they have a minimum size
of ``800``. This option can be changed by passing the option ``min_size``
to the constructor of the models.
For object detection and instance segmentation, the pre-trained
models return the predictions of the following classes:
.. code-block:: python
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic', 'light', 'fire', 'hydrant', 'N/A', 'stop',
'sign', 'parking', 'meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports', 'ball',
'kite', 'baseball', 'bat', 'baseball', 'glove', 'skateboard', 'surfboard', 'tennis',
'racket', 'bottle', 'N/A', 'wine', 'glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot', 'dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted', 'plant', 'bed', 'N/A', 'dining', 'table',
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell',
'phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
'clock', 'vase', 'scissors', 'teddy', 'bear', 'hair', 'drier', 'toothbrush'
]
Here are the summary of the accuracies for the models trained on
the instances set of COCO train2017 and evaluated on COCO val2017.
================================ ======= ======== ===========
Network box AP mask AP keypoint AP
================================ ======= ======== ===========
Faster R-CNN ResNet-50 FPN 37.0 - -
Mask R-CNN ResNet-50 FPN 37.9 34.6 -
================================ ======= ======== ===========
For person keypoint detection, the accuracies for the pre-trained
models are as follows
================================ ======= ======== ===========
Network box AP mask AP keypoint AP
================================ ======= ======== ===========
Keypoint R-CNN ResNet-50 FPN 54.6 - 65.0
================================ ======= ======== ===========
For person keypoint detection, the pre-trained model return the
keypoints in the following order:
.. code-block:: python
COCO_PERSON_KEYPOINT_NAMES = [
'nose',
'left_eye',
'right_eye',
'left_ear',
'right_ear',
'left_shoulder',
'right_shoulder',
'left_elbow',
'right_elbow',
'left_wrist',
'right_wrist',
'left_hip',
'right_hip',
'left_knee',
'right_knee',
'left_ankle',
'right_ankle'
]
Faster R-CNN
------------
.. autofunction:: torchvision.models.detection.fasterrcnn_resnet50_fpn
Mask R-CNN
----------
.. autofunction:: torchvision.models.detection.maskrcnn_resnet50_fpn
Keypoint R-CNN
--------------
.. autofunction:: torchvision.models.detection.keypointrcnn_resnet50_fpn
...@@ -32,19 +32,20 @@ class FasterRCNN(GeneralizedRCNN): ...@@ -32,19 +32,20 @@ class FasterRCNN(GeneralizedRCNN):
During training, the model expects both the input tensors, as well as a targets dictionary, During training, the model expects both the input tensors, as well as a targets dictionary,
containing: containing:
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
between 0 and H and 0 and W between 0 and H and 0 and W
labels (Tensor[N]): the class label for each ground-truth box - labels (Tensor[N]): the class label for each ground-truth box
The model returns a Dict[Tensor] during training, containing the classification and regression The model returns a Dict[Tensor] during training, containing the classification and regression
losses for both the RPN and the R-CNN. losses for both the RPN and the R-CNN.
During inference, the model requires only the input tensors, and returns the post-processed During inference, the model requires only the input tensors, and returns the post-processed
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
follows: follows:
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
0 and H and 0 and W 0 and H and 0 and W
labels (Tensor[N]): the predicted labels for each image - labels (Tensor[N]): the predicted labels for each image
scores (Tensor[N]): the scores or each prediction - scores (Tensor[N]): the scores or each prediction
Arguments: Arguments:
backbone (nn.Module): the network used to compute the features for the model. backbone (nn.Module): the network used to compute the features for the model.
...@@ -257,6 +258,35 @@ def fasterrcnn_resnet50_fpn(pretrained=False, progress=True, ...@@ -257,6 +258,35 @@ def fasterrcnn_resnet50_fpn(pretrained=False, progress=True,
""" """
Constructs a Faster R-CNN model with a ResNet-50-FPN backbone. Constructs a Faster R-CNN model with a ResNet-50-FPN backbone.
The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
image, and should be in ``0-1`` range. Different images can have different sizes.
The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets dictionary,
containing:
- boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
between ``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the class label for each ground-truth box
The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
losses for both the RPN and the R-CNN.
During inference, the model requires only the input tensors, and returns the post-processed
predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
follows:
- boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the predicted labels for each image
- scores (``Tensor[N]``): the scores or each prediction
Example::
>>> model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
Arguments: Arguments:
pretrained (bool): If True, returns a model pre-trained on COCO train2017 pretrained (bool): If True, returns a model pre-trained on COCO train2017
progress (bool): If True, displays a progress bar of the download to stderr progress (bool): If True, displays a progress bar of the download to stderr
......
...@@ -26,22 +26,23 @@ class KeypointRCNN(FasterRCNN): ...@@ -26,22 +26,23 @@ class KeypointRCNN(FasterRCNN):
During training, the model expects both the input tensors, as well as a targets dictionary, During training, the model expects both the input tensors, as well as a targets dictionary,
containing: containing:
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
between 0 and H and 0 and W between 0 and H and 0 and W
labels (Tensor[N]): the class label for each ground-truth box - labels (Tensor[N]): the class label for each ground-truth box
keypoints (Tensor[N, K, 3]): the K keypoints location for each of the N instances, in the - keypoints (Tensor[N, K, 3]): the K keypoints location for each of the N instances, in the
format [x, y, visibility], where visibility=0 means that the keypoint is not visible. format [x, y, visibility], where visibility=0 means that the keypoint is not visible.
The model returns a Dict[Tensor] during training, containing the classification and regression The model returns a Dict[Tensor] during training, containing the classification and regression
losses for both the RPN and the R-CNN, and the keypoint loss. losses for both the RPN and the R-CNN, and the keypoint loss.
During inference, the model requires only the input tensors, and returns the post-processed During inference, the model requires only the input tensors, and returns the post-processed
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
follows: follows:
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
0 and H and 0 and W 0 and H and 0 and W
labels (Tensor[N]): the predicted labels for each image - labels (Tensor[N]): the predicted labels for each image
scores (Tensor[N]): the scores or each prediction - scores (Tensor[N]): the scores or each prediction
keypoints (Tensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format. - keypoints (Tensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format.
Arguments: Arguments:
backbone (nn.Module): the network used to compute the features for the model. backbone (nn.Module): the network used to compute the features for the model.
...@@ -228,6 +229,38 @@ def keypointrcnn_resnet50_fpn(pretrained=False, progress=True, ...@@ -228,6 +229,38 @@ def keypointrcnn_resnet50_fpn(pretrained=False, progress=True,
""" """
Constructs a Keypoint R-CNN model with a ResNet-50-FPN backbone. Constructs a Keypoint R-CNN model with a ResNet-50-FPN backbone.
The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
image, and should be in ``0-1`` range. Different images can have different sizes.
The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets dictionary,
containing:
- boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
between ``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the class label for each ground-truth box
- keypoints (``Tensor[N, K, 3]``): the ``K`` keypoints location for each of the ``N`` instances, in the
format ``[x, y, visibility]``, where ``visibility=0`` means that the keypoint is not visible.
The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
losses for both the RPN and the R-CNN, and the keypoint loss.
During inference, the model requires only the input tensors, and returns the post-processed
predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
follows:
- boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the predicted labels for each image
- scores (``Tensor[N]``): the scores or each prediction
- keypoints (``Tensor[N, K, 3]``): the locations of the predicted keypoints, in ``[x, y, v]`` format.
Example::
>>> model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True)
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
Arguments: Arguments:
pretrained (bool): If True, returns a model pre-trained on COCO train2017 pretrained (bool): If True, returns a model pre-trained on COCO train2017
progress (bool): If True, displays a progress bar of the download to stderr progress (bool): If True, displays a progress bar of the download to stderr
......
...@@ -28,23 +28,24 @@ class MaskRCNN(FasterRCNN): ...@@ -28,23 +28,24 @@ class MaskRCNN(FasterRCNN):
During training, the model expects both the input tensors, as well as a targets dictionary, During training, the model expects both the input tensors, as well as a targets dictionary,
containing: containing:
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values - boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
between 0 and H and 0 and W between 0 and H and 0 and W
labels (Tensor[N]): the class label for each ground-truth box - labels (Tensor[N]): the class label for each ground-truth box
masks (Tensor[N, H, W]): the segmentation binary masks for each instance - masks (Tensor[N, H, W]): the segmentation binary masks for each instance
The model returns a Dict[Tensor] during training, containing the classification and regression The model returns a Dict[Tensor] during training, containing the classification and regression
losses for both the RPN and the R-CNN, and the mask loss. losses for both the RPN and the R-CNN, and the mask loss.
During inference, the model requires only the input tensors, and returns the post-processed During inference, the model requires only the input tensors, and returns the post-processed
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
follows: follows:
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between - boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
0 and H and 0 and W 0 and H and 0 and W
labels (Tensor[N]): the predicted labels for each image - labels (Tensor[N]): the predicted labels for each image
scores (Tensor[N]): the scores or each prediction - scores (Tensor[N]): the scores or each prediction
mask (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to - mask (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to
obtain the final segmentation masks, the soft masks can be thresholded, generally obtain the final segmentation masks, the soft masks can be thresholded, generally
with a value of 0.5 (mask >= 0.5) with a value of 0.5 (mask >= 0.5)
Arguments: Arguments:
backbone (nn.Module): the network used to compute the features for the model. backbone (nn.Module): the network used to compute the features for the model.
...@@ -226,6 +227,39 @@ def maskrcnn_resnet50_fpn(pretrained=False, progress=True, ...@@ -226,6 +227,39 @@ def maskrcnn_resnet50_fpn(pretrained=False, progress=True,
""" """
Constructs a Mask R-CNN model with a ResNet-50-FPN backbone. Constructs a Mask R-CNN model with a ResNet-50-FPN backbone.
The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
image, and should be in ``0-1`` range. Different images can have different sizes.
The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets dictionary,
containing:
- boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
between ``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the class label for each ground-truth box
- masks (``Tensor[N, H, W]``): the segmentation binary masks for each instance
The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
losses for both the RPN and the R-CNN, and the mask loss.
During inference, the model requires only the input tensors, and returns the post-processed
predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
follows:
- boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the predicted labels for each image
- scores (``Tensor[N]``): the scores or each prediction
- mask (``Tensor[N, H, W]``): the predicted masks for each instance, in ``0-1`` range. In order to
obtain the final segmentation masks, the soft masks can be thresholded, generally
with a value of 0.5 (``mask >= 0.5``)
Example::
>>> model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
Arguments: Arguments:
pretrained (bool): If True, returns a model pre-trained on COCO train2017 pretrained (bool): If True, returns a model pre-trained on COCO train2017
progress (bool): If True, displays a progress bar of the download to stderr progress (bool): If True, displays a progress bar of the download to stderr
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment