Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
vision
Commits
cb99fe30
Unverified
Commit
cb99fe30
authored
May 21, 2019
by
Francisco Massa
Committed by
GitHub
May 21, 2019
Browse files
Add detection and segmentation models to doc folder (#933)
parent
face20bd
Changes
4
Show whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
279 additions
and
31 deletions
+279
-31
docs/source/models.rst
docs/source/models.rst
+154
-3
torchvision/models/detection/faster_rcnn.py
torchvision/models/detection/faster_rcnn.py
+37
-7
torchvision/models/detection/keypoint_rcnn.py
torchvision/models/detection/keypoint_rcnn.py
+43
-10
torchvision/models/detection/mask_rcnn.py
torchvision/models/detection/mask_rcnn.py
+45
-11
No files found.
docs/source/models.rst
View file @
cb99fe30
torchvision.models
torchvision.models
==================
##################
The models subpackage contains definitions of models for addressing
different tasks, including: image classification, pixelwise semantic
segmentation, object detection, instance segmentation and person
keypoint detection.
Classification
==============
The models subpackage contains definitions for the following model
The models subpackage contains definitions for the following model
architectures:
architectures
for image classification
:
- `AlexNet`_
- `AlexNet`_
- `VGG`_
- `VGG`_
...
@@ -182,8 +192,149 @@ MobileNet v2
...
@@ -182,8 +192,149 @@ MobileNet v2
.. autofunction:: mobilenet_v2
.. autofunction:: mobilenet_v2
ResNext
ResNext
-------
------
-------
.. autofunction:: resnext50_32x4d
.. autofunction:: resnext50_32x4d
.. autofunction:: resnext101_32x8d
.. autofunction:: resnext101_32x8d
Semantic Segmentation
=====================
As with image classification models, all pre-trained models expect input images normalized in the same way.
The images have to be loaded in to a range of ``[0, 1]`` and then normalized using
``mean = [0.485, 0.456, 0.406]`` and ``std = [0.229, 0.224, 0.225]``.
They have been trained on images resized such that their minimum size is 520.
The pre-trained models have been trained on a subset of COCO train2017, on the 20 categories that are
present in the Pascal VOC dataset. You can see more information on how the subset has been selected in
``references/segmentation/coco_utils.py``. The classes that the pre-trained model outputs are the following,
in order:
.. code-block:: python
['__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']
The accuracies of the pre-trained models evaluated on COCO val2017 are as follows
================================ ============= ====================
Network mean IoU global pixelwise acc
================================ ============= ====================
FCN ResNet101 63.7 91.9
DeepLabV3 ResNet101 67.4 92.4
================================ ============= ====================
Fully Convolutional Networks
----------------------------
.. autofunction:: torchvision.models.segmentation.fcn_resnet50
.. autofunction:: torchvision.models.segmentation.fcn_resnet101
DeepLabV3
---------
.. autofunction:: torchvision.models.segmentation.deeplabv3_resnet50
.. autofunction:: torchvision.models.segmentation.deeplabv3_resnet101
Object Detection, Instance Segmentation and Person Keypoint Detection
=====================================================================
The pre-trained models for detection, instance segmentation and
keypoint detection are initialized with the classification models
in torchvision.
The models expect a list of ``Tensor[C, H, W]``, in the range ``0-1``.
The models internally resize the images so that they have a minimum size
of ``800``. This option can be changed by passing the option ``min_size``
to the constructor of the models.
For object detection and instance segmentation, the pre-trained
models return the predictions of the following classes:
.. code-block:: python
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic', 'light', 'fire', 'hydrant', 'N/A', 'stop',
'sign', 'parking', 'meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports', 'ball',
'kite', 'baseball', 'bat', 'baseball', 'glove', 'skateboard', 'surfboard', 'tennis',
'racket', 'bottle', 'N/A', 'wine', 'glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot', 'dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted', 'plant', 'bed', 'N/A', 'dining', 'table',
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell',
'phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
'clock', 'vase', 'scissors', 'teddy', 'bear', 'hair', 'drier', 'toothbrush'
]
Here are the summary of the accuracies for the models trained on
the instances set of COCO train2017 and evaluated on COCO val2017.
================================ ======= ======== ===========
Network box AP mask AP keypoint AP
================================ ======= ======== ===========
Faster R-CNN ResNet-50 FPN 37.0 - -
Mask R-CNN ResNet-50 FPN 37.9 34.6 -
================================ ======= ======== ===========
For person keypoint detection, the accuracies for the pre-trained
models are as follows
================================ ======= ======== ===========
Network box AP mask AP keypoint AP
================================ ======= ======== ===========
Keypoint R-CNN ResNet-50 FPN 54.6 - 65.0
================================ ======= ======== ===========
For person keypoint detection, the pre-trained model return the
keypoints in the following order:
.. code-block:: python
COCO_PERSON_KEYPOINT_NAMES = [
'nose',
'left_eye',
'right_eye',
'left_ear',
'right_ear',
'left_shoulder',
'right_shoulder',
'left_elbow',
'right_elbow',
'left_wrist',
'right_wrist',
'left_hip',
'right_hip',
'left_knee',
'right_knee',
'left_ankle',
'right_ankle'
]
Faster R-CNN
------------
.. autofunction:: torchvision.models.detection.fasterrcnn_resnet50_fpn
Mask R-CNN
----------
.. autofunction:: torchvision.models.detection.maskrcnn_resnet50_fpn
Keypoint R-CNN
--------------
.. autofunction:: torchvision.models.detection.keypointrcnn_resnet50_fpn
torchvision/models/detection/faster_rcnn.py
View file @
cb99fe30
...
@@ -32,19 +32,20 @@ class FasterRCNN(GeneralizedRCNN):
...
@@ -32,19 +32,20 @@ class FasterRCNN(GeneralizedRCNN):
During training, the model expects both the input tensors, as well as a targets dictionary,
During training, the model expects both the input tensors, as well as a targets dictionary,
containing:
containing:
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
-
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
between 0 and H and 0 and W
between 0 and H and 0 and W
labels (Tensor[N]): the class label for each ground-truth box
- labels (Tensor[N]): the class label for each ground-truth box
The model returns a Dict[Tensor] during training, containing the classification and regression
The model returns a Dict[Tensor] during training, containing the classification and regression
losses for both the RPN and the R-CNN.
losses for both the RPN and the R-CNN.
During inference, the model requires only the input tensors, and returns the post-processed
During inference, the model requires only the input tensors, and returns the post-processed
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
follows:
follows:
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
-
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
0 and H and 0 and W
0 and H and 0 and W
labels (Tensor[N]): the predicted labels for each image
-
labels (Tensor[N]): the predicted labels for each image
scores (Tensor[N]): the scores or each prediction
-
scores (Tensor[N]): the scores or each prediction
Arguments:
Arguments:
backbone (nn.Module): the network used to compute the features for the model.
backbone (nn.Module): the network used to compute the features for the model.
...
@@ -257,6 +258,35 @@ def fasterrcnn_resnet50_fpn(pretrained=False, progress=True,
...
@@ -257,6 +258,35 @@ def fasterrcnn_resnet50_fpn(pretrained=False, progress=True,
"""
"""
Constructs a Faster R-CNN model with a ResNet-50-FPN backbone.
Constructs a Faster R-CNN model with a ResNet-50-FPN backbone.
The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
image, and should be in ``0-1`` range. Different images can have different sizes.
The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets dictionary,
containing:
- boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
between ``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the class label for each ground-truth box
The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
losses for both the RPN and the R-CNN.
During inference, the model requires only the input tensors, and returns the post-processed
predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
follows:
- boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the predicted labels for each image
- scores (``Tensor[N]``): the scores or each prediction
Example::
>>> model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
Arguments:
Arguments:
pretrained (bool): If True, returns a model pre-trained on COCO train2017
pretrained (bool): If True, returns a model pre-trained on COCO train2017
progress (bool): If True, displays a progress bar of the download to stderr
progress (bool): If True, displays a progress bar of the download to stderr
...
...
torchvision/models/detection/keypoint_rcnn.py
View file @
cb99fe30
...
@@ -26,22 +26,23 @@ class KeypointRCNN(FasterRCNN):
...
@@ -26,22 +26,23 @@ class KeypointRCNN(FasterRCNN):
During training, the model expects both the input tensors, as well as a targets dictionary,
During training, the model expects both the input tensors, as well as a targets dictionary,
containing:
containing:
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
-
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
between 0 and H and 0 and W
between 0 and H and 0 and W
labels (Tensor[N]): the class label for each ground-truth box
-
labels (Tensor[N]): the class label for each ground-truth box
keypoints (Tensor[N, K, 3]): the K keypoints location for each of the N instances, in the
-
keypoints (Tensor[N, K, 3]): the K keypoints location for each of the N instances, in the
format [x, y, visibility], where visibility=0 means that the keypoint is not visible.
format [x, y, visibility], where visibility=0 means that the keypoint is not visible.
The model returns a Dict[Tensor] during training, containing the classification and regression
The model returns a Dict[Tensor] during training, containing the classification and regression
losses for both the RPN and the R-CNN, and the keypoint loss.
losses for both the RPN and the R-CNN, and the keypoint loss.
During inference, the model requires only the input tensors, and returns the post-processed
During inference, the model requires only the input tensors, and returns the post-processed
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
follows:
follows:
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
-
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
0 and H and 0 and W
0 and H and 0 and W
labels (Tensor[N]): the predicted labels for each image
-
labels (Tensor[N]): the predicted labels for each image
scores (Tensor[N]): the scores or each prediction
-
scores (Tensor[N]): the scores or each prediction
keypoints (Tensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format.
-
keypoints (Tensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format.
Arguments:
Arguments:
backbone (nn.Module): the network used to compute the features for the model.
backbone (nn.Module): the network used to compute the features for the model.
...
@@ -228,6 +229,38 @@ def keypointrcnn_resnet50_fpn(pretrained=False, progress=True,
...
@@ -228,6 +229,38 @@ def keypointrcnn_resnet50_fpn(pretrained=False, progress=True,
"""
"""
Constructs a Keypoint R-CNN model with a ResNet-50-FPN backbone.
Constructs a Keypoint R-CNN model with a ResNet-50-FPN backbone.
The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
image, and should be in ``0-1`` range. Different images can have different sizes.
The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets dictionary,
containing:
- boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
between ``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the class label for each ground-truth box
- keypoints (``Tensor[N, K, 3]``): the ``K`` keypoints location for each of the ``N`` instances, in the
format ``[x, y, visibility]``, where ``visibility=0`` means that the keypoint is not visible.
The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
losses for both the RPN and the R-CNN, and the keypoint loss.
During inference, the model requires only the input tensors, and returns the post-processed
predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
follows:
- boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the predicted labels for each image
- scores (``Tensor[N]``): the scores or each prediction
- keypoints (``Tensor[N, K, 3]``): the locations of the predicted keypoints, in ``[x, y, v]`` format.
Example::
>>> model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True)
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
Arguments:
Arguments:
pretrained (bool): If True, returns a model pre-trained on COCO train2017
pretrained (bool): If True, returns a model pre-trained on COCO train2017
progress (bool): If True, displays a progress bar of the download to stderr
progress (bool): If True, displays a progress bar of the download to stderr
...
...
torchvision/models/detection/mask_rcnn.py
View file @
cb99fe30
...
@@ -28,21 +28,22 @@ class MaskRCNN(FasterRCNN):
...
@@ -28,21 +28,22 @@ class MaskRCNN(FasterRCNN):
During training, the model expects both the input tensors, as well as a targets dictionary,
During training, the model expects both the input tensors, as well as a targets dictionary,
containing:
containing:
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
-
boxes (Tensor[N, 4]): the ground-truth boxes in [x0, y0, x1, y1] format, with values
between 0 and H and 0 and W
between 0 and H and 0 and W
labels (Tensor[N]): the class label for each ground-truth box
- labels (Tensor[N]): the class label for each ground-truth box
masks (Tensor[N, H, W]): the segmentation binary masks for each instance
- masks (Tensor[N, H, W]): the segmentation binary masks for each instance
The model returns a Dict[Tensor] during training, containing the classification and regression
The model returns a Dict[Tensor] during training, containing the classification and regression
losses for both the RPN and the R-CNN, and the mask loss.
losses for both the RPN and the R-CNN, and the mask loss.
During inference, the model requires only the input tensors, and returns the post-processed
During inference, the model requires only the input tensors, and returns the post-processed
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
follows:
follows:
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
-
boxes (Tensor[N, 4]): the predicted boxes in [x0, y0, x1, y1] format, with values between
0 and H and 0 and W
0 and H and 0 and W
labels (Tensor[N]): the predicted labels for each image
-
labels (Tensor[N]): the predicted labels for each image
scores (Tensor[N]): the scores or each prediction
-
scores (Tensor[N]): the scores or each prediction
mask (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to
-
mask (Tensor[N, H, W]): the predicted masks for each instance, in 0-1 range. In order to
obtain the final segmentation masks, the soft masks can be thresholded, generally
obtain the final segmentation masks, the soft masks can be thresholded, generally
with a value of 0.5 (mask >= 0.5)
with a value of 0.5 (mask >= 0.5)
...
@@ -226,6 +227,39 @@ def maskrcnn_resnet50_fpn(pretrained=False, progress=True,
...
@@ -226,6 +227,39 @@ def maskrcnn_resnet50_fpn(pretrained=False, progress=True,
"""
"""
Constructs a Mask R-CNN model with a ResNet-50-FPN backbone.
Constructs a Mask R-CNN model with a ResNet-50-FPN backbone.
The input to the model is expected to be a list of tensors, each of shape ``[C, H, W]``, one for each
image, and should be in ``0-1`` range. Different images can have different sizes.
The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets dictionary,
containing:
- boxes (``Tensor[N, 4]``): the ground-truth boxes in ``[x0, y0, x1, y1]`` format, with values
between ``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the class label for each ground-truth box
- masks (``Tensor[N, H, W]``): the segmentation binary masks for each instance
The model returns a ``Dict[Tensor]`` during training, containing the classification and regression
losses for both the RPN and the R-CNN, and the mask loss.
During inference, the model requires only the input tensors, and returns the post-processed
predictions as a ``List[Dict[Tensor]]``, one for each input image. The fields of the ``Dict`` are as
follows:
- boxes (``Tensor[N, 4]``): the predicted boxes in ``[x0, y0, x1, y1]`` format, with values between
``0`` and ``H`` and ``0`` and ``W``
- labels (``Tensor[N]``): the predicted labels for each image
- scores (``Tensor[N]``): the scores or each prediction
- mask (``Tensor[N, H, W]``): the predicted masks for each instance, in ``0-1`` range. In order to
obtain the final segmentation masks, the soft masks can be thresholded, generally
with a value of 0.5 (``mask >= 0.5``)
Example::
>>> model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
Arguments:
Arguments:
pretrained (bool): If True, returns a model pre-trained on COCO train2017
pretrained (bool): If True, returns a model pre-trained on COCO train2017
progress (bool): If True, displays a progress bar of the download to stderr
progress (bool): If True, displays a progress bar of the download to stderr
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment