Add SSD architecture with VGG16 backbone (#3403)

* Early skeleton of API. * Adding MultiFeatureMap and vgg16 backbone. * Making vgg16 backbone same as paper. * Making code generic to support all vggs. * Moving vgg's extra layers a separate class + L2 scaling. * Adding header vgg layers. * Fix maxpool patching. * Refactoring code to allow for support of different backbones & sizes: - Skeleton for Default Boxes generator class - Dynamic estimation of configuration when possible - Addition of types * Complete the implementation of DefaultBox generator. * Replace randn with empty. * Minor refactoring * Making clamping between 0 and 1 optional. * Change xywh to xyxy encoding. * Adding parameters and reusing objects in constructor. * Temporarily inherit from Retina to avoid dup code. * Implement forward methods + temp workarounds to inherit from retina. * Inherit more methods from retinanet. * Fix type error. * Add Regression loss. * Fixing JIT issues. * Change JIT workaround to minimize new code. * Fixing initialization bug. * Add classification loss. * Update todos. * Add weight loading support. * Support SSD512. * Change kernel_size to get output size 1x1 * Add xavier init and refactoring. * Adding unit-tests and fixing JIT issues. * Add a test for dbox generator. * Remove unnecessary import. * Workaround on GeneralizedRCNNTransform to support fixed size input. * Remove unnecessary random calls from the test. * Remove more rand calls from the test. * change mapping and handling of empty labels * Fix JIT warnings. * Speed up loss. * Convert 0-1 dboxes to original size. * Fix warning. * Fix tests. * Update comments. * Fixing minor bugs. * Introduce a custom DBoxMatcher. * Minor refactoring * Move extra layer definition inside feature extractor. * handle no bias on init. * Remove fixed image size limitation * Change initialization values for bias of classification head. * Refactoring and update test file. * Adding ResNet backbone. * Minor refactoring. * Remove inheritance of retina and general refactoring. * SSD should fix the input size. * Fixing messages and comments. * Silently ignoring exception if test-only. * Update comments. * Update regression loss. * Restore Xavier init everywhere, update the negative sampling method, change the clipping approach. * Fixing tests. * Refactor to move the losses from the Head to the SSD. * Removing resnet50 ssd version. * Adding support for best performing backbone and its config. * Refactor and clean up the API. * Fix lint * Update todos and comments. * Adding RandomHorizontalFlip and RandomIoUCrop transforms. * Adding necessary checks to our tranforms. * Adding RandomZoomOut. * Adding RandomPhotometricDistort. * Moving Detection transforms to references. * Update presets * fix lint * leave compose and object * Adding scaling for completeness. * Adding params in the repr * Remove unnecessary import. * minor refactoring * Remove unnecessary call. * Give better names to DBox* classes * Port num_anchors estimation in generator * Remove rescaling and fix presets * Add the ability to pass a custom head and refactoring. * fix lint * Fix unit-test * Update todos. * Change mean values. * Change the default parameter of SSD to train the full VGG16 and remove the catch of exception for eval only. * Adding documentation * Adding weights and updating readmes. * Update the model weights with a more performing model. * Adding doc for head. * Restore import.

Add SSD architecture with VGG16 backbone (#3403)
* Early skeleton of API. * Adding MultiFeatureMap and vgg16 backbone. * Making vgg16 backbone same as paper. * Making code generic to support all vggs. * Moving vgg's extra layers a separate class + L2 scaling. * Adding header vgg layers. * Fix maxpool patching. * Refactoring code to allow for support of different backbones & sizes: - Skeleton for Default Boxes generator class - Dynamic estimation of configuration when possible - Addition of types * Complete the implementation of DefaultBox generator. * Replace randn with empty. * Minor refactoring * Making clamping between 0 and 1 optional. * Change xywh to xyxy encoding. * Adding parameters and reusing objects in constructor. * Temporarily inherit from Retina to avoid dup code. * Implement forward methods + temp workarounds to inherit from retina. * Inherit more methods from retinanet. * Fix type error. * Add Regression loss. * Fixing JIT issues. * Change JIT workaround to minimize new code. * Fixing initialization bug. * Add classification loss. * Update todos. * Add weight loading support. * Support SSD512. * Change kernel_size to get output size 1x1 * Add xavier init and refactoring. * Adding unit-tests and fixing JIT issues. * Add a test for dbox generator. * Remove unnecessary import. * Workaround on GeneralizedRCNNTransform to support fixed size input. * Remove unnecessary random calls from the test. * Remove more rand calls from the test. * change mapping and handling of empty labels * Fix JIT warnings. * Speed up loss. * Convert 0-1 dboxes to original size. * Fix warning. * Fix tests. * Update comments. * Fixing minor bugs. * Introduce a custom DBoxMatcher. * Minor refactoring * Move extra layer definition inside feature extractor. * handle no bias on init. * Remove fixed image size limitation * Change initialization values for bias of classification head. * Refactoring and update test file. * Adding ResNet backbone. * Minor refactoring. * Remove inheritance of retina and general refactoring. * SSD should fix the input size. * Fixing messages and comments. * Silently ignoring exception if test-only. * Update comments. * Update regression loss. * Restore Xavier init everywhere, update the negative sampling method, change the clipping approach. * Fixing tests. * Refactor to move the losses from the Head to the SSD. * Removing resnet50 ssd version. * Adding support for best performing backbone and its config. * Refactor and clean up the API. * Fix lint * Update todos and comments. * Adding RandomHorizontalFlip and RandomIoUCrop transforms. * Adding necessary checks to our tranforms. * Adding RandomZoomOut. * Adding RandomPhotometricDistort. * Moving Detection transforms to references. * Update presets * fix lint * leave compose and object * Adding scaling for completeness. * Adding params in the repr * Remove unnecessary import. * minor refactoring * Remove unnecessary call. * Give better names to DBox* classes * Port num_anchors estimation in generator * Remove rescaling and fix presets * Add the ability to pass a custom head and refactoring. * fix lint * Fix unit-test * Update todos. * Change mean values. * Change the default parameter of SSD to train the full VGG16 and remove the catch of exception for eval only. * Adding documentation * Adding weights and updating readmes. * Update the model weights with a more performing model. * Adding doc for head. * Restore import.
730c5e1e · Vasilis Vryniotis · GitHub · 7c35e133 · 730c5e1e · 730c5e1e
Unverified Commit 730c5e1e authored Apr 30, 2021 by Vasilis Vryniotis Committed by GitHub Apr 30, 2021
14 changed files
--- a/docs/source/models.rst
+++ b/docs/source/models.rst
@@ -381,17 +381,18 @@ Object Detection, Instance Segmentation and Person Keypoint Detection
 The models subpackage contains definitions for the following model
 architectures for detection:

- `Faster R-CNN ResNet-50 FPN <https://arxiv.org/abs/1506.01497>`_
- `Mask R-CNN ResNet-50 FPN <https://arxiv.org/abs/1703.06870>`_
+- `Faster R-CNN <https://arxiv.org/abs/1506.01497>`_
+- `Mask R-CNN <https://arxiv.org/abs/1703.06870>`_
+- `RetinaNet <https://arxiv.org/abs/1708.02002>`_
+- `SSD <https://arxiv.org/abs/1512.02325>`_

 The pre-trained models for detection, instance segmentation and
 keypoint detection are initialized with the classification models
 in torchvision.

 The models expect a list of ``Tensor[C, H, W]``, in the range ``0-1``.
-The models internally resize the images so that they have a minimum size
-of ``800``. This option can be changed by passing the option ``min_size``
-to the constructor of the models.
+The models internally resize the images but the behaviour varies depending
+on the model. Check the constructor of the models for more information.


 For object detection and instance segmentation, the pre-trained
@@ -425,6 +426,7 @@ Faster R-CNN ResNet-50 FPN              37.0     -         -
 Faster R-CNN MobileNetV3-Large FPN      32.8     -         -
 Faster R-CNN MobileNetV3-Large 320 FPN  22.8     -         -
 RetinaNet ResNet-50 FPN                 36.4     -         -
+SSD VGG16                               25.1     -         -
 Mask R-CNN ResNet-50 FPN                37.9     34.6      -
 ======================================  =======  ========  ===========

@@ -483,6 +485,7 @@ Faster R-CNN ResNet-50 FPN              0.2288               0.0590
 Faster R-CNN MobileNetV3-Large FPN      0.1020               0.0415              1.0
 Faster R-CNN MobileNetV3-Large 320 FPN  0.0978               0.0376              0.6
 RetinaNet ResNet-50 FPN                 0.2514               0.0939              4.1
+SSD VGG16                               0.2093               0.0744              1.5
 Mask R-CNN ResNet-50 FPN                0.2728               0.0903              5.4
 Keypoint R-CNN ResNet-50 FPN            0.3789               0.1242              6.8
 ======================================  ===================  ==================  ===========
@@ -502,6 +505,12 @@ RetinaNet
 .. autofunction:: torchvision.models.detection.retinanet_resnet50_fpn


+SSD
+------------
+
+.. autofunction:: torchvision.models.detection.ssd300_vgg16
+
+
 Mask R-CNN
 ----------


--- a/references/detection/README.md
+++ b/references/detection/README.md
@@ -48,6 +48,14 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
    --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.01
 ```

+### SSD VGG16
+```
+python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
+    --dataset coco --model ssd300_vgg16 --epochs 120\
+    --lr-steps 80 110 --aspect-ratio-group-factor 3 --lr 0.002 --batch-size 4\
+    --weight-decay 0.0005 --data-augmentation ssd
+```
+

 ### Mask R-CNN
 ```

--- a/references/detection/presets.py
+++ b/references/detection/presets.py
@@ -2,12 +2,22 @@ import transforms as T


 class DetectionPresetTrain:
-    def __init__(self, hflip_prob=0.5):
-        trans = [T.ToTensor()]
-        if hflip_prob > 0:
-            trans.append(T.RandomHorizontalFlip(hflip_prob))
-
-        self.transforms = T.Compose(trans)
+    def __init__(self, data_augmentation, hflip_prob=0.5, mean=(123., 117., 104.)):
+        if data_augmentation == 'hflip':
+            self.transforms = T.Compose([
+                T.RandomHorizontalFlip(p=hflip_prob),
+                T.ToTensor(),
+            ])
+        elif data_augmentation == 'ssd':
+            self.transforms = T.Compose([
+                T.RandomPhotometricDistort(),
+                T.RandomZoomOut(fill=list(mean)),
+                T.RandomIoUCrop(),
+                T.RandomHorizontalFlip(p=hflip_prob),
+                T.ToTensor(),
+            ])
+        else:
+            raise ValueError(f'Unknown data augmentation policy "{data_augmentation}"')

    def __call__(self, img, target):
        return self.transforms(img, target)

--- a/references/detection/train.py
+++ b/references/detection/train.py
@@ -47,8 +47,8 @@ def get_dataset(name, image_set, transform, data_path):
    return ds, num_classes


-def get_transform(train):
-    return presets.DetectionPresetTrain() if train else presets.DetectionPresetEval()
+def get_transform(train, data_augmentation):
+    return presets.DetectionPresetTrain(data_augmentation) if train else presets.DetectionPresetEval()


 def main(args):
@@ -60,8 +60,9 @@ def main(args):
    # Data loading code
    print("Loading data")

-    dataset, num_classes = get_dataset(args.dataset, "train", get_transform(train=True), args.data_path)
-    dataset_test, _ = get_dataset(args.dataset, "val", get_transform(train=False), args.data_path)
+    dataset, num_classes = get_dataset(args.dataset, "train", get_transform(True, args.data_augmentation),
+                                       args.data_path)
+    dataset_test, _ = get_dataset(args.dataset, "val", get_transform(False, args.data_augmentation), args.data_path)

    print("Creating data loaders")
    if args.distributed:
@@ -179,6 +180,7 @@ if __name__ == "__main__":
    parser.add_argument('--rpn-score-thresh', default=None, type=float, help='rpn score threshold for faster-rcnn')
    parser.add_argument('--trainable-backbone-layers', default=None, type=int,
                        help='number of trainable layers of backbone')
+    parser.add_argument('--data-augmentation', default="hflip", help='data augmentation policy (default: hflip)')
    parser.add_argument(
        "--test-only",
        dest="test_only",

--- a/references/detection/transforms.py
+++ b/references/detection/transforms.py
-import random
+import torch
+import torchvision

+from torch import nn, Tensor
 from torchvision.transforms import functional as F
+from torchvision.transforms import transforms as T
+from typing import List, Tuple, Dict, Optional


 def _flip_coco_person_keypoints(kps, width):
@@ -23,27 +27,213 @@ class Compose(object):
        return image, target


-class RandomHorizontalFlip(object):
-    def __init__(self, prob):
-        self.prob = prob
-
-    def __call__(self, image, target):
-        if random.random() < self.prob:
-            height, width = image.shape[-2:]
-            image = image.flip(-1)
-            bbox = target["boxes"]
-            bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
-            target["boxes"] = bbox
-            if "masks" in target:
-                target["masks"] = target["masks"].flip(-1)
-            if "keypoints" in target:
-                keypoints = target["keypoints"]
-                keypoints = _flip_coco_person_keypoints(keypoints, width)
-                target["keypoints"] = keypoints
+class RandomHorizontalFlip(T.RandomHorizontalFlip):
+    def forward(self, image: Tensor,
+                target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
+        if torch.rand(1) < self.p:
+            image = F.hflip(image)
+            if target is not None:
+                width, _ = F._get_image_size(image)
+                target["boxes"][:, [0, 2]] = width - target["boxes"][:, [2, 0]]
+                if "masks" in target:
+                    target["masks"] = target["masks"].flip(-1)
+                if "keypoints" in target:
+                    keypoints = target["keypoints"]
+                    keypoints = _flip_coco_person_keypoints(keypoints, width)
+                    target["keypoints"] = keypoints
        return image, target


-class ToTensor(object):
-    def __call__(self, image, target):
+class ToTensor(nn.Module):
+    def forward(self, image: Tensor,
+                target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
        image = F.to_tensor(image)
        return image, target
+
+
+class RandomIoUCrop(nn.Module):
+    def __init__(self, min_scale: float = 0.3, max_scale: float = 1.0, min_aspect_ratio: float = 0.5,
+                 max_aspect_ratio: float = 2.0, sampler_options: Optional[List[float]] = None, trials: int = 40):
+        super().__init__()
+        # Configuration similar to https://github.com/weiliu89/caffe/blob/ssd/examples/ssd/ssd_coco.py#L89-L174
+        self.min_scale = min_scale
+        self.max_scale = max_scale
+        self.min_aspect_ratio = min_aspect_ratio
+        self.max_aspect_ratio = max_aspect_ratio
+        if sampler_options is None:
+            sampler_options = [0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]
+        self.options = sampler_options
+        self.trials = trials
+
+    def forward(self, image: Tensor,
+                target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
+        if target is None:
+            raise ValueError("The targets can't be None for this transform.")
+
+        if isinstance(image, torch.Tensor):
+            if image.ndimension() not in {2, 3}:
+                raise ValueError('image should be 2/3 dimensional. Got {} dimensions.'.format(image.ndimension()))
+            elif image.ndimension() == 2:
+                image = image.unsqueeze(0)
+
+        orig_w, orig_h = F._get_image_size(image)
+
+        while True:
+            # sample an option
+            idx = int(torch.randint(low=0, high=len(self.options), size=(1,)))
+            min_jaccard_overlap = self.options[idx]
+            if min_jaccard_overlap >= 1.0:  # a value larger than 1 encodes the leave as-is option
+                return image, target
+
+            for _ in range(self.trials):
+                # check the aspect ratio limitations
+                r = self.min_scale + (self.max_scale - self.min_scale) * torch.rand(2)
+                new_w = int(orig_w * r[0])
+                new_h = int(orig_h * r[1])
+                aspect_ratio = new_w / new_h
+                if not (self.min_aspect_ratio <= aspect_ratio <= self.max_aspect_ratio):
+                    continue
+
+                # check for 0 area crops
+                r = torch.rand(2)
+                left = int((orig_w - new_w) * r[0])
+                top = int((orig_h - new_h) * r[1])
+                right = left + new_w
+                bottom = top + new_h
+                if left == right or top == bottom:
+                    continue
+
+                # check for any valid boxes with centers within the crop area
+                cx = 0.5 * (target["boxes"][:, 0] + target["boxes"][:, 2])
+                cy = 0.5 * (target["boxes"][:, 1] + target["boxes"][:, 3])
+                is_within_crop_area = (left < cx) & (cx < right) & (top < cy) & (cy < bottom)
+                if not is_within_crop_area.any():
+                    continue
+
+                # check at least 1 box with jaccard limitations
+                boxes = target["boxes"][is_within_crop_area]
+                ious = torchvision.ops.boxes.box_iou(boxes, torch.tensor([[left, top, right, bottom]],
+                                                                         dtype=boxes.dtype, device=boxes.device))
+                if ious.max() < min_jaccard_overlap:
+                    continue
+
+                # keep only valid boxes and perform cropping
+                target["boxes"] = boxes
+                target["labels"] = target["labels"][is_within_crop_area]
+                target["boxes"][:, 0::2] -= left
+                target["boxes"][:, 1::2] -= top
+                target["boxes"][:, 0::2].clamp_(min=0, max=new_w)
+                target["boxes"][:, 1::2].clamp_(min=0, max=new_h)
+                image = F.crop(image, top, left, new_h, new_w)
+
+                return image, target
+
+
+class RandomZoomOut(nn.Module):
+    def __init__(self, fill: Optional[List[float]] = None, side_range: Tuple[float, float] = (1., 4.), p: float = 0.5):
+        super().__init__()
+        if fill is None:
+            fill = [0., 0., 0.]
+        self.fill = fill
+        self.side_range = side_range
+        if side_range[0] < 1. or side_range[0] > side_range[1]:
+            raise ValueError("Invalid canvas side range provided {}.".format(side_range))
+        self.p = p
+
+    @torch.jit.unused
+    def _get_fill_value(self, is_pil):
+        # type: (bool) -> int
+        # We fake the type to make it work on JIT
+        return tuple(int(x) for x in self.fill) if is_pil else 0
+
+    def forward(self, image: Tensor,
+                target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
+        if isinstance(image, torch.Tensor):
+            if image.ndimension() not in {2, 3}:
+                raise ValueError('image should be 2/3 dimensional. Got {} dimensions.'.format(image.ndimension()))
+            elif image.ndimension() == 2:
+                image = image.unsqueeze(0)
+
+        if torch.rand(1) < self.p:
+            return image, target
+
+        orig_w, orig_h = F._get_image_size(image)
+
+        r = self.side_range[0] + torch.rand(1) * (self.side_range[1] - self.side_range[0])
+        canvas_width = int(orig_w * r)
+        canvas_height = int(orig_h * r)
+
+        r = torch.rand(2)
+        left = int((canvas_width - orig_w) * r[0])
+        top = int((canvas_height - orig_h) * r[1])
+        right = canvas_width - (left + orig_w)
+        bottom = canvas_height - (top + orig_h)
+
+        if torch.jit.is_scripting():
+            fill = 0
+        else:
+            fill = self._get_fill_value(F._is_pil_image(image))
+
+        image = F.pad(image, [left, top, right, bottom], fill=fill)
+        if isinstance(image, torch.Tensor):
+            v = torch.tensor(self.fill, device=image.device, dtype=image.dtype).view(-1, 1, 1)
+            image[..., :top, :] = image[..., :, :left] = image[..., (top + orig_h):, :] = \
+                image[..., :, (left + orig_w):] = v
+
+        if target is not None:
+            target["boxes"][:, 0::2] += left
+            target["boxes"][:, 1::2] += top
+
+        return image, target
+
+
+class RandomPhotometricDistort(nn.Module):
+    def __init__(self, contrast: Tuple[float] = (0.5, 1.5), saturation: Tuple[float] = (0.5, 1.5),
+                 hue: Tuple[float] = (-0.05, 0.05), brightness: Tuple[float] = (0.875, 1.125), p: float = 0.5):
+        super().__init__()
+        self._brightness = T.ColorJitter(brightness=brightness)
+        self._contrast = T.ColorJitter(contrast=contrast)
+        self._hue = T.ColorJitter(hue=hue)
+        self._saturation = T.ColorJitter(saturation=saturation)
+        self.p = p
+
+    def forward(self, image: Tensor,
+                target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
+        if isinstance(image, torch.Tensor):
+            if image.ndimension() not in {2, 3}:
+                raise ValueError('image should be 2/3 dimensional. Got {} dimensions.'.format(image.ndimension()))
+            elif image.ndimension() == 2:
+                image = image.unsqueeze(0)
+
+        r = torch.rand(7)
+
+        if r[0] < self.p:
+            image = self._brightness(image)
+
+        contrast_before = r[1] < 0.5
+        if contrast_before:
+            if r[2] < self.p:
+                image = self._contrast(image)
+
+        if r[3] < self.p:
+            image = self._saturation(image)
+
+        if r[4] < self.p:
+            image = self._hue(image)
+
+        if not contrast_before:
+            if r[5] < self.p:
+                image = self._contrast(image)
+
+        if r[6] < self.p:
+            channels = F._get_image_num_channels(image)
+            permutation = torch.randperm(channels)
+
+            is_pil = F._is_pil_image(image)
+            if is_pil:
+                image = F.to_tensor(image)
+            image = image[..., permutation, :, :]
+            if is_pil:
+                image = F.to_pil_image(image)
+
+        return image, target
--- a/test/expect/ModelTester.test_ssd300_vgg16_expect.pkl
+++ b/test/expect/ModelTester.test_ssd300_vgg16_expect.pkl
--- a/test/test_models.py
+++ b/test/test_models.py
@@ -44,6 +44,7 @@ script_model_unwrapper = {
    "maskrcnn_resnet50_fpn": lambda x: x[1],
    "keypointrcnn_resnet50_fpn": lambda x: x[1],
    "retinanet_resnet50_fpn": lambda x: x[1],
+    "ssd300_vgg16": lambda x: x[1],
 }



--- a/test/test_models_detection_anchor_utils.py
+++ b/test/test_models_detection_anchor_utils.py
-from collections import OrderedDict
 import torch
 from common_utils import TestCase
-from torchvision.models.detection.anchor_utils import AnchorGenerator
+from torchvision.models.detection.anchor_utils import AnchorGenerator, DefaultBoxGenerator
 from torchvision.models.detection.image_list import ImageList


@@ -22,6 +21,12 @@ class Tester(TestCase):

        return anchor_generator

+    def _init_test_defaultbox_generator(self):
+        aspect_ratios = [[2]]
+        dbox_generator = DefaultBoxGenerator(aspect_ratios)
+
+        return dbox_generator
+
    def get_features(self, images):
        s0, s1 = images.shape[-2:]
        features = [torch.rand(2, 8, s0 // 5, s1 // 5)]
@@ -59,3 +64,26 @@ class Tester(TestCase):
        self.assertEqual(tuple(anchors[1].shape), (9, 4))
        self.assertEqual(anchors[0], anchors_output)
        self.assertEqual(anchors[1], anchors_output)
+
+    def test_defaultbox_generator(self):
+        images = torch.zeros(2, 3, 15, 15)
+        features = [torch.zeros(2, 8, 1, 1)]
+        image_shapes = [i.shape[-2:] for i in images]
+        images = ImageList(images, image_shapes)
+
+        model = self._init_test_defaultbox_generator()
+        model.eval()
+        dboxes = model(images, features)
+
+        dboxes_output = torch.tensor([
+            [6.9750, 6.9750, 8.0250, 8.0250],
+            [6.7315, 6.7315, 8.2685, 8.2685],
+            [6.7575, 7.1288, 8.2425, 7.8712],
+            [7.1288, 6.7575, 7.8712, 8.2425]
+        ])
+
+        self.assertEqual(len(dboxes), 2)
+        self.assertEqual(tuple(dboxes[0].shape), (4, 4))
+        self.assertEqual(tuple(dboxes[1].shape), (4, 4))
+        self.assertTrue(dboxes[0].allclose(dboxes_output))
+        self.assertTrue(dboxes[1].allclose(dboxes_output))
--- a/test/test_models_detection_negative_samples.py
+++ b/test/test_models_detection_negative_samples.py
@@ -139,6 +139,15 @@ class Tester(unittest.TestCase):

        self.assertEqual(loss_dict["bbox_regression"], torch.tensor(0.))

+    def test_forward_negative_sample_ssd(self):
+        model = torchvision.models.detection.ssd300_vgg16(
+            num_classes=2, pretrained_backbone=False)
+
+        images, targets = self._make_empty_sample()
+        loss_dict = model(images, targets)
+
+        self.assertEqual(loss_dict["bbox_regression"], torch.tensor(0.))
+

 if __name__ == '__main__':
    unittest.main()
--- a/torchvision/models/detection/__init__.py
+++ b/torchvision/models/detection/__init__.py
@@ -2,3 +2,4 @@ from .faster_rcnn import *
 from .mask_rcnn import *
 from .keypoint_rcnn import *
 from .retinanet import *
+from .ssd import *
--- a/torchvision/models/detection/_utils.py
+++ b/torchvision/models/detection/_utils.py
 import math
-
 import torch
+
+from collections import OrderedDict
 from torch import Tensor
 from typing import List, Tuple

@@ -344,6 +345,23 @@ class Matcher(object):
        matches[pred_inds_to_update] = all_matches[pred_inds_to_update]


+class SSDMatcher(Matcher):
+
+    def __init__(self, threshold):
+        super().__init__(threshold, threshold, allow_low_quality_matches=False)
+
+    def __call__(self, match_quality_matrix):
+        matches = super().__call__(match_quality_matrix)
+
+        # For each gt, find the prediction with which it has the highest quality
+        _, highest_quality_pred_foreach_gt = match_quality_matrix.max(dim=1)
+        matches[highest_quality_pred_foreach_gt] = torch.arange(highest_quality_pred_foreach_gt.size(0),
+                                                                dtype=torch.int64,
+                                                                device=highest_quality_pred_foreach_gt.device)
+
+        return matches
+
+
 def overwrite_eps(model, eps):
    """
    This method overwrites the default eps values of all the
@@ -360,3 +378,33 @@ def overwrite_eps(model, eps):
    for module in model.modules():
        if isinstance(module, FrozenBatchNorm2d):
            module.eps = eps
+
+
+def retrieve_out_channels(model, size):
+    """
+    This method retrieves the number of output channels of a specific model.
+
+    Args:
+        model (nn.Module): The model for which we estimate the out_channels.
+            It should return a single Tensor or an OrderedDict[Tensor].
+        size (Tuple[int, int]): The size (wxh) of the input.
+
+    Returns:
+        out_channels (List[int]): A list of the output channels of the model.
+    """
+    in_training = model.training
+    model.eval()
+
+    with torch.no_grad():
+        # Use dummy data to retrieve the feature map sizes to avoid hard-coding their values
+        device = next(model.parameters()).device
+        tmp_img = torch.zeros((1, 3, size[1], size[0]), device=device)
+        features = model(tmp_img)
+        if isinstance(features, torch.Tensor):
+            features = OrderedDict([('0', features)])
+        out_channels = [x.size(1) for x in features.values()]
+
+    if in_training:
+        model.train()
+
+    return out_channels
--- a/torchvision/models/detection/anchor_utils.py
+++ b/torchvision/models/detection/anchor_utils.py
+import math
 import torch
 from torch import nn, Tensor

-from typing import List
+from typing import List, Optional
 from .image_list import ImageList


@@ -128,3 +129,97 @@ class AnchorGenerator(nn.Module):
            anchors.append(anchors_in_image)
        anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
        return anchors
+
+
+class DefaultBoxGenerator(nn.Module):
+    """
+    This module generates the default boxes of SSD for a set of feature maps and image sizes.
+
+    Args:
+        aspect_ratios (List[List[int]]): A list with all the aspect ratios used in each feature map.
+        min_ratio (float): The minimum scale :math:`\text{s}_{\text{min}}` of the default boxes used in the estimation
+            of the scales of each feature map.
+        max_ratio (float): The maximum scale :math:`\text{s}_{\text{max}}`  of the default boxes used in the estimation
+            of the scales of each feature map.
+        steps (List[int]], optional): It's a hyper-parameter that affects the tiling of defalt boxes. If not provided
+            it will be estimated from the data.
+        clip (bool): Whether the standardized values of default boxes should be clipped between 0 and 1. The clipping
+            is applied while the boxes are encoded in format ``(cx, cy, w, h)``.
+    """
+
+    def __init__(self, aspect_ratios: List[List[int]], min_ratio: float = 0.15, max_ratio: float = 0.9,
+                 steps: Optional[List[int]] = None, clip: bool = True):
+        super().__init__()
+        if steps is not None:
+            assert len(aspect_ratios) == len(steps)
+        self.aspect_ratios = aspect_ratios
+        self.steps = steps
+        self.clip = clip
+        num_outputs = len(aspect_ratios)
+
+        # Estimation of default boxes scales
+        # Inspired from https://github.com/weiliu89/caffe/blob/ssd/examples/ssd/ssd_pascal.py#L311-L317
+        min_centile = int(100 * min_ratio)
+        max_centile = int(100 * max_ratio)
+        conv4_centile = min_centile // 2  # assume half of min_ratio as in paper
+        step = (max_centile - min_centile) // (num_outputs - 2)
+        centiles = [conv4_centile, min_centile]
+        for c in range(min_centile, max_centile + 1, step):
+            centiles.append(c + step)
+        self.scales = [c / 100 for c in centiles]
+
+        self._wh_pairs = []
+        for k in range(num_outputs):
+            # Adding the 2 default width-height pairs for aspect ratio 1 and scale s'k
+            s_k = self.scales[k]
+            s_prime_k = math.sqrt(self.scales[k] * self.scales[k + 1])
+            wh_pairs = [(s_k, s_k), (s_prime_k, s_prime_k)]
+
+            # Adding 2 pairs for each aspect ratio of the feature map k
+            for ar in self.aspect_ratios[k]:
+                sq_ar = math.sqrt(ar)
+                w = self.scales[k] * sq_ar
+                h = self.scales[k] / sq_ar
+                wh_pairs.extend([(w, h), (h, w)])
+
+            self._wh_pairs.append(wh_pairs)
+
+    def num_anchors_per_location(self):
+        # Estimate num of anchors based on aspect ratios: 2 default boxes + 2 * ratios of feaure map.
+        return [2 + 2 * len(r) for r in self.aspect_ratios]
+
+    def __repr__(self) -> str:
+        s = self.__class__.__name__ + '('
+        s += 'aspect_ratios={aspect_ratios}'
+        s += ', clip={clip}'
+        s += ', scales={scales}'
+        s += ', steps={steps}'
+        s += ')'
+        return s.format(**self.__dict__)
+
+    def forward(self, image_list: ImageList, feature_maps: List[Tensor]) -> List[Tensor]:
+        grid_sizes = [feature_map.shape[-2:] for feature_map in feature_maps]
+        image_size = image_list.tensors.shape[-2:]
+        dtype, device = feature_maps[0].dtype, feature_maps[0].device
+
+        # Default Boxes calculation based on page 6 of SSD paper
+        default_boxes: List[List[float]] = []
+        for k, f_k in enumerate(grid_sizes):
+            # Now add the default boxes for each width-height pair
+            for j in range(f_k[0]):
+                cy = (j + 0.5) / (float(f_k[0]) if self.steps is None else image_size[1] / self.steps[k])
+                for i in range(f_k[1]):
+                    cx = (i + 0.5) / (float(f_k[1]) if self.steps is None else image_size[0] / self.steps[k])
+                    default_boxes.extend([[cx, cy, w, h] for w, h in self._wh_pairs[k]])
+
+        dboxes = []
+        for _ in image_list.image_sizes:
+            dboxes_in_image = torch.tensor(default_boxes, dtype=dtype, device=device)
+            if self.clip:
+                dboxes_in_image.clamp_(min=0, max=1)
+            dboxes_in_image = torch.cat([dboxes_in_image[:, :2] - 0.5 * dboxes_in_image[:, 2:],
+                                         dboxes_in_image[:, :2] + 0.5 * dboxes_in_image[:, 2:]], -1)
+            dboxes_in_image[:, 0::2] *= image_size[1]
+            dboxes_in_image[:, 1::2] *= image_size[0]
+            dboxes.append(dboxes_in_image)
+        return dboxes
--- a/torchvision/models/detection/ssd.py
+++ b/torchvision/models/detection/ssd.py
+import torch
+import torch.nn.functional as F
+import warnings
+
+from collections import OrderedDict
+from torch import nn, Tensor
+from typing import Any, Dict, List, Optional, Tuple
+
+from . import _utils as det_utils
+from .anchor_utils import DefaultBoxGenerator
+from .backbone_utils import _validate_trainable_layers
+from .transform import GeneralizedRCNNTransform
+from .. import vgg
+from ..utils import load_state_dict_from_url
+from ...ops import boxes as box_ops
+
+__all__ = ['SSD', 'ssd300_vgg16']
+
+model_urls = {
+    'ssd300_vgg16_coco': 'https://download.pytorch.org/models/ssd300_vgg16_coco-b556d3b4.pth',
+}
+
+backbone_urls = {
+    # We port the features of a VGG16 backbone trained by amdegroot because unlike the one on TorchVision, it uses the
+    # same input standardization method as the paper. Ref: https://s3.amazonaws.com/amdegroot-models/vgg16_reducedfc.pth
+    'vgg16_features': 'https://download.pytorch.org/models/vgg16_features-amdegroot.pth'
+}
+
+
+def _xavier_init(conv: nn.Module):
+    for layer in conv.modules():
+        if isinstance(layer, nn.Conv2d):
+            torch.nn.init.xavier_uniform_(layer.weight)
+            if layer.bias is not None:
+                torch.nn.init.constant_(layer.bias, 0.0)
+
+
+class SSDHead(nn.Module):
+    def __init__(self, in_channels: List[int], num_anchors: List[int], num_classes: int):
+        super().__init__()
+        self.classification_head = SSDClassificationHead(in_channels, num_anchors, num_classes)
+        self.regression_head = SSDRegressionHead(in_channels, num_anchors)
+
+    def forward(self, x: List[Tensor]) -> Dict[str, Tensor]:
+        return {
+            'bbox_regression': self.regression_head(x),
+            'cls_logits': self.classification_head(x),
+        }
+
+
+class SSDScoringHead(nn.Module):
+    def __init__(self, module_list: nn.ModuleList, num_columns: int):
+        super().__init__()
+        self.module_list = module_list
+        self.num_columns = num_columns
+
+    def _get_result_from_module_list(self, x: Tensor, idx: int) -> Tensor:
+        """
+        This is equivalent to self.module_list[idx](x),
+        but torchscript doesn't support this yet
+        """
+        num_blocks = len(self.module_list)
+        if idx < 0:
+            idx += num_blocks
+        i = 0
+        out = x
+        for module in self.module_list:
+            if i == idx:
+                out = module(x)
+            i += 1
+        return out
+
+    def forward(self, x: List[Tensor]) -> Tensor:
+        all_results = []
+
+        for i, features in enumerate(x):
+            results = self._get_result_from_module_list(features, i)
+
+            # Permute output from (N, A * K, H, W) to (N, HWA, K).
+            N, _, H, W = results.shape
+            results = results.view(N, -1, self.num_columns, H, W)
+            results = results.permute(0, 3, 4, 1, 2)
+            results = results.reshape(N, -1, self.num_columns)  # Size=(N, HWA, K)
+
+            all_results.append(results)
+
+        return torch.cat(all_results, dim=1)
+
+
+class SSDClassificationHead(SSDScoringHead):
+    def __init__(self, in_channels: List[int], num_anchors: List[int], num_classes: int):
+        cls_logits = nn.ModuleList()
+        for channels, anchors in zip(in_channels, num_anchors):
+            cls_logits.append(nn.Conv2d(channels, num_classes * anchors, kernel_size=3, padding=1))
+        _xavier_init(cls_logits)
+        super().__init__(cls_logits, num_classes)
+
+
+class SSDRegressionHead(SSDScoringHead):
+    def __init__(self, in_channels: List[int], num_anchors: List[int]):
+        bbox_reg = nn.ModuleList()
+        for channels, anchors in zip(in_channels, num_anchors):
+            bbox_reg.append(nn.Conv2d(channels, 4 * anchors, kernel_size=3, padding=1))
+        _xavier_init(bbox_reg)
+        super().__init__(bbox_reg, 4)
+
+
+class SSD(nn.Module):
+    """
+    Implements SSD architecture from `"SSD: Single Shot MultiBox Detector" <https://arxiv.org/abs/1512.02325>`_.
+
+    The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each
+    image, and should be in 0-1 range. Different images can have different sizes but they will be resized
+    to a fixed size before passing it to the backbone.
+
+    The behavior of the model changes depending if it is in training or evaluation mode.
+
+    During training, the model expects both the input tensors, as well as a targets (list of dictionary),
+    containing:
+        - boxes (``FloatTensor[N, 4]``): the ground-truth boxes in ``[x1, y1, x2, y2]`` format, with
+          ``0 <= x1 < x2 <= W`` and ``0 <= y1 < y2 <= H``.
+        - labels (Int64Tensor[N]): the class label for each ground-truth box
+
+    The model returns a Dict[Tensor] during training, containing the classification and regression
+    losses.
+
+    During inference, the model requires only the input tensors, and returns the post-processed
+    predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
+    follows:
+        - boxes (``FloatTensor[N, 4]``): the predicted boxes in ``[x1, y1, x2, y2]`` format, with
+          ``0 <= x1 < x2 <= W`` and ``0 <= y1 < y2 <= H``.
+        - labels (Int64Tensor[N]): the predicted labels for each image
+        - scores (Tensor[N]): the scores for each prediction
+
+    Args:
+        backbone (nn.Module): the network used to compute the features for the model.
+            It should contain an out_channels attribute with the list of the output channels of
+            each feature map. The backbone should return a single Tensor or an OrderedDict[Tensor].
+        anchor_generator (DefaultBoxGenerator): module that generates the default boxes for a
+            set of feature maps.
+        size (Tuple[int, int]): the width and height to which images will be rescaled before feeding them
+            to the backbone.
+        num_classes (int): number of output classes of the model (excluding the background).
+        image_mean (Tuple[float, float, float]): mean values used for input normalization.
+            They are generally the mean values of the dataset on which the backbone has been trained
+            on
+        image_std (Tuple[float, float, float]): std values used for input normalization.
+            They are generally the std values of the dataset on which the backbone has been trained on
+        head (nn.Module, optional): Module run on top of the backbone features. Defaults to a module containing
+            a classification and regression module.
+        score_thresh (float): Score threshold used for postprocessing the detections.
+        nms_thresh (float): NMS threshold used for postprocessing the detections.
+        detections_per_img (int): Number of best detections to keep after NMS.
+        iou_thresh (float): minimum IoU between the anchor and the GT box so that they can be
+            considered as positive during training.
+        topk_candidates (int): Number of best detections to keep before NMS.
+        positive_fraction (float): a number between 0 and 1 which indicates the proportion of positive
+            proposals used during the training of the classification head. It is used to estimate the negative to
+            positive ratio.
+    """
+    __annotations__ = {
+        'box_coder': det_utils.BoxCoder,
+        'proposal_matcher': det_utils.Matcher,
+    }
+
+    def __init__(self, backbone: nn.Module, anchor_generator: DefaultBoxGenerator,
+                 size: Tuple[int, int], num_classes: int,
+                 image_mean: Optional[List[float]] = None, image_std: Optional[List[float]] = None,
+                 head: Optional[nn.Module] = None,
+                 score_thresh: float = 0.01,
+                 nms_thresh: float = 0.45,
+                 detections_per_img: int = 200,
+                 iou_thresh: float = 0.5,
+                 topk_candidates: int = 400,
+                 positive_fraction: float = 0.25):
+        super().__init__()
+
+        self.backbone = backbone
+
+        self.anchor_generator = anchor_generator
+
+        self.box_coder = det_utils.BoxCoder(weights=(10., 10., 5., 5.))
+
+        if head is None:
+            if hasattr(backbone, 'out_channels'):
+                out_channels = backbone.out_channels
+            else:
+                out_channels = det_utils.retrieve_out_channels(backbone, size)
+
+            assert len(out_channels) == len(anchor_generator.aspect_ratios)
+
+            num_anchors = self.anchor_generator.num_anchors_per_location()
+            head = SSDHead(out_channels, num_anchors, num_classes)
+        self.head = head
+
+        self.proposal_matcher = det_utils.SSDMatcher(iou_thresh)
+
+        if image_mean is None:
+            image_mean = [0.485, 0.456, 0.406]
+        if image_std is None:
+            image_std = [0.229, 0.224, 0.225]
+        self.transform = GeneralizedRCNNTransform(min(size), max(size), image_mean, image_std,
+                                                  size_divisible=1, fixed_size=size)
+
+        self.score_thresh = score_thresh
+        self.nms_thresh = nms_thresh
+        self.detections_per_img = detections_per_img
+        self.topk_candidates = topk_candidates
+        self.neg_to_pos_ratio = (1.0 - positive_fraction) / positive_fraction
+
+        # used only on torchscript mode
+        self._has_warned = False
+
+    @torch.jit.unused
+    def eager_outputs(self, losses: Dict[str, Tensor],
+                      detections: List[Dict[str, Tensor]]) -> Tuple[Dict[str, Tensor], List[Dict[str, Tensor]]]:
+        if self.training:
+            return losses
+
+        return detections
+
+    def compute_loss(self, targets: List[Dict[str, Tensor]], head_outputs: Dict[str, Tensor], anchors: List[Tensor],
+                     matched_idxs: List[Tensor]) -> Dict[str, Tensor]:
+        bbox_regression = head_outputs['bbox_regression']
+        cls_logits = head_outputs['cls_logits']
+
+        # Match original targets with default boxes
+        num_foreground = 0
+        bbox_loss = []
+        cls_targets = []
+        for (targets_per_image, bbox_regression_per_image, cls_logits_per_image, anchors_per_image,
+             matched_idxs_per_image) in zip(targets, bbox_regression, cls_logits, anchors, matched_idxs):
+            # produce the matching between boxes and targets
+            foreground_idxs_per_image = torch.where(matched_idxs_per_image >= 0)[0]
+            foreground_matched_idxs_per_image = matched_idxs_per_image[foreground_idxs_per_image]
+            num_foreground += foreground_matched_idxs_per_image.numel()
+
+            # Calculate regression loss
+            matched_gt_boxes_per_image = targets_per_image['boxes'][foreground_matched_idxs_per_image]
+            bbox_regression_per_image = bbox_regression_per_image[foreground_idxs_per_image, :]
+            anchors_per_image = anchors_per_image[foreground_idxs_per_image, :]
+            target_regression = self.box_coder.encode_single(matched_gt_boxes_per_image, anchors_per_image)
+            bbox_loss.append(torch.nn.functional.smooth_l1_loss(
+                bbox_regression_per_image,
+                target_regression,
+                reduction='sum'
+            ))
+
+            # Estimate ground truth for class targets
+            gt_classes_target = torch.zeros((cls_logits_per_image.size(0), ), dtype=targets_per_image['labels'].dtype,
+                                            device=targets_per_image['labels'].device)
+            gt_classes_target[foreground_idxs_per_image] = \
+                targets_per_image['labels'][foreground_matched_idxs_per_image]
+            cls_targets.append(gt_classes_target)
+
+        bbox_loss = torch.stack(bbox_loss)
+        cls_targets = torch.stack(cls_targets)
+
+        # Calculate classification loss
+        num_classes = cls_logits.size(-1)
+        cls_loss = F.cross_entropy(
+            cls_logits.view(-1, num_classes),
+            cls_targets.view(-1),
+            reduction='none'
+        ).view(cls_targets.size())
+
+        # Hard Negative Sampling
+        foreground_idxs = cls_targets > 0
+        num_negative = self.neg_to_pos_ratio * foreground_idxs.sum(1, keepdim=True)
+        # num_negative[num_negative < self.neg_to_pos_ratio] = self.neg_to_pos_ratio
+        negative_loss = cls_loss.clone()
+        negative_loss[foreground_idxs] = -float('inf')  # use -inf to detect positive values that creeped in the sample
+        values, idx = negative_loss.sort(1, descending=True)
+        # background_idxs = torch.logical_and(idx.sort(1)[1] < num_negative, torch.isfinite(values))
+        background_idxs = idx.sort(1)[1] < num_negative
+
+        N = max(1, num_foreground)
+        return {
+            'bbox_regression': bbox_loss.sum() / N,
+            'classification': (cls_loss[foreground_idxs].sum() + cls_loss[background_idxs].sum()) / N,
+        }
+
+    def forward(self, images: List[Tensor],
+                targets: Optional[List[Dict[str, Tensor]]] = None) -> Tuple[Dict[str, Tensor], List[Dict[str, Tensor]]]:
+        if self.training and targets is None:
+            raise ValueError("In training mode, targets should be passed")
+
+        if self.training:
+            assert targets is not None
+            for target in targets:
+                boxes = target["boxes"]
+                if isinstance(boxes, torch.Tensor):
+                    if len(boxes.shape) != 2 or boxes.shape[-1] != 4:
+                        raise ValueError("Expected target boxes to be a tensor"
+                                         "of shape [N, 4], got {:}.".format(
+                                             boxes.shape))
+                else:
+                    raise ValueError("Expected target boxes to be of type "
+                                     "Tensor, got {:}.".format(type(boxes)))
+
+        # get the original image sizes
+        original_image_sizes: List[Tuple[int, int]] = []
+        for img in images:
+            val = img.shape[-2:]
+            assert len(val) == 2
+            original_image_sizes.append((val[0], val[1]))
+
+        # transform the input
+        images, targets = self.transform(images, targets)
+
+        # Check for degenerate boxes
+        if targets is not None:
+            for target_idx, target in enumerate(targets):
+                boxes = target["boxes"]
+                degenerate_boxes = boxes[:, 2:] <= boxes[:, :2]
+                if degenerate_boxes.any():
+                    bb_idx = torch.where(degenerate_boxes.any(dim=1))[0][0]
+                    degen_bb: List[float] = boxes[bb_idx].tolist()
+                    raise ValueError("All bounding boxes should have positive height and width."
+                                     " Found invalid box {} for target at index {}."
+                                     .format(degen_bb, target_idx))
+
+        # get the features from the backbone
+        features = self.backbone(images.tensors)
+        if isinstance(features, torch.Tensor):
+            features = OrderedDict([('0', features)])
+
+        features = list(features.values())
+
+        # compute the ssd heads outputs using the features
+        head_outputs = self.head(features)
+
+        # create the set of anchors
+        anchors = self.anchor_generator(images, features)
+
+        losses = {}
+        detections: List[Dict[str, Tensor]] = []
+        if self.training:
+            assert targets is not None
+
+            matched_idxs = []
+            for anchors_per_image, targets_per_image in zip(anchors, targets):
+                if targets_per_image['boxes'].numel() == 0:
+                    matched_idxs.append(torch.full((anchors_per_image.size(0),), -1, dtype=torch.int64,
+                                                   device=anchors_per_image.device))
+                    continue
+
+                match_quality_matrix = box_ops.box_iou(targets_per_image['boxes'], anchors_per_image)
+                matched_idxs.append(self.proposal_matcher(match_quality_matrix))
+
+            losses = self.compute_loss(targets, head_outputs, anchors, matched_idxs)
+        else:
+            detections = self.postprocess_detections(head_outputs, anchors, images.image_sizes)
+            detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
+
+        if torch.jit.is_scripting():
+            if not self._has_warned:
+                warnings.warn("SSD always returns a (Losses, Detections) tuple in scripting")
+                self._has_warned = True
+            return losses, detections
+        return self.eager_outputs(losses, detections)
+
+    def postprocess_detections(self, head_outputs: Dict[str, Tensor], image_anchors: List[Tensor],
+                               image_shapes: List[Tuple[int, int]]) -> List[Dict[str, Tensor]]:
+        bbox_regression = head_outputs['bbox_regression']
+        pred_scores = F.softmax(head_outputs['cls_logits'], dim=-1)
+
+        num_classes = pred_scores.size(-1)
+        device = pred_scores.device
+
+        detections: List[Dict[str, Tensor]] = []
+
+        for boxes, scores, anchors, image_shape in zip(bbox_regression, pred_scores, image_anchors, image_shapes):
+            boxes = self.box_coder.decode_single(boxes, anchors)
+            boxes = box_ops.clip_boxes_to_image(boxes, image_shape)
+
+            image_boxes = []
+            image_scores = []
+            image_labels = []
+            for label in range(1, num_classes):
+                score = scores[:, label]
+
+                keep_idxs = score > self.score_thresh
+                score = score[keep_idxs]
+                box = boxes[keep_idxs]
+
+                # keep only topk scoring predictions
+                num_topk = min(self.topk_candidates, score.size(0))
+                score, idxs = score.topk(num_topk)
+                box = box[idxs]
+
+                image_boxes.append(box)
+                image_scores.append(score)
+                image_labels.append(torch.full_like(score, fill_value=label, dtype=torch.int64, device=device))
+
+            image_boxes = torch.cat(image_boxes, dim=0)
+            image_scores = torch.cat(image_scores, dim=0)
+            image_labels = torch.cat(image_labels, dim=0)
+
+            # non-maximum suppression
+            keep = box_ops.batched_nms(image_boxes, image_scores, image_labels, self.nms_thresh)
+            keep = keep[:self.detections_per_img]
+
+            detections.append({
+                'boxes': image_boxes[keep],
+                'scores': image_scores[keep],
+                'labels': image_labels[keep],
+            })
+        return detections
+
+
+class SSDFeatureExtractorVGG(nn.Module):
+    def __init__(self, backbone: nn.Module, highres: bool, rescaling: bool):
+        super().__init__()
+
+        _, _, maxpool3_pos, maxpool4_pos, _ = (i for i, layer in enumerate(backbone) if isinstance(layer, nn.MaxPool2d))
+
+        # Patch ceil_mode for maxpool3 to get the same WxH output sizes as the paper
+        backbone[maxpool3_pos].ceil_mode = True
+
+        # parameters used for L2 regularization + rescaling
+        self.scale_weight = nn.Parameter(torch.ones(512) * 20)
+
+        # Multiple Feature maps - page 4, Fig 2 of SSD paper
+        self.features = nn.Sequential(
+            *backbone[:maxpool4_pos]  # until conv4_3
+        )
+
+        # SSD300 case - page 4, Fig 2 of SSD paper
+        extra = nn.ModuleList([
+            nn.Sequential(
+                nn.Conv2d(1024, 256, kernel_size=1),
+                nn.ReLU(inplace=True),
+                nn.Conv2d(256, 512, kernel_size=3, padding=1, stride=2),  # conv8_2
+                nn.ReLU(inplace=True),
+            ),
+            nn.Sequential(
+                nn.Conv2d(512, 128, kernel_size=1),
+                nn.ReLU(inplace=True),
+                nn.Conv2d(128, 256, kernel_size=3, padding=1, stride=2),  # conv9_2
+                nn.ReLU(inplace=True),
+            ),
+            nn.Sequential(
+                nn.Conv2d(256, 128, kernel_size=1),
+                nn.ReLU(inplace=True),
+                nn.Conv2d(128, 256, kernel_size=3),  # conv10_2
+                nn.ReLU(inplace=True),
+            ),
+            nn.Sequential(
+                nn.Conv2d(256, 128, kernel_size=1),
+                nn.ReLU(inplace=True),
+                nn.Conv2d(128, 256, kernel_size=3),  # conv11_2
+                nn.ReLU(inplace=True),
+            )
+        ])
+        if highres:
+            # Additional layers for the SSD512 case. See page 11, footernote 5.
+            extra.append(nn.Sequential(
+                nn.Conv2d(256, 128, kernel_size=1),
+                nn.ReLU(inplace=True),
+                nn.Conv2d(128, 256, kernel_size=4),  # conv12_2
+                nn.ReLU(inplace=True),
+            ))
+        _xavier_init(extra)
+
+        fc = nn.Sequential(
+            nn.MaxPool2d(kernel_size=3, stride=1, padding=1, ceil_mode=False),  # add modified maxpool5
+            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, padding=6, dilation=6),  # FC6 with atrous
+            nn.ReLU(inplace=True),
+            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=1),  # FC7
+            nn.ReLU(inplace=True)
+        )
+        _xavier_init(fc)
+        extra.insert(0, nn.Sequential(
+            *backbone[maxpool4_pos:-1],  # until conv5_3, skip maxpool5
+            fc,
+        ))
+        self.extra = extra
+        self.rescaling = rescaling
+
+    def forward(self, x: Tensor) -> Dict[str, Tensor]:
+        # Undo the 0-1 scaling of toTensor. Necessary for some backbones.
+        if self.rescaling:
+            x *= 255
+
+        # L2 regularization + Rescaling of 1st block's feature map
+        x = self.features(x)
+        rescaled = self.scale_weight.view(1, -1, 1, 1) * F.normalize(x)
+        output = [rescaled]
+
+        # Calculating Feature maps for the rest blocks
+        for block in self.extra:
+            x = block(x)
+            output.append(x)
+
+        return OrderedDict([(str(i), v) for i, v in enumerate(output)])
+
+
+def _vgg_extractor(backbone_name: str, highres: bool, progress: bool, pretrained: bool, trainable_layers: int,
+                   rescaling: bool):
+    if backbone_name in backbone_urls:
+        # Use custom backbones more appropriate for SSD
+        arch = backbone_name.split('_')[0]
+        backbone = vgg.__dict__[arch](pretrained=False, progress=progress).features
+        if pretrained:
+            state_dict = load_state_dict_from_url(backbone_urls[backbone_name], progress=progress)
+            backbone.load_state_dict(state_dict)
+    else:
+        # Use standard backbones from TorchVision
+        backbone = vgg.__dict__[backbone_name](pretrained=pretrained, progress=progress).features
+
+    # Gather the indices of maxpools. These are the locations of output blocks.
+    stage_indices = [i for i, b in enumerate(backbone) if isinstance(b, nn.MaxPool2d)]
+    num_stages = len(stage_indices)
+
+    # find the index of the layer from which we wont freeze
+    assert 0 <= trainable_layers <= num_stages
+    freeze_before = num_stages if trainable_layers == 0 else stage_indices[num_stages - trainable_layers]
+
+    for b in backbone[:freeze_before]:
+        for parameter in b.parameters():
+            parameter.requires_grad_(False)
+
+    return SSDFeatureExtractorVGG(backbone, highres, rescaling)
+
+
+def ssd300_vgg16(pretrained: bool = False, progress: bool = True, num_classes: int = 91,
+                 pretrained_backbone: bool = True, trainable_backbone_layers: Optional[int] = None, **kwargs: Any):
+    """
+    Constructs an SSD model with a VGG16 backbone. See `SSD` for more details.
+
+    Example:
+
+        >>> model = torchvision.models.detection.ssd300_vgg16(pretrained=True)
+        >>> model.eval()
+        >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
+        >>> predictions = model(x)
+
+    Args:
+        pretrained (bool): If True, returns a model pre-trained on COCO train2017
+        progress (bool): If True, displays a progress bar of the download to stderr
+        num_classes (int): number of output classes of the model (including the background)
+        pretrained_backbone (bool): If True, returns a model with backbone pre-trained on Imagenet
+        trainable_backbone_layers (int): number of trainable (not frozen) resnet layers starting from final block.
+            Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.
+    """
+    trainable_backbone_layers = _validate_trainable_layers(
+        pretrained or pretrained_backbone, trainable_backbone_layers, 5, 5)
+
+    if pretrained:
+        # no need to download the backbone if pretrained is set
+        pretrained_backbone = False
+
+    backbone = _vgg_extractor("vgg16_features", False, progress, pretrained_backbone, trainable_backbone_layers, True)
+    anchor_generator = DefaultBoxGenerator([[2], [2, 3], [2, 3], [2, 3], [2], [2]], steps=[8, 16, 32, 64, 100, 300])
+    model = SSD(backbone, anchor_generator, (300, 300), num_classes,
+                image_mean=[0.48235, 0.45882, 0.40784], image_std=[1., 1., 1.], **kwargs)
+    if pretrained:
+        weights_name = 'ssd300_vgg16_coco'
+        if model_urls.get(weights_name, None) is None:
+            raise ValueError("No checkpoint is available for model {}".format(weights_name))
+        state_dict = load_state_dict_from_url(model_urls[weights_name], progress=progress)
+        model.load_state_dict(state_dict)
+    return model
--- a/torchvision/models/detection/transform.py
+++ b/torchvision/models/detection/transform.py
 import math
 import torch
-from torch import nn, Tensor
-from torch.nn import functional as F
 import torchvision
+
+from torch import nn, Tensor
 from typing import List, Tuple, Dict, Optional

 from .image_list import ImageList
@@ -23,32 +23,40 @@ def _fake_cast_onnx(v):
    return v


-def _resize_image_and_masks(image, self_min_size, self_max_size, target):
-    # type: (Tensor, float, float, Optional[Dict[str, Tensor]]) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]
+def _resize_image_and_masks(image: Tensor, self_min_size: float, self_max_size: float,
+                            target: Optional[Dict[str, Tensor]],
+                            fixed_size: Optional[Tuple[int, int]]) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
    if torchvision._is_tracing():
        im_shape = _get_shape_onnx(image)
    else:
        im_shape = torch.tensor(image.shape[-2:])

-    min_size = torch.min(im_shape).to(dtype=torch.float32)
-    max_size = torch.max(im_shape).to(dtype=torch.float32)
-    scale = torch.min(self_min_size / min_size, self_max_size / max_size)
-
-    if torchvision._is_tracing():
-        scale_factor = _fake_cast_onnx(scale)
+    size: Optional[List[int]] = None
+    scale_factor: Optional[float] = None
+    recompute_scale_factor: Optional[bool] = None
+    if fixed_size is not None:
+        size = [fixed_size[1], fixed_size[0]]
    else:
-        scale_factor = scale.item()
+        min_size = torch.min(im_shape).to(dtype=torch.float32)
+        max_size = torch.max(im_shape).to(dtype=torch.float32)
+        scale = torch.min(self_min_size / min_size, self_max_size / max_size)
+
+        if torchvision._is_tracing():
+            scale_factor = _fake_cast_onnx(scale)
+        else:
+            scale_factor = scale.item()
+        recompute_scale_factor = True

-    image = torch.nn.functional.interpolate(
-        image[None], scale_factor=scale_factor, mode='bilinear', recompute_scale_factor=True,
-        align_corners=False)[0]
+    image = torch.nn.functional.interpolate(image[None], size=size, scale_factor=scale_factor, mode='bilinear',
+                                            recompute_scale_factor=recompute_scale_factor, align_corners=False)[0]

    if target is None:
        return image, target

    if "masks" in target:
        mask = target["masks"]
-        mask = F.interpolate(mask[:, None].float(), scale_factor=scale_factor, recompute_scale_factor=True)[:, 0].byte()
+        mask = torch.nn.functional.interpolate(mask[:, None].float(), size=size, scale_factor=scale_factor,
+                                               recompute_scale_factor=recompute_scale_factor)[:, 0].byte()
        target["masks"] = mask
    return image, target

@@ -65,7 +73,7 @@ class GeneralizedRCNNTransform(nn.Module):
    It returns a ImageList for the inputs, and a List[Dict[Tensor]] for the targets
    """

-    def __init__(self, min_size, max_size, image_mean, image_std):
+    def __init__(self, min_size, max_size, image_mean, image_std, size_divisible=32, fixed_size=None):
        super(GeneralizedRCNNTransform, self).__init__()
        if not isinstance(min_size, (list, tuple)):
            min_size = (min_size,)
@@ -73,6 +81,8 @@ class GeneralizedRCNNTransform(nn.Module):
        self.max_size = max_size
        self.image_mean = image_mean
        self.image_std = image_std
+        self.size_divisible = size_divisible
+        self.fixed_size = fixed_size

    def forward(self,
                images,       # type: List[Tensor]
@@ -106,7 +116,7 @@ class GeneralizedRCNNTransform(nn.Module):
                targets[i] = target_index

        image_sizes = [img.shape[-2:] for img in images]
-        images = self.batch_images(images)
+        images = self.batch_images(images, size_divisible=self.size_divisible)
        image_sizes_list: List[Tuple[int, int]] = []
        for image_size in image_sizes:
            assert len(image_size) == 2
@@ -144,7 +154,7 @@ class GeneralizedRCNNTransform(nn.Module):
        else:
            # FIXME assume for now that testing uses the largest scale
            size = float(self.min_size[-1])
-        image, target = _resize_image_and_masks(image, size, float(self.max_size), target)
+        image, target = _resize_image_and_masks(image, size, float(self.max_size), target, self.fixed_size)

        if target is None:
            return image, target