Refactor object detection box predictors and fix some issues with model_main. (#4965)

* Merged commit includes the following changes: 206852642 by Zhichao Lu: Build the balanced_positive_negative_sampler in the model builder for FasterRCNN. Also adds an option to use the static implementation of the sampler. -- 206803260 by Zhichao Lu: Fixes a misplaced argument in resnet fpn feature extractor. -- 206682736 by Zhichao Lu: This CL modifies the SSD meta architecture to support both Slim-based and Keras-based box predictors, and begins preparation for Keras box predictor support in the other meta architectures. Concretely, this CL adds a new `KerasBoxPredictor` base class and makes the meta architectures appropriately call whichever box predictors they are using. We can switch the non-ssd meta architectures to fully support Keras box predictors once the Keras Convolutional Box Predictor CL is submitted. -- 206669634 by Zhichao Lu: Adds an alternate method for balanced positive negative sampler using static shapes. -- 206643278 by Zhichao Lu: This CL adds a Keras layer hyperparameter configuration object to the hyperparams_builder. It automatically converts from Slim layer hyperparameter configs to Keras layer hyperparameters. Namely, it: - Builds Keras initializers/regularizers instead of Slim ones - sets weights_regularizer/initializer to kernel_regularizer/initializer - converts batchnorm decay to momentum - converts Slim l2 regularizer weights to the equivalent Keras l2 weights This will be used in the conversion of object detection feature extractors & box predictors to newer Tensorflow APIs. -- 206611681 by Zhichao Lu: Internal changes. -- 206591619 by Zhichao Lu: Clip the to shape when the input tensors are larger than the expected padded static shape -- 206517644 by Zhichao Lu: Make MultiscaleGridAnchorGenerator more consistent with MultipleGridAnchorGenerator. -- 206415624 by Zhichao Lu: Make the hardcoded feature pyramid network (FPN) levels configurable for both SSD Resnet and SSD Mobilenet. -- 206398204 by Zhichao Lu: This CL modifies the SSD meta architecture to support both Slim-based and Keras-based feature extractors. This allows us to begin the conversion of object detection to newer Tensorflow APIs. -- 206213448 by Zhichao Lu: Adding a method to compute the expected classification loss by background/foreground weighting. -- 206204232 by Zhichao Lu: Adding the keypoint head to the Mask RCNN pipeline. -- 206200352 by Zhichao Lu: - Create Faster R-CNN target assigner in the model builder. This allows configuring matchers in Target assigner to use TPU compatible ops (tf.gather in this case) without any change in meta architecture. - As a +ve side effect of the refactoring, we can now re-use a single target assigner for all of second stage heads in Faster R-CNN. -- 206178206 by Zhichao Lu: Force ssd feature extractor builder to use keyword arguments so values won't be passed to wrong arguments. -- 206168297 by Zhichao Lu: Updating exporter to use freeze_graph.freeze_graph_with_def_protos rather than a homegrown version. -- 206080748 by Zhichao Lu: Merge external contributions. -- 206074460 by Zhichao Lu: Update to preprocessor to apply temperature and softmax to the multiclass scores on read. -- 205960802 by Zhichao Lu: Fixing a bug in hierarchical label expansion script. -- 205944686 by Zhichao Lu: Update exporter to support exporting quantized model. -- 205912529 by Zhichao Lu: Add a two stage matcher to allow for thresholding by one criteria and then argmaxing on the other. -- 205909017 by Zhichao Lu: Add test for grayscale image_resizer -- 205892801 by Zhichao Lu: Add flag to decide whether to apply batch norm to conv layers of weight shared box predictor. -- 205824449 by Zhichao Lu: make sure that by default mask rcnn box predictor predicts 2 stages. -- 205730139 by Zhichao Lu: Updating warning message to be more explicit about variable size mismatch. -- 205696992 by Zhichao Lu: Remove utils/ops.py's dependency on core/box_list_ops.py. This will allow re-using TPU compatible ops from utils/ops.py in core/box_list_ops.py. -- 205696867 by Zhichao Lu: Refactoring mask rcnn predictor so have each head in a separate file. This CL lets us to add new heads more easily in the future to mask rcnn. -- 205492073 by Zhichao Lu: Refactor R-FCN box predictor to be TPU compliant. - Change utils/ops.py:position_sensitive_crop_regions to operate on single image and set of boxes without `box_ind` - Add a batch version that operations on batches of images and batches of boxes. - Refactor R-FCN box predictor to use the batched version of position sensitive crop regions. -- 205453567 by Zhichao Lu: Fix bug that cannot export inference graph when write_inference_graph flag is True. -- 205316039 by Zhichao Lu: Changing input tensor name. -- 205256307 by Zhichao Lu: Fix model zoo links for quantized model. -- 205164432 by Zhichao Lu: Fixes eval error when label map contains non-ascii characters. -- 205129842 by Zhichao Lu: Adds a option to clip the anchors to the window size without filtering the overlapped boxes in Faster-RCNN -- 205094863 by Zhichao Lu: Update to label map util to allow the option of adding a background class and fill in gaps in the label map. Useful for using multiclass scores which require a complete label map with explicit background label. -- 204989032 by Zhichao Lu: Add tf.prof support to exporter. -- 204825267 by Zhichao Lu: Modify mask rcnn box predictor tests for TPU compatibility. -- 204778749 by Zhichao Lu: Remove score filtering from postprocessing.py and rely on filtering logic in tf.image.non_max_suppression -- 204775818 by Zhichao Lu: Python3 fixes for object_detection. -- 204745920 by Zhichao Lu: Object Detection Dataset visualization tool (documentation). -- 204686993 by Zhichao Lu: Internal changes. -- 204559667 by Zhichao Lu: Refactor box_predictor.py into multiple files. The abstract base class remains in the object_detection/core, The other classes have moved to a separate file each in object_detection/predictors -- 204552847 by Zhichao Lu: Update blog post link. -- 204508028 by Zhichao Lu: Bump down the batch size to 1024 to be a bit more tolerant to OOM and double the number of iterations. This job still converges to 20.5 mAP in 3 hours. -- PiperOrigin-RevId: 206852642 * Add original post-processing back.

Refactor object detection box predictors and fix some issues with model_main. (#4965)
* Merged commit includes the following changes: 206852642 by Zhichao Lu: Build the balanced_positive_negative_sampler in the model builder for FasterRCNN. Also adds an option to use the static implementation of the sampler. -- 206803260 by Zhichao Lu: Fixes a misplaced argument in resnet fpn feature extractor. -- 206682736 by Zhichao Lu: This CL modifies the SSD meta architecture to support both Slim-based and Keras-based box predictors, and begins preparation for Keras box predictor support in the other meta architectures. Concretely, this CL adds a new `KerasBoxPredictor` base class and makes the meta architectures appropriately call whichever box predictors they are using. We can switch the non-ssd meta architectures to fully support Keras box predictors once the Keras Convolutional Box Predictor CL is submitted. -- 206669634 by Zhichao Lu: Adds an alternate method for balanced positive negative sampler using static shapes. -- 206643278 by Zhichao Lu: This CL adds a Keras layer hyperparameter configuration object to the hyperparams_builder. It automatically converts from Slim layer hyperparameter configs to Keras layer hyperparameters. Namely, it: - Builds Keras initializers/regularizers instead of Slim ones - sets weights_regularizer/initializer to kernel_regularizer/initializer - converts batchnorm decay to momentum - converts Slim l2 regularizer weights to the equivalent Keras l2 weights This will be used in the conversion of object detection feature extractors & box predictors to newer Tensorflow APIs. -- 206611681 by Zhichao Lu: Internal changes. -- 206591619 by Zhichao Lu: Clip the to shape when the input tensors are larger than the expected padded static shape -- 206517644 by Zhichao Lu: Make MultiscaleGridAnchorGenerator more consistent with MultipleGridAnchorGenerator. -- 206415624 by Zhichao Lu: Make the hardcoded feature pyramid network (FPN) levels configurable for both SSD Resnet and SSD Mobilenet. -- 206398204 by Zhichao Lu: This CL modifies the SSD meta architecture to support both Slim-based and Keras-based feature extractors. This allows us to begin the conversion of object detection to newer Tensorflow APIs. -- 206213448 by Zhichao Lu: Adding a method to compute the expected classification loss by background/foreground weighting. -- 206204232 by Zhichao Lu: Adding the keypoint head to the Mask RCNN pipeline. -- 206200352 by Zhichao Lu: - Create Faster R-CNN target assigner in the model builder. This allows configuring matchers in Target assigner to use TPU compatible ops (tf.gather in this case) without any change in meta architecture. - As a +ve side effect of the refactoring, we can now re-use a single target assigner for all of second stage heads in Faster R-CNN. -- 206178206 by Zhichao Lu: Force ssd feature extractor builder to use keyword arguments so values won't be passed to wrong arguments. -- 206168297 by Zhichao Lu: Updating exporter to use freeze_graph.freeze_graph_with_def_protos rather than a homegrown version. -- 206080748 by Zhichao Lu: Merge external contributions. -- 206074460 by Zhichao Lu: Update to preprocessor to apply temperature and softmax to the multiclass scores on read. -- 205960802 by Zhichao Lu: Fixing a bug in hierarchical label expansion script. -- 205944686 by Zhichao Lu: Update exporter to support exporting quantized model. -- 205912529 by Zhichao Lu: Add a two stage matcher to allow for thresholding by one criteria and then argmaxing on the other. -- 205909017 by Zhichao Lu: Add test for grayscale image_resizer -- 205892801 by Zhichao Lu: Add flag to decide whether to apply batch norm to conv layers of weight shared box predictor. -- 205824449 by Zhichao Lu: make sure that by default mask rcnn box predictor predicts 2 stages. -- 205730139 by Zhichao Lu: Updating warning message to be more explicit about variable size mismatch. -- 205696992 by Zhichao Lu: Remove utils/ops.py's dependency on core/box_list_ops.py. This will allow re-using TPU compatible ops from utils/ops.py in core/box_list_ops.py. -- 205696867 by Zhichao Lu: Refactoring mask rcnn predictor so have each head in a separate file. This CL lets us to add new heads more easily in the future to mask rcnn. -- 205492073 by Zhichao Lu: Refactor R-FCN box predictor to be TPU compliant. - Change utils/ops.py:position_sensitive_crop_regions to operate on single image and set of boxes without `box_ind` - Add a batch version that operations on batches of images and batches of boxes. - Refactor R-FCN box predictor to use the batched version of position sensitive crop regions. -- 205453567 by Zhichao Lu: Fix bug that cannot export inference graph when write_inference_graph flag is True. -- 205316039 by Zhichao Lu: Changing input tensor name. -- 205256307 by Zhichao Lu: Fix model zoo links for quantized model. -- 205164432 by Zhichao Lu: Fixes eval error when label map contains non-ascii characters. -- 205129842 by Zhichao Lu: Adds a option to clip the anchors to the window size without filtering the overlapped boxes in Faster-RCNN -- 205094863 by Zhichao Lu: Update to label map util to allow the option of adding a background class and fill in gaps in the label map. Useful for using multiclass scores which require a complete label map with explicit background label. -- 204989032 by Zhichao Lu: Add tf.prof support to exporter. -- 204825267 by Zhichao Lu: Modify mask rcnn box predictor tests for TPU compatibility. -- 204778749 by Zhichao Lu: Remove score filtering from postprocessing.py and rely on filtering logic in tf.image.non_max_suppression -- 204775818 by Zhichao Lu: Python3 fixes for object_detection. -- 204745920 by Zhichao Lu: Object Detection Dataset visualization tool (documentation). -- 204686993 by Zhichao Lu: Internal changes. -- 204559667 by Zhichao Lu: Refactor box_predictor.py into multiple files. The abstract base class remains in the object_detection/core, The other classes have moved to a separate file each in object_detection/predictors -- 204552847 by Zhichao Lu: Update blog post link. -- 204508028 by Zhichao Lu: Bump down the batch size to 1024 to be a bit more tolerant to OOM and double the number of iterations. This job still converges to 20.5 mAP in 3 hours. -- PiperOrigin-RevId: 206852642 * Add original post-processing back.
02a9969e · pkulzc · GitHub · d135ed9c · 02a9969e · 02a9969e
Unverified Commit 02a9969e authored Aug 01, 2018 by pkulzc Committed by GitHub Aug 01, 2018
20 changed files
--- a/research/object_detection/models/ssd_mobilenet_v1_feature_extractor.py
+++ b/research/object_detection/models/ssd_mobilenet_v1_feature_extractor.py
@@ -61,8 +61,15 @@ class SSDMobileNetV1FeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
        `conv_hyperparams_fn`.
    """
    super(SSDMobileNetV1FeatureExtractor, self).__init__(
-        is_training, depth_multiplier, min_depth, pad_to_multiple,
-        conv_hyperparams_fn, reuse_weights, use_explicit_padding, use_depthwise,
+        is_training=is_training,
+        depth_multiplier=depth_multiplier,
+        min_depth=min_depth,
+        pad_to_multiple=pad_to_multiple,
+        conv_hyperparams_fn=conv_hyperparams_fn,
+        reuse_weights=reuse_weights,
+        use_explicit_padding=use_explicit_padding,
+        use_depthwise=use_depthwise,
+        override_base_feature_extractor_hyperparams=
        override_base_feature_extractor_hyperparams)

  def preprocess(self, resized_inputs):

--- a/research/object_detection/models/ssd_mobilenet_v1_fpn_feature_extractor.py
+++ b/research/object_detection/models/ssd_mobilenet_v1_fpn_feature_extractor.py
@@ -30,6 +30,61 @@ slim = tf.contrib.slim
 class SSDMobileNetV1FpnFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
  """SSD Feature Extractor using MobilenetV1 FPN features."""

+  def __init__(self,
+               is_training,
+               depth_multiplier,
+               min_depth,
+               pad_to_multiple,
+               conv_hyperparams_fn,
+               fpn_min_level=3,
+               fpn_max_level=7,
+               reuse_weights=None,
+               use_explicit_padding=False,
+               use_depthwise=False,
+               override_base_feature_extractor_hyperparams=False):
+    """SSD FPN feature extractor based on Mobilenet v1 architecture.
+
+    Args:
+      is_training: whether the network is in training mode.
+      depth_multiplier: float depth multiplier for feature extractor.
+      min_depth: minimum feature extractor depth.
+      pad_to_multiple: the nearest multiple to zero pad the input height and
+        width dimensions to.
+      conv_hyperparams_fn: A function to construct tf slim arg_scope for conv2d
+        and separable_conv2d ops in the layers that are added on top of the base
+        feature extractor.
+      fpn_min_level: the highest resolution feature map to use in FPN. The valid
+        values are {2, 3, 4, 5} which map to MobileNet v1 layers
+        {Conv2d_3_pointwise, Conv2d_5_pointwise, Conv2d_11_pointwise,
+        Conv2d_13_pointwise}, respectively.
+      fpn_max_level: the smallest resolution feature map to construct or use in
+        FPN. FPN constructions uses features maps starting from fpn_min_level
+        upto the fpn_max_level. In the case that there are not enough feature
+        maps in the backbone network, additional feature maps are created by
+        applying stride 2 convolutions until we get the desired number of fpn
+        levels.
+      reuse_weights: whether to reuse variables. Default is None.
+      use_explicit_padding: Whether to use explicit padding when extracting
+        features. Default is False.
+      use_depthwise: Whether to use depthwise convolutions. Default is False.
+      override_base_feature_extractor_hyperparams: Whether to override
+        hyperparameters of the base feature extractor with the one from
+        `conv_hyperparams_fn`.
+    """
+    super(SSDMobileNetV1FpnFeatureExtractor, self).__init__(
+        is_training=is_training,
+        depth_multiplier=depth_multiplier,
+        min_depth=min_depth,
+        pad_to_multiple=pad_to_multiple,
+        conv_hyperparams_fn=conv_hyperparams_fn,
+        reuse_weights=reuse_weights,
+        use_explicit_padding=use_explicit_padding,
+        use_depthwise=use_depthwise,
+        override_base_feature_extractor_hyperparams=
+        override_base_feature_extractor_hyperparams)
+    self._fpn_min_level = fpn_min_level
+    self._fpn_max_level = fpn_max_level
+
  def preprocess(self, resized_inputs):
    """SSD preprocessing.

@@ -78,24 +133,31 @@ class SSDMobileNetV1FpnFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
      depth_fn = lambda d: max(int(d * self._depth_multiplier), self._min_depth)
      with slim.arg_scope(self._conv_hyperparams_fn()):
        with tf.variable_scope('fpn', reuse=self._reuse_weights):
+          feature_blocks = [
+              'Conv2d_3_pointwise', 'Conv2d_5_pointwise', 'Conv2d_11_pointwise',
+              'Conv2d_13_pointwise'
+          ]
+          base_fpn_max_level = min(self._fpn_max_level, 5)
+          feature_block_list = []
+          for level in range(self._fpn_min_level, base_fpn_max_level + 1):
+            feature_block_list.append(feature_blocks[level - 2])
          fpn_features = feature_map_generators.fpn_top_down_feature_maps(
-              [(key, image_features[key])
-               for key in ['Conv2d_5_pointwise', 'Conv2d_11_pointwise',
-                           'Conv2d_13_pointwise']],
+              [(key, image_features[key]) for key in feature_block_list],
              depth=depth_fn(256))
-          last_feature_map = fpn_features['top_down_Conv2d_13_pointwise']
-          coarse_features = {}
-          for i in range(14, 16):
+          feature_maps = []
+          for level in range(self._fpn_min_level, base_fpn_max_level + 1):
+            feature_maps.append(fpn_features['top_down_{}'.format(
+                feature_blocks[level - 2])])
+          last_feature_map = fpn_features['top_down_{}'.format(
+              feature_blocks[base_fpn_max_level - 2])]
+          # Construct coarse features
+          for i in range(base_fpn_max_level + 1, self._fpn_max_level + 1):
            last_feature_map = slim.conv2d(
                last_feature_map,
                num_outputs=depth_fn(256),
                kernel_size=[3, 3],
                stride=2,
                padding='SAME',
-                scope='bottom_up_Conv2d_{}'.format(i))
-            coarse_features['bottom_up_Conv2d_{}'.format(i)] = last_feature_map
-    return [fpn_features['top_down_Conv2d_5_pointwise'],
-            fpn_features['top_down_Conv2d_11_pointwise'],
-            fpn_features['top_down_Conv2d_13_pointwise'],
-            coarse_features['bottom_up_Conv2d_14'],
-            coarse_features['bottom_up_Conv2d_15']]
+                scope='bottom_up_Conv2d_{}'.format(i - base_fpn_max_level + 13))
+            feature_maps.append(last_feature_map)
+    return feature_maps
--- a/research/object_detection/models/ssd_mobilenet_v2_feature_extractor.py
+++ b/research/object_detection/models/ssd_mobilenet_v2_feature_extractor.py
@@ -64,8 +64,15 @@ class SSDMobileNetV2FeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
        `conv_hyperparams_fn`.
    """
    super(SSDMobileNetV2FeatureExtractor, self).__init__(
-        is_training, depth_multiplier, min_depth, pad_to_multiple,
-        conv_hyperparams_fn, reuse_weights, use_explicit_padding, use_depthwise,
+        is_training=is_training,
+        depth_multiplier=depth_multiplier,
+        min_depth=min_depth,
+        pad_to_multiple=pad_to_multiple,
+        conv_hyperparams_fn=conv_hyperparams_fn,
+        reuse_weights=reuse_weights,
+        use_explicit_padding=use_explicit_padding,
+        use_depthwise=use_depthwise,
+        override_base_feature_extractor_hyperparams=
        override_base_feature_extractor_hyperparams)

  def preprocess(self, resized_inputs):

--- a/research/object_detection/models/ssd_resnet_v1_fpn_feature_extractor.py
+++ b/research/object_detection/models/ssd_resnet_v1_fpn_feature_extractor.py
@@ -41,6 +41,8 @@ class _SSDResnetV1FpnFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
               resnet_base_fn,
               resnet_scope_name,
               fpn_scope_name,
+               fpn_min_level=3,
+               fpn_max_level=7,
               reuse_weights=None,
               use_explicit_padding=False,
               use_depthwise=False,
@@ -61,6 +63,15 @@ class _SSDResnetV1FpnFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
      resnet_scope_name: scope name under which to construct resnet
      fpn_scope_name: scope name under which to construct the feature pyramid
        network.
+      fpn_min_level: the highest resolution feature map to use in FPN. The valid
+        values are {2, 3, 4, 5} which map to Resnet blocks {1, 2, 3, 4}
+        respectively.
+      fpn_max_level: the smallest resolution feature map to construct or use in
+        FPN. FPN constructions uses features maps starting from fpn_min_level
+        upto the fpn_max_level. In the case that there are not enough feature
+        maps in the backbone network, additional feature maps are created by
+        applying stride 2 convolutions until we get the desired number of fpn
+        levels.
      reuse_weights: Whether to reuse variables. Default is None.
      use_explicit_padding: Whether to use explicit padding when extracting
        features. Default is False. UNUSED currently.
@@ -73,8 +84,15 @@ class _SSDResnetV1FpnFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
      ValueError: On supplying invalid arguments for unused arguments.
    """
    super(_SSDResnetV1FpnFeatureExtractor, self).__init__(
-        is_training, depth_multiplier, min_depth, pad_to_multiple,
-        conv_hyperparams_fn, reuse_weights, use_explicit_padding,
+        is_training=is_training,
+        depth_multiplier=depth_multiplier,
+        min_depth=min_depth,
+        pad_to_multiple=pad_to_multiple,
+        conv_hyperparams_fn=conv_hyperparams_fn,
+        reuse_weights=reuse_weights,
+        use_explicit_padding=use_explicit_padding,
+        use_depthwise=use_depthwise,
+        override_base_feature_extractor_hyperparams=
        override_base_feature_extractor_hyperparams)
    if self._depth_multiplier != 1.0:
      raise ValueError('Only depth 1.0 is supported, found: {}'.
@@ -84,6 +102,8 @@ class _SSDResnetV1FpnFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
    self._resnet_base_fn = resnet_base_fn
    self._resnet_scope_name = resnet_scope_name
    self._fpn_scope_name = fpn_scope_name
+    self._fpn_min_level = fpn_min_level
+    self._fpn_max_level = fpn_max_level

  def preprocess(self, resized_inputs):
    """SSD preprocessing.
@@ -108,7 +128,7 @@ class _SSDResnetV1FpnFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
    filtered_image_features = dict({})
    for key, feature in image_features.items():
      feature_name = key.split('/')[-1]
-      if feature_name in ['block2', 'block3', 'block4']:
+      if feature_name in ['block1', 'block2', 'block3', 'block4']:
        filtered_image_features[feature_name] = feature
    return filtered_image_features

@@ -151,13 +171,21 @@ class _SSDResnetV1FpnFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
      with slim.arg_scope(self._conv_hyperparams_fn()):
        with tf.variable_scope(self._fpn_scope_name,
                               reuse=self._reuse_weights):
+          base_fpn_max_level = min(self._fpn_max_level, 5)
+          feature_block_list = []
+          for level in range(self._fpn_min_level, base_fpn_max_level + 1):
+            feature_block_list.append('block{}'.format(level - 1))
          fpn_features = feature_map_generators.fpn_top_down_feature_maps(
-              [(key, image_features[key])
-               for key in ['block2', 'block3', 'block4']],
+              [(key, image_features[key]) for key in feature_block_list],
              depth=256)
-          last_feature_map = fpn_features['top_down_block4']
-          coarse_features = {}
-          for i in range(5, 7):
+          feature_maps = []
+          for level in range(self._fpn_min_level, base_fpn_max_level + 1):
+            feature_maps.append(
+                fpn_features['top_down_block{}'.format(level - 1)])
+          last_feature_map = fpn_features['top_down_block{}'.format(
+              base_fpn_max_level - 1)]
+          # Construct coarse features
+          for i in range(base_fpn_max_level, self._fpn_max_level):
            last_feature_map = slim.conv2d(
                last_feature_map,
                num_outputs=256,
@@ -165,15 +193,12 @@ class _SSDResnetV1FpnFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
                stride=2,
                padding='SAME',
                scope='bottom_up_block{}'.format(i))
-            coarse_features['bottom_up_block{}'.format(i)] = last_feature_map
-    return [fpn_features['top_down_block2'],
-            fpn_features['top_down_block3'],
-            fpn_features['top_down_block4'],
-            coarse_features['bottom_up_block5'],
-            coarse_features['bottom_up_block6']]
+            feature_maps.append(last_feature_map)
+    return feature_maps


 class SSDResnet50V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
+  """SSD Resnet50 V1 FPN feature extractor."""

  def __init__(self,
               is_training,
@@ -181,6 +206,8 @@ class SSDResnet50V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
               min_depth,
               pad_to_multiple,
               conv_hyperparams_fn,
+               fpn_min_level=3,
+               fpn_max_level=7,
               reuse_weights=None,
               use_explicit_padding=False,
               use_depthwise=False,
@@ -197,6 +224,8 @@ class SSDResnet50V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
      conv_hyperparams_fn: A function to construct tf slim arg_scope for conv2d
        and separable_conv2d ops in the layers that are added on top of the
        base feature extractor.
+      fpn_min_level: the minimum level in feature pyramid networks.
+      fpn_max_level: the maximum level in feature pyramid networks.
      reuse_weights: Whether to reuse variables. Default is None.
      use_explicit_padding: Whether to use explicit padding when extracting
        features. Default is False. UNUSED currently.
@@ -206,13 +235,25 @@ class SSDResnet50V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
        `conv_hyperparams_fn`.
    """
    super(SSDResnet50V1FpnFeatureExtractor, self).__init__(
-        is_training, depth_multiplier, min_depth, pad_to_multiple,
-        conv_hyperparams_fn, resnet_v1.resnet_v1_50, 'resnet_v1_50', 'fpn',
-        reuse_weights, use_explicit_padding,
+        is_training,
+        depth_multiplier,
+        min_depth,
+        pad_to_multiple,
+        conv_hyperparams_fn,
+        resnet_v1.resnet_v1_50,
+        'resnet_v1_50',
+        'fpn',
+        fpn_min_level,
+        fpn_max_level,
+        reuse_weights=reuse_weights,
+        use_explicit_padding=use_explicit_padding,
+        use_depthwise=use_depthwise,
+        override_base_feature_extractor_hyperparams=
        override_base_feature_extractor_hyperparams)


 class SSDResnet101V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
+  """SSD Resnet101 V1 FPN feature extractor."""

  def __init__(self,
               is_training,
@@ -220,6 +261,8 @@ class SSDResnet101V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
               min_depth,
               pad_to_multiple,
               conv_hyperparams_fn,
+               fpn_min_level=3,
+               fpn_max_level=7,
               reuse_weights=None,
               use_explicit_padding=False,
               use_depthwise=False,
@@ -236,6 +279,8 @@ class SSDResnet101V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
      conv_hyperparams_fn: A function to construct tf slim arg_scope for conv2d
        and separable_conv2d ops in the layers that are added on top of the
        base feature extractor.
+      fpn_min_level: the minimum level in feature pyramid networks.
+      fpn_max_level: the maximum level in feature pyramid networks.
      reuse_weights: Whether to reuse variables. Default is None.
      use_explicit_padding: Whether to use explicit padding when extracting
        features. Default is False. UNUSED currently.
@@ -245,13 +290,25 @@ class SSDResnet101V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
        `conv_hyperparams_fn`.
    """
    super(SSDResnet101V1FpnFeatureExtractor, self).__init__(
-        is_training, depth_multiplier, min_depth, pad_to_multiple,
-        conv_hyperparams_fn, resnet_v1.resnet_v1_101, 'resnet_v1_101', 'fpn',
-        reuse_weights, use_explicit_padding,
+        is_training,
+        depth_multiplier,
+        min_depth,
+        pad_to_multiple,
+        conv_hyperparams_fn,
+        resnet_v1.resnet_v1_101,
+        'resnet_v1_101',
+        'fpn',
+        fpn_min_level,
+        fpn_max_level,
+        reuse_weights=reuse_weights,
+        use_explicit_padding=use_explicit_padding,
+        use_depthwise=use_depthwise,
+        override_base_feature_extractor_hyperparams=
        override_base_feature_extractor_hyperparams)


 class SSDResnet152V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
+  """SSD Resnet152 V1 FPN feature extractor."""

  def __init__(self,
               is_training,
@@ -259,6 +316,8 @@ class SSDResnet152V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
               min_depth,
               pad_to_multiple,
               conv_hyperparams_fn,
+               fpn_min_level=3,
+               fpn_max_level=7,
               reuse_weights=None,
               use_explicit_padding=False,
               use_depthwise=False,
@@ -275,6 +334,8 @@ class SSDResnet152V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
      conv_hyperparams_fn: A function to construct tf slim arg_scope for conv2d
        and separable_conv2d ops in the layers that are added on top of the
        base feature extractor.
+      fpn_min_level: the minimum level in feature pyramid networks.
+      fpn_max_level: the maximum level in feature pyramid networks.
      reuse_weights: Whether to reuse variables. Default is None.
      use_explicit_padding: Whether to use explicit padding when extracting
        features. Default is False. UNUSED currently.
@@ -284,7 +345,18 @@ class SSDResnet152V1FpnFeatureExtractor(_SSDResnetV1FpnFeatureExtractor):
        `conv_hyperparams_fn`.
    """
    super(SSDResnet152V1FpnFeatureExtractor, self).__init__(
-        is_training, depth_multiplier, min_depth, pad_to_multiple,
-        conv_hyperparams_fn, resnet_v1.resnet_v1_152, 'resnet_v1_152', 'fpn',
-        reuse_weights, use_explicit_padding,
+        is_training,
+        depth_multiplier,
+        min_depth,
+        pad_to_multiple,
+        conv_hyperparams_fn,
+        resnet_v1.resnet_v1_152,
+        'resnet_v1_152',
+        'fpn',
+        fpn_min_level,
+        fpn_max_level,
+        reuse_weights=reuse_weights,
+        use_explicit_padding=use_explicit_padding,
+        use_depthwise=use_depthwise,
+        override_base_feature_extractor_hyperparams=
        override_base_feature_extractor_hyperparams)
--- a/research/object_detection/object_detection_tutorial.ipynb
+++ b/research/object_detection/object_detection_tutorial.ipynb
@@ -2,7 +2,10 @@
  "cells": [
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "V8-yl-s-WKMG"
+      },
      "source": [
        "# Object Detection Demo\n",
        "Welcome to the object detection inference walkthrough!  This notebook will walk you step by step through the process of using a pre-trained model to detect objects in an image. Make sure to follow the [installation instructions](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md) before you start."
@@ -10,16 +13,26 @@
    },
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "kFSqkTCdWKMI"
+      },
      "source": [
        "# Imports"
      ]
    },
    {
      "cell_type": "code",
-   "execution_count": null,
+      "execution_count": 0,
      "metadata": {
-    "scrolled": true
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "hV4P5gyTWKMI"
      },
      "outputs": [],
      "source": [
@@ -40,21 +53,33 @@
        "sys.path.append(\"..\")\n",
        "from object_detection.utils import ops as utils_ops\n",
        "\n",
-    "if tf.__version__ < '1.4.0':\n",
+        "if tf.__version__ \u003c '1.4.0':\n",
        "  raise ImportError('Please upgrade your tensorflow installation to v1.4.* or later!')\n"
      ]
    },
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "Wy72mWwAWKMK"
+      },
      "source": [
        "## Env setup"
      ]
    },
    {
      "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "v7m_NY_aWKMK"
+      },
      "outputs": [],
      "source": [
        "# This is needed to display the images.\n",
@@ -63,7 +88,10 @@
    },
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "r5FNuiRPWKMN"
+      },
      "source": [
        "## Object detection imports\n",
        "Here are the imports from the object detection module."
@@ -71,8 +99,17 @@
    },
    {
      "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "bm0_uNRnWKMN"
+      },
      "outputs": [],
      "source": [
        "from utils import label_map_util\n",
@@ -82,26 +119,41 @@
    },
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "cfn_tRFOWKMO"
+      },
      "source": [
        "# Model preparation "
      ]
    },
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "X_sEBLpVWKMQ"
+      },
      "source": [
        "## Variables\n",
        "\n",
-    "Any model exported using the `export_inference_graph.py` tool can be loaded here simply by changing `PATH_TO_CKPT` to point to a new .pb file.  \n",
+        "Any model exported using the `export_inference_graph.py` tool can be loaded here simply by changing `PATH_TO_FROZEN_GRAPH` to point to a new .pb file.  \n",
        "\n",
        "By default we use an \"SSD with Mobilenet\" model here. See the [detection model zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md) for a list of other models that can be run out-of-the-box with varying speeds and accuracies."
      ]
    },
    {
      "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "VyPz_t8WWKMQ"
+      },
      "outputs": [],
      "source": [
        "# What model to download.\n",
@@ -110,7 +162,7 @@
        "DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'\n",
        "\n",
        "# Path to frozen detection graph. This is the actual model that is used for the object detection.\n",
-    "PATH_TO_CKPT = MODEL_NAME + '/frozen_inference_graph.pb'\n",
+        "PATH_TO_FROZEN_GRAPH = MODEL_NAME + '/frozen_inference_graph.pb'\n",
        "\n",
        "# List of the strings that is used to add correct label for each box.\n",
        "PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')\n",
@@ -120,15 +172,27 @@
    },
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "7ai8pLZZWKMS"
+      },
      "source": [
        "## Download Model"
      ]
    },
    {
      "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "KILYnwR5WKMS"
+      },
      "outputs": [],
      "source": [
        "opener = urllib.request.URLopener()\n",
@@ -142,21 +206,33 @@
    },
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "YBcB9QHLWKMU"
+      },
      "source": [
        "## Load a (frozen) Tensorflow model into memory."
      ]
    },
    {
      "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "KezjCRVvWKMV"
+      },
      "outputs": [],
      "source": [
        "detection_graph = tf.Graph()\n",
        "with detection_graph.as_default():\n",
        "  od_graph_def = tf.GraphDef()\n",
-    "  with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:\n",
+        "  with tf.gfile.GFile(PATH_TO_FROZEN_GRAPH, 'rb') as fid:\n",
        "    serialized_graph = fid.read()\n",
        "    od_graph_def.ParseFromString(serialized_graph)\n",
        "    tf.import_graph_def(od_graph_def, name='')"
@@ -164,7 +240,10 @@
    },
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "_1MVVTcLWKMW"
+      },
      "source": [
        "## Loading label map\n",
        "Label maps map indices to category names, so that when our convolution network predicts `5`, we know that this corresponds to `airplane`.  Here we use internal utility functions, but anything that returns a dictionary mapping integers to appropriate string labels would be fine"
@@ -172,8 +251,17 @@
    },
    {
      "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "hDbpHkiWWKMX"
+      },
      "outputs": [],
      "source": [
        "label_map = label_map_util.load_labelmap(PATH_TO_LABELS)\n",
@@ -183,15 +271,27 @@
    },
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "EFsoUHvbWKMZ"
+      },
      "source": [
        "## Helper code"
      ]
    },
    {
      "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "aSlYc3JkWKMa"
+      },
      "outputs": [],
      "source": [
        "def load_image_into_numpy_array(image):\n",
@@ -202,15 +302,27 @@
    },
    {
      "cell_type": "markdown",
-   "metadata": {},
+      "metadata": {
+        "colab_type": "text",
+        "id": "H0_1AGhrWKMc"
+      },
      "source": [
        "# Detection"
      ]
    },
    {
      "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "jG-zn5ykWKMd"
+      },
      "outputs": [],
      "source": [
        "# For the sake of simplicity we will use only 2 images:\n",
@@ -226,8 +338,17 @@
    },
    {
      "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "92BHxzcNWKMf"
+      },
      "outputs": [],
      "source": [
        "def run_inference_for_single_image(image, graph):\n",
@@ -279,9 +400,16 @@
    },
    {
      "cell_type": "code",
-   "execution_count": null,
+      "execution_count": 0,
      "metadata": {
-    "scrolled": true
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "3a5wMHN8WKMh"
      },
      "outputs": [],
      "source": [
@@ -310,34 +438,37 @@
    },
    {
      "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+      "execution_count": 0,
+      "metadata": {
+        "colab": {
+          "autoexec": {
+            "startup": false,
+            "wait_interval": 0
+          }
+        },
+        "colab_type": "code",
+        "id": "LQSEnEsPWKMj"
+      },
      "outputs": [],
-   "source": []
+      "source": [
+        ""
+      ]
    }
  ],
  "metadata": {
    "colab": {
-   "version": "0.3.2"
+      "default_view": {},
+      "name": "object_detection_tutorial.ipynb?workspaceId=ronnyvotel:python_inference::citc",
+      "provenance": [],
+      "version": "0.3.2",
+      "views": {}
    },
    "kernelspec": {
      "display_name": "Python 2",
      "language": "python",
      "name": "python2"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 2
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.10"
    }
  },
  "nbformat": 4,
- "nbformat_minor": 2
+  "nbformat_minor": 0
 }
--- a/research/object_detection/predictors/__init__.py
+++ b/research/object_detection/predictors/__init__.py
--- a/research/object_detection/predictors/convolutional_box_predictor.py
+++ b/research/object_detection/predictors/convolutional_box_predictor.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Convolutional Box Predictors with and without weight sharing."""
+import tensorflow as tf
+from object_detection.core import box_predictor
+from object_detection.utils import shape_utils
+from object_detection.utils import static_shape
+
+slim = tf.contrib.slim
+
+BOX_ENCODINGS = box_predictor.BOX_ENCODINGS
+CLASS_PREDICTIONS_WITH_BACKGROUND = (
+    box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND)
+MASK_PREDICTIONS = box_predictor.MASK_PREDICTIONS
+
+
+class _NoopVariableScope(object):
+  """A dummy class that does not push any scope."""
+
+  def __enter__(self):
+    return None
+
+  def __exit__(self, exc_type, exc_value, traceback):
+    return False
+
+
+class ConvolutionalBoxPredictor(box_predictor.BoxPredictor):
+  """Convolutional Box Predictor.
+
+  Optionally add an intermediate 1x1 convolutional layer after features and
+  predict in parallel branches box_encodings and
+  class_predictions_with_background.
+
+  Currently this box predictor assumes that predictions are "shared" across
+  classes --- that is each anchor makes box predictions which do not depend
+  on class.
+  """
+
+  def __init__(self,
+               is_training,
+               num_classes,
+               conv_hyperparams_fn,
+               min_depth,
+               max_depth,
+               num_layers_before_predictor,
+               use_dropout,
+               dropout_keep_prob,
+               kernel_size,
+               box_code_size,
+               apply_sigmoid_to_scores=False,
+               class_prediction_bias_init=0.0,
+               use_depthwise=False):
+    """Constructor.
+
+    Args:
+      is_training: Indicates whether the BoxPredictor is in training mode.
+      num_classes: number of classes.  Note that num_classes *does not*
+        include the background category, so if groundtruth labels take values
+        in {0, 1, .., K-1}, num_classes=K (and not K+1, even though the
+        assigned classification targets can range from {0,... K}).
+      conv_hyperparams_fn: A function to generate tf-slim arg_scope with
+        hyperparameters for convolution ops.
+      min_depth: Minimum feature depth prior to predicting box encodings
+        and class predictions.
+      max_depth: Maximum feature depth prior to predicting box encodings
+        and class predictions. If max_depth is set to 0, no additional
+        feature map will be inserted before location and class predictions.
+      num_layers_before_predictor: Number of the additional conv layers before
+        the predictor.
+      use_dropout: Option to use dropout for class prediction or not.
+      dropout_keep_prob: Keep probability for dropout.
+        This is only used if use_dropout is True.
+      kernel_size: Size of final convolution kernel.  If the
+        spatial resolution of the feature map is smaller than the kernel size,
+        then the kernel size is automatically set to be
+        min(feature_width, feature_height).
+      box_code_size: Size of encoding for each box.
+      apply_sigmoid_to_scores: if True, apply the sigmoid on the output
+        class_predictions.
+      class_prediction_bias_init: constant value to initialize bias of the last
+        conv2d layer before class prediction.
+      use_depthwise: Whether to use depthwise convolutions for prediction
+        steps. Default is False.
+
+    Raises:
+      ValueError: if min_depth > max_depth.
+    """
+    super(ConvolutionalBoxPredictor, self).__init__(is_training, num_classes)
+    if min_depth > max_depth:
+      raise ValueError('min_depth should be less than or equal to max_depth')
+    self._conv_hyperparams_fn = conv_hyperparams_fn
+    self._min_depth = min_depth
+    self._max_depth = max_depth
+    self._num_layers_before_predictor = num_layers_before_predictor
+    self._use_dropout = use_dropout
+    self._kernel_size = kernel_size
+    self._box_code_size = box_code_size
+    self._dropout_keep_prob = dropout_keep_prob
+    self._apply_sigmoid_to_scores = apply_sigmoid_to_scores
+    self._class_prediction_bias_init = class_prediction_bias_init
+    self._use_depthwise = use_depthwise
+
+  def _predict(self, image_features, num_predictions_per_location_list):
+    """Computes encoded object locations and corresponding confidences.
+
+    Args:
+      image_features: A list of float tensors of shape [batch_size, height_i,
+        width_i, channels_i] containing features for a batch of images.
+      num_predictions_per_location_list: A list of integers representing the
+        number of box predictions to be made per spatial location for each
+        feature map.
+
+    Returns:
+      box_encodings: A list of float tensors of shape
+        [batch_size, num_anchors_i, q, code_size] representing the location of
+        the objects, where q is 1 or the number of classes. Each entry in the
+        list corresponds to a feature map in the input `image_features` list.
+      class_predictions_with_background: A list of float tensors of shape
+        [batch_size, num_anchors_i, num_classes + 1] representing the class
+        predictions for the proposals. Each entry in the list corresponds to a
+        feature map in the input `image_features` list.
+    """
+    box_encodings_list = []
+    class_predictions_list = []
+    # TODO(rathodv): Come up with a better way to generate scope names
+    # in box predictor once we have time to retrain all models in the zoo.
+    # The following lines create scope names to be backwards compatible with the
+    # existing checkpoints.
+    box_predictor_scopes = [_NoopVariableScope()]
+    if len(image_features) > 1:
+      box_predictor_scopes = [
+          tf.variable_scope('BoxPredictor_{}'.format(i))
+          for i in range(len(image_features))
+      ]
+
+    for (image_feature,
+         num_predictions_per_location, box_predictor_scope) in zip(
+             image_features, num_predictions_per_location_list,
+             box_predictor_scopes):
+      with box_predictor_scope:
+        # Add a slot for the background class.
+        num_class_slots = self.num_classes + 1
+        net = image_feature
+        with slim.arg_scope(self._conv_hyperparams_fn()), \
+             slim.arg_scope([slim.dropout], is_training=self._is_training):
+          # Add additional conv layers before the class predictor.
+          features_depth = static_shape.get_depth(image_feature.get_shape())
+          depth = max(min(features_depth, self._max_depth), self._min_depth)
+          tf.logging.info('depth of additional conv before box predictor: {}'.
+                          format(depth))
+          if depth > 0 and self._num_layers_before_predictor > 0:
+            for i in range(self._num_layers_before_predictor):
+              net = slim.conv2d(
+                  net, depth, [1, 1], scope='Conv2d_%d_1x1_%d' % (i, depth))
+          with slim.arg_scope([slim.conv2d], activation_fn=None,
+                              normalizer_fn=None, normalizer_params=None):
+            if self._use_depthwise:
+              box_encodings = slim.separable_conv2d(
+                  net, None, [self._kernel_size, self._kernel_size],
+                  padding='SAME', depth_multiplier=1, stride=1,
+                  rate=1, scope='BoxEncodingPredictor_depthwise')
+              box_encodings = slim.conv2d(
+                  box_encodings,
+                  num_predictions_per_location * self._box_code_size, [1, 1],
+                  scope='BoxEncodingPredictor')
+            else:
+              box_encodings = slim.conv2d(
+                  net, num_predictions_per_location * self._box_code_size,
+                  [self._kernel_size, self._kernel_size],
+                  scope='BoxEncodingPredictor')
+            if self._use_dropout:
+              net = slim.dropout(net, keep_prob=self._dropout_keep_prob)
+            if self._use_depthwise:
+              class_predictions_with_background = slim.separable_conv2d(
+                  net, None, [self._kernel_size, self._kernel_size],
+                  padding='SAME', depth_multiplier=1, stride=1,
+                  rate=1, scope='ClassPredictor_depthwise')
+              class_predictions_with_background = slim.conv2d(
+                  class_predictions_with_background,
+                  num_predictions_per_location * num_class_slots,
+                  [1, 1], scope='ClassPredictor')
+            else:
+              class_predictions_with_background = slim.conv2d(
+                  net, num_predictions_per_location * num_class_slots,
+                  [self._kernel_size, self._kernel_size],
+                  scope='ClassPredictor',
+                  biases_initializer=tf.constant_initializer(
+                      self._class_prediction_bias_init))
+            if self._apply_sigmoid_to_scores:
+              class_predictions_with_background = tf.sigmoid(
+                  class_predictions_with_background)
+
+        combined_feature_map_shape = (shape_utils.
+                                      combined_static_and_dynamic_shape(
+                                          image_feature))
+        box_encodings = tf.reshape(
+            box_encodings, tf.stack([combined_feature_map_shape[0],
+                                     combined_feature_map_shape[1] *
+                                     combined_feature_map_shape[2] *
+                                     num_predictions_per_location,
+                                     1, self._box_code_size]))
+        box_encodings_list.append(box_encodings)
+        class_predictions_with_background = tf.reshape(
+            class_predictions_with_background,
+            tf.stack([combined_feature_map_shape[0],
+                      combined_feature_map_shape[1] *
+                      combined_feature_map_shape[2] *
+                      num_predictions_per_location,
+                      num_class_slots]))
+        class_predictions_list.append(class_predictions_with_background)
+    return {
+        BOX_ENCODINGS: box_encodings_list,
+        CLASS_PREDICTIONS_WITH_BACKGROUND: class_predictions_list
+    }
+
+
+# TODO(rathodv): Replace with slim.arg_scope_func_key once its available
+# externally.
+def _arg_scope_func_key(op):
+  """Returns a key that can be used to index arg_scope dictionary."""
+  return getattr(op, '_key_op', str(op))
+
+
+# TODO(rathodv): Merge the implementation with ConvolutionalBoxPredictor above
+# since they are very similar.
+class WeightSharedConvolutionalBoxPredictor(box_predictor.BoxPredictor):
+  """Convolutional Box Predictor with weight sharing.
+
+  Defines the box predictor as defined in
+  https://arxiv.org/abs/1708.02002. This class differs from
+  ConvolutionalBoxPredictor in that it shares weights and biases while
+  predicting from different feature maps. However, batch_norm parameters are not
+  shared because the statistics of the activations vary among the different
+  feature maps.
+
+  Also note that separate multi-layer towers are constructed for the box
+  encoding and class predictors respectively.
+  """
+
+  def __init__(self,
+               is_training,
+               num_classes,
+               conv_hyperparams_fn,
+               depth,
+               num_layers_before_predictor,
+               box_code_size,
+               kernel_size=3,
+               class_prediction_bias_init=0.0,
+               use_dropout=False,
+               dropout_keep_prob=0.8,
+               share_prediction_tower=False,
+               apply_batch_norm=True):
+    """Constructor.
+
+    Args:
+      is_training: Indicates whether the BoxPredictor is in training mode.
+      num_classes: number of classes.  Note that num_classes *does not*
+        include the background category, so if groundtruth labels take values
+        in {0, 1, .., K-1}, num_classes=K (and not K+1, even though the
+        assigned classification targets can range from {0,... K}).
+      conv_hyperparams_fn: A function to generate tf-slim arg_scope with
+        hyperparameters for convolution ops.
+      depth: depth of conv layers.
+      num_layers_before_predictor: Number of the additional conv layers before
+        the predictor.
+      box_code_size: Size of encoding for each box.
+      kernel_size: Size of final convolution kernel.
+      class_prediction_bias_init: constant value to initialize bias of the last
+        conv2d layer before class prediction.
+      use_dropout: Whether to apply dropout to class prediction head.
+      dropout_keep_prob: Probability of keeping activiations.
+      share_prediction_tower: Whether to share the multi-layer tower between box
+        prediction and class prediction heads.
+      apply_batch_norm: Whether to apply batch normalization to conv layers in
+        this predictor.
+    """
+    super(WeightSharedConvolutionalBoxPredictor, self).__init__(is_training,
+                                                                num_classes)
+    self._conv_hyperparams_fn = conv_hyperparams_fn
+    self._depth = depth
+    self._num_layers_before_predictor = num_layers_before_predictor
+    self._box_code_size = box_code_size
+    self._kernel_size = kernel_size
+    self._class_prediction_bias_init = class_prediction_bias_init
+    self._use_dropout = use_dropout
+    self._dropout_keep_prob = dropout_keep_prob
+    self._share_prediction_tower = share_prediction_tower
+    self._apply_batch_norm = apply_batch_norm
+
+  def _predict(self, image_features, num_predictions_per_location_list):
+    """Computes encoded object locations and corresponding confidences.
+
+    Args:
+      image_features: A list of float tensors of shape [batch_size, height_i,
+        width_i, channels] containing features for a batch of images. Note that
+        when not all tensors in the list have the same number of channels, an
+        additional projection layer will be added on top the tensor to generate
+        feature map with number of channels consitent with the majority.
+      num_predictions_per_location_list: A list of integers representing the
+        number of box predictions to be made per spatial location for each
+        feature map. Note that all values must be the same since the weights are
+        shared.
+
+    Returns:
+      box_encodings: A list of float tensors of shape
+        [batch_size, num_anchors_i, code_size] representing the location of
+        the objects. Each entry in the list corresponds to a feature map in the
+        input `image_features` list.
+      class_predictions_with_background: A list of float tensors of shape
+        [batch_size, num_anchors_i, num_classes + 1] representing the class
+        predictions for the proposals. Each entry in the list corresponds to a
+        feature map in the input `image_features` list.
+
+
+    Raises:
+      ValueError: If the image feature maps do not have the same number of
+        channels or if the num predictions per locations is differs between the
+        feature maps.
+    """
+    if len(set(num_predictions_per_location_list)) > 1:
+      raise ValueError('num predictions per location must be same for all'
+                       'feature maps, found: {}'.format(
+                           num_predictions_per_location_list))
+    feature_channels = [
+        image_feature.shape[3].value for image_feature in image_features
+    ]
+    has_different_feature_channels = len(set(feature_channels)) > 1
+    if has_different_feature_channels:
+      inserted_layer_counter = 0
+      target_channel = max(set(feature_channels), key=feature_channels.count)
+      tf.logging.info('Not all feature maps have the same number of '
+                      'channels, found: {}, addition project layers '
+                      'to bring all feature maps to uniform channels '
+                      'of {}'.format(feature_channels, target_channel))
+    box_encodings_list = []
+    class_predictions_list = []
+    num_class_slots = self.num_classes + 1
+    for feature_index, (image_feature,
+                        num_predictions_per_location) in enumerate(
+                            zip(image_features,
+                                num_predictions_per_location_list)):
+      # Add a slot for the background class.
+      with tf.variable_scope('WeightSharedConvolutionalBoxPredictor',
+                             reuse=tf.AUTO_REUSE):
+        with slim.arg_scope(self._conv_hyperparams_fn()):
+          # Insert an additional projection layer if necessary.
+          if (has_different_feature_channels and
+              image_feature.shape[3].value != target_channel):
+            image_feature = slim.conv2d(
+                image_feature,
+                target_channel, [1, 1],
+                stride=1,
+                padding='SAME',
+                activation_fn=None,
+                normalizer_fn=(tf.identity if self._apply_batch_norm else None),
+                scope='ProjectionLayer/conv2d_{}'.format(
+                    inserted_layer_counter))
+            if self._apply_batch_norm:
+              image_feature = slim.batch_norm(
+                  image_feature,
+                  scope='ProjectionLayer/conv2d_{}/BatchNorm'.format(
+                      inserted_layer_counter))
+            inserted_layer_counter += 1
+          box_encodings_net = image_feature
+          class_predictions_net = image_feature
+          for i in range(self._num_layers_before_predictor):
+            box_prediction_tower_prefix = (
+                'PredictionTower' if self._share_prediction_tower
+                else 'BoxPredictionTower')
+            box_encodings_net = slim.conv2d(
+                box_encodings_net,
+                self._depth, [self._kernel_size, self._kernel_size],
+                stride=1,
+                padding='SAME',
+                activation_fn=None,
+                normalizer_fn=(tf.identity if self._apply_batch_norm else None),
+                scope='{}/conv2d_{}'.format(box_prediction_tower_prefix, i))
+            if self._apply_batch_norm:
+              box_encodings_net = slim.batch_norm(
+                  box_encodings_net,
+                  scope='{}/conv2d_{}/BatchNorm/feature_{}'.
+                  format(box_prediction_tower_prefix, i, feature_index))
+            box_encodings_net = tf.nn.relu6(box_encodings_net)
+          box_encodings = slim.conv2d(
+              box_encodings_net,
+              num_predictions_per_location * self._box_code_size,
+              [self._kernel_size, self._kernel_size],
+              activation_fn=None, stride=1, padding='SAME',
+              normalizer_fn=None,
+              scope='BoxPredictor')
+
+          if self._share_prediction_tower:
+            class_predictions_net = box_encodings_net
+          else:
+            for i in range(self._num_layers_before_predictor):
+              class_predictions_net = slim.conv2d(
+                  class_predictions_net,
+                  self._depth, [self._kernel_size, self._kernel_size],
+                  stride=1,
+                  padding='SAME',
+                  activation_fn=None,
+                  normalizer_fn=(tf.identity
+                                 if self._apply_batch_norm else None),
+                  scope='ClassPredictionTower/conv2d_{}'.format(i))
+              if self._apply_batch_norm:
+                class_predictions_net = slim.batch_norm(
+                    class_predictions_net,
+                    scope='ClassPredictionTower/conv2d_{}/BatchNorm/feature_{}'
+                    .format(i, feature_index))
+              class_predictions_net = tf.nn.relu6(class_predictions_net)
+          if self._use_dropout:
+            class_predictions_net = slim.dropout(
+                class_predictions_net, keep_prob=self._dropout_keep_prob)
+          class_predictions_with_background = slim.conv2d(
+              class_predictions_net,
+              num_predictions_per_location * num_class_slots,
+              [self._kernel_size, self._kernel_size],
+              activation_fn=None, stride=1, padding='SAME',
+              normalizer_fn=None,
+              biases_initializer=tf.constant_initializer(
+                  self._class_prediction_bias_init),
+              scope='ClassPredictor')
+
+          combined_feature_map_shape = (shape_utils.
+                                        combined_static_and_dynamic_shape(
+                                            image_feature))
+          box_encodings = tf.reshape(
+              box_encodings, tf.stack([combined_feature_map_shape[0],
+                                       combined_feature_map_shape[1] *
+                                       combined_feature_map_shape[2] *
+                                       num_predictions_per_location,
+                                       self._box_code_size]))
+          box_encodings_list.append(box_encodings)
+          class_predictions_with_background = tf.reshape(
+              class_predictions_with_background,
+              tf.stack([combined_feature_map_shape[0],
+                        combined_feature_map_shape[1] *
+                        combined_feature_map_shape[2] *
+                        num_predictions_per_location,
+                        num_class_slots]))
+          class_predictions_list.append(class_predictions_with_background)
+    return {
+        BOX_ENCODINGS: box_encodings_list,
+        CLASS_PREDICTIONS_WITH_BACKGROUND: class_predictions_list
+    }
--- a/research/object_detection/core/box_predictor_test.py
+++ b/research/object_detection/core/box_predictor_test.py
@@ -13,202 +13,17 @@
 # limitations under the License.
 # ==============================================================================

-"""Tests for object_detection.core.box_predictor."""
+"""Tests for object_detection.predictors.convolutional_box_predictor."""
 import numpy as np
 import tensorflow as tf

 from google.protobuf import text_format
 from object_detection.builders import hyperparams_builder
-from object_detection.core import box_predictor
+from object_detection.predictors import convolutional_box_predictor as box_predictor
 from object_detection.protos import hyperparams_pb2
 from object_detection.utils import test_case


-class MaskRCNNBoxPredictorTest(tf.test.TestCase):
-
-  def _build_arg_scope_with_hyperparams(self,
-                                        op_type=hyperparams_pb2.Hyperparams.FC):
-    hyperparams = hyperparams_pb2.Hyperparams()
-    hyperparams_text_proto = """
-      activation: NONE
-      regularizer {
-        l2_regularizer {
-        }
-      }
-      initializer {
-        truncated_normal_initializer {
-        }
-      }
-    """
-    text_format.Merge(hyperparams_text_proto, hyperparams)
-    hyperparams.op = op_type
-    return hyperparams_builder.build(hyperparams, is_training=True)
-
-  def test_get_boxes_with_five_classes(self):
-    image_features = tf.random_uniform([2, 7, 7, 3], dtype=tf.float32)
-    mask_box_predictor = box_predictor.MaskRCNNBoxPredictor(
-        is_training=False,
-        num_classes=5,
-        fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
-        use_dropout=False,
-        dropout_keep_prob=0.5,
-        box_code_size=4,
-    )
-    box_predictions = mask_box_predictor.predict(
-        [image_features], num_predictions_per_location=[1],
-        scope='BoxPredictor')
-    box_encodings = box_predictions[box_predictor.BOX_ENCODINGS]
-    class_predictions_with_background = box_predictions[
-        box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND]
-    init_op = tf.global_variables_initializer()
-    with self.test_session() as sess:
-      sess.run(init_op)
-      (box_encodings_shape,
-       class_predictions_with_background_shape) = sess.run(
-           [tf.shape(box_encodings),
-            tf.shape(class_predictions_with_background)])
-      self.assertAllEqual(box_encodings_shape, [2, 1, 5, 4])
-      self.assertAllEqual(class_predictions_with_background_shape, [2, 1, 6])
-
-  def test_get_boxes_with_five_classes_share_box_across_classes(self):
-    image_features = tf.random_uniform([2, 7, 7, 3], dtype=tf.float32)
-    mask_box_predictor = box_predictor.MaskRCNNBoxPredictor(
-        is_training=False,
-        num_classes=5,
-        fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
-        use_dropout=False,
-        dropout_keep_prob=0.5,
-        box_code_size=4,
-        share_box_across_classes=True
-    )
-    box_predictions = mask_box_predictor.predict(
-        [image_features], num_predictions_per_location=[1],
-        scope='BoxPredictor')
-    box_encodings = box_predictions[box_predictor.BOX_ENCODINGS]
-    class_predictions_with_background = box_predictions[
-        box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND]
-    init_op = tf.global_variables_initializer()
-    with self.test_session() as sess:
-      sess.run(init_op)
-      (box_encodings_shape,
-       class_predictions_with_background_shape) = sess.run(
-           [tf.shape(box_encodings),
-            tf.shape(class_predictions_with_background)])
-      self.assertAllEqual(box_encodings_shape, [2, 1, 1, 4])
-      self.assertAllEqual(class_predictions_with_background_shape, [2, 1, 6])
-
-  def test_value_error_on_predict_instance_masks_with_no_conv_hyperparms(self):
-    with self.assertRaises(ValueError):
-      box_predictor.MaskRCNNBoxPredictor(
-          is_training=False,
-          num_classes=5,
-          fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
-          use_dropout=False,
-          dropout_keep_prob=0.5,
-          box_code_size=4,
-          predict_instance_masks=True)
-
-  def test_get_instance_masks(self):
-    image_features = tf.random_uniform([2, 7, 7, 3], dtype=tf.float32)
-    mask_box_predictor = box_predictor.MaskRCNNBoxPredictor(
-        is_training=False,
-        num_classes=5,
-        fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
-        use_dropout=False,
-        dropout_keep_prob=0.5,
-        box_code_size=4,
-        conv_hyperparams_fn=self._build_arg_scope_with_hyperparams(
-            op_type=hyperparams_pb2.Hyperparams.CONV),
-        predict_instance_masks=True)
-    box_predictions = mask_box_predictor.predict(
-        [image_features],
-        num_predictions_per_location=[1],
-        scope='BoxPredictor',
-        predict_boxes_and_classes=True,
-        predict_auxiliary_outputs=True)
-    mask_predictions = box_predictions[box_predictor.MASK_PREDICTIONS]
-    self.assertListEqual([2, 1, 5, 14, 14],
-                         mask_predictions.get_shape().as_list())
-
-  def test_do_not_return_instance_masks_without_request(self):
-    image_features = tf.random_uniform([2, 7, 7, 3], dtype=tf.float32)
-    mask_box_predictor = box_predictor.MaskRCNNBoxPredictor(
-        is_training=False,
-        num_classes=5,
-        fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
-        use_dropout=False,
-        dropout_keep_prob=0.5,
-        box_code_size=4)
-    box_predictions = mask_box_predictor.predict(
-        [image_features], num_predictions_per_location=[1],
-        scope='BoxPredictor')
-    self.assertEqual(len(box_predictions), 2)
-    self.assertTrue(box_predictor.BOX_ENCODINGS in box_predictions)
-    self.assertTrue(box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND
-                    in box_predictions)
-
-  def test_value_error_on_predict_keypoints(self):
-    with self.assertRaises(ValueError):
-      box_predictor.MaskRCNNBoxPredictor(
-          is_training=False,
-          num_classes=5,
-          fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
-          use_dropout=False,
-          dropout_keep_prob=0.5,
-          box_code_size=4,
-          predict_keypoints=True)
-
-
-class RfcnBoxPredictorTest(tf.test.TestCase):
-
-  def _build_arg_scope_with_conv_hyperparams(self):
-    conv_hyperparams = hyperparams_pb2.Hyperparams()
-    conv_hyperparams_text_proto = """
-      regularizer {
-        l2_regularizer {
-        }
-      }
-      initializer {
-        truncated_normal_initializer {
-        }
-      }
-    """
-    text_format.Merge(conv_hyperparams_text_proto, conv_hyperparams)
-    return hyperparams_builder.build(conv_hyperparams, is_training=True)
-
-  def test_get_correct_box_encoding_and_class_prediction_shapes(self):
-    image_features = tf.random_uniform([4, 8, 8, 64], dtype=tf.float32)
-    proposal_boxes = tf.random_normal([4, 2, 4], dtype=tf.float32)
-    rfcn_box_predictor = box_predictor.RfcnBoxPredictor(
-        is_training=False,
-        num_classes=2,
-        conv_hyperparams_fn=self._build_arg_scope_with_conv_hyperparams(),
-        num_spatial_bins=[3, 3],
-        depth=4,
-        crop_size=[12, 12],
-        box_code_size=4
-    )
-    box_predictions = rfcn_box_predictor.predict(
-        [image_features], num_predictions_per_location=[1],
-        scope='BoxPredictor',
-        proposal_boxes=proposal_boxes)
-    box_encodings = tf.concat(
-        box_predictions[box_predictor.BOX_ENCODINGS], axis=1)
-    class_predictions_with_background = tf.concat(
-        box_predictions[box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND],
-        axis=1)
-
-    init_op = tf.global_variables_initializer()
-    with self.test_session() as sess:
-      sess.run(init_op)
-      (box_encodings_shape,
-       class_predictions_shape) = sess.run(
-           [tf.shape(box_encodings),
-            tf.shape(class_predictions_with_background)])
-      self.assertAllEqual(box_encodings_shape, [8, 1, 2, 4])
-      self.assertAllEqual(class_predictions_shape, [8, 1, 3])
-
-
 class ConvolutionalBoxPredictorTest(test_case.TestCase):

  def _build_arg_scope_with_conv_hyperparams(self):
@@ -597,7 +412,7 @@ class WeightSharedConvolutionalBoxPredictorTest(test_case.TestCase):
    self.assertAllEqual(class_predictions_with_background.shape,
                        [4, 960, num_classes_without_background+1])

-  def test_predictions_from_multiple_feature_maps_share_weights_not_batchnorm(
+  def test_predictions_multiple_feature_maps_share_weights_separate_batchnorm(
      self):
    num_classes_without_background = 6
    def graph_fn(image_features1, image_features2):
@@ -663,6 +478,65 @@ class WeightSharedConvolutionalBoxPredictorTest(test_case.TestCase):
         'ClassPredictor/biases')])
    self.assertEqual(expected_variable_set, actual_variable_set)

+  def test_predictions_multiple_feature_maps_share_weights_without_batchnorm(
+      self):
+    num_classes_without_background = 6
+    def graph_fn(image_features1, image_features2):
+      conv_box_predictor = box_predictor.WeightSharedConvolutionalBoxPredictor(
+          is_training=False,
+          num_classes=num_classes_without_background,
+          conv_hyperparams_fn=self._build_arg_scope_with_conv_hyperparams(),
+          depth=32,
+          num_layers_before_predictor=2,
+          box_code_size=4,
+          apply_batch_norm=False)
+      box_predictions = conv_box_predictor.predict(
+          [image_features1, image_features2],
+          num_predictions_per_location=[5, 5],
+          scope='BoxPredictor')
+      box_encodings = tf.concat(
+          box_predictions[box_predictor.BOX_ENCODINGS], axis=1)
+      class_predictions_with_background = tf.concat(
+          box_predictions[box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND],
+          axis=1)
+      return (box_encodings, class_predictions_with_background)
+
+    with self.test_session(graph=tf.Graph()):
+      graph_fn(tf.random_uniform([4, 32, 32, 3], dtype=tf.float32),
+               tf.random_uniform([4, 16, 16, 3], dtype=tf.float32))
+      actual_variable_set = set(
+          [var.op.name for var in tf.trainable_variables()])
+    expected_variable_set = set([
+        # Box prediction tower
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'BoxPredictionTower/conv2d_0/weights'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'BoxPredictionTower/conv2d_0/biases'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'BoxPredictionTower/conv2d_1/weights'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'BoxPredictionTower/conv2d_1/biases'),
+        # Box prediction head
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'BoxPredictor/weights'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'BoxPredictor/biases'),
+        # Class prediction tower
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'ClassPredictionTower/conv2d_0/weights'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'ClassPredictionTower/conv2d_0/biases'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'ClassPredictionTower/conv2d_1/weights'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'ClassPredictionTower/conv2d_1/biases'),
+        # Class prediction head
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'ClassPredictor/weights'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'ClassPredictor/biases')])
+    self.assertEqual(expected_variable_set, actual_variable_set)
+
  def test_no_batchnorm_params_when_batchnorm_is_not_configured(self):
    num_classes_without_background = 6
    def graph_fn(image_features1, image_features2):
@@ -672,7 +546,8 @@ class WeightSharedConvolutionalBoxPredictorTest(test_case.TestCase):
          conv_hyperparams_fn=self._build_conv_arg_scope_no_batch_norm(),
          depth=32,
          num_layers_before_predictor=2,
-          box_code_size=4)
+          box_code_size=4,
+          apply_batch_norm=False)
      box_predictions = conv_box_predictor.predict(
          [image_features1, image_features2],
          num_predictions_per_location=[5, 5],
@@ -720,7 +595,7 @@ class WeightSharedConvolutionalBoxPredictorTest(test_case.TestCase):
         'ClassPredictor/biases')])
    self.assertEqual(expected_variable_set, actual_variable_set)

-  def test_predictions_share_weights_share_tower_not_batchnorm(
+  def test_predictions_share_weights_share_tower_separate_batchnorm(
      self):
    num_classes_without_background = 6
    def graph_fn(image_features1, image_features2):
@@ -774,6 +649,57 @@ class WeightSharedConvolutionalBoxPredictorTest(test_case.TestCase):
         'ClassPredictor/biases')])
    self.assertEqual(expected_variable_set, actual_variable_set)

+  def test_predictions_share_weights_share_tower_without_batchnorm(
+      self):
+    num_classes_without_background = 6
+    def graph_fn(image_features1, image_features2):
+      conv_box_predictor = box_predictor.WeightSharedConvolutionalBoxPredictor(
+          is_training=False,
+          num_classes=num_classes_without_background,
+          conv_hyperparams_fn=self._build_arg_scope_with_conv_hyperparams(),
+          depth=32,
+          num_layers_before_predictor=2,
+          box_code_size=4,
+          share_prediction_tower=True,
+          apply_batch_norm=False)
+      box_predictions = conv_box_predictor.predict(
+          [image_features1, image_features2],
+          num_predictions_per_location=[5, 5],
+          scope='BoxPredictor')
+      box_encodings = tf.concat(
+          box_predictions[box_predictor.BOX_ENCODINGS], axis=1)
+      class_predictions_with_background = tf.concat(
+          box_predictions[box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND],
+          axis=1)
+      return (box_encodings, class_predictions_with_background)
+
+    with self.test_session(graph=tf.Graph()):
+      graph_fn(tf.random_uniform([4, 32, 32, 3], dtype=tf.float32),
+               tf.random_uniform([4, 16, 16, 3], dtype=tf.float32))
+      actual_variable_set = set(
+          [var.op.name for var in tf.trainable_variables()])
+    expected_variable_set = set([
+        # Shared prediction tower
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'PredictionTower/conv2d_0/weights'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'PredictionTower/conv2d_0/biases'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'PredictionTower/conv2d_1/weights'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'PredictionTower/conv2d_1/biases'),
+        # Box prediction head
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'BoxPredictor/weights'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'BoxPredictor/biases'),
+        # Class prediction head
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'ClassPredictor/weights'),
+        ('BoxPredictor/WeightSharedConvolutionalBoxPredictor/'
+         'ClassPredictor/biases')])
+    self.assertEqual(expected_variable_set, actual_variable_set)
+
  def test_get_predictions_with_feature_maps_of_dynamic_shape(
      self):
    image_features = tf.placeholder(dtype=tf.float32, shape=[4, None, None, 64])

--- a/research/object_detection/predictors/mask_rcnn_box_predictor.py
+++ b/research/object_detection/predictors/mask_rcnn_box_predictor.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Mask R-CNN Box Predictor."""
+import tensorflow as tf
+
+from object_detection.core import box_predictor
+
+slim = tf.contrib.slim
+
+BOX_ENCODINGS = box_predictor.BOX_ENCODINGS
+CLASS_PREDICTIONS_WITH_BACKGROUND = (
+    box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND)
+MASK_PREDICTIONS = box_predictor.MASK_PREDICTIONS
+
+
+class MaskRCNNBoxPredictor(box_predictor.BoxPredictor):
+  """Mask R-CNN Box Predictor.
+
+  See Mask R-CNN: He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017).
+  Mask R-CNN. arXiv preprint arXiv:1703.06870.
+
+  This is used for the second stage of the Mask R-CNN detector where proposals
+  cropped from an image are arranged along the batch dimension of the input
+  image_features tensor. Notice that locations are *not* shared across classes,
+  thus for each anchor, a separate prediction is made for each class.
+
+  In addition to predicting boxes and classes, optionally this class allows
+  predicting masks and/or keypoints inside detection boxes.
+
+  Currently this box predictor makes per-class predictions; that is, each
+  anchor makes a separate box prediction for each class.
+  """
+
+  def __init__(self,
+               is_training,
+               num_classes,
+               box_prediction_head,
+               class_prediction_head,
+               third_stage_heads):
+    """Constructor.
+
+    Args:
+      is_training: Indicates whether the BoxPredictor is in training mode.
+      num_classes: number of classes.  Note that num_classes *does not*
+        include the background category, so if groundtruth labels take values
+        in {0, 1, .., K-1}, num_classes=K (and not K+1, even though the
+        assigned classification targets can range from {0,... K}).
+      box_prediction_head: The head that predicts the boxes in second stage.
+      class_prediction_head: The head that predicts the classes in second stage.
+      third_stage_heads: A dictionary mapping head names to mask rcnn head
+        classes.
+    """
+    super(MaskRCNNBoxPredictor, self).__init__(is_training, num_classes)
+    self._box_prediction_head = box_prediction_head
+    self._class_prediction_head = class_prediction_head
+    self._third_stage_heads = third_stage_heads
+
+  @property
+  def num_classes(self):
+    return self._num_classes
+
+  def get_second_stage_prediction_heads(self):
+    return BOX_ENCODINGS, CLASS_PREDICTIONS_WITH_BACKGROUND
+
+  def get_third_stage_prediction_heads(self):
+    return sorted(self._third_stage_heads.keys())
+
+  def _predict(self,
+               image_features,
+               num_predictions_per_location,
+               prediction_stage=2):
+    """Optionally computes encoded object locations, confidences, and masks.
+
+    Predicts the heads belonging to the given prediction stage.
+
+    Args:
+      image_features: A list of float tensors of shape
+        [batch_size, height_i, width_i, channels_i] containing roi pooled
+        features for each image. The length of the list should be 1 otherwise
+        a ValueError will be raised.
+      num_predictions_per_location: A list of integers representing the number
+        of box predictions to be made per spatial location for each feature map.
+        Currently, this must be set to [1], or an error will be raised.
+      prediction_stage: Prediction stage. Acceptable values are 2 and 3.
+
+    Returns:
+      A dictionary containing the predicted tensors that are listed in
+      self._prediction_heads. A subset of the following keys will exist in the
+      dictionary:
+        BOX_ENCODINGS: A float tensor of shape
+          [batch_size, 1, num_classes, code_size] representing the
+          location of the objects.
+        CLASS_PREDICTIONS_WITH_BACKGROUND: A float tensor of shape
+          [batch_size, 1, num_classes + 1] representing the class
+          predictions for the proposals.
+        MASK_PREDICTIONS: A float tensor of shape
+          [batch_size, 1, num_classes, image_height, image_width]
+
+    Raises:
+      ValueError: If num_predictions_per_location is not 1 or if
+        len(image_features) is not 1.
+      ValueError: if prediction_stage is not 2 or 3.
+    """
+    if (len(num_predictions_per_location) != 1 or
+        num_predictions_per_location[0] != 1):
+      raise ValueError('Currently FullyConnectedBoxPredictor only supports '
+                       'predicting a single box per class per location.')
+    if len(image_features) != 1:
+      raise ValueError('length of `image_features` must be 1. Found {}'.format(
+          len(image_features)))
+    image_feature = image_features[0]
+    predictions_dict = {}
+
+    if prediction_stage == 2:
+      predictions_dict[BOX_ENCODINGS] = self._box_prediction_head.predict(
+          roi_pooled_features=image_feature)
+      predictions_dict[CLASS_PREDICTIONS_WITH_BACKGROUND] = (
+          self._class_prediction_head.predict(roi_pooled_features=image_feature)
+      )
+    elif prediction_stage == 3:
+      for prediction_head in self.get_third_stage_prediction_heads():
+        head_object = self._third_stage_heads[prediction_head]
+        predictions_dict[prediction_head] = head_object.predict(
+            roi_pooled_features=image_feature)
+    else:
+      raise ValueError('prediction_stage should be either 2 or 3.')
+
+    return predictions_dict
--- a/research/object_detection/predictors/mask_rcnn_box_predictor_test.py
+++ b/research/object_detection/predictors/mask_rcnn_box_predictor_test.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Tests for object_detection.predictors.mask_rcnn_box_predictor."""
+import numpy as np
+import tensorflow as tf
+
+from google.protobuf import text_format
+from object_detection.builders import hyperparams_builder
+from object_detection.predictors import mask_rcnn_box_predictor as box_predictor
+from object_detection.predictors.mask_rcnn_heads import box_head
+from object_detection.predictors.mask_rcnn_heads import class_head
+from object_detection.predictors.mask_rcnn_heads import mask_head
+from object_detection.protos import hyperparams_pb2
+from object_detection.utils import test_case
+
+
+class MaskRCNNBoxPredictorTest(test_case.TestCase):
+
+  def _build_arg_scope_with_hyperparams(self,
+                                        op_type=hyperparams_pb2.Hyperparams.FC):
+    hyperparams = hyperparams_pb2.Hyperparams()
+    hyperparams_text_proto = """
+      activation: NONE
+      regularizer {
+        l2_regularizer {
+        }
+      }
+      initializer {
+        truncated_normal_initializer {
+        }
+      }
+    """
+    text_format.Merge(hyperparams_text_proto, hyperparams)
+    hyperparams.op = op_type
+    return hyperparams_builder.build(hyperparams, is_training=True)
+
+  def _box_predictor_builder(self,
+                             is_training,
+                             num_classes,
+                             fc_hyperparams_fn,
+                             use_dropout,
+                             dropout_keep_prob,
+                             box_code_size,
+                             share_box_across_classes=False,
+                             conv_hyperparams_fn=None,
+                             predict_instance_masks=False):
+    box_prediction_head = box_head.BoxHead(
+        is_training=is_training,
+        num_classes=num_classes,
+        fc_hyperparams_fn=fc_hyperparams_fn,
+        use_dropout=use_dropout,
+        dropout_keep_prob=dropout_keep_prob,
+        box_code_size=box_code_size,
+        share_box_across_classes=share_box_across_classes)
+    class_prediction_head = class_head.ClassHead(
+        is_training=is_training,
+        num_classes=num_classes,
+        fc_hyperparams_fn=fc_hyperparams_fn,
+        use_dropout=use_dropout,
+        dropout_keep_prob=dropout_keep_prob)
+    third_stage_heads = {}
+    if predict_instance_masks:
+      third_stage_heads[box_predictor.MASK_PREDICTIONS] = mask_head.MaskHead(
+          num_classes=num_classes,
+          conv_hyperparams_fn=conv_hyperparams_fn)
+    return box_predictor.MaskRCNNBoxPredictor(
+        is_training=is_training,
+        num_classes=num_classes,
+        box_prediction_head=box_prediction_head,
+        class_prediction_head=class_prediction_head,
+        third_stage_heads=third_stage_heads)
+
+  def test_get_boxes_with_five_classes(self):
+    def graph_fn(image_features):
+      mask_box_predictor = self._box_predictor_builder(
+          is_training=False,
+          num_classes=5,
+          fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
+          use_dropout=False,
+          dropout_keep_prob=0.5,
+          box_code_size=4,
+      )
+      box_predictions = mask_box_predictor.predict(
+          [image_features],
+          num_predictions_per_location=[1],
+          scope='BoxPredictor',
+          prediction_stage=2)
+      return (box_predictions[box_predictor.BOX_ENCODINGS],
+              box_predictions[box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND])
+    image_features = np.random.rand(2, 7, 7, 3).astype(np.float32)
+    (box_encodings,
+     class_predictions_with_background) = self.execute(graph_fn,
+                                                       [image_features])
+    self.assertAllEqual(box_encodings.shape, [2, 1, 5, 4])
+    self.assertAllEqual(class_predictions_with_background.shape, [2, 1, 6])
+
+  def test_get_boxes_with_five_classes_share_box_across_classes(self):
+    def graph_fn(image_features):
+      mask_box_predictor = self._box_predictor_builder(
+          is_training=False,
+          num_classes=5,
+          fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
+          use_dropout=False,
+          dropout_keep_prob=0.5,
+          box_code_size=4,
+          share_box_across_classes=True
+      )
+      box_predictions = mask_box_predictor.predict(
+          [image_features],
+          num_predictions_per_location=[1],
+          scope='BoxPredictor',
+          prediction_stage=2)
+      return (box_predictions[box_predictor.BOX_ENCODINGS],
+              box_predictions[box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND])
+    image_features = np.random.rand(2, 7, 7, 3).astype(np.float32)
+    (box_encodings,
+     class_predictions_with_background) = self.execute(graph_fn,
+                                                       [image_features])
+    self.assertAllEqual(box_encodings.shape, [2, 1, 1, 4])
+    self.assertAllEqual(class_predictions_with_background.shape, [2, 1, 6])
+
+  def test_value_error_on_predict_instance_masks_with_no_conv_hyperparms(self):
+    with self.assertRaises(ValueError):
+      self._box_predictor_builder(
+          is_training=False,
+          num_classes=5,
+          fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
+          use_dropout=False,
+          dropout_keep_prob=0.5,
+          box_code_size=4,
+          predict_instance_masks=True)
+
+  def test_get_instance_masks(self):
+    def graph_fn(image_features):
+      mask_box_predictor = self._box_predictor_builder(
+          is_training=False,
+          num_classes=5,
+          fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
+          use_dropout=False,
+          dropout_keep_prob=0.5,
+          box_code_size=4,
+          conv_hyperparams_fn=self._build_arg_scope_with_hyperparams(
+              op_type=hyperparams_pb2.Hyperparams.CONV),
+          predict_instance_masks=True)
+      box_predictions = mask_box_predictor.predict(
+          [image_features],
+          num_predictions_per_location=[1],
+          scope='BoxPredictor',
+          prediction_stage=3)
+      return (box_predictions[box_predictor.MASK_PREDICTIONS],)
+    image_features = np.random.rand(2, 7, 7, 3).astype(np.float32)
+    mask_predictions = self.execute(graph_fn, [image_features])
+    self.assertAllEqual(mask_predictions.shape, [2, 1, 5, 14, 14])
+
+  def test_do_not_return_instance_masks_without_request(self):
+    image_features = tf.random_uniform([2, 7, 7, 3], dtype=tf.float32)
+    mask_box_predictor = self._box_predictor_builder(
+        is_training=False,
+        num_classes=5,
+        fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
+        use_dropout=False,
+        dropout_keep_prob=0.5,
+        box_code_size=4)
+    box_predictions = mask_box_predictor.predict(
+        [image_features],
+        num_predictions_per_location=[1],
+        scope='BoxPredictor',
+        prediction_stage=2)
+    self.assertEqual(len(box_predictions), 2)
+    self.assertTrue(box_predictor.BOX_ENCODINGS in box_predictions)
+    self.assertTrue(box_predictor.CLASS_PREDICTIONS_WITH_BACKGROUND
+                    in box_predictions)
+
+
+if __name__ == '__main__':
+  tf.test.main()
--- a/research/object_detection/predictors/mask_rcnn_heads/__init__.py
+++ b/research/object_detection/predictors/mask_rcnn_heads/__init__.py
--- a/research/object_detection/predictors/mask_rcnn_heads/box_head.py
+++ b/research/object_detection/predictors/mask_rcnn_heads/box_head.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Mask R-CNN Box Head."""
+import tensorflow as tf
+
+from object_detection.predictors.mask_rcnn_heads import mask_rcnn_head
+
+slim = tf.contrib.slim
+
+
+class BoxHead(mask_rcnn_head.MaskRCNNHead):
+  """Mask RCNN box prediction head."""
+
+  def __init__(self,
+               is_training,
+               num_classes,
+               fc_hyperparams_fn,
+               use_dropout,
+               dropout_keep_prob,
+               box_code_size,
+               share_box_across_classes=False):
+    """Constructor.
+
+    Args:
+      is_training: Indicates whether the BoxPredictor is in training mode.
+      num_classes: number of classes.  Note that num_classes *does not*
+        include the background category, so if groundtruth labels take values
+        in {0, 1, .., K-1}, num_classes=K (and not K+1, even though the
+        assigned classification targets can range from {0,... K}).
+      fc_hyperparams_fn: A function to generate tf-slim arg_scope with
+        hyperparameters for fully connected ops.
+      use_dropout: Option to use dropout or not.  Note that a single dropout
+        op is applied here prior to both box and class predictions, which stands
+        in contrast to the ConvolutionalBoxPredictor below.
+      dropout_keep_prob: Keep probability for dropout.
+        This is only used if use_dropout is True.
+      box_code_size: Size of encoding for each box.
+      share_box_across_classes: Whether to share boxes across classes rather
+        than use a different box for each class.
+    """
+    super(BoxHead, self).__init__()
+    self._is_training = is_training
+    self._num_classes = num_classes
+    self._fc_hyperparams_fn = fc_hyperparams_fn
+    self._use_dropout = use_dropout
+    self._dropout_keep_prob = dropout_keep_prob
+    self._box_code_size = box_code_size
+    self._share_box_across_classes = share_box_across_classes
+
+  def _predict(self, roi_pooled_features):
+    """Predicts boxes.
+
+    Args:
+      roi_pooled_features: A float tensor of shape [batch_size, height, width,
+        channels] containing features for a batch of images.
+
+    Returns:
+      box_encodings: A float tensor of shape
+        [batch_size, 1, num_classes, code_size] representing the location of the
+        objects.
+    """
+    spatial_averaged_roi_pooled_features = tf.reduce_mean(
+        roi_pooled_features, [1, 2], keep_dims=True, name='AvgPool')
+    flattened_roi_pooled_features = slim.flatten(
+        spatial_averaged_roi_pooled_features)
+    if self._use_dropout:
+      flattened_roi_pooled_features = slim.dropout(
+          flattened_roi_pooled_features,
+          keep_prob=self._dropout_keep_prob,
+          is_training=self._is_training)
+    number_of_boxes = 1
+    if not self._share_box_across_classes:
+      number_of_boxes = self._num_classes
+
+    with slim.arg_scope(self._fc_hyperparams_fn()):
+      box_encodings = slim.fully_connected(
+          flattened_roi_pooled_features,
+          number_of_boxes * self._box_code_size,
+          activation_fn=None,
+          scope='BoxEncodingPredictor')
+    box_encodings = tf.reshape(box_encodings,
+                               [-1, 1, number_of_boxes, self._box_code_size])
+    return box_encodings
--- a/research/object_detection/predictors/mask_rcnn_heads/box_head_test.py
+++ b/research/object_detection/predictors/mask_rcnn_heads/box_head_test.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Tests for object_detection.predictors.mask_rcnn_heads.box_head."""
+import tensorflow as tf
+
+from google.protobuf import text_format
+from object_detection.builders import hyperparams_builder
+from object_detection.predictors.mask_rcnn_heads import box_head
+from object_detection.protos import hyperparams_pb2
+from object_detection.utils import test_case
+
+
+class BoxHeadTest(test_case.TestCase):
+
+  def _build_arg_scope_with_hyperparams(self,
+                                        op_type=hyperparams_pb2.Hyperparams.FC):
+    hyperparams = hyperparams_pb2.Hyperparams()
+    hyperparams_text_proto = """
+      activation: NONE
+      regularizer {
+        l2_regularizer {
+        }
+      }
+      initializer {
+        truncated_normal_initializer {
+        }
+      }
+    """
+    text_format.Merge(hyperparams_text_proto, hyperparams)
+    hyperparams.op = op_type
+    return hyperparams_builder.build(hyperparams, is_training=True)
+
+  def test_prediction_size(self):
+    box_prediction_head = box_head.BoxHead(
+        is_training=False,
+        num_classes=20,
+        fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
+        use_dropout=True,
+        dropout_keep_prob=0.5,
+        box_code_size=4,
+        share_box_across_classes=False)
+    roi_pooled_features = tf.random_uniform(
+        [64, 7, 7, 1024], minval=-10.0, maxval=10.0, dtype=tf.float32)
+    prediction = box_prediction_head.predict(
+        roi_pooled_features=roi_pooled_features)
+    tf.logging.info(prediction.shape)
+    self.assertAllEqual([64, 1, 20, 4], prediction.get_shape().as_list())
+
+
+if __name__ == '__main__':
+  tf.test.main()
--- a/research/object_detection/predictors/mask_rcnn_heads/class_head.py
+++ b/research/object_detection/predictors/mask_rcnn_heads/class_head.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Mask R-CNN Class Head."""
+import tensorflow as tf
+
+from object_detection.predictors.mask_rcnn_heads import mask_rcnn_head
+
+slim = tf.contrib.slim
+
+
+class ClassHead(mask_rcnn_head.MaskRCNNHead):
+  """Mask RCNN class prediction head."""
+
+  def __init__(self, is_training, num_classes, fc_hyperparams_fn,
+               use_dropout, dropout_keep_prob):
+    """Constructor.
+
+    Args:
+      is_training: Indicates whether the BoxPredictor is in training mode.
+      num_classes: number of classes.  Note that num_classes *does not*
+        include the background category, so if groundtruth labels take values
+        in {0, 1, .., K-1}, num_classes=K (and not K+1, even though the
+        assigned classification targets can range from {0,... K}).
+      fc_hyperparams_fn: A function to generate tf-slim arg_scope with
+        hyperparameters for fully connected ops.
+      use_dropout: Option to use dropout or not.  Note that a single dropout
+        op is applied here prior to both box and class predictions, which stands
+        in contrast to the ConvolutionalBoxPredictor below.
+      dropout_keep_prob: Keep probability for dropout.
+        This is only used if use_dropout is True.
+    """
+    super(ClassHead, self).__init__()
+    self._is_training = is_training
+    self._num_classes = num_classes
+    self._fc_hyperparams_fn = fc_hyperparams_fn
+    self._use_dropout = use_dropout
+    self._dropout_keep_prob = dropout_keep_prob
+
+  def _predict(self, roi_pooled_features):
+    """Predicts boxes and class scores.
+
+    Args:
+      roi_pooled_features: A float tensor of shape [batch_size, height, width,
+        channels] containing features for a batch of images.
+
+    Returns:
+      class_predictions_with_background: A float tensor of shape
+        [batch_size, 1, num_classes + 1] representing the class predictions for
+        the proposals.
+    """
+    spatial_averaged_roi_pooled_features = tf.reduce_mean(
+        roi_pooled_features, [1, 2], keep_dims=True, name='AvgPool')
+    flattened_roi_pooled_features = slim.flatten(
+        spatial_averaged_roi_pooled_features)
+    if self._use_dropout:
+      flattened_roi_pooled_features = slim.dropout(
+          flattened_roi_pooled_features,
+          keep_prob=self._dropout_keep_prob,
+          is_training=self._is_training)
+
+    with slim.arg_scope(self._fc_hyperparams_fn()):
+      class_predictions_with_background = slim.fully_connected(
+          flattened_roi_pooled_features,
+          self._num_classes + 1,
+          activation_fn=None,
+          scope='ClassPredictor')
+    class_predictions_with_background = tf.reshape(
+        class_predictions_with_background, [-1, 1, self._num_classes + 1])
+    return class_predictions_with_background
--- a/research/object_detection/predictors/mask_rcnn_heads/class_head_test.py
+++ b/research/object_detection/predictors/mask_rcnn_heads/class_head_test.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Tests for object_detection.predictors.mask_rcnn_heads.class_head."""
+import tensorflow as tf
+
+from google.protobuf import text_format
+from object_detection.builders import hyperparams_builder
+from object_detection.predictors.mask_rcnn_heads import class_head
+from object_detection.protos import hyperparams_pb2
+from object_detection.utils import test_case
+
+
+class ClassHeadTest(test_case.TestCase):
+
+  def _build_arg_scope_with_hyperparams(self,
+                                        op_type=hyperparams_pb2.Hyperparams.FC):
+    hyperparams = hyperparams_pb2.Hyperparams()
+    hyperparams_text_proto = """
+      activation: NONE
+      regularizer {
+        l2_regularizer {
+        }
+      }
+      initializer {
+        truncated_normal_initializer {
+        }
+      }
+    """
+    text_format.Merge(hyperparams_text_proto, hyperparams)
+    hyperparams.op = op_type
+    return hyperparams_builder.build(hyperparams, is_training=True)
+
+  def test_prediction_size(self):
+    class_prediction_head = class_head.ClassHead(
+        is_training=False,
+        num_classes=20,
+        fc_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
+        use_dropout=True,
+        dropout_keep_prob=0.5)
+    roi_pooled_features = tf.random_uniform(
+        [64, 7, 7, 1024], minval=-10.0, maxval=10.0, dtype=tf.float32)
+    prediction = class_prediction_head.predict(
+        roi_pooled_features=roi_pooled_features)
+    tf.logging.info(prediction.shape)
+    self.assertAllEqual([64, 1, 21], prediction.get_shape().as_list())
+
+
+if __name__ == '__main__':
+  tf.test.main()
--- a/research/object_detection/predictors/mask_rcnn_heads/keypoint_head.py
+++ b/research/object_detection/predictors/mask_rcnn_heads/keypoint_head.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Mask R-CNN Keypoint Head."""
+import tensorflow as tf
+
+from object_detection.predictors.mask_rcnn_heads import mask_rcnn_head
+slim = tf.contrib.slim
+
+
+class KeypointHead(mask_rcnn_head.MaskRCNNHead):
+  """Mask RCNN keypoint prediction head."""
+
+  def __init__(self,
+               num_keypoints=17,
+               conv_hyperparams_fn=None,
+               keypoint_heatmap_height=56,
+               keypoint_heatmap_width=56,
+               keypoint_prediction_num_conv_layers=8,
+               keypoint_prediction_conv_depth=512):
+    """Constructor.
+
+    Args:
+      num_keypoints: (int scalar) number of keypoints.
+      conv_hyperparams_fn: A function to generate tf-slim arg_scope with
+        hyperparameters for convolution ops.
+      keypoint_heatmap_height: Desired output mask height. The default value
+        is 14.
+      keypoint_heatmap_width: Desired output mask width. The default value
+        is 14.
+      keypoint_prediction_num_conv_layers: Number of convolution layers applied
+        to the image_features in mask prediction branch.
+      keypoint_prediction_conv_depth: The depth for the first conv2d_transpose
+        op applied to the image_features in the mask prediction branch. If set
+        to 0, the depth of the convolution layers will be automatically chosen
+        based on the number of object classes and the number of channels in the
+        image features.
+    """
+    super(KeypointHead, self).__init__()
+    self._num_keypoints = num_keypoints
+    self._conv_hyperparams_fn = conv_hyperparams_fn
+    self._keypoint_heatmap_height = keypoint_heatmap_height
+    self._keypoint_heatmap_width = keypoint_heatmap_width
+    self._keypoint_prediction_num_conv_layers = (
+        keypoint_prediction_num_conv_layers)
+    self._keypoint_prediction_conv_depth = keypoint_prediction_conv_depth
+
+  def _predict(self, roi_pooled_features):
+    """Performs keypoint prediction.
+
+    Args:
+      roi_pooled_features: A float tensor of shape [batch_size, height, width,
+        channels] containing features for a batch of images.
+
+    Returns:
+      instance_masks: A float tensor of shape
+          [batch_size, 1, num_keypoints, heatmap_height, heatmap_width].
+    """
+    with slim.arg_scope(self._conv_hyperparams_fn()):
+      net = slim.conv2d(
+          roi_pooled_features,
+          self._keypoint_prediction_conv_depth, [3, 3],
+          scope='conv_1')
+      for i in range(1, self._keypoint_prediction_num_conv_layers):
+        net = slim.conv2d(
+            net,
+            self._keypoint_prediction_conv_depth, [3, 3],
+            scope='conv_%d' % (i + 1))
+      net = slim.conv2d_transpose(
+          net, self._num_keypoints, [2, 2], scope='deconv1')
+      heatmaps_mask = tf.image.resize_bilinear(
+          net, [self._keypoint_heatmap_height, self._keypoint_heatmap_width],
+          align_corners=True,
+          name='upsample')
+      return tf.expand_dims(
+          tf.transpose(heatmaps_mask, perm=[0, 3, 1, 2]),
+          axis=1,
+          name='KeypointPredictor')
--- a/research/object_detection/predictors/mask_rcnn_heads/keypoint_head_test.py
+++ b/research/object_detection/predictors/mask_rcnn_heads/keypoint_head_test.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Tests for object_detection.predictors.mask_rcnn_heads.keypoint_head."""
+import tensorflow as tf
+
+from google.protobuf import text_format
+from object_detection.builders import hyperparams_builder
+from object_detection.predictors.mask_rcnn_heads import keypoint_head
+from object_detection.protos import hyperparams_pb2
+from object_detection.utils import test_case
+
+
+class KeypointHeadTest(test_case.TestCase):
+
+  def _build_arg_scope_with_hyperparams(self,
+                                        op_type=hyperparams_pb2.Hyperparams.FC):
+    hyperparams = hyperparams_pb2.Hyperparams()
+    hyperparams_text_proto = """
+      activation: NONE
+      regularizer {
+        l2_regularizer {
+        }
+      }
+      initializer {
+        truncated_normal_initializer {
+        }
+      }
+    """
+    text_format.Merge(hyperparams_text_proto, hyperparams)
+    hyperparams.op = op_type
+    return hyperparams_builder.build(hyperparams, is_training=True)
+
+  def test_prediction_size(self):
+    keypoint_prediction_head = keypoint_head.KeypointHead(
+        conv_hyperparams_fn=self._build_arg_scope_with_hyperparams())
+    roi_pooled_features = tf.random_uniform(
+        [64, 14, 14, 1024], minval=-2.0, maxval=2.0, dtype=tf.float32)
+    prediction = keypoint_prediction_head.predict(
+        roi_pooled_features=roi_pooled_features)
+    tf.logging.info(prediction.shape)
+    self.assertAllEqual([64, 1, 17, 56, 56], prediction.get_shape().as_list())
+
+
+if __name__ == '__main__':
+  tf.test.main()
--- a/research/object_detection/predictors/mask_rcnn_heads/mask_head.py
+++ b/research/object_detection/predictors/mask_rcnn_heads/mask_head.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Mask R-CNN Mask Head."""
+import math
+import tensorflow as tf
+
+from object_detection.predictors.mask_rcnn_heads import mask_rcnn_head
+
+slim = tf.contrib.slim
+
+
+class MaskHead(mask_rcnn_head.MaskRCNNHead):
+  """Mask RCNN mask prediction head."""
+
+  def __init__(self,
+               num_classes,
+               conv_hyperparams_fn=None,
+               mask_height=14,
+               mask_width=14,
+               mask_prediction_num_conv_layers=2,
+               mask_prediction_conv_depth=256,
+               masks_are_class_agnostic=False):
+    """Constructor.
+
+    Args:
+      num_classes: number of classes.  Note that num_classes *does not*
+        include the background category, so if groundtruth labels take values
+        in {0, 1, .., K-1}, num_classes=K (and not K+1, even though the
+        assigned classification targets can range from {0,... K}).
+      conv_hyperparams_fn: A function to generate tf-slim arg_scope with
+        hyperparameters for convolution ops.
+      mask_height: Desired output mask height. The default value is 14.
+      mask_width: Desired output mask width. The default value is 14.
+      mask_prediction_num_conv_layers: Number of convolution layers applied to
+        the image_features in mask prediction branch.
+      mask_prediction_conv_depth: The depth for the first conv2d_transpose op
+        applied to the image_features in the mask prediction branch. If set
+        to 0, the depth of the convolution layers will be automatically chosen
+        based on the number of object classes and the number of channels in the
+        image features.
+      masks_are_class_agnostic: Boolean determining if the mask-head is
+        class-agnostic or not.
+
+    Raises:
+      ValueError: conv_hyperparams_fn is None.
+    """
+    super(MaskHead, self).__init__()
+    self._num_classes = num_classes
+    self._conv_hyperparams_fn = conv_hyperparams_fn
+    self._mask_height = mask_height
+    self._mask_width = mask_width
+    self._mask_prediction_num_conv_layers = mask_prediction_num_conv_layers
+    self._mask_prediction_conv_depth = mask_prediction_conv_depth
+    self._masks_are_class_agnostic = masks_are_class_agnostic
+    if conv_hyperparams_fn is None:
+      raise ValueError('conv_hyperparams_fn is None.')
+
+  def _get_mask_predictor_conv_depth(self,
+                                     num_feature_channels,
+                                     num_classes,
+                                     class_weight=3.0,
+                                     feature_weight=2.0):
+    """Computes the depth of the mask predictor convolutions.
+
+    Computes the depth of the mask predictor convolutions given feature channels
+    and number of classes by performing a weighted average of the two in
+    log space to compute the number of convolution channels. The weights that
+    are used for computing the weighted average do not need to sum to 1.
+
+    Args:
+      num_feature_channels: An integer containing the number of feature
+        channels.
+      num_classes: An integer containing the number of classes.
+      class_weight: Class weight used in computing the weighted average.
+      feature_weight: Feature weight used in computing the weighted average.
+
+    Returns:
+      An integer containing the number of convolution channels used by mask
+        predictor.
+    """
+    num_feature_channels_log = math.log(float(num_feature_channels), 2.0)
+    num_classes_log = math.log(float(num_classes), 2.0)
+    weighted_num_feature_channels_log = (
+        num_feature_channels_log * feature_weight)
+    weighted_num_classes_log = num_classes_log * class_weight
+    total_weight = feature_weight + class_weight
+    num_conv_channels_log = round(
+        (weighted_num_feature_channels_log + weighted_num_classes_log) /
+        total_weight)
+    return int(math.pow(2.0, num_conv_channels_log))
+
+  def _predict(self, roi_pooled_features):
+    """Performs mask prediction.
+
+    Args:
+      roi_pooled_features: A float tensor of shape [batch_size, height, width,
+        channels] containing features for a batch of images.
+
+    Returns:
+      instance_masks: A float tensor of shape
+          [batch_size, 1, num_classes, mask_height, mask_width].
+    """
+    num_conv_channels = self._mask_prediction_conv_depth
+    if num_conv_channels == 0:
+      num_feature_channels = roi_pooled_features.get_shape().as_list()[3]
+      num_conv_channels = self._get_mask_predictor_conv_depth(
+          num_feature_channels, self._num_classes)
+    with slim.arg_scope(self._conv_hyperparams_fn()):
+      upsampled_features = tf.image.resize_bilinear(
+          roi_pooled_features, [self._mask_height, self._mask_width],
+          align_corners=True)
+      for _ in range(self._mask_prediction_num_conv_layers - 1):
+        upsampled_features = slim.conv2d(
+            upsampled_features,
+            num_outputs=num_conv_channels,
+            kernel_size=[3, 3])
+      num_masks = 1 if self._masks_are_class_agnostic else self._num_classes
+      mask_predictions = slim.conv2d(
+          upsampled_features,
+          num_outputs=num_masks,
+          activation_fn=None,
+          kernel_size=[3, 3])
+      return tf.expand_dims(
+          tf.transpose(mask_predictions, perm=[0, 3, 1, 2]),
+          axis=1,
+          name='MaskPredictor')
--- a/research/object_detection/predictors/mask_rcnn_heads/mask_head_test.py
+++ b/research/object_detection/predictors/mask_rcnn_heads/mask_head_test.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Tests for object_detection.predictors.mask_rcnn_heads.mask_head."""
+import tensorflow as tf
+
+from google.protobuf import text_format
+from object_detection.builders import hyperparams_builder
+from object_detection.predictors.mask_rcnn_heads import mask_head
+from object_detection.protos import hyperparams_pb2
+from object_detection.utils import test_case
+
+
+class MaskHeadTest(test_case.TestCase):
+
+  def _build_arg_scope_with_hyperparams(self,
+                                        op_type=hyperparams_pb2.Hyperparams.FC):
+    hyperparams = hyperparams_pb2.Hyperparams()
+    hyperparams_text_proto = """
+      activation: NONE
+      regularizer {
+        l2_regularizer {
+        }
+      }
+      initializer {
+        truncated_normal_initializer {
+        }
+      }
+    """
+    text_format.Merge(hyperparams_text_proto, hyperparams)
+    hyperparams.op = op_type
+    return hyperparams_builder.build(hyperparams, is_training=True)
+
+  def test_prediction_size(self):
+    mask_prediction_head = mask_head.MaskHead(
+        num_classes=20,
+        conv_hyperparams_fn=self._build_arg_scope_with_hyperparams(),
+        mask_height=14,
+        mask_width=14,
+        mask_prediction_num_conv_layers=2,
+        mask_prediction_conv_depth=256,
+        masks_are_class_agnostic=False)
+    roi_pooled_features = tf.random_uniform(
+        [64, 7, 7, 1024], minval=-10.0, maxval=10.0, dtype=tf.float32)
+    prediction = mask_prediction_head.predict(
+        roi_pooled_features=roi_pooled_features)
+    tf.logging.info(prediction.shape)
+    self.assertAllEqual([64, 1, 20, 14, 14], prediction.get_shape().as_list())
+
+
+if __name__ == '__main__':
+  tf.test.main()
--- a/research/object_detection/predictors/mask_rcnn_heads/mask_rcnn_head.py
+++ b/research/object_detection/predictors/mask_rcnn_heads/mask_rcnn_head.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Base Mask RCNN head class."""
+from abc import abstractmethod
+
+
+class MaskRCNNHead(object):
+  """Mask RCNN head base class."""
+
+  def __init__(self):
+    """Constructor."""
+
+  def predict(self, roi_pooled_features):
+    """Returns the head's predictions.
+
+    Args:
+      roi_pooled_features: A float tensor of shape
+        [batch_size, height, width, channels] containing ROI pooled features
+        from a batch of boxes.
+    """
+    return self._predict(roi_pooled_features)
+
+  @abstractmethod
+  def _predict(self, roi_pooled_features):
+    """The abstract internal prediction function that needs to be overloaded.
+
+    Args:
+      roi_pooled_features: A float tensor of shape
+        [batch_size, height, width, channels] containing ROI pooled features
+        from a batch of boxes.
+    """