Merge pull request #4161 from pkulzc/master

Internal changes to object detection

Merge pull request #4161 from pkulzc/master
Internal changes to object detection
8e73530e · Jonathan Huang · GitHub · 18d05ad3 · 63054210 · 8e73530e
Unverified Commit 8e73530e authored May 03, 2018 by Jonathan Huang Committed by GitHub May 03, 2018
7 changed files
--- a/research/object_detection/README.md
+++ b/research/object_detection/README.md
@@ -90,6 +90,15 @@ reporting an issue.
 ## Release information
+### April 30, 2018
+We have released a Faster R-CNN detector with ResNet-101 feature extractor trained on [AVA](https://research.google.com/ava/) v2.1.
+Compared with other commonly used object detectors, it changes the action classification loss function to per-class Sigmoid loss to handle boxes with multiple labels.
+The model is trained on the training split of AVA v2.1 for 1.5M iterations, it achieves mean AP of 11.25% over 60 classes on the validation split of AVA v2.1.
+For more details please refer to this [paper](https://arxiv.org/abs/1705.08421).
+<b>Thanks to contributors</b>: Chen Sun, David Ross
 ### April 2, 2018
 Supercharge your mobile phones with the next generation mobile object detector!

--- a/research/object_detection/data/ava_label_map_v2.1.pbtxt
+++ b/research/object_detection/data/ava_label_map_v2.1.pbtxt
+item {
+  name: "bend/bow (at the waist)"
+  id: 1
+}
+item {
+  name: "crouch/kneel"
+  id: 3
+}
+item {
+  name: "dance"
+  id: 4
+}
+item {
+  name: "fall down"
+  id: 5
+}
+item {
+  name: "get up"
+  id: 6
+}
+item {
+  name: "jump/leap"
+  id: 7
+}
+item {
+  name: "lie/sleep"
+  id: 8
+}
+item {
+  name: "martial art"
+  id: 9
+}
+item {
+  name: "run/jog"
+  id: 10
+}
+item {
+  name: "sit"
+  id: 11
+}
+item {
+  name: "stand"
+  id: 12
+}
+item {
+  name: "swim"
+  id: 13
+}
+item {
+  name: "walk"
+  id: 14
+}
+item {
+  name: "answer phone"
+  id: 15
+}
+item {
+  name: "carry/hold (an object)"
+  id: 17
+}
+item {
+  name: "climb (e.g., a mountain)"
+  id: 20
+}
+item {
+  name: "close (e.g., a door, a box)"
+  id: 22
+}
+item {
+  name: "cut"
+  id: 24
+}
+item {
+  name: "dress/put on clothing"
+  id: 26
+}
+item {
+  name: "drink"
+  id: 27
+}
+item {
+  name: "drive (e.g., a car, a truck)"
+  id: 28
+}
+item {
+  name: "eat"
+  id: 29
+}
+item {
+  name: "enter"
+  id: 30
+}
+item {
+  name: "hit (an object)"
+  id: 34
+}
+item {
+  name: "lift/pick up"
+  id: 36
+}
+item {
+  name: "listen (e.g., to music)"
+  id: 37
+}
+item {
+  name: "open (e.g., a window, a car door)"
+  id: 38
+}
+item {
+  name: "play musical instrument"
+  id: 41
+}
+item {
+  name: "point to (an object)"
+  id: 43
+}
+item {
+  name: "pull (an object)"
+  id: 45
+}
+item {
+  name: "push (an object)"
+  id: 46
+}
+item {
+  name: "put down"
+  id: 47
+}
+item {
+  name: "read"
+  id: 48
+}
+item {
+  name: "ride (e.g., a bike, a car, a horse)"
+  id: 49
+}
+item {
+  name: "sail boat"
+  id: 51
+}
+item {
+  name: "shoot"
+  id: 52
+}
+item {
+  name: "smoke"
+  id: 54
+}
+item {
+  name: "take a photo"
+  id: 56
+}
+item {
+  name: "text on/look at a cellphone"
+  id: 57
+}
+item {
+  name: "throw"
+  id: 58
+}
+item {
+  name: "touch (an object)"
+  id: 59
+}
+item {
+  name: "turn (e.g., a screwdriver)"
+  id: 60
+}
+item {
+  name: "watch (e.g., TV)"
+  id: 61
+}
+item {
+  name: "work on a computer"
+  id: 62
+}
+item {
+  name: "write"
+  id: 63
+}
+item {
+  name: "fight/hit (a person)"
+  id: 64
+}
+item {
+  name: "give/serve (an object) to (a person)"
+  id: 65
+}
+item {
+  name: "grab (a person)"
+  id: 66
+}
+item {
+  name: "hand clap"
+  id: 67
+}
+item {
+  name: "hand shake"
+  id: 68
+}
+item {
+  name: "hand wave"
+  id: 69
+}
+item {
+  name: "hug (a person)"
+  id: 70
+}
+item {
+  name: "kiss (a person)"
+  id: 72
+}
+item {
+  name: "lift (a person)"
+  id: 73
+}
+item {
+  name: "listen to (a person)"
+  id: 74
+}
+item {
+  name: "push (another person)"
+  id: 76
+}
+item {
+  name: "sing to (e.g., self, a person, a group)"
+  id: 77
+}
+item {
+  name: "take (an object) from (a person)"
+  id: 78
+}
+item {
+  name: "talk to (e.g., self, a person, a group)"
+  id: 79
+}
+item {
+  name: "watch (a person)"
+  id: 80
+}
--- a/research/object_detection/g3doc/detection_model_zoo.md
+++ b/research/object_detection/g3doc/detection_model_zoo.md
@@ -91,7 +91,7 @@ Some remarks on frozen inference graphs:
 ## Kitti-trained models {#kitti-models}
-Model name                                                                                                                                                        | Speed (ms) | Pascal mAP@0.5 (ms) | Outputs
+Model name                                                                                                                                                        | Speed (ms) | Pascal mAP@0.5 | Outputs
 ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---: | :-------------: | :-----:
 [faster_rcnn_resnet101_kitti](http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_kitti_2018_01_28.tar.gz) | 79  | 87              | Boxes
@@ -103,6 +103,13 @@ Model name
 [faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid](http://download.tensorflow.org/models/object_detection/faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid_2018_01_28.tar.gz) | 347  |               | Boxes
+## AVA v2.1 trained models {#ava-models}
+Model name                                                                                                                                                        | Speed (ms) | Pascal mAP@0.5 | Outputs
+----------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---: | :-------------: | :-----:
+[faster_rcnn_resnet101_ava_v2.1](http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_ava_v2.1_2018_04_30.tar.gz) | 93  | 11              | Boxes
 [^1]: See [MSCOCO evaluation protocol](http://cocodataset.org/#detections-eval).
 [^2]: This is PASCAL mAP with a slightly different way of true positives computation: see [Open Images evaluation protocol](evaluation_protocols.md#open-images).
--- a/research/object_detection/model_lib.py
+++ b/research/object_detection/model_lib.py
@@ -325,16 +325,16 @@ def create_model_fn(detection_model_fn, configs, hparams, use_tpu=False):
      }
    eval_metric_ops = None
-    if mode in (tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL):
+    if mode == tf.estimator.ModeKeys.EVAL:
      class_agnostic = (fields.DetectionResultFields.detection_classes
                        not in detections)
      groundtruth = _get_groundtruth_data(detection_model, class_agnostic)
      use_original_images = fields.InputDataFields.original_image in features
-      original_images = (
+      eval_images = (
          features[fields.InputDataFields.original_image] if use_original_images
          else features[fields.InputDataFields.image])
      eval_dict = eval_util.result_dict_for_single_example(
-          original_images[0:1],
+          eval_images[0:1],
          features[inputs.HASH_KEY][0],
          detections,
          groundtruth,
@@ -355,7 +355,6 @@ def create_model_fn(detection_model_fn, configs, hparams, use_tpu=False):
        img_summary = tf.summary.image('Detections_Left_Groundtruth_Right',
                                       detection_and_groundtruth)
-      if mode == tf.estimator.ModeKeys.EVAL:
      # Eval metrics on a single example.
      eval_metrics = eval_config.metrics_set
      if not eval_metrics:

--- a/research/object_detection/samples/configs/faster_rcnn_resnet101_ava_v2.1.config
+++ b/research/object_detection/samples/configs/faster_rcnn_resnet101_ava_v2.1.config
+# Faster R-CNN with Resnet-101 (v1), configuration for AVA v2.1.
+# Users should configure the fine_tune_checkpoint field in the train config as
+# well as the label_map_path and input_path fields in the train_input_reader and
+# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
+# should be configured.
+model {
+  faster_rcnn {
+    num_classes: 80
+    image_resizer {
+      keep_aspect_ratio_resizer {
+        min_dimension: 600
+        max_dimension: 1024
+      }
+    }
+    feature_extractor {
+      type: 'faster_rcnn_resnet101'
+      first_stage_features_stride: 16
+    }
+    first_stage_anchor_generator {
+      grid_anchor_generator {
+        scales: [0.25, 0.5, 1.0, 2.0]
+        aspect_ratios: [0.5, 1.0, 2.0]
+        height_stride: 16
+        width_stride: 16
+      }
+    }
+    first_stage_box_predictor_conv_hyperparams {
+      op: CONV
+      regularizer {
+        l2_regularizer {
+          weight: 0.0
+        }
+      }
+      initializer {
+        truncated_normal_initializer {
+          stddev: 0.01
+        }
+      }
+    }
+    first_stage_nms_score_threshold: 0.0
+    first_stage_nms_iou_threshold: 0.7
+    first_stage_max_proposals: 300
+    first_stage_localization_loss_weight: 2.0
+    first_stage_objectness_loss_weight: 1.0
+    initial_crop_size: 14
+    maxpool_kernel_size: 2
+    maxpool_stride: 2
+    second_stage_box_predictor {
+      mask_rcnn_box_predictor {
+        use_dropout: false
+        dropout_keep_probability: 1.0
+        fc_hyperparams {
+          op: FC
+          regularizer {
+            l2_regularizer {
+              weight: 0.0
+            }
+          }
+          initializer {
+            variance_scaling_initializer {
+              factor: 1.0
+              uniform: true
+              mode: FAN_AVG
+            }
+          }
+        }
+      }
+    }
+    second_stage_post_processing {
+      batch_non_max_suppression {
+        score_threshold: 0.0
+        iou_threshold: 0.6
+        max_detections_per_class: 100
+        max_total_detections: 300
+      }
+      score_converter: SIGMOID
+    }
+    second_stage_localization_loss_weight: 2.0
+    second_stage_classification_loss_weight: 1.0
+    second_stage_classification_loss {
+      weighted_sigmoid {
+        anchorwise_output: true
+      }
+    }
+  }
+}
+train_config: {
+  batch_size: 1
+  num_steps: 1500000
+  optimizer {
+    momentum_optimizer: {
+      learning_rate: {
+        manual_step_learning_rate {
+          initial_learning_rate: 0.0003
+          schedule {
+            step: 1200000
+            learning_rate: .00003
+          }
+        }
+      }
+      momentum_optimizer_value: 0.9
+    }
+    use_moving_average: false
+  }
+  gradient_clipping_by_norm: 10.0
+  merge_multiple_label_boxes: true
+  fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
+  data_augmentation_options {
+    random_horizontal_flip {
+    }
+  }
+  max_number_of_boxes: 100
+}
+train_input_reader: {
+  tf_record_input_reader {
+    input_path: "PATH_TO_BE_CONFIGURED/ava_train.record"
+  }
+  label_map_path: "PATH_TO_BE_CONFIGURED/ava_label_map_v2.1.pbtxt"
+}
+eval_config: {
+  metrics_set: "pascal_voc_detection_metrics"
+  use_moving_averages: false
+  num_examples: 57371
+}
+eval_input_reader: {
+  tf_record_input_reader {
+    input_path: "PATH_TO_BE_CONFIGURED/ava_val.record"
+  }
+  label_map_path: "PATH_TO_BE_CONFIGURED/ava_label_map_v2.1.pbtxt"
+  shuffle: false
+  num_readers: 1
+}
--- a/research/object_detection/samples/configs/ssd_mobilenet_v2_coco.config
+++ b/research/object_detection/samples/configs/ssd_mobilenet_v2_coco.config
@@ -54,6 +54,7 @@ model {
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 3
+        use_depthwise: true
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {

--- a/research/object_detection/utils/ops.py
+++ b/research/object_detection/utils/ops.py
@@ -774,8 +774,8 @@ def nearest_neighbor_upsampling(input_tensor, scale):
  Nearest neighbor upsampling function that maps input tensor with shape
  [batch_size, height, width, channels] to [batch_size, height * scale
-  , width * scale, channels]. This implementation only uses reshape and tile to
+  , width * scale, channels]. This implementation only uses reshape and
-  make it compatible with certain hardware.
+  broadcasting to make it TPU compatible.
  Args:
    input_tensor: A float32 tensor of size [batch, height_in, width_in,
@@ -785,13 +785,14 @@ def nearest_neighbor_upsampling(input_tensor, scale):
    data_up: A float32 tensor of size
      [batch, height_in*scale, width_in*scale, channels].
  """
-  shape = shape_utils.combined_static_and_dynamic_shape(input_tensor)
+  with tf.name_scope('nearest_neighbor_upsampling'):
-  shape_before_tile = [shape[0], shape[1], 1, shape[2], 1, shape[3]]
+    (batch_size, height, width,
-  shape_after_tile = [shape[0], shape[1] * scale, shape[2] * scale, shape[3]]
+     channels) = shape_utils.combined_static_and_dynamic_shape(input_tensor)
-  data_reshaped = tf.reshape(input_tensor, shape_before_tile)
+    output_tensor = tf.reshape(
-  resized_tensor = tf.tile(data_reshaped, [1, 1, scale, 1, scale, 1])
+        input_tensor, [batch_size, height, 1, width, 1, channels]) * tf.ones(
-  resized_tensor = tf.reshape(resized_tensor, shape_after_tile)
+            [1, 1, scale, 1, scale, 1], dtype=input_tensor.dtype)
-  return resized_tensor
+    return tf.reshape(output_tensor,
+                      [batch_size, height * scale, width * scale, channels])
 def matmul_gather_on_zeroth_axis(params, indices, scope=None):