Unverified Commit 8e73530e authored by Jonathan Huang's avatar Jonathan Huang Committed by GitHub
Browse files

Merge pull request #4161 from pkulzc/master

Internal changes to object detection
parents 18d05ad3 63054210
...@@ -90,6 +90,15 @@ reporting an issue. ...@@ -90,6 +90,15 @@ reporting an issue.
## Release information ## Release information
### April 30, 2018
We have released a Faster R-CNN detector with ResNet-101 feature extractor trained on [AVA](https://research.google.com/ava/) v2.1.
Compared with other commonly used object detectors, it changes the action classification loss function to per-class Sigmoid loss to handle boxes with multiple labels.
The model is trained on the training split of AVA v2.1 for 1.5M iterations, it achieves mean AP of 11.25% over 60 classes on the validation split of AVA v2.1.
For more details please refer to this [paper](https://arxiv.org/abs/1705.08421).
<b>Thanks to contributors</b>: Chen Sun, David Ross
### April 2, 2018 ### April 2, 2018
Supercharge your mobile phones with the next generation mobile object detector! Supercharge your mobile phones with the next generation mobile object detector!
......
item {
name: "bend/bow (at the waist)"
id: 1
}
item {
name: "crouch/kneel"
id: 3
}
item {
name: "dance"
id: 4
}
item {
name: "fall down"
id: 5
}
item {
name: "get up"
id: 6
}
item {
name: "jump/leap"
id: 7
}
item {
name: "lie/sleep"
id: 8
}
item {
name: "martial art"
id: 9
}
item {
name: "run/jog"
id: 10
}
item {
name: "sit"
id: 11
}
item {
name: "stand"
id: 12
}
item {
name: "swim"
id: 13
}
item {
name: "walk"
id: 14
}
item {
name: "answer phone"
id: 15
}
item {
name: "carry/hold (an object)"
id: 17
}
item {
name: "climb (e.g., a mountain)"
id: 20
}
item {
name: "close (e.g., a door, a box)"
id: 22
}
item {
name: "cut"
id: 24
}
item {
name: "dress/put on clothing"
id: 26
}
item {
name: "drink"
id: 27
}
item {
name: "drive (e.g., a car, a truck)"
id: 28
}
item {
name: "eat"
id: 29
}
item {
name: "enter"
id: 30
}
item {
name: "hit (an object)"
id: 34
}
item {
name: "lift/pick up"
id: 36
}
item {
name: "listen (e.g., to music)"
id: 37
}
item {
name: "open (e.g., a window, a car door)"
id: 38
}
item {
name: "play musical instrument"
id: 41
}
item {
name: "point to (an object)"
id: 43
}
item {
name: "pull (an object)"
id: 45
}
item {
name: "push (an object)"
id: 46
}
item {
name: "put down"
id: 47
}
item {
name: "read"
id: 48
}
item {
name: "ride (e.g., a bike, a car, a horse)"
id: 49
}
item {
name: "sail boat"
id: 51
}
item {
name: "shoot"
id: 52
}
item {
name: "smoke"
id: 54
}
item {
name: "take a photo"
id: 56
}
item {
name: "text on/look at a cellphone"
id: 57
}
item {
name: "throw"
id: 58
}
item {
name: "touch (an object)"
id: 59
}
item {
name: "turn (e.g., a screwdriver)"
id: 60
}
item {
name: "watch (e.g., TV)"
id: 61
}
item {
name: "work on a computer"
id: 62
}
item {
name: "write"
id: 63
}
item {
name: "fight/hit (a person)"
id: 64
}
item {
name: "give/serve (an object) to (a person)"
id: 65
}
item {
name: "grab (a person)"
id: 66
}
item {
name: "hand clap"
id: 67
}
item {
name: "hand shake"
id: 68
}
item {
name: "hand wave"
id: 69
}
item {
name: "hug (a person)"
id: 70
}
item {
name: "kiss (a person)"
id: 72
}
item {
name: "lift (a person)"
id: 73
}
item {
name: "listen to (a person)"
id: 74
}
item {
name: "push (another person)"
id: 76
}
item {
name: "sing to (e.g., self, a person, a group)"
id: 77
}
item {
name: "take (an object) from (a person)"
id: 78
}
item {
name: "talk to (e.g., self, a person, a group)"
id: 79
}
item {
name: "watch (a person)"
id: 80
}
...@@ -91,7 +91,7 @@ Some remarks on frozen inference graphs: ...@@ -91,7 +91,7 @@ Some remarks on frozen inference graphs:
## Kitti-trained models {#kitti-models} ## Kitti-trained models {#kitti-models}
Model name | Speed (ms) | Pascal mAP@0.5 (ms) | Outputs Model name | Speed (ms) | Pascal mAP@0.5 | Outputs
----------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---: | :-------------: | :-----: ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---: | :-------------: | :-----:
[faster_rcnn_resnet101_kitti](http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_kitti_2018_01_28.tar.gz) | 79 | 87 | Boxes [faster_rcnn_resnet101_kitti](http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_kitti_2018_01_28.tar.gz) | 79 | 87 | Boxes
...@@ -103,6 +103,13 @@ Model name ...@@ -103,6 +103,13 @@ Model name
[faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid](http://download.tensorflow.org/models/object_detection/faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid_2018_01_28.tar.gz) | 347 | | Boxes [faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid](http://download.tensorflow.org/models/object_detection/faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid_2018_01_28.tar.gz) | 347 | | Boxes
## AVA v2.1 trained models {#ava-models}
Model name | Speed (ms) | Pascal mAP@0.5 | Outputs
----------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---: | :-------------: | :-----:
[faster_rcnn_resnet101_ava_v2.1](http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_ava_v2.1_2018_04_30.tar.gz) | 93 | 11 | Boxes
[^1]: See [MSCOCO evaluation protocol](http://cocodataset.org/#detections-eval). [^1]: See [MSCOCO evaluation protocol](http://cocodataset.org/#detections-eval).
[^2]: This is PASCAL mAP with a slightly different way of true positives computation: see [Open Images evaluation protocol](evaluation_protocols.md#open-images). [^2]: This is PASCAL mAP with a slightly different way of true positives computation: see [Open Images evaluation protocol](evaluation_protocols.md#open-images).
...@@ -325,16 +325,16 @@ def create_model_fn(detection_model_fn, configs, hparams, use_tpu=False): ...@@ -325,16 +325,16 @@ def create_model_fn(detection_model_fn, configs, hparams, use_tpu=False):
} }
eval_metric_ops = None eval_metric_ops = None
if mode in (tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL): if mode == tf.estimator.ModeKeys.EVAL:
class_agnostic = (fields.DetectionResultFields.detection_classes class_agnostic = (fields.DetectionResultFields.detection_classes
not in detections) not in detections)
groundtruth = _get_groundtruth_data(detection_model, class_agnostic) groundtruth = _get_groundtruth_data(detection_model, class_agnostic)
use_original_images = fields.InputDataFields.original_image in features use_original_images = fields.InputDataFields.original_image in features
original_images = ( eval_images = (
features[fields.InputDataFields.original_image] if use_original_images features[fields.InputDataFields.original_image] if use_original_images
else features[fields.InputDataFields.image]) else features[fields.InputDataFields.image])
eval_dict = eval_util.result_dict_for_single_example( eval_dict = eval_util.result_dict_for_single_example(
original_images[0:1], eval_images[0:1],
features[inputs.HASH_KEY][0], features[inputs.HASH_KEY][0],
detections, detections,
groundtruth, groundtruth,
...@@ -355,7 +355,6 @@ def create_model_fn(detection_model_fn, configs, hparams, use_tpu=False): ...@@ -355,7 +355,6 @@ def create_model_fn(detection_model_fn, configs, hparams, use_tpu=False):
img_summary = tf.summary.image('Detections_Left_Groundtruth_Right', img_summary = tf.summary.image('Detections_Left_Groundtruth_Right',
detection_and_groundtruth) detection_and_groundtruth)
if mode == tf.estimator.ModeKeys.EVAL:
# Eval metrics on a single example. # Eval metrics on a single example.
eval_metrics = eval_config.metrics_set eval_metrics = eval_config.metrics_set
if not eval_metrics: if not eval_metrics:
......
# Faster R-CNN with Resnet-101 (v1), configuration for AVA v2.1.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.
model {
faster_rcnn {
num_classes: 80
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_resnet101'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SIGMOID
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
second_stage_classification_loss {
weighted_sigmoid {
anchorwise_output: true
}
}
}
}
train_config: {
batch_size: 1
num_steps: 1500000
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 1200000
learning_rate: .00003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
merge_multiple_label_boxes: true
fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
data_augmentation_options {
random_horizontal_flip {
}
}
max_number_of_boxes: 100
}
train_input_reader: {
tf_record_input_reader {
input_path: "PATH_TO_BE_CONFIGURED/ava_train.record"
}
label_map_path: "PATH_TO_BE_CONFIGURED/ava_label_map_v2.1.pbtxt"
}
eval_config: {
metrics_set: "pascal_voc_detection_metrics"
use_moving_averages: false
num_examples: 57371
}
eval_input_reader: {
tf_record_input_reader {
input_path: "PATH_TO_BE_CONFIGURED/ava_val.record"
}
label_map_path: "PATH_TO_BE_CONFIGURED/ava_label_map_v2.1.pbtxt"
shuffle: false
num_readers: 1
}
...@@ -54,6 +54,7 @@ model { ...@@ -54,6 +54,7 @@ model {
use_dropout: false use_dropout: false
dropout_keep_probability: 0.8 dropout_keep_probability: 0.8
kernel_size: 3 kernel_size: 3
use_depthwise: true
box_code_size: 4 box_code_size: 4
apply_sigmoid_to_scores: false apply_sigmoid_to_scores: false
conv_hyperparams { conv_hyperparams {
......
...@@ -774,8 +774,8 @@ def nearest_neighbor_upsampling(input_tensor, scale): ...@@ -774,8 +774,8 @@ def nearest_neighbor_upsampling(input_tensor, scale):
Nearest neighbor upsampling function that maps input tensor with shape Nearest neighbor upsampling function that maps input tensor with shape
[batch_size, height, width, channels] to [batch_size, height * scale [batch_size, height, width, channels] to [batch_size, height * scale
, width * scale, channels]. This implementation only uses reshape and tile to , width * scale, channels]. This implementation only uses reshape and
make it compatible with certain hardware. broadcasting to make it TPU compatible.
Args: Args:
input_tensor: A float32 tensor of size [batch, height_in, width_in, input_tensor: A float32 tensor of size [batch, height_in, width_in,
...@@ -785,13 +785,14 @@ def nearest_neighbor_upsampling(input_tensor, scale): ...@@ -785,13 +785,14 @@ def nearest_neighbor_upsampling(input_tensor, scale):
data_up: A float32 tensor of size data_up: A float32 tensor of size
[batch, height_in*scale, width_in*scale, channels]. [batch, height_in*scale, width_in*scale, channels].
""" """
shape = shape_utils.combined_static_and_dynamic_shape(input_tensor) with tf.name_scope('nearest_neighbor_upsampling'):
shape_before_tile = [shape[0], shape[1], 1, shape[2], 1, shape[3]] (batch_size, height, width,
shape_after_tile = [shape[0], shape[1] * scale, shape[2] * scale, shape[3]] channels) = shape_utils.combined_static_and_dynamic_shape(input_tensor)
data_reshaped = tf.reshape(input_tensor, shape_before_tile) output_tensor = tf.reshape(
resized_tensor = tf.tile(data_reshaped, [1, 1, scale, 1, scale, 1]) input_tensor, [batch_size, height, 1, width, 1, channels]) * tf.ones(
resized_tensor = tf.reshape(resized_tensor, shape_after_tile) [1, 1, scale, 1, scale, 1], dtype=input_tensor.dtype)
return resized_tensor return tf.reshape(output_tensor,
[batch_size, height * scale, width * scale, channels])
def matmul_gather_on_zeroth_axis(params, indices, scope=None): def matmul_gather_on_zeroth_axis(params, indices, scope=None):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment