Add explanation for anchor box configuration

PiperOrigin-RevId: 383651497

Add explanation for anchor box configuration
PiperOrigin-RevId: 383651497
e71d67c3 · Scott Main · TF Object Detection Team · 178fdeb2 · e71d67c3
Commit e71d67c3 authored Jul 08, 2021 by Scott Main Committed by TF Object Detection Team Jul 08, 2021
Show whitespace changes
Inline Side-by-side

Showing with 143 additions and 39 deletions

research/object_detection/g3doc/configuring_jobs.md research/object_detection/g3doc/configuring_jobs.md +143 -39

No files found.
--- a/research/object_detection/g3doc/configuring_jobs.md
+++ b/research/object_detection/g3doc/configuring_jobs.md
@@ -24,22 +24,23 @@ A skeleton configuration file is shown below:
 ```
 model {
-(... Add model config here...)
+  (... Add model config here...)
 }
 train_config : {
-(... Add train_config here...)
+  (... Add train_config here...)
 }
 train_input_reader: {
-(... Add train_input configuration here...)
+  (... Add train_input configuration here...)
 }
 eval_config: {
+  (... Add eval_config here...)
 }
 eval_input_reader: {
-(... Add eval_input configuration here...)
+  (... Add eval_input configuration here...)
 }
 ```
@@ -58,6 +59,106 @@ configuration files can be pasted into `model` field of the skeleton
 configuration. Users should note that the `num_classes` field should be changed
 to a value suited for the dataset the user is training on.
+### Anchor box parameters
+Many object detection models use an anchor generator as a region-sampling
+strategy, which generates a large number of anchor boxes in a range of shapes
+and sizes, in many locations of the image. The detection algorithm then
+incrementally offsets the anchor box closest to the ground truth until it
+(closely) matches. You can specify the variety of and position of these anchor
+boxes in the `anchor_generator` config.
+Usually, the anchor configs provided with pre-trained checkpoints are
+designed for large/versatile datasets (COCO, ImageNet), in which the goal is to
+improve accuracy for a wide range of object sizes and positions. But in most
+real-world applications, objects are confined to a limited number of sizes. So
+adjusting the anchors to be specific to your dataset and environment
+can both improve model accuracy and reduce training time.
+The format for these anchor box parameters differ depending on your model
+architecture. For details about all fields, see the [`anchor_generator`
+definition](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/anchor_generator.proto).
+On this page, we'll focus on parameters
+used in a traditional single shot detector (SSD) model and SSD models with a
+feature pyramid network (FPN) head.
+Regardless of the model architecture, you'll need to understand the following
+anchor box concepts:
+  +  **Scale**: This defines the variety of anchor box sizes. Each box size is
+  defined as a proportion of the original image size (for SSD models) or as a
+  factor of the filter's stride length (for FPN). The number of different sizes
+  is defined using a range of "scales" (relative to image size) or "levels" (the
+  level on the feature pyramid). For example, to detect small objects with the
+  configurations below, the `min_scale` and `min_level` are set to a small
+  value, while `max_scale` and `max_level` specify the largest objects to
+  detect.
+  +  **Aspect ratio**: This is the height/width ratio for the anchor boxes.  For
+  example, the `aspect_ratio` value of `1.0` creates a square, and `2.0` creates
+  a 1:2 rectangle (landscape orientation). You can define as many aspects as you
+  want and each one is repeated at all anchor box scales.
+Beware that increasing the total number of anchor boxes will exponentially
+increase computation costs. Whereas generating fewer anchors that have a higher
+chance to overlap with ground truth will both improve accuracy and reduce
+computation costs.
+**Single Shot Detector (SSD) full model:**
+Setting `num_layers` to 6 means the model generates each box aspect at 6
+different sizes. The exact sizes are not specified but they're evenly spaced out
+between the `min_scale` and `max_scale` values, which specify the smallest box
+size is 20% of the input image size and the largest is 95% that size.
+```
+model {
+  ssd {
+    anchor_generator {
+      ssd_anchor_generator {
+        num_layers: 6
+        min_scale: 0.2
+        max_scale: 0.95
+        aspect_ratios: 1.0
+        aspect_ratios: 2.0
+        aspect_ratios: 0.5
+      }
+    }
+  }
+}
+```
+For more details, see [`ssd_anchor_generator.proto`](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/ssd_anchor_generator.proto).
+**SSD with Feature Pyramid Network (FPN) head:**
+When using an FPN head, you must specify the anchor box size relative to the
+convolutional filter's stride length at a given pyramid level, using
+`anchor_scale`. So in this example, the box size is 4.0 multiplied by the
+layer's stride length. The number of sizes you get for each aspect simply
+depends on how many levels there are between the `min_level` and `max_level`.
+```
+model {
+  ssd {
+    anchor_generator {
+      multiscale_anchor_generator {
+        anchor_scale: 4.0
+        min_level: 3
+        max_level: 7
+        aspect_ratios: 1.0
+        aspect_ratios: 2.0
+        aspect_ratios: 0.5
+      }
+    }
+  }
+}
+```
+For more details, see [`multiscale_anchor_generator.proto`](https://github.com/tensorflow/models/blob/master/research/object_detection/protos/multiscale_anchor_generator.proto).
 ## Defining Inputs
 The TensorFlow Object Detection API accepts inputs in the TFRecord file format.
@@ -66,20 +167,21 @@ Additionally, users should also specify a label map, which define the mapping
 between a class id and class name. The label map should be identical between
 training and evaluation datasets.
-An example input configuration looks as follows:
+An example training input configuration looks as follows:
 ```
-tf_record_input_reader {
+train_input_reader: {
-  input_path: "/usr/home/username/data/train.record"
+  tf_record_input_reader {
+    input_path: "/usr/home/username/data/train.record-?????-of-00010"
+  }
+  label_map_path: "/usr/home/username/data/label_map.pbtxt"
 }
-label_map_path: "/usr/home/username/data/label_map.pbtxt"
 ```
-Users should substitute the `input_path` and `label_map_path` arguments and
+The `eval_input_reader` follows the same format. Users should substitute the
-insert the input configuration into the `train_input_reader` and
+`input_path` and `label_map_path` arguments. Note that the paths can also point
-`eval_input_reader` fields in the skeleton configuration. Note that the paths
+to Google Cloud Storage buckets (ie. "gs://project_bucket/train.record") to
-can also point to Google Cloud Storage buckets (ie.
+pull datasets hosted on Google Cloud.
-"gs://project_bucket/train.record") for use on Google Cloud.
 ## Configuring the Trainer
@@ -92,8 +194,9 @@ The `train_config` defines parts of the training process:
 A sample `train_config` is below:
 ```
-batch_size: 1
+train_config: {
-optimizer {
+  batch_size: 1
+  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
@@ -115,14 +218,15 @@ optimizer {
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
-}
+  }
-fine_tune_checkpoint: "/usr/home/username/tmp/model.ckpt-#####"
+  fine_tune_checkpoint: "/usr/home/username/tmp/model.ckpt-#####"
-from_detection_checkpoint: true
+  from_detection_checkpoint: true
-load_all_detection_checkpoint_vars: true
+  load_all_detection_checkpoint_vars: true
-gradient_clipping_by_norm: 10.0
+  gradient_clipping_by_norm: 10.0
-data_augmentation_options {
+  data_augmentation_options {
    random_horizontal_flip {
    }
+  }
 }
 ```