MODEL_GARDEN.md 6.7 KB
Newer Older
Abdullah Rashwan's avatar
Abdullah Rashwan committed
1
2
3
4
5
6
7
8
# TF Vision Model Garden

## Introduction
TF Vision model garden provides a large collection of baselines and checkpoints for image classification, object detection, and instance segmentation.


## Image Classification
### ImageNet Baselines
A. Unique TensorFlower's avatar
A. Unique TensorFlower committed
9
#### ResNet models trained with vanilla settings:
A. Unique TensorFlower's avatar
A. Unique TensorFlower committed
10
11
12
13
14
15
16
17
18
19
20
* Models are trained from scratch with batch size 4096 and 1.6 initial learning rate.
* Linear warmup is applied for the first 5 epochs.
* Models trained with l2 weight regularization and ReLU activation.

| model        | resolution    | epochs  |  Top-1  |  Top-5  | download |
| ------------ |:-------------:|--------:|--------:|---------:|---------:|
| ResNet-50    | 224x224       |    90    | 76.1 | 92.9 | config |
| ResNet-50    | 224x224       |    200   | 77.1 | 93.5 | config |
| ResNet-101   | 224x224       |    200   | 78.3 | 94.2 | config |
| ResNet-152   | 224x224       |    200   | 78.7 | 94.3 | config |

Abdullah Rashwan's avatar
Abdullah Rashwan committed
21

A. Unique TensorFlower's avatar
A. Unique TensorFlower committed
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#### ResNet-RS models trained with settings including:

*   ResNet-RS architectural changes and Swish activation.
*   Regularization methods including Random Augment, 4e-5 weight decay, stochastic depth, label smoothing and dropout.
*   New training methods including a 350-epoch schedule, cosine learning rate and
    EMA.
*   Configs are in this [directory](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/image_classification)

model     | resolution | params (M) | Top-1 | Top-5 | download
--------- | :--------: | -----: | ----: | ----: | -------:
ResNet-RS-50 | 160x160    | 35.7    | 79.1  | 94.5  |
ResNet-RS-101 | 160x160    | 63.7    | 80.2  | 94.9  |
ResNet-RS-101 | 192x192    | 63.7    | 81.3  | 95.6  |
ResNet-RS-152 | 192x192    | 86.8    | 81.9  | 95.8  |
ResNet-RS-152 | 224x224    | 86.8    | 82.5  | 96.1  |
ResNet-RS-152 | 256x256    | 86.8    | 83.1  | 96.3  |
ResNet-RS-200 | 256x256    | 93.4    | 83.5  | 96.6  |
ResNet-RS-270 | 256x256    | 130.1    | 83.6  | 96.6  |
ResNet-RS-350 | 256x256    |  164.3   | 83.7  | 96.7  |
ResNet-RS-350 | 320x320    | 164.3   | 84.2  | 96.9  |
Abdullah Rashwan's avatar
Abdullah Rashwan committed
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61


## Object Detection and Instance Segmentation
### Common Settings and Notes
* We provide models based on two detection frameworks, [RetinaNet](https://arxiv.org/abs/1708.02002) or [Mask R-CNN](https://arxiv.org/abs/1703.06870), and two backbones, [ResNet-FPN](https://arxiv.org/abs/1612.03144) or [SpineNet](https://arxiv.org/abs/1912.05027).
* Models are all trained on COCO train2017 and evaluated on COCO val2017.
* Training details:
  * Models finetuned from ImageNet pretrained checkpoints adopt the 12 or 36 epochs schedule. Models trained from scratch adopt the 350 epochs schedule.
  * The default training data augmentation implements horizontal flipping and scale jittering with a random scale between [0.5, 2.0].
  * Unless noted, all models are trained with l2 weight regularization and ReLU activation.
  * We use batch size 256 and stepwise learning rate that decays at the last 30 and 10 epoch.
  * We use square image as input by resizing the long side of an image to the target size then padding the short side with zeros.

### COCO Object Detection Baselines
#### RetinaNet (ImageNet pretrained)
| backbone        | resolution    | epochs  | FLOPs (B)     | params (M) |  box AP |   download |
| ------------ |:-------------:| ---------:|-----------:|--------:|--------:|-----------:|
| R50-FPN      | 640x640       |    12    | 97.0 | 34.0 | 34.3 | config|
| R50-FPN      | 640x640       |    36    | 97.0 | 34.0 | 37.3 | config|

A. Unique TensorFlower's avatar
A. Unique TensorFlower committed
62
63
64
65
#### RetinaNet (Trained from scratch) with training features including:
* Stochastic depth with drop rate 0.2.
* Swish activation.

Abdullah Rashwan's avatar
Abdullah Rashwan committed
66
67
| backbone        | resolution    | epochs  | FLOPs (B)     | params (M) |  box AP |   download |
| ------------ |:-------------:| ---------:|-----------:|--------:|---------:|-----------:|
A. Unique TensorFlower's avatar
A. Unique TensorFlower committed
68
69
70
| SpineNet-49  | 640x640       |    500    | 85.4| 28.5 | 44.2 | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/retinanet/coco_spinenet49_tpu.yaml)|
| SpineNet-96  | 1024x1024     |    500    | 265.4 | 43.0 | 48.5 |  [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/retinanet/coco_spinenet96_tpu.yaml) |
| SpineNet-143 | 1280x1280     |    500    | 524.0 | 67.0 | 50.0 | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/retinanet/coco_spinenet143_tpu.yaml)|
Abdullah Rashwan's avatar
Abdullah Rashwan committed
71
72
73
74
75
76
77
78
79
80


### Instance Segmentation Baselines
#### Mask R-CNN (ImageNet pretrained)


#### Mask R-CNN (Trained from scratch)
| backbone        | resolution    | epochs  | FLOPs (B)  | params (M)  |  box AP |  mask AP  |   download |
| ------------ |:-------------:| ---------:|-----------:|--------:|--------:|-----------:|-----------:|
| SpineNet-49  | 640x640       |    350    | 215.7 | 40.8 | 42.6 | 37.9 | config |
Yin Cui's avatar
Yin Cui committed
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102


## Video Classification
### Common Settings and Notes
* We provide models for video classification with two backbones: [SlowOnly](https://arxiv.org/abs/1812.03982) and 3D-ResNet (R3D) used in [Spatiotemporal Contrastive Video Representation Learning](https://arxiv.org/abs/2008.03800).
* Training and evaluation details:
  * All models are trained from scratch with vision modality (RGB) for 200 epochs.
  * We use batch size of 1024 and cosine learning rate decay with linear warmup in first 5 epochs.
  * We follow [SlowFast](https://arxiv.org/abs/1812.03982) to perform 30-view evaluation.

### Kinetics-400 Action Recognition Baselines
| model    | input (frame x stride) |  Top-1  |  Top-5  | download |
| -------- |:----------------------:|--------:|--------:|---------:|
| SlowOnly | 8 x 8                  |  74.1   |  91.4   | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/video_classification/k400_slowonly8x8_tpu.yaml) |
| SlowOnly | 16 x 4                 |  75.6   |  92.1   | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/video_classification/k400_slowonly16x4_tpu.yaml) |
| R3D-50   | 32 x 2                 |  77.0   |  93.0   | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/video_classification/k400_3d-resnet50_tpu.yaml) |

### Kinetics-600 Action Recognition Baselines
| model    | input (frame x stride) |  Top-1  |  Top-5  | download |
| -------- |:----------------------:|--------:|--------:|---------:|
| SlowOnly | 8 x 8                  |  77.3   |  93.6   | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/video_classification/k600_slowonly8x8_tpu.yaml) |
| R3D-50   | 32 x 2                 |  79.5   |  94.8   | [config](https://github.com/tensorflow/models/blob/master/official/vision/beta/configs/experiments/video_classification/k600_3d-resnet50_tpu.yaml) |