README.md 7.03 KB
Newer Older
Hongkun Yu's avatar
Hongkun Yu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Will Cromar's avatar
Will Cromar committed
15
16
# Image Classification

Allen Wang's avatar
Allen Wang committed
17
This folder contains TF 2.0 model examples for image classification:
Will Cromar's avatar
Will Cromar committed
18
19

* [MNIST](#mnist)
Allen Wang's avatar
Allen Wang committed
20
21
22
23
* [Classifier Trainer](#classifier-trainer), a framework that uses the Keras
compile/fit methods for image classification models, including:
  * ResNet
  * EfficientNet[^1]
Will Cromar's avatar
Will Cromar committed
24

Allen Wang's avatar
Allen Wang committed
25
[^1]: Currently a work in progress. We cannot match "AutoAugment (AA)" in [the original version](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet).
Will Cromar's avatar
Will Cromar committed
26
27
28
For more information about other types of models, please refer to this
[README file](../../README.md).

Allen Wang's avatar
Allen Wang committed
29
## Before you begin
30
Please make sure that you have the latest version of TensorFlow
31
installed and
32
33
[add the models folder to your Python path](/official/#running-the-models).

Allen Wang's avatar
Allen Wang committed
34
### ImageNet preparation
35

36
37
38
39
40
41
42
43
44
45
#### Using TFDS
`classifier_trainer.py` supports ImageNet with
[TensorFlow Datasets (TFDS)](https://www.tensorflow.org/datasets/overview).

Please see the following [example snippet](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/scripts/download_and_prepare.py)
for more information on how to use TFDS to download and prepare datasets, and
specifically the [TFDS ImageNet readme](https://github.com/tensorflow/datasets/blob/master/docs/catalog/imagenet2012.md)
for manual download instructions.

#### Legacy TFRecords
46
Download the ImageNet dataset and convert it to TFRecord format.
47
48
49
50
The following [script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py)
and [README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy)
provide a few options.

51
52
53
54
Note that the legacy ResNet runners, e.g. [resnet/resnet_ctl_imagenet_main.py](resnet/resnet_ctl_imagenet_main.py)
require TFRecords whereas `classifier_trainer.py` can use both by setting the
builder to 'records' or 'tfds' in the configurations.

Will Cromar's avatar
Will Cromar committed
55
### Running on Cloud TPUs
Will Cromar's avatar
Will Cromar committed
56

Allen Wang's avatar
Allen Wang committed
57
Note: These models will **not** work with TPUs on Colab.
Will Cromar's avatar
Will Cromar committed
58

Allen Wang's avatar
Allen Wang committed
59
You can train image classification models on Cloud TPUs using
60
[tf.distribute.TPUStrategy](https://www.tensorflow.org/api_docs/python/tf.distribute.TPUStrategy?version=nightly).
61
62
If you are not familiar with Cloud TPUs, it is strongly recommended that you go
through the
Will Cromar's avatar
Will Cromar committed
63
64
65
[quickstart](https://cloud.google.com/tpu/docs/quickstart) to learn how to
create a TPU and GCE VM.

66
67
68
69
70
71
72
73
74
75
76
77
78
### Running on multiple GPU hosts

You can also train these models on multiple hosts, each with GPUs, using
[tf.distribute.experimental.MultiWorkerMirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy).

The easiest way to run multi-host benchmarks is to set the
[`TF_CONFIG`](https://www.tensorflow.org/guide/distributed_training#TF_CONFIG)
appropriately at each host.  e.g., to run using `MultiWorkerMirroredStrategy` on
2 hosts, the `cluster` in `TF_CONFIG` should have 2 `host:port` entries, and
host `i` should have the `task` in `TF_CONFIG` set to `{"type": "worker",
"index": i}`.  `MultiWorkerMirroredStrategy` will automatically use all the
available GPUs at each host.

Allen Wang's avatar
Allen Wang committed
79
80
81
82
## MNIST

To download the data and run the MNIST sample model locally for the first time,
run one of the following command:
Will Cromar's avatar
Will Cromar committed
83
84

```bash
Allen Wang's avatar
Allen Wang committed
85
python3 mnist_main.py \
Will Cromar's avatar
Will Cromar committed
86
87
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
Allen Wang's avatar
Allen Wang committed
88
89
90
91
  --train_epochs=10 \
  --distribution_strategy=one_device \
  --num_gpus=$NUM_GPUS \
  --download
Will Cromar's avatar
Will Cromar committed
92
93
```

Allen Wang's avatar
Allen Wang committed
94
To train the model on a Cloud TPU, run the following command:
Will Cromar's avatar
Will Cromar committed
95
96

```bash
Allen Wang's avatar
Allen Wang committed
97
python3 mnist_main.py \
Will Cromar's avatar
Will Cromar committed
98
99
100
  --tpu=$TPU_NAME \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
Allen Wang's avatar
Allen Wang committed
101
  --train_epochs=10 \
Will Cromar's avatar
Will Cromar committed
102
  --distribution_strategy=tpu \
Allen Wang's avatar
Allen Wang committed
103
  --download
Will Cromar's avatar
Will Cromar committed
104
105
```

Allen Wang's avatar
Allen Wang committed
106
Note: the `--download` flag is only required the first time you run the model.
Will Cromar's avatar
Will Cromar committed
107

Will Cromar's avatar
Will Cromar committed
108

Allen Wang's avatar
Allen Wang committed
109
110
111
112
113
114
## Classifier Trainer
The classifier trainer is a unified framework for running image classification
models using Keras's compile/fit methods. Experiments should be provided in the
form of YAML files, some examples are included within the configs/examples
folder. Please see [configs/examples](./configs/examples) for more example
configurations.
Will Cromar's avatar
Will Cromar committed
115

Allen Wang's avatar
Allen Wang committed
116
117
118
119
120
The provided configuration files use a per replica batch size and is scaled
by the number of devices. For instance, if `batch size` = 64, then for 1 GPU
the global batch size would be 64 * 1 = 64. For 8 GPUs, the global batch size
would be 64 * 8 = 512. Similarly, for a v3-8 TPU, the global batch size would
be 64 * 8 = 512, and for a v3-32, the global batch size is 64 * 32 = 2048.
Will Cromar's avatar
Will Cromar committed
121

Allen Wang's avatar
Allen Wang committed
122
123
124
### ResNet50

#### On GPU:
Will Cromar's avatar
Will Cromar committed
125
```bash
Allen Wang's avatar
Allen Wang committed
126
127
128
129
python3 classifier_trainer.py \
  --mode=train_and_eval \
  --model_type=resnet \
  --dataset=imagenet \
Will Cromar's avatar
Will Cromar committed
130
131
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
Allen Wang's avatar
Allen Wang committed
132
133
  --config_file=configs/examples/resnet/imagenet/gpu.yaml \
  --params_override='runtime.num_gpus=$NUM_GPUS'
Will Cromar's avatar
Will Cromar committed
134
135
```

136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
To train on multiple hosts, each with GPUs attached using
[MultiWorkerMirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy)
please update `runtime` section in gpu.yaml
(or override using `--params_override`) with:

```YAML
# gpu.yaml
runtime:
  distribution_strategy: 'multi_worker_mirrored'
  worker_hosts: '$HOST1:port,$HOST2:port'
  num_gpus: $NUM_GPUS
  task_index: 0
```
By having `task_index: 0` on the first host and `task_index: 1` on the second
and so on. `$HOST1` and `$HOST2` are the IP addresses of the hosts, and `port`
can be chosen any free port on the hosts. Only the first host will write
TensorBoard Summaries and save checkpoints.

Allen Wang's avatar
Allen Wang committed
154
155
156
157
158
159
160
161
162
#### On TPU:
```bash
python3 classifier_trainer.py \
  --mode=train_and_eval \
  --model_type=resnet \
  --dataset=imagenet \
  --tpu=$TPU_NAME \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
163
  --config_file=configs/examples/resnet/imagenet/tpu.yaml
Allen Wang's avatar
Allen Wang committed
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
```

### EfficientNet
**Note: EfficientNet development is a work in progress.**
#### On GPU:
```bash
python3 classifier_trainer.py \
  --mode=train_and_eval \
  --model_type=efficientnet \
  --dataset=imagenet \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-gpu.yaml \
  --params_override='runtime.num_gpus=$NUM_GPUS'
```
Will Cromar's avatar
Will Cromar committed
179

Allen Wang's avatar
Allen Wang committed
180
181

#### On TPU:
Will Cromar's avatar
Will Cromar committed
182
```bash
Allen Wang's avatar
Allen Wang committed
183
184
185
186
python3 classifier_trainer.py \
  --mode=train_and_eval \
  --model_type=efficientnet \
  --dataset=imagenet \
Will Cromar's avatar
Will Cromar committed
187
188
189
  --tpu=$TPU_NAME \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
190
  --config_file=configs/examples/efficientnet/imagenet/efficientnet-b0-tpu.yaml
Will Cromar's avatar
Will Cromar committed
191
192
```

Allen Wang's avatar
Allen Wang committed
193
194
195
196
Note that the number of GPU devices can be overridden in the command line using
`--params_overrides`. The TPU does not need this override as the device is fixed
by providing the TPU address or name with the `--tpu` flag.