README.md 5.88 KB
Newer Older
Will Cromar's avatar
Will Cromar committed
1
2
3
4
5
6
7
8
9
10
11
# Image Classification

This folder contains the TF 2.0 model examples for image classification:

* [ResNet](#resnet)
* [MNIST](#mnist)

For more information about other types of models, please refer to this
[README file](../../README.md).

## ResNet
12

13
Similar to the [estimator implementation](../../r1/resnet), the Keras
14
15
16
implementation has code for the ImageNet dataset. The ImageNet
version uses a ResNet50 model implemented in
[`resnet_model.py`](./resnet_model.py).
17

18
Please make sure that you have the latest version of TensorFlow
19
installed and
20
21
22
23
24
25
26
27
[add the models folder to your Python path](/official/#running-the-models).

### Pretrained Models

* [ResNet50 Checkpoints](https://storage.googleapis.com/cloud-tpu-checkpoints/resnet/resnet50.tar.gz)

* ResNet50 TFHub: [feature vector](https://tfhub.dev/tensorflow/resnet_50/feature_vector/1)
and [classification](https://tfhub.dev/tensorflow/resnet_50/classification/1)
28

29
### ImageNet Training
30

31
Download the ImageNet dataset and convert it to TFRecord format.
32
33
34
35
36
37
38
The following [script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py)
and [README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy)
provide a few options.

Once your dataset is ready, you can begin training the model as follows:

```bash
39
python resnet_imagenet_main.py
40
41
42
43
44
45
```

Again, if you did not download the data to the default directory, specify the
location with the `--data_dir` flag:

```bash
46
python resnet_imagenet_main.py --data_dir=/path/to/imagenet
47
48
49
50
```

There are more flag options you can specify. Here are some examples:

51
- `--use_synthetic_data`: when set to true, synthetic data, rather than real
52
53
54
55
56
57
data, are used;
- `--batch_size`: the batch size used for the model;
- `--model_dir`: the directory to save the model checkpoint;
- `--train_epochs`: number of epoches to run for training the model;
- `--train_steps`: number of steps to run for training the model. We now only
support a number that is smaller than the number of batches in an epoch.
58
- `--skip_eval`: when set to true, evaluation as well as validation during
59
60
training is skipped

61
For example, this is a typical command line to run with ImageNet data with
62
63
64
batch size 128 per GPU:

```bash
65
66
67
68
69
70
71
python -m resnet_imagenet_main \
    --model_dir=/tmp/model_dir/something \
    --num_gpus=2 \
    --batch_size=128 \
    --train_epochs=90 \
    --train_steps=10 \
    --use_synthetic_data=false
72
73
```

74
See [`common.py`](common.py) for full list of options.
75

Will Cromar's avatar
Will Cromar committed
76
77
### Using multiple GPUs

78
79
You can train these models on multiple GPUs using `tf.distribute.Strategy` API.
You can read more about them in this
80
81
[guide](https://www.tensorflow.org/guide/distribute_strategy).

82
83
In this example, we have made it easier to use is with just a command line flag
`--num_gpus`. By default this flag is 1 if TensorFlow is compiled with CUDA,
84
85
86
87
and 0 otherwise.

- --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device.
- --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device.
88
- --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous
89
90
distributed training across the GPUs.

91
If you wish to run without `tf.distribute.Strategy`, you can do so by setting
92
93
`--distribution_strategy=off`.

94
95
96
97
98
99
100
101
102
103
104
105
106
### Running on multiple GPU hosts

You can also train these models on multiple hosts, each with GPUs, using
`tf.distribute.Strategy`.

The easiest way to run multi-host benchmarks is to set the
[`TF_CONFIG`](https://www.tensorflow.org/guide/distributed_training#TF_CONFIG)
appropriately at each host.  e.g., to run using `MultiWorkerMirroredStrategy` on
2 hosts, the `cluster` in `TF_CONFIG` should have 2 `host:port` entries, and
host `i` should have the `task` in `TF_CONFIG` set to `{"type": "worker",
"index": i}`.  `MultiWorkerMirroredStrategy` will automatically use all the
available GPUs at each host.

Will Cromar's avatar
Will Cromar committed
107
### Running on Cloud TPUs
Will Cromar's avatar
Will Cromar committed
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132

Note: This model will **not** work with TPUs on Colab.

You can train the ResNet CTL model on Cloud TPUs using
`tf.distribute.TPUStrategy`. If you are not familiar with Cloud TPUs, it is
strongly recommended that you go through the
[quickstart](https://cloud.google.com/tpu/docs/quickstart) to learn how to
create a TPU and GCE VM.

To run ResNet model on a TPU, you must set `--distribution_strategy=tpu` and
`--tpu=$TPU_NAME`, where `$TPU_NAME` the name of your TPU in the Cloud Console.
From a GCE VM, you can run the following command to train ResNet for one epoch
on a v2-8 or v3-8 TPU:

```bash
python resnet_ctl_imagenet_main.py \
  --tpu=$TPU_NAME \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --batch_size=1024 \
  --steps_per_loop=500 \
  --train_epochs=1 \
  --use_synthetic_data=false \
  --dtype=fp32 \
  --enable_eager=true \
Will Cromar's avatar
Will Cromar committed
133
  --enable_tensorboard=true \
Will Cromar's avatar
Will Cromar committed
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
  --distribution_strategy=tpu \
  --log_steps=50 \
  --single_l2_loss_op=true \
  --use_tf_function=true
```

To train the ResNet to convergence, run it for 90 epochs:

```bash
python resnet_ctl_imagenet_main.py \
  --tpu=$TPU_NAME \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --batch_size=1024 \
  --steps_per_loop=500 \
  --train_epochs=90 \
  --use_synthetic_data=false \
  --dtype=fp32 \
  --enable_eager=true \
Will Cromar's avatar
Will Cromar committed
153
  --enable_tensorboard=true \
Will Cromar's avatar
Will Cromar committed
154
155
156
157
158
159
160
161
  --distribution_strategy=tpu \
  --log_steps=50 \
  --single_l2_loss_op=true \
  --use_tf_function=true
```

Note: `$MODEL_DIR` and `$DATA_DIR` must be GCS paths.

Will Cromar's avatar
Will Cromar committed
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190

## MNIST

To download the data and run the MNIST sample model locally for the first time,
run one of the following command:

```bash
python mnist_main.py \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --train_epochs=10 \
  --distribution_strategy=one_device \
  --num_gpus=$NUM_GPUS \
  --download
```

To train the model on a Cloud TPU, run the following command:

```bash
python mnist_main.py \
  --tpu=$TPU_NAME \
  --model_dir=$MODEL_DIR \
  --data_dir=$DATA_DIR \
  --train_epochs=10 \
  --distribution_strategy=tpu \
  --download
```

Note: the `--download` flag is only required the first time you run the model.