README.md 38 KB
Newer Older
Jaeyoun Kim's avatar
Jaeyoun Kim committed
1
![No Maintenance Intended](https://img.shields.io/badge/No%20Maintenance%20Intended-%E2%9C%95-red.svg)
2
3
4
![TensorFlow Requirement: 1.x](https://img.shields.io/badge/TensorFlow%20Requirement-1.x-brightgreen)
![TensorFlow 2 Not Supported](https://img.shields.io/badge/TensorFlow%202%20Not%20Supported-%E2%9C%95-red.svg)

5
**NOTE: For the most part, you will find a newer version of this code at [models/research/slim](https://github.com/tensorflow/models/tree/master/research/slim).** In particular:
6
7
8
9
10
11

*   `inception_train.py` and `imagenet_train.py` should no longer be used. The slim editions for running on multiple GPUs are the current best examples.
*   `inception_distributed_train.py` and `imagenet_distributed_train.py` are still valid examples of distributed training.

For performance benchmarking, please see https://www.tensorflow.org/performance/benchmarks.

12
13
---

Martin Wicke's avatar
Martin Wicke committed
14
15
16
17
# Inception in TensorFlow

[ImageNet](http://www.image-net.org/) is a common academic data set in machine
learning for training an image recognition system. Code in this directory
18
19
20
demonstrates how to use TensorFlow to train and evaluate a type of convolutional
neural network (CNN) on this academic data set. In particular, we demonstrate
how to train the Inception v3 architecture as specified in:
Martin Wicke's avatar
Martin Wicke committed
21
22
23

_Rethinking the Inception Architecture for Computer Vision_

24
25
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew
Wojna
Martin Wicke's avatar
Martin Wicke committed
26
27
28
29
30

http://arxiv.org/abs/1512.00567

This network achieves 21.2% top-1 and 5.6% top-5 error for single frame
evaluation with a computational cost of 5 billion multiply-adds per inference
31
32
and with using less than 25 million parameters. Below is a visualization of the
model architecture.
Martin Wicke's avatar
Martin Wicke committed
33
34
35
36
37
38
39

![Inception-v3 Architecture](g3doc/inception_v3_architecture.png)

## Description of Code

The code base provides three core binaries for:

40
41
42
43
44
45
*   Training an Inception v3 network from scratch across multiple GPUs and/or
    multiple machines using the ImageNet 2012 Challenge training data set.
*   Evaluating an Inception v3 network using the ImageNet 2012 Challenge
    validation data set.
*   Retraining an Inception v3 network on a novel task and back-propagating the
    errors to fine tune the network weights.
Martin Wicke's avatar
Martin Wicke committed
46

Jack Zhang's avatar
Jack Zhang committed
47
The training procedure employs synchronous stochastic gradient descent across
Neal Wu's avatar
Neal Wu committed
48
multiple GPUs. The user may specify the number of GPUs they wish to harness. The
49
50
51
52
synchronous training performs *batch-splitting* by dividing a given batch across
multiple GPUs.

The training set up is nearly identical to the section [Training a Model Using
Neal Wu's avatar
Neal Wu committed
53
Multiple GPU Cards](https://www.tensorflow.org/tutorials/deep_cnn/index.html#launching_and_training_the_model_on_multiple_gpu_cards)
54
55
56
57
58
59
60
61
where we have substituted the CIFAR-10 model architecture with Inception v3. The
primary differences with that setup are:

*   Calculate and update the batch-norm statistics during training so that they
    may be substituted in during evaluation.
*   Specify the model architecture using a (still experimental) higher level
    language called TensorFlow-Slim.

Neal Wu's avatar
Neal Wu committed
62
For more details about TensorFlow-Slim, please see the [Slim README](inception/slim/README.md). Please note that this higher-level language is still
63
64
*experimental* and the API may change over time depending on usage and
subsequent research.
Martin Wicke's avatar
Martin Wicke committed
65
66
67

## Getting Started

68
69
70
71
72
Before you run the training script for the first time, you will need to download
and convert the ImageNet data to native TFRecord format. The TFRecord format
consists of a set of sharded files where each entry is a serialized `tf.Example`
proto. Each `tf.Example` proto contains the ImageNet image (JPEG encoded) as
well as metadata such as label and bounding box information. See
73
[`parse_example_proto`](inception/image_processing.py) for details.
74

75
We provide a single [script](inception/data/download_and_preprocess_imagenet.sh) for
76
77
78
79
downloading and converting ImageNet data to TFRecord format. Downloading and
preprocessing the data may take several hours (up to half a day) depending on
your network and computer speed. Please be patient.

Neal Wu's avatar
Neal Wu committed
80
To begin, you will need to sign up for an account with [ImageNet](http://image-net.org) to gain access to the data. Look for the sign up page,
81
82
83
84
85
create an account and request an access key to download the data.

After you have `USERNAME` and `PASSWORD`, you are ready to run our script. Make
sure that your hard disk has at least 500 GB of free space for downloading and
storing the data. Here we select `DATA_DIR=$HOME/imagenet-data` as such a
Martin Wicke's avatar
Martin Wicke committed
86
87
location but feel free to edit accordingly.

88
89
90
When you run the below script, please enter *USERNAME* and *PASSWORD* when
prompted. This will occur at the very beginning. Once these values are entered,
you will not need to interact with the script again.
Martin Wicke's avatar
Martin Wicke committed
91
92
93
94
95
96

```shell
# location of where to place the ImageNet data
DATA_DIR=$HOME/imagenet-data

# build the preprocessing script.
97
98
cd tensorflow-models/inception
bazel build //inception:download_and_preprocess_imagenet
Martin Wicke's avatar
Martin Wicke committed
99
100

# run it
101
bazel-bin/inception/download_and_preprocess_imagenet "${DATA_DIR}"
Martin Wicke's avatar
Martin Wicke committed
102
103
104
105
106
107
108
109
```

The final line of the output script should read:

```shell
2016-02-17 14:30:17.287989: Finished writing all 1281167 images in data set.
```

Neal Wu's avatar
Neal Wu committed
110
111
112
When the script finishes, you will find 1024 training files and 128 validation
files in the `DATA_DIR`. The files will match the patterns
`train-?????-of-01024` and `validation-?????-of-00128`, respectively.
Martin Wicke's avatar
Martin Wicke committed
113

114
115
[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now
ready to train or evaluate with the ImageNet data set.
Martin Wicke's avatar
Martin Wicke committed
116
117
118
119

## How to Train from Scratch

**WARNING** Training an Inception v3 network from scratch is a computationally
120
121
intensive task and depending on your compute setup may take several days or even
weeks.
Martin Wicke's avatar
Martin Wicke committed
122

Neal Wu's avatar
Neal Wu committed
123
124
*Before proceeding* please read the [Convolutional Neural Networks](https://www.tensorflow.org/tutorials/deep_cnn/index.html) tutorial; in
particular, focus on [Training a Model Using Multiple GPU Cards](https://www.tensorflow.org/tutorials/deep_cnn/index.html#launching_and_training_the_model_on_multiple_gpu_cards). The model training method is nearly identical to that described in the
125
CIFAR-10 multi-GPU model training. Briefly, the model training
Martin Wicke's avatar
Martin Wicke committed
126

Neal Wu's avatar
Neal Wu committed
127
128
*   Places an individual model replica on each GPU.
*   Splits the batch across the GPUs.
129
130
*   Updates model parameters synchronously by waiting for all GPUs to finish
    processing a batch of data.
Martin Wicke's avatar
Martin Wicke committed
131
132

The training procedure is encapsulated by this diagram of how operations and
Jack Zhang's avatar
Jack Zhang committed
133
variables are placed on CPU and GPUs respectively.
Martin Wicke's avatar
Martin Wicke committed
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148

<div style="width:40%; margin:auto; margin-bottom:10px; margin-top:20px;">
  <img style="width:100%" src="https://www.tensorflow.org/images/Parallelism.png">
</div>

Each tower computes the gradients for a portion of the batch and the gradients
are combined and averaged across the multiple towers in order to provide a
single update of the Variables stored on the CPU.

A crucial aspect of training a network of this size is *training speed* in terms
of wall-clock time. The training speed is dictated by many factors -- most
importantly the batch size and the learning rate schedule. Both of these
parameters are heavily coupled to the hardware set up.

Generally speaking, a batch size is a difficult parameter to tune as it requires
149
150
151
balancing memory demands of the model, memory available on the GPU and speed of
computation. Generally speaking, employing larger batch sizes leads to more
efficient computation and potentially more efficient training steps.
Martin Wicke's avatar
Martin Wicke committed
152
153
154
155
156

We have tested several hardware setups for training this model from scratch but
we emphasize that depending your hardware set up, you may need to adapt the
batch size and learning rate schedule.

157
Please see the comments in `inception_train.py` for a few selected learning rate
Martin Wicke's avatar
Martin Wicke committed
158
159
160
161
162
plans based on some selected hardware setups.

To train this model, you simply need to specify the following:

```shell
163
164
# Build the model. Note that we need to make sure the TensorFlow is ready to
# use before this as this command will not build TensorFlow.
165
166
cd tensorflow-models/inception
bazel build //inception:imagenet_train
Martin Wicke's avatar
Martin Wicke committed
167
168

# run it
swlsw's avatar
swlsw committed
169
bazel-bin/inception/imagenet_train --num_gpus=1 --batch_size=32 --train_dir=/tmp/imagenet_train --data_dir=/tmp/imagenet_data
Martin Wicke's avatar
Martin Wicke committed
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
```

The model reads in the ImageNet training data from `--data_dir`. If you followed
the instructions in [Getting Started](#getting-started), then set
`--data_dir="${DATA_DIR}"`. The script assumes that there exists a set of
sharded TFRecord files containing the ImageNet data. If you have not created
TFRecord files, please refer to [Getting Started](#getting-started)

Here is the output of the above command line when running on a Tesla K40c:

```shell
2016-03-07 12:24:59.922898: step 0, loss = 13.11 (5.3 examples/sec; 6.064 sec/batch)
2016-03-07 12:25:55.206783: step 10, loss = 13.71 (9.4 examples/sec; 3.394 sec/batch)
2016-03-07 12:26:28.905231: step 20, loss = 14.81 (9.5 examples/sec; 3.380 sec/batch)
2016-03-07 12:27:02.699719: step 30, loss = 14.45 (9.5 examples/sec; 3.378 sec/batch)
2016-03-07 12:27:36.515699: step 40, loss = 13.98 (9.5 examples/sec; 3.376 sec/batch)
2016-03-07 12:28:10.220956: step 50, loss = 13.92 (9.6 examples/sec; 3.327 sec/batch)
2016-03-07 12:28:43.658223: step 60, loss = 13.28 (9.6 examples/sec; 3.350 sec/batch)
...
```

191
192
193
In this example, a log entry is printed every 10 step and the line includes the
total loss (starts around 13.0-14.0) and the speed of processing in terms of
throughput (examples / sec) and batch speed (sec/batch).
Martin Wicke's avatar
Martin Wicke committed
194
195

The number of GPU devices is specified by `--num_gpus` (which defaults to 1).
196
197
Specifying `--num_gpus` greater then 1 splits the batch evenly split across the
GPU cards.
Martin Wicke's avatar
Martin Wicke committed
198
199

```shell
200
201
# Build the model. Note that we need to make sure the TensorFlow is ready to
# use before this as this command will not build TensorFlow.
202
203
cd tensorflow-models/inception
bazel build //inception:imagenet_train
Martin Wicke's avatar
Martin Wicke committed
204
205
206
207
208

# run it
bazel-bin/inception/imagenet_train --num_gpus=2 --batch_size=64 --train_dir=/tmp/imagenet_train
```

209
210
211
212
213
214
215
This model splits the batch of 64 images across 2 GPUs and calculates the
average gradient by waiting for both GPUs to finish calculating the gradients
from their respective data (See diagram above). Generally speaking, using larger
numbers of GPUs leads to higher throughput as well as the opportunity to use
larger batch sizes. In turn, larger batch sizes imply better estimates of the
gradient enabling the usage of higher learning rates. In summary, using more
GPUs results in simply faster training speed.
Martin Wicke's avatar
Martin Wicke committed
216
217

Note that selecting a batch size is a difficult parameter to tune as it requires
218
219
220
balancing memory demands of the model, memory available on the GPU and speed of
computation. Generally speaking, employing larger batch sizes leads to more
efficient computation and potentially more efficient training steps.
Martin Wicke's avatar
Martin Wicke committed
221
222
223
224
225
226
227
228
229
230

Note that there is considerable noise in the loss function on individual steps
in the previous log. Because of this noise, it is difficult to discern how well
a model is learning. The solution to the last problem is to launch TensorBoard
pointing to the directory containing the events log.

```shell
tensorboard --logdir=/tmp/imagenet_train
```

231
232
233
TensorBoard has access to the many Summaries produced by the model that describe
multitudes of statistics tracking the model behavior and the quality of the
learned model. In particular, TensorBoard tracks a exponentially smoothed
Martin Wicke's avatar
Martin Wicke committed
234
235
236
version of the loss. In practice, it is far easier to judge how well a model
learns by monitoring the smoothed version of the loss.

237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
## How to Train from Scratch in a Distributed Setting

**NOTE** Distributed TensorFlow requires version 0.8 or later.

Distributed TensorFlow lets us use multiple machines to train a model faster.
This is quite different from the training with multiple GPU towers on a single
machine where all parameters and gradients computation are in the same place. We
coordinate the computation across multiple machines by employing a centralized
repository for parameters that maintains a unified, single copy of model
parameters. Each individual machine sends gradient updates to the centralized
parameter repository which coordinates these updates and sends back updated
parameters to the individual machines running the model training.

We term each machine that runs a copy of the training a `worker` or `replica`.
We term each machine that maintains model parameters a `ps`, short for
`parameter server`. Note that we might have more than one machine acting as a
`ps` as the model parameters may be sharded across multiple machines.

Variables may be updated with synchronous or asynchronous gradient updates. One
Neal Wu's avatar
Neal Wu committed
256
257
258
may construct a an [`Optimizer`](https://www.tensorflow.org/api_docs/python/train.html#optimizers) in TensorFlow
that constructs the necessary graph for either case diagrammed below from the
TensorFlow [Whitepaper](http://download.tensorflow.org/paper/whitepaper2015.pdf):
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273

<div style="width:40%; margin:auto; margin-bottom:10px; margin-top:20px;">
  <img style="width:100%"
  src="https://www.tensorflow.org/images/tensorflow_figure7.png">
</div>

In [a recent paper](https://arxiv.org/abs/1604.00981), synchronous gradient
updates have demonstrated to reach higher accuracy in a shorter amount of time.
In this distributed Inception example we employ synchronous gradient updates.

Note that in this example each replica has a single tower that uses one GPU.

The command-line flags `worker_hosts` and `ps_hosts` specify available servers.
The same binary will be used for both the `worker` jobs and the `ps` jobs.
Command line flag `job_name` will be used to specify what role a task will be
Junwei Pan's avatar
Junwei Pan committed
274
playing and `task_id` will be used to identify which one of the jobs it is
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
running. Several things to note here:

*   The numbers of `ps` and `worker` tasks are inferred from the lists of hosts
    specified in the flags. The `task_id` should be within the range `[0,
    num_ps_tasks)` for `ps` tasks and `[0, num_worker_tasks)` for `worker`
    tasks.
*   `ps` and `worker` tasks can run on the same machine, as long as that machine
    has sufficient resources to handle both tasks. Note that the `ps` task does
    not benefit from a GPU, so it should not attempt to use one (see below).
*   Multiple `worker` tasks can run on the same machine with multiple GPUs so
    machine_A with 2 GPUs may have 2 workers while machine_B with 1 GPU just has
    1 worker.
*   The default learning rate schedule works well for a wide range of number of
    replicas [25, 50, 100] but feel free to tune it for even better results.
*   The command line of both `ps` and `worker` tasks should include the complete
    list of `ps_hosts` and `worker_hosts`.
*   There is a chief `worker` among all workers which defaults to `worker` 0.
    The chief will be in charge of initializing all the parameters, writing out
    the summaries and the checkpoint. The checkpoint and summary will be in the
    `train_dir` of the host for `worker` 0.
*   Each worker processes a batch_size number of examples but each gradient
    update is computed from all replicas. Hence, the effective batch size of
    this model is batch_size * num_workers.

```shell
# Build the model. Note that we need to make sure the TensorFlow is ready to
# use before this as this command will not build TensorFlow.
302
303
cd tensorflow-models/inception
bazel build //inception:imagenet_distributed_train
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381

# To start worker 0, go to the worker0 host and run the following (Note that
# task_id should be in the range [0, num_worker_tasks):
bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=$HOME/imagenet-data \
--job_name='worker' \
--task_id=0 \
--ps_hosts='ps0.example.com:2222' \
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'

# To start worker 1, go to the worker1 host and run the following (Note that
# task_id should be in the range [0, num_worker_tasks):
bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=$HOME/imagenet-data \
--job_name='worker' \
--task_id=1 \
--ps_hosts='ps0.example.com:2222' \
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'

# To start the parameter server (ps), go to the ps host and run the following (Note
# that task_id should be in the range [0, num_ps_tasks):
bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
--task_id=0 \
--ps_hosts='ps0.example.com:2222' \
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
```

If you have installed a GPU-compatible version of TensorFlow, the `ps` will also
try to allocate GPU memory although it is not helpful. This could potentially
crash the worker on the same machine as it has little to no GPU memory to
allocate. To avoid this, you can prepend the previous command to start `ps`
with: `CUDA_VISIBLE_DEVICES=''`

```shell
CUDA_VISIBLE_DEVICES='' bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
--task_id=0 \
--ps_hosts='ps0.example.com:2222' \
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
```

If you have run everything correctly, you should see a log in each `worker` job
that looks like the following. Note the training speed varies depending on your
hardware and the first several steps could take much longer.

```shell
INFO:tensorflow:PS hosts are: ['ps0.example.com:2222', 'ps1.example.com:2222']
INFO:tensorflow:Worker hosts are: ['worker0.example.com:2222', 'worker1.example.com:2222']
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {ps0.example.com:2222, ps1.example.com:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {localhost:2222, worker1.example.com:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222
INFO:tensorflow:Created variable global_step:0 with shape () and init <function zeros_initializer at 0x7f6aa014b140>

...

INFO:tensorflow:Created variable logits/logits/biases:0 with shape (1001,) and init <function _initializer at 0x7f6a77f3cf50>
INFO:tensorflow:SyncReplicas enabled: replicas_to_aggregate=2; total_num_replicas=2
INFO:tensorflow:2016-04-13 01:56:26.405639 Supervisor
INFO:tensorflow:Started 2 queues for processing input data.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Worker 0: 2016-04-13 01:58:40.342404: step 0, loss = 12.97(0.0 examples/sec; 65.428  sec/batch)
INFO:tensorflow:global_step/sec: 0.0172907
...
```

and a log in each `ps` job that looks like the following:

```shell
INFO:tensorflow:PS hosts are: ['ps0.example.com:2222', 'ps1.example.com:2222']
INFO:tensorflow:Worker hosts are: ['worker0.example.com:2222', 'worker1.example.com:2222']
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {localhost:2222, ps1.example.com:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {worker0.example.com:2222, worker1.example.com:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222
```

382
383
384
385
386
387
388
If you compiled TensorFlow (from v1.1-rc3) with VERBS support and you have the
required device and IB verbs SW stack, you can specify --protocol='grpc+verbs'
In order to use Verbs RDMA for Tensor passing between workers and ps.
Need to add the the --protocol flag in all tasks (ps and workers).
The default protocol is the TensorFlow default protocol of grpc.


389
390
391
[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now
training Inception in a distributed manner.

Martin Wicke's avatar
Martin Wicke committed
392
393
394
395
396
## How to Evaluate

Evaluating an Inception v3 model on the ImageNet 2012 validation data set
requires running a separate binary.

Neal Wu's avatar
Neal Wu committed
397
398
The evaluation procedure is nearly identical to [Evaluating a Model](https://www.tensorflow.org/tutorials/deep_cnn/index.html#evaluating_a_model)
described in the [Convolutional Neural Network](https://www.tensorflow.org/tutorials/deep_cnn/index.html) tutorial.
Martin Wicke's avatar
Martin Wicke committed
399

400
401
402
403
**WARNING** Be careful not to run the evaluation and training binary on the same
GPU or else you might run out of memory. Consider running the evaluation on a
separate GPU if available or suspending the training binary while running the
evaluation on the same GPU.
Martin Wicke's avatar
Martin Wicke committed
404
405
406
407

Briefly, one can evaluate the model by running:

```shell
408
409
# Build the model. Note that we need to make sure the TensorFlow is ready to
# use before this as this command will not build TensorFlow.
410
411
cd tensorflow-models/inception
bazel build //inception:imagenet_eval
Martin Wicke's avatar
Martin Wicke committed
412
413
414
415
416

# run it
bazel-bin/inception/imagenet_eval --checkpoint_dir=/tmp/imagenet_train --eval_dir=/tmp/imagenet_eval
```

417
418
Note that we point `--checkpoint_dir` to the location of the checkpoints saved
by `inception_train.py` above. Running the above command results in the
Martin Wicke's avatar
Martin Wicke committed
419
420
421
422
423
424
425
426
427
following output:

```shell
2016-02-17 22:32:50.391206: precision @ 1 = 0.735
...
```

The script calculates the precision @ 1 over the entire validation data
periodically. The precision @ 1 measures the how often the highest scoring
428
429
430
prediction from the model matched the ImageNet label -- in this case, 73.5%. If
you wish to run the eval just once and not periodically, append the `--run_once`
option.
Martin Wicke's avatar
Martin Wicke committed
431

432
433
434
435
Much like the training script, `imagenet_eval.py` also exports summaries that
may be visualized in TensorBoard. These summaries calculate additional
statistics on the predictions (e.g. recall @ 5) as well as monitor the
statistics of the model activations and weights during evaluation.
Martin Wicke's avatar
Martin Wicke committed
436
437
438
439

## How to Fine-Tune a Pre-Trained Model on a New Task

### Getting Started
440

Martin Wicke's avatar
Martin Wicke committed
441
442
443
Much like training the ImageNet model we must first convert a new data set to
the sharded TFRecord format which each entry is a serialized `tf.Example` proto.

444
445
We have provided a script demonstrating how to do this for small data set of of
a few thousand flower images spread across 5 labels:
Martin Wicke's avatar
Martin Wicke committed
446
447
448
449

```shell
daisy, dandelion, roses, sunflowers, tulips
```
450
451
452
453

There is a single automated script that downloads the data set and converts it
to the TFRecord format. Much like the ImageNet data set, each record in the
TFRecord format is a serialized `tf.Example` proto whose entries include a
Neal Wu's avatar
Neal Wu committed
454
JPEG-encoded string and an integer label. Please see [`parse_example_proto`](inception/image_processing.py) for details.
Martin Wicke's avatar
Martin Wicke committed
455
456
457

The script just takes a few minutes to run depending your network connection
speed for downloading and processing the images. Your hard disk requires 200MB
458
of free storage. Here we select `DATA_DIR=/tmp/flowers-data/` as such a location
459
but feel free to edit accordingly.
Martin Wicke's avatar
Martin Wicke committed
460
461
462

```shell
# location of where to place the flowers data
463
FLOWERS_DATA_DIR=/tmp/flowers-data/
Martin Wicke's avatar
Martin Wicke committed
464
465

# build the preprocessing script.
466
467
cd tensorflow-models/inception
bazel build //inception:download_and_preprocess_flowers
Martin Wicke's avatar
Martin Wicke committed
468
469

# run it
LiberiFatali's avatar
LiberiFatali committed
470
bazel-bin/inception/download_and_preprocess_flowers "${FLOWERS_DATA_DIR}"
Martin Wicke's avatar
Martin Wicke committed
471
472
473
474
475
476
477
478
479
480
```

If the script runs successfully, the final line of the terminal output should
look like:

```shell
2016-02-24 20:42:25.067551: Finished writing all 3170 images in data set.
```

When the script finishes you will find 2 shards for the training and validation
postmasters's avatar
postmasters committed
481
482
files in the `DATA_DIR`. The files will match the patterns `train-?????-of-00002`
and `validation-?????-of-00002`, respectively.
Martin Wicke's avatar
Martin Wicke committed
483
484

**NOTE** If you wish to prepare a custom image data set for transfer learning,
485
you will need to invoke [`build_image_data.py`](inception/data/build_image_data.py) on
486
your custom data set. Please see the associated options and assumptions behind
Neal Wu's avatar
Neal Wu committed
487
this script by reading the comments section of [`build_image_data.py`](inception/data/build_image_data.py). Also, if your custom data has a different
488
number of examples or classes, you need to change the appropriate values in
489
[`imagenet_data.py`](inception/imagenet_data.py).
Martin Wicke's avatar
Martin Wicke committed
490
491

The second piece you will need is a trained Inception v3 image model. You have
Neal Wu's avatar
Neal Wu committed
492
the option of either training one yourself (See [How to Train from Scratch](#how-to-train-from-scratch) for details) or you can download a pre-trained
493
model like so:
Martin Wicke's avatar
Martin Wicke committed
494
495
496

```shell
# location of where to place the Inception v3 model
497
498
499
INCEPTION_MODEL_DIR=$HOME/inception-v3-model
mkdir -p ${INCEPTION_MODEL_DIR}
cd ${INCEPTION_MODEL_DIR}
Martin Wicke's avatar
Martin Wicke committed
500
501
502
503
504
505
506
507
508
509
510
511

# download the Inception v3 model
curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
tar xzf inception-v3-2016-03-01.tar.gz

# this will create a directory called inception-v3 which contains the following files.
> ls inception-v3
README.txt
checkpoint
model.ckpt-157585
```

512
513
[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now
ready to fine-tune your pre-trained Inception v3 model with the flower data set.
Martin Wicke's avatar
Martin Wicke committed
514
515
516

### How to Retrain a Trained Model on the Flowers Data

517
518
We are now ready to fine-tune a pre-trained Inception-v3 model on the flowers
data set. This requires two distinct changes to our training procedure:
Martin Wicke's avatar
Martin Wicke committed
519

520
521
1.  Build the exact same model as previously except we change the number of
    labels in the final classification layer.
Martin Wicke's avatar
Martin Wicke committed
522

523
524
2.  Restore all weights from the pre-trained Inception-v3 except for the final
    classification layer; this will get randomly initialized instead.
Martin Wicke's avatar
Martin Wicke committed
525
526

We can perform these two operations by specifying two flags:
527
528
529
530
`--pretrained_model_checkpoint_path` and `--fine_tune`. The first flag is a
string that points to the path of a pre-trained Inception-v3 model. If this flag
is specified, it will load the entire model from the checkpoint before the
script begins training.
Martin Wicke's avatar
Martin Wicke committed
531
532

The second flag `--fine_tune` is a boolean that indicates whether the last
533
534
535
536
classification layer should be randomly initialized or restored. You may set
this flag to false if you wish to continue training a pre-trained model from a
checkpoint. If you set this flag to true, you can train a new classification
layer from scratch.
Martin Wicke's avatar
Martin Wicke committed
537

538
In order to understand how `--fine_tune` works, please see the discussion on
539
`Variables` in the TensorFlow-Slim [`README.md`](inception/slim/README.md).
Martin Wicke's avatar
Martin Wicke committed
540

541
542
Putting this all together you can retrain a pre-trained Inception-v3 model on
the flowers data set with the following command.
Martin Wicke's avatar
Martin Wicke committed
543
544

```shell
545
546
# Build the model. Note that we need to make sure the TensorFlow is ready to
# use before this as this command will not build TensorFlow.
547
548
cd tensorflow-models/inception
bazel build //inception:flowers_train
Martin Wicke's avatar
Martin Wicke committed
549
550

# Path to the downloaded Inception-v3 model.
551
MODEL_PATH="${INCEPTION_MODEL_DIR}/inception-v3/model.ckpt-157585"
Martin Wicke's avatar
Martin Wicke committed
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571

# Directory where the flowers data resides.
FLOWERS_DATA_DIR=/tmp/flowers-data/

# Directory where to save the checkpoint and events files.
TRAIN_DIR=/tmp/flowers_train/

# Run the fine-tuning on the flowers data set starting from the pre-trained
# Imagenet-v3 model.
bazel-bin/inception/flowers_train \
  --train_dir="${TRAIN_DIR}" \
  --data_dir="${FLOWERS_DATA_DIR}" \
  --pretrained_model_checkpoint_path="${MODEL_PATH}" \
  --fine_tune=True \
  --initial_learning_rate=0.001 \
  --input_queue_memory_factor=1
```

We have added a few extra options to the training procedure.

572
573
574
575
576
*   Fine-tuning a model a separate data set requires significantly lowering the
    initial learning rate. We set the initial learning rate to 0.001.
*   The flowers data set is quite small so we shrink the size of the shuffling
    queue of examples. See [Adjusting Memory Demands](#adjusting-memory-demands)
    for more details.
Martin Wicke's avatar
Martin Wicke committed
577
578
579
580
581

The training script will only reports the loss. To evaluate the quality of the
fine-tuned model, you will need to run `flowers_eval`:

```shell
582
583
# Build the model. Note that we need to make sure the TensorFlow is ready to
# use before this as this command will not build TensorFlow.
584
585
cd tensorflow-models/inception
bazel build //inception:flowers_eval
Martin Wicke's avatar
Martin Wicke committed
586
587
588
589
590
591
592
593
594
595
596

# Directory where we saved the fine-tuned checkpoint and events files.
TRAIN_DIR=/tmp/flowers_train/

# Directory where the flowers data resides.
FLOWERS_DATA_DIR=/tmp/flowers-data/

# Directory where to save the evaluation events files.
EVAL_DIR=/tmp/flowers_eval/

# Evaluate the fine-tuned model on a hold-out of the flower data set.
Weilin Xu's avatar
Weilin Xu committed
597
bazel-bin/inception/flowers_eval \
Martin Wicke's avatar
Martin Wicke committed
598
599
600
601
602
  --eval_dir="${EVAL_DIR}" \
  --data_dir="${FLOWERS_DATA_DIR}" \
  --subset=validation \
  --num_examples=500 \
  --checkpoint_dir="${TRAIN_DIR}" \
603
  --input_queue_memory_factor=1 \
Martin Wicke's avatar
Martin Wicke committed
604
605
606
  --run_once
```

607
608
We find that the evaluation arrives at roughly 93.4% precision@1 after the model
has been running for 2000 steps.
Martin Wicke's avatar
Martin Wicke committed
609
610

```shell
Neal Wu's avatar
Neal Wu committed
611
Successfully loaded model from /tmp/flowers/model.ckpt-1999 at step=1999.
Martin Wicke's avatar
Martin Wicke committed
612
613
614
615
616
617
618
2016-03-01 16:52:51.761219: starting evaluation on (validation).
2016-03-01 16:53:05.450419: [20 batches out of 20] (36.5 examples/sec; 0.684sec/batch)
2016-03-01 16:53:05.450471: precision @ 1 = 0.9340 recall @ 5 = 0.9960 [500 examples]
```

## How to Construct a New Dataset for Retraining

619
620
One can use the existing scripts supplied with this model to build a new dataset
for training or fine-tuning. The main script to employ is
621
[`build_image_data.py`](inception/data/build_image_data.py). Briefly, this script takes a
622
623
structured directory of images and converts it to a sharded `TFRecord` that can
be read by the Inception model.
Martin Wicke's avatar
Martin Wicke committed
624

625
626
In particular, you will need to create a directory of training images that
reside within `$TRAIN_DIR` and `$VALIDATION_DIR` arranged as such:
Martin Wicke's avatar
Martin Wicke committed
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645

```shell
  $TRAIN_DIR/dog/image0.jpeg
  $TRAIN_DIR/dog/image1.jpg
  $TRAIN_DIR/dog/image2.png
  ...
  $TRAIN_DIR/cat/weird-image.jpeg
  $TRAIN_DIR/cat/my-image.jpeg
  $TRAIN_DIR/cat/my-image.JPG
  ...
  $VALIDATION_DIR/dog/imageA.jpeg
  $VALIDATION_DIR/dog/imageB.jpg
  $VALIDATION_DIR/dog/imageC.png
  ...
  $VALIDATION_DIR/cat/weird-image.PNG
  $VALIDATION_DIR/cat/that-image.jpg
  $VALIDATION_DIR/cat/cat.JPG
  ...
```
Neal Wu's avatar
Neal Wu committed
646
647
648
649
**NOTE**: This script will append an extra background class indexed at 0, so
your class labels will range from 0 to num_labels. Using the example above, the
corresponding class labels generated from `build_image_data.py` will be as
follows:
650
651
652
653
654
```shell
0
1 dog
2 cat
```
655
656
657
658

Each sub-directory in `$TRAIN_DIR` and `$VALIDATION_DIR` corresponds to a unique
label for the images that reside within that sub-directory. The images may be
JPEG or PNG images. We do not support other images types currently.
Martin Wicke's avatar
Martin Wicke committed
659
660

Once the data is arranged in this directory structure, we can run
661
662
663
664
`build_image_data.py` on the data to generate the sharded `TFRecord` dataset.
Each entry of the `TFRecord` is a serialized `tf.Example` protocol buffer. A
complete list of information contained in the `tf.Example` is described in the
comments of `build_image_data.py`.
Martin Wicke's avatar
Martin Wicke committed
665

666
To run `build_image_data.py`, you can run the following command line:
Martin Wicke's avatar
Martin Wicke committed
667
668
669
670
671
672

```shell
# location to where to save the TFRecord data.
OUTPUT_DIRECTORY=$HOME/my-custom-data/

# build the preprocessing script.
673
674
cd tensorflow-models/inception
bazel build //inception:build_image_data
Martin Wicke's avatar
Martin Wicke committed
675
676
677
678
679
680
681

# convert the data.
bazel-bin/inception/build_image_data \
  --train_directory="${TRAIN_DIR}" \
  --validation_directory="${VALIDATION_DIR}" \
  --output_directory="${OUTPUT_DIRECTORY}" \
  --labels_file="${LABELS_FILE}" \
682
683
  --train_shards=128 \
  --validation_shards=24 \
Martin Wicke's avatar
Martin Wicke committed
684
685
  --num_threads=8
```
686

Martin Wicke's avatar
Martin Wicke committed
687
where the `$OUTPUT_DIRECTORY` is the location of the sharded `TFRecords`. The
Anthony Tatowicz's avatar
Anthony Tatowicz committed
688
`$LABELS_FILE` will be a text file that is read by the script that provides
689
690
a list of all of the labels. For instance, in the case flowers data set, the
`$LABELS_FILE` contained the following data:
Martin Wicke's avatar
Martin Wicke committed
691
692
693
694
695
696
697
698
699
700

```shell
daisy
dandelion
roses
sunflowers
tulips
```

Note that each row of each label corresponds with the entry in the final
701
702
703
classifier in the model. That is, the `daisy` corresponds to the classifier for
entry `1`; `dandelion` is entry `2`, etc. We skip label `0` as a background
class.
Martin Wicke's avatar
Martin Wicke committed
704
705
706
707

After running this script produces files that look like the following:

```shell
708
709
  $TRAIN_DIR/train-00000-of-00128
  $TRAIN_DIR/train-00001-of-00128
Martin Wicke's avatar
Martin Wicke committed
710
  ...
711
  $TRAIN_DIR/train-00127-of-00128
Martin Wicke's avatar
Martin Wicke committed
712
713
714

and

715
716
  $VALIDATION_DIR/validation-00000-of-00024
  $VALIDATION_DIR/validation-00001-of-00024
Martin Wicke's avatar
Martin Wicke committed
717
  ...
718
  $VALIDATION_DIR/validation-00023-of-00024
Martin Wicke's avatar
Martin Wicke committed
719
```
720

721
where 128 and 24 are the number of shards specified for each dataset,
722
respectively. Generally speaking, we aim for selecting the number of shards such
Glen Baker's avatar
Glen Baker committed
723
that roughly 1024 images reside in each shard. Once this data set is built, you
724
are ready to train or fine-tune an Inception model on this data set.
Martin Wicke's avatar
Martin Wicke committed
725

726
727
Note, if you are piggy backing on the flowers retraining scripts, be sure to
update `num_classes()` and `num_examples_per_epoch()` in `flowers_data.py`
728
729
to correspond with your data.

Martin Wicke's avatar
Martin Wicke committed
730
731
732
## Practical Considerations for Training a Model

The model architecture and training procedure is heavily dependent on the
733
734
735
736
hardware used to train the model. If you wish to train or fine-tune this model
on your machine **you will need to adjust and empirically determine a good set
of training hyper-parameters for your setup**. What follows are some general
considerations for novices.
Martin Wicke's avatar
Martin Wicke committed
737
738
739

### Finding Good Hyperparameters

740
741
Roughly 5-10 hyper-parameters govern the speed at which a network is trained. In
addition to `--batch_size` and `--num_gpus`, there are several constants defined
742
in [inception_train.py](inception/inception_train.py) which dictate the learning
743
schedule.
Martin Wicke's avatar
Martin Wicke committed
744
745
746
747
748
749
750
751
752
753

```shell
RMSPROP_DECAY = 0.9                # Decay term for RMSProp.
MOMENTUM = 0.9                     # Momentum in RMSProp.
RMSPROP_EPSILON = 1.0              # Epsilon term for RMSProp.
INITIAL_LEARNING_RATE = 0.1        # Initial learning rate.
NUM_EPOCHS_PER_DECAY = 30.0        # Epochs after which learning rate decays.
LEARNING_RATE_DECAY_FACTOR = 0.16  # Learning rate decay factor.
```

Jack Zhang's avatar
Jack Zhang committed
754
There are many papers that discuss the various tricks and trade-offs associated
Martin Wicke's avatar
Martin Wicke committed
755
756
757
with training a model with stochastic gradient descent. For those new to the
field, some great references are:

758
759
760
761
*   Y Bengio, [Practical recommendations for gradient-based training of deep
    architectures](http://arxiv.org/abs/1206.5533)
*   I Goodfellow, Y Bengio and A Courville, [Deep Learning]
    (http://www.deeplearningbook.org/)
Martin Wicke's avatar
Martin Wicke committed
762
763

What follows is a summary of some general advice for identifying appropriate
764
765
model hyper-parameters in the context of this particular model training setup.
Namely, this library provides *synchronous* updates to model parameters based on
Martin Wicke's avatar
Martin Wicke committed
766
767
batch-splitting the model across multiple GPUs.

768
769
770
*   Higher learning rates leads to faster training. Too high of learning rate
    leads to instability and will cause model parameters to diverge to infinity
    or NaN.
Martin Wicke's avatar
Martin Wicke committed
771

772
773
*   Larger batch sizes lead to higher quality estimates of the gradient and
    permit training the model with higher learning rates.
Martin Wicke's avatar
Martin Wicke committed
774

775
*   Often the GPU memory is a bottleneck that prevents employing larger batch
handong1587's avatar
handong1587 committed
776
    sizes. Employing more GPUs allows one to use larger batch sizes because
777
    this model splits the batch across the GPUs.
Martin Wicke's avatar
Martin Wicke committed
778
779
780

**NOTE** If one wishes to train this model with *asynchronous* gradient updates,
one will need to substantially alter this model and new considerations need to
781
782
be factored into hyperparameter tuning. See [Large Scale Distributed Deep
Networks](http://research.google.com/archive/large_deep_networks_nips2012.html)
Martin Wicke's avatar
Martin Wicke committed
783
784
785
786
787
788
789
790
791
for a discussion in this domain.

### Adjusting Memory Demands

Training this model has large memory demands in terms of the CPU and GPU. Let's
discuss each item in turn.

GPU memory is relatively small compared to CPU memory. Two items dictate the
amount of GPU memory employed -- model architecture and batch size. Assuming
792
793
794
that you keep the model architecture fixed, the sole parameter governing the GPU
demand is the batch size. A good rule of thumb is to try employ as large of
batch size as will fit on the GPU.
Martin Wicke's avatar
Martin Wicke committed
795
796
797
798
799
800

If you run out of GPU memory, either lower the `--batch_size` or employ more
GPUs on your desktop. The model performs batch-splitting across GPUs, thus N
GPUs can handle N times the batch size of 1 GPU.

The model requires a large amount of CPU memory as well. We have tuned the model
801
802
to employ about ~20GB of CPU memory. Thus, having access to about 40 GB of CPU
memory would be ideal.
Martin Wicke's avatar
Martin Wicke committed
803

804
805
806
807
808
If that is not possible, you can tune down the memory demands of the model via
lowering `--input_queue_memory_factor`. Images are preprocessed asynchronously
with respect to the main training across `--num_preprocess_threads` threads. The
preprocessed images are stored in shuffling queue in which each GPU performs a
dequeue operation in order to receive a `batch_size` worth of images.
Martin Wicke's avatar
Martin Wicke committed
809
810
811

In order to guarantee good shuffling across the data, we maintain a large
shuffling queue of 1024 x `input_queue_memory_factor` images. For the current
812
813
814
815
model architecture, this corresponds to about 4GB of CPU memory. You may lower
`input_queue_memory_factor` in order to decrease the memory footprint. Keep in
mind though that lowering this value drastically may result in a model with
slightly lower predictive accuracy when training from scratch. Please see
816
comments in [`image_processing.py`](inception/image_processing.py) for more details.
Martin Wicke's avatar
Martin Wicke committed
817
818
819
820
821

## Troubleshooting

#### The model runs out of CPU memory.

822
In lieu of buying more CPU memory, an easy fix is to decrease
Neal Wu's avatar
Neal Wu committed
823
`--input_queue_memory_factor`. See [Adjusting Memory Demands](#adjusting-memory-demands).
Martin Wicke's avatar
Martin Wicke committed
824
825
826
827

#### The model runs out of GPU memory.

The data is not able to fit on the GPU card. The simplest solution is to
828
829
decrease the batch size of the model. Otherwise, you will need to think about a
more sophisticated method for specifying the training which cuts up the model
Martin Wicke's avatar
Martin Wicke committed
830
across multiple `session.run()` calls or partitions the model across multiple
831
832
GPUs. See [Using GPUs](https://www.tensorflow.org/how_tos/using_gpu/index.html)
and [Adjusting Memory Demands](#adjusting-memory-demands) for more information.
Martin Wicke's avatar
Martin Wicke committed
833
834
835
836
837
838
839

#### The model training results in NaN's.

The learning rate of the model is too high. Turn down your learning rate.

#### I wish to train a model with a different image size.

840
841
842
843
844
The simplest solution is to artificially resize your images to `299x299` pixels.
See [Images](https://www.tensorflow.org/api_docs/python/image.html) section for
many resizing, cropping and padding methods. Note that the entire model
architecture is predicated on a `299x299` image, thus if you wish to change the
input image size, then you may need to redesign the entire model architecture.
Martin Wicke's avatar
Martin Wicke committed
845
846
847

#### What hardware specification are these hyper-parameters targeted for?

848
849
850
We targeted a desktop with 128GB of CPU ram connected to 8 NVIDIA Tesla K40 GPU
cards but we have run this on desktops with 32GB of CPU ram and 1 NVIDIA Tesla
K40. You can get a sense of the various training configurations we tested by
851
reading the comments in [`inception_train.py`](inception/inception_train.py).
Martin Wicke's avatar
Martin Wicke committed
852

853
#### How do I continue training from a checkpoint in distributed setting?
Martin Wicke's avatar
Martin Wicke committed
854

855
856
857
858
You only need to make sure that the checkpoint is in a location that can be
reached by all of the `ps` tasks. By specifying the checkpoint location with
`--train_dir` , the `ps` servers will load the checkpoint before commencing
training.