README.md 3.43 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
This folder contains the Keras implementation of the ResNet models. For more 
information about the models, please refer to this [README file](../README.md).

Similar to the [estimator implementation](/official/resnet), the Keras 
implementation has code for both CIFAR-10 data and ImageNet data. The CIFAR-10
version uses a ResNet56 model implemented in 
[`resnet_cifar_model.py`](./resnet_cifar_model.py), and the ImageNet version 
uses a ResNet50 model implemented in [`resnet_model.py`](./resnet_model.py).

To use 
either dataset, make sure that you have the latest version of TensorFlow 
installed and 
[add the models folder to your Python path](/official/#running-the-models),
otherwise you may encounter an error like `ImportError: No module named 
official.resnet`.

## CIFAR-10

Download and extract the CIFAR-10 data. You can use the following script:
```bash
python cifar10_download_and_extract.py
```

After you download the data, you can run the program by:

```bash
python keras_cifar_main.py
```

If you did not use the default directory to download the data, specify the 
location with the `--data_dir` flag, like:

```bash
python keras_cifar_main.py --data_dir=/path/to/cifar
```

## ImageNet

Download the ImageNet dataset and convert it to TFRecord format. 
The following [script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py)
and [README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy)
provide a few options.

Once your dataset is ready, you can begin training the model as follows:

```bash
python keras_imagenet_main.py 
```

Again, if you did not download the data to the default directory, specify the
location with the `--data_dir` flag:

```bash
python keras_imagenet_main.py --data_dir=/path/to/imagenet
```

There are more flag options you can specify. Here are some examples:

- `--use_synthetic_data`: when set to true, synthetic data, rather than real 
data, are used;
- `--batch_size`: the batch size used for the model;
- `--model_dir`: the directory to save the model checkpoint;
- `--train_epochs`: number of epoches to run for training the model;
- `--train_steps`: number of steps to run for training the model. We now only
support a number that is smaller than the number of batches in an epoch.
- `--skip_eval`: when set to true, evaluation as well as validation during 
training is skipped

For example, this is a typical command line to run with ImageNet data with 
batch size 128 per GPU:

```bash
python -m keras_imagenet_main \
--model_dir=/tmp/model_dir/something \
--num_gpus=2 \
--batch_size=128 \
--train_epochs=90 \
--train_steps=10 \
--use_synthetic_data=false
```

See [`keras_common.py`](keras_common.py) for full list of options.

## Using multiple GPUs
You can train these models on multiple GPUs using `tf.distribute.Strategy` API. 
You can read more about them in this 
[guide](https://www.tensorflow.org/guide/distribute_strategy).

In this example, we have made it easier to use is with just a command line flag 
`--num_gpus`. By default this flag is 1 if TensorFlow is compiled with CUDA, 
and 0 otherwise.

- --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device.
- --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device.
- --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous 
distributed training across the GPUs.

If you wish to run without `tf.distribute.Strategy`, you can do so by setting 
`--distribution_strategy=off`.