@@ -2,53 +2,431 @@ CIFAR-10 is a common benchmark in machine learning for image recognition.
http://www.cs.toronto.edu/~kriz/cifar.html
Code in this directory focuses on how to use TensorFlow Estimators to train and evaluate a CIFAR-10 ResNet model on a single host with one CPU and potentially multiple GPUs.
Code in this directory focuses on how to use TensorFlow Estimators to train and evaluate a CIFAR-10 ResNet model on:
* A single host with one CPU;
* A single host with multiple GPUs;
* Multiple hosts with CPU or multiple GPUs;
<b>Prerequisite:</b>
## Prerequisite
1. Install TensorFlow version 1.2.1 or later with GPU support.
You can see how to do it [here](https://www.tensorflow.org/install/).
# There are more command line flags to play with; check cifar10_main.py for details.
```
## How to run on distributed mode
### Set TF_CONFIG
Considering that you already have multiple hosts configured, all you need is a `TF_CONFIG`
environment variable on each host. You can set up the hosts manually or check [tensorflow/ecosystem](https://github.com/tensorflow/ecosystem) for instructions about how to set up a Cluster.
The `TF_CONFIG` will be used by the `RunConfig` to know the existing hosts and their task: `master`, `ps` or `worker`.
Here's an example of `TF_CONFIG`.
```python
cluster={'master':['master-ip:8000'],
'ps':['ps-ip:8000'],
'worker':['worker-ip:8000']}
TF_CONFIG=json.dumps(
{'cluster':cluster,
'task':{'type':master,'index':0},
'model_dir':'gs://<bucket_path>/<dir_path>',
'environment':'cloud'
})
```
*Cluster*
A cluster spec, which is basically a dictionary that describes all of the tasks in the cluster. More about it [here](https://www.tensorflow.org/deploy/distributed).
In this cluster spec we are defining a cluster with 1 master, 1 ps and 1 worker.
*`ps`: saves the parameters among all workers. All workers can read/write/update the parameters for model via ps.
As some models are extremely large the parameters are shared among the ps (each ps stores a subset).
*`worker`: does the training.
*`master`: basically a special worker, it does training, but also restores and saves checkpoints and do evaluation.
*Task*
The Task defines what is the role of the current node, for this example the node is the master on index 0
on the cluster spec, the task will be different for each node. An example of the `TF_CONFIG` for a worker would be:
```python
cluster={'master':['master-ip:8000'],
'ps':['ps-ip:8000'],
'worker':['worker-ip:8000']}
TF_CONFIG=json.dumps(
{'cluster':cluster,
'task':{'type':worker,'index':0},
'model_dir':'gs://<bucket_path>/<dir_path>',
'environment':'cloud'
})
```
*Model_dir*
This is the path where the master will save the checkpoints, graph and TensorBoard files.
For a multi host environment you may want to use a Distributed File System, Google Storage and DFS are supported.
*Environment*
By the default environment is *local*, for a distributed setting we need to change it to *cloud*.
### Running script
Once you have a `TF_CONFIG` configured properly on each host you're ready to run on distributed settings.
#### Master
```shell
# Run this on master:
# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
# It will run evaluation a couple of times during training.
# The num_workers arugument is used only to update the learning rate correctly.
# Make sure the model_dir is the same as defined on the TF_CONFIG.
2017-07-31 22:54:58.928088: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-ip:8000}
2017-07-31 22:54:58.928153: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:8000}
2017-07-31 22:54:58.928160: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-ip:8000}
2017-07-31 22:54:58.929873: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
```
## Visualizing results with TensorFlow
When using Estimators you can also visualize your data in TensorBoard, with no changes in your code. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it.
You'll see something similar to this if you "point" TensorBoard to the `model_dir` you used to train or evaluate your model.
```shell
# Check TensorBoard during training or after it.
# Just point TensorBoard to the model_dir you chose on the previous step
# by default the model_dir is "sentiment_analysis_output"