There are more command line flags to play with; run `python cifar10_main.py --help` for details.
There are more command line flags to play with; run
`python cifar10_main.py --help` for details.
## How to run on distributed mode
### (Optional) Running on Google Cloud Machine Learning Engine
This example can be run on Google Cloud Machine Learning Engine (ML Engine), which will configure the environment and take care of running workers, parameters servers, and masters in a fault tolerant way.
This example can be run on Google Cloud Machine Learning Engine (ML Engine),
which will configure the environment and take care of running workers,
parameters servers, and masters in a fault tolerant way.
To install the command line tool, and set up a project and billing, see the quickstart [here](https://cloud.google.com/ml-engine/docs/quickstarts/command-line).
To install the command line tool, and set up a project and billing, see the
You'll also need a Google Cloud Storage bucket for the data. If you followed the instructions above, you can just run:
You'll also need a Google Cloud Storage bucket for the data. If you followed the
instructions above, you can just run:
```
MY_BUCKET=gs://<my-bucket-name>
gsutil cp -r ${PWD}/cifar-10-data $MY_BUCKET/
```
Then run the following command from the `tutorials/image` directory of this repository (the parent directory of this README):
Then run the following command from the `tutorials/image` directory of this
repository (the parent directory of this README):
```
gcloud ml-engine jobs submit training cifarmultigpu \
...
...
@@ -97,10 +107,13 @@ gcloud ml-engine jobs submit training cifarmultigpu \
### Set TF_CONFIG
Considering that you already have multiple hosts configured, all you need is a `TF_CONFIG`
environment variable on each host. You can set up the hosts manually or check [tensorflow/ecosystem](https://github.com/tensorflow/ecosystem) for instructions about how to set up a Cluster.
Considering that you already have multiple hosts configured, all you need is a
`TF_CONFIG` environment variable on each host. You can set up the hosts manually
or check [tensorflow/ecosystem](https://github.com/tensorflow/ecosystem) for
instructions about how to set up a Cluster.
The `TF_CONFIG` will be used by the `RunConfig` to know the existing hosts and their task: `master`, `ps` or `worker`.
The `TF_CONFIG` will be used by the `RunConfig` to know the existing hosts and
their task: `master`, `ps` or `worker`.
Here's an example of `TF_CONFIG`.
...
...
@@ -119,21 +132,26 @@ TF_CONFIG = json.dumps(
*Cluster*
A cluster spec, which is basically a dictionary that describes all of the tasks in the cluster. More about it [here](https://www.tensorflow.org/deploy/distributed).
A cluster spec, which is basically a dictionary that describes all of the tasks
in the cluster. More about it [here](https://www.tensorflow.org/deploy/distributed).
In this cluster spec we are defining a cluster with 1 master, 1 ps and 1 worker.
*`ps`: saves the parameters among all workers. All workers can read/write/update the parameters for model via ps.
As some models are extremely large the parameters are shared among the ps (each ps stores a subset).
*`ps`: saves the parameters among all workers. All workers can
read/write/update the parameters for model via ps. As some models are
extremely large the parameters are shared among the ps (each ps stores a
subset).
*`worker`: does the training.
*`master`: basically a special worker, it does training, but also restores and saves checkpoints and do evaluation.
*`master`: basically a special worker, it does training, but also restores and
saves checkpoints and do evaluation.
*Task*
The Task defines what is the role of the current node, for this example the node is the master on index 0
on the cluster spec, the task will be different for each node. An example of the `TF_CONFIG` for a worker would be:
The Task defines what is the role of the current node, for this example the node
is the master on index 0 on the cluster spec, the task will be different for
each node. An example of the `TF_CONFIG` for a worker would be:
```python
cluster={'master':['master-ip:8000'],
...
...
@@ -150,23 +168,26 @@ TF_CONFIG = json.dumps(
*Model_dir*
This is the path where the master will save the checkpoints, graph and TensorBoard files.
For a multi host environment you may want to use a Distributed File System, Google Storage and DFS are supported.
This is the path where the master will save the checkpoints, graph and
TensorBoard files. For a multi host environment you may want to use a
Distributed File System, Google Storage and DFS are supported.
*Environment*
By the default environment is *local*, for a distributed setting we need to change it to *cloud*.
By the default environment is *local*, for a distributed setting we need to
change it to *cloud*.
### Running script
Once you have a `TF_CONFIG` configured properly on each host you're ready to run on distributed settings.
Once you have a `TF_CONFIG` configured properly on each host you're ready to run
on distributed settings.
#### Master
Run this on master:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
It will run evaluation a couple of times during training.
The num_workers arugument is used only to update the learning rate correctly.
Make sure the model_dir is the same as defined on the TF_CONFIG.
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for
40000 steps. It will run evaluation a couple of times during training. The
num_workers arugument is used only to update the learning rate correctly. Make
sure the model_dir is the same as defined on the TF_CONFIG.
When using Estimators you can also visualize your data in TensorBoard, with no changes in your code. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it.
When using Estimators you can also visualize your data in TensorBoard, with no
changes in your code. You can use TensorBoard to visualize your TensorFlow
graph, plot quantitative metrics about the execution of your graph, and show
additional data like images that pass through it.
You'll see something similar to this if you "point" TensorBoard to the `model_dir` you used to train or evaluate your model.
You'll see something similar to this if you "point" TensorBoard to the
`model_dir` you used to train or evaluate your model.
Check TensorBoard during training or after it.
Just point TensorBoard to the model_dir you chose on the previous step
by default the model_dir is "sentiment_analysis_output"
Check TensorBoard during training or after it. Just point TensorBoard to the
model_dir you chose on the previous step by default the model_dir is