Instructions for Distributed and TensorBoard

3e117b65 · Marianne Linhares Monteiro · GitHub · 751bfb3f · 3e117b65
Commit 3e117b65 authored Aug 01, 2017 by Marianne Linhares Monteiro Committed by GitHub Aug 01, 2017
Hide whitespace changes
Inline Side-by-side

Showing with 392 additions and 14 deletions

tutorials/image/cifar10_estimator/README.md tutorials/image/cifar10_estimator/README.md +392 -14

No files found.
--- a/tutorials/image/cifar10_estimator/README.md
+++ b/tutorials/image/cifar10_estimator/README.md
@@ -2,53 +2,431 @@ CIFAR-10 is a common benchmark in machine learning for image recognition.

 http://www.cs.toronto.edu/~kriz/cifar.html

-Code in this directory focuses on how to use TensorFlow Estimators to train and evaluate a CIFAR-10 ResNet model on a single host with one CPU and potentially multiple GPUs.
+Code in this directory focuses on how to use TensorFlow Estimators to train and evaluate a CIFAR-10 ResNet model on:
+* A single host with one CPU;
+* A single host with multiple GPUs;
+* Multiple hosts with CPU or multiple GPUs;

-<b>Prerequisite:</b>
+## Prerequisite

 1. Install TensorFlow version 1.2.1 or later with GPU support.
+   You can see how to do it [here](https://www.tensorflow.org/install/).

 2. Download the CIFAR-10 dataset.

 ```shell
-curl -o cifar-10-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
-tar xzf cifar-10-binary.tar.gz
+$ curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
+$ tar xzf cifar-10-python.tar.gz
 ```
+After running the above commands, you should see the following files in the folder where the data was downloaded.

-<b>How to run:</b>
+``` shell
+$ ls -R cifar-10-batches-py

-```shell
-# After running the above commands, you should see the following in the folder
-# where the data is downloaded.
-$ ls -R cifar-10-batches-bin
-
-cifar-10-batches-bin:
 batches.meta  data_batch_1  data_batch_2  data_batch_3
 data_batch_4  data_batch_5  readme.html  test_batch
+```

+## How to run on local mode
+
+```

 # Run the model on CPU only. After training, it runs the evaluation.
-$ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-bin \
+$ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-py \
                         --model_dir=/tmp/cifar10 \
                         --is_cpu_ps=True \
                         --num_gpus=0 \
                         --train_steps=1000

 # Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
-$ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-bin \
+$ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-py \
                         --model_dir=/tmp/cifar10 \
                         --is_cpu_ps=True \
                         --force_gpu_compatible=True \
                         --num_gpus=2 \
                         --train_steps=1000

-# Run the model on 2 GPUs using GPU as parameter server. After training, it runs the evaluation.
+# Run the model on 2 GPUs using GPU as parameter server.
+# It will run an experiment, which for local setting basically means it will run stop training
+# a couple of times to perform evaluation.
 $ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-bin \
                         --model_dir=/tmp/cifar10 \
                         --is_cpu_ps=False \
                         --force_gpu_compatible=True \
                         --num_gpus=2 \
                         --train_steps=1000
+                         --run_experiment=True

 # There are more command line flags to play with; check cifar10_main.py for details.
 ```
+
+## How to run on distributed mode
+
+### Set TF_CONFIG
+
+Considering that you already have multiple hosts configured, all you need is a `TF_CONFIG`
+environment variable on each host. You can set up the hosts manually or check [tensorflow/ecosystem](https://github.com/tensorflow/ecosystem) for instructions about how to set up a Cluster.
+
+The `TF_CONFIG` will be used by the `RunConfig` to know the existing hosts and their task: `master`, `ps` or `worker`.
+
+Here's an example of `TF_CONFIG`.
+
+```python
+cluster = {'master': ['master-ip:8000'],
+           'ps': ['ps-ip:8000'],
+           'worker': ['worker-ip:8000']}
+
+TF_CONFIG = json.dumps(
+  {'cluster': cluster,
+   'task': {'type': master, 'index': 0},
+   'model_dir': 'gs://<bucket_path>/<dir_path>',
+   'environment': 'cloud'
+  })
+```
+
+*Cluster*
+
+A cluster spec, which is basically a dictionary that describes all of the tasks in the cluster. More about it [here](https://www.tensorflow.org/deploy/distributed).
+
+In this cluster spec we are defining a cluster with 1 master, 1 ps and 1 worker.
+
+* `ps`: saves the parameters among all workers. All workers can read/write/update the parameters for model via ps.
+        As some models are extremely large the parameters are shared among the ps (each ps stores a subset).
+
+* `worker`: does the training.
+
+* `master`: basically a special worker, it does training, but also restores and saves checkpoints and do evaluation.
+
+*Task*
+
+The Task defines what is the role of the current node, for this example the node is the master on index 0
+on the cluster spec, the task will be different for each node. An example of the `TF_CONFIG` for a worker would be:
+
+```python
+cluster = {'master': ['master-ip:8000'],
+           'ps': ['ps-ip:8000'],
+           'worker': ['worker-ip:8000']}
+
+TF_CONFIG = json.dumps(
+  {'cluster': cluster,
+   'task': {'type': worker, 'index': 0},
+   'model_dir': 'gs://<bucket_path>/<dir_path>',
+   'environment': 'cloud'
+  })
+```
+
+*Model_dir*
+
+This is the path where the master will save the checkpoints, graph and TensorBoard files.
+For a multi host environment you may want to use a Distributed File System, Google Storage and DFS are supported.
+
+*Environment*
+
+By the default environment is *local*, for a distributed setting we need to change it to *cloud*.
+
+### Running script
+
+Once you have a `TF_CONFIG` configured properly on each host you're ready to run on distributed settings.
+
+
+#### Master
+```shell
+# Run this on master:
+# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
+# It will run evaluation a couple of times during training.
+# The num_workers arugument is used only to update the learning rate correctly.
+# Make sure the model_dir is the same as defined on the TF_CONFIG.
+$ python cifar10_main.py --data_dir=gs://path/cifar-10-batches-py \
+                         --model_dir=gs://path/model_dir \
+                         --is_cpu_ps=True \
+                         --force_gpu_compatible=True \
+                         --num_gpus=4 \
+                         --train_steps=40000 \
+                         --sync=True \
+                         --run_experiment=True \
+                         --num_workers=2
+```
+
+*Output:*
+
+```shell
+INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir
+INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'master', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd16fb2be10>, '_model_dir': 'gs://path/model_dir', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': intra_op_parallelism_threads: 1
+gpu_options {
+}
+allow_soft_placement: true
+, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
+  per_process_gpu_memory_fraction: 1.0
+}
+, '_evaluation_master': '', '_master': u'grpc://master-ip:8000'}
+...
+2017-08-01 19:59:26.496208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
+name: Tesla K80
+major: 3 minor: 7 memoryClockRate (GHz) 0.8235
+pciBusID 0000:00:04.0
+Total memory: 11.17GiB
+Free memory: 11.09GiB
+2017-08-01 19:59:26.775660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties: 
+name: Tesla K80
+major: 3 minor: 7 memoryClockRate (GHz) 0.8235
+pciBusID 0000:00:05.0
+Total memory: 11.17GiB
+Free memory: 11.10GiB
+...
+2017-08-01 19:59:29.675171: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64)
+INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11)
+INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=1; total_num_replicas=1
+INFO:tensorflow:Create CheckpointSaverHook.
+INFO:tensorflow:Restoring parameters from gs://path/model_dir/model.ckpt-0
+2017-08-01 19:59:37.560775: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session 156fcb55fe6648d6 with config: 
+intra_op_parallelism_threads: 1
+gpu_options {
+  per_process_gpu_memory_fraction: 1
+}
+allow_soft_placement: true
+
+INFO:tensorflow:Saving checkpoints for 1 into gs://path/model_dirmodel.ckpt.
+INFO:tensorflow:loss = 1.20682, step = 1
+INFO:tensorflow:loss = 1.20682, learning_rate = 0.1
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64)
+INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11)
+INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=2; total_num_replicas=2
+INFO:tensorflow:Starting evaluation at 2017-08-01-20:00:14
+2017-08-01 20:00:15.745881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)
+2017-08-01 20:00:15.745949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:05.0)
+2017-08-01 20:00:15.745958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:06.0)
+2017-08-01 20:00:15.745964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:07.0)
+2017-08-01 20:00:15.745969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:00:08.0)
+2017-08-01 20:00:15.745975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:00:09.0)
+2017-08-01 20:00:15.745987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:00:0a.0)
+2017-08-01 20:00:15.745997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:0b.0)
+INFO:tensorflow:Restoring parameters from gs://path/model_dir/model.ckpt-10023
+INFO:tensorflow:Evaluation [1/100]
+INFO:tensorflow:Evaluation [2/100]
+INFO:tensorflow:Evaluation [3/100]
+INFO:tensorflow:Evaluation [4/100]
+INFO:tensorflow:Evaluation [5/100]
+INFO:tensorflow:Evaluation [6/100]
+INFO:tensorflow:Evaluation [7/100]
+INFO:tensorflow:Evaluation [8/100]
+INFO:tensorflow:Evaluation [9/100]
+INFO:tensorflow:Evaluation [10/100]
+INFO:tensorflow:Evaluation [11/100]
+INFO:tensorflow:Evaluation [12/100]
+INFO:tensorflow:Evaluation [13/100]
+...
+INFO:tensorflow:Evaluation [100/100]
+INFO:tensorflow:Finished evaluation at 2017-08-01-20:00:31
+INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step = 1, loss = 630.425
+```
+
+#### Worker
+
+```shell
+# Run this on worker:
+# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
+# It will run evaluation a couple of times during training.
+# Make sure the model_dir is the same as defined on the TF_CONFIG.
+$ python cifar10_main.py --data_dir=gs://path/cifar-10-batches-py \
+                         --model_dir=gs://path/model_dir \
+                         --is_cpu_ps=True \
+                         --force_gpu_compatible=True \
+                         --num_gpus=4 \
+                         --train_steps=40000 \
+                         --sync=True
+                         --run_experiment=True
+```
+
+*Output:*
+
+```shell
+INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir
+INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600,
+'_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'worker',
+'_is_chief': False, '_cluster_spec':
+<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f6918438e10>,
+'_model_dir': 'gs://<path>/model_dir',
+'_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000,
+'_session_config': intra_op_parallelism_threads: 1
+gpu_options {
+}
+allow_soft_placement: true
+, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1,
+'_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
+  per_process_gpu_memory_fraction: 1.0
+  }
+...
+2017-08-01 19:59:26.496208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
+name: Tesla K80
+major: 3 minor: 7 memoryClockRate (GHz) 0.8235
+pciBusID 0000:00:04.0
+Total memory: 11.17GiB
+Free memory: 11.09GiB
+2017-08-01 19:59:26.775660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties: 
+name: Tesla K80
+major: 3 minor: 7 memoryClockRate (GHz) 0.8235
+pciBusID 0000:00:05.0
+Total memory: 11.17GiB
+Free memory: 11.10GiB
+...
+2017-08-01 19:59:29.675171: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8)
+INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64)
+INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11)
+INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=2; total_num_replicas=2
+INFO:tensorflow:Create CheckpointSaverHook.
+2017-07-31 22:38:04.629150: I
+tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting
+for response from worker: /job:master/replica:0/task:0
+2017-07-31 22:38:09.263492: I
+tensorflow/core/distributed_runtime/master_session.cc:999] Start master
+session cc58f93b1e259b0c with config: 
+intra_op_parallelism_threads: 1
+gpu_options {
+per_process_gpu_memory_fraction: 1
+}
+allow_soft_placement: true
+INFO:tensorflow:loss = 5.82382, step = 0
+INFO:tensorflow:loss = 5.82382, learning_rate = 0.8
+INFO:tensorflow:Average examples/sec: 1116.92 (1116.92), step = 10
+INFO:tensorflow:Average examples/sec: 1233.73 (1377.83), step = 20
+INFO:tensorflow:Average examples/sec: 1485.43 (2509.3), step = 30
+INFO:tensorflow:Average examples/sec: 1680.27 (2770.39), step = 40
+INFO:tensorflow:Average examples/sec: 1825.38 (2788.78), step = 50
+INFO:tensorflow:Average examples/sec: 1929.32 (2697.27), step = 60
+INFO:tensorflow:Average examples/sec: 2015.17 (2749.05), step = 70
+INFO:tensorflow:loss = 37.6272, step = 79 (19.554 sec)
+INFO:tensorflow:loss = 37.6272, learning_rate = 0.8 (19.554 sec)
+INFO:tensorflow:Average examples/sec: 2074.92 (2618.36), step = 80
+INFO:tensorflow:Average examples/sec: 2132.71 (2744.13), step = 90
+INFO:tensorflow:Average examples/sec: 2183.38 (2777.21), step = 100
+INFO:tensorflow:Average examples/sec: 2224.4 (2739.03), step = 110
+INFO:tensorflow:Average examples/sec: 2240.28 (2431.26), step = 120
+INFO:tensorflow:Average examples/sec: 2272.12 (2739.32), step = 130
+INFO:tensorflow:Average examples/sec: 2300.68 (2750.03), step = 140
+INFO:tensorflow:Average examples/sec: 2325.81 (2745.63), step = 150
+INFO:tensorflow:Average examples/sec: 2347.14 (2721.53), step = 160
+INFO:tensorflow:Average examples/sec: 2367.74 (2754.54), step = 170
+INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec)
+....
+```
+
+#### PS
+
+```shell
+# Run this on ps:
+# The ps will not do training so most of the arguments won't affect the execution
+$ python cifar10_main.py --run_experiment=True --model_dir=gs://path/model_dir
+
+# There are more command line flags to play with; check cifar10_main.py for details.
+```
+
+*Output:*
+
+```shell
+INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dirrds/
+INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'ps', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f48f1addf90>, '_model_dir': 'gs://path/model_dir/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': intra_op_parallelism_threads: 1
+gpu_options {
+}
+allow_soft_placement: true
+, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
+  per_process_gpu_memory_fraction: 1.0
+}
+, '_evaluation_master': '', '_master': u'grpc://master-ip:8000'}
+2017-07-31 22:54:58.928088: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-ip:8000}
+2017-07-31 22:54:58.928153: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:8000}
+2017-07-31 22:54:58.928160: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-ip:8000}
+2017-07-31 22:54:58.929873: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
+```
+
+## Visualizing results with TensorFlow
+
+When using Estimators you can also visualize your data in TensorBoard, with no changes in your code. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it.
+
+You'll see something similar to this if you "point" TensorBoard to the `model_dir` you used to train or evaluate your model.
+
+```shell
+# Check TensorBoard during training or after it.
+# Just point TensorBoard to the model_dir you chose on the previous step
+# by default the model_dir is "sentiment_analysis_output"
+$ tensorboard --log_dir="sentiment_analysis_output"
+```