Implement distributed inception (#44)

Implements a distributed trainer for Inception.

Implement distributed inception (#44)
Implements a distributed trainer for Inception.
84b58a60 · Jianmin Chen · Derek Murray · 9a1dfdf2 · 84b58a60 · 84b58a60
Commit 84b58a60 authored Apr 13, 2016 by Jianmin Chen Committed by Derek Murray Apr 13, 2016
6 changed files
--- a/inception/README.md
+++ b/inception/README.md
 # Inception in TensorFlow
-[TOC]

 [ImageNet](http://www.image-net.org/) is a common academic data set in machine
 learning for training an image recognition system. Code in this directory
-demonstrates how to use TensorFlow to train and evaluate
-a type of convolutional neural network (CNN) on this academic data set.
-In particular, we demonstrate how to train the Inception v3 architecture
-as specified in:
+demonstrates how to use TensorFlow to train and evaluate a type of convolutional
+neural network (CNN) on this academic data set. In particular, we demonstrate
+how to train the Inception v3 architecture as specified in:

 _Rethinking the Inception Architecture for Computer Vision_

-Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens,
-Zbigniew Wojna
+Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew
+Wojna

 http://arxiv.org/abs/1512.00567

 This network achieves 21.2% top-1 and 5.6% top-5 error for single frame
 evaluation with a computational cost of 5 billion multiply-adds per inference
-and with using less than 25 million parameters. Below is a visualization
-of the model architecture.
+and with using less than 25 million parameters. Below is a visualization of the
+model architecture.

 <center>
 ![Inception-v3 Architecture](g3doc/inception_v3_architecture.png)
@@ -28,75 +26,75 @@ of the model architecture.

 The code base provides three core binaries for:

-* Training an Inception v3 network from scratch across multiple GPUs and/or
-multiple machines using the ImageNet 2012 Challenge training data set.
-*  Evaluating an Inception v3 network using the ImageNet 2012 Challenge
-validation data set.
-*  Retraining an Inception v3 network on a novel task and back-propagating the
-errors to fine tune the network weights.
+*   Training an Inception v3 network from scratch across multiple GPUs and/or
+    multiple machines using the ImageNet 2012 Challenge training data set.
+*   Evaluating an Inception v3 network using the ImageNet 2012 Challenge
+    validation data set.
+*   Retraining an Inception v3 network on a novel task and back-propagating the
+    errors to fine tune the network weights.

 The training procedure employs synchronous stochastic gradient descent across
-multiple GPUs. The user may specify the number of GPUs they wish harness.
-The synchronous training performs *batch-splitting* by dividing a given batch
-across multiple GPUs.
-
-The training set up is nearly identical to the section [Training a Model
-Using Multiple GPU Cards](https://www.tensorflow.org/tutorials/deep_cnn/index.html#training-a-model-using-multiple-gpu-cards)
-where we have substituted the CIFAR-10 model architecture
-with Inception v3. The primary differences with that setup are:
-
-* Calculate and update the batch-norm statistics during training so that they
-may be substituted in during evaluation.
-* Specify the model architecture using a (still experimental) higher
-level language called TensorFlow-Slim.
-
-For more details about TensorFlow-Slim, please see the
-[Slim README](inception/slim/README.md). Please
-note that this higher-level language is still *experimental* and the API may
-change over time depending on usage and subsequent research.
+multiple GPUs. The user may specify the number of GPUs they wish harness. The
+synchronous training performs *batch-splitting* by dividing a given batch across
+multiple GPUs.
+
+The training set up is nearly identical to the section [Training a Model Using
+Multiple GPU Cards]
+(https://www.tensorflow.org/tutorials/deep_cnn/index.html#training-a-model-using-multiple-gpu-cards)
+where we have substituted the CIFAR-10 model architecture with Inception v3. The
+primary differences with that setup are:
+
+*   Calculate and update the batch-norm statistics during training so that they
+    may be substituted in during evaluation.
+*   Specify the model architecture using a (still experimental) higher level
+    language called TensorFlow-Slim.
+
+For more details about TensorFlow-Slim, please see the [Slim README]
+(slim/README.md). Please note that this higher-level language is still
+*experimental* and the API may change over time depending on usage and
+subsequent research.

 ## Getting Started

 **NOTE** Before doing anything, we first need to build TensorFlow from source,
-and installed as a PIP package.
-Please follow the instructions at
-[Installing From Source](https://www.tensorflow.org/versions/r0.7/get_started/os_setup.html#create-the-pip-package-and-install).
-
-Before you run the training script for the first time, you will need to
-download and convert the ImageNet data to native TFRecord format. The TFRecord
-format consists of a set of sharded files where each entry is a serialized
-`tf.Example` proto. Each `tf.Example` proto contains the ImageNet image (JPEG
-encoded) as well as metadata such as label and bounding box information. See
-[`parse_example_proto`](inception/image_processing.py) for details.
-
-We provide a single
-[script](inception/data/download_and_preprocess_imagenet.sh)
-for downloading and converting ImageNet data to TFRecord format. Downloading
-and preprocessing the data may take several hours (up to half a day) depending
-on your network and computer speed. Please be patient.
-
-To begin, you will need to sign up for an account with
-[ImageNet](http://image-net.org) to gain access to the data. Look for the
-sign up page, create an account and request an access key to download the data.
-
-After you have `USERNAME` and `PASSWORD`, you are ready to run our script.
-Make sure that your hard disk has at least 500 GB of free space for downloading
-and storing the data. Here we select `DATA_DIR=$HOME/imagenet-data` as such a
+and installed as a PIP package. Please follow the instructions at [Installing
+From Source]
+(https://www.tensorflow.org/get_started/os_setup.html#create-the-pip-package-and-install).
+
+Before you run the training script for the first time, you will need to download
+and convert the ImageNet data to native TFRecord format. The TFRecord format
+consists of a set of sharded files where each entry is a serialized `tf.Example`
+proto. Each `tf.Example` proto contains the ImageNet image (JPEG encoded) as
+well as metadata such as label and bounding box information. See
+[`parse_example_proto`](image_processing.py) for details.
+
+We provide a single [script](data/download_and_preprocess_imagenet.sh) for
+downloading and converting ImageNet data to TFRecord format. Downloading and
+preprocessing the data may take several hours (up to half a day) depending on
+your network and computer speed. Please be patient.
+
+To begin, you will need to sign up for an account with [ImageNet]
+(http://image-net.org) to gain access to the data. Look for the sign up page,
+create an account and request an access key to download the data.
+
+After you have `USERNAME` and `PASSWORD`, you are ready to run our script. Make
+sure that your hard disk has at least 500 GB of free space for downloading and
+storing the data. Here we select `DATA_DIR=$HOME/imagenet-data` as such a
 location but feel free to edit accordingly.

-When you run the below script, please enter *USERNAME* and *PASSWORD*
-when prompted. This will occur at the very beginning. Once these values are
-entered, you will not need to interact with the script again.
+When you run the below script, please enter *USERNAME* and *PASSWORD* when
+prompted. This will occur at the very beginning. Once these values are entered,
+you will not need to interact with the script again.

 ```shell
 # location of where to place the ImageNet data
 DATA_DIR=$HOME/imagenet-data

 # build the preprocessing script.
-bazel build -c opt inception/download_and_preprocess_imagenet
+bazel build inception/download_and_preprocess_imagenet

 # run it
-bazel-bin/inception/download_and_preprocess_imagenet "${DATA_DIR}"
+bazel-bin/inception/download_and_preprocess_imagenet "${DATA_DIR}$"
 ```

 The final line of the output script should read:
@@ -106,29 +104,29 @@ The final line of the output script should read:
 ```

 When the script finishes you will find 1024 and 128 training and validation
-files in the `DATA_DIR`. The files will match the patterns
-`train-????-of-1024` and `validation-?????-of-00128`, respectively.
+files in the `DATA_DIR`. The files will match the patterns `train-????-of-1024`
+and `validation-?????-of-00128`, respectively.

-[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0)
-You are now ready to train or evaluate with the ImageNet data set.
+[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now
+ready to train or evaluate with the ImageNet data set.

 ## How to Train from Scratch

 **WARNING** Training an Inception v3 network from scratch is a computationally
-intensive task and depending on your compute setup may take several days or
-even weeks.
+intensive task and depending on your compute setup may take several days or even
+weeks.

-*Before proceeding* please read the [Convolutional Neural
-Networks] (https://www.tensorflow.org/tutorials/deep_cnn/index.html)
-tutorial in particular focus on
-[Training a Model Using Multiple GPU Cards](https://www.tensorflow.org/tutorials/deep_cnn/index.html#training-a-model-using-multiple-gpu-cards)
-. The model training method is nearly identical to that
-described in the CIFAR-10 multi-GPU model training. Briefly, the model training
+*Before proceeding* please read the [Convolutional Neural Networks]
+(https://www.tensorflow.org/tutorials/deep_cnn/index.html) tutorial in
+particular focus on [Training a Model Using Multiple GPU Cards]
+(https://www.tensorflow.org/tutorials/deep_cnn/index.html#training-a-model-using-multiple-gpu-cards)
+. The model training method is nearly identical to that described in the
+CIFAR-10 multi-GPU model training. Briefly, the model training

-* Places an individual model replica on each GPU. Split the batch
-across the GPUs.
-* Updates model parameters synchronously by waiting for all GPUs to finish
-processing a batch of data.
+*   Places an individual model replica on each GPU. Split the batch across the
+    GPUs.
+*   Updates model parameters synchronously by waiting for all GPUs to finish
+    processing a batch of data.

 The training procedure is encapsulated by this diagram of how operations and
 variables are placed on CPU and GPUs respectively.
@@ -147,26 +145,26 @@ importantly the batch size and the learning rate schedule. Both of these
 parameters are heavily coupled to the hardware set up.

 Generally speaking, a batch size is a difficult parameter to tune as it requires
-balancing memory demands of the model, memory available on the GPU and speed
-of computation. Generally speaking, employing larger batch sizes leads to
-more efficient computation and potentially more efficient training steps.
+balancing memory demands of the model, memory available on the GPU and speed of
+computation. Generally speaking, employing larger batch sizes leads to more
+efficient computation and potentially more efficient training steps.

 We have tested several hardware setups for training this model from scratch but
 we emphasize that depending your hardware set up, you may need to adapt the
 batch size and learning rate schedule.

-Please see the comments in [`inception_train.py`](inception/inception_train.py) for a few selected learning rate
+Please see the comments in `inception_train.py` for a few selected learning rate
 plans based on some selected hardware setups.

 To train this model, you simply need to specify the following:

 ```shell
-# Build the training binary to run on a GPU. If you do not have a GPU,
-# then exclude '--config=cuda'
-bazel build -c opt --config=cuda inception/imagenet_train
+# Build the model. Note that we need to make sure the TensorFlow is ready to
+# use before this as this command will not build TensorFlow.
+bazel build inception/imagenet_train

 # run it
-bazel-bin/inception/imagenet_train --num_gpus=1 --batch_size=32 --train_dir=/tmp/imagenet_train --data_dir=/tmp/imagenet_data
+bazel-bin/inception/imagenet_train.py --num_gpus=1 --batch_size=32 --train_dir=/tmp/imagenet_train --data_dir=/tmp/imagenet_data
 ```

 The model reads in the ImageNet training data from `--data_dir`. If you followed
@@ -188,42 +186,35 @@ Here is the output of the above command line when running on a Tesla K40c:
 ...
 ```

-This example highlights several important points:
-
-* A log entry is printed every 10 step and the line includes the
-total loss (starts around 13.0-14.0) and the speed of processing in terms
-of throughput (examples / sec) and batch speed (sec/batch).
-
-* The first step in training is always slow. The primary reason for the slow
-speed is that during the first step of training, the preprocessing queue must
-first fill up the several thousand example images in order to reach its minimum
-capacity before training starts.
+In this example, a log entry is printed every 10 step and the line includes the
+total loss (starts around 13.0-14.0) and the speed of processing in terms of
+throughput (examples / sec) and batch speed (sec/batch).

 The number of GPU devices is specified by `--num_gpus` (which defaults to 1).
-Specifying `--num_gpus` greater then 1 splits the batch evenly split across
-the GPU cards.
+Specifying `--num_gpus` greater then 1 splits the batch evenly split across the
+GPU cards.

 ```shell
-# Build the training binary to run on a GPU. If you do not have a GPU,
-# then exclude '--config=cuda'
-bazel build -c opt --config=cuda inception/imagenet_train
+# Build the model. Note that we need to make sure the TensorFlow is ready to
+# use before this as this command will not build TensorFlow.
+bazel build inception/imagenet_train

 # run it
 bazel-bin/inception/imagenet_train --num_gpus=2 --batch_size=64 --train_dir=/tmp/imagenet_train
 ```

-This model splits the batch of 64 images across 2 GPUs and calculates
-the average gradient by waiting for both GPUs to finish calculating the
-gradients from their respective data (See diagram above). Generally speaking,
-using larger numbers of GPUs leads to higher throughput as well as the
-opportunity to use larger batch sizes. In turn, larger batch sizes imply
-better estimates of the gradient enabling the usage of higher learning rates.
-In summary, using more GPUs results in simply faster training speed.
+This model splits the batch of 64 images across 2 GPUs and calculates the
+average gradient by waiting for both GPUs to finish calculating the gradients
+from their respective data (See diagram above). Generally speaking, using larger
+numbers of GPUs leads to higher throughput as well as the opportunity to use
+larger batch sizes. In turn, larger batch sizes imply better estimates of the
+gradient enabling the usage of higher learning rates. In summary, using more
+GPUs results in simply faster training speed.

 Note that selecting a batch size is a difficult parameter to tune as it requires
-balancing memory demands of the model, memory available on the GPU and speed
-of computation. Generally speaking, employing larger batch sizes leads to
-more efficient computation and potentially more efficient training steps.
+balancing memory demands of the model, memory available on the GPU and speed of
+computation. Generally speaking, employing larger batch sizes leads to more
+efficient computation and potentially more efficient training steps.

 Note that there is considerable noise in the loss function on individual steps
 in the previous log. Because of this noise, it is difficult to discern how well
@@ -234,12 +225,161 @@ pointing to the directory containing the events log.
 tensorboard --logdir=/tmp/imagenet_train
 ```

-TensorBoard has access to the many Summaries produced by the model that
-describe multitudes of statistics tracking the model behavior and the quality
-of the learned model. In particular, TensorBoard tracks a exponentially smoothed
+TensorBoard has access to the many Summaries produced by the model that describe
+multitudes of statistics tracking the model behavior and the quality of the
+learned model. In particular, TensorBoard tracks a exponentially smoothed
 version of the loss. In practice, it is far easier to judge how well a model
 learns by monitoring the smoothed version of the loss.

+## How to Train from Scratch in a Distributed Setting
+
+**NOTE** Distributed TensorFlow requires version 0.8 or later.
+
+Distributed TensorFlow lets us use multiple machines to train a model faster.
+This is quite different from the training with multiple GPU towers on a single
+machine where all parameters and gradients computation are in the same place. We
+coordinate the computation across multiple machines by employing a centralized
+repository for parameters that maintains a unified, single copy of model
+parameters. Each individual machine sends gradient updates to the centralized
+parameter repository which coordinates these updates and sends back updated
+parameters to the individual machines running the model training.
+
+We term each machine that runs a copy of the training a `worker` or `replica`.
+We term each machine that maintains model parameters a `ps`, short for
+`parameter server`. Note that we might have more than one machine acting as a
+`ps` as the model parameters may be sharded across multiple machines.
+
+Variables may be updated with synchronous or asynchronous gradient updates. One
+may construct a an [`Optimizer`]
+(https://www.tensorflow.org/api_docs/python/train.html#optimizers) in TensorFlow
+that constructs the necessary graph for either case diagrammed below from
+TensorFlow [Whitepaper]
+(http://download.tensorflow.org/paper/whitepaper2015.pdf)):
+
+<div style="width:40%; margin:auto; margin-bottom:10px; margin-top:20px;">
+  <img style="width:100%"
+  src="https://www.tensorflow.org/images/tensorflow_figure7.png">
+</div>
+
+In [a recent paper](https://arxiv.org/abs/1604.00981), synchronous gradient
+updates have demonstrated to reach higher accuracy in a shorter amount of time.
+In this distributed Inception example we employ synchronous gradient updates.
+
+Note that in this example each replica has a single tower that uses one GPU.
+
+The command-line flags `worker_hosts` and `ps_hosts` specify available servers.
+The same binary will be used for both the `worker` jobs and the `ps` jobs.
+Command line flag `job_name` will be used to specify what role a task will be
+playing and `task_id` will be used to idenify which one of the jobs it is
+running. Several things to note here:
+
+*   The numbers of `ps` and `worker` tasks are inferred from the lists of hosts
+    specified in the flags. The `task_id` should be within the range `[0,
+    num_ps_tasks)` for `ps` tasks and `[0, num_worker_tasks)` for `worker`
+    tasks.
+*   `ps` and `worker` tasks can run on the same machine, as long as that machine
+    has sufficient resources to handle both tasks. Note that the `ps` task does
+    not benefit from a GPU, so it should not attempt to use one (see below).
+*   Multiple `worker` tasks can run on the same machine with multiple GPUs so
+    machine_A with 2 GPUs may have 2 workers while machine_B with 1 GPU just has
+    1 worker.
+*   The default learning rate schedule works well for a wide range of number of
+    replicas [25, 50, 100] but feel free to tune it for even better results.
+*   The command line of both `ps` and `worker` tasks should include the complete
+    list of `ps_hosts` and `worker_hosts`.
+*   There is a chief `worker` among all workers which defaults to `worker` 0.
+    The chief will be in charge of initializing all the parameters, writing out
+    the summaries and the checkpoint. The checkpoint and summary will be in the
+    `train_dir` of the host for `worker` 0.
+*   Each worker processes a batch_size number of examples but each gradient
+    update is computed from all replicas. Hence, the effective batch size of
+    this model is batch_size * num_workers.
+
+```shell
+# Build the model. Note that we need to make sure the TensorFlow is ready to
+# use before this as this command will not build TensorFlow.
+bazel build inception/imagenet_distributed_train
+
+# To start worker 0, go to the worker0 host and run the following (Note that
+# task_id should be in the range [0, num_worker_tasks):
+bazel-bin/inception/imagenet_distributed_train \
+--batch_size=32 \
+--data_dir=$HOME/imagenet-data \
+--job_name='worker' \
+--task_id=0 \
+--ps_hosts='ps0.example.com:2222' \
+--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
+
+# To start worker 1, go to the worker1 host and run the following (Note that
+# task_id should be in the range [0, num_worker_tasks):
+bazel-bin/inception/imagenet_distributed_train \
+--batch_size=32 \
+--data_dir=$HOME/imagenet-data \
+--job_name='worker' \
+--task_id=1 \
+--ps_hosts='ps0.example.com:2222' \
+--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
+
+# To start the parameter server (ps), go to the ps host and run the following (Note
+# that task_id should be in the range [0, num_ps_tasks):
+bazel-bin/inception/imagenet_distributed_train \
+--job_name='ps' \
+--task_id=0 \
+--ps_hosts='ps0.example.com:2222' \
+--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
+```
+
+If you have installed a GPU-compatible version of TensorFlow, the `ps` will also
+try to allocate GPU memory although it is not helpful. This could potentially
+crash the worker on the same machine as it has little to no GPU memory to
+allocate. To avoid this, you can prepend the previous command to start `ps`
+with: `CUDA_VISIBLE_DEVICES=''`
+
+```shell
+CUDA_VISIBLE_DEVICES='' bazel-bin/inception/imagenet_distributed_train \
+--job_name='ps' \
+--task_id=0 \
+--ps_hosts='ps0.example.com:2222' \
+--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
+```
+
+If you have run everything correctly, you should see a log in each `worker` job
+that looks like the following. Note the training speed varies depending on your
+hardware and the first several steps could take much longer.
+
+```shell
+INFO:tensorflow:PS hosts are: ['ps0.example.com:2222', 'ps1.example.com:2222']
+INFO:tensorflow:Worker hosts are: ['worker0.example.com:2222', 'worker1.example.com:2222']
+I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {ps0.example.com:2222, ps1.example.com:2222}
+I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {localhost:2222, worker1.example.com:2222}
+I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222
+INFO:tensorflow:Created variable global_step:0 with shape () and init <function zeros_initializer at 0x7f6aa014b140>
+
+...
+
+INFO:tensorflow:Created variable logits/logits/biases:0 with shape (1001,) and init <function _initializer at 0x7f6a77f3cf50>
+INFO:tensorflow:SyncReplicas enabled: replicas_to_aggregate=2; total_num_replicas=2
+INFO:tensorflow:2016-04-13 01:56:26.405639 Supervisor
+INFO:tensorflow:Started 2 queues for processing input data.
+INFO:tensorflow:global_step/sec: 0
+INFO:tensorflow:Worker 0: 2016-04-13 01:58:40.342404: step 0, loss = 12.97(0.0 examples/sec; 65.428  sec/batch)
+INFO:tensorflow:global_step/sec: 0.0172907
+...
+```
+
+and a log in each `ps` job that looks like the following:
+
+```shell
+INFO:tensorflow:PS hosts are: ['ps0.example.com:2222', 'ps1.example.com:2222']
+INFO:tensorflow:Worker hosts are: ['worker0.example.com:2222', 'worker1.example.com:2222']
+I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {localhost:2222, ps1.example.com:2222}
+I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {worker0.example.com:2222, worker1.example.com:2222}
+I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222
+```
+
+[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now
+training Inception in a distributed manner.
+
 ## How to Evaluate

 Evaluating an Inception v3 model on the ImageNet 2012 validation data set
@@ -247,27 +387,27 @@ requires running a separate binary.

 The evaluation procedure is nearly identical to [Evaluating a Model]
 (https://www.tensorflow.org/tutorials/deep_cnn/index.html#evaluating-a-model)
-described in the [Convolutional Neural Network](https://www.tensorflow.org/tutorials/deep_cnn/index.html)
-tutorial.
+described in the [Convolutional Neural Network]
+(https://www.tensorflow.org/tutorials/deep_cnn/index.html) tutorial.

-**WARNING** Be careful not to run the evaluation and training binary on the
-same GPU or else you might run out of memory. Consider running the evaluation on
-a separate GPU if available or suspending the training binary while running
-the evaluation on the same GPU.
+**WARNING** Be careful not to run the evaluation and training binary on the same
+GPU or else you might run out of memory. Consider running the evaluation on a
+separate GPU if available or suspending the training binary while running the
+evaluation on the same GPU.

 Briefly, one can evaluate the model by running:

 ```shell
-# Build the training binary to run on a GPU. If you do not have a GPU,
-# then exclude '--config=cuda'
-bazel build -c opt --config=cuda inception/imagenet_eval
+# Build the model. Note that we need to make sure the TensorFlow is ready to
+# use before this as this command will not build TensorFlow.
+bazel build inception/imagenet_eval

 # run it
 bazel-bin/inception/imagenet_eval --checkpoint_dir=/tmp/imagenet_train --eval_dir=/tmp/imagenet_eval
 ```

-Note that we point ``--checkpoint_dir`` to the location of the checkpoints
-saved by `inception_train.py` above. Running the above command results in the
+Note that we point `--checkpoint_dir` to the location of the checkpoints saved
+by `inception_train.py` above. Running the above command results in the
 following output:

 ```shell
@@ -277,45 +417,46 @@ following output:

 The script calculates the precision @ 1 over the entire validation data
 periodically. The precision @ 1 measures the how often the highest scoring
-prediction from the model matched the ImageNet label -- in this case, 73.5%.
-If you wish to run the eval just once and not periodically, append the
-`--run_once` option.
+prediction from the model matched the ImageNet label -- in this case, 73.5%. If
+you wish to run the eval just once and not periodically, append the `--run_once`
+option.

-Much like the training script, [`imagenet_eval.py`](inception/imagenet_eval.py) also
-exports summaries that may be visualized in TensorBoard. These summaries
-calculate additional statistics on the predictions (e.g. recall @ 5) as well
-as monitor the statistics of the model activations and weights during
-evaluation.
+Much like the training script, `imagenet_eval.py` also exports summaries that
+may be visualized in TensorBoard. These summaries calculate additional
+statistics on the predictions (e.g. recall @ 5) as well as monitor the
+statistics of the model activations and weights during evaluation.

 ## How to Fine-Tune a Pre-Trained Model on a New Task

 ### Getting Started
+
 Much like training the ImageNet model we must first convert a new data set to
 the sharded TFRecord format which each entry is a serialized `tf.Example` proto.

-We have provided a script demonstrating how to do this for small data set of
-of a few thousand flower images spread across 5 labels:
+We have provided a script demonstrating how to do this for small data set of of
+a few thousand flower images spread across 5 labels:

 ```shell
 daisy, dandelion, roses, sunflowers, tulips
 ```
-There is a single automated script that downloads the data set and converts
-it to the TFRecord format. Much like the ImageNet data set, each record in the
-TFRecord format is a serialized `tf.Example` proto whose entries include
-a JPEG-encoded string and an integer label. Please see
-[`parse_example_proto`](inception/image_processing.py) for details.
+
+There is a single automated script that downloads the data set and converts it
+to the TFRecord format. Much like the ImageNet data set, each record in the
+TFRecord format is a serialized `tf.Example` proto whose entries include a
+JPEG-encoded string and an integer label. Please see [`parse_example_proto`]
+(image_processing.py) for details.

 The script just takes a few minutes to run depending your network connection
 speed for downloading and processing the images. Your hard disk requires 200MB
-of free storage. Here we select `DATA_DIR=$HOME/flowers-data` as such a
-location but feel free to edit accordingly.
+of free storage. Here we select `DATA_DIR=$HOME/flowers-data` as such a location
+but feel free to edit accordingly.

 ```shell
 # location of where to place the flowers data
 FLOWERS_DATA_DIR=$HOME/flowers-data

 # build the preprocessing script.
-bazel build -c opt inception/download_and_preprocess_flowers
+bazel build inception/download_and_preprocess_flowers

 # run it
 bazel-bin/inception/download_and_preprocess_flowers "${FLOWERS_DATA_DIR}"
@@ -329,19 +470,19 @@ look like:
 ```

 When the script finishes you will find 2 shards for the training and validation
-files in the `DATA_DIR`. The files will match the patterns
-`train-????-of-00001` and `validation-?????-of-00001`, respectively.
+files in the `DATA_DIR`. The files will match the patterns `train-????-of-00001`
+and `validation-?????-of-00001`, respectively.

 **NOTE** If you wish to prepare a custom image data set for transfer learning,
-you will need to invoke [`build_image_data.py`](inception/data/build_image_data.py)
-on your custom data set.
-Please see the associated options and assumptions behind this script by reading
-the comments section of  [`build_image_data.py`](inception/data/build_image_data.py).
+you will need to invoke [`build_image_data.py`](data/build_image_data.py) on
+your custom data set. Please see the associated options and assumptions behind
+this script by reading the comments section of [`build_image_data.py`]
+(data/build_image_data.py).

 The second piece you will need is a trained Inception v3 image model. You have
-the option of either training one yourself (See
-[How to Train from Scratch](#how-to-train-from-scratch) for details) or you can
-download a pre-trained model like so:
+the option of either training one yourself (See [How to Train from Scratch]
+(#how-to-train-from-scratch) for details) or you can download a pre-trained
+model like so:

 ```shell
 # location of where to place the Inception v3 model
@@ -359,46 +500,42 @@ checkpoint
 model.ckpt-157585
 ```

-[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0)
-You are now ready to fine-tune your pre-trained Inception v3 model
-with the flower data set.
+[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now
+ready to fine-tune your pre-trained Inception v3 model with the flower data set.

 ### How to Retrain a Trained Model on the Flowers Data

-We are now ready to fine-tune a pre-trained Inception-v3 model on
-the flowers data set. This requires two distinct changes to our training
-procedure:
-
-1. Build the exact same model as previously except we change the number
-of labels in the final classification layer.
-
-2. Restore all weights from the pre-trained Inception-v3 except for the
-final classification layer; this will get randomly initialized instead.
+We are now ready to fine-tune a pre-trained Inception-v3 model on the flowers
+data set. This requires two distinct changes to our training procedure:

+1.  Build the exact same model as previously except we change the number of
+    labels in the final classification layer.

+2.  Restore all weights from the pre-trained Inception-v3 except for the final
+    classification layer; this will get randomly initialized instead.

 We can perform these two operations by specifying two flags:
-`--pretrained_model_checkpoint_path` and `--fine_tune`.
-The first flag is a string that points to the path of a pre-trained Inception-v3
-model. If this flag is specified, it will load the entire model from the
-checkpoint before the script begins training.
+`--pretrained_model_checkpoint_path` and `--fine_tune`. The first flag is a
+string that points to the path of a pre-trained Inception-v3 model. If this flag
+is specified, it will load the entire model from the checkpoint before the
+script begins training.

 The second flag `--fine_tune` is a boolean that indicates whether the last
-classification layer should be randomly initialized or restored.
-You may set this flag to false
-if you wish to continue training a pre-trained model from a checkpoint. If you
-set this flag to true, you can train a new classification layer from scratch.
+classification layer should be randomly initialized or restored. You may set
+this flag to false if you wish to continue training a pre-trained model from a
+checkpoint. If you set this flag to true, you can train a new classification
+layer from scratch.

-In order to understand how `--fine_tune` works, please see the discussion
-on `Variables` in the TensorFlow-Slim [`README.md`](inception/slim/README.md).
+In order to understand how `--fine_tune` works, please see the discussion on
+`Variables` in the TensorFlow-Slim [`README.md`](slim/README.md).

-Putting this all together you can retrain a pre-trained Inception-v3 model
-on the flowers data set with the following command.
+Putting this all together you can retrain a pre-trained Inception-v3 model on
+the flowers data set with the following command.

 ```shell
-# Build the training binary to run on a GPU. If you do not have a GPU,
-# then exclude '--config=cuda'
-bazel build -c opt --config=cuda inception/flowers_train
+# Build the model. Note that we need to make sure the TensorFlow is ready to
+# use before this as this command will not build TensorFlow.
+bazel build inception/flowers_train

 # Path to the downloaded Inception-v3 model.
 MODEL_PATH="${INCEPTION_MODEL_DIR}/model.ckpt-157585"
@@ -422,20 +559,19 @@ bazel-bin/inception/flowers_train \

 We have added a few extra options to the training procedure.

-* Fine-tuning a model a separate data set requires significantly lowering the
-initial learning rate. We set the initial learning rate to 0.001.
-* The flowers data set is quite small so we shrink the size of the shuffling
-queue of examples. See [Adjusting Memory Demands](#adjusting-memory-demands)
-for more details.
+*   Fine-tuning a model a separate data set requires significantly lowering the
+    initial learning rate. We set the initial learning rate to 0.001.
+*   The flowers data set is quite small so we shrink the size of the shuffling
+    queue of examples. See [Adjusting Memory Demands](#adjusting-memory-demands)
+    for more details.

 The training script will only reports the loss. To evaluate the quality of the
 fine-tuned model, you will need to run `flowers_eval`:

-
 ```shell
-# Build the training binary to run on a GPU. If you do not have a GPU,
-# then exclude '--config=cuda'
-bazel build -c opt --config=cuda inception/flowers_eval
+# Build the model. Note that we need to make sure the TensorFlow is ready to
+# use before this as this command will not build TensorFlow.
+bazel build inception/flowers_eval

 # Directory where we saved the fine-tuned checkpoint and events files.
 TRAIN_DIR=/tmp/flowers_train/
@@ -447,18 +583,18 @@ FLOWERS_DATA_DIR=/tmp/flowers-data/
 EVAL_DIR=/tmp/flowers_eval/

 # Evaluate the fine-tuned model on a hold-out of the flower data set.
-bazel-bin/inception/flowers_eval \
+blaze-bin/inception/flowers_eval \
  --eval_dir="${EVAL_DIR}" \
  --data_dir="${FLOWERS_DATA_DIR}" \
  --subset=validation \
  --num_examples=500 \
  --checkpoint_dir="${TRAIN_DIR}" \
-  --input_queue_memory_factor=1 \
+  --input_queue_memory_factfor=1 \
  --run_once
 ```

-We find that the evaluation arrives at roughly 93.4% precision@1 after the
-model has been running for 2000 steps.
+We find that the evaluation arrives at roughly 93.4% precision@1 after the model
+has been running for 2000 steps.

 ```shell
 Succesfully loaded model from /tmp/flowers/model.ckpt-1999 at step=1999.
@@ -467,18 +603,16 @@ Succesfully loaded model from /tmp/flowers/model.ckpt-1999 at step=1999.
 2016-03-01 16:53:05.450471: precision @ 1 = 0.9340 recall @ 5 = 0.9960 [500 examples]
 ```

-
 ## How to Construct a New Dataset for Retraining

-One can use the existing scripts supplied with this model to build a new
-dataset for training or fine-tuning. The main script to employ is
-[`build_image_data.py`](inception/data/build_image_data.py). Briefly,
-this script takes a structured
-directory of images and converts it to a sharded `TFRecord` that can be read
-by the Inception model.
+One can use the existing scripts supplied with this model to build a new dataset
+for training or fine-tuning. The main script to employ is
+[`build_image_data.py`](./build_image_data.py). Briefly, this script takes a
+structured directory of images and converts it to a sharded `TFRecord` that can
+be read by the Inception model.

-In particular, you will need to create a directory of training images
-that reside within `$TRAIN_DIR` and `$VALIDATION_DIR` arranged as such:
+In particular, you will need to create a directory of training images that
+reside within `$TRAIN_DIR` and `$VALIDATION_DIR` arranged as such:

 ```shell
  $TRAIN_DIR/dog/image0.jpeg
@@ -498,24 +632,25 @@ that reside within `$TRAIN_DIR` and `$VALIDATION_DIR` arranged as such:
  $VALIDATION_DIR/cat/cat.JPG
  ...
 ```
-Each sub-directory in `$TRAIN_DIR` and `$VALIDATION_DIR` corresponds to a
-unique label for the images that reside within that sub-directory. The images
-may be JPEG or PNG images. We do not support other images types currently.
+
+Each sub-directory in `$TRAIN_DIR` and `$VALIDATION_DIR` corresponds to a unique
+label for the images that reside within that sub-directory. The images may be
+JPEG or PNG images. We do not support other images types currently.

 Once the data is arranged in this directory structure, we can run
-[`build_image_data.py`](inception/data/build_image_data.py) on the data to generate the sharded `TFRecord` dataset.
-Each entry of the `TFRecord` is a serialized `tf.Example` protocol buffer.
-A complete list of information contained in the `tf.Example` is described
-in the comments of [`build_image_data.py`](inception/data/build_image_data.py).
+`build_image_data.py` on the data to generate the sharded `TFRecord` dataset.
+Each entry of the `TFRecord` is a serialized `tf.Example` protocol buffer. A
+complete list of information contained in the `tf.Example` is described in the
+comments of `build_image_data.py`.

-To run [`build_image_data.py`](inception/data/build_image_data.py), you can run the following command line:
+To run `build_image_data.py`, you can run the following command line:

 ```shell
 # location to where to save the TFRecord data.
 OUTPUT_DIRECTORY=$HOME/my-custom-data/

 # build the preprocessing script.
-bazel build -c opt inception/build_image_data
+bazel build inception/build_image_data

 # convert the data.
 bazel-bin/inception/build_image_data \
@@ -527,10 +662,11 @@ bazel-bin/inception/build_image_data \
  --validation_shards=24 \
  --num_threads=8
 ```
+
 where the `$OUTPUT_DIRECTORY` is the location of the sharded `TFRecords`. The
-`$LABELS_FILE` will be a text file that is outputted by the script that
-provides a list of all of the labels. For instance, in the case flowers data
-set, the `$LABELS_FILE` contained the following data:
+`$LABELS_FILE` will be a text file that is outputted by the script that provides
+a list of all of the labels. For instance, in the case flowers data set, the
+`$LABELS_FILE` contained the following data:

 ```shell
 daisy
@@ -541,9 +677,9 @@ tulips
 ```

 Note that each row of each label corresponds with the entry in the final
-classifier in the model. That is, the `daisy` corresponds to the classifier
-for entry `1`; `dandelion` is entry `2`, etc. We skip label `0` as a
-background class.
+classifier in the model. That is, the `daisy` corresponds to the classifier for
+entry `1`; `dandelion` is entry `2`, etc. We skip label `0` as a background
+class.

 After running this script produces files that look like the following:

@@ -560,26 +696,26 @@ and
  ...
  $VALIDATION_DIR/validation-00007-of-00008
 ```
-where 24 and 8 are the number of shards specified for each
-dataset, respectively. Generally speaking, we aim for selecting the number
-of shards such that roughly 1024 images reside in each shard. One this
-data set is built you are ready to train or fine-tune an Inception model
-on this data set.
+
+where 24 and 8 are the number of shards specified for each dataset,
+respectively. Generally speaking, we aim for selecting the number of shards such
+that roughly 1024 images reside in each shard. One this data set is built you
+are ready to train or fine-tune an Inception model on this data set.

 ## Practical Considerations for Training a Model

 The model architecture and training procedure is heavily dependent on the
-hardware used to train the model. If you wish to train or fine-tune this
-model on your machine **you will need to adjust and empirically determine
-a good set of training hyper-parameters for your setup**. What follows are
-some general considerations for novices.
+hardware used to train the model. If you wish to train or fine-tune this model
+on your machine **you will need to adjust and empirically determine a good set
+of training hyper-parameters for your setup**. What follows are some general
+considerations for novices.

 ### Finding Good Hyperparameters

-Roughly 5-10 hyper-parameters govern the speed at which a network is trained.
-In addition to `--batch_size` and `--num_gpus`, there are several constants
-defined in [inception_train.py](inception/inception_train.py) which dictate the
-learning schedule.
+Roughly 5-10 hyper-parameters govern the speed at which a network is trained. In
+addition to `--batch_size` and `--num_gpus`, there are several constants defined
+in [inception_train.py](./inception_train.py) which dictate the learning
+schedule.

 ```shell
 RMSPROP_DECAY = 0.9                # Decay term for RMSProp.
@@ -594,30 +730,31 @@ There are many papers that discuss the various tricks and trade-offs associated
 with training a model with stochastic gradient descent. For those new to the
 field, some great references are:

-* Y Bengio, [Practical recommendations for gradient-based training of deep architectures](http://arxiv.org/abs/1206.5533)
-* I Goodfellow, Y Bengio and A Courville, [Deep Learning](http://www.deeplearningbook.org/)
+*   Y Bengio, [Practical recommendations for gradient-based training of deep
+    architectures](http://arxiv.org/abs/1206.5533)
+*   I Goodfellow, Y Bengio and A Courville, [Deep Learning]
+    (http://www.deeplearningbook.org/)

 What follows is a summary of some general advice for identifying appropriate
-model hyper-parameters in the context of this particular
-model training setup. Namely,
-this library provides *synchronous* updates to model parameters based on
+model hyper-parameters in the context of this particular model training setup.
+Namely, this library provides *synchronous* updates to model parameters based on
 batch-splitting the model across multiple GPUs.

-* Higher learning rates leads to faster training. Too high of learning rate
-leads to instability and will cause model parameters to diverge to infinity
-or NaN.
+*   Higher learning rates leads to faster training. Too high of learning rate
+    leads to instability and will cause model parameters to diverge to infinity
+    or NaN.

-* Larger batch sizes lead to higher quality estimates of the gradient and
-permit training the model with higher learning rates.
+*   Larger batch sizes lead to higher quality estimates of the gradient and
+    permit training the model with higher learning rates.

-* Often the GPU memory is a bottleneck that prevents employing larger batch
-sizes. Employing more GPUs allows one to user larger batch sizes because
-this model splits the batch across the GPUs.
+*   Often the GPU memory is a bottleneck that prevents employing larger batch
+    sizes. Employing more GPUs allows one to user larger batch sizes because
+    this model splits the batch across the GPUs.

 **NOTE** If one wishes to train this model with *asynchronous* gradient updates,
 one will need to substantially alter this model and new considerations need to
-be factored into hyperparameter tuning.
-See [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html)
+be factored into hyperparameter tuning. See [Large Scale Distributed Deep
+Networks](http://research.google.com/archive/large_deep_networks_nips2012.html)
 for a discussion in this domain.

 ### Adjusting Memory Demands
@@ -627,52 +764,48 @@ discuss each item in turn.

 GPU memory is relatively small compared to CPU memory. Two items dictate the
 amount of GPU memory employed -- model architecture and batch size. Assuming
-that you keep the model architecture fixed, the sole parameter governing the
-GPU demand is the batch size. A good rule of thumb is to try employ as large
-of batch size as will fit on the GPU.
+that you keep the model architecture fixed, the sole parameter governing the GPU
+demand is the batch size. A good rule of thumb is to try employ as large of
+batch size as will fit on the GPU.

 If you run out of GPU memory, either lower the `--batch_size` or employ more
 GPUs on your desktop. The model performs batch-splitting across GPUs, thus N
 GPUs can handle N times the batch size of 1 GPU.

 The model requires a large amount of CPU memory as well. We have tuned the model
-to employ about ~40GB of CPU memory. Thus, having access to 64 or 128 GB of
-CPU memory would be ideal.
+to employ about ~20GB of CPU memory. Thus, having access to about 40 GB of CPU
+memory would be ideal.

-If that is not possible, you can tune down the memory demands of the model
-via lowering `--input_queue_memory_factor`. Images are preprocessed
-asynchronously with respect to the main training across
-`--num_preprocess_threads` threads. The preprocessed images are stored in
-shuffling queue in which each GPU performs a dequeue operation in order
-to receive a `batch_size` worth of images.
+If that is not possible, you can tune down the memory demands of the model via
+lowering `--input_queue_memory_factor`. Images are preprocessed asynchronously
+with respect to the main training across `--num_preprocess_threads` threads. The
+preprocessed images are stored in shuffling queue in which each GPU performs a
+dequeue operation in order to receive a `batch_size` worth of images.

 In order to guarantee good shuffling across the data, we maintain a large
 shuffling queue of 1024 x `input_queue_memory_factor` images. For the current
-model architecture, this corresponds to 16GB of CPU memory. You may lower
-`input_queue_memory_factor` in order to decrease the memory footprint. Keep
-in mind though that lowering this value drastically may result in a model
-with slightly lower predictive accuracy when training from scratch. Please
-see comments in [`image_processing.py`](inception/image_processing.py) for more details.
+model architecture, this corresponds to about 4GB of CPU memory. You may lower
+`input_queue_memory_factor` in order to decrease the memory footprint. Keep in
+mind though that lowering this value drastically may result in a model with
+slightly lower predictive accuracy when training from scratch. Please see
+comments in [`image_processing.py`](./image_processing.py) for more details.

 ## Troubleshooting

 #### The model runs out of CPU memory.

-In lieu of buying more CPU memory, an easy fix is to
-decrease `--input_queue_memory_factor`. See
-[Adjusting Memory Demands](#adjusting-memory-demands).
-
+In lieu of buying more CPU memory, an easy fix is to decrease
+`--input_queue_memory_factor`. See [Adjusting Memory Demands]
+(#adjusting-memory-demands).

 #### The model runs out of GPU memory.

 The data is not able to fit on the GPU card. The simplest solution is to
-decrease the batch size of the model. Otherwise, you will need to think about
-a more sophisticated method for specifying the training which cuts up the model
+decrease the batch size of the model. Otherwise, you will need to think about a
+more sophisticated method for specifying the training which cuts up the model
 across multiple `session.run()` calls or partitions the model across multiple
-GPUs. See [Using GPUs](https://www.tensorflow.org/versions/r0.7/how_tos/using_gpu/index.html)
-and
-[Adjusting Memory Demands](#adjusting-memory-demands)
-for more information.
+GPUs. See [Using GPUs](https://www.tensorflow.org/how_tos/using_gpu/index.html)
+and [Adjusting Memory Demands](#adjusting-memory-demands) for more information.

 #### The model training results in NaN's.

@@ -680,23 +813,22 @@ The learning rate of the model is too high. Turn down your learning rate.

 #### I wish to train a model with a different image size.

-The simplest solution is to artificially resize your images to `299x299`
-pixels. See
-[Images](https://www.tensorflow.org/versions/r0.7/api_docs/python/image.html)
-section for many resizing, cropping and padding methods.
-Note that the entire model architecture is predicated on a `299x299` image,
-thus if you wish to change the input image size, then you may need to redesign
-the entire model architecture.
+The simplest solution is to artificially resize your images to `299x299` pixels.
+See [Images](https://www.tensorflow.org/api_docs/python/image.html) section for
+many resizing, cropping and padding methods. Note that the entire model
+architecture is predicated on a `299x299` image, thus if you wish to change the
+input image size, then you may need to redesign the entire model architecture.

 #### What hardware specification are these hyper-parameters targeted for?

-We targeted a desktop with 128GB of CPU ram connected to 8 NVIDIA Tesla K40
-GPU cards but we have run this on desktops with 32GB of CPU ram and 1 NVIDIA
-Tesla K40. You can get a sense of the various training configurations we
-tested by reading the comments in [`inception_train.py`](inception/inception_train.py).
-
-
-
-
+We targeted a desktop with 128GB of CPU ram connected to 8 NVIDIA Tesla K40 GPU
+cards but we have run this on desktops with 32GB of CPU ram and 1 NVIDIA Tesla
+K40. You can get a sense of the various training configurations we tested by
+reading the comments in [`inception_train.py`](./inception_train.py).

+#### How do I continue training from a checkpoint in distributed setting?

+You only need to make sure that the checkpoint is in a location that can be
+reached by all of the `ps` tasks. By specifying the checkpoint location with
+`--train_dir` , the `ps` servers will load the checkpoint before commencing
+training.
--- a/inception/inception/BUILD
+++ b/inception/inception/BUILD
@@ -102,6 +102,17 @@ py_binary(
    ],
 )

+py_binary(
+    name = "imagenet_distributed_train",
+    srcs = [
+        "imagenet_distributed_train.py",
+    ],
+    deps = [
+        ":imagenet_data",
+        ":inception_distributed_train",
+    ],
+)
+
 py_binary(
    name = "flowers_train",
    srcs = [
@@ -124,6 +135,17 @@ py_library(
    ],
 )

+py_library(
+    name = "inception_distributed_train",
+    srcs = [
+        "inception_distributed_train.py",
+    ],
+    deps = [
+        ":image_processing",
+        ":inception",
+    ],
+)
+
 py_binary(
    name = "build_image_data",
    srcs = ["data/build_image_data.py"],

--- a/inception/inception/imagenet_distributed_train.py
+++ b/inception/inception/imagenet_distributed_train.py
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=line-too-long
+"""A binary to train Inception in a distributed manner using multiple systems.
+
+Please see accompanying README.md for details and instructions.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from inception import inception_distributed_train
+from inception.imagenet_data import ImagenetData
+
+FLAGS = tf.app.flags.FLAGS
+
+
+def main(unused_args):
+  assert FLAGS.job_name in ['ps', 'worker'], 'job_name must be ps or worker'
+
+  # Extract all the hostnames for the ps and worker jobs to construct the
+  # cluster spec.
+  ps_hosts = FLAGS.ps_hosts.split(',')
+  worker_hosts = FLAGS.worker_hosts.split(',')
+  tf.logging.info('PS hosts are: %s' % ps_hosts)
+  tf.logging.info('Worker hosts are: %s' % worker_hosts)
+
+  cluster_spec = tf.train.ClusterSpec({'ps': ps_hosts,
+                                       'worker': worker_hosts})
+  server = tf.train.Server(
+      {'ps': ps_hosts,
+       'worker': worker_hosts},
+      job_name=FLAGS.job_name,
+      task_index=FLAGS.task_id)
+
+  if FLAGS.job_name == 'ps':
+    # `ps` jobs wait for incoming connections from the workers.
+    server.join()
+  else:
+    # `worker` jobs will actually do the work.
+    dataset = ImagenetData(subset=FLAGS.subset)
+    assert dataset.data_files()
+    # Only the chief checks for or creates train_dir.
+    if FLAGS.task_id == 0:
+      if not tf.gfile.Exists(FLAGS.train_dir):
+        tf.gfile.MakeDirs(FLAGS.train_dir)
+    inception_distributed_train.train(server.target, dataset, cluster_spec)
+
+if __name__ == '__main__':
+  tf.logging.set_verbosity(tf.logging.INFO)
+  tf.app.run()
--- a/inception/inception/inception_distributed_train.py
+++ b/inception/inception/inception_distributed_train.py
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""A library to train Inception using multiple replicas with synchronous update.
+
+Please see accompanying README.md for details and instructions.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from datetime import datetime
+import os.path
+import time
+
+import numpy as np
+import tensorflow as tf
+
+from inception import image_processing
+from inception import inception_model as inception
+from inception.slim import slim
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('job_name', '', 'One of "ps", "worker"')
+tf.app.flags.DEFINE_string('ps_hosts', '',
+                           """Comma-separated list of hostname:port for the """
+                           """parameter server jobs. e.g. """
+                           """'machine1:2222,machine2:1111,machine2:2222'""")
+tf.app.flags.DEFINE_string('worker_hosts', '',
+                           """Comma-separated list of hostname:port for the """
+                           """worker jobs. e.g. """
+                           """'machine1:2222,machine2:1111,machine2:2222'""")
+
+tf.app.flags.DEFINE_string('train_dir', '/tmp/imagenet_train',
+                           """Directory where to write event logs """
+                           """and checkpoint.""")
+tf.app.flags.DEFINE_integer('max_steps', 1000000, 'Number of batches to run.')
+tf.app.flags.DEFINE_string('subset', 'train', 'Either "train" or "validation".')
+tf.app.flags.DEFINE_boolean('log_device_placement', False,
+                            'Whether to log device placement.')
+
+# Task ID is used to select the chief and also to access the local_step for
+# each replica to check staleness of the gradients in sync_replicas_optimizer.
+tf.app.flags.DEFINE_integer(
+    'task_id', 0, 'Task ID of the worker/replica running the training.')
+
+# More details can be found in the sync_replicas_optimizer class:
+# tensorflow/python/training/sync_replicas_optimizer.py
+tf.app.flags.DEFINE_integer('num_replicas_to_aggregate', -1,
+                            """Number of gradients to collect before """
+                            """updating the parameters.""")
+tf.app.flags.DEFINE_integer('save_interval_secs', 10 * 60,
+                            'Save interval seconds.')
+tf.app.flags.DEFINE_integer('save_summaries_secs', 180,
+                            'Save summaries interval seconds.')
+
+# **IMPORTANT**
+# Please note that this learning rate schedule is heavily dependent on the
+# hardware architecture, batch size and any changes to the model architecture
+# specification. Selecting a finely tuned learning rate schedule is an
+# empirical process that requires some experimentation. Please see README.md
+# more guidance and discussion.
+#
+# Learning rate decay factor selected from https://arxiv.org/abs/1604.00981
+tf.app.flags.DEFINE_float('initial_learning_rate', 0.045,
+                          'Initial learning rate.')
+tf.app.flags.DEFINE_float('num_epochs_per_decay', 2.0,
+                          'Epochs after which learning rate decays.')
+tf.app.flags.DEFINE_float('learning_rate_decay_factor', 0.94,
+                          'Learning rate decay factor.')
+
+# Constants dictating the learning rate schedule.
+RMSPROP_DECAY = 0.9                # Decay term for RMSProp.
+RMSPROP_MOMENTUM = 0.9             # Momentum in RMSProp.
+RMSPROP_EPSILON = 1.0              # Epsilon term for RMSProp.
+
+
+def train(target, dataset, cluster_spec):
+  """Train Inception on a dataset for a number of steps."""
+  # Number of workers and parameter servers are infered from the workers and ps
+  # hosts string.
+  num_workers = len(cluster_spec.as_dict()['worker'])
+  num_parameter_servers = len(cluster_spec.as_dict()['ps'])
+  # If no value is given, num_replicas_to_aggregate defaults to be the number of
+  # workers.
+  if FLAGS.num_replicas_to_aggregate == -1:
+    num_replicas_to_aggregate = num_workers
+  else:
+    num_replicas_to_aggregate = FLAGS.num_replicas_to_aggregate
+
+  # Both should be greater than 0 in a distributed training.
+  assert num_workers > 0 and num_parameter_servers > 0, (' num_workers and '
+                                                         'num_parameter_servers'
+                                                         ' must be > 0.')
+
+  # Choose worker 0 as the chief. Note that any worker could be the chief
+  # but there should be only one chief.
+  is_chief = (FLAGS.task_id == 0)
+
+  # Ops are assigned to worker by default.
+  with tf.device('/job:worker/task:%d' % FLAGS.task_id):
+    # Variables and its related init/assign ops are assigned to ps.
+    with slim.scopes.arg_scope(
+        [slim.variables.variable, slim.variables.global_step],
+        device=slim.variables.VariableDeviceChooser(num_parameter_servers)):
+      # Create a variable to count the number of train() calls. This equals the
+      # number of updates applied to the variables.
+      global_step = slim.variables.global_step()
+
+      # Calculate the learning rate schedule.
+      num_batches_per_epoch = (dataset.num_examples_per_epoch() /
+                               FLAGS.batch_size)
+      # Decay steps need to be divided by the number of replicas to aggregate.
+      decay_steps = int(num_batches_per_epoch * FLAGS.num_epochs_per_decay /
+                        num_replicas_to_aggregate)
+
+      # Decay the learning rate exponentially based on the number of steps.
+      lr = tf.train.exponential_decay(FLAGS.initial_learning_rate,
+                                      global_step,
+                                      decay_steps,
+                                      FLAGS.learning_rate_decay_factor,
+                                      staircase=True)
+      # Add a summary to track the learning rate.
+      tf.scalar_summary('learning_rate', lr)
+
+      # Create an optimizer that performs gradient descent.
+      opt = tf.train.RMSPropOptimizer(lr,
+                                      RMSPROP_DECAY,
+                                      momentum=RMSPROP_MOMENTUM,
+                                      epsilon=RMSPROP_EPSILON)
+
+      images, labels = image_processing.distorted_inputs(
+          dataset,
+          batch_size=FLAGS.batch_size,
+          num_preprocess_threads=FLAGS.num_preprocess_threads)
+
+      # Number of classes in the Dataset label set plus 1.
+      # Label 0 is reserved for an (unused) background class.
+      num_classes = dataset.num_classes() + 1
+      logits = inception.inference(images, num_classes, for_training=True)
+      # Add classification loss.
+      inception.loss(logits, labels)
+
+      # Gather all of the losses including regularization losses.
+      losses = tf.get_collection(slim.losses.LOSSES_COLLECTION)
+      losses += tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
+
+      total_loss = tf.add_n(losses, name='total_loss')
+
+      if is_chief:
+        # Compute the moving average of all individual losses and the
+        # total loss.
+        loss_averages = tf.train.ExponentialMovingAverage(0.9, name='avg')
+        loss_averages_op = loss_averages.apply(losses + [total_loss])
+
+        # Attach a scalar summmary to all individual losses and the total loss;
+        # do the same for the averaged version of the losses.
+        for l in losses + [total_loss]:
+          loss_name = l.op.name
+          # Name each loss as '(raw)' and name the moving average version of the
+          # loss as the original loss name.
+          tf.scalar_summary(loss_name + ' (raw)', l)
+          tf.scalar_summary(loss_name, loss_averages.average(l))
+
+        # Add dependency to compute loss_averages.
+        with tf.control_dependencies([loss_averages_op]):
+          total_loss = tf.identity(total_loss)
+
+      # Track the moving averages of all trainable variables.
+      # Note that we maintain a 'double-average' of the BatchNormalization
+      # global statistics.
+      # This is not needed when the number of replicas are small but important
+      # for synchronous distributed training with tens of workers/replicas.
+      exp_moving_averager = tf.train.ExponentialMovingAverage(
+          inception.MOVING_AVERAGE_DECAY, global_step)
+
+      variables_to_average = (
+          tf.trainable_variables() + tf.moving_average_variables())
+
+      # Add histograms for model variables.
+      for var in variables_to_average:
+        tf.histogram_summary(var.op.name, var)
+
+      # Create synchronous replica optimizer.
+      opt = tf.train.SyncReplicasOptimizer(
+          opt,
+          replicas_to_aggregate=num_replicas_to_aggregate,
+          replica_id=FLAGS.task_id,
+          total_num_replicas=num_workers,
+          variable_averages=exp_moving_averager,
+          variables_to_average=variables_to_average)
+
+      batchnorm_updates = tf.get_collection(slim.ops.UPDATE_OPS_COLLECTION)
+      assert batchnorm_updates, 'Batchnorm updates are missing'
+      batchnorm_updates_op = tf.group(*batchnorm_updates)
+      # Add dependency to compute batchnorm_updates.
+      with tf.control_dependencies([batchnorm_updates_op]):
+        total_loss = tf.identity(total_loss)
+
+      # Compute gradients with respect to the loss.
+      grads = opt.compute_gradients(total_loss)
+
+      # Add histograms for gradients.
+      for grad, var in grads:
+        if grad is not None:
+          tf.histogram_summary(var.op.name + '/gradients', grad)
+
+      apply_gradients_op = opt.apply_gradients(grads, global_step=global_step)
+
+      with tf.control_dependencies([apply_gradients_op]):
+        train_op = tf.identity(total_loss, name='train_op')
+
+      # Get chief queue_runners, init_tokens and clean_up_op, which is used to
+      # synchronize replicas.
+      # More details can be found in sync_replicas_optimizer.
+      chief_queue_runners = [opt.get_chief_queue_runner()]
+      init_tokens_op = opt.get_init_tokens_op()
+      clean_up_op = opt.get_clean_up_op()
+
+      # Create a saver.
+      saver = tf.train.Saver()
+
+      # Build the summary operation based on the TF collection of Summaries.
+      summary_op = tf.merge_all_summaries()
+
+      # Build an initialization operation to run below.
+      init_op = tf.initialize_all_variables()
+
+      # We run the summaries in the same thread as the training operations by
+      # passing in None for summary_op to avoid a summary_thread being started.
+      # Running summaries and training operations in parallel could run out of
+      # GPU memory.
+      sv = tf.train.Supervisor(is_chief=is_chief,
+                               logdir=FLAGS.train_dir,
+                               init_op=init_op,
+                               summary_op=None,
+                               global_step=global_step,
+                               saver=saver,
+                               save_model_secs=FLAGS.save_interval_secs)
+
+      tf.logging.info('%s Supervisor' % datetime.now())
+
+      sess_config = tf.ConfigProto(
+          allow_soft_placement=True,
+          log_device_placement=FLAGS.log_device_placement)
+
+      # Get a session.
+      sess = sv.prepare_or_wait_for_session(target, config=sess_config)
+
+      # Start the queue runners.
+      queue_runners = tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS)
+      sv.start_queue_runners(sess, queue_runners)
+      tf.logging.info('Started %d queues for processing input data.',
+                      len(queue_runners))
+
+      if is_chief:
+        sv.start_queue_runners(sess, chief_queue_runners)
+        sess.run(init_tokens_op)
+
+      # Train, checking for Nans. Concurrently run the summary operation at a
+      # specified interval. Note that the summary_op and train_op never run
+      # simultaneously in order to prevent running out of GPU memory.
+      next_summary_time = time.time() + FLAGS.save_summaries_secs
+      while not sv.should_stop():
+        try:
+          start_time = time.time()
+          loss_value, step = sess.run([train_op, global_step])
+          assert not np.isnan(loss_value), 'Model diverged with loss = NaN'
+          if step > FLAGS.max_steps:
+            break
+          duration = time.time() - start_time
+
+          if step % 30 == 0:
+            examples_per_sec = FLAGS.batch_size / float(duration)
+            format_str = ('Worker %d: %s: step %d, loss = %.2f'
+                          '(%.1f examples/sec; %.3f  sec/batch)')
+            tf.logging.info(format_str %
+                            (FLAGS.task_id, datetime.now(), step, loss_value,
+                             examples_per_sec, duration))
+
+          # Determine if the summary_op should be run on the chief worker.
+          if is_chief and next_summary_time < time.time():
+            tf.logging.info('Running Summary operation on the chief.')
+            summary_str = sess.run(summary_op)
+            sv.summary_computed(sess, summary_str)
+            tf.logging.info('Finished running Summary operation.')
+
+            # Determine the next time for running the summary.
+            next_summary_time += FLAGS.save_summaries_secs
+        except:
+          if is_chief:
+            tf.logging.info('About to execute sync_clean_up_op!')
+            sess.run(clean_up_op)
+          raise
+
+      # Stop the supervisor.  This also waits for service threads to finish.
+      sv.stop()
+
+      # Save after the training ends.
+      if is_chief:
+        saver.save(sess,
+                   os.path.join(FLAGS.train_dir, 'model.ckpt'),
+                   global_step=global_step)
--- a/inception/inception/inception_model.py
+++ b/inception/inception/inception_model.py
@@ -26,7 +26,6 @@ from __future__ import print_function

 import re

-
 import tensorflow as tf

 from inception.slim import slim
@@ -79,15 +78,13 @@ def inference(images, num_classes, for_training=False, restore_logits=True,
                        stddev=0.1,
                        activation=tf.nn.relu,
                        batch_norm_params=batch_norm_params):
-      # Force all Variables to reside on the CPU.
-      with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
-        logits, endpoints = slim.inception.inception_v3(
-            images,
-            dropout_keep_prob=0.8,
-            num_classes=num_classes,
-            is_training=for_training,
-            restore_logits=restore_logits,
-            scope=scope)
+      logits, endpoints = slim.inception.inception_v3(
+          images,
+          dropout_keep_prob=0.8,
+          num_classes=num_classes,
+          is_training=for_training,
+          restore_logits=restore_logits,
+          scope=scope)

  # Add summaries for viewing model statistics on TensorBoard.
  _activation_summaries(endpoints)

--- a/inception/inception/inception_train.py
+++ b/inception/inception/inception_train.py
@@ -24,8 +24,6 @@ import os.path
 import re
 import time

-
-
 import numpy as np
 import tensorflow as tf

@@ -215,7 +213,6 @@ def train(dataset):
    num_preprocess_threads = FLAGS.num_preprocess_threads * FLAGS.num_gpus
    images, labels = image_processing.distorted_inputs(
        dataset,
-        batch_size=split_batch_size,
        num_preprocess_threads=num_preprocess_threads)

    input_summaries = copy.copy(tf.get_collection(tf.GraphKeys.SUMMARIES))
@@ -229,10 +226,22 @@ def train(dataset):
    for i in xrange(FLAGS.num_gpus):
      with tf.device('/gpu:%d' % i):
        with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
-          # Calculate the loss for one tower of the ImageNet model. This
-          # function constructs the entire ImageNet model but shares the
-          # variables across all towers.
-          loss = _tower_loss(images, labels, num_classes, scope)
+          # Split the batch of images and labels.
+          batch_start = split_batch_size * i
+          images_batch = tf.slice(images,
+                                  begin=[batch_start, 0, 0, 0],
+                                  size=[split_batch_size, -1, -1, -1])
+          labels_batch = tf.slice(labels,
+                                  begin=[batch_start],
+                                  size=[split_batch_size])
+
+
+          # Force all Variables to reside on the CPU.
+          with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
+            # Calculate the loss for one tower of the ImageNet model. This
+            # function constructs the entire ImageNet model but shares the
+            # variables across all towers.
+            loss = _tower_loss(images_batch, labels_batch, num_classes, scope)

          # Reuse variables for the next tower.
          tf.get_variable_scope().reuse_variables()