Merge branch 'cmlesupport' of https://github.com/elibixby/models

5d06cfcf · Toby Boyd · 7c460c90 · f5697b94 · 5d06cfcf · 5d06cfcf
Commit 5d06cfcf authored Aug 18, 2017 by Toby Boyd
Showing with 118 additions and 115 deletions

tutorials/image/cifar10_estimator/README.md tutorials/image/cifar10_estimator/README.md +72 -58

tutorials/image/cifar10_estimator/cifar10_main.py tutorials/image/cifar10_estimator/cifar10_main.py +46 -57

No files found.
--- a/tutorials/image/cifar10_estimator/README.md
+++ b/tutorials/image/cifar10_estimator/README.md
@@ -17,62 +17,74 @@ Before trying to run the model we highly encourage you to read all the README.
 2. Download the CIFAR-10 dataset.
 ```shell
-$ curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
+curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
-$ tar xzf cifar-10-python.tar.gz
+tar xzf cifar-10-python.tar.gz
 ```
 After running the commands above, you should see the following files in the folder where the data was downloaded.
 ``` shell
-$ ls -R cifar-10-batches-py
+ls -R cifar-10-batches-py
+```
+The output should be:
+```
 batches.meta  data_batch_1  data_batch_2  data_batch_3
 data_batch_4  data_batch_5  readme.html  test_batch
 ```
 3. Generate TFRecord files.
+This will generate a tf record for the training and test data available at the input_dir.
+You can see more details in `generate_cifar10_tf_records.py`
 ```shell
-# This will generate a tf record for the training and test data available at the input_dir.
+python generate_cifar10_tfrecords.py --input-dir=${PWD}/cifar-10-batches-py \
-# You can see more details in generate_cifar10_tf_records.py
+                                     --output-dir=${PWD}/cifar-10-batches-py
-$ python generate_cifar10_tfrecords.py --input-dir=/prefix/to/downloaded/data/cifar-10-batches-py \
-                                       --output-dir=/prefix/to/downloaded/data/cifar-10-batches-py
 ```
 After running the command above, you should see the following new files in the output_dir.
 ``` shell
-$ ls -R cifar-10-batches-py
+ls -R cifar-10-batches-py
+```
+```
 train.tfrecords validation.tfrecords eval.tfrecords
 ```
 ## How to run on local mode
-```
+Run the model on CPU only. After training, it runs the evaluation.
-# Run the model on CPU only. After training, it runs the evaluation.
+```
-$ python cifar10_main.py --data-dir=/prefix/to/downloaded/data/cifar-10-batches-py \
+python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
-                         --job-dir=/tmp/cifar10 \
+                       --job-dir=/tmp/cifar10 \
-                         --num-gpus=0 \
+                       --num-gpus=0 \
-                         --train-steps=1000
+                       --train-steps=1000
+```
-# Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
-$ python cifar10_main.py --data-dir=/prefix/to/downloaded/data/cifar-10-batches-py \
-                         --job-dir=/tmp/cifar10 \
-                         --num-gpus=2 \
-                         --train-steps=1000
-# Run the model on 2 GPUs using GPU as parameter server.
+Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
-# It will run an experiment, which for local setting basically means it will run stop training
+```
-# a couple of times to perform evaluation.
+python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
-$ python cifar10_main.py --data-dir=/prefix/to/downloaded/data/cifar-10-batches-bin \
+                       --job-dir=/tmp/cifar10 \
-                         --job-dir=/tmp/cifar10 \
+                       --num-gpus=2 \
-                         --variable-strategy GPU \
+                       --train-steps=1000
-                         --num-gpus=2 \
+```
+Run the model on 2 GPUs using GPU as parameter server.
+It will run an experiment, which for local setting basically means it will run stop training
+a couple of times to perform evaluation.
-# There are more command line flags to play with; check cifar10_main.py for details.
 ```
+python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-bin \
+                       --job-dir=/tmp/cifar10 \
+                       --variable-strategy GPU \
+                       --num-gpus=2 \
+```
+There are more command line flags to play with; run `python cifar10_main.py --help` for details.
 ## How to run on distributed mode
@@ -86,7 +98,7 @@ You'll also need a Google Cloud Storage bucket for the data. If you followed the
 ```
 MY_BUCKET=gs://<my-bucket-name>
-gsutil cp -r cifar-10-batches-py $MY_BUCKET/
+gsutil cp -r ${PWD}/cifar-10-batches-py $MY_BUCKET/
 ```
 Then run the following command from the `tutorials/image` directory of this repository (the parent directory of this README):
@@ -172,18 +184,19 @@ By the default environment is *local*, for a distributed setting we need to chan
 Once you have a `TF_CONFIG` configured properly on each host you're ready to run on distributed settings.
 #### Master
+Run this on master:
+Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
+It will run evaluation a couple of times during training.
+The num_workers arugument is used only to update the learning rate correctly.
+Make sure the model_dir is the same as defined on the TF_CONFIG.
 ```shell
-# Run this on master:
+python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
-# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
+                       --job-dir=gs://path/model_dir/ \
-# It will run evaluation a couple of times during training.
+                       --num-gpus=4 \
-# The num_workers arugument is used only to update the learning rate correctly.
+                       --train-steps=40000 \
-# Make sure the model_dir is the same as defined on the TF_CONFIG.
+                       --sync \
-$ python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
+                       --num-workers=2
-                         --job-dir=gs://path/model_dir/ \
-                         --num-gpus=4 \
-                         --train-steps=40000 \
-                         --sync \
-                         --num-workers=2
 ```
 *Output:*
@@ -313,16 +326,17 @@ INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step =
 #### Worker
+Run this on worker:
+Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
+It will run evaluation a couple of times during training.
+Make sure the model_dir is the same as defined on the TF_CONFIG.
 ```shell
-# Run this on worker:
+python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
-# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
+                       --job-dir=gs://path/model_dir/ \
-# It will run evaluation a couple of times during training.
+                       --num-gpus=4 \
-# Make sure the model_dir is the same as defined on the TF_CONFIG.
+                       --train-steps=40000 \
-$ python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
+                       --sync
-                         --job-dir=gs://path/model_dir/ \
-                         --num-gpus=4 \
-                         --train-steps=40000 \
-                         --sync
 ```
 *Output:*
@@ -428,12 +442,11 @@ INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec)
 #### PS
-```shell
+Run this on ps:
-# Run this on ps:
+The ps will not do training so most of the arguments won't affect the execution
-# The ps will not do training so most of the arguments won't affect the execution
-$ python cifar10_main.py --job-dir=gs://path/model_dir/
-# There are more command line flags to play with; check cifar10_main.py for details.
+```shell
+python cifar10_main.py --job-dir=gs://path/model_dir/
 ```
 *Output:*
@@ -460,11 +473,12 @@ When using Estimators you can also visualize your data in TensorBoard, with no c
 You'll see something similar to this if you "point" TensorBoard to the `model_dir` you used to train or evaluate your model.
+Check TensorBoard during training or after it.
+Just point TensorBoard to the model_dir you chose on the previous step
+by default the model_dir is "sentiment_analysis_output"
 ```shell
-# Check TensorBoard during training or after it.
+tensorboard --log-dir="sentiment_analysis_output"
-# Just point TensorBoard to the model_dir you chose on the previous step
-# by default the model_dir is "sentiment_analysis_output"
-$ tensorboard --log-dir="sentiment_analysis_output"
 ```
 ## Warnings

--- a/tutorials/image/cifar10_estimator/cifar10_main.py
+++ b/tutorials/image/cifar10_estimator/cifar10_main.py
@@ -74,61 +74,50 @@ def get_model_fn(num_gpus, variable_strategy, num_workers):
    tower_gradvars = []
    tower_preds = []
-    if num_gpus != 0:
+    if num_gpus == 0:
-      for i in range(num_gpus):
+      num_devices = 1
-        worker_device = '/gpu:{}'.format(i)
+      device_type = 'cpu'
-        if variable_strategy == 'CPU':
-            device_setter = cifar10_utils.local_device_setter(
-                worker_device=worker_device)
-        elif variable_strategy == 'GPU':
-            device_setter = cifar10_utils.local_device_setter(
-                ps_device_type='gpu',
-                worker_device=worker_device,
-                ps_strategy=tf.contrib.training.GreedyLoadBalancingStrategy(
-                    num_gpus,
-                    tf.contrib.training.byte_size_load_fn
-                )
-            )
-        with tf.variable_scope('resnet', reuse=bool(i != 0)):
-          with tf.name_scope('tower_%d' % i) as name_scope:
-            with tf.device(device_setter):
-              loss, gradvars, preds = _tower_fn(
-                  is_training,
-                  weight_decay,
-                  tower_features[i],
-                  tower_labels[i],
-                  False,
-                  params['num_layers'],
-                  params['batch_norm_decay'],
-                  params['batch_norm_epsilon'])
-              tower_losses.append(loss)
-              tower_gradvars.append(gradvars)
-              tower_preds.append(preds)
-              if i == 0:
-                # Only trigger batch_norm moving mean and variance update from
-                # the 1st tower. Ideally, we should grab the updates from all
-                # towers but these stats accumulate extremely fast so we can
-                # ignore the other stats from the other towers without
-                # significant detriment.
-                update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS,
-                                               name_scope)
    else:
-      with tf.variable_scope('resnet'), tf.device('/cpu:0'):
+      num_devices = num_gpus
-        with tf.name_scope('tower_cpu') as name_scope:
+      device_type = 'gpu'
-          loss, gradvars, preds = _tower_fn(
-              is_training,
+    for i in range(num_devices):
-              weight_decay,
+      worker_device = '/{}:{}'.format(device_type, i)
-              tower_features[0],
+      if variable_strategy == 'CPU':
-              tower_labels[0],
+          device_setter = cifar10_utils.local_device_setter(
-              True,
+              worker_device=worker_device)
-              params['num_layers'],
+      elif variable_strategy == 'GPU':
-              params['batch_norm_decay'],
+          device_setter = cifar10_utils.local_device_setter(
-              params['batch_norm_epsilon'])
+              ps_device_type='gpu',
-          tower_losses.append(loss)
+              worker_device=worker_device,
-          tower_gradvars.append(gradvars)
+              ps_strategy=tf.contrib.training.GreedyLoadBalancingStrategy(
-          tower_preds.append(preds)
+                  num_gpus,
+                  tf.contrib.training.byte_size_load_fn
-          update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, name_scope)
+              )
+          )
+      with tf.variable_scope('resnet', reuse=bool(i != 0)):
+        with tf.name_scope('tower_%d' % i) as name_scope:
+          with tf.device(device_setter):
+            loss, gradvars, preds = _tower_fn(
+                is_training,
+                weight_decay,
+                tower_features[i],
+                tower_labels[i],
+                (device_type == 'cpu'),
+                params['num_layers'],
+                params['batch_norm_decay'],
+                params['batch_norm_epsilon'])
+            tower_losses.append(loss)
+            tower_gradvars.append(gradvars)
+            tower_preds.append(preds)
+            if i == 0:
+              # Only trigger batch_norm moving mean and variance update from
+              # the 1st tower. Ideally, we should grab the updates from all
+              # towers but these stats accumulate extremely fast so we can
+              # ignore the other stats from the other towers without
+              # significant detriment.
+              update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS,
+                                             name_scope)
    # Now compute global loss and gradients.
    gradvars = []
@@ -420,7 +409,7 @@ if __name__ == '__main__':
      help='The directory where the model will be stored.'
  )
  parser.add_argument(
-      '--variable_strategy',
+      '--variable-strategy',
      choices=['CPU', 'GPU'],
      type=str,
      default='CPU',
@@ -520,13 +509,13 @@ if __name__ == '__main__':
      help='Whether to log device placement.'
  )
  parser.add_argument(
-      '--batch_norm_decay',
+      '--batch-norm-decay',
      type=float,
      default=0.997,
      help='Decay for batch norm.'
  )
  parser.add_argument(
-      '--batch_norm_epsilon',
+      '--batch-norm-epsilon',
      type=float,
      default=1e-5,
      help='Epsilon for batch norm.'