Merge pull request #2284 from tfboyd/single_cmd_download

Download and create TFRecords with single command.

Merge pull request #2284 from tfboyd/single_cmd_download
Download and create TFRecords with single command.
da62bb0b · Toby Boyd · GitHub · aae631cc · df399534 · da62bb0b
Commit da62bb0b authored Aug 25, 2017 by Toby Boyd Committed by GitHub Aug 25, 2017
3 changed files
--- a/tutorials/image/cifar10_estimator/README.md
+++ b/tutorials/image/cifar10_estimator/README.md
@@ -2,7 +2,9 @@ CIFAR-10 is a common benchmark in machine learning for image recognition.

 http://www.cs.toronto.edu/~kriz/cifar.html

-Code in this directory focuses on how to use TensorFlow Estimators to train and evaluate a CIFAR-10 ResNet model on:
+Code in this directory focuses on how to use TensorFlow Estimators to train and 
+evaluate a CIFAR-10 ResNet model on:
+
 * A single host with one CPU;
 * A single host with multiple GPUs;
 * Multiple hosts with CPU or multiple GPUs;
@@ -11,97 +13,82 @@ Before trying to run the model we highly encourage you to read all the README.

 ## Prerequisite

-1. Install TensorFlow version 1.2.1 or later with GPU support.
-   You can see how to do it [here](https://www.tensorflow.org/install/).
-
-2. Download the CIFAR-10 dataset.
-
-```shell
-curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
-tar xzf cifar-10-python.tar.gz
-```
-
-After running the commands above, you should see the following files in the folder where the data was downloaded.
-
-``` shell
-ls -R cifar-10-batches-py
-```
-
-The output should be:
-
-
-```
-batches.meta  data_batch_1  data_batch_2  data_batch_3
-data_batch_4  data_batch_5  readme.html  test_batch
-```
+1. [Install](https://www.tensorflow.org/install/) TensorFlow version 1.2.1 or
+later.

-3. Generate TFRecord files.
-This will generate a tf record for the training and test data available at the input_dir.
-You can see more details in `generate_cifar10_tf_records.py`
+2. Download the CIFAR-10 dataset and generate TFRecord files using the provided
+script.  The script and associated command below will download the CIFAR-10
+dataset and then generate a TFRecord for the training, validation, and
+evaluation datasets. 

 ```shell
-python generate_cifar10_tfrecords.py --input-dir=${PWD}/cifar-10-batches-py \
-                                     --output-dir=${PWD}/cifar-10-batches-py
+python generate_cifar10_tfrecords.py --data-dir=${PWD}/cifar-10-data
 ```

-After running the command above, you should see the following new files in the output_dir.
+After running the command above, you should see the following files in the
+--data-dir (```ls -R cifar-10-data```):

-``` shell
-ls -R cifar-10-batches-py
-```
+* train.tfrecords
+* validation.tfrecords
+* eval.tfrecords

-```
-train.tfrecords validation.tfrecords eval.tfrecords
-```

-## How to run on local mode
+## Training on a single machine with GPUs or CPU

-Run the model on CPU only. After training, it runs the evaluation.
+Run the training on CPU only. After training, it runs the evaluation.

 ```
-python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
+python cifar10_main.py --data-dir=${PWD}/cifar-10-data \
                       --job-dir=/tmp/cifar10 \
                       --num-gpus=0 \
                       --train-steps=1000
 ```

-Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
+Run the model on 2 GPUs using CPU as parameter server. After training, it runs
+the evaluation.
 ```
-python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
+python cifar10_main.py --data-dir=${PWD}/cifar-10-data \
                       --job-dir=/tmp/cifar10 \
                       --num-gpus=2 \
                       --train-steps=1000
 ```

 Run the model on 2 GPUs using GPU as parameter server.
-It will run an experiment, which for local setting basically means it will run stop training
+It will run an experiment, which for local setting basically means it will run
+stop training
 a couple of times to perform evaluation.

 ```
-python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-bin \
+python cifar10_main.py --data-dir=${PWD}/cifar-10-data \
                       --job-dir=/tmp/cifar10 \
                       --variable-strategy GPU \
                       --num-gpus=2 \
 ```

-There are more command line flags to play with; run `python cifar10_main.py --help` for details.
+There are more command line flags to play with; run
+`python cifar10_main.py --help` for details.

-## How to run on distributed mode
+## Run distributed training

 ### (Optional) Running on Google Cloud Machine Learning Engine

-This example can be run on Google Cloud Machine Learning Engine (ML Engine), which will configure the environment and take care of running workers, parameters servers, and masters in a fault tolerant way.
+This example can be run on Google Cloud Machine Learning Engine (ML Engine),
+which will configure the environment and take care of running workers,
+parameters servers, and masters in a fault tolerant way.

-To install the command line tool, and set up a project and billing, see the quickstart [here](https://cloud.google.com/ml-engine/docs/quickstarts/command-line).
+To install the command line tool, and set up a project and billing, see the
+quickstart [here](https://cloud.google.com/ml-engine/docs/quickstarts/command-line).

-You'll also need a Google Cloud Storage bucket for the data. If you followed the instructions above, you can just run:
+You'll also need a Google Cloud Storage bucket for the data. If you followed the
+instructions above, you can just run:

 ```
 MY_BUCKET=gs://<my-bucket-name>
-gsutil cp -r ${PWD}/cifar-10-batches-py $MY_BUCKET/
+gsutil cp -r ${PWD}/cifar-10-data $MY_BUCKET/
 ```

-Then run the following command from the `tutorials/image` directory of this repository (the parent directory of this README):
+Then run the following command from the `tutorials/image` directory of this
+repository (the parent directory of this README):

 ```
 gcloud ml-engine jobs submit training cifarmultigpu \
@@ -111,7 +98,7 @@ gcloud ml-engine jobs submit training cifarmultigpu \
    --package-path cifar10_estimator/ \
    --module-name cifar10_estimator.cifar10_main \
    -- \
-    --data-dir=$MY_BUCKET/cifar-10-batches-py \
+    --data-dir=$MY_BUCKET/cifar-10-data \
    --num-gpus=4 \
    --train-steps=1000
 ```
@@ -119,10 +106,13 @@ gcloud ml-engine jobs submit training cifarmultigpu \

 ### Set TF_CONFIG

-Considering that you already have multiple hosts configured, all you need is a `TF_CONFIG`
-environment variable on each host. You can set up the hosts manually or check [tensorflow/ecosystem](https://github.com/tensorflow/ecosystem) for instructions about how to set up a Cluster.
+Considering that you already have multiple hosts configured, all you need is a
+`TF_CONFIG` environment variable on each host. You can set up the hosts manually
+or check [tensorflow/ecosystem](https://github.com/tensorflow/ecosystem) for
+instructions about how to set up a Cluster.

-The `TF_CONFIG` will be used by the `RunConfig` to know the existing hosts and their task: `master`, `ps` or `worker`.
+The `TF_CONFIG` will be used by the `RunConfig` to know the existing hosts and
+their task: `master`, `ps` or `worker`.

 Here's an example of `TF_CONFIG`.

@@ -141,21 +131,26 @@ TF_CONFIG = json.dumps(

 *Cluster*

-A cluster spec, which is basically a dictionary that describes all of the tasks in the cluster. More about it [here](https://www.tensorflow.org/deploy/distributed).
+A cluster spec, which is basically a dictionary that describes all of the tasks
+in the cluster. More about it [here](https://www.tensorflow.org/deploy/distributed).

 In this cluster spec we are defining a cluster with 1 master, 1 ps and 1 worker.

-* `ps`: saves the parameters among all workers. All workers can read/write/update the parameters for model via ps.
-        As some models are extremely large the parameters are shared among the ps (each ps stores a subset).
+* `ps`: saves the parameters among all workers. All workers can
+   read/write/update the parameters for model via ps. As some models are
+   extremely large the parameters are shared among the ps (each ps stores a
+   subset).

 * `worker`: does the training.

-* `master`: basically a special worker, it does training, but also restores and saves checkpoints and do evaluation.
+* `master`: basically a special worker, it does training, but also restores and
+   saves checkpoints and do evaluation.

 *Task*

-The Task defines what is the role of the current node, for this example the node is the master on index 0
-on the cluster spec, the task will be different for each node. An example of the `TF_CONFIG` for a worker would be:
+The Task defines what is the role of the current node, for this example the node
+is the master on index 0 on the cluster spec, the task will be different for
+each node. An example of the `TF_CONFIG` for a worker would be:

 ```python
 cluster = {'master': ['master-ip:8000'],
@@ -172,26 +167,29 @@ TF_CONFIG = json.dumps(

 *Model_dir*

-This is the path where the master will save the checkpoints, graph and TensorBoard files.
-For a multi host environment you may want to use a Distributed File System, Google Storage and DFS are supported.
+This is the path where the master will save the checkpoints, graph and
+TensorBoard files. For a multi host environment you may want to use a
+Distributed File System, Google Storage and DFS are supported.

 *Environment*

-By the default environment is *local*, for a distributed setting we need to change it to *cloud*.
+By the default environment is *local*, for a distributed setting we need to
+change it to *cloud*.

 ### Running script

-Once you have a `TF_CONFIG` configured properly on each host you're ready to run on distributed settings.
+Once you have a `TF_CONFIG` configured properly on each host you're ready to run
+on distributed settings.

 #### Master
 Run this on master:
-Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
-It will run evaluation a couple of times during training.
-The num_workers arugument is used only to update the learning rate correctly.
-Make sure the model_dir is the same as defined on the TF_CONFIG.
+Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for
+40000 steps. It will run evaluation a couple of times during training. The
+num_workers arugument is used only to update the learning rate correctly. Make
+sure the model_dir is the same as defined on the TF_CONFIG.

 ```shell
-python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
+python cifar10_main.py --data-dir=gs://path/cifar-10-data \
                       --job-dir=gs://path/model_dir/ \
                       --num-gpus=4 \
                       --train-steps=40000 \
@@ -327,12 +325,12 @@ INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step =
 #### Worker

 Run this on worker:
-Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
-It will run evaluation a couple of times during training.
-Make sure the model_dir is the same as defined on the TF_CONFIG.
+Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for
+40000 steps. It will run evaluation a couple of times during training. Make sure
+the model_dir is the same as defined on the TF_CONFIG.

 ```shell
-python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
+python cifar10_main.py --data-dir=gs://path/cifar-10-data \
                       --job-dir=gs://path/model_dir/ \
                       --num-gpus=4 \
                       --train-steps=40000 \
@@ -469,13 +467,17 @@ allow_soft_placement: true

 ## Visualizing results with TensorFlow

-When using Estimators you can also visualize your data in TensorBoard, with no changes in your code. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it.
+When using Estimators you can also visualize your data in TensorBoard, with no
+changes in your code. You can use TensorBoard to visualize your TensorFlow
+graph, plot quantitative metrics about the execution of your graph, and show
+additional data like images that pass through it.

-You'll see something similar to this if you "point" TensorBoard to the `model_dir` you used to train or evaluate your model.
+You'll see something similar to this if you "point" TensorBoard to the
+`model_dir` you used to train or evaluate your model.

-Check TensorBoard during training or after it.
-Just point TensorBoard to the model_dir you chose on the previous step
-by default the model_dir is "sentiment_analysis_output"
+Check TensorBoard during training or after it. Just point TensorBoard to the
+model_dir you chose on the previous step by default the model_dir is
+"sentiment_analysis_output"

 ```shell
 tensorboard --log-dir="sentiment_analysis_output"
@@ -483,7 +485,8 @@ tensorboard --log-dir="sentiment_analysis_output"

 ## Warnings

-When runninng `cifar10_main.py` with `--sync` argument you may see an error similar to:
+When runninng `cifar10_main.py` with `--sync` argument you may see an error
+similar to:

 ```python
 File "cifar10_main.py", line 538, in <module>

--- a/tutorials/image/cifar10_estimator/cifar10_utils.py
+++ b/tutorials/image/cifar10_estimator/cifar10_utils.py
@@ -4,7 +4,6 @@ import six
 import tensorflow as tf

 from tensorflow.python.platform import tf_logging as logging
-
 from tensorflow.core.framework import node_def_pb2
 from tensorflow.python.framework import device as pydev
 from tensorflow.python.training import basic_session_run_hooks

--- a/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py
+++ b/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py
@@ -12,10 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Read CIFAR-10 data from pickled numpy arrays and write TFExamples.
+"""Read CIFAR-10 data from pickled numpy arrays and writes TFRecords.

-Generates TFRecord files from the python version of the CIFAR-10 dataset
-downloaded from https://www.cs.toronto.edu/~kriz/cifar.html.
+Generates tf.train.Example protos and writes them to TFRecord files from the
+python version of the CIFAR-10 dataset downloaded from
+https://www.cs.toronto.edu/~kriz/cifar.html.
 """

 from __future__ import absolute_import
@@ -26,8 +27,22 @@ import argparse
 import cPickle
 import os

+import tarfile
+from six.moves import xrange  # pylint: disable=redefined-builtin
 import tensorflow as tf

+CIFAR_FILENAME = 'cifar-10-python.tar.gz'
+CIFAR_DOWNLOAD_URL = 'https://www.cs.toronto.edu/~kriz/' + CIFAR_FILENAME
+CIFAR_LOCAL_FOLDER = 'cifar-10-batches-py'
+
+
+def download_and_extract(data_dir):
+  # download CIFAR-10 if not already downloaded.
+  tf.contrib.learn.datasets.base.maybe_download(CIFAR_FILENAME, data_dir,
+                                                CIFAR_DOWNLOAD_URL)
+  tarfile.open(os.path.join(data_dir, CIFAR_FILENAME),
+               'r:gz').extractall(data_dir)
+

 def _int64_feature(value):
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
@@ -53,31 +68,36 @@ def read_pickle_from_file(filename):


 def convert_to_tfrecord(input_files, output_file):
-  """Converts a file to tfrecords."""
+  """Converts a file to TFRecords."""
  print('Generating %s' % output_file)
  with tf.python_io.TFRecordWriter(output_file) as record_writer:
    for input_file in input_files:
      data_dict = read_pickle_from_file(input_file)
      data = data_dict['data']
      labels = data_dict['labels']
-
      num_entries_in_batch = len(labels)
      for i in range(num_entries_in_batch):
-        example = tf.train.Example(
-            features=tf.train.Features(feature={
+        example = tf.train.Example(features=tf.train.Features(
+            feature={
                'image': _bytes_feature(data[i].tobytes()),
                'label': _int64_feature(labels[i])
            }))
        record_writer.write(example.SerializeToString())


-def main(input_dir, output_dir):
+def main(data_dir):
+  print('Download from {} and extract.'.format(CIFAR_DOWNLOAD_URL))
+  download_and_extract(data_dir)
  file_names = _get_file_names()
+  input_dir = os.path.join(data_dir, CIFAR_LOCAL_FOLDER)
  for mode, files in file_names.items():
-    input_files = [
-        os.path.join(input_dir, f) for f in files]
-    output_file = os.path.join(output_dir, mode + '.tfrecords')
-    # Convert to Examples and write the result to TFRecords.
+    input_files = [os.path.join(input_dir, f) for f in files]
+    output_file = os.path.join(data_dir, mode + '.tfrecords')
+    try:
+      os.remove(output_file)
+    except OSError:
+      pass
+    # Convert to tf.train.Example and write the to TFRecords.
    convert_to_tfrecord(input_files, output_file)
  print('Done!')

@@ -85,19 +105,10 @@ def main(input_dir, output_dir):
 if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument(
-      '--input-dir',
+      '--data-dir',
      type=str,
      default='',
-      help='Directory where CIFAR10 data is located.'
-  )
-  parser.add_argument(
-      '--output-dir',
-      type=str,
-      default='',
-      help="""\
-      Directory where TFRecords will be saved.The TFRecords will have the same
-      name as the CIFAR10 inputs + .tfrecords.\
-      """
-  )
+      help='Directory to download and extract CIFAR-10 to.')
+
  args = parser.parse_args()
-  main(args.input_dir, args.output_dir)
+  main(args.data_dir)