Commit aae631cc authored by Toby Boyd's avatar Toby Boyd Committed by GitHub
Browse files

Merge pull request #2152 from elibixby/cmlesupport

Fix import paths and GPUOptions initialization
parents a380b4b3 e11010fc
...@@ -17,70 +17,106 @@ Before trying to run the model we highly encourage you to read all the README. ...@@ -17,70 +17,106 @@ Before trying to run the model we highly encourage you to read all the README.
2. Download the CIFAR-10 dataset. 2. Download the CIFAR-10 dataset.
```shell ```shell
$ curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
$ tar xzf cifar-10-python.tar.gz tar xzf cifar-10-python.tar.gz
``` ```
After running the commands above, you should see the following files in the folder where the data was downloaded. After running the commands above, you should see the following files in the folder where the data was downloaded.
``` shell ``` shell
$ ls -R cifar-10-batches-py ls -R cifar-10-batches-py
```
The output should be:
```
batches.meta data_batch_1 data_batch_2 data_batch_3 batches.meta data_batch_1 data_batch_2 data_batch_3
data_batch_4 data_batch_5 readme.html test_batch data_batch_4 data_batch_5 readme.html test_batch
``` ```
3. Generate TFRecord files. 3. Generate TFRecord files.
This will generate a tf record for the training and test data available at the input_dir.
You can see more details in `generate_cifar10_tf_records.py`
```shell ```shell
# This will generate a tf record for the training and test data available at the input_dir. python generate_cifar10_tfrecords.py --input-dir=${PWD}/cifar-10-batches-py \
# You can see more details in generate_cifar10_tf_records.py --output-dir=${PWD}/cifar-10-batches-py
$ python generate_cifar10_tfrecords.py --input_dir=/prefix/to/downloaded/data/cifar-10-batches-py \
--output_dir=/prefix/to/downloaded/data/cifar-10-batches-py
``` ```
After running the command above, you should see the following new files in the output_dir. After running the command above, you should see the following new files in the output_dir.
``` shell ``` shell
$ ls -R cifar-10-batches-py ls -R cifar-10-batches-py
```
```
train.tfrecords validation.tfrecords eval.tfrecords train.tfrecords validation.tfrecords eval.tfrecords
``` ```
## How to run on local mode ## How to run on local mode
Run the model on CPU only. After training, it runs the evaluation.
```
python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
--job-dir=/tmp/cifar10 \
--num-gpus=0 \
--train-steps=1000
```
Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
```
python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
--job-dir=/tmp/cifar10 \
--num-gpus=2 \
--train-steps=1000
``` ```
# Run the model on CPU only. After training, it runs the evaluation. Run the model on 2 GPUs using GPU as parameter server.
$ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-py \ It will run an experiment, which for local setting basically means it will run stop training
--model_dir=/tmp/cifar10 \ a couple of times to perform evaluation.
--is_cpu_ps=True \
--num_gpus=0 \ ```
--train_steps=1000 python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-bin \
--job-dir=/tmp/cifar10 \
# Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation. --variable-strategy GPU \
$ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-py \ --num-gpus=2 \
--model_dir=/tmp/cifar10 \
--is_cpu_ps=True \
--force_gpu_compatible=True \
--num_gpus=2 \
--train_steps=1000
# Run the model on 2 GPUs using GPU as parameter server.
# It will run an experiment, which for local setting basically means it will run stop training
# a couple of times to perform evaluation.
$ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-bin \
--model_dir=/tmp/cifar10 \
--is_cpu_ps=False \
--force_gpu_compatible=True \
--num_gpus=2 \
--train_steps=1000
--run_experiment=True
# There are more command line flags to play with; check cifar10_main.py for details.
``` ```
There are more command line flags to play with; run `python cifar10_main.py --help` for details.
## How to run on distributed mode ## How to run on distributed mode
### (Optional) Running on Google Cloud Machine Learning Engine
This example can be run on Google Cloud Machine Learning Engine (ML Engine), which will configure the environment and take care of running workers, parameters servers, and masters in a fault tolerant way.
To install the command line tool, and set up a project and billing, see the quickstart [here](https://cloud.google.com/ml-engine/docs/quickstarts/command-line).
You'll also need a Google Cloud Storage bucket for the data. If you followed the instructions above, you can just run:
```
MY_BUCKET=gs://<my-bucket-name>
gsutil cp -r ${PWD}/cifar-10-batches-py $MY_BUCKET/
```
Then run the following command from the `tutorials/image` directory of this repository (the parent directory of this README):
```
gcloud ml-engine jobs submit training cifarmultigpu \
--runtime-version 1.2 \
--job-dir=$MY_BUCKET/model_dirs/cifarmultigpu \
--config cifar10_estimator/cmle_config.yaml \
--package-path cifar10_estimator/ \
--module-name cifar10_estimator.cifar10_main \
-- \
--data-dir=$MY_BUCKET/cifar-10-batches-py \
--num-gpus=4 \
--train-steps=1000
```
### Set TF_CONFIG ### Set TF_CONFIG
Considering that you already have multiple hosts configured, all you need is a `TF_CONFIG` Considering that you already have multiple hosts configured, all you need is a `TF_CONFIG`
...@@ -148,21 +184,19 @@ By the default environment is *local*, for a distributed setting we need to chan ...@@ -148,21 +184,19 @@ By the default environment is *local*, for a distributed setting we need to chan
Once you have a `TF_CONFIG` configured properly on each host you're ready to run on distributed settings. Once you have a `TF_CONFIG` configured properly on each host you're ready to run on distributed settings.
#### Master #### Master
Run this on master:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
It will run evaluation a couple of times during training.
The num_workers arugument is used only to update the learning rate correctly.
Make sure the model_dir is the same as defined on the TF_CONFIG.
```shell ```shell
# Run this on master: python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps. --job-dir=gs://path/model_dir/ \
# It will run evaluation a couple of times during training. --num-gpus=4 \
# The num_workers arugument is used only to update the learning rate correctly. --train-steps=40000 \
# Make sure the model_dir is the same as defined on the TF_CONFIG. --sync \
$ python cifar10_main.py --data_dir=gs://path/cifar-10-batches-py \ --num-workers=2
--model_dir=gs://path/model_dir/ \
--is_cpu_ps=True \
--force_gpu_compatible=True \
--num_gpus=4 \
--train_steps=40000 \
--sync=True \
--run_experiment=True \
--num_workers=2
``` ```
*Output:* *Output:*
...@@ -292,19 +326,17 @@ INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step = ...@@ -292,19 +326,17 @@ INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step =
#### Worker #### Worker
Run this on worker:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
It will run evaluation a couple of times during training.
Make sure the model_dir is the same as defined on the TF_CONFIG.
```shell ```shell
# Run this on worker: python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps. --job-dir=gs://path/model_dir/ \
# It will run evaluation a couple of times during training. --num-gpus=4 \
# Make sure the model_dir is the same as defined on the TF_CONFIG. --train-steps=40000 \
$ python cifar10_main.py --data_dir=gs://path/cifar-10-batches-py \ --sync
--model_dir=gs://path/model_dir/ \
--is_cpu_ps=True \
--force_gpu_compatible=True \
--num_gpus=4 \
--train_steps=40000 \
--sync=True
--run_experiment=True
``` ```
*Output:* *Output:*
...@@ -410,12 +442,11 @@ INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec) ...@@ -410,12 +442,11 @@ INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec)
#### PS #### PS
```shell Run this on ps:
# Run this on ps: The ps will not do training so most of the arguments won't affect the execution
# The ps will not do training so most of the arguments won't affect the execution
$ python cifar10_main.py --run_experiment=True --model_dir=gs://path/model_dir/
# There are more command line flags to play with; check cifar10_main.py for details. ```shell
python cifar10_main.py --job-dir=gs://path/model_dir/
``` ```
*Output:* *Output:*
...@@ -442,16 +473,17 @@ When using Estimators you can also visualize your data in TensorBoard, with no c ...@@ -442,16 +473,17 @@ When using Estimators you can also visualize your data in TensorBoard, with no c
You'll see something similar to this if you "point" TensorBoard to the `model_dir` you used to train or evaluate your model. You'll see something similar to this if you "point" TensorBoard to the `model_dir` you used to train or evaluate your model.
Check TensorBoard during training or after it.
Just point TensorBoard to the model_dir you chose on the previous step
by default the model_dir is "sentiment_analysis_output"
```shell ```shell
# Check TensorBoard during training or after it. tensorboard --log-dir="sentiment_analysis_output"
# Just point TensorBoard to the model_dir you chose on the previous step
# by default the model_dir is "sentiment_analysis_output"
$ tensorboard --log_dir="sentiment_analysis_output"
``` ```
## Warnings ## Warnings
When runninng `cifar10_main.py` with `--sync=True` argument you may see an error similar to: When runninng `cifar10_main.py` with `--sync` argument you may see an error similar to:
```python ```python
File "cifar10_main.py", line 538, in <module> File "cifar10_main.py", line 538, in <module>
......
...@@ -13,7 +13,6 @@ ...@@ -13,7 +13,6 @@
# limitations under the License. # limitations under the License.
# ============================================================================== # ==============================================================================
"""Model class for Cifar10 Dataset.""" """Model class for Cifar10 Dataset."""
from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
...@@ -25,8 +24,18 @@ import model_base ...@@ -25,8 +24,18 @@ import model_base
class ResNetCifar10(model_base.ResNet): class ResNetCifar10(model_base.ResNet):
"""Cifar10 model with ResNetV1 and basic residual block.""" """Cifar10 model with ResNetV1 and basic residual block."""
def __init__(self, num_layers, is_training, data_format='channels_first'): def __init__(self,
super(ResNetCifar10, self).__init__(is_training, data_format) num_layers,
is_training,
batch_norm_decay,
batch_norm_epsilon,
data_format='channels_first'):
super(ResNetCifar10, self).__init__(
is_training,
data_format,
batch_norm_decay,
batch_norm_epsilon
)
self.n = (num_layers - 2) // 6 self.n = (num_layers - 2) // 6
# Add one in case label starts with 1. No impact if label starts with 0. # Add one in case label starts with 1. No impact if label starts with 0.
self.num_classes = 10 + 1 self.num_classes = 10 + 1
......
import collections
import six
import tensorflow as tf
from tensorflow.python.platform import tf_logging as logging
from tensorflow.core.framework import node_def_pb2
from tensorflow.python.framework import device as pydev
from tensorflow.python.training import basic_session_run_hooks
from tensorflow.python.training import session_run_hook
from tensorflow.python.training import training_util
from tensorflow.python.training import device_setter
from tensorflow.contrib.learn.python.learn import run_config
# TODO(b/64848083) Remove once uid bug is fixed
class RunConfig(tf.contrib.learn.RunConfig):
def uid(self, whitelist=None):
"""Generates a 'Unique Identifier' based on all internal fields.
Caller should use the uid string to check `RunConfig` instance integrity
in one session use, but should not rely on the implementation details, which
is subject to change.
Args:
whitelist: A list of the string names of the properties uid should not
include. If `None`, defaults to `_DEFAULT_UID_WHITE_LIST`, which
includes most properties user allowes to change.
Returns:
A uid string.
"""
if whitelist is None:
whitelist = run_config._DEFAULT_UID_WHITE_LIST
state = {k: v for k, v in self.__dict__.items() if not k.startswith('__')}
# Pop out the keys in whitelist.
for k in whitelist:
state.pop('_' + k, None)
ordered_state = collections.OrderedDict(
sorted(state.items(), key=lambda t: t[0]))
# For class instance without __repr__, some special cares are required.
# Otherwise, the object address will be used.
if '_cluster_spec' in ordered_state:
ordered_state['_cluster_spec'] = collections.OrderedDict(
sorted(ordered_state['_cluster_spec'].as_dict().items(),
key=lambda t: t[0])
)
return ', '.join(
'%s=%r' % (k, v) for (k, v) in six.iteritems(ordered_state))
class ExamplesPerSecondHook(session_run_hook.SessionRunHook):
"""Hook to print out examples per second.
Total time is tracked and then divided by the total number of steps
to get the average step time and then batch_size is used to determine
the running average of examples per second. The examples per second for the
most recent interval is also logged.
"""
def __init__(
self,
batch_size,
every_n_steps=100,
every_n_secs=None,):
"""Initializer for ExamplesPerSecondHook.
Args:
batch_size: Total batch size used to calculate examples/second from
global time.
every_n_steps: Log stats every n steps.
every_n_secs: Log stats every n seconds.
"""
if (every_n_steps is None) == (every_n_secs is None):
raise ValueError('exactly one of every_n_steps'
' and every_n_secs should be provided.')
self._timer = basic_session_run_hooks.SecondOrStepTimer(
every_steps=every_n_steps, every_secs=every_n_secs)
self._step_train_time = 0
self._total_steps = 0
self._batch_size = batch_size
def begin(self):
self._global_step_tensor = training_util.get_global_step()
if self._global_step_tensor is None:
raise RuntimeError(
'Global step should be created to use StepCounterHook.')
def before_run(self, run_context): # pylint: disable=unused-argument
return basic_session_run_hooks.SessionRunArgs(self._global_step_tensor)
def after_run(self, run_context, run_values):
_ = run_context
global_step = run_values.results
if self._timer.should_trigger_for_step(global_step):
elapsed_time, elapsed_steps = self._timer.update_last_triggered_step(
global_step)
if elapsed_time is not None:
steps_per_sec = elapsed_steps / elapsed_time
self._step_train_time += elapsed_time
self._total_steps += elapsed_steps
average_examples_per_sec = self._batch_size * (
self._total_steps / self._step_train_time)
current_examples_per_sec = steps_per_sec * self._batch_size
# Average examples/sec followed by current examples/sec
logging.info('%s: %g (%g), step = %g', 'Average examples/sec',
average_examples_per_sec, current_examples_per_sec,
self._total_steps)
def local_device_setter(num_devices=1,
ps_device_type='cpu',
worker_device='/cpu:0',
ps_ops=None,
ps_strategy=None):
if ps_ops == None:
ps_ops = ['Variable', 'VariableV2', 'VarHandleOp']
if ps_strategy is None:
ps_strategy = device_setter._RoundRobinStrategy(num_devices)
if not six.callable(ps_strategy):
raise TypeError("ps_strategy must be callable")
def _local_device_chooser(op):
current_device = pydev.DeviceSpec.from_string(op.device or "")
node_def = op if isinstance(op, node_def_pb2.NodeDef) else op.node_def
if node_def.op in ps_ops:
ps_device_spec = pydev.DeviceSpec.from_string(
'/{}:{}'.format(ps_device_type, ps_strategy(op)))
ps_device_spec.merge_from(current_device)
return ps_device_spec.to_string()
else:
worker_device_spec = pydev.DeviceSpec.from_string(worker_device or "")
worker_device_spec.merge_from(current_device)
return worker_device_spec.to_string()
return _local_device_chooser
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m_gpu
workerType: complex_model_m_gpu
parameterServerType: complex_model_m
workerCount: 1
...@@ -22,19 +22,11 @@ from __future__ import absolute_import ...@@ -22,19 +22,11 @@ from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
import argparse
import cPickle import cPickle
import os import os
import tensorflow as tf
FLAGS = tf.flags.FLAGS
tf.flags.DEFINE_string('input_dir', '',
'Directory where CIFAR10 data is located.')
tf.flags.DEFINE_string('output_dir', '', import tensorflow as tf
'Directory where TFRecords will be saved.'
'The TFRecords will have the same name as'
' the CIFAR10 inputs + .tfrecords.')
def _int64_feature(value): def _int64_feature(value):
...@@ -55,7 +47,7 @@ def _get_file_names(): ...@@ -55,7 +47,7 @@ def _get_file_names():
def read_pickle_from_file(filename): def read_pickle_from_file(filename):
with open(filename, 'r') as f: with tf.gfile.Open(filename, 'r') as f:
data_dict = cPickle.load(f) data_dict = cPickle.load(f)
return data_dict return data_dict
...@@ -63,8 +55,7 @@ def read_pickle_from_file(filename): ...@@ -63,8 +55,7 @@ def read_pickle_from_file(filename):
def convert_to_tfrecord(input_files, output_file): def convert_to_tfrecord(input_files, output_file):
"""Converts a file to tfrecords.""" """Converts a file to tfrecords."""
print('Generating %s' % output_file) print('Generating %s' % output_file)
record_writer = tf.python_io.TFRecordWriter(output_file) with tf.python_io.TFRecordWriter(output_file) as record_writer:
for input_file in input_files: for input_file in input_files:
data_dict = read_pickle_from_file(input_file) data_dict = read_pickle_from_file(input_file)
data = data_dict['data'] data = data_dict['data']
...@@ -78,19 +69,35 @@ def convert_to_tfrecord(input_files, output_file): ...@@ -78,19 +69,35 @@ def convert_to_tfrecord(input_files, output_file):
'label': _int64_feature(labels[i]) 'label': _int64_feature(labels[i])
})) }))
record_writer.write(example.SerializeToString()) record_writer.write(example.SerializeToString())
record_writer.close()
def main(unused_argv): def main(input_dir, output_dir):
file_names = _get_file_names() file_names = _get_file_names()
for mode, files in file_names.items(): for mode, files in file_names.items():
input_files = [ input_files = [
os.path.join(FLAGS.input_dir, f) for f in files] os.path.join(input_dir, f) for f in files]
output_file = os.path.join(FLAGS.output_dir, mode + '.tfrecords') output_file = os.path.join(output_dir, mode + '.tfrecords')
# Convert to Examples and write the result to TFRecords. # Convert to Examples and write the result to TFRecords.
convert_to_tfrecord(input_files, output_file) convert_to_tfrecord(input_files, output_file)
print('Done!') print('Done!')
if __name__ == '__main__': if __name__ == '__main__':
tf.app.run(main) parser = argparse.ArgumentParser()
parser.add_argument(
'--input-dir',
type=str,
default='',
help='Directory where CIFAR10 data is located.'
)
parser.add_argument(
'--output-dir',
type=str,
default='',
help="""\
Directory where TFRecords will be saved.The TFRecords will have the same
name as the CIFAR10 inputs + .tfrecords.\
"""
)
args = parser.parse_args()
main(args.input_dir, args.output_dir)
...@@ -25,16 +25,11 @@ from __future__ import print_function ...@@ -25,16 +25,11 @@ from __future__ import print_function
import tensorflow as tf import tensorflow as tf
FLAGS = tf.flags.FLAGS
tf.flags.DEFINE_float('batch_norm_decay', 0.997, 'Decay for batch norm.')
tf.flags.DEFINE_float('batch_norm_epsilon', 1e-5, 'Epsilon for batch norm.')
class ResNet(object): class ResNet(object):
"""ResNet model.""" """ResNet model."""
def __init__(self, is_training, data_format): def __init__(self, is_training, data_format, batch_norm_decay, batch_norm_epsilon):
"""ResNet constructor. """ResNet constructor.
Args: Args:
...@@ -42,6 +37,8 @@ class ResNet(object): ...@@ -42,6 +37,8 @@ class ResNet(object):
data_format: the data_format used during computation. data_format: the data_format used during computation.
one of 'channels_first' or 'channels_last'. one of 'channels_first' or 'channels_last'.
""" """
self._batch_norm_decay = batch_norm_decay
self._batch_norm_epsilon = batch_norm_epsilon
self._is_training = is_training self._is_training = is_training
assert data_format in ('channels_first', 'channels_last') assert data_format in ('channels_first', 'channels_last')
self._data_format = data_format self._data_format = data_format
...@@ -185,10 +182,10 @@ class ResNet(object): ...@@ -185,10 +182,10 @@ class ResNet(object):
data_format = 'NHWC' data_format = 'NHWC'
return tf.contrib.layers.batch_norm( return tf.contrib.layers.batch_norm(
x, x,
decay=FLAGS.batch_norm_decay, decay=self._batch_norm_decay,
center=True, center=True,
scale=True, scale=True,
epsilon=FLAGS.batch_norm_epsilon, epsilon=self._batch_norm_epsilon,
is_training=self._is_training, is_training=self._is_training,
fused=True, fused=True,
data_format=data_format) data_format=data_format)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment