Commit aae631cc authored by Toby Boyd's avatar Toby Boyd Committed by GitHub
Browse files

Merge pull request #2152 from elibixby/cmlesupport

Fix import paths and GPUOptions initialization
parents a380b4b3 e11010fc
......@@ -17,70 +17,106 @@ Before trying to run the model we highly encourage you to read all the README.
2. Download the CIFAR-10 dataset.
```shell
$ curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
$ tar xzf cifar-10-python.tar.gz
curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar xzf cifar-10-python.tar.gz
```
After running the commands above, you should see the following files in the folder where the data was downloaded.
``` shell
$ ls -R cifar-10-batches-py
ls -R cifar-10-batches-py
```
The output should be:
```
batches.meta data_batch_1 data_batch_2 data_batch_3
data_batch_4 data_batch_5 readme.html test_batch
```
3. Generate TFRecord files.
This will generate a tf record for the training and test data available at the input_dir.
You can see more details in `generate_cifar10_tf_records.py`
```shell
# This will generate a tf record for the training and test data available at the input_dir.
# You can see more details in generate_cifar10_tf_records.py
$ python generate_cifar10_tfrecords.py --input_dir=/prefix/to/downloaded/data/cifar-10-batches-py \
--output_dir=/prefix/to/downloaded/data/cifar-10-batches-py
python generate_cifar10_tfrecords.py --input-dir=${PWD}/cifar-10-batches-py \
--output-dir=${PWD}/cifar-10-batches-py
```
After running the command above, you should see the following new files in the output_dir.
``` shell
$ ls -R cifar-10-batches-py
ls -R cifar-10-batches-py
```
```
train.tfrecords validation.tfrecords eval.tfrecords
```
## How to run on local mode
Run the model on CPU only. After training, it runs the evaluation.
```
python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
--job-dir=/tmp/cifar10 \
--num-gpus=0 \
--train-steps=1000
```
Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
```
python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
--job-dir=/tmp/cifar10 \
--num-gpus=2 \
--train-steps=1000
```
# Run the model on CPU only. After training, it runs the evaluation.
$ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-py \
--model_dir=/tmp/cifar10 \
--is_cpu_ps=True \
--num_gpus=0 \
--train_steps=1000
# Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
$ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-py \
--model_dir=/tmp/cifar10 \
--is_cpu_ps=True \
--force_gpu_compatible=True \
--num_gpus=2 \
--train_steps=1000
# Run the model on 2 GPUs using GPU as parameter server.
# It will run an experiment, which for local setting basically means it will run stop training
# a couple of times to perform evaluation.
$ python cifar10_main.py --data_dir=/prefix/to/downloaded/data/cifar-10-batches-bin \
--model_dir=/tmp/cifar10 \
--is_cpu_ps=False \
--force_gpu_compatible=True \
--num_gpus=2 \
--train_steps=1000
--run_experiment=True
# There are more command line flags to play with; check cifar10_main.py for details.
Run the model on 2 GPUs using GPU as parameter server.
It will run an experiment, which for local setting basically means it will run stop training
a couple of times to perform evaluation.
```
python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-bin \
--job-dir=/tmp/cifar10 \
--variable-strategy GPU \
--num-gpus=2 \
```
There are more command line flags to play with; run `python cifar10_main.py --help` for details.
## How to run on distributed mode
### (Optional) Running on Google Cloud Machine Learning Engine
This example can be run on Google Cloud Machine Learning Engine (ML Engine), which will configure the environment and take care of running workers, parameters servers, and masters in a fault tolerant way.
To install the command line tool, and set up a project and billing, see the quickstart [here](https://cloud.google.com/ml-engine/docs/quickstarts/command-line).
You'll also need a Google Cloud Storage bucket for the data. If you followed the instructions above, you can just run:
```
MY_BUCKET=gs://<my-bucket-name>
gsutil cp -r ${PWD}/cifar-10-batches-py $MY_BUCKET/
```
Then run the following command from the `tutorials/image` directory of this repository (the parent directory of this README):
```
gcloud ml-engine jobs submit training cifarmultigpu \
--runtime-version 1.2 \
--job-dir=$MY_BUCKET/model_dirs/cifarmultigpu \
--config cifar10_estimator/cmle_config.yaml \
--package-path cifar10_estimator/ \
--module-name cifar10_estimator.cifar10_main \
-- \
--data-dir=$MY_BUCKET/cifar-10-batches-py \
--num-gpus=4 \
--train-steps=1000
```
### Set TF_CONFIG
Considering that you already have multiple hosts configured, all you need is a `TF_CONFIG`
......@@ -148,21 +184,19 @@ By the default environment is *local*, for a distributed setting we need to chan
Once you have a `TF_CONFIG` configured properly on each host you're ready to run on distributed settings.
#### Master
Run this on master:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
It will run evaluation a couple of times during training.
The num_workers arugument is used only to update the learning rate correctly.
Make sure the model_dir is the same as defined on the TF_CONFIG.
```shell
# Run this on master:
# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
# It will run evaluation a couple of times during training.
# The num_workers arugument is used only to update the learning rate correctly.
# Make sure the model_dir is the same as defined on the TF_CONFIG.
$ python cifar10_main.py --data_dir=gs://path/cifar-10-batches-py \
--model_dir=gs://path/model_dir/ \
--is_cpu_ps=True \
--force_gpu_compatible=True \
--num_gpus=4 \
--train_steps=40000 \
--sync=True \
--run_experiment=True \
--num_workers=2
python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
--job-dir=gs://path/model_dir/ \
--num-gpus=4 \
--train-steps=40000 \
--sync \
--num-workers=2
```
*Output:*
......@@ -292,19 +326,17 @@ INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step =
#### Worker
Run this on worker:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
It will run evaluation a couple of times during training.
Make sure the model_dir is the same as defined on the TF_CONFIG.
```shell
# Run this on worker:
# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
# It will run evaluation a couple of times during training.
# Make sure the model_dir is the same as defined on the TF_CONFIG.
$ python cifar10_main.py --data_dir=gs://path/cifar-10-batches-py \
--model_dir=gs://path/model_dir/ \
--is_cpu_ps=True \
--force_gpu_compatible=True \
--num_gpus=4 \
--train_steps=40000 \
--sync=True
--run_experiment=True
python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
--job-dir=gs://path/model_dir/ \
--num-gpus=4 \
--train-steps=40000 \
--sync
```
*Output:*
......@@ -410,12 +442,11 @@ INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec)
#### PS
```shell
# Run this on ps:
# The ps will not do training so most of the arguments won't affect the execution
$ python cifar10_main.py --run_experiment=True --model_dir=gs://path/model_dir/
Run this on ps:
The ps will not do training so most of the arguments won't affect the execution
# There are more command line flags to play with; check cifar10_main.py for details.
```shell
python cifar10_main.py --job-dir=gs://path/model_dir/
```
*Output:*
......@@ -442,16 +473,17 @@ When using Estimators you can also visualize your data in TensorBoard, with no c
You'll see something similar to this if you "point" TensorBoard to the `model_dir` you used to train or evaluate your model.
Check TensorBoard during training or after it.
Just point TensorBoard to the model_dir you chose on the previous step
by default the model_dir is "sentiment_analysis_output"
```shell
# Check TensorBoard during training or after it.
# Just point TensorBoard to the model_dir you chose on the previous step
# by default the model_dir is "sentiment_analysis_output"
$ tensorboard --log_dir="sentiment_analysis_output"
tensorboard --log-dir="sentiment_analysis_output"
```
## Warnings
When runninng `cifar10_main.py` with `--sync=True` argument you may see an error similar to:
When runninng `cifar10_main.py` with `--sync` argument you may see an error similar to:
```python
File "cifar10_main.py", line 538, in <module>
......
......@@ -13,7 +13,6 @@
# limitations under the License.
# ==============================================================================
"""Model class for Cifar10 Dataset."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
......@@ -25,8 +24,18 @@ import model_base
class ResNetCifar10(model_base.ResNet):
"""Cifar10 model with ResNetV1 and basic residual block."""
def __init__(self, num_layers, is_training, data_format='channels_first'):
super(ResNetCifar10, self).__init__(is_training, data_format)
def __init__(self,
num_layers,
is_training,
batch_norm_decay,
batch_norm_epsilon,
data_format='channels_first'):
super(ResNetCifar10, self).__init__(
is_training,
data_format,
batch_norm_decay,
batch_norm_epsilon
)
self.n = (num_layers - 2) // 6
# Add one in case label starts with 1. No impact if label starts with 0.
self.num_classes = 10 + 1
......
import collections
import six
import tensorflow as tf
from tensorflow.python.platform import tf_logging as logging
from tensorflow.core.framework import node_def_pb2
from tensorflow.python.framework import device as pydev
from tensorflow.python.training import basic_session_run_hooks
from tensorflow.python.training import session_run_hook
from tensorflow.python.training import training_util
from tensorflow.python.training import device_setter
from tensorflow.contrib.learn.python.learn import run_config
# TODO(b/64848083) Remove once uid bug is fixed
class RunConfig(tf.contrib.learn.RunConfig):
def uid(self, whitelist=None):
"""Generates a 'Unique Identifier' based on all internal fields.
Caller should use the uid string to check `RunConfig` instance integrity
in one session use, but should not rely on the implementation details, which
is subject to change.
Args:
whitelist: A list of the string names of the properties uid should not
include. If `None`, defaults to `_DEFAULT_UID_WHITE_LIST`, which
includes most properties user allowes to change.
Returns:
A uid string.
"""
if whitelist is None:
whitelist = run_config._DEFAULT_UID_WHITE_LIST
state = {k: v for k, v in self.__dict__.items() if not k.startswith('__')}
# Pop out the keys in whitelist.
for k in whitelist:
state.pop('_' + k, None)
ordered_state = collections.OrderedDict(
sorted(state.items(), key=lambda t: t[0]))
# For class instance without __repr__, some special cares are required.
# Otherwise, the object address will be used.
if '_cluster_spec' in ordered_state:
ordered_state['_cluster_spec'] = collections.OrderedDict(
sorted(ordered_state['_cluster_spec'].as_dict().items(),
key=lambda t: t[0])
)
return ', '.join(
'%s=%r' % (k, v) for (k, v) in six.iteritems(ordered_state))
class ExamplesPerSecondHook(session_run_hook.SessionRunHook):
"""Hook to print out examples per second.
Total time is tracked and then divided by the total number of steps
to get the average step time and then batch_size is used to determine
the running average of examples per second. The examples per second for the
most recent interval is also logged.
"""
def __init__(
self,
batch_size,
every_n_steps=100,
every_n_secs=None,):
"""Initializer for ExamplesPerSecondHook.
Args:
batch_size: Total batch size used to calculate examples/second from
global time.
every_n_steps: Log stats every n steps.
every_n_secs: Log stats every n seconds.
"""
if (every_n_steps is None) == (every_n_secs is None):
raise ValueError('exactly one of every_n_steps'
' and every_n_secs should be provided.')
self._timer = basic_session_run_hooks.SecondOrStepTimer(
every_steps=every_n_steps, every_secs=every_n_secs)
self._step_train_time = 0
self._total_steps = 0
self._batch_size = batch_size
def begin(self):
self._global_step_tensor = training_util.get_global_step()
if self._global_step_tensor is None:
raise RuntimeError(
'Global step should be created to use StepCounterHook.')
def before_run(self, run_context): # pylint: disable=unused-argument
return basic_session_run_hooks.SessionRunArgs(self._global_step_tensor)
def after_run(self, run_context, run_values):
_ = run_context
global_step = run_values.results
if self._timer.should_trigger_for_step(global_step):
elapsed_time, elapsed_steps = self._timer.update_last_triggered_step(
global_step)
if elapsed_time is not None:
steps_per_sec = elapsed_steps / elapsed_time
self._step_train_time += elapsed_time
self._total_steps += elapsed_steps
average_examples_per_sec = self._batch_size * (
self._total_steps / self._step_train_time)
current_examples_per_sec = steps_per_sec * self._batch_size
# Average examples/sec followed by current examples/sec
logging.info('%s: %g (%g), step = %g', 'Average examples/sec',
average_examples_per_sec, current_examples_per_sec,
self._total_steps)
def local_device_setter(num_devices=1,
ps_device_type='cpu',
worker_device='/cpu:0',
ps_ops=None,
ps_strategy=None):
if ps_ops == None:
ps_ops = ['Variable', 'VariableV2', 'VarHandleOp']
if ps_strategy is None:
ps_strategy = device_setter._RoundRobinStrategy(num_devices)
if not six.callable(ps_strategy):
raise TypeError("ps_strategy must be callable")
def _local_device_chooser(op):
current_device = pydev.DeviceSpec.from_string(op.device or "")
node_def = op if isinstance(op, node_def_pb2.NodeDef) else op.node_def
if node_def.op in ps_ops:
ps_device_spec = pydev.DeviceSpec.from_string(
'/{}:{}'.format(ps_device_type, ps_strategy(op)))
ps_device_spec.merge_from(current_device)
return ps_device_spec.to_string()
else:
worker_device_spec = pydev.DeviceSpec.from_string(worker_device or "")
worker_device_spec.merge_from(current_device)
return worker_device_spec.to_string()
return _local_device_chooser
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m_gpu
workerType: complex_model_m_gpu
parameterServerType: complex_model_m
workerCount: 1
......@@ -22,19 +22,11 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import cPickle
import os
import tensorflow as tf
FLAGS = tf.flags.FLAGS
tf.flags.DEFINE_string('input_dir', '',
'Directory where CIFAR10 data is located.')
tf.flags.DEFINE_string('output_dir', '',
'Directory where TFRecords will be saved.'
'The TFRecords will have the same name as'
' the CIFAR10 inputs + .tfrecords.')
import tensorflow as tf
def _int64_feature(value):
......@@ -55,7 +47,7 @@ def _get_file_names():
def read_pickle_from_file(filename):
with open(filename, 'r') as f:
with tf.gfile.Open(filename, 'r') as f:
data_dict = cPickle.load(f)
return data_dict
......@@ -63,34 +55,49 @@ def read_pickle_from_file(filename):
def convert_to_tfrecord(input_files, output_file):
"""Converts a file to tfrecords."""
print('Generating %s' % output_file)
record_writer = tf.python_io.TFRecordWriter(output_file)
for input_file in input_files:
data_dict = read_pickle_from_file(input_file)
data = data_dict['data']
labels = data_dict['labels']
num_entries_in_batch = len(labels)
for i in range(num_entries_in_batch):
example = tf.train.Example(
features=tf.train.Features(feature={
'image': _bytes_feature(data[i].tobytes()),
'label': _int64_feature(labels[i])
}))
record_writer.write(example.SerializeToString())
record_writer.close()
def main(unused_argv):
with tf.python_io.TFRecordWriter(output_file) as record_writer:
for input_file in input_files:
data_dict = read_pickle_from_file(input_file)
data = data_dict['data']
labels = data_dict['labels']
num_entries_in_batch = len(labels)
for i in range(num_entries_in_batch):
example = tf.train.Example(
features=tf.train.Features(feature={
'image': _bytes_feature(data[i].tobytes()),
'label': _int64_feature(labels[i])
}))
record_writer.write(example.SerializeToString())
def main(input_dir, output_dir):
file_names = _get_file_names()
for mode, files in file_names.items():
input_files = [
os.path.join(FLAGS.input_dir, f) for f in files]
output_file = os.path.join(FLAGS.output_dir, mode + '.tfrecords')
os.path.join(input_dir, f) for f in files]
output_file = os.path.join(output_dir, mode + '.tfrecords')
# Convert to Examples and write the result to TFRecords.
convert_to_tfrecord(input_files, output_file)
print('Done!')
if __name__ == '__main__':
tf.app.run(main)
parser = argparse.ArgumentParser()
parser.add_argument(
'--input-dir',
type=str,
default='',
help='Directory where CIFAR10 data is located.'
)
parser.add_argument(
'--output-dir',
type=str,
default='',
help="""\
Directory where TFRecords will be saved.The TFRecords will have the same
name as the CIFAR10 inputs + .tfrecords.\
"""
)
args = parser.parse_args()
main(args.input_dir, args.output_dir)
......@@ -25,16 +25,11 @@ from __future__ import print_function
import tensorflow as tf
FLAGS = tf.flags.FLAGS
tf.flags.DEFINE_float('batch_norm_decay', 0.997, 'Decay for batch norm.')
tf.flags.DEFINE_float('batch_norm_epsilon', 1e-5, 'Epsilon for batch norm.')
class ResNet(object):
"""ResNet model."""
def __init__(self, is_training, data_format):
def __init__(self, is_training, data_format, batch_norm_decay, batch_norm_epsilon):
"""ResNet constructor.
Args:
......@@ -42,6 +37,8 @@ class ResNet(object):
data_format: the data_format used during computation.
one of 'channels_first' or 'channels_last'.
"""
self._batch_norm_decay = batch_norm_decay
self._batch_norm_epsilon = batch_norm_epsilon
self._is_training = is_training
assert data_format in ('channels_first', 'channels_last')
self._data_format = data_format
......@@ -185,10 +182,10 @@ class ResNet(object):
data_format = 'NHWC'
return tf.contrib.layers.batch_norm(
x,
decay=FLAGS.batch_norm_decay,
decay=self._batch_norm_decay,
center=True,
scale=True,
epsilon=FLAGS.batch_norm_epsilon,
epsilon=self._batch_norm_epsilon,
is_training=self._is_training,
fused=True,
data_format=data_format)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment