"convert/git@developer.sourcefind.cn:OpenDAS/ollama.git" did not exist on "107f6959299d0cc18ef15df23cee5eaae8ffbf4e"
Commit 5d06cfcf authored by Toby Boyd's avatar Toby Boyd
Browse files

Merge branch 'cmlesupport' of https://github.com/elibixby/models

parents 7c460c90 f5697b94
...@@ -17,62 +17,74 @@ Before trying to run the model we highly encourage you to read all the README. ...@@ -17,62 +17,74 @@ Before trying to run the model we highly encourage you to read all the README.
2. Download the CIFAR-10 dataset. 2. Download the CIFAR-10 dataset.
```shell ```shell
$ curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz curl -o cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
$ tar xzf cifar-10-python.tar.gz tar xzf cifar-10-python.tar.gz
``` ```
After running the commands above, you should see the following files in the folder where the data was downloaded. After running the commands above, you should see the following files in the folder where the data was downloaded.
``` shell ``` shell
$ ls -R cifar-10-batches-py ls -R cifar-10-batches-py
```
The output should be:
```
batches.meta data_batch_1 data_batch_2 data_batch_3 batches.meta data_batch_1 data_batch_2 data_batch_3
data_batch_4 data_batch_5 readme.html test_batch data_batch_4 data_batch_5 readme.html test_batch
``` ```
3. Generate TFRecord files. 3. Generate TFRecord files.
This will generate a tf record for the training and test data available at the input_dir.
You can see more details in `generate_cifar10_tf_records.py`
```shell ```shell
# This will generate a tf record for the training and test data available at the input_dir. python generate_cifar10_tfrecords.py --input-dir=${PWD}/cifar-10-batches-py \
# You can see more details in generate_cifar10_tf_records.py --output-dir=${PWD}/cifar-10-batches-py
$ python generate_cifar10_tfrecords.py --input-dir=/prefix/to/downloaded/data/cifar-10-batches-py \
--output-dir=/prefix/to/downloaded/data/cifar-10-batches-py
``` ```
After running the command above, you should see the following new files in the output_dir. After running the command above, you should see the following new files in the output_dir.
``` shell ``` shell
$ ls -R cifar-10-batches-py ls -R cifar-10-batches-py
```
```
train.tfrecords validation.tfrecords eval.tfrecords train.tfrecords validation.tfrecords eval.tfrecords
``` ```
## How to run on local mode ## How to run on local mode
``` Run the model on CPU only. After training, it runs the evaluation.
# Run the model on CPU only. After training, it runs the evaluation. ```
$ python cifar10_main.py --data-dir=/prefix/to/downloaded/data/cifar-10-batches-py \ python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
--job-dir=/tmp/cifar10 \ --job-dir=/tmp/cifar10 \
--num-gpus=0 \ --num-gpus=0 \
--train-steps=1000 --train-steps=1000
```
# Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
$ python cifar10_main.py --data-dir=/prefix/to/downloaded/data/cifar-10-batches-py \
--job-dir=/tmp/cifar10 \
--num-gpus=2 \
--train-steps=1000
# Run the model on 2 GPUs using GPU as parameter server. Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
# It will run an experiment, which for local setting basically means it will run stop training ```
# a couple of times to perform evaluation. python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
$ python cifar10_main.py --data-dir=/prefix/to/downloaded/data/cifar-10-batches-bin \ --job-dir=/tmp/cifar10 \
--job-dir=/tmp/cifar10 \ --num-gpus=2 \
--variable-strategy GPU \ --train-steps=1000
--num-gpus=2 \ ```
Run the model on 2 GPUs using GPU as parameter server.
It will run an experiment, which for local setting basically means it will run stop training
a couple of times to perform evaluation.
# There are more command line flags to play with; check cifar10_main.py for details.
``` ```
python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-bin \
--job-dir=/tmp/cifar10 \
--variable-strategy GPU \
--num-gpus=2 \
```
There are more command line flags to play with; run `python cifar10_main.py --help` for details.
## How to run on distributed mode ## How to run on distributed mode
...@@ -86,7 +98,7 @@ You'll also need a Google Cloud Storage bucket for the data. If you followed the ...@@ -86,7 +98,7 @@ You'll also need a Google Cloud Storage bucket for the data. If you followed the
``` ```
MY_BUCKET=gs://<my-bucket-name> MY_BUCKET=gs://<my-bucket-name>
gsutil cp -r cifar-10-batches-py $MY_BUCKET/ gsutil cp -r ${PWD}/cifar-10-batches-py $MY_BUCKET/
``` ```
Then run the following command from the `tutorials/image` directory of this repository (the parent directory of this README): Then run the following command from the `tutorials/image` directory of this repository (the parent directory of this README):
...@@ -172,18 +184,19 @@ By the default environment is *local*, for a distributed setting we need to chan ...@@ -172,18 +184,19 @@ By the default environment is *local*, for a distributed setting we need to chan
Once you have a `TF_CONFIG` configured properly on each host you're ready to run on distributed settings. Once you have a `TF_CONFIG` configured properly on each host you're ready to run on distributed settings.
#### Master #### Master
Run this on master:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
It will run evaluation a couple of times during training.
The num_workers arugument is used only to update the learning rate correctly.
Make sure the model_dir is the same as defined on the TF_CONFIG.
```shell ```shell
# Run this on master: python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps. --job-dir=gs://path/model_dir/ \
# It will run evaluation a couple of times during training. --num-gpus=4 \
# The num_workers arugument is used only to update the learning rate correctly. --train-steps=40000 \
# Make sure the model_dir is the same as defined on the TF_CONFIG. --sync \
$ python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \ --num-workers=2
--job-dir=gs://path/model_dir/ \
--num-gpus=4 \
--train-steps=40000 \
--sync \
--num-workers=2
``` ```
*Output:* *Output:*
...@@ -313,16 +326,17 @@ INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step = ...@@ -313,16 +326,17 @@ INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step =
#### Worker #### Worker
Run this on worker:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
It will run evaluation a couple of times during training.
Make sure the model_dir is the same as defined on the TF_CONFIG.
```shell ```shell
# Run this on worker: python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \
# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps. --job-dir=gs://path/model_dir/ \
# It will run evaluation a couple of times during training. --num-gpus=4 \
# Make sure the model_dir is the same as defined on the TF_CONFIG. --train-steps=40000 \
$ python cifar10_main.py --data-dir=gs://path/cifar-10-batches-py \ --sync
--job-dir=gs://path/model_dir/ \
--num-gpus=4 \
--train-steps=40000 \
--sync
``` ```
*Output:* *Output:*
...@@ -428,12 +442,11 @@ INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec) ...@@ -428,12 +442,11 @@ INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec)
#### PS #### PS
```shell Run this on ps:
# Run this on ps: The ps will not do training so most of the arguments won't affect the execution
# The ps will not do training so most of the arguments won't affect the execution
$ python cifar10_main.py --job-dir=gs://path/model_dir/
# There are more command line flags to play with; check cifar10_main.py for details. ```shell
python cifar10_main.py --job-dir=gs://path/model_dir/
``` ```
*Output:* *Output:*
...@@ -460,11 +473,12 @@ When using Estimators you can also visualize your data in TensorBoard, with no c ...@@ -460,11 +473,12 @@ When using Estimators you can also visualize your data in TensorBoard, with no c
You'll see something similar to this if you "point" TensorBoard to the `model_dir` you used to train or evaluate your model. You'll see something similar to this if you "point" TensorBoard to the `model_dir` you used to train or evaluate your model.
Check TensorBoard during training or after it.
Just point TensorBoard to the model_dir you chose on the previous step
by default the model_dir is "sentiment_analysis_output"
```shell ```shell
# Check TensorBoard during training or after it. tensorboard --log-dir="sentiment_analysis_output"
# Just point TensorBoard to the model_dir you chose on the previous step
# by default the model_dir is "sentiment_analysis_output"
$ tensorboard --log-dir="sentiment_analysis_output"
``` ```
## Warnings ## Warnings
......
...@@ -74,61 +74,50 @@ def get_model_fn(num_gpus, variable_strategy, num_workers): ...@@ -74,61 +74,50 @@ def get_model_fn(num_gpus, variable_strategy, num_workers):
tower_gradvars = [] tower_gradvars = []
tower_preds = [] tower_preds = []
if num_gpus != 0: if num_gpus == 0:
for i in range(num_gpus): num_devices = 1
worker_device = '/gpu:{}'.format(i) device_type = 'cpu'
if variable_strategy == 'CPU':
device_setter = cifar10_utils.local_device_setter(
worker_device=worker_device)
elif variable_strategy == 'GPU':
device_setter = cifar10_utils.local_device_setter(
ps_device_type='gpu',
worker_device=worker_device,
ps_strategy=tf.contrib.training.GreedyLoadBalancingStrategy(
num_gpus,
tf.contrib.training.byte_size_load_fn
)
)
with tf.variable_scope('resnet', reuse=bool(i != 0)):
with tf.name_scope('tower_%d' % i) as name_scope:
with tf.device(device_setter):
loss, gradvars, preds = _tower_fn(
is_training,
weight_decay,
tower_features[i],
tower_labels[i],
False,
params['num_layers'],
params['batch_norm_decay'],
params['batch_norm_epsilon'])
tower_losses.append(loss)
tower_gradvars.append(gradvars)
tower_preds.append(preds)
if i == 0:
# Only trigger batch_norm moving mean and variance update from
# the 1st tower. Ideally, we should grab the updates from all
# towers but these stats accumulate extremely fast so we can
# ignore the other stats from the other towers without
# significant detriment.
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS,
name_scope)
else: else:
with tf.variable_scope('resnet'), tf.device('/cpu:0'): num_devices = num_gpus
with tf.name_scope('tower_cpu') as name_scope: device_type = 'gpu'
loss, gradvars, preds = _tower_fn(
is_training, for i in range(num_devices):
weight_decay, worker_device = '/{}:{}'.format(device_type, i)
tower_features[0], if variable_strategy == 'CPU':
tower_labels[0], device_setter = cifar10_utils.local_device_setter(
True, worker_device=worker_device)
params['num_layers'], elif variable_strategy == 'GPU':
params['batch_norm_decay'], device_setter = cifar10_utils.local_device_setter(
params['batch_norm_epsilon']) ps_device_type='gpu',
tower_losses.append(loss) worker_device=worker_device,
tower_gradvars.append(gradvars) ps_strategy=tf.contrib.training.GreedyLoadBalancingStrategy(
tower_preds.append(preds) num_gpus,
tf.contrib.training.byte_size_load_fn
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, name_scope) )
)
with tf.variable_scope('resnet', reuse=bool(i != 0)):
with tf.name_scope('tower_%d' % i) as name_scope:
with tf.device(device_setter):
loss, gradvars, preds = _tower_fn(
is_training,
weight_decay,
tower_features[i],
tower_labels[i],
(device_type == 'cpu'),
params['num_layers'],
params['batch_norm_decay'],
params['batch_norm_epsilon'])
tower_losses.append(loss)
tower_gradvars.append(gradvars)
tower_preds.append(preds)
if i == 0:
# Only trigger batch_norm moving mean and variance update from
# the 1st tower. Ideally, we should grab the updates from all
# towers but these stats accumulate extremely fast so we can
# ignore the other stats from the other towers without
# significant detriment.
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS,
name_scope)
# Now compute global loss and gradients. # Now compute global loss and gradients.
gradvars = [] gradvars = []
...@@ -420,7 +409,7 @@ if __name__ == '__main__': ...@@ -420,7 +409,7 @@ if __name__ == '__main__':
help='The directory where the model will be stored.' help='The directory where the model will be stored.'
) )
parser.add_argument( parser.add_argument(
'--variable_strategy', '--variable-strategy',
choices=['CPU', 'GPU'], choices=['CPU', 'GPU'],
type=str, type=str,
default='CPU', default='CPU',
...@@ -520,13 +509,13 @@ if __name__ == '__main__': ...@@ -520,13 +509,13 @@ if __name__ == '__main__':
help='Whether to log device placement.' help='Whether to log device placement.'
) )
parser.add_argument( parser.add_argument(
'--batch_norm_decay', '--batch-norm-decay',
type=float, type=float,
default=0.997, default=0.997,
help='Decay for batch norm.' help='Decay for batch norm.'
) )
parser.add_argument( parser.add_argument(
'--batch_norm_epsilon', '--batch-norm-epsilon',
type=float, type=float,
default=1e-5, default=1e-5,
help='Epsilon for batch norm.' help='Epsilon for batch norm.'
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment