Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
ResNet50_tensorflow
Commits
5d06cfcf
Commit
5d06cfcf
authored
Aug 18, 2017
by
Toby Boyd
Browse files
Merge branch 'cmlesupport' of
https://github.com/elibixby/models
parents
7c460c90
f5697b94
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
118 additions
and
115 deletions
+118
-115
tutorials/image/cifar10_estimator/README.md
tutorials/image/cifar10_estimator/README.md
+72
-58
tutorials/image/cifar10_estimator/cifar10_main.py
tutorials/image/cifar10_estimator/cifar10_main.py
+46
-57
No files found.
tutorials/image/cifar10_estimator/README.md
View file @
5d06cfcf
...
...
@@ -17,63 +17,75 @@ Before trying to run the model we highly encourage you to read all the README.
2.
Download the CIFAR-10 dataset.
```
shell
$
curl
-o
cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
$
tar
xzf cifar-10-python.tar.gz
curl
-o
cifar-10-python.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar
xzf cifar-10-python.tar.gz
```
After running the commands above, you should see the following files in the folder where the data was downloaded.
```
shell
$
ls
-R
cifar-10-batches-py
ls
-R
cifar-10-batches-py
```
The output should be:
```
batches.meta data_batch_1 data_batch_2 data_batch_3
data_batch_4 data_batch_5 readme.html test_batch
```
3.
Generate TFRecord files.
This will generate a tf record for the training and test data available at the input_dir.
You can see more details in
`generate_cifar10_tf_records.py`
```
shell
# This will generate a tf record for the training and test data available at the input_dir.
# You can see more details in generate_cifar10_tf_records.py
$
python generate_cifar10_tfrecords.py
--input-dir
=
/prefix/to/downloaded/data/cifar-10-batches-py
\
--output-dir
=
/prefix/to/downloaded/data/cifar-10-batches-py
python generate_cifar10_tfrecords.py
--input-dir
=
${
PWD
}
/cifar-10-batches-py
\
--output-dir
=
${
PWD
}
/cifar-10-batches-py
```
After running the command above, you should see the following new files in the output_dir.
```
shell
$
ls
-R
cifar-10-batches-py
ls
-R
cifar-10-batches-py
```
```
train.tfrecords validation.tfrecords eval.tfrecords
```
## How to run on local mode
```
Run the model on CPU only. After training, it runs the evaluation.
# Run the model on CPU only. After training, it runs the evaluation.
$
python cifar10_main.py --data-dir=
/prefix/to/downloaded/data
/cifar-10-batches-py \
```
python cifar10_main.py --data-dir=
${PWD}
/cifar-10-batches-py \
--job-dir=/tmp/cifar10 \
--num-gpus=0 \
--train-steps=1000
```
# Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
$ python cifar10_main.py --data-dir=/prefix/to/downloaded/data/cifar-10-batches-py \
Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation.
```
python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-py \
--job-dir=/tmp/cifar10 \
--num-gpus=2 \
--train-steps=1000
```
Run the model on 2 GPUs using GPU as parameter server.
It will run an experiment, which for local setting basically means it will run stop training
a couple of times to perform evaluation.
# Run the model on 2 GPUs using GPU as parameter server.
# It will run an experiment, which for local setting basically means it will run stop training
# a couple of times to perform evaluation.
$ python cifar10_main.py --data-dir=/prefix/to/downloaded/data/cifar-10-batches-bin \
```
python cifar10_main.py --data-dir=${PWD}/cifar-10-batches-bin \
--job-dir=/tmp/cifar10 \
--variable-strategy GPU \
--num-gpus=2 \
# There are more command line flags to play with; check cifar10_main.py for details.
```
There are more command line flags to play with; run
`python cifar10_main.py --help`
for details.
## How to run on distributed mode
### (Optional) Running on Google Cloud Machine Learning Engine
...
...
@@ -86,7 +98,7 @@ You'll also need a Google Cloud Storage bucket for the data. If you followed the
```
MY_BUCKET=gs://<my-bucket-name>
gsutil cp -r cifar-10-batches-py $MY_BUCKET/
gsutil cp -r
${PWD}/
cifar-10-batches-py $MY_BUCKET/
```
Then run the following command from the
`tutorials/image`
directory of this repository (the parent directory of this README):
...
...
@@ -172,13 +184,14 @@ By the default environment is *local*, for a distributed setting we need to chan
Once you have a
`TF_CONFIG`
configured properly on each host you're ready to run on distributed settings.
#### Master
Run this on master:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
It will run evaluation a couple of times during training.
The num_workers arugument is used only to update the learning rate correctly.
Make sure the model_dir is the same as defined on the TF_CONFIG.
```
shell
# Run this on master:
# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
# It will run evaluation a couple of times during training.
# The num_workers arugument is used only to update the learning rate correctly.
# Make sure the model_dir is the same as defined on the TF_CONFIG.
$
python cifar10_main.py
--data-dir
=
gs://path/cifar-10-batches-py
\
python cifar10_main.py
--data-dir
=
gs://path/cifar-10-batches-py
\
--job-dir
=
gs://path/model_dir/
\
--num-gpus
=
4
\
--train-steps
=
40000
\
...
...
@@ -313,12 +326,13 @@ INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step =
#### Worker
Run this on worker:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
It will run evaluation a couple of times during training.
Make sure the model_dir is the same as defined on the TF_CONFIG.
```
shell
# Run this on worker:
# Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps.
# It will run evaluation a couple of times during training.
# Make sure the model_dir is the same as defined on the TF_CONFIG.
$
python cifar10_main.py
--data-dir
=
gs://path/cifar-10-batches-py
\
python cifar10_main.py
--data-dir
=
gs://path/cifar-10-batches-py
\
--job-dir
=
gs://path/model_dir/
\
--num-gpus
=
4
\
--train-steps
=
40000
\
...
...
@@ -428,12 +442,11 @@ INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec)
#### PS
```
shell
# Run this on ps:
# The ps will not do training so most of the arguments won't affect the execution
$
python cifar10_main.py
--job-dir
=
gs://path/model_dir/
Run this on ps:
The ps will not do training so most of the arguments won't affect the execution
# There are more command line flags to play with; check cifar10_main.py for details.
```
shell
python cifar10_main.py
--job-dir
=
gs://path/model_dir/
```
*Output:*
...
...
@@ -460,11 +473,12 @@ When using Estimators you can also visualize your data in TensorBoard, with no c
You'll see something similar to this if you "point" TensorBoard to the
`model_dir`
you used to train or evaluate your model.
Check TensorBoard during training or after it.
Just point TensorBoard to the model_dir you chose on the previous step
by default the model_dir is "sentiment_analysis_output"
```
shell
# Check TensorBoard during training or after it.
# Just point TensorBoard to the model_dir you chose on the previous step
# by default the model_dir is "sentiment_analysis_output"
$
tensorboard
--log-dir
=
"sentiment_analysis_output"
tensorboard
--log-dir
=
"sentiment_analysis_output"
```
## Warnings
...
...
tutorials/image/cifar10_estimator/cifar10_main.py
View file @
5d06cfcf
...
...
@@ -74,9 +74,15 @@ def get_model_fn(num_gpus, variable_strategy, num_workers):
tower_gradvars
=
[]
tower_preds
=
[]
if
num_gpus
!=
0
:
for
i
in
range
(
num_gpus
):
worker_device
=
'/gpu:{}'
.
format
(
i
)
if
num_gpus
==
0
:
num_devices
=
1
device_type
=
'cpu'
else
:
num_devices
=
num_gpus
device_type
=
'gpu'
for
i
in
range
(
num_devices
):
worker_device
=
'/{}:{}'
.
format
(
device_type
,
i
)
if
variable_strategy
==
'CPU'
:
device_setter
=
cifar10_utils
.
local_device_setter
(
worker_device
=
worker_device
)
...
...
@@ -97,7 +103,7 @@ def get_model_fn(num_gpus, variable_strategy, num_workers):
weight_decay
,
tower_features
[
i
],
tower_labels
[
i
],
False
,
(
device_type
==
'cpu'
)
,
params
[
'num_layers'
],
params
[
'batch_norm_decay'
],
params
[
'batch_norm_epsilon'
])
...
...
@@ -112,23 +118,6 @@ def get_model_fn(num_gpus, variable_strategy, num_workers):
# significant detriment.
update_ops
=
tf
.
get_collection
(
tf
.
GraphKeys
.
UPDATE_OPS
,
name_scope
)
else
:
with
tf
.
variable_scope
(
'resnet'
),
tf
.
device
(
'/cpu:0'
):
with
tf
.
name_scope
(
'tower_cpu'
)
as
name_scope
:
loss
,
gradvars
,
preds
=
_tower_fn
(
is_training
,
weight_decay
,
tower_features
[
0
],
tower_labels
[
0
],
True
,
params
[
'num_layers'
],
params
[
'batch_norm_decay'
],
params
[
'batch_norm_epsilon'
])
tower_losses
.
append
(
loss
)
tower_gradvars
.
append
(
gradvars
)
tower_preds
.
append
(
preds
)
update_ops
=
tf
.
get_collection
(
tf
.
GraphKeys
.
UPDATE_OPS
,
name_scope
)
# Now compute global loss and gradients.
gradvars
=
[]
...
...
@@ -420,7 +409,7 @@ if __name__ == '__main__':
help
=
'The directory where the model will be stored.'
)
parser
.
add_argument
(
'--variable
_
strategy'
,
'--variable
-
strategy'
,
choices
=
[
'CPU'
,
'GPU'
],
type
=
str
,
default
=
'CPU'
,
...
...
@@ -520,13 +509,13 @@ if __name__ == '__main__':
help
=
'Whether to log device placement.'
)
parser
.
add_argument
(
'--batch
_
norm
_
decay'
,
'--batch
-
norm
-
decay'
,
type
=
float
,
default
=
0.997
,
help
=
'Decay for batch norm.'
)
parser
.
add_argument
(
'--batch
_
norm
_
epsilon'
,
'--batch
-
norm
-
epsilon'
,
type
=
float
,
default
=
1e-5
,
help
=
'Epsilon for batch norm.'
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment