Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
apex
Commits
a730f38f
You need to sign in or sign up before continuing.
Commit
a730f38f
authored
Mar 14, 2019
by
Michael Carilli
Browse files
Adding simple distributed example for #200
parent
7f0d8c87
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
82 additions
and
2 deletions
+82
-2
examples/imagenet/main_amp.py
examples/imagenet/main_amp.py
+2
-2
examples/simple/distributed/README.md
examples/simple/distributed/README.md
+13
-0
examples/simple/distributed/distributed_data_parallel.py
examples/simple/distributed/distributed_data_parallel.py
+65
-0
examples/simple/distributed/run.sh
examples/simple/distributed/run.sh
+2
-0
No files found.
examples/imagenet/main_amp.py
View file @
a730f38f
...
@@ -117,7 +117,7 @@ def main():
...
@@ -117,7 +117,7 @@ def main():
args
.
world_size
=
1
args
.
world_size
=
1
if
args
.
distributed
:
if
args
.
distributed
:
args
.
gpu
=
args
.
local_rank
%
torch
.
cuda
.
device_count
()
args
.
gpu
=
args
.
local_rank
torch
.
cuda
.
set_device
(
args
.
gpu
)
torch
.
cuda
.
set_device
(
args
.
gpu
)
torch
.
distributed
.
init_process_group
(
backend
=
'nccl'
,
torch
.
distributed
.
init_process_group
(
backend
=
'nccl'
,
init_method
=
'env://'
)
init_method
=
'env://'
)
...
@@ -334,7 +334,7 @@ def train(train_loader, model, criterion, optimizer, epoch):
...
@@ -334,7 +334,7 @@ def train(train_loader, model, criterion, optimizer, epoch):
if
args
.
prof
:
torch
.
cuda
.
nvtx
.
range_pop
()
if
args
.
prof
:
torch
.
cuda
.
nvtx
.
range_pop
()
if
i
%
args
.
print_freq
==
0
:
if
i
%
args
.
print_freq
==
0
:
# Every print_freq iterations, check the loss accuracy and speed.
# Every print_freq iterations, check the loss
,
accuracy
,
and speed.
# For best performance, it doesn't make sense to print these metrics every
# For best performance, it doesn't make sense to print these metrics every
# iteration, since they incur an allreduce and some host<->device syncs.
# iteration, since they incur an allreduce and some host<->device syncs.
...
...
examples/simple/distributed/README.md
0 → 100644
View file @
a730f38f
**distributed_data_parallel.py**
and
**run.sh**
show an example using Amp with
[
apex.parallel.DistributedDataParallel
](
https://nvidia.github.io/apex/parallel.html
)
or
[
torch.nn.parallel.DistributedDataParallel
](
https://pytorch.org/docs/stable/nn.html#distributeddataparallel
)
and the Pytorch multiprocess launcher script,
[
torch.distributed.launch
](
https://pytorch.org/docs/master/distributed.html#launch-utility
)
.
The use of
`Amp`
with distributed does not need to change from ordinary
single-process use. The only gotcha is that wrapping your model with
`DistributedDataParallel`
must
come after the call to
`amp.initialize`
. Test via
```
bash
bash run.sh
```
**This is intended purely as an instructional example, not a performance showcase.**
examples/simple/distributed/distributed_data_parallel.py
0 → 100644
View file @
a730f38f
import
torch
import
argparse
import
os
from
apex
import
amp
# FOR DISTRIBUTED: (can also use torch.nn.parallel.DistributedDataParallel instead)
from
apex.parallel
import
DistributedDataParallel
parser
=
argparse
.
ArgumentParser
()
# FOR DISTRIBUTED: Parse for the local_rank argument, which will be supplied
# automatically by torch.distributed.launch.
parser
.
add_argument
(
"--local_rank"
,
default
=
0
,
type
=
int
)
args
=
parser
.
parse_args
()
# FOR DISTRIBUTED: If we are running under torch.distributed.launch,
# the 'WORLD_SIZE' environment variable will also be set automatically.
args
.
distributed
=
False
if
'WORLD_SIZE'
in
os
.
environ
:
args
.
distributed
=
int
(
os
.
environ
[
'WORLD_SIZE'
])
>
1
if
args
.
distributed
:
# FOR DISTRIBUTED: Set the device according to local_rank.
torch
.
cuda
.
set_device
(
args
.
local_rank
)
# FOR DISTRIBUTED: Initialize the backend. torch.distributed.launch will provide
# environment variables, and requires that you use init_method=`env://`.
torch
.
distributed
.
init_process_group
(
backend
=
'nccl'
,
init_method
=
'env://'
)
torch
.
backends
.
cudnn
.
benchmark
=
True
N
,
D_in
,
D_out
=
64
,
1024
,
16
# Each process receives its own batch of "fake input data" and "fake target data."
# The "training loop" in each process just uses this fake batch over and over.
# https://github.com/NVIDIA/apex/tree/master/examples/imagenet provides a more realistic
# example of distributed data sampling for both training and validation.
x
=
torch
.
randn
(
N
,
D_in
,
device
=
'cuda'
)
y
=
torch
.
randn
(
N
,
D_out
,
device
=
'cuda'
)
model
=
torch
.
nn
.
Linear
(
D_in
,
D_out
).
cuda
()
optimizer
=
torch
.
optim
.
SGD
(
model
.
parameters
(),
lr
=
1e-3
)
model
,
optimizer
=
amp
.
initialize
(
model
,
optimizer
,
opt_level
=
"O1"
)
if
args
.
distributed
:
# FOR DISTRIBUTED: After amp.initialize, wrap the model with
# apex.parallel.DistributedDataParallel.
model
=
DistributedDataParallel
(
model
)
# torch.nn.parallel.DistributedDataParallel is also fine, with some added args:
# model = torch.nn.parallel.DistributedDataParallel(model,
# device_ids=[args.local_rank],
# output_device=args.local_rank)
loss_fn
=
torch
.
nn
.
MSELoss
()
for
t
in
range
(
500
):
optimizer
.
zero_grad
()
y_pred
=
model
(
x
)
loss
=
loss_fn
(
y_pred
,
y
)
with
amp
.
scale_loss
(
loss
,
optimizer
)
as
scaled_loss
:
scaled_loss
.
backward
()
optimizer
.
step
()
if
args
.
local_rank
==
0
:
print
(
"final loss = "
,
loss
)
examples/simple/distributed/run.sh
0 → 100644
View file @
a730f38f
#!/bin/bash
python
-m
torch.distributed.launch
--nproc_per_node
=
2 distributed_data_parallel.py
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment