Commit 8e1f01c5 authored by Michael Carilli's avatar Michael Carilli
Browse files

Adding minimal examples with Apex and Pytorch distributed data parallel

parent 83acda92
# Simple examples of FP16_Optimizer functionality
`minimal.py` shows the basic usage of `FP16_Optimizer`.
`closure.py` shows how to use `FP16_Optimizer` with a closure.
`save_load.py` shows that `FP16_Optimizer` uses the same checkpointing syntax as ordinary Pytorch
optimizers.
`distributed_pytorch` shows an example using `FP16_Optimizer` with Pytorch DistributedDataParallel.
The usage of `FP16_Optimizer` with distributed does not need to change from ordinary single-process
usage. Run via
```bash
cd distributed_pytorch
bash run.sh
```
`distributed_pytorch` shows an example using `FP16_Optimizer` with Apex DistributedDataParallel.
Again, the usage of `FP16_Optimizer` with distributed does not need to change from ordinary
single-process usage. Run via
```bash
cd distributed_apex
bash run.sh
```
import torch
from torch.autograd import Variable
import argparse
from apex.parallel import DistributedDataParallel as DDP
from apex.fp16_utils import FP16_Optimizer
parser = argparse.ArgumentParser()
parser.add_argument('--dist-url', default='tcp://224.66.41.62:23456', type=str,
help='url used to set up distributed training')
parser.add_argument('--world-size', default=2, type=int,
help='Number of distributed processes.')
parser.add_argument("--rank", type=int,
help='Rank of this process')
args = parser.parse_args()
torch.cuda.set_device(args.rank)
torch.distributed.init_process_group(backend='nccl',
init_method=args.dist_url,
world_size=args.world_size,
rank=args.rank)
torch.backends.cudnn.benchmark = True
N, D_in, D_out = 64, 1024, 16
x = Variable(torch.cuda.FloatTensor(N, D_in ).normal_()).half()
y = Variable(torch.cuda.FloatTensor(N, D_out).normal_()).half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
model = DDP(model)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
### CONSTRUCT FP16_Optimizer ###
optimizer = FP16_Optimizer(optimizer)
###
loss_fn = torch.nn.MSELoss()
for t in range(500):
optimizer.zero_grad()
y_pred = model(x)
loss = loss_fn(y_pred, y)
### CHANGE loss.backward() TO: ###
optimizer.backward(loss)
###
optimizer.step()
print("final loss = ", loss)
#!/bin/bash
# By default, apex.parallel.multiproc will attempt to use all available GPUs on the system.
# The number of GPUs to use can be limited by setting CUDA_VISIBLE_DEVICES:
export CUDA_VISIBLE_DEVICES=0,1
python -m apex.parallel.multiproc distributed_data_parallel.py
import torch
from torch.autograd import Variable
import argparse
from apex.fp16_utils import FP16_Optimizer
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl',
init_method='env://')
torch.backends.cudnn.benchmark = True
N, D_in, D_out = 64, 1024, 16
x = Variable(torch.cuda.FloatTensor(N, D_in ).normal_()).half()
y = Variable(torch.cuda.FloatTensor(N, D_out).normal_()).half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
model = torch.nn.parallel.DistributedDataParallel(model,
device_ids=[args.local_rank],
output_device=args.local_rank)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
### CONSTRUCT FP16_Optimizer ###
optimizer = FP16_Optimizer(optimizer)
###
loss_fn = torch.nn.MSELoss()
for t in range(500):
optimizer.zero_grad()
y_pred = model(x)
loss = loss_fn(y_pred, y)
### CHANGE loss.backward() TO: ###
optimizer.backward(loss)
###
optimizer.step()
print("final loss = ", loss)
#!/bin/bash
python -m torch.distributed.launch --nproc_per_node=2 distributed_data_parallel.py
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment