"src/vscode:/vscode.git/clone" did not exist on "9dccc7dc42dc7aca1ee5af806b38dea6632ad965"
Unverified Commit 05d9d496 authored by Mufei Li's avatar Mufei Li Committed by GitHub
Browse files
parent 05aca98d
...@@ -2,59 +2,36 @@ ...@@ -2,59 +2,36 @@
Chapter 8: Mixed Precision Training Chapter 8: Mixed Precision Training
=================================== ===================================
DGL is compatible with `PyTorch's automatic mixed precision package DGL is compatible with the `PyTorch Automatic Mixed Precision (AMP) package
<https://pytorch.org/docs/stable/amp.html>`_ <https://pytorch.org/docs/stable/amp.html>`_
for mixed precision training, thus saving both training time and GPU memory for mixed precision training, thus saving both training time and GPU memory
consumption. To enable this feature, users need to install PyTorch 1.6+ with python 3.7+ and consumption. This feature requires DGL 0.9+.
build DGL from source file to support ``float16`` data type (this feature is
still in its beta stage and we do not provide official pre-built pip wheels).
Installation
------------
First download DGL's source code from GitHub and build the shared library
with flag ``USE_FP16=ON``.
.. code:: bash
git clone --recurse-submodules https://github.com/dmlc/dgl.git
cd dgl
mkdir build
cd build
cmake -DUSE_CUDA=ON -DUSE_FP16=ON ..
make -j
Then install the Python binding.
.. code:: bash
cd ../python
python setup.py install
Message-Passing with Half Precision Message-Passing with Half Precision
----------------------------------- -----------------------------------
DGL with fp16 support allows message-passing on ``float16`` features for both DGL allows message-passing on ``float16 (fp16)`` features for both
UDF(User Defined Function)s and built-in functions (e.g. ``dgl.function.sum``, UDFs (User Defined Functions) and built-in functions (e.g., ``dgl.function.sum``,
``dgl.function.copy_u``). ``dgl.function.copy_u``).
The following examples shows how to use DGL's message-passing API on half-precision The following example shows how to use DGL's message-passing APIs on half-precision
features: features:
>>> import torch >>> import torch
>>> import dgl >>> import dgl
>>> import dgl.function as fn >>> import dgl.function as fn
>>> g = dgl.rand_graph(30, 100).to(0) # Create a graph on GPU w/ 30 nodes and 100 edges. >>> dev = torch.device('cuda')
>>> g.ndata['h'] = torch.rand(30, 16).to(0).half() # Create fp16 node features. >>> g = dgl.rand_graph(30, 100).to(dev) # Create a graph on GPU w/ 30 nodes and 100 edges.
>>> g.edata['w'] = torch.rand(100, 1).to(0).half() # Create fp16 edge features. >>> g.ndata['h'] = torch.rand(30, 16).to(dev).half() # Create fp16 node features.
>>> g.edata['w'] = torch.rand(100, 1).to(dev).half() # Create fp16 edge features.
>>> # Use DGL's built-in functions for message passing on fp16 features. >>> # Use DGL's built-in functions for message passing on fp16 features.
>>> g.update_all(fn.u_mul_e('h', 'w', 'm'), fn.sum('m', 'x')) >>> g.update_all(fn.u_mul_e('h', 'w', 'm'), fn.sum('m', 'x'))
>>> g.ndata['x'][0] >>> g.ndata['x'].dtype
tensor([0.3391, 0.2208, 0.7163, 0.6655, 0.7031, 0.5854, 0.9404, 0.7720, 0.6562, torch.float16
0.4028, 0.6943, 0.5908, 0.9307, 0.5962, 0.7827, 0.5034],
device='cuda:0', dtype=torch.float16)
>>> g.apply_edges(fn.u_dot_v('h', 'x', 'hx')) >>> g.apply_edges(fn.u_dot_v('h', 'x', 'hx'))
>>> g.edata['hx'][0] >>> g.edata['hx'].dtype
tensor([5.4570], device='cuda:0', dtype=torch.float16) torch.float16
>>> # Use UDF(User Defined Functions) for message passing on fp16 features.
>>> # Use UDFs for message passing on fp16 features.
>>> def message(edges): >>> def message(edges):
... return {'m': edges.src['h'] * edges.data['w']} ... return {'m': edges.src['h'] * edges.data['w']}
... ...
...@@ -65,14 +42,11 @@ features: ...@@ -65,14 +42,11 @@ features:
... return {'hy': (edges.src['h'] * edges.dst['y']).sum(-1, keepdims=True)} ... return {'hy': (edges.src['h'] * edges.dst['y']).sum(-1, keepdims=True)}
... ...
>>> g.update_all(message, reduce) >>> g.update_all(message, reduce)
>>> g.ndata['y'][0] >>> g.ndata['y'].dtype
tensor([0.3394, 0.2209, 0.7168, 0.6655, 0.7026, 0.5854, 0.9404, 0.7720, 0.6562, torch.float16
0.4028, 0.6943, 0.5908, 0.9307, 0.5967, 0.7827, 0.5039],
device='cuda:0', dtype=torch.float16)
>>> g.apply_edges(dot) >>> g.apply_edges(dot)
>>> g.edata['hy'][0] >>> g.edata['hy'].dtype
tensor([5.4609], device='cuda:0', dtype=torch.float16) torch.float16
End-to-End Mixed Precision Training End-to-End Mixed Precision Training
----------------------------------- -----------------------------------
...@@ -80,33 +54,52 @@ DGL relies on PyTorch's AMP package for mixed precision training, ...@@ -80,33 +54,52 @@ DGL relies on PyTorch's AMP package for mixed precision training,
and the user experience is exactly and the user experience is exactly
the same as `PyTorch's <https://pytorch.org/docs/stable/notes/amp_examples.html>`_. the same as `PyTorch's <https://pytorch.org/docs/stable/notes/amp_examples.html>`_.
By wrapping the forward pass (including loss computation) of your GNN model with By wrapping the forward pass with ``torch.cuda.amp.autocast()``, PyTorch automatically
``torch.cuda.amp.autocast()``, PyTorch automatically selects the appropriate datatype selects the appropriate datatype for each op and tensor. Half precision tensors are memory
for each op and tensor. Half precision tensors are memory efficient, most operators efficient, most operators on half precision tensors are faster as they leverage GPU tensorcores.
on half precision tensors are faster as they leverage GPU's tensorcores.
.. code::
import torch.nn.functional as F
from torch.cuda.amp import autocast
def forward(g, feat, label, mask, model, use_fp16):
with autocast(enabled=use_fp16):
logit = model(g, feat)
loss = F.cross_entropy(logit[mask], label[mask])
return loss
Small Gradients in ``float16`` format have underflow problems (flush to zero).
PyTorch provides a ``GradScaler`` module to address this issue. It multiplies
the loss by a factor and invokes backward pass on the scaled loss to prevent
the underflow problem. It then unscales the computed gradients before the optimizer
updates the parameters. The scale factor is determined automatically.
.. code::
from torch.cuda.amp import GradScaler
Small Gradients in ``float16`` format have underflow problems (flush to zero), and scaler = GradScaler()
PyTorch provides a ``GradScaler`` module to address this issue. ``GradScaler`` multiplies
loss by a factor and invokes backward pass on scaled loss, and unscales graidents before def backward(scaler, loss, optimizer):
optimizers update the parameters, thus preventing the underflow problem. scaler.scale(loss).backward()
The scale factor is determined automatically. scaler.step(optimizer)
scaler.update()
Following is the training script of 3-layer GAT on Reddit dataset (w/ 114 million edges), The following example trains a 3-layer GAT on the Reddit dataset (w/ 114 million edges).
note the difference in codes when ``use_fp16`` is activated/not activated: Pay attention to the differences in the code when ``use_fp16`` is activated or not.
.. code:: .. code::
import torch import torch
import torch.nn as nn import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
import dgl import dgl
from dgl.data import RedditDataset from dgl.data import RedditDataset
from dgl.nn import GATConv from dgl.nn import GATConv
from dgl.transforms import AddSelfLoop
use_fp16 = True use_fp16 = True
class GAT(nn.Module): class GAT(nn.Module):
def __init__(self, def __init__(self,
in_feats, in_feats,
...@@ -129,48 +122,40 @@ note the difference in codes when ``use_fp16`` is activated/not activated: ...@@ -129,48 +122,40 @@ note the difference in codes when ``use_fp16`` is activated/not activated:
return h return h
# Data loading # Data loading
data = RedditDataset() transform = AddSelfLoop()
device = torch.device(0) data = RedditDataset(transform)
dev = torch.device('cuda')
g = data[0] g = data[0]
g = dgl.add_self_loop(g) g = g.int().to(dev)
g = g.int().to(device)
train_mask = g.ndata['train_mask'] train_mask = g.ndata['train_mask']
features = g.ndata['feat'] feat = g.ndata['feat']
labels = g.ndata['label'] label = g.ndata['label']
in_feats = features.shape[1]
in_feats = feat.shape[1]
n_hidden = 256 n_hidden = 256
n_classes = data.num_classes n_classes = data.num_classes
n_edges = g.number_of_edges()
heads = [1, 1, 1] heads = [1, 1, 1]
model = GAT(in_feats, n_hidden, n_classes, heads) model = GAT(in_feats, n_hidden, n_classes, heads)
model = model.to(device) model = model.to(dev)
model.train()
# Create optimizer # Create optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=5e-4) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=5e-4)
# Create gradient scaler
scaler = GradScaler()
for epoch in range(100): for epoch in range(100):
model.train()
optimizer.zero_grad() optimizer.zero_grad()
loss = forward(g, feat, label, train_mask, model, use_fp16)
# Wrap forward pass with autocast
with autocast(enabled=use_fp16):
logits = model(g, features)
loss = F.cross_entropy(logits[train_mask], labels[train_mask])
if use_fp16: if use_fp16:
# Backprop w/ gradient scaling # Backprop w/ gradient scaling
scaler.scale(loss).backward() backward(scaler, loss, optimizer)
scaler.step(optimizer)
scaler.update()
else: else:
loss.backward() loss.backward()
optimizer.step() optimizer.step()
print('Epoch {} | Loss {}'.format(epoch, loss.item())) print('Epoch {} | Loss {}'.format(epoch, loss.item()))
On a NVIDIA V100 (16GB) machine, training this model without fp16 consumes On a NVIDIA V100 (16GB) machine, training this model without fp16 consumes
15.2GB GPU memory; with fp16 turned on, the training consumes 12.8G 15.2GB GPU memory; with fp16 turned on, the training consumes 12.8G
GPU memory, the loss converges to similar values in both settings. GPU memory, the loss converges to similar values in both settings.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment