Update (#4277)

Co-authored-by: Ubuntu <ubuntu@ip-172-31-53-142.us-west-2.compute.internal>

Update (#4277)
Co-authored-by: Ubuntu <ubuntu@ip-172-31-53-142.us-west-2.compute.internal>
05d9d496 · Mufei Li · GitHub · 05aca98d · 05d9d496
Unverified Commit 05d9d496 authored Jul 21, 2022 by Mufei Li Committed by GitHub Jul 21, 2022
Show whitespace changes
Inline Side-by-side

Showing with 66 additions and 81 deletions

docs/source/guide/mixed_precision.rst docs/source/guide/mixed_precision.rst +66 -81

No files found.
--- a/docs/source/guide/mixed_precision.rst
+++ b/docs/source/guide/mixed_precision.rst
@@ -2,59 +2,36 @@

 Chapter 8: Mixed Precision Training
 ===================================
-DGL is compatible with `PyTorch's automatic mixed precision package
+DGL is compatible with the `PyTorch Automatic Mixed Precision (AMP) package
 <https://pytorch.org/docs/stable/amp.html>`_
 for mixed precision training, thus saving both training time and GPU memory
-consumption. To enable this feature, users need to install PyTorch 1.6+ with python 3.7+ and
-build DGL from source file to support ``float16`` data type (this feature is
-still in its beta stage and we do not provide official pre-built pip wheels).
-
-Installation
------------
-First download DGL's source code from GitHub and build the shared library
-with flag ``USE_FP16=ON``.
-
-.. code:: bash
-
-   git clone --recurse-submodules https://github.com/dmlc/dgl.git
-   cd dgl
-   mkdir build
-   cd build
-   cmake -DUSE_CUDA=ON -DUSE_FP16=ON ..
-   make -j
-
-Then install the Python binding.
-
-.. code:: bash
-
-   cd ../python
-   python setup.py install
+consumption. This feature requires DGL 0.9+.

 Message-Passing with Half Precision
 -----------------------------------
-DGL with fp16 support allows message-passing on ``float16`` features for both
-UDF(User Defined Function)s and built-in functions (e.g. ``dgl.function.sum``,
+DGL allows message-passing on ``float16 (fp16)`` features for both
+UDFs (User Defined Functions) and built-in functions (e.g., ``dgl.function.sum``,
 ``dgl.function.copy_u``).

-The following examples shows how to use DGL's message-passing API on half-precision
+The following example shows how to use DGL's message-passing APIs on half-precision
 features:

    >>> import torch
    >>> import dgl
    >>> import dgl.function as fn
-    >>> g = dgl.rand_graph(30, 100).to(0)  # Create a graph on GPU w/ 30 nodes and 100 edges.
-    >>> g.ndata['h'] = torch.rand(30, 16).to(0).half()  # Create fp16 node features.
-    >>> g.edata['w'] = torch.rand(100, 1).to(0).half()  # Create fp16 edge features.
+    >>> dev = torch.device('cuda')
+    >>> g = dgl.rand_graph(30, 100).to(dev)  # Create a graph on GPU w/ 30 nodes and 100 edges.
+    >>> g.ndata['h'] = torch.rand(30, 16).to(dev).half()  # Create fp16 node features.
+    >>> g.edata['w'] = torch.rand(100, 1).to(dev).half()  # Create fp16 edge features.
    >>> # Use DGL's built-in functions for message passing on fp16 features.
    >>> g.update_all(fn.u_mul_e('h', 'w', 'm'), fn.sum('m', 'x'))
-    >>> g.ndata['x'][0]
-    tensor([0.3391, 0.2208, 0.7163, 0.6655, 0.7031, 0.5854, 0.9404, 0.7720, 0.6562,
-            0.4028, 0.6943, 0.5908, 0.9307, 0.5962, 0.7827, 0.5034],
-           device='cuda:0', dtype=torch.float16)
+    >>> g.ndata['x'].dtype
+    torch.float16
    >>> g.apply_edges(fn.u_dot_v('h', 'x', 'hx'))
-    >>> g.edata['hx'][0]
-    tensor([5.4570], device='cuda:0', dtype=torch.float16)
-    >>> # Use UDF(User Defined Functions) for message passing on fp16 features.
+    >>> g.edata['hx'].dtype
+    torch.float16
+
+    >>> # Use UDFs for message passing on fp16 features.
    >>> def message(edges):
    ...     return {'m': edges.src['h'] * edges.data['w']}
    ...
@@ -65,14 +42,11 @@ features:
    ...     return {'hy': (edges.src['h'] * edges.dst['y']).sum(-1, keepdims=True)}
    ...
    >>> g.update_all(message, reduce)
-    >>> g.ndata['y'][0]
-    tensor([0.3394, 0.2209, 0.7168, 0.6655, 0.7026, 0.5854, 0.9404, 0.7720, 0.6562,
-            0.4028, 0.6943, 0.5908, 0.9307, 0.5967, 0.7827, 0.5039],
-           device='cuda:0', dtype=torch.float16)
+    >>> g.ndata['y'].dtype
+    torch.float16
    >>> g.apply_edges(dot)
-    >>> g.edata['hy'][0]
-    tensor([5.4609], device='cuda:0', dtype=torch.float16)
-
+    >>> g.edata['hy'].dtype
+    torch.float16

 End-to-End Mixed Precision Training
 -----------------------------------
@@ -80,33 +54,52 @@ DGL relies on PyTorch's AMP package for mixed precision training,
 and the user experience is exactly
 the same as `PyTorch's <https://pytorch.org/docs/stable/notes/amp_examples.html>`_.

-By wrapping the forward pass (including loss computation) of your GNN model with
-``torch.cuda.amp.autocast()``, PyTorch automatically selects the appropriate datatype
-for each op and tensor. Half precision tensors are memory efficient, most operators
-on half precision tensors are faster as they leverage GPU's tensorcores.
+By wrapping the forward pass with ``torch.cuda.amp.autocast()``, PyTorch automatically
+selects the appropriate datatype for each op and tensor. Half precision tensors are memory
+efficient, most operators on half precision tensors are faster as they leverage GPU tensorcores.
+
+.. code::
+
+    import torch.nn.functional as F
+    from torch.cuda.amp import autocast
+
+    def forward(g, feat, label, mask, model, use_fp16):
+        with autocast(enabled=use_fp16):
+            logit = model(g, feat)
+            loss = F.cross_entropy(logit[mask], label[mask])
+            return loss
+
+Small Gradients in ``float16`` format have underflow problems (flush to zero).
+PyTorch provides a ``GradScaler`` module to address this issue. It multiplies
+the loss by a factor and invokes backward pass on the scaled loss to prevent
+the underflow problem. It then unscales the computed gradients before the optimizer
+updates the parameters. The scale factor is determined automatically.
+
+.. code::
+
+    from torch.cuda.amp import GradScaler

-Small Gradients in ``float16`` format have underflow problems (flush to zero), and
-PyTorch provides a ``GradScaler`` module to address this issue. ``GradScaler`` multiplies
-loss by a factor and invokes backward pass on scaled loss, and unscales graidents before
-optimizers update the parameters, thus preventing the underflow problem.
-The scale factor is determined automatically.
+    scaler = GradScaler()
+
+    def backward(scaler, loss, optimizer):
+        scaler.scale(loss).backward()
+        scaler.step(optimizer)
+        scaler.update()

-Following is the training script of 3-layer GAT on Reddit dataset (w/ 114 million edges),
-note the difference in codes when ``use_fp16`` is activated/not activated:
+The following example trains a 3-layer GAT on the Reddit dataset (w/ 114 million edges).
+Pay attention to the differences in the code when ``use_fp16`` is activated or not.

 .. code::

    import torch
    import torch.nn as nn
-    import torch.nn.functional as F
-    from torch.cuda.amp import autocast, GradScaler
    import dgl
    from dgl.data import RedditDataset
    from dgl.nn import GATConv
+    from dgl.transforms import AddSelfLoop

    use_fp16 = True

-
    class GAT(nn.Module):
        def __init__(self,
                     in_feats,
@@ -129,48 +122,40 @@ note the difference in codes when ``use_fp16`` is activated/not activated:
            return h

    # Data loading
-    data = RedditDataset()
-    device = torch.device(0)
+    transform = AddSelfLoop()
+    data = RedditDataset(transform)
+    dev = torch.device('cuda')
+
    g = data[0]
-    g = dgl.add_self_loop(g)
-    g = g.int().to(device)
+    g = g.int().to(dev)
    train_mask = g.ndata['train_mask']
-    features = g.ndata['feat']
-    labels = g.ndata['label']
-    in_feats = features.shape[1]
+    feat = g.ndata['feat']
+    label = g.ndata['label']
+
+    in_feats = feat.shape[1]
    n_hidden = 256
    n_classes = data.num_classes
-    n_edges = g.number_of_edges()
    heads = [1, 1, 1]
    model = GAT(in_feats, n_hidden, n_classes, heads)
-    model = model.to(device)
+    model = model.to(dev)
+    model.train()

    # Create optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=5e-4)
-    # Create gradient scaler
-    scaler = GradScaler()

    for epoch in range(100):
-        model.train()
        optimizer.zero_grad()
-
-        # Wrap forward pass with autocast
-        with autocast(enabled=use_fp16):
-            logits = model(g, features)
-            loss = F.cross_entropy(logits[train_mask], labels[train_mask])
+        loss = forward(g, feat, label, train_mask, model, use_fp16)

        if use_fp16:
            # Backprop w/ gradient scaling
-            scaler.scale(loss).backward()
-            scaler.step(optimizer)
-            scaler.update()
+            backward(scaler, loss, optimizer)
        else:
            loss.backward()
            optimizer.step()

        print('Epoch {} | Loss {}'.format(epoch, loss.item()))

-
 On a NVIDIA V100 (16GB) machine, training this model without fp16 consumes
 15.2GB GPU memory; with fp16 turned on, the training consumes 12.8G
 GPU memory, the loss converges to similar values in both settings.