Merge branch 'master' into dist_part

e9b624fe · Minjie Wang · GitHub · 8086d1ed · a88e7f7e · e9b624fe
Unverified Commit e9b624fe authored Aug 11, 2022 by Minjie Wang Committed by GitHub Aug 11, 2022
20 changed files
--- a/dglgo/recipes/nodepred_pubmed_gat.yaml
+++ b/dglgo/recipes/nodepred_pubmed_gat.yaml
 # Accuracy across 10 runs: 0.7788 ± 0.002227
-version: 0.0.1
+version: 0.0.2
 pipeline_name: nodepred
 pipeline_mode: train
 device: cuda:0

--- a/dglgo/recipes/nodepred_pubmed_gcn.yaml
+++ b/dglgo/recipes/nodepred_pubmed_gcn.yaml
 # Accuracy across 10 runs: 0.7826 ± 0.004317
-version: 0.0.1
+version: 0.0.2
 pipeline_name: nodepred
 pipeline_mode: train
 device: cuda:0

--- a/dglgo/recipes/nodepred_pubmed_sage.yaml
+++ b/dglgo/recipes/nodepred_pubmed_sage.yaml
 # Accuracy across 10 runs: 0.7819 ± 0.003176
-version: 0.0.1
+version: 0.0.2
 pipeline_name: nodepred
 pipeline_mode: train
 device: cuda:0

--- a/dglgo/setup.py
+++ b/dglgo/setup.py
@@ -4,7 +4,7 @@ from setuptools import find_packages
 from distutils.core import setup
 setup(name='dglgo',
-      version='0.0.1',
+      version='0.0.2',
      description='DGL',
      author='DGL Team',
      author_email='wmjlyjemaine@gmail.com',

--- a/dglgo/tests/cfg.yml
+++ b/dglgo/tests/cfg.yml
-version: 0.0.1
+version: 0.0.2
 pipeline_name: nodepred
 pipeline_mode: train
 device: cpu

--- a/docs/source/api/python/dgl.DGLGraph.rst
+++ b/docs/source/api/python/dgl.DGLGraph.rst
@@ -132,6 +132,7 @@ under the ``dgl`` namespace.
    DGLGraph.add_self_loop
    DGLGraph.remove_self_loop
    DGLGraph.to_simple
+    DGLGraph.to_cugraph
    DGLGraph.reorder_graph
 Adjacency and incidence matrix

--- a/docs/source/api/python/dgl.rst
+++ b/docs/source/api/python/dgl.rst
@@ -18,6 +18,7 @@ Operators for constructing :class:`DGLGraph` from raw data formats.
    graph
    heterograph
+    from_cugraph
    from_scipy
    from_networkx
    bipartite_from_scipy
@@ -93,6 +94,7 @@ Operators for generating new graphs by manipulating the structure of the existin
    to_bidirected
    to_bidirected_stale
    to_block
+    to_cugraph
    to_double
    to_float
    to_half

--- a/docs/source/developer/ffi.rst
+++ b/docs/source/developer/ffi.rst
@@ -24,11 +24,13 @@ API that is exposed to python is only a few lines of codes:
   #include <dgl/runtime/packed_func.h>
   #include <dgl/runtime/registry.h>
+   using namespace dgl::runtime;
   DGL_REGISTER_GLOBAL("calculator.MyAdd")
   .set_body([] (DGLArgs args, DGLRetValue* rv) {
       int a = args[0];
       int b = args[1];
-       *rv = a * b;
+       *rv = a + b;
     });
 Compile and build the library. On the python side, create a

--- a/docs/source/guide/minibatch-gpu-sampling.rst
+++ b/docs/source/guide/minibatch-gpu-sampling.rst
@@ -60,7 +60,7 @@ Using CUDA UVA-based neighborhood sampling in DGL data loaders
 For the case where the graph is too large to fit onto the GPU memory, we introduce the
 CUDA UVA (Unified Virtual Addressing)-based sampling, in which GPUs perform the sampling
-on the graph pinned on CPU memory via zero-copy access.
+on the graph pinned in CPU memory via zero-copy access.
 You can enable UVA-based neighborhood sampling in DGL data loaders via:
 * Put the ``train_nid`` onto GPU.
@@ -99,6 +99,38 @@ especially for multi-GPU training.
  Refer to our `GraphSAGE example <https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/multi_gpu_node_classification.py>`_ for more details.
+UVA and GPU support for PinSAGESampler/RandomWalkNeighborSampler
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+PinSAGESampler and RandomWalkNeighborSampler support UVA and GPU sampling.
+You can enable them via:
+* Pin the graph (for UVA sampling) or put the graph onto GPU (for GPU sampling).
+* Put the ``train_nid`` onto GPU.
+.. code:: python
+  g = dgl.heterograph({
+      ('item', 'bought-by', 'user'): ([0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 2, 3, 2, 3]),
+      ('user', 'bought', 'item'): ([0, 1, 0, 1, 2, 3, 2, 3], [0, 0, 1, 1, 2, 2, 3, 3])})
+  # UVA setup
+  # g.create_formats_()
+  # g.pin_memory_()
+  # GPU setup
+  device = torch.device('cuda:0')
+  g = g.to(device)
+  sampler1 = dgl.sampling.PinSAGESampler(g, 'item', 'user', 4, 0.5, 3, 2)
+  sampler2 = dgl.sampling.RandomWalkNeighborSampler(g, 4, 0.5, 3, 2, ['bought-by', 'bought'])
+  train_nid = torch.tensor([0, 2], dtype=g.idtype, device=device)
+  sampler1(train_nid)
+  sampler2(train_nid)
 Using GPU-based neighbor sampling with DGL functions
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -106,8 +138,7 @@ You can build your own GPU sampling pipelines with the following functions that
 operating on GPU:
 * :func:`dgl.sampling.sample_neighbors`
+* :func:`dgl.sampling.random_walk`
-  * Only has support for uniform sampling; non-uniform sampling can only run on CPU.
 Subgraph extraction ops:

--- a/docs/source/guide/mixed_precision.rst
+++ b/docs/source/guide/mixed_precision.rst
@@ -2,59 +2,36 @@
 Chapter 8: Mixed Precision Training
 ===================================
-DGL is compatible with `PyTorch's automatic mixed precision package
+DGL is compatible with the `PyTorch Automatic Mixed Precision (AMP) package
 <https://pytorch.org/docs/stable/amp.html>`_
 for mixed precision training, thus saving both training time and GPU memory
-consumption. To enable this feature, users need to install PyTorch 1.6+ with python 3.7+ and
+consumption. This feature requires DGL 0.9+.
-build DGL from source file to support ``float16`` data type (this feature is
-still in its beta stage and we do not provide official pre-built pip wheels).
-Installation
------------
-First download DGL's source code from GitHub and build the shared library
-with flag ``USE_FP16=ON``.
-.. code:: bash
-   git clone --recurse-submodules https://github.com/dmlc/dgl.git
-   cd dgl
-   mkdir build
-   cd build
-   cmake -DUSE_CUDA=ON -DUSE_FP16=ON ..
-   make -j
-Then install the Python binding.
-.. code:: bash
-   cd ../python
-   python setup.py install
 Message-Passing with Half Precision
 -----------------------------------
-DGL with fp16 support allows message-passing on ``float16`` features for both
+DGL allows message-passing on ``float16 (fp16)`` features for both
-UDF(User Defined Function)s and built-in functions (e.g. ``dgl.function.sum``,
+UDFs (User Defined Functions) and built-in functions (e.g., ``dgl.function.sum``,
 ``dgl.function.copy_u``).
-The following examples shows how to use DGL's message-passing API on half-precision
+The following example shows how to use DGL's message-passing APIs on half-precision
 features:
    >>> import torch
    >>> import dgl
    >>> import dgl.function as fn
-    >>> g = dgl.rand_graph(30, 100).to(0)  # Create a graph on GPU w/ 30 nodes and 100 edges.
+    >>> dev = torch.device('cuda')
-    >>> g.ndata['h'] = torch.rand(30, 16).to(0).half()  # Create fp16 node features.
+    >>> g = dgl.rand_graph(30, 100).to(dev)  # Create a graph on GPU w/ 30 nodes and 100 edges.
-    >>> g.edata['w'] = torch.rand(100, 1).to(0).half()  # Create fp16 edge features.
+    >>> g.ndata['h'] = torch.rand(30, 16).to(dev).half()  # Create fp16 node features.
+    >>> g.edata['w'] = torch.rand(100, 1).to(dev).half()  # Create fp16 edge features.
    >>> # Use DGL's built-in functions for message passing on fp16 features.
    >>> g.update_all(fn.u_mul_e('h', 'w', 'm'), fn.sum('m', 'x'))
-    >>> g.ndata['x'][0]
+    >>> g.ndata['x'].dtype
-    tensor([0.3391, 0.2208, 0.7163, 0.6655, 0.7031, 0.5854, 0.9404, 0.7720, 0.6562,
+    torch.float16
-            0.4028, 0.6943, 0.5908, 0.9307, 0.5962, 0.7827, 0.5034],
-           device='cuda:0', dtype=torch.float16)
    >>> g.apply_edges(fn.u_dot_v('h', 'x', 'hx'))
-    >>> g.edata['hx'][0]
+    >>> g.edata['hx'].dtype
-    tensor([5.4570], device='cuda:0', dtype=torch.float16)
+    torch.float16
-    >>> # Use UDF(User Defined Functions) for message passing on fp16 features.
+    >>> # Use UDFs for message passing on fp16 features.
    >>> def message(edges):
    ...     return {'m': edges.src['h'] * edges.data['w']}
    ...
@@ -65,14 +42,11 @@ features:
    ...     return {'hy': (edges.src['h'] * edges.dst['y']).sum(-1, keepdims=True)}
    ...
    >>> g.update_all(message, reduce)
-    >>> g.ndata['y'][0]
+    >>> g.ndata['y'].dtype
-    tensor([0.3394, 0.2209, 0.7168, 0.6655, 0.7026, 0.5854, 0.9404, 0.7720, 0.6562,
+    torch.float16
-            0.4028, 0.6943, 0.5908, 0.9307, 0.5967, 0.7827, 0.5039],
-           device='cuda:0', dtype=torch.float16)
    >>> g.apply_edges(dot)
-    >>> g.edata['hy'][0]
+    >>> g.edata['hy'].dtype
-    tensor([5.4609], device='cuda:0', dtype=torch.float16)
+    torch.float16
 End-to-End Mixed Precision Training
 -----------------------------------
@@ -80,33 +54,52 @@ DGL relies on PyTorch's AMP package for mixed precision training,
 and the user experience is exactly
 the same as `PyTorch's <https://pytorch.org/docs/stable/notes/amp_examples.html>`_.
-By wrapping the forward pass (including loss computation) of your GNN model with
+By wrapping the forward pass with ``torch.cuda.amp.autocast()``, PyTorch automatically
-``torch.cuda.amp.autocast()``, PyTorch automatically selects the appropriate datatype
+selects the appropriate datatype for each op and tensor. Half precision tensors are memory
-for each op and tensor. Half precision tensors are memory efficient, most operators
+efficient, most operators on half precision tensors are faster as they leverage GPU tensorcores.
-on half precision tensors are faster as they leverage GPU's tensorcores.
+.. code::
+    import torch.nn.functional as F
+    from torch.cuda.amp import autocast
+    def forward(g, feat, label, mask, model, use_fp16):
+        with autocast(enabled=use_fp16):
+            logit = model(g, feat)
+            loss = F.cross_entropy(logit[mask], label[mask])
+            return loss
+Small Gradients in ``float16`` format have underflow problems (flush to zero).
+PyTorch provides a ``GradScaler`` module to address this issue. It multiplies
+the loss by a factor and invokes backward pass on the scaled loss to prevent
+the underflow problem. It then unscales the computed gradients before the optimizer
+updates the parameters. The scale factor is determined automatically.
+.. code::
+    from torch.cuda.amp import GradScaler
-Small Gradients in ``float16`` format have underflow problems (flush to zero), and
+    scaler = GradScaler()
-PyTorch provides a ``GradScaler`` module to address this issue. ``GradScaler`` multiplies
-loss by a factor and invokes backward pass on scaled loss, and unscales graidents before
+    def backward(scaler, loss, optimizer):
-optimizers update the parameters, thus preventing the underflow problem.
+        scaler.scale(loss).backward()
-The scale factor is determined automatically.
+        scaler.step(optimizer)
+        scaler.update()
-Following is the training script of 3-layer GAT on Reddit dataset (w/ 114 million edges),
+The following example trains a 3-layer GAT on the Reddit dataset (w/ 114 million edges).
-note the difference in codes when ``use_fp16`` is activated/not activated:
+Pay attention to the differences in the code when ``use_fp16`` is activated or not.
 .. code::
    import torch
    import torch.nn as nn
-    import torch.nn.functional as F
-    from torch.cuda.amp import autocast, GradScaler
    import dgl
    from dgl.data import RedditDataset
    from dgl.nn import GATConv
+    from dgl.transforms import AddSelfLoop
    use_fp16 = True
    class GAT(nn.Module):
        def __init__(self,
                     in_feats,
@@ -129,48 +122,40 @@ note the difference in codes when ``use_fp16`` is activated/not activated:
            return h
    # Data loading
-    data = RedditDataset()
+    transform = AddSelfLoop()
-    device = torch.device(0)
+    data = RedditDataset(transform)
+    dev = torch.device('cuda')
    g = data[0]
-    g = dgl.add_self_loop(g)
+    g = g.int().to(dev)
-    g = g.int().to(device)
    train_mask = g.ndata['train_mask']
-    features = g.ndata['feat']
+    feat = g.ndata['feat']
-    labels = g.ndata['label']
+    label = g.ndata['label']
-    in_feats = features.shape[1]
+    in_feats = feat.shape[1]
    n_hidden = 256
    n_classes = data.num_classes
-    n_edges = g.number_of_edges()
    heads = [1, 1, 1]
    model = GAT(in_feats, n_hidden, n_classes, heads)
-    model = model.to(device)
+    model = model.to(dev)
+    model.train()
    # Create optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=5e-4)
-    # Create gradient scaler
-    scaler = GradScaler()
    for epoch in range(100):
-        model.train()
        optimizer.zero_grad()
+        loss = forward(g, feat, label, train_mask, model, use_fp16)
-        # Wrap forward pass with autocast
-        with autocast(enabled=use_fp16):
-            logits = model(g, features)
-            loss = F.cross_entropy(logits[train_mask], labels[train_mask])
        if use_fp16:
            # Backprop w/ gradient scaling
-            scaler.scale(loss).backward()
+            backward(scaler, loss, optimizer)
-            scaler.step(optimizer)
-            scaler.update()
        else:
            loss.backward()
            optimizer.step()
        print('Epoch {} | Loss {}'.format(epoch, loss.item()))
 On a NVIDIA V100 (16GB) machine, training this model without fp16 consumes
 15.2GB GPU memory; with fp16 turned on, the training consumes 12.8G
 GPU memory, the loss converges to similar values in both settings.

--- a/examples/README.md
+++ b/examples/README.md
@@ -249,7 +249,7 @@ To quickly locate the examples of your interest, search for the tagged keywords
    - Tags: matrix completion, recommender system, link prediction, bipartite graphs
 - <a name="graphsage"></a> Hamilton et al. Inductive Representation Learning on Large Graphs. [Paper link](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
-    - Example code: [PyTorch](../examples/pytorch/graphsage), [PyTorch on ogbn-products](../examples/pytorch/ogb/ogbn-products), [PyTorch on ogbl-ppa](https://github.com/awslabs/dgl-lifesci/tree/master/examples/link_prediction/ogbl-ppa), [MXNet](../examples/mxnet/graphsage)
+    - Example code: [PyTorch](../examples/pytorch/graphsage), [PyTorch on ogbn-products](../examples/pytorch/ogb/ogbn-products), [PyTorch on ogbn-mag](../examples/pytorch/ogb/ogbn-mag), [PyTorch on ogbl-ppa](https://github.com/awslabs/dgl-lifesci/tree/master/examples/link_prediction/ogbl-ppa), [MXNet](../examples/mxnet/graphsage)
    - Tags: node classification, sampling, unsupervised learning, link prediction, OGB
 - <a name="metapath2vec"></a> Dong et al. metapath2vec: Scalable Representation Learning for Heterogeneous Networks. [Paper link](https://dl.acm.org/doi/10.1145/3097983.3098036).

--- a/examples/pytorch/arma/model.py
+++ b/examples/pytorch/arma/model.py
@@ -64,7 +64,7 @@ class ARMAConv(nn.Module):
            # assume that the graphs are undirected and graph.in_degrees() is the same as graph.out_degrees()
            degs = g.in_degrees().float().clamp(min=1)
            norm = torch.pow(degs, -0.5).to(feats.device).unsqueeze(1)
-            output = None
+            output = [] 
            for k in range(self.K):
                feats = init_feats
@@ -88,13 +88,9 @@ class ARMAConv(nn.Module):
                    if self.activation is not None:
                        feats = self.activation(feats)
+                output.append(feats)
-                if output is None:
+            return torch.stack(output).mean(dim=0)
-                    output = feats
-                else:
-                    output += feats
-            return output / self.K 
 class ARMA4NC(nn.Module):
    def __init__(self,

--- a/examples/pytorch/caregnn/main_sampling.py
+++ b/examples/pytorch/caregnn/main_sampling.py
@@ -92,9 +92,10 @@ def main(args):
            graph.ndata['nd'] = th.tanh(model.layers[i].MLP(layers_feat[i]))
            for etype in graph.canonical_etypes:
                graph.apply_edges(_l1_dist, etype=etype)
-                dist[etype] = graph.edges[etype].data['ed']
+                dist[etype] = graph.edges[etype].data.pop('ed').detach().cpu()
            dists.append(dist)
            p.append(model.layers[i].p)
+        graph.ndata.pop('nd')
        sampler = CARESampler(p, dists, args.num_layers)
        # train
@@ -103,14 +104,9 @@ def main(args):
        tr_recall = 0
        tr_auc = 0
        tr_blk = 0
-        train_dataloader = dgl.dataloading.DataLoader(graph,
+        train_dataloader = dgl.dataloading.DataLoader(
-                                                          train_idx,
+            graph, train_idx, sampler, batch_size=args.batch_size,
-                                                          sampler,
+            shuffle=True,  drop_last=False, num_workers=args.num_workers)
-                                                          batch_size=args.batch_size,
-                                                          shuffle=True,
-                                                          drop_last=False,
-                                                          num_workers=args.num_workers
-                                                          )
        for input_nodes, output_nodes, blocks in train_dataloader:
            blocks = [b.to(device) for b in blocks]
@@ -135,14 +131,9 @@ def main(args):
        # validation
        model.eval()
-        val_dataloader = dgl.dataloading.DataLoader(graph,
+        val_dataloader = dgl.dataloading.DataLoader(
-                                                        val_idx,
+            graph, val_idx, sampler, batch_size=args.batch_size,
-                                                        sampler,
+            shuffle=True, drop_last=False, num_workers=args.num_workers)
-                                                        batch_size=args.batch_size,
-                                                        shuffle=True,
-                                                        drop_last=False,
-                                                        num_workers=args.num_workers
-                                                        )
        val_recall, val_auc, val_loss = evaluate(model, loss_fn, val_dataloader, device)
@@ -159,14 +150,9 @@ def main(args):
    model.eval()
    if args.early_stop:
        model.load_state_dict(th.load('es_checkpoint.pt'))
-    test_dataloader = dgl.dataloading.DataLoader(graph,
+    test_dataloader = dgl.dataloading.DataLoader(
-                                                     test_idx,
+        graph, test_idx, sampler, batch_size=args.batch_size,
-                                                     sampler,
+        shuffle=True, drop_last=False, num_workers=args.num_workers)
-                                                     batch_size=args.batch_size,
-                                                     shuffle=True,
-                                                     drop_last=False,
-                                                     num_workers=args.num_workers
-                                                     )
    test_recall, test_auc, test_loss = evaluate(model, loss_fn, test_dataloader, device)

--- a/examples/pytorch/caregnn/model_sampling.py
+++ b/examples/pytorch/caregnn/model_sampling.py
@@ -13,9 +13,10 @@ def _l1_dist(edges):
 class CARESampler(dgl.dataloading.BlockSampler):
    def __init__(self, p, dists, num_layers):
-        super().__init__(num_layers)
+        super().__init__()
        self.p = p
        self.dists = dists
+        self.num_layers = num_layers
    def sample_frontier(self, block_id, g, seed_nodes, *args, **kwargs):
        with g.local_scope():
@@ -28,7 +29,7 @@ class CARESampler(dgl.dataloading.BlockSampler):
                    num_neigh = th.ceil(g.in_degrees(node, etype=etype) * self.p[block_id][etype]).int().item()
                    neigh_dist = self.dists[block_id][etype][edges]
                    if neigh_dist.shape[0] > num_neigh:
-                        neigh_index = np.argpartition(neigh_dist.cpu().detach(), num_neigh)[:num_neigh]
+                        neigh_index = np.argpartition(neigh_dist, num_neigh)[:num_neigh]
                    else:
                        neigh_index = np.arange(num_neigh)
                    edge_mask[edges[neigh_index]] = 1
@@ -36,6 +37,19 @@ class CARESampler(dgl.dataloading.BlockSampler):
            return dgl.edge_subgraph(g, new_edges_masks, relabel_nodes=False)
+    def sample_blocks(self, g, seed_nodes, exclude_eids=None):
+        output_nodes = seed_nodes
+        blocks = []
+        for block_id in reversed(range(self.num_layers)):
+            frontier = self.sample_frontier(block_id, g, seed_nodes)
+            eid = frontier.edata[dgl.EID]
+            block = dgl.to_block(frontier, seed_nodes)
+            block.edata[dgl.EID] = eid
+            seed_nodes = block.srcdata[dgl.NID]
+            blocks.insert(0, block)
+        return seed_nodes, output_nodes, blocks
    def __len__(self):
        return self.num_layers

--- a/examples/pytorch/dimenet/modules/bessel_basis_layer.py
+++ b/examples/pytorch/dimenet/modules/bessel_basis_layer.py
@@ -17,7 +17,9 @@ class BesselBasisLayer(nn.Module):
        self.reset_params()
    def reset_params(self):
+        with torch.no_grad():
            torch.arange(1, self.frequencies.numel() + 1, out=self.frequencies).mul_(np.pi)
+        self.frequencies.requires_grad_()
    def forward(self, g):
        d_scaled = g.edata['d'] / self.cutoff

--- a/examples/pytorch/eges/README.md
+++ b/examples/pytorch/eges/README.md
 # DGL & Pytorch implementation of Enhanced Graph Embedding with Side information (EGES)
+Paper link: https://arxiv.org/pdf/1803.02349.pdf
-## Version
+Reference code repo: (https://github.com/wangzhegeek/EGES.git)
-dgl==0.6.1, torch==1.9.0
-## Paper
-Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba: 
-https://arxiv.org/pdf/1803.02349.pdf
-https://arxiv.org/abs/1803.02349
 ## How to run
-Create folder named `data`. Download two csv files from [here](https://github.com/Wang-Yu-Qing/dgl_data/tree/master/eges_data) into the `data` folder.
-Run command: `python main.py` with default configuration, and the following message will shown up:
+- Create a folder named `data`.
+`mkdir data`
+- Download csv data
+`wget https://raw.githubusercontent.com/Wang-Yu-Qing/dgl_data/master/eges_data/action_head.csv -P data/`
+`wget https://raw.githubusercontent.com/Wang-Yu-Qing/dgl_data/master/eges_data/jdata_product.csv -P data/`
+- Run with the following command (with default configuration)
+`python main.py`
+## Result
 ```
-Using backend: pytorch
-Num skus: 33344, num brands: 3662, num shops: 4785, num cates: 79
-Epoch 00000 | Step 00000 | Step Loss 0.9117 | Epoch Avg Loss: 0.9117
-Epoch 00000 | Step 00100 | Step Loss 0.8736 | Epoch Avg Loss: 0.8801
-Epoch 00000 | Step 00200 | Step Loss 0.8975 | Epoch Avg Loss: 0.8785
-Evaluate link prediction AUC: 0.6864
-Epoch 00001 | Step 00000 | Step Loss 0.8695 | Epoch Avg Loss: 0.8695
-Epoch 00001 | Step 00100 | Step Loss 0.8290 | Epoch Avg Loss: 0.8643
-Epoch 00001 | Step 00200 | Step Loss 0.8012 | Epoch Avg Loss: 0.8604
-Evaluate link prediction AUC: 0.6875
-...
-Epoch 00029 | Step 00000 | Step Loss 0.7095 | Epoch Avg Loss: 0.7095
-Epoch 00029 | Step 00100 | Step Loss 0.7248 | Epoch Avg Loss: 0.7139
-Epoch 00029 | Step 00200 | Step Loss 0.7123 | Epoch Avg Loss: 0.7134
 Evaluate link prediction AUC: 0.7084
 ```
-The AUC of link-prediction task on test graph is computed after each epoch is done.
-## Reference
-https://github.com/nonva/eges
-https://github.com/wangzhegeek/EGES.git
--- a/examples/pytorch/gat/README.md
+++ b/examples/pytorch/gat/README.md
@@ -2,54 +2,29 @@ Graph Attention Networks (GAT)
 ============
 - Paper link: [https://arxiv.org/abs/1710.10903](https://arxiv.org/abs/1710.10903)
- Author's code repo (in Tensorflow):
+- Author's code repo (tensorflow implementation):
  [https://github.com/PetarV-/GAT](https://github.com/PetarV-/GAT).
 - Popular pytorch implementation:
  [https://github.com/Diego999/pyGAT](https://github.com/Diego999/pyGAT).
-Dependencies
------------
- torch v1.0: the autograd support for sparse mm is only available in v1.0.
- requests
- sklearn
-```bash
-pip install torch==1.0.0 requests
-```
 How to run
----------
+-------
-Run with following:
-```bash
-python3 train.py --dataset=cora --gpu=0
-```
+Run with the following for multiclass node classification (available datasets: "cora", "citeseer", "pubmed")
 ```bash
-python3 train.py --dataset=citeseer --gpu=0 --early-stop
+python3 train.py --dataset cora
 ```
+Run with the following for multilabel classification with PPI dataset
 ```bash
-python3 train.py --dataset=pubmed --gpu=0 --num-out-heads=8 --weight-decay=0.001 --early-stop
+python3 train_ppi.py
 ```
-```bash
+> **_NOTE:_**  Users may occasionally run into low accuracy issue (e.g., test accuracy < 0.8) due to overfitting. This can be resolved by adding Early Stopping or reducing maximum number of training epochs.
-python3 train_ppi.py --gpu=0
-```
-Results
+Summary
 -------
+* cora: ~0.821
-| Dataset  | Test Accuracy | Time(s) | Baseline#1 times(s) | Baseline#2 times(s) |
+* citeseer: ~0.710
-| -------- | ------------- | ------- | ------------------- | ------------------- |
+* pubmed: ~0.780
-| Cora     | 84.02(0.40)   | 0.0113  | 0.0982 (**8.7x**)   | 0.0424 (**3.8x**)   |
+* ppi: ~0.9744
-| Citeseer | 70.91(0.79)   | 0.0111  | n/a                 | n/a                 |
-| Pubmed   | 78.57(0.75)   | 0.0115  | n/a                 | n/a                 |
-| PPI      | 0.9836        | n/a     | n/a                 | n/a                 | 
-* All the accuracy numbers are obtained after 300 epochs.
-* The time measures how long it takes to train one epoch.
-* All time is measured on EC2 p3.2xlarge instance w/ V100 GPU.
-* Baseline#1: [https://github.com/PetarV-/GAT](https://github.com/PetarV-/GAT).
-* Baseline#2: [https://github.com/Diego999/pyGAT](https://github.com/Diego999/pyGAT).
--- a/examples/pytorch/gat/gat.py
+++ b/examples/pytorch/gat/gat.py
-"""
-Graph Attention Networks in DGL using SPMV optimization.
-References
----------
-Paper: https://arxiv.org/abs/1710.10903
-Author's code: https://github.com/PetarV-/GAT
-Pytorch implementation: https://github.com/Diego999/pyGAT
-"""
-import torch
-import torch.nn as nn
-import dgl.function as fn
-from dgl.nn import GATConv
-class GAT(nn.Module):
-    def __init__(self,
-                 g,
-                 num_layers,
-                 in_dim,
-                 num_hidden,
-                 num_classes,
-                 heads,
-                 activation,
-                 feat_drop,
-                 attn_drop,
-                 negative_slope,
-                 residual):
-        super(GAT, self).__init__()
-        self.g = g
-        self.num_layers = num_layers
-        self.gat_layers = nn.ModuleList()
-        self.activation = activation
-        if num_layers > 1:
-        # input projection (no residual)
-            self.gat_layers.append(GATConv(
-                in_dim, num_hidden, heads[0],
-                feat_drop, attn_drop, negative_slope, False, self.activation))
-            # hidden layers
-            for l in range(1, num_layers-1):
-                # due to multi-head, the in_dim = num_hidden * num_heads
-                self.gat_layers.append(GATConv(
-                    num_hidden * heads[l-1], num_hidden, heads[l],
-                    feat_drop, attn_drop, negative_slope, residual, self.activation))
-            # output projection
-            self.gat_layers.append(GATConv(
-                num_hidden * heads[-2], num_classes, heads[-1],
-                feat_drop, attn_drop, negative_slope, residual, None))
-        else:
-            self.gat_layers.append(GATConv(
-                in_dim, num_classes, heads[0],
-                feat_drop, attn_drop, negative_slope, residual, None))
-    def forward(self, inputs):
-        h = inputs
-        for l in range(self.num_layers):
-            h = self.gat_layers[l](self.g, h)
-            h = h.flatten(1) if l != self.num_layers - 1 else h.mean(1)
-        return h
--- a/examples/pytorch/gat/train.py
+++ b/examples/pytorch/gat/train.py
-"""
-Graph Attention Networks in DGL using SPMV optimization.
-Multiple heads are also batched together for faster training.
-References
----------
-Paper: https://arxiv.org/abs/1710.10903
-Author's code: https://github.com/PetarV-/GAT
-Pytorch implementation: https://github.com/Diego999/pyGAT
-"""
-import argparse
-import numpy as np
-import networkx as nx
-import time
 import torch
+import torch.nn as nn
 import torch.nn.functional as F
-import dgl
+import dgl.nn as dglnn
-from dgl.data import register_data_args
 from dgl.data import CoraGraphDataset, CiteseerGraphDataset, PubmedGraphDataset
+from dgl import AddSelfLoop
+import argparse
-from gat import GAT
+class GAT(nn.Module):
-from utils import EarlyStopping
+    def __init__(self,in_size, hid_size, out_size, heads):
+        super().__init__()
+        self.gat_layers = nn.ModuleList()
-def accuracy(logits, labels):
+        # two-layer GAT
+        self.gat_layers.append(dglnn.GATConv(in_size, hid_size, heads[0], feat_drop=0.6, attn_drop=0.6, activation=F.elu))
+        self.gat_layers.append(dglnn.GATConv(hid_size*heads[0], out_size, heads[1], feat_drop=0.6, attn_drop=0.6, activation=None))
+    def forward(self, g, inputs):
+        h = inputs
+        for i, layer in enumerate(self.gat_layers):
+            h = layer(g, h)
+            if i == 1:  # last layer 
+                h = h.mean(1)
+            else:       # other layer(s)
+                h = h.flatten(1)
+        return h
+def evaluate(g, features, labels, mask, model):
+    model.eval()
+    with torch.no_grad():
+        logits = model(g, features)
+        logits = logits[mask]
+        labels = labels[mask]
        _, indices = torch.max(logits, dim=1)
        correct = torch.sum(indices == labels)
        return correct.item() * 1.0 / len(labels)
+def train(g, features, labels, masks, model):
+    # define train/val samples, loss function and optimizer
+    train_mask = masks[0]
+    val_mask = masks[1]
+    loss_fcn = nn.CrossEntropyLoss()
+    optimizer = torch.optim.Adam(model.parameters(), lr=5e-3, weight_decay=5e-4)
-def evaluate(model, features, labels, mask):
+    #training loop        
-    model.eval()
+    for epoch in range(200):
-    with torch.no_grad():
+        model.train()
-        logits = model(features)
+        logits = model(g, features)
-        logits = logits[mask]
+        loss = loss_fcn(logits[train_mask], labels[train_mask])
-        labels = labels[mask]
+        optimizer.zero_grad()
-        return accuracy(logits, labels)
+        loss.backward()
+        optimizer.step()
+        acc = evaluate(g, features, labels, val_mask, model)
+        print("Epoch {:05d} | Loss {:.4f} | Accuracy {:.4f} "
+              . format(epoch, loss.item(), acc))
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dataset", type=str, default="cora",
+                        help="Dataset name ('cora', 'citeseer', 'pubmed').")
+    args = parser.parse_args()
+    print(f'Training with DGL built-in GATConv module.')
-def main(args):
    # load and preprocess dataset
+    transform = AddSelfLoop()  # by default, it will first remove self-loops to prevent duplication
    if args.dataset == 'cora':
-        data = CoraGraphDataset()
+        data = CoraGraphDataset(transform=transform)
    elif args.dataset == 'citeseer':
-        data = CiteseerGraphDataset()
+        data = CiteseerGraphDataset(transform=transform)
    elif args.dataset == 'pubmed':
-        data = PubmedGraphDataset()
+        data = PubmedGraphDataset(transform=transform)
    else:
        raise ValueError('Unknown dataset: {}'.format(args.dataset))
    g = data[0]
-    if args.gpu < 0:
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-        cuda = False
+    g = g.int().to(device)
-    else:
-        cuda = True
-        g = g.int().to(args.gpu)
    features = g.ndata['feat']
    labels = g.ndata['label']
-    train_mask = g.ndata['train_mask']
+    masks = g.ndata['train_mask'], g.ndata['val_mask'], g.ndata['test_mask']
-    val_mask = g.ndata['val_mask']
-    test_mask = g.ndata['test_mask']
-    num_feats = features.shape[1]
-    n_classes = data.num_labels
-    n_edges = g.number_of_edges()
-    print("""----Data statistics------'
-      #Edges %d
-      #Classes %d
-      #Train samples %d
-      #Val samples %d
-      #Test samples %d""" %
-          (n_edges, n_classes,
-           train_mask.int().sum().item(),
-           val_mask.int().sum().item(),
-           test_mask.int().sum().item()))
-    # add self loop
-    g = dgl.remove_self_loop(g)
-    g = dgl.add_self_loop(g)
-    n_edges = g.number_of_edges()
-    # create model
-    heads = ([args.num_heads] * (args.num_layers-1)) + [args.num_out_heads]
-    model = GAT(g,
-                args.num_layers,
-                num_feats,
-                args.num_hidden,
-                n_classes,
-                heads,
-                F.elu,
-                args.in_drop,
-                args.attn_drop,
-                args.negative_slope,
-                args.residual)
-    print(model)
-    if args.early_stop:
-        stopper = EarlyStopping(patience=100)
-    if cuda:
-        model.cuda()
-    loss_fcn = torch.nn.CrossEntropyLoss()
-    # use optimizer
-    optimizer = torch.optim.Adam(
-        model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
-    # initialize graph
-    dur = []
-    for epoch in range(args.epochs):
-        model.train()
-        if epoch >= 3:
-            if cuda:
-                torch.cuda.synchronize()
-            t0 = time.time()
-        # forward
-        logits = model(features)
-        loss = loss_fcn(logits[train_mask], labels[train_mask])
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-        if epoch >= 3:
-            if cuda:
-                torch.cuda.synchronize()
-            dur.append(time.time() - t0)
-        train_acc = accuracy(logits[train_mask], labels[train_mask])
-        if args.fastmode:
-            val_acc = accuracy(logits[val_mask], labels[val_mask])
-        else:
-            val_acc = evaluate(model, features, labels, val_mask)
-            if args.early_stop:
-                if stopper.step(val_acc, model):
-                    break
-        print("Epoch {:05d} | Time(s) {:.4f} | Loss {:.4f} | TrainAcc {:.4f} |"
+    # create GAT model    
-              " ValAcc {:.4f} | ETputs(KTEPS) {:.2f}".
+    in_size = features.shape[1]
-              format(epoch, np.mean(dur), loss.item(), train_acc,
+    out_size = data.num_classes
-                     val_acc, n_edges / np.mean(dur) / 1000))
+    model = GAT(in_size, 8, out_size, heads=[8,1]).to(device)
-    print()
+    # model training
-    if args.early_stop:
+    print('Training...')
-        model.load_state_dict(torch.load('es_checkpoint.pt'))
+    train(g, features, labels, masks, model)
-    acc = evaluate(model, features, labels, test_mask)
-    print("Test Accuracy {:.4f}".format(acc))
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='GAT')
-    register_data_args(parser)
-    parser.add_argument("--gpu", type=int, default=-1,
-                        help="which GPU to use. Set -1 to use CPU.")
-    parser.add_argument("--epochs", type=int, default=200,
-                        help="number of training epochs")
-    parser.add_argument("--num-heads", type=int, default=8,
-                        help="number of hidden attention heads")
-    parser.add_argument("--num-out-heads", type=int, default=1,
-                        help="number of output attention heads")
-    parser.add_argument("--num-layers", type=int, default=2,
-                        help="number of hidden layers")
-    parser.add_argument("--num-hidden", type=int, default=8,
-                        help="number of hidden units")
-    parser.add_argument("--residual", action="store_true", default=False,
-                        help="use residual connection")
-    parser.add_argument("--in-drop", type=float, default=.6,
-                        help="input feature dropout")
-    parser.add_argument("--attn-drop", type=float, default=.6,
-                        help="attention dropout")
-    parser.add_argument("--lr", type=float, default=0.005,
-                        help="learning rate")
-    parser.add_argument('--weight-decay', type=float, default=5e-4,
-                        help="weight decay")
-    parser.add_argument('--negative-slope', type=float, default=0.2,
-                        help="the negative slope of leaky relu")
-    parser.add_argument('--early-stop', action='store_true', default=False,
-                        help="indicates whether to use early stop or not")
-    parser.add_argument('--fastmode', action="store_true", default=False,
-                        help="skip re-evaluate the validation set")
-    args = parser.parse_args()
-    print(args)
-    main(args)
+    # test the model
+    print('Testing...')
+    acc = evaluate(g, features, labels, masks[2], model)
+    print("Test accuracy {:.4f}".format(acc))
--- a/examples/pytorch/gat/train_ppi.py
+++ b/examples/pytorch/gat/train_ppi.py
-"""
-Graph Attention Networks (PPI Dataset) in DGL using SPMV optimization.
-Multiple heads are also batched together for faster training.
-Compared with the original paper, this code implements
-early stopping.
-References
----------
-Paper: https://arxiv.org/abs/1710.10903
-Author's code: https://github.com/PetarV-/GAT
-Pytorch implementation: https://github.com/Diego999/pyGAT
-"""
 import numpy as np
 import torch
-import dgl
+import torch.nn as nn
 import torch.nn.functional as F
-import argparse
+import dgl.nn as dglnn
-from sklearn.metrics import f1_score
-from gat import GAT
 from dgl.data.ppi import PPIDataset
 from dgl.dataloading import GraphDataLoader
+from sklearn.metrics import f1_score
-def evaluate(feats, model, subgraph, labels, loss_fcn):
+class GAT(nn.Module):
-    with torch.no_grad():
+    def __init__(self, in_size, hid_size, out_size, heads):
+        super().__init__()
+        self.gat_layers = nn.ModuleList()
+        # three-layer GAT
+        self.gat_layers.append(dglnn.GATConv(in_size, hid_size, heads[0], activation=F.elu))
+        self.gat_layers.append(dglnn.GATConv(hid_size*heads[0], hid_size, heads[1], residual=True, activation=F.elu))
+        self.gat_layers.append(dglnn.GATConv(hid_size*heads[1], out_size, heads[2], residual=True, activation=None))
+    def forward(self, g, inputs):
+        h = inputs
+        for i, layer in enumerate(self.gat_layers):
+            h = layer(g, h)
+            if i == 2:  # last layer 
+                h = h.mean(1)
+            else:       # other layer(s)
+                h = h.flatten(1)
+        return h
+def evaluate(g, features, labels, model):
    model.eval()
-        model.g = subgraph
+    with torch.no_grad():
-        for layer in model.gat_layers:
+        output = model(g, features)
-            layer.g = subgraph
+        pred = np.where(output.data.cpu().numpy() >= 0, 1, 0)
-        output = model(feats.float())
+        score = f1_score(labels.data.cpu().numpy(), pred, average='micro')
-        loss_data = loss_fcn(output, labels.float())
+        return score
-        predict = np.where(output.data.cpu().numpy() >= 0., 1, 0)
-        score = f1_score(labels.data.cpu().numpy(),
-                         predict, average='micro')
-        return score, loss_data.item()
-def main(args):
+def evaluate_in_batches(dataloader, device, model):
-    if args.gpu<0:
+    total_score = 0
-        device = torch.device("cpu")
+    for batch_id, batched_graph in enumerate(dataloader):
-    else:
+        batched_graph = batched_graph.to(device)
-        device = torch.device("cuda:" + str(args.gpu))
+        features = batched_graph.ndata['feat']
+        labels = batched_graph.ndata['label']
+        score = evaluate(batched_graph, features, labels, model)
+        total_score += score
+    return total_score / (batch_id + 1) # return average score
-    batch_size = args.batch_size
+def train(train_dataloader, val_dataloader, device, model):
-    cur_step = 0
+    # define loss function and optimizer
-    patience = args.patience
+    loss_fcn = nn.BCEWithLogitsLoss()
-    best_score = -1
+    optimizer = torch.optim.Adam(model.parameters(), lr=5e-3, weight_decay=0)
-    best_loss = 10000
-    # define loss function
+    # training loop        
-    loss_fcn = torch.nn.BCEWithLogitsLoss()
+    for epoch in range(400):
-    # create the dataset
-    train_dataset = PPIDataset(mode='train')
-    valid_dataset = PPIDataset(mode='valid')
-    test_dataset = PPIDataset(mode='test')
-    train_dataloader = GraphDataLoader(train_dataset, batch_size=batch_size)
-    valid_dataloader = GraphDataLoader(valid_dataset, batch_size=batch_size)
-    test_dataloader = GraphDataLoader(test_dataset, batch_size=batch_size)
-    g = train_dataset[0]
-    n_classes = train_dataset.num_labels
-    num_feats = g.ndata['feat'].shape[1]
-    g = g.int().to(device)
-    heads = ([args.num_heads] * (args.num_layers-1)) + [args.num_out_heads]
-    # define the model
-    model = GAT(g,
-                args.num_layers,
-                num_feats,
-                args.num_hidden,
-                n_classes,
-                heads,
-                F.elu,
-                args.in_drop,
-                args.attn_drop,
-                args.alpha,
-                args.residual)
-    # define the optimizer
-    optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
-    model = model.to(device)
-    for epoch in range(args.epochs):
        model.train()
-        loss_list = []
+        logits = []
-        for batch, subgraph in enumerate(train_dataloader):
+        total_loss = 0
-            subgraph = subgraph.to(device)
+        # mini-batch loop
-            model.g = subgraph
+        for batch_id, batched_graph in enumerate(train_dataloader):
-            for layer in model.gat_layers:
+            batched_graph = batched_graph.to(device)
-                layer.g = subgraph
+            features = batched_graph.ndata['feat'].float()
-            logits = model(subgraph.ndata['feat'].float())
+            labels = batched_graph.ndata['label'].float()
-            loss = loss_fcn(logits, subgraph.ndata['label'])
+            logits = model(batched_graph, features)
+            loss = loss_fcn(logits, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
-            loss_list.append(loss.item())
+            total_loss += loss.item()
-        loss_data = np.array(loss_list).mean()
+        print("Epoch {:05d} | Loss {:.4f} |". format(epoch, total_loss / (batch_id + 1) ))
-        print("Epoch {:05d} | Loss: {:.4f}".format(epoch + 1, loss_data))
-        if epoch % 5 == 0:
+        if (epoch + 1) % 5 == 0:
-            score_list = []
+            avg_score = evaluate_in_batches(val_dataloader, device, model) # evaluate F1-score instead of loss
-            val_loss_list = []
+            print("                            Acc. (F1-score) {:.4f} ". format(avg_score))
-            for batch, subgraph in enumerate(valid_dataloader):
-                subgraph = subgraph.to(device)
-                score, val_loss = evaluate(subgraph.ndata['feat'], model, subgraph, subgraph.ndata['label'], loss_fcn)
-                score_list.append(score)
-                val_loss_list.append(val_loss)
-            mean_score = np.array(score_list).mean()
-            mean_val_loss = np.array(val_loss_list).mean()
-            print("Val F1-Score: {:.4f} ".format(mean_score))
-            # early stop
-            if mean_score > best_score or best_loss > mean_val_loss:
-                if mean_score > best_score and best_loss > mean_val_loss:
-                    val_early_loss = mean_val_loss
-                    val_early_score = mean_score
-                best_score = np.max((mean_score, best_score))
-                best_loss = np.min((best_loss, mean_val_loss))
-                cur_step = 0
-            else:
-                cur_step += 1
-                if cur_step == patience:
-                    break
-    test_score_list = []
-    for batch, subgraph in enumerate(test_dataloader):
-        subgraph = subgraph.to(device)
-        score, test_loss = evaluate(subgraph.ndata['feat'], model, subgraph, subgraph.ndata['label'], loss_fcn)
-        test_score_list.append(score)
-    print("Test F1-Score: {:.4f}".format(np.array(test_score_list).mean()))
 if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='GAT')
+    print(f'Training PPI Dataset with DGL built-in GATConv module.')
-    parser.add_argument("--gpu", type=int, default=-1,
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-                        help="which GPU to use. Set -1 to use CPU.")
-    parser.add_argument("--epochs", type=int, default=400,
+    # load and preprocess datasets
-                        help="number of training epochs")
+    train_dataset = PPIDataset(mode='train')
-    parser.add_argument("--num-heads", type=int, default=4,
+    val_dataset = PPIDataset(mode='valid')
-                        help="number of hidden attention heads")
+    test_dataset = PPIDataset(mode='test')
-    parser.add_argument("--num-out-heads", type=int, default=6,
+    features = train_dataset[0].ndata['feat']
-                        help="number of output attention heads")
-    parser.add_argument("--num-layers", type=int, default=3,
+    # create GAT model    
-                        help="number of hidden layers")
+    in_size = features.shape[1]
-    parser.add_argument("--num-hidden", type=int, default=256,
+    out_size = train_dataset.num_labels
-                        help="number of hidden units")
+    model = GAT(in_size, 256, out_size, heads=[4,4,6]).to(device)
-    parser.add_argument("--residual", action="store_true", default=True,
-                        help="use residual connection")
+    # model training
-    parser.add_argument("--in-drop", type=float, default=0,
+    print('Training...')
-                        help="input feature dropout")
+    train_dataloader = GraphDataLoader(train_dataset, batch_size=2)
-    parser.add_argument("--attn-drop", type=float, default=0,
+    val_dataloader = GraphDataLoader(val_dataset, batch_size=2)
-                        help="attention dropout")
+    train(train_dataloader, val_dataloader, device, model)
-    parser.add_argument("--lr", type=float, default=0.005,
-                        help="learning rate")
-    parser.add_argument('--weight-decay', type=float, default=0,
-                        help="weight decay")
-    parser.add_argument('--alpha', type=float, default=0.2,
-                        help="the negative slop of leaky relu")
-    parser.add_argument('--batch-size', type=int, default=2,
-                        help="batch size used for training, validation and test")
-    parser.add_argument('--patience', type=int, default=10,
-                        help="used for early stop")
-    args = parser.parse_args()
-    print(args)
-    main(args)
+    # test the model
+    print('Testing...')
+    test_dataloader = GraphDataLoader(test_dataset, batch_size=2)
+    avg_score = evaluate_in_batches(test_dataloader, device, model)
+    print("Test Accuracy (F1-score) {:.4f}".format(avg_score))