[Model] Simplify RGCN

* Update (#5) * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * FIx * Try * Update * Update * Update * Fix * Update * Fix * Fix * Fix * Fix * Update * Fix * Update * Update * Update * Fix * Fix * Update * Update * Update * Update * Fix * Fix * Fix * Update * Update * Update * Update * Update * Update README.md * Update * Fix * Update * Update * Fix * Fix * Fix * Update * Update * Update Co-authored-by: Ubuntu <ubuntu@ip-172-31-6-240.us-west-2.compute.internal> * Update * Update * Fix * Update * Update * Update * Fix * Update * Update * Update * Update * Update * Update * CI Co-authored-by: Ubuntu <ubuntu@ip-172-31-6-240.us-west-2.compute.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-57-123.us-west-2.compute.internal> Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

[Model] Simplify RGCN
* Update (#5) * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * FIx * Try * Update * Update * Update * Fix * Update * Fix * Fix * Fix * Fix * Update * Fix * Update * Update * Update * Fix * Fix * Update * Update * Update * Update * Fix * Fix * Fix * Update * Update * Update * Update * Update * Update README.md * Update * Fix * Update * Update * Fix * Fix * Fix * Update * Update * Update Co-authored-by: Ubuntu <ubuntu@ip-172-31-6-240.us-west-2.compute.internal> * Update * Update * Fix * Update * Update * Update * Fix * Update * Update * Update * Update * Update * Update * CI Co-authored-by: Ubuntu <ubuntu@ip-172-31-6-240.us-west-2.compute.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-57-123.us-west-2.compute.internal> Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
e9c3c0e8 · Mufei Li · GitHub · 22272de6 · e9c3c0e8 · 22272de6
Unverified Commit e9c3c0e8 authored Feb 17, 2022 by Mufei Li Committed by GitHub Feb 17, 2022
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -162,3 +162,5 @@ config.cmake
 .ycm_extra_conf.py
 **.png

+# model file
+*.pth
--- a/examples/pytorch/rgcn/.gitignore
+++ b/examples/pytorch/rgcn/.gitignore
-*.pth
--- a/examples/pytorch/rgcn/README.md
+++ b/examples/pytorch/rgcn/README.md
 # Relational-GCN

-* Paper: [https://arxiv.org/abs/1703.06103](https://arxiv.org/abs/1703.06103)
+* Paper: [Modeling Relational Data with Graph Convolutional Networks](https://arxiv.org/abs/1703.06103)
 * Author's code for entity classification: [https://github.com/tkipf/relational-gcn](https://github.com/tkipf/relational-gcn)
 * Author's code for link prediction: [https://github.com/MichSchli/RelationPrediction](https://github.com/MichSchli/RelationPrediction)

 ### Dependencies
-* PyTorch 0.4.1+
-* requests
+* PyTorch 1.10
 * rdflib
 * pandas
+* tqdm
+* TorchMetrics

 ```
-pip install requests torch rdflib pandas
+pip install rdflib pandas
 ```

 Example code was tested with rdflib 4.2.2 and pandas 0.23.4
@@ -19,66 +20,55 @@ Example code was tested with rdflib 4.2.2 and pandas 0.23.4
 ### Entity Classification
 AIFB: accuracy 96.29% (3 runs, DGL), 95.83% (paper)
 ```
-python3 entity_classify.py -d aifb --testing --gpu 0
+python entity.py -d aifb --l2norm 0 --gpu 0
 ```

-MUTAG: accuracy 70.59% (3 runs, DGL), 73.23% (paper)
+MUTAG: accuracy 72.55% (3 runs, DGL), 73.23% (paper)
 ```
-python3 entity_classify.py -d mutag --l2norm 5e-4 --n-bases 30 --testing --gpu 0
+python entity.py -d mutag --n-bases 30 --gpu 0
 ```

-BGS: accuracy 93.10% (3 runs, DGL), 83.10% (paper)
+BGS: accuracy 89.70% (3 runs, DGL), 83.10% (paper)
 ```
-python3 entity_classify.py -d bgs --l2norm 5e-4 --n-bases 40 --testing --gpu 0
+python entity.py -d bgs --n-bases 40 --gpu 0
 ```

-AM: accuracy 89.22% (3 runs, DGL), 89.29% (paper)
+AM: accuracy 89.56% (3 runs, DGL), 89.29% (paper)
 ```
-python3 entity_classify.py -d am --n-bases=40 --n-hidden=10 --l2norm=5e-4 --testing
+python entity.py -d am --n-bases 40 --n-hidden 10
 ```

 ### Entity Classification with minibatch
-AIFB: accuracy avg(5 runs) 90.00%, best 94.44% (DGL)
-```
-python3 entity_classify_mp.py -d aifb --testing --gpu 0 --fanout='20,20' --batch-size 128
-```
-
-MUTAG: accuracy avg(10 runs) 62.94%, best 72.06% (DGL)
-```
-python3 entity_classify_mp.py -d mutag --l2norm 5e-4 --n-bases 30 --testing --gpu 0 --batch-size 64 --fanout "-1, -1" --use-self-loop --dgl-sparse --n-epochs 20 --sparse-lr 0.01 --dropout 0.5
-```

-BGS: accuracy avg(5 runs) 78.62%, best 86.21% (DGL)
+AIFB: accuracy avg(5 runs) 91.10%, best 97.22% (DGL)
 ```
-python3 entity_classify_mp.py -d bgs --l2norm 5e-4 --n-bases 40 --testing --gpu 0 --fanout "-1, -1"  --n-epochs=16 --batch-size=16 --dgl-sparse  --lr 0.01 --sparse-lr 0.05 --dropout 0.3
+python entity_sample.py -d aifb --l2norm 0 --gpu 0 --fanout='20,20' --batch-size 128
 ```

-AM: accuracy avg(5 runs) 87.37%, best 89.9% (DGL)
+MUTAG: accuracy avg(10 runs) 66.47%, best 72.06% (DGL)
 ```
-python3 entity_classify_mp.py -d am --l2norm 5e-4 --n-bases 40 --testing  --gpu 0 --fanout '35,35' --batch-size 64 --n-hidden 16 --use-self-loop --n-epochs=20 --dgl-sparse --lr 0.01  --sparse-lr 0.02 --dropout 0.7
+python entity_sample.py -d mutag --n-bases 30 --gpu 0 --batch-size 64 --fanout "-1, -1" --use-self-loop --n-epochs 20 --sparse-lr 0.01 --dropout 0.5
 ```

-### Entity Classification on OGBN-MAG
-Test-bd: P3-8xlarge
-
-OGBN-MAG accuracy 45.5 (3 runs)
+BGS: accuracy avg(5 runs) 84.83%, best 89.66% (DGL)
 ```
-python3 entity_classify_mp.py -d ogbn-mag --testing --fanout='30,30' --batch-size 1024 --n-hidden 128 --lr 0.01 --num-worker 4 --eval-batch-size 8 --low-mem --gpu 0,1,2,3 --dropout 0.7 --use-self-loop --n-bases 2 --n-epochs 3 --node-feats --dgl-sparse --sparse-lr 0.08
+python entity_sample.py -d bgs --n-bases 40 --gpu 0 --fanout "-1, -1"  --n-epochs=16 --batch-size=16 --sparse-lr 0.05 --dropout 0.3
 ```

-OGBN-MAG without node-feats 42.79
+AM: accuracy avg(5 runs) 88.58%, best 89.90% (DGL)
 ```
-python3 entity_classify_mp.py -d ogbn-mag --testing --fanout='30,30' --batch-size 1024 --n-hidden 128 --lr 0.01 --num-worker 4 --eval-batch-size 8 --low-mem --gpu 0,1,2,3 --dropout 0.7 --use-self-loop --n-bases 2 --n-epochs 3 --dgl-sparse --sparse-lr 0.08
+python entity_sample.py -d am --n-bases 40 --gpu 0 --fanout '35,35' --batch-size 64 --n-hidden 16 --use-self-loop --n-epochs=20 --sparse-lr 0.02 --dropout 0.7
 ```

-Test-bd: P2-8xlarge
+To use multiple GPUs, replace `entity_sample.py` with `entity_sample_multi_gpu.py` and specify
+multiple GPU IDs separated by comma, e.g., `--gpu 0,1`.

 ### Link Prediction
-FB15k-237: MRR 0.151 (DGL), 0.158 (paper)
+FB15k-237: MRR 0.163 (DGL), 0.158 (paper)
 ```
-python3 link_predict.py -d FB15k-237 --gpu 0 --eval-protocol raw
+python link.py --gpu 0 --eval-protocol raw
 ```
-FB15k-237: Filtered-MRR 0.2044
+FB15k-237: Filtered-MRR 0.247
 ```
-python3 link_predict.py -d FB15k-237 --gpu 0 --eval-protocol filtered
+python link.py --gpu 0 --eval-protocol filtered
 ```
--- a/examples/pytorch/rgcn/entity.py
+++ b/examples/pytorch/rgcn/entity.py
+"""
+Differences compared to tkipf/relation-gcn
+* l2norm applied to all weights
+* remove nodes that won't be touched
+"""
+
+import argparse
+import torch as th
+import torch.nn.functional as F
+
+from torchmetrics.functional import accuracy
+
+from entity_utils import load_data
+from model import RGCN
+
+def main(args):
+    g, num_rels, num_classes, labels, train_idx, test_idx, target_idx = load_data(
+        args.dataset, get_norm=True)
+
+    num_nodes = g.num_nodes()
+
+    # Since the nodes are featureless, learn node embeddings from scratch
+    # This requires passing the node IDs to the model.
+    feats = th.arange(num_nodes)
+
+    model = RGCN(num_nodes,
+                 args.n_hidden,
+                 num_classes,
+                 num_rels,
+                 num_bases=args.n_bases)
+
+    if args.gpu >= 0 and th.cuda.is_available():
+        device = th.device(args.gpu)
+    else:
+        device = th.device('cpu')
+    feats = feats.to(device)
+    labels = labels.to(device)
+    model = model.to(device)
+    g = g.to(device)
+
+    optimizer = th.optim.Adam(model.parameters(), lr=1e-2, weight_decay=args.l2norm)
+
+    model.train()
+    for epoch in range(50):
+        logits = model(g, feats)
+        logits = logits[target_idx]
+        loss = F.cross_entropy(logits[train_idx], labels[train_idx])
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+
+        train_acc = accuracy(logits[train_idx].argmax(dim=1), labels[train_idx]).item()
+        print("Epoch {:05d} | Train Accuracy: {:.4f} | Train Loss: {:.4f}".format(
+            epoch, train_acc, loss.item()))
+    print()
+
+    model.eval()
+    with th.no_grad():
+        logits = model(g, feats)
+    logits = logits[target_idx]
+    test_acc = accuracy(logits[test_idx].argmax(dim=1), labels[test_idx]).item()
+    print("Test Accuracy: {:.4f}".format(test_acc))
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='RGCN for entity classification')
+    parser.add_argument("--n-hidden", type=int, default=16,
+                        help="number of hidden units")
+    parser.add_argument("--gpu", type=int, default=-1,
+                        help="gpu")
+    parser.add_argument("--n-bases", type=int, default=-1,
+                        help="number of filter weight matrices, default: -1 [use all]")
+    parser.add_argument("-d", "--dataset", type=str, required=True,
+                        choices=['aifb', 'mutag', 'bgs', 'am'],
+                        help="dataset to use")
+    parser.add_argument("--l2norm", type=float, default=5e-4,
+                        help="l2 norm coef")
+
+    args = parser.parse_args()
+    print(args)
+    main(args)
--- a/examples/pytorch/rgcn/entity_classify.py
+++ b/examples/pytorch/rgcn/entity_classify.py
-"""
-Modeling Relational Data with Graph Convolutional Networks
-Paper: https://arxiv.org/abs/1703.06103
-Code: https://github.com/tkipf/relational-gcn
-
-Difference compared to tkipf/relation-gcn
-* l2norm applied to all weights
-* remove nodes that won't be touched
-"""
-
-import argparse
-import numpy as np
-import time
-import torch
-import torch.nn.functional as F
-import dgl
-from dgl.nn.pytorch import RelGraphConv
-from functools import partial
-from dgl.data.rdf import AIFBDataset, MUTAGDataset, BGSDataset, AMDataset
-
-from model import BaseRGCN
-
-class EntityClassify(BaseRGCN):
-    def create_features(self):
-        features = torch.arange(self.num_nodes)
-        if self.use_cuda:
-            features = features.cuda()
-        return features
-
-    def build_input_layer(self):
-        return RelGraphConv(self.num_nodes, self.h_dim, self.num_rels, "basis",
-                self.num_bases, activation=F.relu, self_loop=self.use_self_loop,
-                dropout=self.dropout)
-
-    def build_hidden_layer(self, idx):
-        return RelGraphConv(self.h_dim, self.h_dim, self.num_rels, "basis",
-                self.num_bases, activation=F.relu, self_loop=self.use_self_loop,
-                dropout=self.dropout)
-
-    def build_output_layer(self):
-        return RelGraphConv(self.h_dim, self.out_dim, self.num_rels, "basis",
-                self.num_bases, activation=None,
-                self_loop=self.use_self_loop)
-
-def main(args):
-    # load graph data
-    if args.dataset == 'aifb':
-        dataset = AIFBDataset()
-    elif args.dataset == 'mutag':
-        dataset = MUTAGDataset()
-    elif args.dataset == 'bgs':
-        dataset = BGSDataset()
-    elif args.dataset == 'am':
-        dataset = AMDataset()
-    else:
-        raise ValueError()
-
-    # Load from hetero-graph
-    hg = dataset[0]
-
-    num_rels = len(hg.canonical_etypes)
-    category = dataset.predict_category
-    num_classes = dataset.num_classes
-    train_mask = hg.nodes[category].data.pop('train_mask')
-    test_mask = hg.nodes[category].data.pop('test_mask')
-    train_idx = torch.nonzero(train_mask, as_tuple=False).squeeze()
-    test_idx = torch.nonzero(test_mask, as_tuple=False).squeeze()
-    labels = hg.nodes[category].data.pop('labels')
-
-    # split dataset into train, validate, test
-    if args.validation:
-        val_idx = train_idx[:len(train_idx) // 5]
-        train_idx = train_idx[len(train_idx) // 5:]
-    else:
-        val_idx = train_idx
-
-    # calculate norm for each edge type and store in edge
-    for canonical_etype in hg.canonical_etypes:
-        u, v, eid = hg.all_edges(form='all', etype=canonical_etype)
-        _, inverse_index, count = torch.unique(v, return_inverse=True, return_counts=True)
-        degrees = count[inverse_index]
-        norm = torch.ones(eid.shape[0]).float() / degrees.float()
-        norm = norm.unsqueeze(1)
-        hg.edges[canonical_etype].data['norm'] = norm
-
-    # get target category id
-    category_id = len(hg.ntypes)
-    for i, ntype in enumerate(hg.ntypes):
-        if ntype == category:
-            category_id = i
-
-    g = dgl.to_homogeneous(hg, edata=['norm'])
-    num_nodes = g.number_of_nodes()
-    node_ids = torch.arange(num_nodes)
-    edge_norm = g.edata['norm']
-    edge_type = g.edata[dgl.ETYPE].long()
-
-    # find out the target node ids in g
-    node_tids = g.ndata[dgl.NTYPE]
-    loc = (node_tids == category_id)
-    target_idx = node_ids[loc]
-
-    # since the nodes are featureless, the input feature is then the node id.
-    feats = torch.arange(num_nodes)
-
-    # check cuda
-    use_cuda = args.gpu >= 0 and torch.cuda.is_available()
-    if use_cuda:
-        torch.cuda.set_device(args.gpu)
-        feats = feats.cuda()
-        edge_type = edge_type.cuda()
-        edge_norm = edge_norm.cuda()
-        labels = labels.cuda()
-
-    # create model
-    model = EntityClassify(num_nodes,
-                           args.n_hidden,
-                           num_classes,
-                           num_rels,
-                           num_bases=args.n_bases,
-                           num_hidden_layers=args.n_layers - 2,
-                           dropout=args.dropout,
-                           use_self_loop=args.use_self_loop,
-                           use_cuda=use_cuda)
-
-    if use_cuda:
-        model.cuda()
-        g = g.to('cuda:%d' % args.gpu)
-
-    # optimizer
-    optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.l2norm)
-
-    # training loop
-    print("start training...")
-    forward_time = []
-    backward_time = []
-    model.train()
-    for epoch in range(args.n_epochs):
-        optimizer.zero_grad()
-        t0 = time.time()
-        logits = model(g, feats, edge_type, edge_norm)
-        logits = logits[target_idx]
-        loss = F.cross_entropy(logits[train_idx], labels[train_idx])
-        t1 = time.time()
-        loss.backward()
-        optimizer.step()
-        t2 = time.time()
-
-        forward_time.append(t1 - t0)
-        backward_time.append(t2 - t1)
-        print("Epoch {:05d} | Train Forward Time(s) {:.4f} | Backward Time(s) {:.4f}".
-              format(epoch, forward_time[-1], backward_time[-1]))
-        train_acc = torch.sum(logits[train_idx].argmax(dim=1) == labels[train_idx]).item() / len(train_idx)
-        val_loss = F.cross_entropy(logits[val_idx], labels[val_idx])
-        val_acc = torch.sum(logits[val_idx].argmax(dim=1) == labels[val_idx]).item() / len(val_idx)
-        print("Train Accuracy: {:.4f} | Train Loss: {:.4f} | Validation Accuracy: {:.4f} | Validation loss: {:.4f}".
-              format(train_acc, loss.item(), val_acc, val_loss.item()))
-    print()
-
-    model.eval()
-    logits = model.forward(g, feats, edge_type, edge_norm)
-    logits = logits[target_idx]
-    test_loss = F.cross_entropy(logits[test_idx], labels[test_idx])
-    test_acc = torch.sum(logits[test_idx].argmax(dim=1) == labels[test_idx]).item() / len(test_idx)
-    print("Test Accuracy: {:.4f} | Test loss: {:.4f}".format(test_acc, test_loss.item()))
-    print()
-
-    print("Mean forward time: {:4f}".format(np.mean(forward_time[len(forward_time) // 4:])))
-    print("Mean backward time: {:4f}".format(np.mean(backward_time[len(backward_time) // 4:])))
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='RGCN')
-    parser.add_argument("--dropout", type=float, default=0,
-            help="dropout probability")
-    parser.add_argument("--n-hidden", type=int, default=16,
-            help="number of hidden units")
-    parser.add_argument("--gpu", type=int, default=-1,
-            help="gpu")
-    parser.add_argument("--lr", type=float, default=1e-2,
-            help="learning rate")
-    parser.add_argument("--n-bases", type=int, default=-1,
-            help="number of filter weight matrices, default: -1 [use all]")
-    parser.add_argument("--n-layers", type=int, default=2,
-            help="number of propagation rounds")
-    parser.add_argument("-e", "--n-epochs", type=int, default=50,
-            help="number of training epochs")
-    parser.add_argument("-d", "--dataset", type=str, required=True,
-            help="dataset to use")
-    parser.add_argument("--l2norm", type=float, default=0,
-            help="l2 norm coef")
-    parser.add_argument("--use-self-loop", default=False, action='store_true',
-            help="include self feature as a special relation")
-    fp = parser.add_mutually_exclusive_group(required=False)
-    fp.add_argument('--validation', dest='validation', action='store_true')
-    fp.add_argument('--testing', dest='validation', action='store_false')
-    parser.set_defaults(validation=True)
-
-    args = parser.parse_args()
-    print(args)
-    main(args)
--- a/examples/pytorch/rgcn/entity_classify_mp.py
+++ b/examples/pytorch/rgcn/entity_classify_mp.py
--- a/examples/pytorch/rgcn/entity_sample.py
+++ b/examples/pytorch/rgcn/entity_sample.py
+"""
+Differences compared to tkipf/relation-gcn
+* l2norm applied to all weights
+* remove nodes that won't be touched
+"""
+import argparse
+import torch as th
+import torch.nn.functional as F
+import dgl
+
+from dgl.dataloading import MultiLayerNeighborSampler, NodeDataLoader
+from torchmetrics.functional import accuracy
+from tqdm import tqdm
+
+from entity_utils import load_data
+from model import RelGraphEmbedLayer, RGCN
+
+def init_dataloaders(args, g, train_idx, test_idx, target_idx, device, use_ddp=False):
+    fanouts = [int(fanout) for fanout in args.fanout.split(',')]
+    sampler = MultiLayerNeighborSampler(fanouts)
+
+    train_loader = NodeDataLoader(
+        g,
+        target_idx[train_idx],
+        sampler,
+        use_ddp=use_ddp,
+        device=device,
+        batch_size=args.batch_size,
+        shuffle=True,
+        drop_last=False)
+
+    # The datasets do not have a validation subset, use the train subset
+    val_loader = NodeDataLoader(
+        g,
+        target_idx[train_idx],
+        sampler,
+        use_ddp=use_ddp,
+        device=device,
+        batch_size=args.batch_size,
+        shuffle=False,
+        drop_last=False)
+
+    # -1 for sampling all neighbors
+    test_sampler = MultiLayerNeighborSampler([-1] * len(fanouts))
+    test_loader = NodeDataLoader(
+        g,
+        target_idx[test_idx],
+        test_sampler,
+        use_ddp=use_ddp,
+        device=device,
+        batch_size=32,
+        shuffle=False,
+        drop_last=False)
+
+    return train_loader, val_loader, test_loader
+
+def init_models(args, device, num_nodes, num_classes, num_rels):
+    embed_layer = RelGraphEmbedLayer(device,
+                                     num_nodes,
+                                     args.n_hidden)
+
+    model = RGCN(args.n_hidden,
+                 args.n_hidden,
+                 num_classes,
+                 num_rels,
+                 num_bases=args.n_bases,
+                 dropout=args.dropout,
+                 self_loop=args.use_self_loop)
+
+    return embed_layer, model
+
+def process_batch(inv_target, batch):
+    _, seeds, blocks = batch
+    # map the seed nodes back to their type-specific ids,
+    # in order to get the target node labels
+    seeds = inv_target[seeds]
+
+    for blc in blocks:
+        blc.edata['norm'] = dgl.norm_by_dst(blc).unsqueeze(1)
+
+    return seeds, blocks
+
+def train(model, embed_layer, train_loader, inv_target,
+          labels, emb_optimizer, optimizer):
+    model.train()
+    embed_layer.train()
+
+    for sample_data in train_loader:
+        seeds, blocks = process_batch(inv_target, sample_data)
+        feats = embed_layer(blocks[0].srcdata[dgl.NID].cpu())
+        logits = model(blocks, feats)
+        loss = F.cross_entropy(logits, labels[seeds])
+        emb_optimizer.zero_grad()
+        optimizer.zero_grad()
+
+        loss.backward()
+        emb_optimizer.step()
+        optimizer.step()
+
+        train_acc = accuracy(logits.argmax(dim=1), labels[seeds]).item()
+
+    return train_acc, loss.item()
+
+def evaluate(model, embed_layer, eval_loader, inv_target):
+    model.eval()
+    embed_layer.eval()
+    eval_logits = []
+    eval_seeds = []
+
+    with th.no_grad():
+        for sample_data in tqdm(eval_loader):
+            seeds, blocks = process_batch(inv_target, sample_data)
+            feats = embed_layer(blocks[0].srcdata[dgl.NID].cpu())
+            logits = model(blocks, feats)
+            eval_logits.append(logits.cpu().detach())
+            eval_seeds.append(seeds.cpu().detach())
+
+    eval_logits = th.cat(eval_logits)
+    eval_seeds = th.cat(eval_seeds)
+
+    return eval_logits, eval_seeds
+
+def main(args):
+    g, num_rels, num_classes, labels, train_idx, test_idx, target_idx, inv_target = load_data(
+        args.dataset, inv_target=True)
+
+    if args.gpu >= 0 and th.cuda.is_available():
+        device = th.device(args.gpu)
+    else:
+        device = th.device('cpu')
+
+    train_loader, val_loader, test_loader = init_dataloaders(
+        args, g, train_idx, test_idx, target_idx, args.gpu)
+    embed_layer, model = init_models(args, device, g.num_nodes(), num_classes, num_rels)
+
+    labels = labels.to(device)
+    model = model.to(device)
+
+    emb_optimizer = th.optim.SparseAdam(embed_layer.parameters(), lr=args.sparse_lr)
+    optimizer = th.optim.Adam(model.parameters(), lr=1e-2, weight_decay=args.l2norm)
+
+    for epoch in range(args.n_epochs):
+        train_acc, loss = train(model, embed_layer, train_loader, inv_target,
+                                labels, emb_optimizer, optimizer)
+        print("Epoch {:05d}/{:05d} | Train Accuracy: {:.4f} | Train Loss: {:.4f}".format(
+            epoch, args.n_epochs, train_acc, loss))
+
+        val_logits, val_seeds = evaluate(model, embed_layer, val_loader, inv_target)
+        val_acc = accuracy(val_logits.argmax(dim=1), labels[val_seeds].cpu()).item()
+        print("Validation Accuracy: {:.4f}".format(val_acc))
+
+    test_logits, test_seeds = evaluate(model, embed_layer,
+                                       test_loader, inv_target)
+    test_acc = accuracy(test_logits.argmax(dim=1), labels[test_seeds].cpu()).item()
+    print("Final Test Accuracy: {:.4f}".format(test_acc))
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='RGCN for entity classification with sampling')
+    parser.add_argument("--dropout", type=float, default=0,
+                        help="dropout probability")
+    parser.add_argument("--n-hidden", type=int, default=16,
+                        help="number of hidden units")
+    parser.add_argument("--gpu", type=int, default=0,
+                        help="gpu")
+    parser.add_argument("--sparse-lr", type=float, default=2e-2,
+                        help="sparse embedding learning rate")
+    parser.add_argument("--n-bases", type=int, default=-1,
+                        help="number of filter weight matrices, default: -1 [use all]")
+    parser.add_argument("--n-epochs", type=int, default=50,
+                        help="number of training epochs")
+    parser.add_argument("-d", "--dataset", type=str, required=True,
+                        choices=['aifb', 'mutag', 'bgs', 'am'],
+                        help="dataset to use")
+    parser.add_argument("--l2norm", type=float, default=5e-4,
+                        help="l2 norm coef")
+    parser.add_argument("--fanout", type=str, default="4, 4",
+                        help="Fan-out of neighbor sampling")
+    parser.add_argument("--use-self-loop", default=False, action='store_true',
+                        help="include self feature as a special relation")
+    parser.add_argument("--batch-size", type=int, default=100,
+                        help="Mini-batch size")
+    args = parser.parse_args()
+
+    print(args)
+    main(args)
--- a/examples/pytorch/rgcn/entity_sample_multi_gpu.py
+++ b/examples/pytorch/rgcn/entity_sample_multi_gpu.py
+"""
+Differences compared to tkipf/relation-gcn
+* l2norm applied to all weights
+* remove nodes that won't be touched
+"""
+import argparse
+import gc
+import torch as th
+import torch.nn.functional as F
+import dgl.multiprocessing as mp
+import dgl
+
+from torchmetrics.functional import accuracy
+from torch.nn.parallel import DistributedDataParallel
+
+from entity_utils import load_data
+from entity_sample import init_dataloaders, init_models, train, evaluate
+
+def collect_eval(n_gpus, queue, labels):
+    eval_logits = []
+    eval_seeds = []
+    for _ in range(n_gpus):
+        eval_l, eval_s = queue.get()
+        eval_logits.append(eval_l)
+        eval_seeds.append(eval_s)
+    eval_logits = th.cat(eval_logits)
+    eval_seeds = th.cat(eval_seeds)
+    eval_acc = accuracy(eval_logits.argmax(dim=1), labels[eval_seeds].cpu()).item()
+
+    return eval_acc
+
+def run(proc_id, n_gpus, n_cpus, args, devices, dataset, queue=None):
+    dev_id = devices[proc_id]
+    g, num_classes, num_rels, target_idx, inv_target, train_idx,\
+        test_idx, labels = dataset
+
+    dist_init_method = 'tcp://{master_ip}:{master_port}'.format(
+        master_ip='127.0.0.1', master_port='12345')
+    backend = 'gloo'
+    if proc_id == 0:
+        print("backend using {}".format(backend))
+    th.distributed.init_process_group(backend=backend,
+                                      init_method=dist_init_method,
+                                      world_size=n_gpus,
+                                      rank=proc_id)
+
+    device = th.device(dev_id)
+    use_ddp = True if n_gpus > 1 else False
+    train_loader, val_loader, test_loader = init_dataloaders(
+        args, g, train_idx, test_idx, target_idx, dev_id, use_ddp=use_ddp)
+    embed_layer, model = init_models(args, device, g.num_nodes(), num_classes, num_rels)
+
+    labels = labels.to(device)
+    model = model.to(device)
+    model = DistributedDataParallel(model, device_ids=[dev_id], output_device=dev_id)
+    embed_layer = DistributedDataParallel(embed_layer, device_ids=None, output_device=None)
+
+    emb_optimizer = th.optim.SparseAdam(embed_layer.module.parameters(), lr=args.sparse_lr)
+    optimizer = th.optim.Adam(model.parameters(), lr=1e-2, weight_decay=args.l2norm)
+
+    th.set_num_threads(n_cpus)
+    for epoch in range(args.n_epochs):
+        train_loader.set_epoch(epoch)
+        train_acc, loss = train(model, embed_layer, train_loader, inv_target,
+                                labels, emb_optimizer, optimizer)
+
+        if proc_id == 0:
+            print("Epoch {:05d}/{:05d} | Train Accuracy: {:.4f} | Train Loss: {:.4f}".format(
+                epoch, args.n_epochs, train_acc, loss))
+
+        # garbage collection that empties the queue
+        gc.collect()
+
+        val_logits, val_seeds = evaluate(model, embed_layer, val_loader, inv_target)
+        queue.put((val_logits, val_seeds))
+
+        # gather evaluation result from multiple processes
+        if proc_id == 0:
+            val_acc = collect_eval(n_gpus, queue, labels)
+            print("Validation Accuracy: {:.4f}".format(val_acc))
+
+    # garbage collection that empties the queue
+    gc.collect()
+    test_logits, test_seeds = evaluate(model, embed_layer, test_loader, inv_target)
+    queue.put((test_logits, test_seeds))
+    if proc_id == 0:
+        test_acc = collect_eval(n_gpus, queue, labels)
+        print("Final Test Accuracy: {:.4f}".format(test_acc))
+    th.distributed.barrier()
+
+def main(args, devices):
+    g, num_rels, num_classes, labels, train_idx, test_idx, target_idx, inv_target = load_data(
+        args.dataset, inv_target=True)
+
+    # Create csr/coo/csc formats before launching training processes.
+    # This avoids creating certain formats in each sub-process, which saves momory and CPU.
+    g.create_formats_()
+
+    n_gpus = len(devices)
+    n_cpus = mp.cpu_count()
+    queue = mp.Queue(n_gpus)
+    procs = []
+    for proc_id in range(n_gpus):
+        # We use distributed data parallel dataloader to handle the data splitting
+        p = mp.Process(target=run, args=(proc_id, n_gpus, n_cpus // n_gpus, args, devices,
+                                        (g, num_classes, num_rels, target_idx,
+                                         inv_target, train_idx, test_idx, labels),
+                                         queue))
+        p.start()
+        procs.append(p)
+    for p in procs:
+        p.join()
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='RGCN for entity classification with sampling and multiple gpus')
+    parser.add_argument("--dropout", type=float, default=0,
+                        help="dropout probability")
+    parser.add_argument("--n-hidden", type=int, default=16,
+                        help="number of hidden units")
+    parser.add_argument("--gpu", type=str, default='0',
+                        help="gpu")
+    parser.add_argument("--sparse-lr", type=float, default=2e-2,
+                        help="sparse embedding learning rate")
+    parser.add_argument("--n-bases", type=int, default=-1,
+                        help="number of filter weight matrices, default: -1 [use all]")
+    parser.add_argument("--n-epochs", type=int, default=50,
+                        help="number of training epochs")
+    parser.add_argument("-d", "--dataset", type=str, required=True,
+                        choices=['aifb', 'mutag', 'bgs', 'am'],
+                        help="dataset to use")
+    parser.add_argument("--l2norm", type=float, default=5e-4,
+                        help="l2 norm coef")
+    parser.add_argument("--fanout", type=str, default="4, 4",
+                        help="Fan-out of neighbor sampling")
+    parser.add_argument("--use-self-loop", default=False, action='store_true',
+                        help="include self feature as a special relation")
+    parser.add_argument("--batch-size", type=int, default=100,
+                        help="Mini-batch size. ")
+    args = parser.parse_args()
+    devices = list(map(int, args.gpu.split(',')))
+
+    print(args)
+    main(args, devices)
--- a/examples/pytorch/rgcn/entity_utils.py
+++ b/examples/pytorch/rgcn/entity_utils.py
+import dgl
+import torch as th
+
+from dgl.data.rdf import AIFBDataset, MUTAGDataset, BGSDataset, AMDataset
+
+def load_data(data_name, get_norm=False, inv_target=False):
+    if data_name == 'aifb':
+        dataset = AIFBDataset()
+    elif data_name == 'mutag':
+        dataset = MUTAGDataset()
+    elif data_name == 'bgs':
+        dataset = BGSDataset()
+    else:
+        dataset = AMDataset()
+
+    # Load hetero-graph
+    hg = dataset[0]
+
+    num_rels = len(hg.canonical_etypes)
+    category = dataset.predict_category
+    num_classes = dataset.num_classes
+    labels = hg.nodes[category].data.pop('labels')
+    train_mask = hg.nodes[category].data.pop('train_mask')
+    test_mask = hg.nodes[category].data.pop('test_mask')
+    train_idx = th.nonzero(train_mask, as_tuple=False).squeeze()
+    test_idx = th.nonzero(test_mask, as_tuple=False).squeeze()
+
+    if get_norm:
+        # Calculate normalization weight for each edge,
+        # 1. / d, d is the degree of the destination node
+        for cetype in hg.canonical_etypes:
+            hg.edges[cetype].data['norm'] = dgl.norm_by_dst(hg, cetype).unsqueeze(1)
+        edata = ['norm']
+    else:
+        edata = None
+
+    # get target category id
+    category_id = hg.ntypes.index(category)
+
+    g = dgl.to_homogeneous(hg, edata=edata)
+    # Rename the fields as they can be changed by for example NodeDataLoader
+    g.ndata['ntype'] = g.ndata.pop(dgl.NTYPE)
+    g.ndata['type_id'] = g.ndata.pop(dgl.NID)
+    node_ids = th.arange(g.num_nodes())
+
+    # find out the target node ids in g
+    loc = (g.ndata['ntype'] == category_id)
+    target_idx = node_ids[loc]
+
+    if inv_target:
+        # Map global node IDs to type-specific node IDs. This is required for
+        # looking up type-specific labels in a minibatch
+        inv_target = th.empty((g.num_nodes(),), dtype=th.int64)
+        inv_target[target_idx] = th.arange(0, target_idx.shape[0],
+                                           dtype=inv_target.dtype)
+        return g, num_rels, num_classes, labels, train_idx, test_idx, target_idx, inv_target
+    else:
+        return g, num_rels, num_classes, labels, train_idx, test_idx, target_idx
--- a/examples/pytorch/rgcn/link.py
+++ b/examples/pytorch/rgcn/link.py
+"""
+Differences compared to MichSchli/RelationPrediction
+* Report raw metrics instead of filtered metrics.
+* By default, we use uniform edge sampling instead of neighbor-based edge
+  sampling used in author's code. In practice, we find it achieves similar MRR.
+"""
+
+import argparse
+import torch as th
+import torch.nn as nn
+import torch.nn.functional as F
+
+from dgl.data.knowledge_graph import FB15k237Dataset
+from dgl.dataloading import GraphDataLoader
+
+from link_utils import preprocess, SubgraphIterator, calc_mrr
+from model import RGCN
+
+class LinkPredict(nn.Module):
+    def __init__(self, in_dim, num_rels, h_dim=500, num_bases=100, dropout=0.2, reg_param=0.01):
+        super(LinkPredict, self).__init__()
+        self.rgcn = RGCN(in_dim, h_dim, h_dim, num_rels * 2, regularizer="bdd",
+                         num_bases=num_bases, dropout=dropout, self_loop=True, link_pred=True)
+        self.reg_param = reg_param
+        self.w_relation = nn.Parameter(th.Tensor(num_rels, h_dim))
+        nn.init.xavier_uniform_(self.w_relation,
+                                gain=nn.init.calculate_gain('relu'))
+
+    def calc_score(self, embedding, triplets):
+        # DistMult
+        s = embedding[triplets[:,0]]
+        r = self.w_relation[triplets[:,1]]
+        o = embedding[triplets[:,2]]
+        score = th.sum(s * r * o, dim=1)
+        return score
+
+    def forward(self, g, h):
+        return self.rgcn(g, h)
+
+    def regularization_loss(self, embedding):
+        return th.mean(embedding.pow(2)) + th.mean(self.w_relation.pow(2))
+
+    def get_loss(self, embed, triplets, labels):
+        # each row in the triplets is a 3-tuple of (source, relation, destination)
+        score = self.calc_score(embed, triplets)
+        predict_loss = F.binary_cross_entropy_with_logits(score, labels)
+        reg_loss = self.regularization_loss(embed)
+        return predict_loss + self.reg_param * reg_loss
+
+def main(args):
+    data = FB15k237Dataset(reverse=False)
+    graph = data[0]
+    num_nodes = graph.num_nodes()
+    num_rels = data.num_rels
+
+    train_g, test_g = preprocess(graph, num_rels)
+    test_node_id = th.arange(0, num_nodes).view(-1, 1)
+    test_mask = graph.edata['test_mask']
+    subg_iter = SubgraphIterator(train_g, num_rels, args.edge_sampler)
+    dataloader = GraphDataLoader(subg_iter, batch_size=1, collate_fn=lambda x: x[0])
+
+    # Prepare data for metric computation
+    src, dst = graph.edges()
+    triplets = th.stack([src, graph.edata['etype'], dst], dim=1)
+
+    model = LinkPredict(num_nodes, num_rels)
+    optimizer = th.optim.Adam(model.parameters(), lr=1e-2)
+
+    if args.gpu >= 0 and th.cuda.is_available():
+        device = th.device(args.gpu)
+    else:
+        device = th.device('cpu')
+    model = model.to(device)
+
+    best_mrr = 0
+    model_state_file = 'model_state.pth'
+    for epoch, batch_data in enumerate(dataloader):
+        model.train()
+
+        g, node_id, data, labels = batch_data
+        g = g.to(device)
+        node_id = node_id.to(device)
+        data = data.to(device)
+        labels = labels.to(device)
+
+        embed = model(g, node_id)
+        loss = model.get_loss(embed, data, labels)
+        optimizer.zero_grad()
+        loss.backward()
+        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # clip gradients
+        optimizer.step()
+
+        print("Epoch {:04d} | Loss {:.4f} | Best MRR {:.4f}".format(epoch, loss.item(), best_mrr))
+
+        if (epoch + 1) % 500 == 0:
+            # perform validation on CPU because full graph is too large
+            model = model.cpu()
+            model.eval()
+            print("start eval")
+            embed = model(test_g, test_node_id)
+            mrr = calc_mrr(embed, model.w_relation, test_mask, triplets,
+                           batch_size=500, eval_p=args.eval_protocol)
+            # save best model
+            if best_mrr < mrr:
+                best_mrr = mrr
+                th.save({'state_dict': model.state_dict(), 'epoch': epoch}, model_state_file)
+
+            model = model.to(device)
+
+    print("Start testing:")
+    # use best model checkpoint
+    checkpoint = th.load(model_state_file)
+    model = model.cpu() # test on CPU
+    model.eval()
+    model.load_state_dict(checkpoint['state_dict'])
+    print("Using best epoch: {}".format(checkpoint['epoch']))
+    embed = model(test_g, test_node_id)
+    calc_mrr(embed, model.w_relation, test_mask, triplets,
+             batch_size=500, eval_p=args.eval_protocol)
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='RGCN for link prediction')
+    parser.add_argument("--gpu", type=int, default=-1,
+                        help="gpu")
+    parser.add_argument("--eval-protocol", type=str, default='filtered',
+                        choices=['filtered', 'raw'],
+                        help="Whether to use 'filtered' or 'raw' MRR for evaluation")
+    parser.add_argument("--edge-sampler", type=str, default='uniform',
+                        choices=['uniform', 'neighbor'],
+                        help="Type of edge sampler: 'uniform' or 'neighbor'"
+                             "The original implementation uses neighbor sampler.")
+
+    args = parser.parse_args()
+    print(args)
+    main(args)
--- a/examples/pytorch/rgcn/link_predict.py
+++ b/examples/pytorch/rgcn/link_predict.py
-"""
-Modeling Relational Data with Graph Convolutional Networks
-Paper: https://arxiv.org/abs/1703.06103
-Code: https://github.com/MichSchli/RelationPrediction
-
-Difference compared to MichSchli/RelationPrediction
-* Report raw metrics instead of filtered metrics.
-* By default, we use uniform edge sampling instead of neighbor-based edge
-  sampling used in author's code. In practice, we find it achieves similar MRR. User could specify "--edge-sampler=neighbor" to switch
-  to neighbor-based edge sampling.
-"""
-
-import argparse
-import numpy as np
-import time
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import random
-from dgl.data.knowledge_graph import load_data
-from dgl.nn.pytorch import RelGraphConv
-
-from model import BaseRGCN
-
-import utils
-
-class EmbeddingLayer(nn.Module):
-    def __init__(self, num_nodes, h_dim):
-        super(EmbeddingLayer, self).__init__()
-        self.embedding = torch.nn.Embedding(num_nodes, h_dim)
-
-    def forward(self, g, h, r, norm):
-        return self.embedding(h.squeeze())
-
-class RGCN(BaseRGCN):
-    def build_input_layer(self):
-        return EmbeddingLayer(self.num_nodes, self.h_dim)
-
-    def build_hidden_layer(self, idx):
-        act = F.relu if idx < self.num_hidden_layers - 1 else None
-        return RelGraphConv(self.h_dim, self.h_dim, self.num_rels, "bdd",
-                self.num_bases, activation=act, self_loop=True,
-                dropout=self.dropout)
-
-class LinkPredict(nn.Module):
-    def __init__(self, in_dim, h_dim, num_rels, num_bases=-1,
-                 num_hidden_layers=1, dropout=0, use_cuda=False, reg_param=0):
-        super(LinkPredict, self).__init__()
-        self.rgcn = RGCN(in_dim, h_dim, h_dim, num_rels * 2, num_bases,
-                         num_hidden_layers, dropout, use_cuda)
-        self.reg_param = reg_param
-        self.w_relation = nn.Parameter(torch.Tensor(num_rels, h_dim))
-        nn.init.xavier_uniform_(self.w_relation,
-                                gain=nn.init.calculate_gain('relu'))
-
-    def calc_score(self, embedding, triplets):
-        # DistMult
-        s = embedding[triplets[:,0]]
-        r = self.w_relation[triplets[:,1]]
-        o = embedding[triplets[:,2]]
-        score = torch.sum(s * r * o, dim=1)
-        return score
-
-    def forward(self, g, h, r, norm):
-        return self.rgcn.forward(g, h, r, norm)
-
-    def regularization_loss(self, embedding):
-        return torch.mean(embedding.pow(2)) + torch.mean(self.w_relation.pow(2))
-
-    def get_loss(self, g, embed, triplets, labels):
-        # triplets is a list of data samples (positive and negative)
-        # each row in the triplets is a 3-tuple of (source, relation, destination)
-        score = self.calc_score(embed, triplets)
-        predict_loss = F.binary_cross_entropy_with_logits(score, labels)
-        reg_loss = self.regularization_loss(embed)
-        return predict_loss + self.reg_param * reg_loss
-
-def node_norm_to_edge_norm(g, node_norm):
-    g = g.local_var()
-    # convert to edge norm
-    g.ndata['norm'] = node_norm
-    g.apply_edges(lambda edges : {'norm' : edges.dst['norm']})
-    return g.edata['norm']
-
-def main(args):
-    # load graph data
-    data = load_data(args.dataset)
-    num_nodes = data.num_nodes
-    train_data = data.train
-    valid_data = data.valid
-    test_data = data.test
-    num_rels = data.num_rels
-
-    # check cuda
-    use_cuda = args.gpu >= 0 and torch.cuda.is_available()
-    if use_cuda:
-        torch.cuda.set_device(args.gpu)
-
-    # create model
-    model = LinkPredict(num_nodes,
-                        args.n_hidden,
-                        num_rels,
-                        num_bases=args.n_bases,
-                        num_hidden_layers=args.n_layers,
-                        dropout=args.dropout,
-                        use_cuda=use_cuda,
-                        reg_param=args.regularization)
-
-    # validation and testing triplets
-    valid_data = torch.LongTensor(valid_data)
-    test_data = torch.LongTensor(test_data)
-
-    # build test graph
-    test_graph, test_rel, test_norm = utils.build_test_graph(
-        num_nodes, num_rels, train_data)
-    test_deg = test_graph.in_degrees(
-                range(test_graph.number_of_nodes())).float().view(-1,1)
-    test_node_id = torch.arange(0, num_nodes, dtype=torch.long).view(-1, 1)
-    test_rel = torch.from_numpy(test_rel)
-    test_norm = node_norm_to_edge_norm(test_graph, torch.from_numpy(test_norm).view(-1, 1))
-
-    if use_cuda:
-        model.cuda()
-
-    # build adj list and calculate degrees for sampling
-    adj_list, degrees = utils.get_adj_and_degrees(num_nodes, train_data)
-
-    # optimizer
-    optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
-
-    model_state_file = 'model_state.pth'
-    forward_time = []
-    backward_time = []
-
-    # training loop
-    print("start training...")
-
-    epoch = 0
-    best_mrr = 0
-    while True:
-        model.train()
-        epoch += 1
-
-        # perform edge neighborhood sampling to generate training graph and data
-        g, node_id, edge_type, node_norm, data, labels = \
-            utils.generate_sampled_graph_and_labels(
-                train_data, args.graph_batch_size, args.graph_split_size,
-                num_rels, adj_list, degrees, args.negative_sample,
-                args.edge_sampler)
-        print("Done edge sampling")
-
-        # set node/edge feature
-        node_id = torch.from_numpy(node_id).view(-1, 1).long()
-        edge_type = torch.from_numpy(edge_type)
-        edge_norm = node_norm_to_edge_norm(g, torch.from_numpy(node_norm).view(-1, 1))
-        data, labels = torch.from_numpy(data), torch.from_numpy(labels)
-        deg = g.in_degrees(range(g.number_of_nodes())).float().view(-1, 1)
-        if use_cuda:
-            node_id, deg = node_id.cuda(), deg.cuda()
-            edge_type, edge_norm = edge_type.cuda(), edge_norm.cuda()
-            data, labels = data.cuda(), labels.cuda()
-            g = g.to(args.gpu)
-
-        t0 = time.time()
-        embed = model(g, node_id, edge_type, edge_norm)
-        loss = model.get_loss(g, embed, data, labels)
-        t1 = time.time()
-        loss.backward()
-        torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_norm) # clip gradients
-        optimizer.step()
-        t2 = time.time()
-
-        forward_time.append(t1 - t0)
-        backward_time.append(t2 - t1)
-        print("Epoch {:04d} | Loss {:.4f} | Best MRR {:.4f} | Forward {:.4f}s | Backward {:.4f}s".
-              format(epoch, loss.item(), best_mrr, forward_time[-1], backward_time[-1]))
-
-        optimizer.zero_grad()
-
-        # validation
-        if epoch % args.evaluate_every == 0:
-            # perform validation on CPU because full graph is too large
-            if use_cuda:
-                model.cpu()
-            model.eval()
-            print("start eval")
-            embed = model(test_graph, test_node_id, test_rel, test_norm)
-            mrr = utils.calc_mrr(embed, model.w_relation, torch.LongTensor(train_data),
-                                 valid_data, test_data, hits=[1, 3, 10], eval_bz=args.eval_batch_size,
-                                 eval_p=args.eval_protocol)
-            # save best model
-            if best_mrr < mrr:
-                best_mrr = mrr
-                torch.save({'state_dict': model.state_dict(), 'epoch': epoch}, model_state_file)
-            
-            if epoch >= args.n_epochs:
-                break
-            
-            if use_cuda:
-                model.cuda()
-
-    print("training done")
-    print("Mean forward time: {:4f}s".format(np.mean(forward_time)))
-    print("Mean Backward time: {:4f}s".format(np.mean(backward_time)))
-
-    print("\nstart testing:")
-    # use best model checkpoint
-    checkpoint = torch.load(model_state_file)
-    if use_cuda:
-        model.cpu() # test on CPU
-    model.eval()
-    model.load_state_dict(checkpoint['state_dict'])
-    print("Using best epoch: {}".format(checkpoint['epoch']))
-    embed = model(test_graph, test_node_id, test_rel, test_norm)
-    utils.calc_mrr(embed, model.w_relation, torch.LongTensor(train_data), valid_data,
-                   test_data, hits=[1, 3, 10], eval_bz=args.eval_batch_size, eval_p=args.eval_protocol)
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='RGCN')
-    parser.add_argument("--dropout", type=float, default=0.2,
-            help="dropout probability")
-    parser.add_argument("--n-hidden", type=int, default=500,
-            help="number of hidden units")
-    parser.add_argument("--gpu", type=int, default=-1,
-            help="gpu")
-    parser.add_argument("--lr", type=float, default=1e-2,
-            help="learning rate")
-    parser.add_argument("--n-bases", type=int, default=100,
-            help="number of weight blocks for each relation")
-    parser.add_argument("--n-layers", type=int, default=2,
-            help="number of propagation rounds")
-    parser.add_argument("--n-epochs", type=int, default=6000,
-            help="number of minimum training epochs")
-    parser.add_argument("-d", "--dataset", type=str, required=True,
-            help="dataset to use")
-    parser.add_argument("--eval-batch-size", type=int, default=500,
-            help="batch size when evaluating")
-    parser.add_argument("--eval-protocol", type=str, default="filtered",
-            help="type of evaluation protocol: 'raw' or 'filtered' mrr")
-    parser.add_argument("--regularization", type=float, default=0.01,
-            help="regularization weight")
-    parser.add_argument("--grad-norm", type=float, default=1.0,
-            help="norm to clip gradient to")
-    parser.add_argument("--graph-batch-size", type=int, default=30000,
-            help="number of edges to sample in each iteration")
-    parser.add_argument("--graph-split-size", type=float, default=0.5,
-            help="portion of edges used as positive sample")
-    parser.add_argument("--negative-sample", type=int, default=10,
-            help="number of negative samples per positive sample")
-    parser.add_argument("--evaluate-every", type=int, default=500,
-            help="perform evaluation every n epochs")
-    parser.add_argument("--edge-sampler", type=str, default="uniform",
-            help="type of edge sampler: 'uniform' or 'neighbor'")
-
-    args = parser.parse_args()
-    print(args)
-    main(args)
--- a/examples/pytorch/rgcn/link_utils.py
+++ b/examples/pytorch/rgcn/link_utils.py
+"""
+Utility functions for link prediction
+Most code is adapted from authors' implementation of RGCN link prediction:
+https://github.com/MichSchli/RelationPrediction
+
+"""
+import numpy as np
+import torch as th
+import dgl
+
+# Utility function for building training and testing graphs
+
+def get_subset_g(g, mask, num_rels, bidirected=False):
+    src, dst = g.edges()
+    sub_src = src[mask]
+    sub_dst = dst[mask]
+    sub_rel = g.edata['etype'][mask]
+
+    if bidirected:
+        sub_src, sub_dst = th.cat([sub_src, sub_dst]), th.cat([sub_dst, sub_src])
+        sub_rel = th.cat([sub_rel, sub_rel + num_rels])
+
+    sub_g = dgl.graph((sub_src, sub_dst), num_nodes=g.num_nodes())
+    sub_g.edata[dgl.ETYPE] = sub_rel
+
+    return sub_g
+
+def preprocess(g, num_rels):
+    # Get train graph
+    train_g = get_subset_g(g, g.edata['train_mask'], num_rels)
+
+    # Get test graph
+    test_g = get_subset_g(g, g.edata['train_mask'], num_rels, bidirected=True)
+    test_g.edata['norm'] = dgl.norm_by_dst(test_g).unsqueeze(-1)
+
+    return train_g, test_g
+
+class GlobalUniform:
+    def __init__(self, g, sample_size):
+        self.sample_size = sample_size
+        self.eids = np.arange(g.num_edges())
+
+    def sample(self):
+        return th.from_numpy(np.random.choice(self.eids, self.sample_size))
+
+class NeighborExpand:
+    """Sample a connected component by neighborhood expansion"""
+    def __init__(self, g, sample_size):
+        self.g = g
+        self.nids = np.arange(g.num_nodes())
+        self.sample_size = sample_size
+
+    def sample(self):
+        edges = th.zeros((self.sample_size), dtype=th.int64)
+        neighbor_counts = (self.g.in_degrees() + self.g.out_degrees()).numpy()
+        seen_edge = np.array([False] * self.g.num_edges())
+        seen_node = np.array([False] * self.g.num_nodes())
+
+        for i in range(self.sample_size):
+            if np.sum(seen_node) == 0:
+                node_weights = np.ones_like(neighbor_counts)
+                node_weights[np.where(neighbor_counts == 0)] = 0
+            else:
+                # Sample a visited node if applicable.
+                # This guarantees a connected component.
+                node_weights = neighbor_counts * seen_node
+
+            node_probs = node_weights / np.sum(node_weights)
+            chosen_node = np.random.choice(self.nids, p=node_probs)
+
+            # Sample a neighbor of the sampled node
+            u1, v1, eid1 = self.g.in_edges(chosen_node, form='all')
+            u2, v2, eid2 = self.g.out_edges(chosen_node, form='all')
+            u = th.cat([u1, u2])
+            v = th.cat([v1, v2])
+            eid = th.cat([eid1, eid2])
+
+            to_pick = True
+            while to_pick:
+                random_id = th.randint(high=eid.shape[0], size=(1,))
+                chosen_eid = eid[random_id]
+                to_pick = seen_edge[chosen_eid]
+
+            chosen_u = u[random_id]
+            chosen_v = v[random_id]
+            edges[i] = chosen_eid
+            seen_node[chosen_u] = True
+            seen_node[chosen_v] = True
+            seen_edge[chosen_eid] = True
+
+            neighbor_counts[chosen_u] -= 1
+            neighbor_counts[chosen_v] -= 1
+
+        return edges
+
+class NegativeSampler:
+    def __init__(self, k=10):
+        self.k = k
+
+    def sample(self, pos_samples, num_nodes):
+        batch_size = len(pos_samples)
+        neg_batch_size = batch_size * self.k
+        neg_samples = np.tile(pos_samples, (self.k, 1))
+
+        values = np.random.randint(num_nodes, size=neg_batch_size)
+        choices = np.random.uniform(size=neg_batch_size)
+        subj = choices > 0.5
+        obj = choices <= 0.5
+        neg_samples[subj, 0] = values[subj]
+        neg_samples[obj, 2] = values[obj]
+        samples = np.concatenate((pos_samples, neg_samples))
+
+        # binary labels indicating positive and negative samples
+        labels = np.zeros(batch_size * (self.k + 1), dtype=np.float32)
+        labels[:batch_size] = 1
+
+        return th.from_numpy(samples), th.from_numpy(labels)
+
+class SubgraphIterator:
+    def __init__(self, g, num_rels, pos_sampler, sample_size=30000, num_epochs=6000):
+        self.g = g
+        self.num_rels = num_rels
+        self.sample_size = sample_size
+        self.num_epochs = num_epochs
+        if pos_sampler == 'neighbor':
+            self.pos_sampler = NeighborExpand(g, sample_size)
+        else:
+            self.pos_sampler = GlobalUniform(g, sample_size)
+        self.neg_sampler = NegativeSampler()
+
+    def __len__(self):
+        return self.num_epochs
+
+    def __getitem__(self, i):
+        eids = self.pos_sampler.sample()
+        src, dst = self.g.find_edges(eids)
+        src, dst = src.numpy(), dst.numpy()
+        rel = self.g.edata[dgl.ETYPE][eids].numpy()
+
+        # relabel nodes to have consecutive node IDs
+        uniq_v, edges = np.unique((src, dst), return_inverse=True)
+        num_nodes = len(uniq_v)
+        # edges is the concatenation of src, dst with relabeled ID
+        src, dst = np.reshape(edges, (2, -1))
+        relabeled_data = np.stack((src, rel, dst)).transpose()
+
+        samples, labels = self.neg_sampler.sample(relabeled_data, num_nodes)
+
+        # Use only half of the positive edges
+        chosen_ids = np.random.choice(np.arange(self.sample_size),
+                                      size=int(self.sample_size / 2),
+                                      replace=False)
+        src = src[chosen_ids]
+        dst = dst[chosen_ids]
+        rel = rel[chosen_ids]
+        src, dst = np.concatenate((src, dst)), np.concatenate((dst, src))
+        rel = np.concatenate((rel, rel + self.num_rels))
+        sub_g = dgl.graph((src, dst), num_nodes=num_nodes)
+        sub_g.edata[dgl.ETYPE] = th.from_numpy(rel)
+        sub_g.edata['norm'] = dgl.norm_by_dst(sub_g).unsqueeze(-1)
+        uniq_v = th.from_numpy(uniq_v).view(-1, 1).long()
+
+        return sub_g, uniq_v, samples, labels
+
+# Utility functions for evaluations (raw)
+
+def perturb_and_get_raw_rank(emb, w, a, r, b, test_size, batch_size=100):
+    """ Perturb one element in the triplets"""
+    n_batch = (test_size + batch_size - 1) // batch_size
+    ranks = []
+    emb = emb.transpose(0, 1) # size D x V
+    w = w.transpose(0, 1)     # size D x R
+    for idx in range(n_batch):
+        print("batch {} / {}".format(idx, n_batch))
+        batch_start = idx * batch_size
+        batch_end = (idx + 1) * batch_size
+        batch_a = a[batch_start: batch_end]
+        batch_r = r[batch_start: batch_end]
+        emb_ar = emb[:,batch_a] * w[:,batch_r] # size D x E
+        emb_ar = emb_ar.unsqueeze(2)           # size D x E x 1
+        emb_c = emb.unsqueeze(1)               # size D x 1 x V
+
+        # out-prod and reduce sum
+        out_prod = th.bmm(emb_ar, emb_c)          # size D x E x V
+        score = th.sum(out_prod, dim=0).sigmoid() # size E x V
+        target = b[batch_start: batch_end]
+
+        _, indices = th.sort(score, dim=1, descending=True)
+        indices = th.nonzero(indices == target.view(-1, 1), as_tuple=False)
+        ranks.append(indices[:, 1].view(-1))
+    return th.cat(ranks)
+
+# Utility functions for evaluations (filtered)
+
+def filter(triplets_to_filter, target_s, target_r, target_o, num_nodes, filter_o=True):
+    """Get candidate heads or tails to score"""
+    target_s, target_r, target_o = int(target_s), int(target_r), int(target_o)
+
+    # Add the ground truth node first
+    if filter_o:
+        candidate_nodes = [target_o]
+    else:
+        candidate_nodes = [target_s]
+
+    for e in range(num_nodes):
+        triplet = (target_s, target_r, e) if filter_o else (e, target_r, target_o)
+        # Do not consider a node if it leads to a real triplet
+        if triplet not in triplets_to_filter:
+            candidate_nodes.append(e)
+    return th.LongTensor(candidate_nodes)
+
+def perturb_and_get_filtered_rank(emb, w, s, r, o, test_size, triplets_to_filter, filter_o=True):
+    """Perturb subject or object in the triplets"""
+    num_nodes = emb.shape[0]
+    ranks = []
+    for idx in range(test_size):
+        if idx % 100 == 0:
+            print("test triplet {} / {}".format(idx, test_size))
+        target_s = s[idx]
+        target_r = r[idx]
+        target_o = o[idx]
+        candidate_nodes = filter(triplets_to_filter, target_s, target_r,
+                                 target_o, num_nodes, filter_o=filter_o)
+        if filter_o:
+            emb_s = emb[target_s]
+            emb_o = emb[candidate_nodes]
+        else:
+            emb_s = emb[candidate_nodes]
+            emb_o = emb[target_o]
+        target_idx = 0
+        emb_r = w[target_r]
+        emb_triplet = emb_s * emb_r * emb_o
+        scores = th.sigmoid(th.sum(emb_triplet, dim=1))
+
+        _, indices = th.sort(scores, descending=True)
+        rank = int((indices == target_idx).nonzero())
+        ranks.append(rank)
+    return th.LongTensor(ranks)
+
+def _calc_mrr(emb, w, test_mask, triplets_to_filter, batch_size, filter=False):
+    with th.no_grad():
+        test_triplets = triplets_to_filter[test_mask]
+        s, r, o = test_triplets[:,0], test_triplets[:,1], test_triplets[:,2]
+        test_size = len(s)
+
+        if filter:
+            metric_name = 'MRR (filtered)'
+            triplets_to_filter = {tuple(triplet) for triplet in triplets_to_filter.tolist()}
+            ranks_s = perturb_and_get_filtered_rank(emb, w, s, r, o, test_size,
+                                                    triplets_to_filter, filter_o=False)
+            ranks_o = perturb_and_get_filtered_rank(emb, w, s, r, o,
+                                                    test_size, triplets_to_filter)
+        else:
+            metric_name = 'MRR (raw)'
+            ranks_s = perturb_and_get_raw_rank(emb, w, o, r, s, test_size, batch_size)
+            ranks_o = perturb_and_get_raw_rank(emb, w, s, r, o, test_size, batch_size)
+
+        ranks = th.cat([ranks_s, ranks_o])
+        ranks += 1 # change to 1-indexed
+        mrr = th.mean(1.0 / ranks.float()).item()
+        print("{}: {:.6f}".format(metric_name, mrr))
+
+    return mrr
+
+# Main evaluation function
+
+def calc_mrr(emb, w, test_mask, triplets, batch_size=100, eval_p="filtered"):
+    if eval_p == "filtered":
+        mrr = _calc_mrr(emb, w, test_mask, triplets, batch_size, filter=True)
+    else:
+        mrr = _calc_mrr(emb, w, test_mask, triplets, batch_size)
+    return mrr
--- a/examples/pytorch/rgcn/model.py
+++ b/examples/pytorch/rgcn/model.py
+from dgl import DGLGraph
 import torch as th
 import torch.nn as nn
-
+import torch.nn.functional as F
 import dgl

-class BaseRGCN(nn.Module):
-    def __init__(self, num_nodes, h_dim, out_dim, num_rels, num_bases,
-                 num_hidden_layers=1, dropout=0,
-                 use_self_loop=False, use_cuda=False):
-        super(BaseRGCN, self).__init__()
-        self.num_nodes = num_nodes
-        self.h_dim = h_dim
-        self.out_dim = out_dim
-        self.num_rels = num_rels
-        self.num_bases = None if num_bases < 0 else num_bases
-        self.num_hidden_layers = num_hidden_layers
-        self.dropout = dropout
-        self.use_self_loop = use_self_loop
-        self.use_cuda = use_cuda
+from dgl.nn.pytorch import RelGraphConv

-        # create rgcn layers
-        self.build_model()
+class RGCN(nn.Module):
+    def __init__(self, in_dim, h_dim, out_dim, num_rels,
+                 regularizer="basis", num_bases=-1, dropout=0.,
+                 self_loop=False, link_pred=False):
+        super(RGCN, self).__init__()

-    def build_model(self):
        self.layers = nn.ModuleList()
-        # i2h
-        i2h = self.build_input_layer()
-        if i2h is not None:
-            self.layers.append(i2h)
-        # h2h
-        for idx in range(self.num_hidden_layers):
-            h2h = self.build_hidden_layer(idx)
-            self.layers.append(h2h)
-        # h2o
-        h2o = self.build_output_layer()
-        if h2o is not None:
-            self.layers.append(h2o)
-
-    def build_input_layer(self):
-        return None
-
-    def build_hidden_layer(self, idx):
-        raise NotImplementedError
+        if link_pred:
+            self.emb = nn.Embedding(in_dim, h_dim)
+            in_dim = h_dim
+        else:
+            self.emb = None
+        self.layers.append(RelGraphConv(in_dim, h_dim, num_rels, regularizer,
+                                        num_bases, activation=F.relu, self_loop=self_loop,
+                                        dropout=dropout))
+
+        # For entity classification, dropout should not be applied to the output layer
+        if not link_pred:
+            dropout = 0.
+        self.layers.append(RelGraphConv(h_dim, out_dim, num_rels, regularizer,
+                                        num_bases, self_loop=self_loop, dropout=dropout))
+
+    def forward(self, g, h):
+        if isinstance(g, DGLGraph):
+            blocks = [g] * len(self.layers)
+        else:
+            blocks = g

-    def build_output_layer(self):
-        return None
+        if self.emb is not None:
+            h = self.emb(h.squeeze())

-    def forward(self, g, h, r, norm):
-        for layer in self.layers:
-            h = layer(g, h, r, norm)
+        for layer, block in zip(self.layers, blocks):
+            h = layer(block, h, block.edata[dgl.ETYPE], block.edata['norm'])
        return h

 def initializer(emb):
@@ -55,112 +46,42 @@ def initializer(emb):
    return emb

 class RelGraphEmbedLayer(nn.Module):
-    r"""Embedding layer for featureless heterograph.
+    """Embedding layer for featureless heterograph.
+
    Parameters
    ----------
-    storage_dev_id : int
-        The device to store the weights of the layer.
-    out_dev_id : int
-        Device to return the output embeddings on.
+    out_dev
+        Device to store the output embeddings
    num_nodes : int
-        Number of nodes.
-    node_tides : tensor
-        Storing the node type id for each node starting from 0
-    num_of_ntype : int
-        Number of node types
-    input_size : list of int
-        A list of input feature size for each node type. If None, we then
-        treat certain input feature as an one-hot encoding feature.
+        Number of nodes in the graph.
    embed_size : int
        Output embed size
-    dgl_sparse : bool, optional
-        If true, use dgl.nn.NodeEmbedding otherwise use torch.nn.Embedding
    """
    def __init__(self,
-                 storage_dev_id,
-                 out_dev_id,
+                 out_dev,
                 num_nodes,
-                 node_tids,
-                 num_of_ntype,
-                 input_size,
-                 embed_size,
-                 dgl_sparse=False):
+                 embed_size):
        super(RelGraphEmbedLayer, self).__init__()
-        self.storage_dev_id = th.device( \
-            storage_dev_id if storage_dev_id >= 0 else 'cpu')
-        self.out_dev_id = th.device(out_dev_id if out_dev_id >= 0 else 'cpu')
+        self.out_dev = out_dev
        self.embed_size = embed_size
-        self.num_nodes = num_nodes
-        self.dgl_sparse = dgl_sparse
-
-        # create weight embeddings for each node for each relation
-        self.embeds = nn.ParameterDict()
-        self.node_embeds = {} if dgl_sparse else nn.ModuleDict()
-        self.num_of_ntype = num_of_ntype
-
-        for ntype in range(num_of_ntype):
-            if isinstance(input_size[ntype], int):
-                if dgl_sparse:
-                    self.node_embeds[str(ntype)] = dgl.nn.NodeEmbedding(input_size[ntype], embed_size, name=str(ntype),
-                        init_func=initializer, device=self.storage_dev_id)
-                else:
-                    sparse_emb = th.nn.Embedding(input_size[ntype], embed_size, sparse=True)
-                    sparse_emb.cuda(self.storage_dev_id)
-                    nn.init.uniform_(sparse_emb.weight, -1.0, 1.0)
-                    self.node_embeds[str(ntype)] = sparse_emb
-            else:
-                input_emb_size = input_size[ntype].shape[1]
-                embed = nn.Parameter(th.empty([input_emb_size, self.embed_size],
-                                              device=self.storage_dev_id))
-                nn.init.xavier_uniform_(embed)
-                self.embeds[str(ntype)] = embed

-    @property
-    def dgl_emb(self):
-        """
-        """
-        if self.dgl_sparse:
-            embs = [emb for emb in self.node_embeds.values()]
-            return embs
-        else:
-            return []
+        # create embeddings for all nodes
+        self.node_embed = nn.Embedding(num_nodes, embed_size, sparse=True)
+        nn.init.uniform_(self.node_embed.weight, -1.0, 1.0)

-    def forward(self, node_ids, node_tids, type_ids, features):
+    def forward(self, node_ids):
        """Forward computation
+
        Parameters
        ----------
        node_ids : tensor
-            node ids to generate embedding for.
-        node_ids : tensor
-            node type ids
-        features : list of features
-            list of initial features for nodes belong to different node type.
-            If None, the corresponding features is an one-hot encoding feature,
-            else use the features directly as input feature and matmul a
-            projection matrix.
+            Raw node IDs.
+
        Returns
        -------
        tensor
            embeddings as the input of the next layer
        """
-        embeds = th.empty(node_ids.shape[0], self.embed_size, device=self.out_dev_id)
-
-        # transfer input to the correct device
-        type_ids = type_ids.to(self.storage_dev_id)
-        node_tids = node_tids.to(self.storage_dev_id)
-
-        # build locs first
-        locs = [None for i in range(self.num_of_ntype)]
-        for ntype in range(self.num_of_ntype):
-            locs[ntype] = (node_tids == ntype).nonzero().squeeze(-1)
-        for ntype in range(self.num_of_ntype):
-            loc = locs[ntype]
-            if isinstance(features[ntype], int):
-                if self.dgl_sparse:
-                    embeds[loc] = self.node_embeds[str(ntype)](type_ids[loc], self.out_dev_id)
-                else:
-                    embeds[loc] = self.node_embeds[str(ntype)](type_ids[loc]).to(self.out_dev_id)
-            else:
-                embeds[loc] = features[ntype][type_ids[loc]].to(self.out_dev_id) @ self.embeds[str(ntype)].to(self.out_dev_id)
+        embeds = self.node_embed(node_ids).to(self.out_dev)

        return embeds
--- a/examples/pytorch/rgcn/utils.py
+++ b/examples/pytorch/rgcn/utils.py
-"""
-Utility functions for link prediction
-Most code is adapted from authors' implementation of RGCN link prediction:
-https://github.com/MichSchli/RelationPrediction
-
-"""
-import numpy as np
-import torch
-from torch.multiprocessing import Queue
-import dgl
-
-#######################################################################
-#
-# Utility function for building training and testing graphs
-#
-#######################################################################
-
-def get_adj_and_degrees(num_nodes, triplets):
-    """ Get adjacency list and degrees of the graph
-    """
-    adj_list = [[] for _ in range(num_nodes)]
-    for i,triplet in enumerate(triplets):
-        adj_list[triplet[0]].append([i, triplet[2]])
-        adj_list[triplet[2]].append([i, triplet[0]])
-
-    degrees = np.array([len(a) for a in adj_list])
-    adj_list = [np.array(a) for a in adj_list]
-    return adj_list, degrees
-
-def sample_edge_neighborhood(adj_list, degrees, n_triplets, sample_size):
-    """Sample edges by neighborhool expansion.
-
-    This guarantees that the sampled edges form a connected graph, which
-    may help deeper GNNs that require information from more than one hop.
-    """
-    edges = np.zeros((sample_size), dtype=np.int32)
-
-    #initialize
-    sample_counts = np.array([d for d in degrees])
-    picked = np.array([False for _ in range(n_triplets)])
-    seen = np.array([False for _ in degrees])
-
-    for i in range(0, sample_size):
-        weights = sample_counts * seen
-
-        if np.sum(weights) == 0:
-            weights = np.ones_like(weights)
-            weights[np.where(sample_counts == 0)] = 0
-
-        probabilities = (weights) / np.sum(weights)
-        chosen_vertex = np.random.choice(np.arange(degrees.shape[0]),
-                                         p=probabilities)
-        chosen_adj_list = adj_list[chosen_vertex]
-        seen[chosen_vertex] = True
-
-        chosen_edge = np.random.choice(np.arange(chosen_adj_list.shape[0]))
-        chosen_edge = chosen_adj_list[chosen_edge]
-        edge_number = chosen_edge[0]
-
-        while picked[edge_number]:
-            chosen_edge = np.random.choice(np.arange(chosen_adj_list.shape[0]))
-            chosen_edge = chosen_adj_list[chosen_edge]
-            edge_number = chosen_edge[0]
-
-        edges[i] = edge_number
-        other_vertex = chosen_edge[1]
-        picked[edge_number] = True
-        sample_counts[chosen_vertex] -= 1
-        sample_counts[other_vertex] -= 1
-        seen[other_vertex] = True
-
-    return edges
-
-def sample_edge_uniform(adj_list, degrees, n_triplets, sample_size):
-    """Sample edges uniformly from all the edges."""
-    all_edges = np.arange(n_triplets)
-    return np.random.choice(all_edges, sample_size, replace=False)
-
-def generate_sampled_graph_and_labels(triplets, sample_size, split_size,
-                                      num_rels, adj_list, degrees,
-                                      negative_rate, sampler="uniform"):
-    """Get training graph and signals
-    First perform edge neighborhood sampling on graph, then perform negative
-    sampling to generate negative samples
-    """
-    # perform edge neighbor sampling
-    if sampler == "uniform":
-        edges = sample_edge_uniform(adj_list, degrees, len(triplets), sample_size)
-    elif sampler == "neighbor":
-        edges = sample_edge_neighborhood(adj_list, degrees, len(triplets), sample_size)
-    else:
-        raise ValueError("Sampler type must be either 'uniform' or 'neighbor'.")
-
-    # relabel nodes to have consecutive node ids
-    edges = triplets[edges]
-    src, rel, dst = edges.transpose()
-    uniq_v, edges = np.unique((src, dst), return_inverse=True)
-    src, dst = np.reshape(edges, (2, -1))
-    relabeled_edges = np.stack((src, rel, dst)).transpose()
-
-    # negative sampling
-    samples, labels = negative_sampling(relabeled_edges, len(uniq_v),
-                                        negative_rate)
-
-    # further split graph, only half of the edges will be used as graph
-    # structure, while the rest half is used as unseen positive samples
-    split_size = int(sample_size * split_size)
-    graph_split_ids = np.random.choice(np.arange(sample_size),
-                                       size=split_size, replace=False)
-    src = src[graph_split_ids]
-    dst = dst[graph_split_ids]
-    rel = rel[graph_split_ids]
-
-    # build DGL graph
-    print("# sampled nodes: {}".format(len(uniq_v)))
-    print("# sampled edges: {}".format(len(src) * 2))
-    g, rel, norm = build_graph_from_triplets(len(uniq_v), num_rels,
-                                             (src, rel, dst))
-    return g, uniq_v, rel, norm, samples, labels
-
-def comp_deg_norm(g):
-    g = g.local_var()
-    in_deg = g.in_degrees(range(g.number_of_nodes())).float().numpy()
-    norm = 1.0 / in_deg
-    norm[np.isinf(norm)] = 0
-    return norm
-
-def build_graph_from_triplets(num_nodes, num_rels, triplets):
-    """ Create a DGL graph. The graph is bidirectional because RGCN authors
-        use reversed relations.
-        This function also generates edge type and normalization factor
-        (reciprocal of node incoming degree)
-    """
-    g = dgl.graph(([], []))
-    g.add_nodes(num_nodes)
-    src, rel, dst = triplets
-    src, dst = np.concatenate((src, dst)), np.concatenate((dst, src))
-    rel = np.concatenate((rel, rel + num_rels))
-    edges = sorted(zip(dst, src, rel))
-    dst, src, rel = np.array(edges).transpose()
-    g.add_edges(src, dst)
-    norm = comp_deg_norm(g)
-    print("# nodes: {}, # edges: {}".format(num_nodes, len(src)))
-    return g, rel.astype('int64'), norm.astype('int64')
-
-def build_test_graph(num_nodes, num_rels, edges):
-    src, rel, dst = edges.transpose()
-    print("Test graph:")
-    return build_graph_from_triplets(num_nodes, num_rels, (src, rel, dst))
-
-def negative_sampling(pos_samples, num_entity, negative_rate):
-    size_of_batch = len(pos_samples)
-    num_to_generate = size_of_batch * negative_rate
-    neg_samples = np.tile(pos_samples, (negative_rate, 1))
-    labels = np.zeros(size_of_batch * (negative_rate + 1), dtype=np.float32)
-    labels[: size_of_batch] = 1
-    values = np.random.randint(num_entity, size=num_to_generate)
-    choices = np.random.uniform(size=num_to_generate)
-    subj = choices > 0.5
-    obj = choices <= 0.5
-    neg_samples[subj, 0] = values[subj]
-    neg_samples[obj, 2] = values[obj]
-
-    return np.concatenate((pos_samples, neg_samples)), labels
-
-#######################################################################
-#
-# Utility functions for evaluations (raw)
-#
-#######################################################################
-
-def sort_and_rank(score, target):
-    _, indices = torch.sort(score, dim=1, descending=True)
-    indices = torch.nonzero(indices == target.view(-1, 1), as_tuple=False)
-    indices = indices[:, 1].view(-1)
-    return indices
-
-def perturb_and_get_raw_rank(embedding, w, a, r, b, test_size, batch_size=100):
-    """ Perturb one element in the triplets
-    """
-    n_batch = (test_size + batch_size - 1) // batch_size
-    ranks = []
-    for idx in range(n_batch):
-        print("batch {} / {}".format(idx, n_batch))
-        batch_start = idx * batch_size
-        batch_end = min(test_size, (idx + 1) * batch_size)
-        batch_a = a[batch_start: batch_end]
-        batch_r = r[batch_start: batch_end]
-        emb_ar = embedding[batch_a] * w[batch_r]
-        emb_ar = emb_ar.transpose(0, 1).unsqueeze(2) # size: D x E x 1
-        emb_c = embedding.transpose(0, 1).unsqueeze(1) # size: D x 1 x V
-        # out-prod and reduce sum
-        out_prod = torch.bmm(emb_ar, emb_c) # size D x E x V
-        score = torch.sum(out_prod, dim=0) # size E x V
-        score = torch.sigmoid(score)
-        target = b[batch_start: batch_end]
-        ranks.append(sort_and_rank(score, target))
-    return torch.cat(ranks)
-
-# return MRR (raw), and Hits @ (1, 3, 10)
-def calc_raw_mrr(embedding, w, test_triplets, hits=[], eval_bz=100):
-    with torch.no_grad():
-        s = test_triplets[:, 0]
-        r = test_triplets[:, 1]
-        o = test_triplets[:, 2]
-        test_size = test_triplets.shape[0]
-
-        # perturb subject
-        ranks_s = perturb_and_get_raw_rank(embedding, w, o, r, s, test_size, eval_bz)
-        # perturb object
-        ranks_o = perturb_and_get_raw_rank(embedding, w, s, r, o, test_size, eval_bz)
-
-        ranks = torch.cat([ranks_s, ranks_o])
-        ranks += 1 # change to 1-indexed
-
-        mrr = torch.mean(1.0 / ranks.float())
-        print("MRR (raw): {:.6f}".format(mrr.item()))
-
-        for hit in hits:
-            avg_count = torch.mean((ranks <= hit).float())
-            print("Hits (raw) @ {}: {:.6f}".format(hit, avg_count.item()))
-    return mrr.item()
-
-#######################################################################
-#
-# Utility functions for evaluations (filtered)
-#
-#######################################################################
-
-def filter_o(triplets_to_filter, target_s, target_r, target_o, num_entities):
-    target_s, target_r, target_o = int(target_s), int(target_r), int(target_o)
-    filtered_o = []
-    # Do not filter out the test triplet, since we want to predict on it
-    if (target_s, target_r, target_o) in triplets_to_filter:
-        triplets_to_filter.remove((target_s, target_r, target_o))
-    # Do not consider an object if it is part of a triplet to filter
-    for o in range(num_entities):
-        if (target_s, target_r, o) not in triplets_to_filter:
-            filtered_o.append(o)
-    return torch.LongTensor(filtered_o)
-
-def filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities):
-    target_s, target_r, target_o = int(target_s), int(target_r), int(target_o)
-    filtered_s = []
-    # Do not filter out the test triplet, since we want to predict on it
-    if (target_s, target_r, target_o) in triplets_to_filter:
-        triplets_to_filter.remove((target_s, target_r, target_o))
-    # Do not consider a subject if it is part of a triplet to filter
-    for s in range(num_entities):
-        if (s, target_r, target_o) not in triplets_to_filter:
-            filtered_s.append(s)
-    return torch.LongTensor(filtered_s)
-
-def perturb_o_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter):
-    """ Perturb object in the triplets
-    """
-    num_entities = embedding.shape[0]
-    ranks = []
-    for idx in range(test_size):
-        if idx % 100 == 0:
-            print("test triplet {} / {}".format(idx, test_size))
-        target_s = s[idx]
-        target_r = r[idx]
-        target_o = o[idx]
-        filtered_o = filter_o(triplets_to_filter, target_s, target_r, target_o, num_entities)
-        target_o_idx = int((filtered_o == target_o).nonzero())
-        emb_s = embedding[target_s]
-        emb_r = w[target_r]
-        emb_o = embedding[filtered_o]
-        emb_triplet = emb_s * emb_r * emb_o
-        scores = torch.sigmoid(torch.sum(emb_triplet, dim=1))
-        _, indices = torch.sort(scores, descending=True)
-        rank = int((indices == target_o_idx).nonzero())
-        ranks.append(rank)
-    return torch.LongTensor(ranks)
-
-def perturb_s_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter):
-    """ Perturb subject in the triplets
-    """
-    num_entities = embedding.shape[0]
-    ranks = []
-    for idx in range(test_size):
-        if idx % 100 == 0:
-            print("test triplet {} / {}".format(idx, test_size))
-        target_s = s[idx]
-        target_r = r[idx]
-        target_o = o[idx]
-        filtered_s = filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities)
-        target_s_idx = int((filtered_s == target_s).nonzero())
-        emb_s = embedding[filtered_s]
-        emb_r = w[target_r]
-        emb_o = embedding[target_o]
-        emb_triplet = emb_s * emb_r * emb_o
-        scores = torch.sigmoid(torch.sum(emb_triplet, dim=1))
-        _, indices = torch.sort(scores, descending=True)
-        rank = int((indices == target_s_idx).nonzero())
-        ranks.append(rank)
-    return torch.LongTensor(ranks)
-
-def calc_filtered_mrr(embedding, w, train_triplets, valid_triplets, test_triplets, hits=[]):
-    with torch.no_grad():
-        s = test_triplets[:, 0]
-        r = test_triplets[:, 1]
-        o = test_triplets[:, 2]
-        test_size = test_triplets.shape[0]
-
-        triplets_to_filter = torch.cat([train_triplets, valid_triplets, test_triplets]).tolist()
-        triplets_to_filter = {tuple(triplet) for triplet in triplets_to_filter}
-        print('Perturbing subject...')
-        ranks_s = perturb_s_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter)
-        print('Perturbing object...')
-        ranks_o = perturb_o_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter)
-
-        ranks = torch.cat([ranks_s, ranks_o])
-        ranks += 1 # change to 1-indexed
-
-        mrr = torch.mean(1.0 / ranks.float())
-        print("MRR (filtered): {:.6f}".format(mrr.item()))
-
-        for hit in hits:
-            avg_count = torch.mean((ranks <= hit).float())
-            print("Hits (filtered) @ {}: {:.6f}".format(hit, avg_count.item()))
-    return mrr.item()
-
-#######################################################################
-#
-# Main evaluation function
-#
-#######################################################################
-
-def calc_mrr(embedding, w, train_triplets, valid_triplets, test_triplets, hits=[], eval_bz=100, eval_p="filtered"):
-    if eval_p == "filtered":
-        mrr = calc_filtered_mrr(embedding, w, train_triplets, valid_triplets, test_triplets, hits)
-    else:
-        mrr = calc_raw_mrr(embedding, w, test_triplets, hits, eval_bz)
-    return mrr
--- a/python/dgl/backend/backend.py
+++ b/python/dgl/backend/backend.py
@@ -1178,7 +1178,7 @@ def count_nonzero(input):
 # DGL should contain all the operations on index, so this set of operators
 # should be gradually removed.

-def unique(input, return_inverse=False):
+def unique(input, return_inverse=False, return_counts=False):
    """Returns the unique scalar elements in a tensor.

    Parameters
@@ -1188,13 +1188,19 @@ def unique(input, return_inverse=False):
    return_inverse : bool, optional
        Whether to also return the indices for where elements in the original
        input ended up in the returned unique list.
+    return_counts : bool, optional
+        Whether to also return the counts for each unique element.

    Returns
    -------
    Tensor
        A 1-D tensor containing unique elements.
-    Tensor
+    Tensor, optional
        A 1-D tensor containing the new positions of the elements in the input.
+        It is returned if return_inverse is True.
+    Tensor, optional
+        A 1-D tensor containing the number of occurrences for each unique value or tensor.
+        It is returned if return_counts is True.
    """
    pass


--- a/python/dgl/backend/mxnet/tensor.py
+++ b/python/dgl/backend/mxnet/tensor.py
@@ -353,14 +353,20 @@ def count_nonzero(input):
    tmp = input.asnumpy()
    return np.count_nonzero(tmp)

-def unique(input, return_inverse=False):
+def unique(input, return_inverse=False, return_counts=False):
    # TODO: fallback to numpy is unfortunate
    tmp = input.asnumpy()
-    if return_inverse:
-        tmp, inv = np.unique(tmp, return_inverse=True)
+    if return_inverse and return_counts:
+        tmp, inv, count = np.unique(tmp, return_inverse=True, return_counts=True)
        tmp = nd.array(tmp, ctx=input.context, dtype=input.dtype)
        inv = nd.array(inv, ctx=input.context)
-        return tmp, inv
+        count = nd.array(count, ctx=input.context)
+        return tmp, inv, count
+    elif return_inverse or return_counts:
+        tmp, tmp2 = np.unique(tmp, return_inverse=return_inverse, return_counts=return_counts)
+        tmp = nd.array(tmp, ctx=input.context, dtype=input.dtype)
+        tmp2 = nd.array(tmp2, ctx=input.context)
+        return tmp, tmp2
    else:
        tmp = np.unique(tmp)
        return nd.array(tmp, ctx=input.context, dtype=input.dtype)

--- a/python/dgl/backend/pytorch/tensor.py
+++ b/python/dgl/backend/pytorch/tensor.py
@@ -301,10 +301,10 @@ def count_nonzero(input):
    # TODO: fallback to numpy for backward compatibility
    return np.count_nonzero(input)

-def unique(input, return_inverse=False):
+def unique(input, return_inverse=False, return_counts=False):
    if input.dtype == th.bool:
        input = input.type(th.int8)
-    return th.unique(input, return_inverse=return_inverse)
+    return th.unique(input, return_inverse=return_inverse, return_counts=return_counts)

 def full_1d(length, fill_value, dtype, ctx):
    return th.full((length,), fill_value, dtype=dtype, device=ctx)

--- a/python/dgl/backend/tensorflow/tensor.py
+++ b/python/dgl/backend/tensorflow/tensor.py
@@ -425,8 +425,13 @@ def count_nonzero(input):
    return int(tf.math.count_nonzero(input))


-def unique(input, return_inverse=False):
-    if return_inverse:
+def unique(input, return_inverse=False, return_counts=False):
+    if return_inverse and return_counts:
+        return tf.unique_with_counts(input)
+    elif return_counts:
+        result = tf.unique_with_counts(input)
+        return result.y, result.count
+    elif return_inverse:
        return tf.unique(input)
    else:
        return tf.unique(input).y

--- a/python/dgl/data/knowledge_graph.py
+++ b/python/dgl/data/knowledge_graph.py
@@ -352,7 +352,7 @@ class FB15k237Dataset(KnowledgeGraphDataset):
            >>> graph = dataset[0]
            >>> train_mask = graph.edata['train_mask']
            >>> train_idx = th.nonzero(train_mask, as_tuple=False).squeeze()
-            >>> src, dst = graph.edges(train_idx)
+            >>> src, dst = graph.find_edges(train_idx)
            >>> rel = graph.edata['etype'][train_idx]

        - ``valid`` is deprecated, it is replaced by:
@@ -361,7 +361,7 @@ class FB15k237Dataset(KnowledgeGraphDataset):
            >>> graph = dataset[0]
            >>> val_mask = graph.edata['val_mask']
            >>> val_idx = th.nonzero(val_mask, as_tuple=False).squeeze()
-            >>> src, dst = graph.edges(val_idx)
+            >>> src, dst = graph.find_edges(val_idx)
            >>> rel = graph.edata['etype'][val_idx]

        - ``test`` is deprecated, it is replaced by:
@@ -370,7 +370,7 @@ class FB15k237Dataset(KnowledgeGraphDataset):
            >>> graph = dataset[0]
            >>> test_mask = graph.edata['test_mask']
            >>> test_idx = th.nonzero(test_mask, as_tuple=False).squeeze()
-            >>> src, dst = graph.edges(test_idx)
+            >>> src, dst = graph.find_edges(test_idx)
            >>> rel = graph.edata['etype'][test_idx]

    FB15k-237 is a subset of FB15k where inverse

--- a/python/dgl/transform/functional.py
+++ b/python/dgl/transform/functional.py
@@ -69,7 +69,8 @@ __all__ = [
    'as_heterograph',
    'adj_product_graph',
    'adj_sum_graph',
-    'reorder_graph'
+    'reorder_graph',
+    'norm_by_dst'
    ]


@@ -3213,4 +3214,42 @@ def rcmk_perm(g):
    perm = sparse.csgraph.reverse_cuthill_mckee(csr_adj)
    return perm.copy()

+
+def norm_by_dst(g, etype=None):
+    r"""Calculate normalization coefficient per edge based on destination node degree.
+
+    Parameters
+    ----------
+    g : DGLGraph
+        The input graph.
+    etype : str or (str, str, str), optional
+        The type of the edges to calculate. The allowed edge type formats are:
+
+        * ``(str, str, str)`` for source node type, edge type and destination node type.
+        * or one ``str`` edge type name if the name can uniquely identify a
+          triplet format in the graph.
+
+        It can be omitted if the graph has a single edge type.
+
+    Returns
+    -------
+    1D Tensor
+        The normalization coefficient of the edges.
+
+    Examples
+    --------
+
+    >>> import dgl
+    >>> g = dgl.graph(([0, 1, 1], [1, 1, 2]))
+    >>> print(dgl.norm_by_dst(g))
+    tensor([0.5000, 0.5000, 1.0000])
+    """
+    _, v, _ = g.edges(form='all', etype=etype)
+    _, inv_index, count = F.unique(v, return_inverse=True, return_counts=True)
+    deg = F.astype(count[inv_index], F.float32)
+    norm = 1. / deg
+    norm = F.replace_inf_with_zero(norm)
+
+    return norm
+
 _init_api("dgl.transform", __name__)