Unverified Commit e9c3c0e8 authored by Mufei Li's avatar Mufei Li Committed by GitHub
Browse files

[Model] Simplify RGCN



* Update (#5)

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* FIx

* Try

* Update

* Update

* Update

* Fix

* Update

* Fix

* Fix

* Fix

* Fix

* Update

* Fix

* Update

* Update

* Update

* Fix

* Fix

* Update

* Update

* Update

* Update

* Fix

* Fix

* Fix

* Update

* Update

* Update

* Update

* Update

* Update README.md

* Update

* Fix

* Update

* Update

* Fix

* Fix

* Fix

* Update

* Update

* Update
Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-6-240.us-west-2.compute.internal>

* Update

* Update

* Fix

* Update

* Update

* Update

* Fix

* Update

* Update

* Update

* Update

* Update

* Update

* CI
Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-6-240.us-west-2.compute.internal>
Co-authored-by: default avatarUbuntu <ubuntu@ip-172-31-57-123.us-west-2.compute.internal>
Co-authored-by: default avatarQuan (Andy) Gan <coin2028@hotmail.com>
parent 22272de6
......@@ -162,3 +162,5 @@ config.cmake
.ycm_extra_conf.py
**.png
# model file
*.pth
# Relational-GCN
* Paper: [https://arxiv.org/abs/1703.06103](https://arxiv.org/abs/1703.06103)
* Paper: [Modeling Relational Data with Graph Convolutional Networks](https://arxiv.org/abs/1703.06103)
* Author's code for entity classification: [https://github.com/tkipf/relational-gcn](https://github.com/tkipf/relational-gcn)
* Author's code for link prediction: [https://github.com/MichSchli/RelationPrediction](https://github.com/MichSchli/RelationPrediction)
### Dependencies
* PyTorch 0.4.1+
* requests
* PyTorch 1.10
* rdflib
* pandas
* tqdm
* TorchMetrics
```
pip install requests torch rdflib pandas
pip install rdflib pandas
```
Example code was tested with rdflib 4.2.2 and pandas 0.23.4
......@@ -19,66 +20,55 @@ Example code was tested with rdflib 4.2.2 and pandas 0.23.4
### Entity Classification
AIFB: accuracy 96.29% (3 runs, DGL), 95.83% (paper)
```
python3 entity_classify.py -d aifb --testing --gpu 0
python entity.py -d aifb --l2norm 0 --gpu 0
```
MUTAG: accuracy 70.59% (3 runs, DGL), 73.23% (paper)
MUTAG: accuracy 72.55% (3 runs, DGL), 73.23% (paper)
```
python3 entity_classify.py -d mutag --l2norm 5e-4 --n-bases 30 --testing --gpu 0
python entity.py -d mutag --n-bases 30 --gpu 0
```
BGS: accuracy 93.10% (3 runs, DGL), 83.10% (paper)
BGS: accuracy 89.70% (3 runs, DGL), 83.10% (paper)
```
python3 entity_classify.py -d bgs --l2norm 5e-4 --n-bases 40 --testing --gpu 0
python entity.py -d bgs --n-bases 40 --gpu 0
```
AM: accuracy 89.22% (3 runs, DGL), 89.29% (paper)
AM: accuracy 89.56% (3 runs, DGL), 89.29% (paper)
```
python3 entity_classify.py -d am --n-bases=40 --n-hidden=10 --l2norm=5e-4 --testing
python entity.py -d am --n-bases 40 --n-hidden 10
```
### Entity Classification with minibatch
AIFB: accuracy avg(5 runs) 90.00%, best 94.44% (DGL)
```
python3 entity_classify_mp.py -d aifb --testing --gpu 0 --fanout='20,20' --batch-size 128
```
MUTAG: accuracy avg(10 runs) 62.94%, best 72.06% (DGL)
```
python3 entity_classify_mp.py -d mutag --l2norm 5e-4 --n-bases 30 --testing --gpu 0 --batch-size 64 --fanout "-1, -1" --use-self-loop --dgl-sparse --n-epochs 20 --sparse-lr 0.01 --dropout 0.5
```
BGS: accuracy avg(5 runs) 78.62%, best 86.21% (DGL)
AIFB: accuracy avg(5 runs) 91.10%, best 97.22% (DGL)
```
python3 entity_classify_mp.py -d bgs --l2norm 5e-4 --n-bases 40 --testing --gpu 0 --fanout "-1, -1" --n-epochs=16 --batch-size=16 --dgl-sparse --lr 0.01 --sparse-lr 0.05 --dropout 0.3
python entity_sample.py -d aifb --l2norm 0 --gpu 0 --fanout='20,20' --batch-size 128
```
AM: accuracy avg(5 runs) 87.37%, best 89.9% (DGL)
MUTAG: accuracy avg(10 runs) 66.47%, best 72.06% (DGL)
```
python3 entity_classify_mp.py -d am --l2norm 5e-4 --n-bases 40 --testing --gpu 0 --fanout '35,35' --batch-size 64 --n-hidden 16 --use-self-loop --n-epochs=20 --dgl-sparse --lr 0.01 --sparse-lr 0.02 --dropout 0.7
python entity_sample.py -d mutag --n-bases 30 --gpu 0 --batch-size 64 --fanout "-1, -1" --use-self-loop --n-epochs 20 --sparse-lr 0.01 --dropout 0.5
```
### Entity Classification on OGBN-MAG
Test-bd: P3-8xlarge
OGBN-MAG accuracy 45.5 (3 runs)
BGS: accuracy avg(5 runs) 84.83%, best 89.66% (DGL)
```
python3 entity_classify_mp.py -d ogbn-mag --testing --fanout='30,30' --batch-size 1024 --n-hidden 128 --lr 0.01 --num-worker 4 --eval-batch-size 8 --low-mem --gpu 0,1,2,3 --dropout 0.7 --use-self-loop --n-bases 2 --n-epochs 3 --node-feats --dgl-sparse --sparse-lr 0.08
python entity_sample.py -d bgs --n-bases 40 --gpu 0 --fanout "-1, -1" --n-epochs=16 --batch-size=16 --sparse-lr 0.05 --dropout 0.3
```
OGBN-MAG without node-feats 42.79
AM: accuracy avg(5 runs) 88.58%, best 89.90% (DGL)
```
python3 entity_classify_mp.py -d ogbn-mag --testing --fanout='30,30' --batch-size 1024 --n-hidden 128 --lr 0.01 --num-worker 4 --eval-batch-size 8 --low-mem --gpu 0,1,2,3 --dropout 0.7 --use-self-loop --n-bases 2 --n-epochs 3 --dgl-sparse --sparse-lr 0.08
python entity_sample.py -d am --n-bases 40 --gpu 0 --fanout '35,35' --batch-size 64 --n-hidden 16 --use-self-loop --n-epochs=20 --sparse-lr 0.02 --dropout 0.7
```
Test-bd: P2-8xlarge
To use multiple GPUs, replace `entity_sample.py` with `entity_sample_multi_gpu.py` and specify
multiple GPU IDs separated by comma, e.g., `--gpu 0,1`.
### Link Prediction
FB15k-237: MRR 0.151 (DGL), 0.158 (paper)
FB15k-237: MRR 0.163 (DGL), 0.158 (paper)
```
python3 link_predict.py -d FB15k-237 --gpu 0 --eval-protocol raw
python link.py --gpu 0 --eval-protocol raw
```
FB15k-237: Filtered-MRR 0.2044
FB15k-237: Filtered-MRR 0.247
```
python3 link_predict.py -d FB15k-237 --gpu 0 --eval-protocol filtered
python link.py --gpu 0 --eval-protocol filtered
```
"""
Differences compared to tkipf/relation-gcn
* l2norm applied to all weights
* remove nodes that won't be touched
"""
import argparse
import torch as th
import torch.nn.functional as F
from torchmetrics.functional import accuracy
from entity_utils import load_data
from model import RGCN
def main(args):
g, num_rels, num_classes, labels, train_idx, test_idx, target_idx = load_data(
args.dataset, get_norm=True)
num_nodes = g.num_nodes()
# Since the nodes are featureless, learn node embeddings from scratch
# This requires passing the node IDs to the model.
feats = th.arange(num_nodes)
model = RGCN(num_nodes,
args.n_hidden,
num_classes,
num_rels,
num_bases=args.n_bases)
if args.gpu >= 0 and th.cuda.is_available():
device = th.device(args.gpu)
else:
device = th.device('cpu')
feats = feats.to(device)
labels = labels.to(device)
model = model.to(device)
g = g.to(device)
optimizer = th.optim.Adam(model.parameters(), lr=1e-2, weight_decay=args.l2norm)
model.train()
for epoch in range(50):
logits = model(g, feats)
logits = logits[target_idx]
loss = F.cross_entropy(logits[train_idx], labels[train_idx])
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_acc = accuracy(logits[train_idx].argmax(dim=1), labels[train_idx]).item()
print("Epoch {:05d} | Train Accuracy: {:.4f} | Train Loss: {:.4f}".format(
epoch, train_acc, loss.item()))
print()
model.eval()
with th.no_grad():
logits = model(g, feats)
logits = logits[target_idx]
test_acc = accuracy(logits[test_idx].argmax(dim=1), labels[test_idx]).item()
print("Test Accuracy: {:.4f}".format(test_acc))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='RGCN for entity classification')
parser.add_argument("--n-hidden", type=int, default=16,
help="number of hidden units")
parser.add_argument("--gpu", type=int, default=-1,
help="gpu")
parser.add_argument("--n-bases", type=int, default=-1,
help="number of filter weight matrices, default: -1 [use all]")
parser.add_argument("-d", "--dataset", type=str, required=True,
choices=['aifb', 'mutag', 'bgs', 'am'],
help="dataset to use")
parser.add_argument("--l2norm", type=float, default=5e-4,
help="l2 norm coef")
args = parser.parse_args()
print(args)
main(args)
"""
Modeling Relational Data with Graph Convolutional Networks
Paper: https://arxiv.org/abs/1703.06103
Code: https://github.com/tkipf/relational-gcn
Difference compared to tkipf/relation-gcn
* l2norm applied to all weights
* remove nodes that won't be touched
"""
import argparse
import numpy as np
import time
import torch
import torch.nn.functional as F
import dgl
from dgl.nn.pytorch import RelGraphConv
from functools import partial
from dgl.data.rdf import AIFBDataset, MUTAGDataset, BGSDataset, AMDataset
from model import BaseRGCN
class EntityClassify(BaseRGCN):
def create_features(self):
features = torch.arange(self.num_nodes)
if self.use_cuda:
features = features.cuda()
return features
def build_input_layer(self):
return RelGraphConv(self.num_nodes, self.h_dim, self.num_rels, "basis",
self.num_bases, activation=F.relu, self_loop=self.use_self_loop,
dropout=self.dropout)
def build_hidden_layer(self, idx):
return RelGraphConv(self.h_dim, self.h_dim, self.num_rels, "basis",
self.num_bases, activation=F.relu, self_loop=self.use_self_loop,
dropout=self.dropout)
def build_output_layer(self):
return RelGraphConv(self.h_dim, self.out_dim, self.num_rels, "basis",
self.num_bases, activation=None,
self_loop=self.use_self_loop)
def main(args):
# load graph data
if args.dataset == 'aifb':
dataset = AIFBDataset()
elif args.dataset == 'mutag':
dataset = MUTAGDataset()
elif args.dataset == 'bgs':
dataset = BGSDataset()
elif args.dataset == 'am':
dataset = AMDataset()
else:
raise ValueError()
# Load from hetero-graph
hg = dataset[0]
num_rels = len(hg.canonical_etypes)
category = dataset.predict_category
num_classes = dataset.num_classes
train_mask = hg.nodes[category].data.pop('train_mask')
test_mask = hg.nodes[category].data.pop('test_mask')
train_idx = torch.nonzero(train_mask, as_tuple=False).squeeze()
test_idx = torch.nonzero(test_mask, as_tuple=False).squeeze()
labels = hg.nodes[category].data.pop('labels')
# split dataset into train, validate, test
if args.validation:
val_idx = train_idx[:len(train_idx) // 5]
train_idx = train_idx[len(train_idx) // 5:]
else:
val_idx = train_idx
# calculate norm for each edge type and store in edge
for canonical_etype in hg.canonical_etypes:
u, v, eid = hg.all_edges(form='all', etype=canonical_etype)
_, inverse_index, count = torch.unique(v, return_inverse=True, return_counts=True)
degrees = count[inverse_index]
norm = torch.ones(eid.shape[0]).float() / degrees.float()
norm = norm.unsqueeze(1)
hg.edges[canonical_etype].data['norm'] = norm
# get target category id
category_id = len(hg.ntypes)
for i, ntype in enumerate(hg.ntypes):
if ntype == category:
category_id = i
g = dgl.to_homogeneous(hg, edata=['norm'])
num_nodes = g.number_of_nodes()
node_ids = torch.arange(num_nodes)
edge_norm = g.edata['norm']
edge_type = g.edata[dgl.ETYPE].long()
# find out the target node ids in g
node_tids = g.ndata[dgl.NTYPE]
loc = (node_tids == category_id)
target_idx = node_ids[loc]
# since the nodes are featureless, the input feature is then the node id.
feats = torch.arange(num_nodes)
# check cuda
use_cuda = args.gpu >= 0 and torch.cuda.is_available()
if use_cuda:
torch.cuda.set_device(args.gpu)
feats = feats.cuda()
edge_type = edge_type.cuda()
edge_norm = edge_norm.cuda()
labels = labels.cuda()
# create model
model = EntityClassify(num_nodes,
args.n_hidden,
num_classes,
num_rels,
num_bases=args.n_bases,
num_hidden_layers=args.n_layers - 2,
dropout=args.dropout,
use_self_loop=args.use_self_loop,
use_cuda=use_cuda)
if use_cuda:
model.cuda()
g = g.to('cuda:%d' % args.gpu)
# optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.l2norm)
# training loop
print("start training...")
forward_time = []
backward_time = []
model.train()
for epoch in range(args.n_epochs):
optimizer.zero_grad()
t0 = time.time()
logits = model(g, feats, edge_type, edge_norm)
logits = logits[target_idx]
loss = F.cross_entropy(logits[train_idx], labels[train_idx])
t1 = time.time()
loss.backward()
optimizer.step()
t2 = time.time()
forward_time.append(t1 - t0)
backward_time.append(t2 - t1)
print("Epoch {:05d} | Train Forward Time(s) {:.4f} | Backward Time(s) {:.4f}".
format(epoch, forward_time[-1], backward_time[-1]))
train_acc = torch.sum(logits[train_idx].argmax(dim=1) == labels[train_idx]).item() / len(train_idx)
val_loss = F.cross_entropy(logits[val_idx], labels[val_idx])
val_acc = torch.sum(logits[val_idx].argmax(dim=1) == labels[val_idx]).item() / len(val_idx)
print("Train Accuracy: {:.4f} | Train Loss: {:.4f} | Validation Accuracy: {:.4f} | Validation loss: {:.4f}".
format(train_acc, loss.item(), val_acc, val_loss.item()))
print()
model.eval()
logits = model.forward(g, feats, edge_type, edge_norm)
logits = logits[target_idx]
test_loss = F.cross_entropy(logits[test_idx], labels[test_idx])
test_acc = torch.sum(logits[test_idx].argmax(dim=1) == labels[test_idx]).item() / len(test_idx)
print("Test Accuracy: {:.4f} | Test loss: {:.4f}".format(test_acc, test_loss.item()))
print()
print("Mean forward time: {:4f}".format(np.mean(forward_time[len(forward_time) // 4:])))
print("Mean backward time: {:4f}".format(np.mean(backward_time[len(backward_time) // 4:])))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='RGCN')
parser.add_argument("--dropout", type=float, default=0,
help="dropout probability")
parser.add_argument("--n-hidden", type=int, default=16,
help="number of hidden units")
parser.add_argument("--gpu", type=int, default=-1,
help="gpu")
parser.add_argument("--lr", type=float, default=1e-2,
help="learning rate")
parser.add_argument("--n-bases", type=int, default=-1,
help="number of filter weight matrices, default: -1 [use all]")
parser.add_argument("--n-layers", type=int, default=2,
help="number of propagation rounds")
parser.add_argument("-e", "--n-epochs", type=int, default=50,
help="number of training epochs")
parser.add_argument("-d", "--dataset", type=str, required=True,
help="dataset to use")
parser.add_argument("--l2norm", type=float, default=0,
help="l2 norm coef")
parser.add_argument("--use-self-loop", default=False, action='store_true',
help="include self feature as a special relation")
fp = parser.add_mutually_exclusive_group(required=False)
fp.add_argument('--validation', dest='validation', action='store_true')
fp.add_argument('--testing', dest='validation', action='store_false')
parser.set_defaults(validation=True)
args = parser.parse_args()
print(args)
main(args)
This diff is collapsed.
"""
Differences compared to tkipf/relation-gcn
* l2norm applied to all weights
* remove nodes that won't be touched
"""
import argparse
import torch as th
import torch.nn.functional as F
import dgl
from dgl.dataloading import MultiLayerNeighborSampler, NodeDataLoader
from torchmetrics.functional import accuracy
from tqdm import tqdm
from entity_utils import load_data
from model import RelGraphEmbedLayer, RGCN
def init_dataloaders(args, g, train_idx, test_idx, target_idx, device, use_ddp=False):
fanouts = [int(fanout) for fanout in args.fanout.split(',')]
sampler = MultiLayerNeighborSampler(fanouts)
train_loader = NodeDataLoader(
g,
target_idx[train_idx],
sampler,
use_ddp=use_ddp,
device=device,
batch_size=args.batch_size,
shuffle=True,
drop_last=False)
# The datasets do not have a validation subset, use the train subset
val_loader = NodeDataLoader(
g,
target_idx[train_idx],
sampler,
use_ddp=use_ddp,
device=device,
batch_size=args.batch_size,
shuffle=False,
drop_last=False)
# -1 for sampling all neighbors
test_sampler = MultiLayerNeighborSampler([-1] * len(fanouts))
test_loader = NodeDataLoader(
g,
target_idx[test_idx],
test_sampler,
use_ddp=use_ddp,
device=device,
batch_size=32,
shuffle=False,
drop_last=False)
return train_loader, val_loader, test_loader
def init_models(args, device, num_nodes, num_classes, num_rels):
embed_layer = RelGraphEmbedLayer(device,
num_nodes,
args.n_hidden)
model = RGCN(args.n_hidden,
args.n_hidden,
num_classes,
num_rels,
num_bases=args.n_bases,
dropout=args.dropout,
self_loop=args.use_self_loop)
return embed_layer, model
def process_batch(inv_target, batch):
_, seeds, blocks = batch
# map the seed nodes back to their type-specific ids,
# in order to get the target node labels
seeds = inv_target[seeds]
for blc in blocks:
blc.edata['norm'] = dgl.norm_by_dst(blc).unsqueeze(1)
return seeds, blocks
def train(model, embed_layer, train_loader, inv_target,
labels, emb_optimizer, optimizer):
model.train()
embed_layer.train()
for sample_data in train_loader:
seeds, blocks = process_batch(inv_target, sample_data)
feats = embed_layer(blocks[0].srcdata[dgl.NID].cpu())
logits = model(blocks, feats)
loss = F.cross_entropy(logits, labels[seeds])
emb_optimizer.zero_grad()
optimizer.zero_grad()
loss.backward()
emb_optimizer.step()
optimizer.step()
train_acc = accuracy(logits.argmax(dim=1), labels[seeds]).item()
return train_acc, loss.item()
def evaluate(model, embed_layer, eval_loader, inv_target):
model.eval()
embed_layer.eval()
eval_logits = []
eval_seeds = []
with th.no_grad():
for sample_data in tqdm(eval_loader):
seeds, blocks = process_batch(inv_target, sample_data)
feats = embed_layer(blocks[0].srcdata[dgl.NID].cpu())
logits = model(blocks, feats)
eval_logits.append(logits.cpu().detach())
eval_seeds.append(seeds.cpu().detach())
eval_logits = th.cat(eval_logits)
eval_seeds = th.cat(eval_seeds)
return eval_logits, eval_seeds
def main(args):
g, num_rels, num_classes, labels, train_idx, test_idx, target_idx, inv_target = load_data(
args.dataset, inv_target=True)
if args.gpu >= 0 and th.cuda.is_available():
device = th.device(args.gpu)
else:
device = th.device('cpu')
train_loader, val_loader, test_loader = init_dataloaders(
args, g, train_idx, test_idx, target_idx, args.gpu)
embed_layer, model = init_models(args, device, g.num_nodes(), num_classes, num_rels)
labels = labels.to(device)
model = model.to(device)
emb_optimizer = th.optim.SparseAdam(embed_layer.parameters(), lr=args.sparse_lr)
optimizer = th.optim.Adam(model.parameters(), lr=1e-2, weight_decay=args.l2norm)
for epoch in range(args.n_epochs):
train_acc, loss = train(model, embed_layer, train_loader, inv_target,
labels, emb_optimizer, optimizer)
print("Epoch {:05d}/{:05d} | Train Accuracy: {:.4f} | Train Loss: {:.4f}".format(
epoch, args.n_epochs, train_acc, loss))
val_logits, val_seeds = evaluate(model, embed_layer, val_loader, inv_target)
val_acc = accuracy(val_logits.argmax(dim=1), labels[val_seeds].cpu()).item()
print("Validation Accuracy: {:.4f}".format(val_acc))
test_logits, test_seeds = evaluate(model, embed_layer,
test_loader, inv_target)
test_acc = accuracy(test_logits.argmax(dim=1), labels[test_seeds].cpu()).item()
print("Final Test Accuracy: {:.4f}".format(test_acc))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='RGCN for entity classification with sampling')
parser.add_argument("--dropout", type=float, default=0,
help="dropout probability")
parser.add_argument("--n-hidden", type=int, default=16,
help="number of hidden units")
parser.add_argument("--gpu", type=int, default=0,
help="gpu")
parser.add_argument("--sparse-lr", type=float, default=2e-2,
help="sparse embedding learning rate")
parser.add_argument("--n-bases", type=int, default=-1,
help="number of filter weight matrices, default: -1 [use all]")
parser.add_argument("--n-epochs", type=int, default=50,
help="number of training epochs")
parser.add_argument("-d", "--dataset", type=str, required=True,
choices=['aifb', 'mutag', 'bgs', 'am'],
help="dataset to use")
parser.add_argument("--l2norm", type=float, default=5e-4,
help="l2 norm coef")
parser.add_argument("--fanout", type=str, default="4, 4",
help="Fan-out of neighbor sampling")
parser.add_argument("--use-self-loop", default=False, action='store_true',
help="include self feature as a special relation")
parser.add_argument("--batch-size", type=int, default=100,
help="Mini-batch size")
args = parser.parse_args()
print(args)
main(args)
"""
Differences compared to tkipf/relation-gcn
* l2norm applied to all weights
* remove nodes that won't be touched
"""
import argparse
import gc
import torch as th
import torch.nn.functional as F
import dgl.multiprocessing as mp
import dgl
from torchmetrics.functional import accuracy
from torch.nn.parallel import DistributedDataParallel
from entity_utils import load_data
from entity_sample import init_dataloaders, init_models, train, evaluate
def collect_eval(n_gpus, queue, labels):
eval_logits = []
eval_seeds = []
for _ in range(n_gpus):
eval_l, eval_s = queue.get()
eval_logits.append(eval_l)
eval_seeds.append(eval_s)
eval_logits = th.cat(eval_logits)
eval_seeds = th.cat(eval_seeds)
eval_acc = accuracy(eval_logits.argmax(dim=1), labels[eval_seeds].cpu()).item()
return eval_acc
def run(proc_id, n_gpus, n_cpus, args, devices, dataset, queue=None):
dev_id = devices[proc_id]
g, num_classes, num_rels, target_idx, inv_target, train_idx,\
test_idx, labels = dataset
dist_init_method = 'tcp://{master_ip}:{master_port}'.format(
master_ip='127.0.0.1', master_port='12345')
backend = 'gloo'
if proc_id == 0:
print("backend using {}".format(backend))
th.distributed.init_process_group(backend=backend,
init_method=dist_init_method,
world_size=n_gpus,
rank=proc_id)
device = th.device(dev_id)
use_ddp = True if n_gpus > 1 else False
train_loader, val_loader, test_loader = init_dataloaders(
args, g, train_idx, test_idx, target_idx, dev_id, use_ddp=use_ddp)
embed_layer, model = init_models(args, device, g.num_nodes(), num_classes, num_rels)
labels = labels.to(device)
model = model.to(device)
model = DistributedDataParallel(model, device_ids=[dev_id], output_device=dev_id)
embed_layer = DistributedDataParallel(embed_layer, device_ids=None, output_device=None)
emb_optimizer = th.optim.SparseAdam(embed_layer.module.parameters(), lr=args.sparse_lr)
optimizer = th.optim.Adam(model.parameters(), lr=1e-2, weight_decay=args.l2norm)
th.set_num_threads(n_cpus)
for epoch in range(args.n_epochs):
train_loader.set_epoch(epoch)
train_acc, loss = train(model, embed_layer, train_loader, inv_target,
labels, emb_optimizer, optimizer)
if proc_id == 0:
print("Epoch {:05d}/{:05d} | Train Accuracy: {:.4f} | Train Loss: {:.4f}".format(
epoch, args.n_epochs, train_acc, loss))
# garbage collection that empties the queue
gc.collect()
val_logits, val_seeds = evaluate(model, embed_layer, val_loader, inv_target)
queue.put((val_logits, val_seeds))
# gather evaluation result from multiple processes
if proc_id == 0:
val_acc = collect_eval(n_gpus, queue, labels)
print("Validation Accuracy: {:.4f}".format(val_acc))
# garbage collection that empties the queue
gc.collect()
test_logits, test_seeds = evaluate(model, embed_layer, test_loader, inv_target)
queue.put((test_logits, test_seeds))
if proc_id == 0:
test_acc = collect_eval(n_gpus, queue, labels)
print("Final Test Accuracy: {:.4f}".format(test_acc))
th.distributed.barrier()
def main(args, devices):
g, num_rels, num_classes, labels, train_idx, test_idx, target_idx, inv_target = load_data(
args.dataset, inv_target=True)
# Create csr/coo/csc formats before launching training processes.
# This avoids creating certain formats in each sub-process, which saves momory and CPU.
g.create_formats_()
n_gpus = len(devices)
n_cpus = mp.cpu_count()
queue = mp.Queue(n_gpus)
procs = []
for proc_id in range(n_gpus):
# We use distributed data parallel dataloader to handle the data splitting
p = mp.Process(target=run, args=(proc_id, n_gpus, n_cpus // n_gpus, args, devices,
(g, num_classes, num_rels, target_idx,
inv_target, train_idx, test_idx, labels),
queue))
p.start()
procs.append(p)
for p in procs:
p.join()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='RGCN for entity classification with sampling and multiple gpus')
parser.add_argument("--dropout", type=float, default=0,
help="dropout probability")
parser.add_argument("--n-hidden", type=int, default=16,
help="number of hidden units")
parser.add_argument("--gpu", type=str, default='0',
help="gpu")
parser.add_argument("--sparse-lr", type=float, default=2e-2,
help="sparse embedding learning rate")
parser.add_argument("--n-bases", type=int, default=-1,
help="number of filter weight matrices, default: -1 [use all]")
parser.add_argument("--n-epochs", type=int, default=50,
help="number of training epochs")
parser.add_argument("-d", "--dataset", type=str, required=True,
choices=['aifb', 'mutag', 'bgs', 'am'],
help="dataset to use")
parser.add_argument("--l2norm", type=float, default=5e-4,
help="l2 norm coef")
parser.add_argument("--fanout", type=str, default="4, 4",
help="Fan-out of neighbor sampling")
parser.add_argument("--use-self-loop", default=False, action='store_true',
help="include self feature as a special relation")
parser.add_argument("--batch-size", type=int, default=100,
help="Mini-batch size. ")
args = parser.parse_args()
devices = list(map(int, args.gpu.split(',')))
print(args)
main(args, devices)
import dgl
import torch as th
from dgl.data.rdf import AIFBDataset, MUTAGDataset, BGSDataset, AMDataset
def load_data(data_name, get_norm=False, inv_target=False):
if data_name == 'aifb':
dataset = AIFBDataset()
elif data_name == 'mutag':
dataset = MUTAGDataset()
elif data_name == 'bgs':
dataset = BGSDataset()
else:
dataset = AMDataset()
# Load hetero-graph
hg = dataset[0]
num_rels = len(hg.canonical_etypes)
category = dataset.predict_category
num_classes = dataset.num_classes
labels = hg.nodes[category].data.pop('labels')
train_mask = hg.nodes[category].data.pop('train_mask')
test_mask = hg.nodes[category].data.pop('test_mask')
train_idx = th.nonzero(train_mask, as_tuple=False).squeeze()
test_idx = th.nonzero(test_mask, as_tuple=False).squeeze()
if get_norm:
# Calculate normalization weight for each edge,
# 1. / d, d is the degree of the destination node
for cetype in hg.canonical_etypes:
hg.edges[cetype].data['norm'] = dgl.norm_by_dst(hg, cetype).unsqueeze(1)
edata = ['norm']
else:
edata = None
# get target category id
category_id = hg.ntypes.index(category)
g = dgl.to_homogeneous(hg, edata=edata)
# Rename the fields as they can be changed by for example NodeDataLoader
g.ndata['ntype'] = g.ndata.pop(dgl.NTYPE)
g.ndata['type_id'] = g.ndata.pop(dgl.NID)
node_ids = th.arange(g.num_nodes())
# find out the target node ids in g
loc = (g.ndata['ntype'] == category_id)
target_idx = node_ids[loc]
if inv_target:
# Map global node IDs to type-specific node IDs. This is required for
# looking up type-specific labels in a minibatch
inv_target = th.empty((g.num_nodes(),), dtype=th.int64)
inv_target[target_idx] = th.arange(0, target_idx.shape[0],
dtype=inv_target.dtype)
return g, num_rels, num_classes, labels, train_idx, test_idx, target_idx, inv_target
else:
return g, num_rels, num_classes, labels, train_idx, test_idx, target_idx
"""
Differences compared to MichSchli/RelationPrediction
* Report raw metrics instead of filtered metrics.
* By default, we use uniform edge sampling instead of neighbor-based edge
sampling used in author's code. In practice, we find it achieves similar MRR.
"""
import argparse
import torch as th
import torch.nn as nn
import torch.nn.functional as F
from dgl.data.knowledge_graph import FB15k237Dataset
from dgl.dataloading import GraphDataLoader
from link_utils import preprocess, SubgraphIterator, calc_mrr
from model import RGCN
class LinkPredict(nn.Module):
def __init__(self, in_dim, num_rels, h_dim=500, num_bases=100, dropout=0.2, reg_param=0.01):
super(LinkPredict, self).__init__()
self.rgcn = RGCN(in_dim, h_dim, h_dim, num_rels * 2, regularizer="bdd",
num_bases=num_bases, dropout=dropout, self_loop=True, link_pred=True)
self.reg_param = reg_param
self.w_relation = nn.Parameter(th.Tensor(num_rels, h_dim))
nn.init.xavier_uniform_(self.w_relation,
gain=nn.init.calculate_gain('relu'))
def calc_score(self, embedding, triplets):
# DistMult
s = embedding[triplets[:,0]]
r = self.w_relation[triplets[:,1]]
o = embedding[triplets[:,2]]
score = th.sum(s * r * o, dim=1)
return score
def forward(self, g, h):
return self.rgcn(g, h)
def regularization_loss(self, embedding):
return th.mean(embedding.pow(2)) + th.mean(self.w_relation.pow(2))
def get_loss(self, embed, triplets, labels):
# each row in the triplets is a 3-tuple of (source, relation, destination)
score = self.calc_score(embed, triplets)
predict_loss = F.binary_cross_entropy_with_logits(score, labels)
reg_loss = self.regularization_loss(embed)
return predict_loss + self.reg_param * reg_loss
def main(args):
data = FB15k237Dataset(reverse=False)
graph = data[0]
num_nodes = graph.num_nodes()
num_rels = data.num_rels
train_g, test_g = preprocess(graph, num_rels)
test_node_id = th.arange(0, num_nodes).view(-1, 1)
test_mask = graph.edata['test_mask']
subg_iter = SubgraphIterator(train_g, num_rels, args.edge_sampler)
dataloader = GraphDataLoader(subg_iter, batch_size=1, collate_fn=lambda x: x[0])
# Prepare data for metric computation
src, dst = graph.edges()
triplets = th.stack([src, graph.edata['etype'], dst], dim=1)
model = LinkPredict(num_nodes, num_rels)
optimizer = th.optim.Adam(model.parameters(), lr=1e-2)
if args.gpu >= 0 and th.cuda.is_available():
device = th.device(args.gpu)
else:
device = th.device('cpu')
model = model.to(device)
best_mrr = 0
model_state_file = 'model_state.pth'
for epoch, batch_data in enumerate(dataloader):
model.train()
g, node_id, data, labels = batch_data
g = g.to(device)
node_id = node_id.to(device)
data = data.to(device)
labels = labels.to(device)
embed = model(g, node_id)
loss = model.get_loss(embed, data, labels)
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # clip gradients
optimizer.step()
print("Epoch {:04d} | Loss {:.4f} | Best MRR {:.4f}".format(epoch, loss.item(), best_mrr))
if (epoch + 1) % 500 == 0:
# perform validation on CPU because full graph is too large
model = model.cpu()
model.eval()
print("start eval")
embed = model(test_g, test_node_id)
mrr = calc_mrr(embed, model.w_relation, test_mask, triplets,
batch_size=500, eval_p=args.eval_protocol)
# save best model
if best_mrr < mrr:
best_mrr = mrr
th.save({'state_dict': model.state_dict(), 'epoch': epoch}, model_state_file)
model = model.to(device)
print("Start testing:")
# use best model checkpoint
checkpoint = th.load(model_state_file)
model = model.cpu() # test on CPU
model.eval()
model.load_state_dict(checkpoint['state_dict'])
print("Using best epoch: {}".format(checkpoint['epoch']))
embed = model(test_g, test_node_id)
calc_mrr(embed, model.w_relation, test_mask, triplets,
batch_size=500, eval_p=args.eval_protocol)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='RGCN for link prediction')
parser.add_argument("--gpu", type=int, default=-1,
help="gpu")
parser.add_argument("--eval-protocol", type=str, default='filtered',
choices=['filtered', 'raw'],
help="Whether to use 'filtered' or 'raw' MRR for evaluation")
parser.add_argument("--edge-sampler", type=str, default='uniform',
choices=['uniform', 'neighbor'],
help="Type of edge sampler: 'uniform' or 'neighbor'"
"The original implementation uses neighbor sampler.")
args = parser.parse_args()
print(args)
main(args)
"""
Modeling Relational Data with Graph Convolutional Networks
Paper: https://arxiv.org/abs/1703.06103
Code: https://github.com/MichSchli/RelationPrediction
Difference compared to MichSchli/RelationPrediction
* Report raw metrics instead of filtered metrics.
* By default, we use uniform edge sampling instead of neighbor-based edge
sampling used in author's code. In practice, we find it achieves similar MRR. User could specify "--edge-sampler=neighbor" to switch
to neighbor-based edge sampling.
"""
import argparse
import numpy as np
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
from dgl.data.knowledge_graph import load_data
from dgl.nn.pytorch import RelGraphConv
from model import BaseRGCN
import utils
class EmbeddingLayer(nn.Module):
def __init__(self, num_nodes, h_dim):
super(EmbeddingLayer, self).__init__()
self.embedding = torch.nn.Embedding(num_nodes, h_dim)
def forward(self, g, h, r, norm):
return self.embedding(h.squeeze())
class RGCN(BaseRGCN):
def build_input_layer(self):
return EmbeddingLayer(self.num_nodes, self.h_dim)
def build_hidden_layer(self, idx):
act = F.relu if idx < self.num_hidden_layers - 1 else None
return RelGraphConv(self.h_dim, self.h_dim, self.num_rels, "bdd",
self.num_bases, activation=act, self_loop=True,
dropout=self.dropout)
class LinkPredict(nn.Module):
def __init__(self, in_dim, h_dim, num_rels, num_bases=-1,
num_hidden_layers=1, dropout=0, use_cuda=False, reg_param=0):
super(LinkPredict, self).__init__()
self.rgcn = RGCN(in_dim, h_dim, h_dim, num_rels * 2, num_bases,
num_hidden_layers, dropout, use_cuda)
self.reg_param = reg_param
self.w_relation = nn.Parameter(torch.Tensor(num_rels, h_dim))
nn.init.xavier_uniform_(self.w_relation,
gain=nn.init.calculate_gain('relu'))
def calc_score(self, embedding, triplets):
# DistMult
s = embedding[triplets[:,0]]
r = self.w_relation[triplets[:,1]]
o = embedding[triplets[:,2]]
score = torch.sum(s * r * o, dim=1)
return score
def forward(self, g, h, r, norm):
return self.rgcn.forward(g, h, r, norm)
def regularization_loss(self, embedding):
return torch.mean(embedding.pow(2)) + torch.mean(self.w_relation.pow(2))
def get_loss(self, g, embed, triplets, labels):
# triplets is a list of data samples (positive and negative)
# each row in the triplets is a 3-tuple of (source, relation, destination)
score = self.calc_score(embed, triplets)
predict_loss = F.binary_cross_entropy_with_logits(score, labels)
reg_loss = self.regularization_loss(embed)
return predict_loss + self.reg_param * reg_loss
def node_norm_to_edge_norm(g, node_norm):
g = g.local_var()
# convert to edge norm
g.ndata['norm'] = node_norm
g.apply_edges(lambda edges : {'norm' : edges.dst['norm']})
return g.edata['norm']
def main(args):
# load graph data
data = load_data(args.dataset)
num_nodes = data.num_nodes
train_data = data.train
valid_data = data.valid
test_data = data.test
num_rels = data.num_rels
# check cuda
use_cuda = args.gpu >= 0 and torch.cuda.is_available()
if use_cuda:
torch.cuda.set_device(args.gpu)
# create model
model = LinkPredict(num_nodes,
args.n_hidden,
num_rels,
num_bases=args.n_bases,
num_hidden_layers=args.n_layers,
dropout=args.dropout,
use_cuda=use_cuda,
reg_param=args.regularization)
# validation and testing triplets
valid_data = torch.LongTensor(valid_data)
test_data = torch.LongTensor(test_data)
# build test graph
test_graph, test_rel, test_norm = utils.build_test_graph(
num_nodes, num_rels, train_data)
test_deg = test_graph.in_degrees(
range(test_graph.number_of_nodes())).float().view(-1,1)
test_node_id = torch.arange(0, num_nodes, dtype=torch.long).view(-1, 1)
test_rel = torch.from_numpy(test_rel)
test_norm = node_norm_to_edge_norm(test_graph, torch.from_numpy(test_norm).view(-1, 1))
if use_cuda:
model.cuda()
# build adj list and calculate degrees for sampling
adj_list, degrees = utils.get_adj_and_degrees(num_nodes, train_data)
# optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
model_state_file = 'model_state.pth'
forward_time = []
backward_time = []
# training loop
print("start training...")
epoch = 0
best_mrr = 0
while True:
model.train()
epoch += 1
# perform edge neighborhood sampling to generate training graph and data
g, node_id, edge_type, node_norm, data, labels = \
utils.generate_sampled_graph_and_labels(
train_data, args.graph_batch_size, args.graph_split_size,
num_rels, adj_list, degrees, args.negative_sample,
args.edge_sampler)
print("Done edge sampling")
# set node/edge feature
node_id = torch.from_numpy(node_id).view(-1, 1).long()
edge_type = torch.from_numpy(edge_type)
edge_norm = node_norm_to_edge_norm(g, torch.from_numpy(node_norm).view(-1, 1))
data, labels = torch.from_numpy(data), torch.from_numpy(labels)
deg = g.in_degrees(range(g.number_of_nodes())).float().view(-1, 1)
if use_cuda:
node_id, deg = node_id.cuda(), deg.cuda()
edge_type, edge_norm = edge_type.cuda(), edge_norm.cuda()
data, labels = data.cuda(), labels.cuda()
g = g.to(args.gpu)
t0 = time.time()
embed = model(g, node_id, edge_type, edge_norm)
loss = model.get_loss(g, embed, data, labels)
t1 = time.time()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_norm) # clip gradients
optimizer.step()
t2 = time.time()
forward_time.append(t1 - t0)
backward_time.append(t2 - t1)
print("Epoch {:04d} | Loss {:.4f} | Best MRR {:.4f} | Forward {:.4f}s | Backward {:.4f}s".
format(epoch, loss.item(), best_mrr, forward_time[-1], backward_time[-1]))
optimizer.zero_grad()
# validation
if epoch % args.evaluate_every == 0:
# perform validation on CPU because full graph is too large
if use_cuda:
model.cpu()
model.eval()
print("start eval")
embed = model(test_graph, test_node_id, test_rel, test_norm)
mrr = utils.calc_mrr(embed, model.w_relation, torch.LongTensor(train_data),
valid_data, test_data, hits=[1, 3, 10], eval_bz=args.eval_batch_size,
eval_p=args.eval_protocol)
# save best model
if best_mrr < mrr:
best_mrr = mrr
torch.save({'state_dict': model.state_dict(), 'epoch': epoch}, model_state_file)
if epoch >= args.n_epochs:
break
if use_cuda:
model.cuda()
print("training done")
print("Mean forward time: {:4f}s".format(np.mean(forward_time)))
print("Mean Backward time: {:4f}s".format(np.mean(backward_time)))
print("\nstart testing:")
# use best model checkpoint
checkpoint = torch.load(model_state_file)
if use_cuda:
model.cpu() # test on CPU
model.eval()
model.load_state_dict(checkpoint['state_dict'])
print("Using best epoch: {}".format(checkpoint['epoch']))
embed = model(test_graph, test_node_id, test_rel, test_norm)
utils.calc_mrr(embed, model.w_relation, torch.LongTensor(train_data), valid_data,
test_data, hits=[1, 3, 10], eval_bz=args.eval_batch_size, eval_p=args.eval_protocol)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='RGCN')
parser.add_argument("--dropout", type=float, default=0.2,
help="dropout probability")
parser.add_argument("--n-hidden", type=int, default=500,
help="number of hidden units")
parser.add_argument("--gpu", type=int, default=-1,
help="gpu")
parser.add_argument("--lr", type=float, default=1e-2,
help="learning rate")
parser.add_argument("--n-bases", type=int, default=100,
help="number of weight blocks for each relation")
parser.add_argument("--n-layers", type=int, default=2,
help="number of propagation rounds")
parser.add_argument("--n-epochs", type=int, default=6000,
help="number of minimum training epochs")
parser.add_argument("-d", "--dataset", type=str, required=True,
help="dataset to use")
parser.add_argument("--eval-batch-size", type=int, default=500,
help="batch size when evaluating")
parser.add_argument("--eval-protocol", type=str, default="filtered",
help="type of evaluation protocol: 'raw' or 'filtered' mrr")
parser.add_argument("--regularization", type=float, default=0.01,
help="regularization weight")
parser.add_argument("--grad-norm", type=float, default=1.0,
help="norm to clip gradient to")
parser.add_argument("--graph-batch-size", type=int, default=30000,
help="number of edges to sample in each iteration")
parser.add_argument("--graph-split-size", type=float, default=0.5,
help="portion of edges used as positive sample")
parser.add_argument("--negative-sample", type=int, default=10,
help="number of negative samples per positive sample")
parser.add_argument("--evaluate-every", type=int, default=500,
help="perform evaluation every n epochs")
parser.add_argument("--edge-sampler", type=str, default="uniform",
help="type of edge sampler: 'uniform' or 'neighbor'")
args = parser.parse_args()
print(args)
main(args)
"""
Utility functions for link prediction
Most code is adapted from authors' implementation of RGCN link prediction:
https://github.com/MichSchli/RelationPrediction
"""
import numpy as np
import torch as th
import dgl
# Utility function for building training and testing graphs
def get_subset_g(g, mask, num_rels, bidirected=False):
src, dst = g.edges()
sub_src = src[mask]
sub_dst = dst[mask]
sub_rel = g.edata['etype'][mask]
if bidirected:
sub_src, sub_dst = th.cat([sub_src, sub_dst]), th.cat([sub_dst, sub_src])
sub_rel = th.cat([sub_rel, sub_rel + num_rels])
sub_g = dgl.graph((sub_src, sub_dst), num_nodes=g.num_nodes())
sub_g.edata[dgl.ETYPE] = sub_rel
return sub_g
def preprocess(g, num_rels):
# Get train graph
train_g = get_subset_g(g, g.edata['train_mask'], num_rels)
# Get test graph
test_g = get_subset_g(g, g.edata['train_mask'], num_rels, bidirected=True)
test_g.edata['norm'] = dgl.norm_by_dst(test_g).unsqueeze(-1)
return train_g, test_g
class GlobalUniform:
def __init__(self, g, sample_size):
self.sample_size = sample_size
self.eids = np.arange(g.num_edges())
def sample(self):
return th.from_numpy(np.random.choice(self.eids, self.sample_size))
class NeighborExpand:
"""Sample a connected component by neighborhood expansion"""
def __init__(self, g, sample_size):
self.g = g
self.nids = np.arange(g.num_nodes())
self.sample_size = sample_size
def sample(self):
edges = th.zeros((self.sample_size), dtype=th.int64)
neighbor_counts = (self.g.in_degrees() + self.g.out_degrees()).numpy()
seen_edge = np.array([False] * self.g.num_edges())
seen_node = np.array([False] * self.g.num_nodes())
for i in range(self.sample_size):
if np.sum(seen_node) == 0:
node_weights = np.ones_like(neighbor_counts)
node_weights[np.where(neighbor_counts == 0)] = 0
else:
# Sample a visited node if applicable.
# This guarantees a connected component.
node_weights = neighbor_counts * seen_node
node_probs = node_weights / np.sum(node_weights)
chosen_node = np.random.choice(self.nids, p=node_probs)
# Sample a neighbor of the sampled node
u1, v1, eid1 = self.g.in_edges(chosen_node, form='all')
u2, v2, eid2 = self.g.out_edges(chosen_node, form='all')
u = th.cat([u1, u2])
v = th.cat([v1, v2])
eid = th.cat([eid1, eid2])
to_pick = True
while to_pick:
random_id = th.randint(high=eid.shape[0], size=(1,))
chosen_eid = eid[random_id]
to_pick = seen_edge[chosen_eid]
chosen_u = u[random_id]
chosen_v = v[random_id]
edges[i] = chosen_eid
seen_node[chosen_u] = True
seen_node[chosen_v] = True
seen_edge[chosen_eid] = True
neighbor_counts[chosen_u] -= 1
neighbor_counts[chosen_v] -= 1
return edges
class NegativeSampler:
def __init__(self, k=10):
self.k = k
def sample(self, pos_samples, num_nodes):
batch_size = len(pos_samples)
neg_batch_size = batch_size * self.k
neg_samples = np.tile(pos_samples, (self.k, 1))
values = np.random.randint(num_nodes, size=neg_batch_size)
choices = np.random.uniform(size=neg_batch_size)
subj = choices > 0.5
obj = choices <= 0.5
neg_samples[subj, 0] = values[subj]
neg_samples[obj, 2] = values[obj]
samples = np.concatenate((pos_samples, neg_samples))
# binary labels indicating positive and negative samples
labels = np.zeros(batch_size * (self.k + 1), dtype=np.float32)
labels[:batch_size] = 1
return th.from_numpy(samples), th.from_numpy(labels)
class SubgraphIterator:
def __init__(self, g, num_rels, pos_sampler, sample_size=30000, num_epochs=6000):
self.g = g
self.num_rels = num_rels
self.sample_size = sample_size
self.num_epochs = num_epochs
if pos_sampler == 'neighbor':
self.pos_sampler = NeighborExpand(g, sample_size)
else:
self.pos_sampler = GlobalUniform(g, sample_size)
self.neg_sampler = NegativeSampler()
def __len__(self):
return self.num_epochs
def __getitem__(self, i):
eids = self.pos_sampler.sample()
src, dst = self.g.find_edges(eids)
src, dst = src.numpy(), dst.numpy()
rel = self.g.edata[dgl.ETYPE][eids].numpy()
# relabel nodes to have consecutive node IDs
uniq_v, edges = np.unique((src, dst), return_inverse=True)
num_nodes = len(uniq_v)
# edges is the concatenation of src, dst with relabeled ID
src, dst = np.reshape(edges, (2, -1))
relabeled_data = np.stack((src, rel, dst)).transpose()
samples, labels = self.neg_sampler.sample(relabeled_data, num_nodes)
# Use only half of the positive edges
chosen_ids = np.random.choice(np.arange(self.sample_size),
size=int(self.sample_size / 2),
replace=False)
src = src[chosen_ids]
dst = dst[chosen_ids]
rel = rel[chosen_ids]
src, dst = np.concatenate((src, dst)), np.concatenate((dst, src))
rel = np.concatenate((rel, rel + self.num_rels))
sub_g = dgl.graph((src, dst), num_nodes=num_nodes)
sub_g.edata[dgl.ETYPE] = th.from_numpy(rel)
sub_g.edata['norm'] = dgl.norm_by_dst(sub_g).unsqueeze(-1)
uniq_v = th.from_numpy(uniq_v).view(-1, 1).long()
return sub_g, uniq_v, samples, labels
# Utility functions for evaluations (raw)
def perturb_and_get_raw_rank(emb, w, a, r, b, test_size, batch_size=100):
""" Perturb one element in the triplets"""
n_batch = (test_size + batch_size - 1) // batch_size
ranks = []
emb = emb.transpose(0, 1) # size D x V
w = w.transpose(0, 1) # size D x R
for idx in range(n_batch):
print("batch {} / {}".format(idx, n_batch))
batch_start = idx * batch_size
batch_end = (idx + 1) * batch_size
batch_a = a[batch_start: batch_end]
batch_r = r[batch_start: batch_end]
emb_ar = emb[:,batch_a] * w[:,batch_r] # size D x E
emb_ar = emb_ar.unsqueeze(2) # size D x E x 1
emb_c = emb.unsqueeze(1) # size D x 1 x V
# out-prod and reduce sum
out_prod = th.bmm(emb_ar, emb_c) # size D x E x V
score = th.sum(out_prod, dim=0).sigmoid() # size E x V
target = b[batch_start: batch_end]
_, indices = th.sort(score, dim=1, descending=True)
indices = th.nonzero(indices == target.view(-1, 1), as_tuple=False)
ranks.append(indices[:, 1].view(-1))
return th.cat(ranks)
# Utility functions for evaluations (filtered)
def filter(triplets_to_filter, target_s, target_r, target_o, num_nodes, filter_o=True):
"""Get candidate heads or tails to score"""
target_s, target_r, target_o = int(target_s), int(target_r), int(target_o)
# Add the ground truth node first
if filter_o:
candidate_nodes = [target_o]
else:
candidate_nodes = [target_s]
for e in range(num_nodes):
triplet = (target_s, target_r, e) if filter_o else (e, target_r, target_o)
# Do not consider a node if it leads to a real triplet
if triplet not in triplets_to_filter:
candidate_nodes.append(e)
return th.LongTensor(candidate_nodes)
def perturb_and_get_filtered_rank(emb, w, s, r, o, test_size, triplets_to_filter, filter_o=True):
"""Perturb subject or object in the triplets"""
num_nodes = emb.shape[0]
ranks = []
for idx in range(test_size):
if idx % 100 == 0:
print("test triplet {} / {}".format(idx, test_size))
target_s = s[idx]
target_r = r[idx]
target_o = o[idx]
candidate_nodes = filter(triplets_to_filter, target_s, target_r,
target_o, num_nodes, filter_o=filter_o)
if filter_o:
emb_s = emb[target_s]
emb_o = emb[candidate_nodes]
else:
emb_s = emb[candidate_nodes]
emb_o = emb[target_o]
target_idx = 0
emb_r = w[target_r]
emb_triplet = emb_s * emb_r * emb_o
scores = th.sigmoid(th.sum(emb_triplet, dim=1))
_, indices = th.sort(scores, descending=True)
rank = int((indices == target_idx).nonzero())
ranks.append(rank)
return th.LongTensor(ranks)
def _calc_mrr(emb, w, test_mask, triplets_to_filter, batch_size, filter=False):
with th.no_grad():
test_triplets = triplets_to_filter[test_mask]
s, r, o = test_triplets[:,0], test_triplets[:,1], test_triplets[:,2]
test_size = len(s)
if filter:
metric_name = 'MRR (filtered)'
triplets_to_filter = {tuple(triplet) for triplet in triplets_to_filter.tolist()}
ranks_s = perturb_and_get_filtered_rank(emb, w, s, r, o, test_size,
triplets_to_filter, filter_o=False)
ranks_o = perturb_and_get_filtered_rank(emb, w, s, r, o,
test_size, triplets_to_filter)
else:
metric_name = 'MRR (raw)'
ranks_s = perturb_and_get_raw_rank(emb, w, o, r, s, test_size, batch_size)
ranks_o = perturb_and_get_raw_rank(emb, w, s, r, o, test_size, batch_size)
ranks = th.cat([ranks_s, ranks_o])
ranks += 1 # change to 1-indexed
mrr = th.mean(1.0 / ranks.float()).item()
print("{}: {:.6f}".format(metric_name, mrr))
return mrr
# Main evaluation function
def calc_mrr(emb, w, test_mask, triplets, batch_size=100, eval_p="filtered"):
if eval_p == "filtered":
mrr = _calc_mrr(emb, w, test_mask, triplets, batch_size, filter=True)
else:
mrr = _calc_mrr(emb, w, test_mask, triplets, batch_size)
return mrr
from dgl import DGLGraph
import torch as th
import torch.nn as nn
import torch.nn.functional as F
import dgl
class BaseRGCN(nn.Module):
def __init__(self, num_nodes, h_dim, out_dim, num_rels, num_bases,
num_hidden_layers=1, dropout=0,
use_self_loop=False, use_cuda=False):
super(BaseRGCN, self).__init__()
self.num_nodes = num_nodes
self.h_dim = h_dim
self.out_dim = out_dim
self.num_rels = num_rels
self.num_bases = None if num_bases < 0 else num_bases
self.num_hidden_layers = num_hidden_layers
self.dropout = dropout
self.use_self_loop = use_self_loop
self.use_cuda = use_cuda
from dgl.nn.pytorch import RelGraphConv
# create rgcn layers
self.build_model()
class RGCN(nn.Module):
def __init__(self, in_dim, h_dim, out_dim, num_rels,
regularizer="basis", num_bases=-1, dropout=0.,
self_loop=False, link_pred=False):
super(RGCN, self).__init__()
def build_model(self):
self.layers = nn.ModuleList()
# i2h
i2h = self.build_input_layer()
if i2h is not None:
self.layers.append(i2h)
# h2h
for idx in range(self.num_hidden_layers):
h2h = self.build_hidden_layer(idx)
self.layers.append(h2h)
# h2o
h2o = self.build_output_layer()
if h2o is not None:
self.layers.append(h2o)
def build_input_layer(self):
return None
def build_hidden_layer(self, idx):
raise NotImplementedError
if link_pred:
self.emb = nn.Embedding(in_dim, h_dim)
in_dim = h_dim
else:
self.emb = None
self.layers.append(RelGraphConv(in_dim, h_dim, num_rels, regularizer,
num_bases, activation=F.relu, self_loop=self_loop,
dropout=dropout))
# For entity classification, dropout should not be applied to the output layer
if not link_pred:
dropout = 0.
self.layers.append(RelGraphConv(h_dim, out_dim, num_rels, regularizer,
num_bases, self_loop=self_loop, dropout=dropout))
def forward(self, g, h):
if isinstance(g, DGLGraph):
blocks = [g] * len(self.layers)
else:
blocks = g
def build_output_layer(self):
return None
if self.emb is not None:
h = self.emb(h.squeeze())
def forward(self, g, h, r, norm):
for layer in self.layers:
h = layer(g, h, r, norm)
for layer, block in zip(self.layers, blocks):
h = layer(block, h, block.edata[dgl.ETYPE], block.edata['norm'])
return h
def initializer(emb):
......@@ -55,112 +46,42 @@ def initializer(emb):
return emb
class RelGraphEmbedLayer(nn.Module):
r"""Embedding layer for featureless heterograph.
"""Embedding layer for featureless heterograph.
Parameters
----------
storage_dev_id : int
The device to store the weights of the layer.
out_dev_id : int
Device to return the output embeddings on.
out_dev
Device to store the output embeddings
num_nodes : int
Number of nodes.
node_tides : tensor
Storing the node type id for each node starting from 0
num_of_ntype : int
Number of node types
input_size : list of int
A list of input feature size for each node type. If None, we then
treat certain input feature as an one-hot encoding feature.
Number of nodes in the graph.
embed_size : int
Output embed size
dgl_sparse : bool, optional
If true, use dgl.nn.NodeEmbedding otherwise use torch.nn.Embedding
"""
def __init__(self,
storage_dev_id,
out_dev_id,
out_dev,
num_nodes,
node_tids,
num_of_ntype,
input_size,
embed_size,
dgl_sparse=False):
embed_size):
super(RelGraphEmbedLayer, self).__init__()
self.storage_dev_id = th.device( \
storage_dev_id if storage_dev_id >= 0 else 'cpu')
self.out_dev_id = th.device(out_dev_id if out_dev_id >= 0 else 'cpu')
self.out_dev = out_dev
self.embed_size = embed_size
self.num_nodes = num_nodes
self.dgl_sparse = dgl_sparse
# create weight embeddings for each node for each relation
self.embeds = nn.ParameterDict()
self.node_embeds = {} if dgl_sparse else nn.ModuleDict()
self.num_of_ntype = num_of_ntype
for ntype in range(num_of_ntype):
if isinstance(input_size[ntype], int):
if dgl_sparse:
self.node_embeds[str(ntype)] = dgl.nn.NodeEmbedding(input_size[ntype], embed_size, name=str(ntype),
init_func=initializer, device=self.storage_dev_id)
else:
sparse_emb = th.nn.Embedding(input_size[ntype], embed_size, sparse=True)
sparse_emb.cuda(self.storage_dev_id)
nn.init.uniform_(sparse_emb.weight, -1.0, 1.0)
self.node_embeds[str(ntype)] = sparse_emb
else:
input_emb_size = input_size[ntype].shape[1]
embed = nn.Parameter(th.empty([input_emb_size, self.embed_size],
device=self.storage_dev_id))
nn.init.xavier_uniform_(embed)
self.embeds[str(ntype)] = embed
@property
def dgl_emb(self):
"""
"""
if self.dgl_sparse:
embs = [emb for emb in self.node_embeds.values()]
return embs
else:
return []
# create embeddings for all nodes
self.node_embed = nn.Embedding(num_nodes, embed_size, sparse=True)
nn.init.uniform_(self.node_embed.weight, -1.0, 1.0)
def forward(self, node_ids, node_tids, type_ids, features):
def forward(self, node_ids):
"""Forward computation
Parameters
----------
node_ids : tensor
node ids to generate embedding for.
node_ids : tensor
node type ids
features : list of features
list of initial features for nodes belong to different node type.
If None, the corresponding features is an one-hot encoding feature,
else use the features directly as input feature and matmul a
projection matrix.
Raw node IDs.
Returns
-------
tensor
embeddings as the input of the next layer
"""
embeds = th.empty(node_ids.shape[0], self.embed_size, device=self.out_dev_id)
# transfer input to the correct device
type_ids = type_ids.to(self.storage_dev_id)
node_tids = node_tids.to(self.storage_dev_id)
# build locs first
locs = [None for i in range(self.num_of_ntype)]
for ntype in range(self.num_of_ntype):
locs[ntype] = (node_tids == ntype).nonzero().squeeze(-1)
for ntype in range(self.num_of_ntype):
loc = locs[ntype]
if isinstance(features[ntype], int):
if self.dgl_sparse:
embeds[loc] = self.node_embeds[str(ntype)](type_ids[loc], self.out_dev_id)
else:
embeds[loc] = self.node_embeds[str(ntype)](type_ids[loc]).to(self.out_dev_id)
else:
embeds[loc] = features[ntype][type_ids[loc]].to(self.out_dev_id) @ self.embeds[str(ntype)].to(self.out_dev_id)
embeds = self.node_embed(node_ids).to(self.out_dev)
return embeds
"""
Utility functions for link prediction
Most code is adapted from authors' implementation of RGCN link prediction:
https://github.com/MichSchli/RelationPrediction
"""
import numpy as np
import torch
from torch.multiprocessing import Queue
import dgl
#######################################################################
#
# Utility function for building training and testing graphs
#
#######################################################################
def get_adj_and_degrees(num_nodes, triplets):
""" Get adjacency list and degrees of the graph
"""
adj_list = [[] for _ in range(num_nodes)]
for i,triplet in enumerate(triplets):
adj_list[triplet[0]].append([i, triplet[2]])
adj_list[triplet[2]].append([i, triplet[0]])
degrees = np.array([len(a) for a in adj_list])
adj_list = [np.array(a) for a in adj_list]
return adj_list, degrees
def sample_edge_neighborhood(adj_list, degrees, n_triplets, sample_size):
"""Sample edges by neighborhool expansion.
This guarantees that the sampled edges form a connected graph, which
may help deeper GNNs that require information from more than one hop.
"""
edges = np.zeros((sample_size), dtype=np.int32)
#initialize
sample_counts = np.array([d for d in degrees])
picked = np.array([False for _ in range(n_triplets)])
seen = np.array([False for _ in degrees])
for i in range(0, sample_size):
weights = sample_counts * seen
if np.sum(weights) == 0:
weights = np.ones_like(weights)
weights[np.where(sample_counts == 0)] = 0
probabilities = (weights) / np.sum(weights)
chosen_vertex = np.random.choice(np.arange(degrees.shape[0]),
p=probabilities)
chosen_adj_list = adj_list[chosen_vertex]
seen[chosen_vertex] = True
chosen_edge = np.random.choice(np.arange(chosen_adj_list.shape[0]))
chosen_edge = chosen_adj_list[chosen_edge]
edge_number = chosen_edge[0]
while picked[edge_number]:
chosen_edge = np.random.choice(np.arange(chosen_adj_list.shape[0]))
chosen_edge = chosen_adj_list[chosen_edge]
edge_number = chosen_edge[0]
edges[i] = edge_number
other_vertex = chosen_edge[1]
picked[edge_number] = True
sample_counts[chosen_vertex] -= 1
sample_counts[other_vertex] -= 1
seen[other_vertex] = True
return edges
def sample_edge_uniform(adj_list, degrees, n_triplets, sample_size):
"""Sample edges uniformly from all the edges."""
all_edges = np.arange(n_triplets)
return np.random.choice(all_edges, sample_size, replace=False)
def generate_sampled_graph_and_labels(triplets, sample_size, split_size,
num_rels, adj_list, degrees,
negative_rate, sampler="uniform"):
"""Get training graph and signals
First perform edge neighborhood sampling on graph, then perform negative
sampling to generate negative samples
"""
# perform edge neighbor sampling
if sampler == "uniform":
edges = sample_edge_uniform(adj_list, degrees, len(triplets), sample_size)
elif sampler == "neighbor":
edges = sample_edge_neighborhood(adj_list, degrees, len(triplets), sample_size)
else:
raise ValueError("Sampler type must be either 'uniform' or 'neighbor'.")
# relabel nodes to have consecutive node ids
edges = triplets[edges]
src, rel, dst = edges.transpose()
uniq_v, edges = np.unique((src, dst), return_inverse=True)
src, dst = np.reshape(edges, (2, -1))
relabeled_edges = np.stack((src, rel, dst)).transpose()
# negative sampling
samples, labels = negative_sampling(relabeled_edges, len(uniq_v),
negative_rate)
# further split graph, only half of the edges will be used as graph
# structure, while the rest half is used as unseen positive samples
split_size = int(sample_size * split_size)
graph_split_ids = np.random.choice(np.arange(sample_size),
size=split_size, replace=False)
src = src[graph_split_ids]
dst = dst[graph_split_ids]
rel = rel[graph_split_ids]
# build DGL graph
print("# sampled nodes: {}".format(len(uniq_v)))
print("# sampled edges: {}".format(len(src) * 2))
g, rel, norm = build_graph_from_triplets(len(uniq_v), num_rels,
(src, rel, dst))
return g, uniq_v, rel, norm, samples, labels
def comp_deg_norm(g):
g = g.local_var()
in_deg = g.in_degrees(range(g.number_of_nodes())).float().numpy()
norm = 1.0 / in_deg
norm[np.isinf(norm)] = 0
return norm
def build_graph_from_triplets(num_nodes, num_rels, triplets):
""" Create a DGL graph. The graph is bidirectional because RGCN authors
use reversed relations.
This function also generates edge type and normalization factor
(reciprocal of node incoming degree)
"""
g = dgl.graph(([], []))
g.add_nodes(num_nodes)
src, rel, dst = triplets
src, dst = np.concatenate((src, dst)), np.concatenate((dst, src))
rel = np.concatenate((rel, rel + num_rels))
edges = sorted(zip(dst, src, rel))
dst, src, rel = np.array(edges).transpose()
g.add_edges(src, dst)
norm = comp_deg_norm(g)
print("# nodes: {}, # edges: {}".format(num_nodes, len(src)))
return g, rel.astype('int64'), norm.astype('int64')
def build_test_graph(num_nodes, num_rels, edges):
src, rel, dst = edges.transpose()
print("Test graph:")
return build_graph_from_triplets(num_nodes, num_rels, (src, rel, dst))
def negative_sampling(pos_samples, num_entity, negative_rate):
size_of_batch = len(pos_samples)
num_to_generate = size_of_batch * negative_rate
neg_samples = np.tile(pos_samples, (negative_rate, 1))
labels = np.zeros(size_of_batch * (negative_rate + 1), dtype=np.float32)
labels[: size_of_batch] = 1
values = np.random.randint(num_entity, size=num_to_generate)
choices = np.random.uniform(size=num_to_generate)
subj = choices > 0.5
obj = choices <= 0.5
neg_samples[subj, 0] = values[subj]
neg_samples[obj, 2] = values[obj]
return np.concatenate((pos_samples, neg_samples)), labels
#######################################################################
#
# Utility functions for evaluations (raw)
#
#######################################################################
def sort_and_rank(score, target):
_, indices = torch.sort(score, dim=1, descending=True)
indices = torch.nonzero(indices == target.view(-1, 1), as_tuple=False)
indices = indices[:, 1].view(-1)
return indices
def perturb_and_get_raw_rank(embedding, w, a, r, b, test_size, batch_size=100):
""" Perturb one element in the triplets
"""
n_batch = (test_size + batch_size - 1) // batch_size
ranks = []
for idx in range(n_batch):
print("batch {} / {}".format(idx, n_batch))
batch_start = idx * batch_size
batch_end = min(test_size, (idx + 1) * batch_size)
batch_a = a[batch_start: batch_end]
batch_r = r[batch_start: batch_end]
emb_ar = embedding[batch_a] * w[batch_r]
emb_ar = emb_ar.transpose(0, 1).unsqueeze(2) # size: D x E x 1
emb_c = embedding.transpose(0, 1).unsqueeze(1) # size: D x 1 x V
# out-prod and reduce sum
out_prod = torch.bmm(emb_ar, emb_c) # size D x E x V
score = torch.sum(out_prod, dim=0) # size E x V
score = torch.sigmoid(score)
target = b[batch_start: batch_end]
ranks.append(sort_and_rank(score, target))
return torch.cat(ranks)
# return MRR (raw), and Hits @ (1, 3, 10)
def calc_raw_mrr(embedding, w, test_triplets, hits=[], eval_bz=100):
with torch.no_grad():
s = test_triplets[:, 0]
r = test_triplets[:, 1]
o = test_triplets[:, 2]
test_size = test_triplets.shape[0]
# perturb subject
ranks_s = perturb_and_get_raw_rank(embedding, w, o, r, s, test_size, eval_bz)
# perturb object
ranks_o = perturb_and_get_raw_rank(embedding, w, s, r, o, test_size, eval_bz)
ranks = torch.cat([ranks_s, ranks_o])
ranks += 1 # change to 1-indexed
mrr = torch.mean(1.0 / ranks.float())
print("MRR (raw): {:.6f}".format(mrr.item()))
for hit in hits:
avg_count = torch.mean((ranks <= hit).float())
print("Hits (raw) @ {}: {:.6f}".format(hit, avg_count.item()))
return mrr.item()
#######################################################################
#
# Utility functions for evaluations (filtered)
#
#######################################################################
def filter_o(triplets_to_filter, target_s, target_r, target_o, num_entities):
target_s, target_r, target_o = int(target_s), int(target_r), int(target_o)
filtered_o = []
# Do not filter out the test triplet, since we want to predict on it
if (target_s, target_r, target_o) in triplets_to_filter:
triplets_to_filter.remove((target_s, target_r, target_o))
# Do not consider an object if it is part of a triplet to filter
for o in range(num_entities):
if (target_s, target_r, o) not in triplets_to_filter:
filtered_o.append(o)
return torch.LongTensor(filtered_o)
def filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities):
target_s, target_r, target_o = int(target_s), int(target_r), int(target_o)
filtered_s = []
# Do not filter out the test triplet, since we want to predict on it
if (target_s, target_r, target_o) in triplets_to_filter:
triplets_to_filter.remove((target_s, target_r, target_o))
# Do not consider a subject if it is part of a triplet to filter
for s in range(num_entities):
if (s, target_r, target_o) not in triplets_to_filter:
filtered_s.append(s)
return torch.LongTensor(filtered_s)
def perturb_o_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter):
""" Perturb object in the triplets
"""
num_entities = embedding.shape[0]
ranks = []
for idx in range(test_size):
if idx % 100 == 0:
print("test triplet {} / {}".format(idx, test_size))
target_s = s[idx]
target_r = r[idx]
target_o = o[idx]
filtered_o = filter_o(triplets_to_filter, target_s, target_r, target_o, num_entities)
target_o_idx = int((filtered_o == target_o).nonzero())
emb_s = embedding[target_s]
emb_r = w[target_r]
emb_o = embedding[filtered_o]
emb_triplet = emb_s * emb_r * emb_o
scores = torch.sigmoid(torch.sum(emb_triplet, dim=1))
_, indices = torch.sort(scores, descending=True)
rank = int((indices == target_o_idx).nonzero())
ranks.append(rank)
return torch.LongTensor(ranks)
def perturb_s_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter):
""" Perturb subject in the triplets
"""
num_entities = embedding.shape[0]
ranks = []
for idx in range(test_size):
if idx % 100 == 0:
print("test triplet {} / {}".format(idx, test_size))
target_s = s[idx]
target_r = r[idx]
target_o = o[idx]
filtered_s = filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities)
target_s_idx = int((filtered_s == target_s).nonzero())
emb_s = embedding[filtered_s]
emb_r = w[target_r]
emb_o = embedding[target_o]
emb_triplet = emb_s * emb_r * emb_o
scores = torch.sigmoid(torch.sum(emb_triplet, dim=1))
_, indices = torch.sort(scores, descending=True)
rank = int((indices == target_s_idx).nonzero())
ranks.append(rank)
return torch.LongTensor(ranks)
def calc_filtered_mrr(embedding, w, train_triplets, valid_triplets, test_triplets, hits=[]):
with torch.no_grad():
s = test_triplets[:, 0]
r = test_triplets[:, 1]
o = test_triplets[:, 2]
test_size = test_triplets.shape[0]
triplets_to_filter = torch.cat([train_triplets, valid_triplets, test_triplets]).tolist()
triplets_to_filter = {tuple(triplet) for triplet in triplets_to_filter}
print('Perturbing subject...')
ranks_s = perturb_s_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter)
print('Perturbing object...')
ranks_o = perturb_o_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter)
ranks = torch.cat([ranks_s, ranks_o])
ranks += 1 # change to 1-indexed
mrr = torch.mean(1.0 / ranks.float())
print("MRR (filtered): {:.6f}".format(mrr.item()))
for hit in hits:
avg_count = torch.mean((ranks <= hit).float())
print("Hits (filtered) @ {}: {:.6f}".format(hit, avg_count.item()))
return mrr.item()
#######################################################################
#
# Main evaluation function
#
#######################################################################
def calc_mrr(embedding, w, train_triplets, valid_triplets, test_triplets, hits=[], eval_bz=100, eval_p="filtered"):
if eval_p == "filtered":
mrr = calc_filtered_mrr(embedding, w, train_triplets, valid_triplets, test_triplets, hits)
else:
mrr = calc_raw_mrr(embedding, w, test_triplets, hits, eval_bz)
return mrr
......@@ -1178,7 +1178,7 @@ def count_nonzero(input):
# DGL should contain all the operations on index, so this set of operators
# should be gradually removed.
def unique(input, return_inverse=False):
def unique(input, return_inverse=False, return_counts=False):
"""Returns the unique scalar elements in a tensor.
Parameters
......@@ -1188,13 +1188,19 @@ def unique(input, return_inverse=False):
return_inverse : bool, optional
Whether to also return the indices for where elements in the original
input ended up in the returned unique list.
return_counts : bool, optional
Whether to also return the counts for each unique element.
Returns
-------
Tensor
A 1-D tensor containing unique elements.
Tensor
Tensor, optional
A 1-D tensor containing the new positions of the elements in the input.
It is returned if return_inverse is True.
Tensor, optional
A 1-D tensor containing the number of occurrences for each unique value or tensor.
It is returned if return_counts is True.
"""
pass
......
......@@ -353,14 +353,20 @@ def count_nonzero(input):
tmp = input.asnumpy()
return np.count_nonzero(tmp)
def unique(input, return_inverse=False):
def unique(input, return_inverse=False, return_counts=False):
# TODO: fallback to numpy is unfortunate
tmp = input.asnumpy()
if return_inverse:
tmp, inv = np.unique(tmp, return_inverse=True)
if return_inverse and return_counts:
tmp, inv, count = np.unique(tmp, return_inverse=True, return_counts=True)
tmp = nd.array(tmp, ctx=input.context, dtype=input.dtype)
inv = nd.array(inv, ctx=input.context)
return tmp, inv
count = nd.array(count, ctx=input.context)
return tmp, inv, count
elif return_inverse or return_counts:
tmp, tmp2 = np.unique(tmp, return_inverse=return_inverse, return_counts=return_counts)
tmp = nd.array(tmp, ctx=input.context, dtype=input.dtype)
tmp2 = nd.array(tmp2, ctx=input.context)
return tmp, tmp2
else:
tmp = np.unique(tmp)
return nd.array(tmp, ctx=input.context, dtype=input.dtype)
......
......@@ -301,10 +301,10 @@ def count_nonzero(input):
# TODO: fallback to numpy for backward compatibility
return np.count_nonzero(input)
def unique(input, return_inverse=False):
def unique(input, return_inverse=False, return_counts=False):
if input.dtype == th.bool:
input = input.type(th.int8)
return th.unique(input, return_inverse=return_inverse)
return th.unique(input, return_inverse=return_inverse, return_counts=return_counts)
def full_1d(length, fill_value, dtype, ctx):
return th.full((length,), fill_value, dtype=dtype, device=ctx)
......
......@@ -425,8 +425,13 @@ def count_nonzero(input):
return int(tf.math.count_nonzero(input))
def unique(input, return_inverse=False):
if return_inverse:
def unique(input, return_inverse=False, return_counts=False):
if return_inverse and return_counts:
return tf.unique_with_counts(input)
elif return_counts:
result = tf.unique_with_counts(input)
return result.y, result.count
elif return_inverse:
return tf.unique(input)
else:
return tf.unique(input).y
......
......@@ -352,7 +352,7 @@ class FB15k237Dataset(KnowledgeGraphDataset):
>>> graph = dataset[0]
>>> train_mask = graph.edata['train_mask']
>>> train_idx = th.nonzero(train_mask, as_tuple=False).squeeze()
>>> src, dst = graph.edges(train_idx)
>>> src, dst = graph.find_edges(train_idx)
>>> rel = graph.edata['etype'][train_idx]
- ``valid`` is deprecated, it is replaced by:
......@@ -361,7 +361,7 @@ class FB15k237Dataset(KnowledgeGraphDataset):
>>> graph = dataset[0]
>>> val_mask = graph.edata['val_mask']
>>> val_idx = th.nonzero(val_mask, as_tuple=False).squeeze()
>>> src, dst = graph.edges(val_idx)
>>> src, dst = graph.find_edges(val_idx)
>>> rel = graph.edata['etype'][val_idx]
- ``test`` is deprecated, it is replaced by:
......@@ -370,7 +370,7 @@ class FB15k237Dataset(KnowledgeGraphDataset):
>>> graph = dataset[0]
>>> test_mask = graph.edata['test_mask']
>>> test_idx = th.nonzero(test_mask, as_tuple=False).squeeze()
>>> src, dst = graph.edges(test_idx)
>>> src, dst = graph.find_edges(test_idx)
>>> rel = graph.edata['etype'][test_idx]
FB15k-237 is a subset of FB15k where inverse
......
......@@ -69,7 +69,8 @@ __all__ = [
'as_heterograph',
'adj_product_graph',
'adj_sum_graph',
'reorder_graph'
'reorder_graph',
'norm_by_dst'
]
......@@ -3213,4 +3214,42 @@ def rcmk_perm(g):
perm = sparse.csgraph.reverse_cuthill_mckee(csr_adj)
return perm.copy()
def norm_by_dst(g, etype=None):
r"""Calculate normalization coefficient per edge based on destination node degree.
Parameters
----------
g : DGLGraph
The input graph.
etype : str or (str, str, str), optional
The type of the edges to calculate. The allowed edge type formats are:
* ``(str, str, str)`` for source node type, edge type and destination node type.
* or one ``str`` edge type name if the name can uniquely identify a
triplet format in the graph.
It can be omitted if the graph has a single edge type.
Returns
-------
1D Tensor
The normalization coefficient of the edges.
Examples
--------
>>> import dgl
>>> g = dgl.graph(([0, 1, 1], [1, 1, 2]))
>>> print(dgl.norm_by_dst(g))
tensor([0.5000, 0.5000, 1.0000])
"""
_, v, _ = g.edges(form='all', etype=etype)
_, inv_index, count = F.unique(v, return_inverse=True, return_counts=True)
deg = F.astype(count[inv_index], F.float32)
norm = 1. / deg
norm = F.replace_inf_with_zero(norm)
return norm
_init_api("dgl.transform", __name__)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment