[KG][Model] Knowledge graph embeddings (#888)

* upd * fig edgebatch edges * add test * trigger * Update README.md for pytorch PinSage example. Add noting that the PinSage model example under example/pytorch/recommendation only work with Python 3.6+ as its dataset loader depends on stanfordnlp package which work only with Python 3.6+. * Provid a frame agnostic API to test nn modules on both CPU and CUDA side. 1. make dgl.nn.xxx frame agnostic 2. make test.backend include dgl.nn modules 3. modify test_edge_softmax of test/mxnet/test_nn.py and test/pytorch/test_nn.py work on both CPU and GPU * Fix style * Delete unused code * Make agnostic test only related to tests/backend 1. clear all agnostic related code in dgl.nn 2. make test_graph_conv agnostic to cpu/gpu * Fix code style * fix * doc * Make all test code under tests.mxnet/pytorch.test_nn.py work on both CPU and GPU. * Fix syntex * Remove rand * Add TAGCN nn.module and example * Now tagcn can run on CPU. * Add unitest for TGConv * Fix style * For pubmed dataset, using --lr=0.005 can achieve better acc * Fix style * Fix some descriptions * trigger * Fix doc * Add nn.TGConv and example * Fix bug * Update data in mxnet.tagcn test acc. * Fix some comments and code * delete useless code * Fix namming * Fix bug * Fix bug * Add test for mxnet TAGCov * Add test code for mxnet TAGCov * Update some docs * Fix some code * Update docs dgl.nn.mxnet * Update weight init * Fix * init version. * change default value of regularization. * avoid specifying adversarial_temperature * use default eval_interval. * remove original model. * remove optimizer. * set default value of num_proc * set default value of log_interval. * don't need to set neg_sample_size_valid. * remove unused code. * use uni_weight by default. * unify model. * rename model. * remove unnecessary data sampler. * remove the code for checkpoint. * fix eval. * raise exception in invalid arguments. * remove RowAdagrad. * remove unsupported score function for now. * Fix bugs of kg Update README * Update Readme for mxnet distmult * Update README.md * Update README.md * revert changes on dmlc * add tests. * update CI. * add tests script. * reorder tests in CI. * measure performance. * add results on wn18 * remove some code. * rename the training script. * new results on TransE. * remove --train. * add format. * fix. * use EdgeSubgraph. * create PBGNegEdgeSubgraph to simplify the code. * fix test * fix CI. * run nose for unit tests. * remove unused code in dataset. * change argument to save embeddings. * test training and eval scripts in CI. * check Pytorch version. * fix a minor problem in config. * fix a minor bug. * fix readme. * Update README.md * Update README.md * Update README.md

[KG][Model] Knowledge graph embeddings (#888)
* upd * fig edgebatch edges * add test * trigger * Update README.md for pytorch PinSage example. Add noting that the PinSage model example under example/pytorch/recommendation only work with Python 3.6+ as its dataset loader depends on stanfordnlp package which work only with Python 3.6+. * Provid a frame agnostic API to test nn modules on both CPU and CUDA side. 1. make dgl.nn.xxx frame agnostic 2. make test.backend include dgl.nn modules 3. modify test_edge_softmax of test/mxnet/test_nn.py and test/pytorch/test_nn.py work on both CPU and GPU * Fix style * Delete unused code * Make agnostic test only related to tests/backend 1. clear all agnostic related code in dgl.nn 2. make test_graph_conv agnostic to cpu/gpu * Fix code style * fix * doc * Make all test code under tests.mxnet/pytorch.test_nn.py work on both CPU and GPU. * Fix syntex * Remove rand * Add TAGCN nn.module and example * Now tagcn can run on CPU. * Add unitest for TGConv * Fix style * For pubmed dataset, using --lr=0.005 can achieve better acc * Fix style * Fix some descriptions * trigger * Fix doc * Add nn.TGConv and example * Fix bug * Update data in mxnet.tagcn test acc. * Fix some comments and code * delete useless code * Fix namming * Fix bug * Fix bug * Add test for mxnet TAGCov * Add test code for mxnet TAGCov * Update some docs * Fix some code * Update docs dgl.nn.mxnet * Update weight init * Fix * init version. * change default value of regularization. * avoid specifying adversarial_temperature * use default eval_interval. * remove original model. * remove optimizer. * set default value of num_proc * set default value of log_interval. * don't need to set neg_sample_size_valid. * remove unused code. * use uni_weight by default. * unify model. * rename model. * remove unnecessary data sampler. * remove the code for checkpoint. * fix eval. * raise exception in invalid arguments. * remove RowAdagrad. * remove unsupported score function for now. * Fix bugs of kg Update README * Update Readme for mxnet distmult * Update README.md * Update README.md * revert changes on dmlc * add tests. * update CI. * add tests script. * reorder tests in CI. * measure performance. * add results on wn18 * remove some code. * rename the training script. * new results on TransE. * remove --train. * add format. * fix. * use EdgeSubgraph. * create PBGNegEdgeSubgraph to simplify the code. * fix test * fix CI. * run nose for unit tests. * remove unused code in dataset. * change argument to save embeddings. * test training and eval scripts in CI. * check Pytorch version. * fix a minor problem in config. * fix a minor bug. * fix readme. * Update README.md * Update README.md * Update README.md
15b951d4 · Da Zheng · GitHub · 1c00f3a8 · 15b951d4 · 15b951d4
Unverified Commit 15b951d4 authored Oct 02, 2019 by Da Zheng Committed by GitHub Oct 02, 2019
20 changed files
--- a/Jenkinsfile
+++ b/Jenkinsfile
@@ -69,6 +69,14 @@ def unit_test_win64(backend, dev) {
  }
 }
+def kg_test_linux(backend, dev) {
+  init_git()
+  unpack_lib("dgl-${dev}-linux", dgl_linux_libs)
+  timeout(time: 20, unit: 'MINUTES') {
+    sh "bash tests/scripts/task_kg_test.sh ${backend} ${dev}"
+  }
+}
 def example_test_linux(backend, dev) {
  init_git()
  unpack_lib("dgl-${dev}-linux", dgl_linux_libs)
@@ -196,6 +204,11 @@ pipeline {
                tutorial_test_linux("pytorch")
              }
            }
+            stage("Knowledge Graph test") {
+              steps {
+                kg_test_linux("pytorch", "cpu")
+              }
+            }
          }
          post {
            always {
@@ -257,6 +270,11 @@ pipeline {
                unit_test_linux("mxnet", "cpu")
              }
            }
+            stage("Knowledge Graph test") {
+              steps {
+                kg_test_linux("mxnet", "cpu")
+              }
+            }
            //stage("Tutorial test") {
            //  steps {
            //    tutorial_test_linux("mxnet")

--- a/apps/kg/README.md
+++ b/apps/kg/README.md
+# DGL - Knowledge Graph Embedding
+## Introduction
+DGL-KE aims to computing knowledge graph embeddings efficiently on giant knowledge graphs.
+It can train knowledge graphs, such as FB15k and wn18, within a few minutes, while it trains
+Freebase, which has hundreds of millions of edges within a couple of hours.
+It supports multiple knowledge graph embeddings. For now, it supports knowledge graph embedding
+models including:
+- TransE
+- DistMult
+- ComplEx
+More models will be supported in a near future.
+DGL-KE supports multiple training modes:
+- CPU & GPU training
+- Mixed CPU & GPU training: node embeddings are stored on CPU and mini-batches are trained on GPU. This is designed for training KGE models on large knowledge graphs.
+- Multiprocessing training on CPUs: this is designed to train KGE models on large knowledge graphs with many CPU cores.
+We will support multi-GPU training and distributed training in a near future.
+## Requirements
+The package can run with both Pytorch and MXNet. For Pytorch, it works with Pytorch v1.2 or newer.
+For MXNet, it can work with MXNet 1.5 or newer.
+## Datasets
+DGL-KE provides five knowledge graphs:
+- FB15k
+- FB15k-237
+- wn18
+- wn18rr
+- Freebase
+Users can specify one of the datasets with `--dataset` in `train.py` and `eval.py`.
+## Performance
+The speed is measured on an EC2 P3 instance on a Nvidia V100 GPU.
+The speed on FB15k
+|  Models | TrasnE | DistMult | ComplEx |
+|---------|--------|----------|---------|
+|MAX_STEPS| 20000  | 100000   | 100000  |
+|TIME     | 411s   | 690s     | 806s    |
+The accuracy on FB15k
+|  Models  |  MR   |  MRR  | HITS@1 | HITS@3 | HITS@10 |
+|----------|-------|-------|--------|--------|---------|
+| TransE   | 69.12 | 0.656 | 0.567  | 0.718  | 0.802   |
+| DistMult | 43.35 | 0.783 | 0.713  | 0.837  | 0.897   |
+| ComplEx  | 51.99 | 0.785 | 0.720  | 0.832  | 0.889   |
+The speed on wn18
+|  Models | TrasnE | DistMult | ComplEx |
+|---------|--------|----------|---------|
+|MAX_STEPS| 40000  | 10000    | 20000   |
+|TIME     | 719s   | 126s     | 266s    |
+The accuracy on wn18
+|  Models  |  MR    |  MRR  | HITS@1 | HITS@3 | HITS@10 |
+|----------|--------|-------|--------|--------|---------|
+| TransE   | 321.35 | 0.760 | 0.652  | 0.850  | 0.940   |
+| DistMult | 271.09 | 0.769 | 0.639  | 0.892  | 0.949   |
+| ComplEx  | 276.37 | 0.935 | 0.916  | 0.950  | 0.960   |
+## Usage
+The package supports two data formats for a knowledge graph.
+Format 1:
+- entities.dict maps entity Id to entity name.
+- relations.dict maps relation Id to relation name.
+- train.txt stores the triples (head, rel, tail) in the training set.
+- valid.txt stores the triples (head, rel, tail) in the validation set.
+- test.txt stores the triples (head, rel, tail) in the test set.
+Format 2:
+- entity2id.txt maps entity name to entity Id.
+- relation2id.txt maps relation name to relation Id.
+- train.txt stores the triples (head, tail, rel) in the training set.
+- valid.txt stores the triples (head, tail, rel) in the validation set.
+- test.txt stores the triples (head, tail, rel) in the test set.
+Here are some examples of using the training script.
+Train KGE models with GPU.
+```bash
+python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
+    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
+    --batch_size_eval 16 --gpu 0 --valid --test -adv
+```
+Train KGE models with mixed CPUs and GPUs.
+```bash
+python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
+    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
+    --batch_size_eval 16 --gpu 0 --valid --test -adv --mix_cpu_gpu
+```
+Train embeddings and verify it later.
+```bash
+python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
+    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
+    --batch_size_eval 16 --gpu 0 --valid -adv --save_emb DistMult_FB15k_emb
+python3 eval.py --model_name DistMult --dataset FB15k --hidden_dim 2000 \
+    --gamma 500.0 --batch_size 16 --gpu 0 --model_path DistMult_FB15k_emb/
+```
+Train embeddings with multi-processing. This currently doesn't work in MXNet.
+```bash
+python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
+    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.07 --max_step 3000 \
+    --batch_size_eval 16 --regularization_coef 0.000001 --valid --test -adv --num_proc 8
+```
+## Freebase
+Train embeddings on Freebase with multi-processing on X1.
+```bash
+DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset Freebase --batch_size 1024 \
+    --neg_sample_size 256 --hidden_dim 400 --gamma 500.0 \
+    --lr 0.1 --max_step 50000 --batch_size_eval 128 --test -adv --eval_interval 300000 \
+    --neg_sample_size_test 10000 --eval_percent 0.2 --num_proc 64
+Test average MR at [0/50000]: 754.5566055566055
+Test average MRR at [0/50000]: 0.7333319016877765
+Test average HITS@1 at [0/50000]: 0.7182952182952183
+Test average HITS@3 at [0/50000]: 0.7409752409752409
+Test average HITS@10 at [0/50000]: 0.7587412587412588
+```
--- a/apps/kg/config/best_config.sh
+++ b/apps/kg/config/best_config.sh
+#To reproduce reported results on README, you can run the model with the following commands:
+# for FB15k
+DGLBACKEND=pytorch python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
+    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
+    --batch_size_eval 16 --gpu 0 --valid --test -adv
+DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset FB15k --batch_size 1024 \
+    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.2 --max_step 100000 \
+    --batch_size_eval 16 --gpu 1 --valid --test -adv
+DGLBACKEND=pytorch python3 train.py --model TransE --dataset FB15k --batch_size 1024 \
+    --neg_sample_size 256 --hidden_dim 2000 --gamma 24.0 --lr 0.01 --max_step 20000 \
+    --batch_size_eval 16 --gpu 0 --valid --test -adv
+# for wn18
+DGLBACKEND=pytorch python3 train.py --model TransE --dataset wn18 --batch_size 1024 \
+    --neg_sample_size 512 --hidden_dim 500 --gamma 12.0 --adversarial_temperature 0.5 \
+    --lr 0.01 --max_step 40000 --batch_size_eval 16 --gpu 0 --valid --test -adv \
+    --regularization_coef 0.00001
+DGLBACKEND=pytorch python3 train.py --model DistMult --dataset wn18 --batch_size 1024 \
+	--neg_sample_size 1024 --hidden_dim 1000 --gamma 200.0 --lr 0.1 --max_step 10000 \
+	--batch_size_eval 16 --gpu 0 --valid --test -adv --regularization_coef 0.00001
+DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset wn18 --batch_size 1024 \
+	--neg_sample_size 1024 --hidden_dim 500 --gamma 200.0 --lr 0.1 --max_step 20000 \
+	--batch_size_eval 16 --gpu 0 --valid --test -adv --regularization_coef 0.00001
--- a/apps/kg/dataloader/KGDataset.py
+++ b/apps/kg/dataloader/KGDataset.py
+import os
+def _download_and_extract(url, path, filename):
+    import shutil, zipfile
+    from tqdm import tqdm
+    import requests
+    fn = os.path.join(path, filename)
+    while True:
+        try:
+            with zipfile.ZipFile(fn) as zf:
+                zf.extractall(path)
+            print('Unzip finished.')
+            break
+        except Exception:
+            os.makedirs(path, exist_ok=True)
+            f_remote = requests.get(url, stream=True)
+            sz = f_remote.headers.get('content-length')
+            assert f_remote.status_code == 200, 'fail to open {}'.format(url)
+            with open(fn, 'wb') as writer:
+                for chunk in tqdm(f_remote.iter_content(chunk_size=1024*1024)):
+                    writer.write(chunk)
+            print('Download finished. Unzipping the file...')
+class KGDataset1:
+    '''Load a knowledge graph with format 1
+    In this format, the folder with a knowledge graph has five files:
+    * entities.dict stores the mapping between entity Id and entity name.
+    * relations.dict stores the mapping between relation Id and relation name.
+    * train.txt stores the triples in the training set.
+    * valid.txt stores the triples in the validation set.
+    * test.txt stores the triples in the test set.
+    The mapping between entity (relation) Id and entity (relation) name is stored as 'id\tname'.
+    The triples are stored as 'head_name\trelation_name\ttail_name'.
+    '''
+    def __init__(self, path, name):
+        url = 'https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/{}.zip'.format(name)
+        if not os.path.exists(os.path.join(path, name)):
+            print('File not found. Downloading from', url)
+            _download_and_extract(url, path, name + '.zip')
+        path = os.path.join(path, name)
+        with open(os.path.join(path, 'entities.dict')) as f:
+            entity2id = {}
+            for line in f:
+                eid, entity = line.strip().split('\t')
+                entity2id[entity] = int(eid)
+        self.entity2id = entity2id
+        with open(os.path.join(path, 'relations.dict')) as f:
+            relation2id = {}
+            for line in f:
+                rid, relation = line.strip().split('\t')
+                relation2id[relation] = int(rid)
+        self.relation2id = relation2id
+        # TODO: to deal with contries dataset.
+        self.n_entities = len(self.entity2id)
+        self.n_relations = len(self.relation2id)
+        self.train = self.read_triple(path, 'train')
+        self.valid = self.read_triple(path, 'valid')
+        self.test = self.read_triple(path, 'test')
+    def read_triple(self, path, mode):
+        # mode: train/valid/test
+        triples = []
+        with open(os.path.join(path, '{}.txt'.format(mode))) as f:
+            for line in f:
+                h, r, t = line.strip().split('\t')
+                triples.append((self.entity2id[h], self.relation2id[r], self.entity2id[t]))
+        return triples
+class KGDataset2:
+    '''Load a knowledge graph with format 2
+    In this format, the folder with a knowledge graph has five files:
+    * entity2id.txt stores the mapping between entity name and entity Id.
+    * relation2id.txt stores the mapping between relation name relation Id.
+    * train.txt stores the triples in the training set.
+    * valid.txt stores the triples in the validation set.
+    * test.txt stores the triples in the test set.
+    The mapping between entity (relation) name and entity (relation) Id is stored as 'name\tid'.
+    The triples are stored as 'head_nid\trelation_id\ttail_nid'.
+    '''
+    def __init__(self, path, name):
+        url = 'https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/{}.zip'.format(name)
+        if not os.path.exists(os.path.join(path, name)):
+            print('File not found. Downloading from', url)
+            _download_and_extract(url, path, '{}.zip'.format(name))
+        self.path = os.path.join(path, name)
+        f_ent2id = os.path.join(self.path, 'entity2id.txt')
+        f_rel2id = os.path.join(self.path, 'relation2id.txt')
+        with open(f_ent2id) as f_ent:
+            self.n_entities = int(f_ent.readline()[:-1])
+        with open(f_rel2id) as f_rel:
+            self.n_relations = int(f_rel.readline()[:-1])
+        self.train = self.read_triple(self.path, 'train')
+        self.valid = self.read_triple(self.path, 'valid')
+        self.test = self.read_triple(self.path, 'test')
+    def read_triple(self, path, mode, skip_first_line=False):
+        triples = []
+        print('Reading {} triples....'.format(mode))
+        with open(os.path.join(path, '{}.txt'.format(mode))) as f:
+            if skip_first_line:
+                _ = f.readline()
+            for line in f:
+                h, t, r = line.strip().split('\t')
+                triples.append((int(h), int(r), int(t)))
+        print('Finished. Read {} {} triples.'.format(len(triples), mode))
+        return triples
+def get_dataset(data_path, data_name, format_str):
+    if data_name == 'Freebase':
+        dataset = KGDataset2(data_path, data_name)
+    elif format_str == '1':
+        dataset = KGDataset1(data_path, data_name)
+    else:
+        dataset = KGDataset2(data_path, data_name)
+    return dataset
--- a/apps/kg/dataloader/__init__.py
+++ b/apps/kg/dataloader/__init__.py
+from .KGDataset import *
+from .sampler import *
--- a/apps/kg/dataloader/sampler.py
+++ b/apps/kg/dataloader/sampler.py
+import math
+import numpy as np
+import scipy as sp
+import dgl.backend as F
+import dgl
+import os
+import pickle
+import time
+# This partitions a list of edges based on relations to make sure
+# each partition has roughly the same number of edges and relations.
+def RelationPartition(edges, n):
+    print('relation partition {} edges into {} parts'.format(len(edges), n))
+    rel = np.array([r for h, r, t in edges])
+    uniq, cnts = np.unique(rel, return_counts=True)
+    idx = np.flip(np.argsort(cnts))
+    cnts = cnts[idx]
+    uniq = uniq[idx]
+    assert cnts[0] > cnts[-1]
+    edge_cnts = np.zeros(shape=(n,), dtype=np.int64)
+    rel_cnts = np.zeros(shape=(n,), dtype=np.int64)
+    rel_dict = {}
+    for i in range(len(cnts)):
+        cnt = cnts[i]
+        r = uniq[i]
+        idx = np.argmin(edge_cnts)
+        rel_dict[r] = idx
+        edge_cnts[idx] += cnt
+        rel_cnts[idx] += 1
+    for i, edge_cnt in enumerate(edge_cnts):
+        print('part {} has {} edges and {} relations'.format(i, edge_cnt, rel_cnts[i]))
+    parts = []
+    for _ in range(n):
+        parts.append([])
+    for h, r, t in edges:
+        idx = rel_dict[r]
+        parts[idx].append((h, r, t))
+    return parts
+def RandomPartition(edges, n):
+    print('random partition {} edges into {} parts'.format(len(edges), n))
+    idx = np.random.permutation(len(edges))
+    part_size = int(math.ceil(len(idx) / n))
+    parts = []
+    for i in range(n):
+        start = part_size * i
+        end = min(part_size * (i + 1), len(idx))
+        parts.append([edges[i] for i in idx[start:end]])
+    return parts
+def ConstructGraph(edges, n_entities, i, args):
+    pickle_name = 'graph_train_{}.pickle'.format(i)
+    if args.pickle_graph and os.path.exists(os.path.join(args.data_path, args.dataset, pickle_name)):
+        with open(os.path.join(args.data_path, args.dataset, pickle_name), 'rb') as graph_file:
+            g = pickle.load(graph_file)
+            print('Load pickled graph.')
+    else:
+        src = [t[0] for t in edges]
+        etype_id = [t[1] for t in edges]
+        dst = [t[2] for t in edges]
+        coo = sp.sparse.coo_matrix((np.ones(len(src)), (src, dst)), shape=[n_entities, n_entities])
+        g = dgl.DGLGraph(coo, readonly=True, sort_csr=True)
+        g.ndata['id'] = F.arange(0, g.number_of_nodes())
+        g.edata['id'] = F.tensor(etype_id, F.int64)
+        if args.pickle_graph:
+            with open(os.path.join(args.data_path, args.dataset, pickle_name), 'wb') as graph_file:
+                pickle.dump(g, graph_file)
+    return g
+class TrainDataset(object):
+    def __init__(self, dataset, args, weighting=False, ranks=64):
+        triples = dataset.train
+        print('|Train|:', len(triples))
+        if ranks > 1 and args.rel_part:
+            triples_list = RelationPartition(triples, ranks)
+        elif ranks > 1:
+            triples_list = RandomPartition(triples, ranks)
+        else:
+            triples_list = [triples]
+        self.graphs = []
+        for i, triples in enumerate(triples_list):
+            g = ConstructGraph(triples, dataset.n_entities, i, args)
+            if weighting:
+                # TODO: weight to be added
+                count = self.count_freq(triples)
+                subsampling_weight = np.vectorize(
+                    lambda h, r, t: np.sqrt(1 / (count[(h, r)] + count[(t, -r - 1)]))
+                )
+                weight = subsampling_weight(src, etype_id, dst)
+                g.edata['weight'] = F.zerocopy_from_numpy(weight)
+                # to be added
+            self.graphs.append(g)
+    def count_freq(self, triples, start=4):
+        count = {}
+        for head, rel, tail in triples:
+            if (head, rel) not in count:
+                count[(head, rel)] = start
+            else:
+                count[(head, rel)] += 1
+            if (tail, -rel - 1) not in count:
+                count[(tail, -rel - 1)] = start
+            else:
+                count[(tail, -rel - 1)] += 1
+        return count
+    def create_sampler(self, batch_size, neg_sample_size=2, mode='head', num_workers=5,
+                       shuffle=True, exclude_positive=False, rank=0):
+        EdgeSampler = getattr(dgl.contrib.sampling, 'EdgeSampler')
+        return EdgeSampler(self.graphs[rank],
+                           batch_size=batch_size,
+                           neg_sample_size=neg_sample_size,
+                           negative_mode=mode,
+                           num_workers=num_workers,
+                           shuffle=shuffle,
+                           exclude_positive=exclude_positive,
+                           return_false_neg=False)
+class PBGNegEdgeSubgraph(dgl.subgraph.DGLSubGraph):
+    def __init__(self, subg, num_chunks, chunk_size,
+                 neg_sample_size, neg_head):
+        super(PBGNegEdgeSubgraph, self).__init__(subg._parent, subg.sgi)
+        self.subg = subg
+        self.num_chunks = num_chunks
+        self.chunk_size = chunk_size
+        self.neg_sample_size = neg_sample_size
+        self.neg_head = neg_head
+    @property
+    def head_nid(self):
+        return self.subg.head_nid
+    @property
+    def tail_nid(self):
+        return self.subg.tail_nid
+def create_neg_subgraph(pos_g, neg_g, is_pbg, neg_head, num_nodes):
+    assert neg_g.number_of_edges() % pos_g.number_of_edges() == 0
+    neg_sample_size = int(neg_g.number_of_edges() / pos_g.number_of_edges())
+    # We use all nodes to create negative edges. Regardless of the sampling algorithm,
+    # we can always view the subgraph with one chunk.
+    if (neg_head and len(neg_g.head_nid) == num_nodes) \
+       or (not neg_head and len(neg_g.tail_nid) == num_nodes):
+        num_chunks = 1
+        chunk_size = pos_g.number_of_edges()
+    elif is_pbg:
+        if pos_g.number_of_edges() < neg_sample_size:
+            num_chunks = 1
+            chunk_size = pos_g.number_of_edges()
+        else:
+            # This is probably the last batch. Let's ignore it.
+            if pos_g.number_of_edges() % neg_sample_size > 0:
+                return None
+            num_chunks = int(pos_g.number_of_edges()/ neg_sample_size)
+            chunk_size = neg_sample_size
+    else:
+        num_chunks = pos_g.number_of_edges()
+        chunk_size = 1
+    return PBGNegEdgeSubgraph(neg_g, num_chunks, chunk_size,
+                              neg_sample_size, neg_head)
+class EvalSampler(object):
+    def __init__(self, g, edges, batch_size, neg_sample_size, mode, num_workers):
+        EdgeSampler = getattr(dgl.contrib.sampling, 'EdgeSampler')
+        self.sampler = EdgeSampler(g,
+                                   batch_size=batch_size,
+                                   seed_edges=edges,
+                                   neg_sample_size=neg_sample_size,
+                                   negative_mode=mode,
+                                   num_workers=num_workers,
+                                   shuffle=False,
+                                   exclude_positive=False,
+                                   relations=g.edata['id'],
+                                   return_false_neg=True)
+        self.sampler_iter = iter(self.sampler)
+        self.mode = mode
+        self.neg_head = 'head' in mode
+        self.g = g
+    def __iter__(self):
+        return self
+    def __next__(self):
+        while True:
+            pos_g, neg_g = next(self.sampler_iter)
+            neg_positive = neg_g.edata['false_neg']
+            neg_g = create_neg_subgraph(pos_g, neg_g, 'PBG' in self.mode,
+                                        self.neg_head, self.g.number_of_nodes())
+            if neg_g is not None:
+                break
+        pos_g.copy_from_parent()
+        neg_g.copy_from_parent()
+        neg_g.edata['bias'] = F.astype(-neg_positive, F.float32)
+        return pos_g, neg_g
+    def reset(self):
+        self.sampler_iter = iter(self.sampler)
+        return self
+class EvalDataset(object):
+    def __init__(self, dataset, args):
+        triples = dataset.train + dataset.valid + dataset.test
+        pickle_name = 'graph_all.pickle'
+        if args.pickle_graph and os.path.exists(os.path.join(args.data_path, args.dataset, pickle_name)):
+            with open(os.path.join(args.data_path, args.dataset, pickle_name), 'rb') as graph_file:
+                g = pickle.load(graph_file)
+                print('Load pickled graph.')
+        else:
+            src = [t[0] for t in triples]
+            etype_id = [t[1] for t in triples]
+            dst = [t[2] for t in triples]
+            coo = sp.sparse.coo_matrix((np.ones(len(src)), (src, dst)), shape=[dataset.n_entities, dataset.n_entities])
+            g = dgl.DGLGraph(coo, readonly=True, sort_csr=True)
+            g.ndata['id'] = F.arange(0, g.number_of_nodes())
+            g.edata['id'] = F.tensor(etype_id, F.int64)
+            if args.pickle_graph:
+                with open(os.path.join(args.data_path, args.dataset, pickle_name), 'wb') as graph_file:
+                    pickle.dump(g, graph_file)
+        self.g = g
+        self.num_train = len(dataset.train)
+        self.num_valid = len(dataset.valid)
+        self.num_test = len(dataset.test)
+        if args.eval_percent < 1:
+            self.valid = np.random.randint(0, self.num_valid,
+                    size=(int(self.num_valid * args.eval_percent),)) + self.num_train
+        else:
+            self.valid = np.arange(self.num_train, self.num_train + self.num_valid)
+        print('|valid|:', len(self.valid))
+        if args.eval_percent < 1:
+            self.test = np.random.randint(0, self.num_test,
+                    size=(int(self.num_test * args.eval_percent,)))
+            self.test += self.num_train + self.num_valid
+        else:
+            self.test = np.arange(self.num_train + self.num_valid, self.g.number_of_edges())
+        print('|test|:', len(self.test))
+        self.num_valid = len(self.valid)
+        self.num_test = len(self.test)
+    def get_edges(self, eval_type):
+        if eval_type == 'valid':
+            return self.valid
+        elif eval_type == 'test':
+            return self.test
+        else:
+            raise Exception('get invalid type: ' + eval_type)
+    def check(self, eval_type):
+        edges = self.get_edges(eval_type)
+        subg = self.g.edge_subgraph(edges)
+        if eval_type == 'valid':
+            data = self.valid
+        elif eval_type == 'test':
+            data = self.test
+        subg.copy_from_parent()
+        src, dst, eid = subg.all_edges('all', order='eid')
+        src_id = subg.ndata['id'][src]
+        dst_id = subg.ndata['id'][dst]
+        etype = subg.edata['id'][eid]
+        orig_src = np.array([t[0] for t in data])
+        orig_etype = np.array([t[1] for t in data])
+        orig_dst = np.array([t[2] for t in data])
+        np.testing.assert_equal(F.asnumpy(src_id), orig_src)
+        np.testing.assert_equal(F.asnumpy(dst_id), orig_dst)
+        np.testing.assert_equal(F.asnumpy(etype), orig_etype)
+    def create_sampler(self, eval_type, batch_size, neg_sample_size, mode='head',
+                       num_workers=5, rank=0, ranks=1):
+        edges = self.get_edges(eval_type)
+        beg = edges.shape[0] * rank // ranks
+        end = min(edges.shape[0] * (rank + 1) // ranks, edges.shape[0])
+        edges = edges[beg: end]
+        print("eval on {} edges".format(len(edges)))
+        return EvalSampler(self.g, edges, batch_size, neg_sample_size, mode, num_workers)
+class NewBidirectionalOneShotIterator:
+    def __init__(self, dataloader_head, dataloader_tail, is_pbg, num_nodes):
+        self.sampler_head = dataloader_head
+        self.sampler_tail = dataloader_tail
+        self.iterator_head = self.one_shot_iterator(dataloader_head, is_pbg,
+                                                    True, num_nodes)
+        self.iterator_tail = self.one_shot_iterator(dataloader_tail, is_pbg,
+                                                    False, num_nodes)
+        self.step = 0
+    def __next__(self):
+        self.step += 1
+        if self.step % 2 == 0:
+            pos_g, neg_g = next(self.iterator_head)
+        else:
+            pos_g, neg_g = next(self.iterator_tail)
+        return pos_g, neg_g
+    @staticmethod
+    def one_shot_iterator(dataloader, is_pbg, neg_head, num_nodes):
+        while True:
+            for pos_g, neg_g in dataloader:
+                neg_g = create_neg_subgraph(pos_g, neg_g, is_pbg, neg_head, num_nodes)
+                if neg_g is None:
+                    continue
+                pos_g.copy_from_parent()
+                neg_g.copy_from_parent()
+                yield pos_g, neg_g
--- a/apps/kg/eval.py
+++ b/apps/kg/eval.py
+from dataloader import EvalDataset, TrainDataset
+from dataloader import get_dataset
+import argparse
+import torch.multiprocessing as mp
+import os
+import logging
+import time
+import pickle
+import line_profiler
+backend = os.environ.get('DGLBACKEND')
+if backend.lower() == 'mxnet':
+    from train_mxnet import load_model_from_checkpoint
+    from train_mxnet import test
+else:
+    from train_pytorch import load_model_from_checkpoint
+    from train_pytorch import test
+class ArgParser(argparse.ArgumentParser):
+    def __init__(self):
+        super(ArgParser, self).__init__()
+        self.add_argument('--model_name', default='TransE',
+                          choices=['TransE', 'TransH', 'TransR', 'TransD',
+                                   'RESCAL', 'DistMult', 'ComplEx', 'RotatE', 'pRotatE'],
+                          help='model to use')
+        self.add_argument('--data_path', type=str, default='data',
+                          help='root path of all dataset')
+        self.add_argument('--dataset', type=str, default='FB15k',
+                          help='dataset name, under data_path')
+        self.add_argument('--format', type=str, default='1',
+                          help='the format of the dataset.')
+        self.add_argument('--model_path', type=str, default='ckpts',
+                          help='the place where models are saved')
+        self.add_argument('--batch_size', type=int, default=8,
+                          help='batch size used for eval and test')
+        self.add_argument('--neg_sample_size', type=int, default=-1,
+                          help='negative sampling size for testing')
+        self.add_argument('--hidden_dim', type=int, default=256,
+                          help='hidden dim used by relation and entity')
+        self.add_argument('-g', '--gamma', type=float, default=12.0,
+                          help='margin value')
+        self.add_argument('--eval_percent', type=float, default=1,
+                          help='sample some percentage for evaluation.')
+        self.add_argument('--gpu', type=int, default=-1,
+                          help='use GPU')
+        self.add_argument('--mix_cpu_gpu', action='store_true',
+                          help='mix CPU and GPU training')
+        self.add_argument('-de', '--double_ent', action='store_true',
+                          help='double entitiy dim for complex number')
+        self.add_argument('-dr', '--double_rel', action='store_true',
+                          help='double relation dim for complex number')
+        self.add_argument('--seed', type=int, default=0,
+                          help='set random seed fro reproducibility')
+        self.add_argument('--num_worker', type=int, default=16,
+                          help='number of workers used for loading data')
+        self.add_argument('--num_proc', type=int, default=1,
+                          help='number of process used')
+    def parse_args(self):
+        args = super().parse_args()
+        return args
+def get_logger(args):
+    if not os.path.exists(args.model_path):
+        raise Exception('No existing model_path: ' + args.model_path)
+    log_file = os.path.join(args.model_path, 'eval.log')
+    logging.basicConfig(
+        format='%(asctime)s %(levelname)-8s %(message)s',
+        level=logging.INFO,
+        datefmt='%Y-%m-%d %H:%M:%S',
+        filename=log_file,
+        filemode='w'
+    )
+    logger = logging.getLogger(__name__)
+    print("Logs are being recorded at: {}".format(log_file))
+    return logger
+def main(args):
+    # load dataset and samplers
+    dataset = get_dataset(args.data_path, args.dataset, args.format)
+    args.pickle_graph = False
+    args.train = False
+    args.valid = False
+    args.test = True
+    args.batch_size_eval = args.batch_size
+    logger = get_logger(args)
+    # Here we want to use the regualr negative sampler because we need to ensure that
+    # all positive edges are excluded.
+    eval_dataset = EvalDataset(dataset, args)
+    args.neg_sample_size_test = args.neg_sample_size
+    if args.neg_sample_size < 0:
+        args.neg_sample_size_test = args.neg_sample_size = eval_dataset.g.number_of_nodes()
+    if args.num_proc > 1:
+        test_sampler_tails = []
+        test_sampler_heads = []
+        for i in range(args.num_proc):
+            test_sampler_head = eval_dataset.create_sampler('test', args.batch_size,
+                                                            args.neg_sample_size,
+                                                            mode='head',
+                                                            num_workers=args.num_worker,
+                                                            rank=i, ranks=args.num_proc)
+            test_sampler_tail = eval_dataset.create_sampler('test', args.batch_size,
+                                                            args.neg_sample_size,
+                                                            mode='tail',
+                                                            num_workers=args.num_worker,
+                                                            rank=i, ranks=args.num_proc)
+            test_sampler_heads.append(test_sampler_head)
+            test_sampler_tails.append(test_sampler_tail)
+    else:
+        test_sampler_head = eval_dataset.create_sampler('test', args.batch_size,
+                                                        args.neg_sample_size,
+                                                        mode='head',
+                                                        num_workers=args.num_worker,
+                                                        rank=0, ranks=1)
+        test_sampler_tail = eval_dataset.create_sampler('test', args.batch_size,
+                                                        args.neg_sample_size,
+                                                        mode='tail',
+                                                        num_workers=args.num_worker,
+                                                        rank=0, ranks=1)
+    # load model
+    n_entities = dataset.n_entities
+    n_relations = dataset.n_relations
+    ckpt_path = args.model_path
+    model = load_model_from_checkpoint(logger, args, n_entities, n_relations, ckpt_path)
+    if args.num_proc > 1:
+        model.share_memory()
+    # test
+    args.step = 0
+    args.max_step = 0
+    if args.num_proc > 1:
+        procs = []
+        for i in range(args.num_proc):
+            proc = mp.Process(target=test, args=(args, model, [test_sampler_heads[i], test_sampler_tails[i]]))
+            procs.append(proc)
+            proc.start()
+        for proc in procs:
+            proc.join()
+    else:
+        test(args, model, [test_sampler_head, test_sampler_tail])
+if __name__ == '__main__':
+    args = ArgParser().parse_args()
+    main(args)
--- a/apps/kg/models/__init__.py
+++ b/apps/kg/models/__init__.py
+from .general_models import KEModel
--- a/apps/kg/models/general_models.py
+++ b/apps/kg/models/general_models.py
+import os
+import numpy as np
+import dgl.backend as F
+backend = os.environ.get('DGLBACKEND')
+if backend.lower() == 'mxnet':
+    from .mxnet.tensor_models import logsigmoid
+    from .mxnet.tensor_models import get_device
+    from .mxnet.tensor_models import norm
+    from .mxnet.tensor_models import get_scalar
+    from .mxnet.tensor_models import reshape
+    from .mxnet.tensor_models import cuda
+    from .mxnet.tensor_models import ExternalEmbedding
+    from .mxnet.score_fun import *
+else:
+    from .pytorch.tensor_models import logsigmoid
+    from .pytorch.tensor_models import get_device
+    from .pytorch.tensor_models import norm
+    from .pytorch.tensor_models import get_scalar
+    from .pytorch.tensor_models import reshape
+    from .pytorch.tensor_models import cuda
+    from .pytorch.tensor_models import ExternalEmbedding
+    from .pytorch.score_fun import *
+class KEModel(object):
+    def __init__(self, args, model_name, n_entities, n_relations, hidden_dim, gamma,
+                 double_entity_emb=False, double_relation_emb=False):
+        super(KEModel, self).__init__()
+        self.args = args
+        self.n_entities = n_entities
+        self.model_name = model_name
+        self.hidden_dim = hidden_dim
+        self.eps = 2.0
+        self.emb_init = (gamma + self.eps) / hidden_dim
+        entity_dim = 2 * hidden_dim if double_entity_emb else hidden_dim
+        relation_dim = 2 * hidden_dim if double_relation_emb else hidden_dim
+        device = get_device(args)
+        self.entity_emb = ExternalEmbedding(args, n_entities, entity_dim,
+                                            F.cpu() if args.mix_cpu_gpu else device)
+        # For RESCAL, relation_emb = relation_dim * entity_dim
+        if model_name == 'RESCAL':
+            rel_dim = relation_dim * entity_dim
+        else:
+            rel_dim = relation_dim
+        self.relation_emb = ExternalEmbedding(args, n_relations, rel_dim, device)
+        if model_name == 'TransE':
+            self.score_func = TransEScore(gamma)
+        elif model_name == 'DistMult':
+            self.score_func = DistMultScore()
+        elif model_name == 'ComplEx':
+            self.score_func = ComplExScore()
+        self.head_neg_score = self.score_func.create_neg(True)
+        self.tail_neg_score = self.score_func.create_neg(False)
+        self.reset_parameters()
+    def share_memory(self):
+        # TODO(zhengda) we should make it work for parameters in score func
+        self.entity_emb.share_memory()
+        self.relation_emb.share_memory()
+    def save_emb(self, path, dataset):
+        self.entity_emb.save(path, dataset+'_'+self.model_name+'_entity')
+        self.relation_emb.save(path, dataset+'_'+self.model_name+'_relation')
+        self.score_func.save(path, dataset)
+    def load_emb(self, path, dataset):
+        self.entity_emb.load(path, dataset+'_'+self.model_name+'_entity')
+        self.relation_emb.load(path, dataset+'_'+self.model_name+'_relation')
+        self.score_func.load(path, dataset)
+    def reset_parameters(self):
+        self.entity_emb.init(self.emb_init)
+        self.relation_emb.init(self.emb_init)
+        self.score_func.reset_parameters()
+    def predict_score(self, g):
+        self.score_func(g)
+        return g.edata['score']
+    def predict_neg_score(self, pos_g, neg_g, to_device=None, gpu_id=-1, trace=False):
+        num_chunks = neg_g.num_chunks
+        chunk_size = neg_g.chunk_size
+        neg_sample_size = neg_g.neg_sample_size
+        if neg_g.neg_head:
+            neg_head_ids = neg_g.ndata['id'][neg_g.head_nid]
+            neg_head = self.entity_emb(neg_head_ids, gpu_id, trace)
+            _, tail_ids = pos_g.all_edges(order='eid')
+            if to_device is not None and gpu_id >= 0:
+                tail_ids = to_device(tail_ids, gpu_id)
+            tail = pos_g.ndata['emb'][tail_ids]
+            rel = pos_g.edata['emb']
+            neg_score = self.head_neg_score(neg_head, rel, tail,
+                                            num_chunks, chunk_size, neg_sample_size)
+        else:
+            neg_tail_ids = neg_g.ndata['id'][neg_g.tail_nid]
+            neg_tail = self.entity_emb(neg_tail_ids, gpu_id, trace)
+            head_ids, _ = pos_g.all_edges(order='eid')
+            if to_device is not None and gpu_id >= 0:
+                head_ids = to_device(head_ids, gpu_id)
+            head = pos_g.ndata['emb'][head_ids]
+            rel = pos_g.edata['emb']
+            neg_score = self.tail_neg_score(head, rel, neg_tail,
+                                            num_chunks, chunk_size, neg_sample_size)
+        return neg_score
+    def forward_test(self, pos_g, neg_g, logs, gpu_id=-1):
+        pos_g.ndata['emb'] = self.entity_emb(pos_g.ndata['id'], gpu_id, False)
+        pos_g.edata['emb'] = self.relation_emb(pos_g.edata['id'], gpu_id, False)
+        batch_size = pos_g.number_of_edges()
+        pos_scores = self.predict_score(pos_g)
+        pos_scores = reshape(logsigmoid(pos_scores), batch_size, -1)
+        neg_scores = self.predict_neg_score(pos_g, neg_g, to_device=cuda,
+                                            gpu_id=gpu_id, trace=False)
+        neg_scores = reshape(logsigmoid(neg_scores), batch_size, -1)
+        # We need to filter the positive edges in the negative graph.
+        filter_bias = reshape(neg_g.edata['bias'], batch_size, -1)
+        if self.args.gpu >= 0:
+            filter_bias = cuda(filter_bias, self.args.gpu)
+        neg_scores += filter_bias
+        # To compute the rank of a positive edge among all negative edges,
+        # we need to know how many negative edges have higher scores than
+        # the positive edge.
+        rankings = F.sum(neg_scores > pos_scores, dim=1) + 1
+        rankings = F.asnumpy(rankings)
+        for i in range(batch_size):
+            ranking = rankings[i]
+            logs.append({
+                'MRR': 1.0 / ranking,
+                'MR': float(ranking),
+                'HITS@1': 1.0 if ranking <= 1 else 0.0,
+                'HITS@3': 1.0 if ranking <= 3 else 0.0,
+                'HITS@10': 1.0 if ranking <= 10 else 0.0
+            })
+    # @profile
+    def forward(self, pos_g, neg_g, gpu_id=-1):
+        pos_g.ndata['emb'] = self.entity_emb(pos_g.ndata['id'], gpu_id, True)
+        pos_g.edata['emb'] = self.relation_emb(pos_g.edata['id'], gpu_id, True)
+        pos_score = self.predict_score(pos_g)
+        pos_score = logsigmoid(pos_score)
+        if gpu_id >= 0:
+            neg_score = self.predict_neg_score(pos_g, neg_g, to_device=cuda,
+                                               gpu_id=gpu_id, trace=True)
+        else:
+            neg_score = self.predict_neg_score(pos_g, neg_g, trace=True)
+        neg_score = reshape(neg_score, -1, neg_g.neg_sample_size)
+        # Adversarial sampling
+        if self.args.neg_adversarial_sampling:
+            neg_score = F.sum(F.softmax(neg_score * self.args.adversarial_temperature, dim=1).detach()
+                         * logsigmoid(-neg_score), dim=1)
+        else:
+            neg_score = F.mean(logsigmoid(-neg_score), dim=1)
+        # subsampling weight
+        # TODO: add subsampling to new sampler
+        if self.args.non_uni_weight:
+            subsampling_weight = pos_g.edata['weight']
+            pos_score = (pos_score * subsampling_weight).sum() / subsampling_weight.sum()
+            neg_score = (neg_score * subsampling_weight).sum() / subsampling_weight.sum()
+        else:
+            pos_score = pos_score.mean()
+            neg_score = neg_score.mean()
+        # compute loss
+        loss = -(pos_score + neg_score) / 2
+        log = {'pos_loss': - get_scalar(pos_score),
+               'neg_loss': - get_scalar(neg_score),
+               'loss': get_scalar(loss)}
+        # regularization: TODO(zihao)
+        #TODO: only reg ent&rel embeddings. other params to be added.
+        if self.args.regularization_coef > 0.0 and self.args.regularization_norm > 0:
+            coef, nm = self.args.regularization_coef, self.args.regularization_norm
+            reg = coef * (norm(self.entity_emb.curr_emb(), nm) + norm(self.relation_emb.curr_emb(), nm))
+            log['regularization'] = get_scalar(reg)
+            loss = loss + reg
+        return loss, log
+    def update(self):
+        self.entity_emb.update()
+        self.relation_emb.update()
--- a/apps/kg/models/mxnet/__init__.py
+++ b/apps/kg/models/mxnet/__init__.py
--- a/apps/kg/models/mxnet/score_fun.py
+++ b/apps/kg/models/mxnet/score_fun.py
+import mxnet as mx
+from mxnet import gluon
+from mxnet.gluon import nn
+from mxnet import ndarray as nd
+class TransEScore(nn.Block):
+    def __init__(self, gamma):
+        super(TransEScore, self).__init__()
+        self.gamma = gamma
+    def edge_func(self, edges):
+        head = edges.src['emb']
+        tail = edges.dst['emb']
+        rel = edges.data['emb']
+        score = head + rel - tail
+        return {'score': self.gamma - nd.norm(score, ord=1, axis=-1)}
+    def reset_parameters(self):
+        pass
+    def save(self, path, name):
+        pass
+    def load(self, path, name):
+        pass
+    def forward(self, g):
+        g.apply_edges(lambda edges: self.edge_func(edges))
+    def create_neg(self, neg_head):
+        gamma = self.gamma
+        if neg_head:
+            def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
+                hidden_dim = heads.shape[1]
+                heads = heads.reshape(num_chunks, 1, neg_sample_size, hidden_dim)
+                tails = tails - relations
+                tails = tails.reshape(num_chunks,chunk_size, 1, hidden_dim)
+                return gamma - nd.norm(heads - tails, ord=1, axis=-1)
+            return fn
+        else:
+            def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
+                hidden_dim = heads.shape[1]
+                heads = heads + relations
+                heads = heads.reshape(num_chunks, chunk_size, 1, hidden_dim)
+                tails = tails.reshape(num_chunks, 1, neg_sample_size, hidden_dim)
+                return gamma - nd.norm(heads - tails, ord=1, axis=-1)
+            return fn
+class DistMultScore(nn.Block):
+    def __init__(self):
+        super(DistMultScore, self).__init__()
+    def edge_func(self, edges):
+        head = edges.src['emb']
+        tail = edges.dst['emb']
+        rel = edges.data['emb']
+        score = head * rel * tail
+        # TODO: check if there exists minus sign and if gamma should be used here(jin)
+        return {'score': nd.sum(score, axis=-1)}
+    def reset_parameters(self):
+        pass
+    def save(self, path, name):
+        pass
+    def load(self, path, name):
+        pass
+    def forward(self, g):
+        g.apply_edges(lambda edges: self.edge_func(edges))
+    def create_neg(self, neg_head):
+        if neg_head:
+            def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
+                hidden_dim = heads.shape[1]
+                heads = heads.reshape(num_chunks, neg_sample_size, hidden_dim)
+                heads = nd.transpose(heads, axes=(0, 2, 1))
+                tmp = (tails * relations).reshape(num_chunks, chunk_size, hidden_dim)
+                return nd.linalg_gemm2(tmp, heads)
+            return fn
+        else:
+            def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
+                hidden_dim = heads.shape[1]
+                tails = tails.reshape(num_chunks, neg_sample_size, hidden_dim)
+                tails = nd.transpose(tails, axes=(0, 2, 1))
+                tmp = (heads * relations).reshape(num_chunks, chunk_size, hidden_dim)
+                return nd.linalg_gemm2(tmp, tails)
+            return fn
--- a/apps/kg/models/mxnet/tensor_models.py
+++ b/apps/kg/models/mxnet/tensor_models.py
+import os
+import numpy as np
+import mxnet as mx
+from mxnet import gluon
+from mxnet import ndarray as nd
+from .score_fun import *
+from .. import *
+def logsigmoid(val):
+    max_elem = nd.maximum(0., -val)
+    z = nd.exp(-max_elem) + nd.exp(-val - max_elem)
+    return -(max_elem + nd.log(z))
+get_device = lambda args : mx.gpu(args.gpu) if args.gpu >= 0 else mx.cpu()
+norm = lambda x, p: nd.sum(nd.abs(x) ** p)
+get_scalar = lambda x: x.detach().asscalar()
+reshape = lambda arr, x, y: arr.reshape(x, y)
+cuda = lambda arr, gpu: arr.as_in_context(mx.gpu(gpu))
+class ExternalEmbedding:
+    def __init__(self, args, num, dim, ctx):
+        self.gpu = args.gpu
+        self.args = args
+        self.trace = []
+        self.emb = nd.empty((num, dim), dtype=np.float32, ctx=ctx)
+        self.state_sum = nd.zeros((self.emb.shape[0]), dtype=np.float32, ctx=ctx)
+        self.state_step = 0
+    def init(self, emb_init):
+        nd.random.uniform(-emb_init, emb_init,
+                          shape=self.emb.shape, dtype=self.emb.dtype,
+                          ctx=self.emb.context, out=self.emb)
+    def share_memory(self):
+        # TODO(zhengda) fix this later
+        pass
+    def __call__(self, idx, gpu_id=-1, trace=True):
+        if self.emb.context != idx.context:
+            idx = idx.as_in_context(self.emb.context)
+        data = nd.take(self.emb, idx)
+        if self.gpu >= 0:
+            data = data.as_in_context(mx.gpu(self.gpu))
+        data.attach_grad()
+        if trace:
+            self.trace.append((idx, data))
+        return data
+    def update(self):
+        self.state_step += 1
+        for idx, data in self.trace:
+            grad = data.grad
+            clr = self.args.lr
+            #clr = self.args.lr / (1 + (self.state_step - 1) * group['lr_decay'])
+            # the update is non-linear so indices must be unique
+            grad_indices = idx
+            grad_values = grad
+            grad_sum = (grad_values * grad_values).mean(1)
+            ctx = self.state_sum.context
+            if ctx != grad_indices.context:
+                grad_indices = grad_indices.as_in_context(ctx)
+            if ctx != grad_sum.context:
+                grad_sum = grad_sum.as_in_context(ctx)
+            self.state_sum[grad_indices] += grad_sum
+            std = self.state_sum[grad_indices]  # _sparse_mask
+            std_values = nd.expand_dims(nd.sqrt(std) + 1e-10, 1)
+            if self.gpu >= 0:
+                std_values = std_values.as_in_context(mx.gpu(self.args.gpu))
+            tmp = (-clr * grad_values / std_values)
+            if tmp.context != ctx:
+                tmp = tmp.as_in_context(ctx)
+            # TODO(zhengda) the overhead is here.
+            self.emb[grad_indices] = mx.nd.take(self.emb, grad_indices) + tmp
+        self.trace = []
+    def curr_emb(self):
+        data = [data for _, data in self.trace]
+        return nd.concat(*data, dim=0)
+    def save(self, path, name):
+        emb_fname = os.path.join(path, name+'.emb')
+        nd.save(emb_fname, self.emb)
+    def load(self, path, name):
+        emb_fname = os.path.join(path, name+'.emb')
+        self.emb = nd.load(emb_fname)[0]
--- a/apps/kg/models/pytorch/__init__.py
+++ b/apps/kg/models/pytorch/__init__.py
--- a/apps/kg/models/pytorch/score_fun.py
+++ b/apps/kg/models/pytorch/score_fun.py
+import torch as th
+import torch.nn as nn
+import torch.nn.functional as functional
+import torch.nn.init as INIT
+class TransEScore(nn.Module):
+    def __init__(self, gamma):
+        super(TransEScore, self).__init__()
+        self.gamma = gamma
+    def edge_func(self, edges):
+        head = edges.src['emb']
+        tail = edges.dst['emb']
+        rel = edges.data['emb']
+        score = head + rel - tail
+        return {'score': self.gamma - th.norm(score, p=1, dim=-1)}
+    def forward(self, g):
+        g.apply_edges(lambda edges: self.edge_func(edges))
+    def reset_parameters(self):
+        pass
+    def save(self, path, name):
+        pass
+    def load(self, path, name):
+        pass
+    def create_neg(self, neg_head):
+        gamma = self.gamma
+        if neg_head:
+            def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
+                hidden_dim = heads.shape[1]
+                heads = heads.reshape(num_chunks, neg_sample_size, hidden_dim)
+                tails = tails - relations
+                tails = tails.reshape(num_chunks, chunk_size, hidden_dim)
+                return gamma - th.cdist(tails, heads, p=1)
+            return fn
+        else:
+            def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
+                hidden_dim = heads.shape[1]
+                heads = heads + relations
+                heads = heads.reshape(num_chunks, chunk_size, hidden_dim)
+                tails = tails.reshape(num_chunks, neg_sample_size, hidden_dim)
+                return gamma - th.cdist(heads, tails, p=1)
+            return fn
+class DistMultScore(nn.Module):
+    def __init__(self):
+        super(DistMultScore, self).__init__()
+    def edge_func(self, edges):
+        head = edges.src['emb']
+        tail = edges.dst['emb']
+        rel = edges.data['emb']
+        score = head * rel * tail
+        # TODO: check if there exists minus sign and if gamma should be used here(jin)
+        return {'score': th.sum(score, dim=-1)}
+    def reset_parameters(self):
+        pass
+    def save(self, path, name):
+        pass
+    def load(self, path, name):
+        pass
+    def forward(self, g):
+        g.apply_edges(lambda edges: self.edge_func(edges))
+    def create_neg(self, neg_head):
+        if neg_head:
+            def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
+                hidden_dim = heads.shape[1]
+                heads = heads.reshape(num_chunks, neg_sample_size, hidden_dim)
+                heads = th.transpose(heads, 1, 2)
+                tmp = (tails * relations).reshape(num_chunks, chunk_size, hidden_dim)
+                return th.bmm(tmp, heads)
+            return fn
+        else:
+            def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
+                hidden_dim = tails.shape[1]
+                tails = tails.reshape(num_chunks, neg_sample_size, hidden_dim)
+                tails = th.transpose(tails, 1, 2)
+                tmp = (heads * relations).reshape(num_chunks, chunk_size, hidden_dim)
+                return th.bmm(tmp, tails)
+            return fn
+class ComplExScore(nn.Module):
+    def __init__(self):
+        super(ComplExScore, self).__init__()
+    def edge_func(self, edges):
+        real_head, img_head = th.chunk(edges.src['emb'], 2, dim=-1)
+        real_tail, img_tail = th.chunk(edges.dst['emb'], 2, dim=-1)
+        real_rel, img_rel = th.chunk(edges.data['emb'], 2, dim=-1)
+        score = real_head * real_tail * real_rel \
+                + img_head * img_tail * real_rel \
+                + real_head * img_tail * img_rel \
+                - img_head * real_tail * img_rel
+        # TODO: check if there exists minus sign and if gamma should be used here(jin)
+        return {'score': th.sum(score, -1)}
+    def reset_parameters(self):
+        pass
+    def save(self, path, name):
+        pass
+    def load(self, path, name):
+        pass
+    def forward(self, g):
+        g.apply_edges(lambda edges: self.edge_func(edges))
+    def create_neg(self, neg_head):
+        if neg_head:
+            def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
+                hidden_dim = heads.shape[1]
+                emb_real = tails[..., :hidden_dim // 2]
+                emb_imag = tails[..., hidden_dim // 2:]
+                rel_real = relations[..., :hidden_dim // 2]
+                rel_imag = relations[..., hidden_dim // 2:]
+                real = emb_real * rel_real + emb_imag * rel_imag
+                imag = -emb_real * rel_imag + emb_imag * rel_real
+                emb_complex = th.cat((real, imag), dim=-1)
+                tmp = emb_complex.reshape(num_chunks, chunk_size, hidden_dim)
+                heads = heads.reshape(num_chunks, neg_sample_size, hidden_dim)
+                heads = th.transpose(heads, 1, 2)
+                return th.bmm(tmp, heads)
+            return fn
+        else:
+            def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
+                hidden_dim = heads.shape[1]
+                emb_real = heads[..., :hidden_dim // 2]
+                emb_imag = heads[..., hidden_dim // 2:]
+                rel_real = relations[..., :hidden_dim // 2]
+                rel_imag = relations[..., hidden_dim // 2:]
+                real = emb_real * rel_real - emb_imag * rel_imag
+                imag = emb_real * rel_imag + emb_imag * rel_real
+                emb_complex = th.cat((real, imag), dim=-1)
+                tmp = emb_complex.reshape(num_chunks, chunk_size, hidden_dim)
+                tails = tails.reshape(num_chunks, neg_sample_size, hidden_dim)
+                tails = th.transpose(tails, 1, 2)
+                return th.bmm(tmp, tails)
+            return fn
+class RESCALScore(nn.Module):
+    def __init__(self, relation_dim, entity_dim):
+        super(RESCALScore, self).__init__()
+        self.relation_dim = relation_dim
+        self.entity_dim = entity_dim
+    def edge_func(self, edges):
+        head = edges.src['emb']
+        tail = edges.dst['emb'].unsqueeze(-1)
+        rel = edges.data['emb']
+        rel = rel.view(-1, self.relation_dim, self.entity_dim)
+        score = head * th.matmul(rel, tail).squeeze(-1)
+        # TODO: check if use self.gamma
+        return {'score': th.sum(score, dim=-1)}
+        # return {'score': self.gamma - th.norm(score, p=1, dim=-1)}
+    def reset_parameters(self):
+        pass
+    def save(self, path, name):
+        pass
+    def load(self, path, name):
+        pass
+    def forward(self, g):
+        g.apply_edges(lambda edges: self.edge_func(edges))
--- a/apps/kg/models/pytorch/tensor_models.py
+++ b/apps/kg/models/pytorch/tensor_models.py
+"""
+Knowledge Graph Embedding Models.
+1. TransE
+2. DistMult
+3. ComplEx
+4. RotatE
+5. pRotatE
+6. TransH
+7. TransR
+8. TransD
+9. RESCAL
+"""
+import os
+import numpy as np
+import torch as th
+import torch.nn as nn
+import torch.nn.functional as functional
+import torch.nn.init as INIT
+from .. import *
+logsigmoid = functional.logsigmoid
+def get_device(args):
+    return th.device('cpu') if args.gpu < 0 else th.device('cuda:' + str(args.gpu))
+norm = lambda x, p: x.norm(p=p)**p
+get_scalar = lambda x: x.detach().item()
+reshape = lambda arr, x, y: arr.view(x, y)
+cuda = lambda arr, gpu: arr.cuda(gpu)
+class ExternalEmbedding:
+    def __init__(self, args, num, dim, device):
+        self.gpu = args.gpu
+        self.args = args
+        self.trace = []
+        self.emb = th.empty(num, dim, dtype=th.float32, device=device)
+        self.state_sum = self.emb.new().resize_(self.emb.size(0)).zero_()
+        self.state_step = 0
+    def init(self, emb_init):
+        INIT.uniform_(self.emb, -emb_init, emb_init)
+        INIT.zeros_(self.state_sum)
+    def share_memory(self):
+        self.emb.share_memory_()
+        self.state_sum.share_memory_()
+    def __call__(self, idx, gpu_id=-1, trace=True):
+        s = self.emb[idx]
+        if self.gpu >= 0:
+            s = s.cuda(self.gpu)
+        data = s.clone().detach().requires_grad_(True)
+        if trace:
+            self.trace.append((idx, data))
+        return data
+    def update(self):
+        self.state_step += 1
+        with th.no_grad():
+            for idx, data in self.trace:
+                grad = data.grad.data
+                clr = self.args.lr
+                #clr = self.args.lr / (1 + (self.state_step - 1) * group['lr_decay'])
+                # the update is non-linear so indices must be unique
+                grad_indices = idx
+                grad_values = grad
+                grad_sum = (grad_values * grad_values).mean(1)
+                device = self.state_sum.device
+                if device != grad_indices.device:
+                    grad_indices = grad_indices.to(device)
+                if device != grad_sum.device:
+                    grad_sum = grad_sum.to(device)
+                self.state_sum.index_add_(0, grad_indices, grad_sum)
+                std = self.state_sum[grad_indices]  # _sparse_mask
+                std_values = std.sqrt_().add_(1e-10).unsqueeze(1)
+                if self.gpu >= 0:
+                    std_values = std_values.cuda(self.args.gpu)
+                tmp = (-clr * grad_values / std_values)
+                if tmp.device != device:
+                    tmp = tmp.to(device)
+                # TODO(zhengda) the overhead is here.
+                self.emb.index_add_(0, grad_indices, tmp)
+        self.trace = []
+    def curr_emb(self):
+        data = [data for _, data in self.trace]
+        return th.cat(data, 0)
+    def save(self, path, name):
+        file_name = os.path.join(path, name)
+        np.save(file_name, self.emb.cpu().detach().numpy())
+    def load(self, path, name):
+        file_name = os.path.join(path, name+'.npy')
+        self.emb = th.Tensor(np.load(file_name))
--- a/apps/kg/tests/test_score.py
+++ b/apps/kg/tests/test_score.py
+import os
+import scipy as sp
+import dgl
+import numpy as np
+import dgl.backend as F
+import dgl
+backend = os.environ.get('DGLBACKEND')
+if backend.lower() == 'mxnet':
+    from models.mxnet.score_fun import *
+else:
+    from models.pytorch.score_fun import *
+from models.general_models import KEModel
+from dataloader.sampler import create_neg_subgraph
+def generate_rand_graph(n):
+    arr = (sp.sparse.random(n, n, density=0.1, format='coo') != 0).astype(np.int64)
+    g = dgl.DGLGraph(arr, readonly=True)
+    num_rels = 10
+    entity_emb = F.uniform((g.number_of_nodes(), 10), F.float32, F.cpu(), 0, 1)
+    rel_emb = F.uniform((num_rels, 10), F.float32, F.cpu(), 0, 1)
+    g.ndata['id'] = F.arange(0, g.number_of_nodes())
+    rel_ids = np.random.randint(0, num_rels, g.number_of_edges(), dtype=np.int64)
+    g.edata['id'] = F.tensor(rel_ids, F.int64)
+    return g, entity_emb, rel_emb
+ke_score_funcs = {'TransE': TransEScore(12.0),
+                  'DistMult': DistMultScore()}
+class BaseKEModel:
+    def __init__(self, score_func, entity_emb, rel_emb):
+        self.score_func = score_func
+        self.head_neg_score = self.score_func.create_neg(True)
+        self.tail_neg_score = self.score_func.create_neg(False)
+        self.entity_emb = entity_emb
+        self.rel_emb = rel_emb
+    def predict_score(self, g):
+        g.ndata['emb'] = self.entity_emb[g.ndata['id']]
+        g.edata['emb'] = self.rel_emb[g.edata['id']]
+        self.score_func(g)
+        return g.edata['score']
+    def predict_neg_score(self, pos_g, neg_g):
+        pos_g.ndata['emb'] = self.entity_emb[pos_g.ndata['id']]
+        pos_g.edata['emb'] = self.rel_emb[pos_g.edata['id']]
+        neg_g.ndata['emb'] = self.entity_emb[neg_g.ndata['id']]
+        neg_g.edata['emb'] = self.rel_emb[neg_g.edata['id']]
+        num_chunks = neg_g.num_chunks
+        chunk_size = neg_g.chunk_size
+        neg_sample_size = neg_g.neg_sample_size
+        if neg_g.neg_head:
+            neg_head_ids = neg_g.ndata['id'][neg_g.head_nid]
+            neg_head = self.entity_emb[neg_head_ids]
+            _, tail_ids = pos_g.all_edges(order='eid')
+            tail = pos_g.ndata['emb'][tail_ids]
+            rel = pos_g.edata['emb']
+            neg_score = self.head_neg_score(neg_head, rel, tail,
+                                            num_chunks, chunk_size, neg_sample_size)
+        else:
+            neg_tail_ids = neg_g.ndata['id'][neg_g.tail_nid]
+            neg_tail = self.entity_emb[neg_tail_ids]
+            head_ids, _ = pos_g.all_edges(order='eid')
+            head = pos_g.ndata['emb'][head_ids]
+            rel = pos_g.edata['emb']
+            neg_score = self.tail_neg_score(head, rel, neg_tail,
+                                            num_chunks, chunk_size, neg_sample_size)
+        return neg_score
+def check_score_func(func_name):
+    batch_size = 10
+    neg_sample_size = 10
+    g, entity_emb, rel_emb = generate_rand_graph(100)
+    hidden_dim = entity_emb.shape[1]
+    ke_score_func = ke_score_funcs[func_name]
+    model = BaseKEModel(ke_score_func, entity_emb, rel_emb)
+    EdgeSampler = getattr(dgl.contrib.sampling, 'EdgeSampler')
+    sampler = EdgeSampler(g, batch_size=batch_size,
+                          neg_sample_size=neg_sample_size,
+                          negative_mode='PBG-head',
+                          num_workers=1,
+                          shuffle=False,
+                          exclude_positive=False,
+                          return_false_neg=False)
+    for pos_g, neg_g in sampler:
+        neg_g = create_neg_subgraph(pos_g, neg_g, True, True, g.number_of_nodes())
+        pos_g.copy_from_parent()
+        neg_g.copy_from_parent()
+        score1 = F.reshape(model.predict_score(neg_g), (batch_size, -1))
+        score2 = model.predict_neg_score(pos_g, neg_g)
+        score2 = F.reshape(score2, (batch_size, -1))
+        np.testing.assert_allclose(F.asnumpy(score1), F.asnumpy(score2),
+                                   rtol=1e-5, atol=1e-5)
+def test_score_func():
+    for key in ke_score_funcs:
+        check_score_func(key)
+if __name__ == '__main__':
+    test_score_func()
--- a/apps/kg/train.py
+++ b/apps/kg/train.py
+from dataloader import EvalDataset, TrainDataset, NewBidirectionalOneShotIterator
+from dataloader import get_dataset
+import torch.multiprocessing as mp
+import argparse
+import os
+import logging
+import time
+backend = os.environ.get('DGLBACKEND')
+if backend.lower() == 'mxnet':
+    from train_mxnet import load_model
+    from train_mxnet import train
+    from train_mxnet import test
+else:
+    from train_pytorch import load_model
+    from train_pytorch import train
+    from train_pytorch import test
+class ArgParser(argparse.ArgumentParser):
+    def __init__(self):
+        super(ArgParser, self).__init__()
+        self.add_argument('--model_name', default='TransE',
+                          choices=['TransE', 'TransH', 'TransR', 'TransD',
+                                   'RESCAL', 'DistMult', 'ComplEx', 'RotatE', 'pRotatE'],
+                          help='model to use')
+        self.add_argument('--data_path', type=str, default='data',
+                          help='root path of all dataset')
+        self.add_argument('--dataset', type=str, default='FB15k',
+                          help='dataset name, under data_path')
+        self.add_argument('--format', type=str, default='1',
+                          help='the format of the dataset.')
+        self.add_argument('--save_path', type=str, default='ckpts',
+                          help='place to save models and logs')
+        self.add_argument('--save_emb', type=str, default=None,
+                          help='save the embeddings in the specific location.')
+        self.add_argument('--max_step', type=int, default=80000,
+                          help='train xx steps')
+        self.add_argument('--warm_up_step', type=int, default=None,
+                          help='for learning rate decay')
+        self.add_argument('--batch_size', type=int, default=1024,
+                          help='batch size')
+        self.add_argument('--batch_size_eval', type=int, default=8,
+                          help='batch size used for eval and test')
+        self.add_argument('--neg_sample_size', type=int, default=128,
+                          help='negative sampling size')
+        self.add_argument('--neg_sample_size_valid', type=int, default=1000,
+                          help='negative sampling size for validation')
+        self.add_argument('--neg_sample_size_test', type=int, default=-1,
+                          help='negative sampling size for testing')
+        self.add_argument('--hidden_dim', type=int, default=256,
+                          help='hidden dim used by relation and entity')
+        self.add_argument('--lr', type=float, default=0.0001,
+                          help='learning rate')
+        self.add_argument('-g', '--gamma', type=float, default=12.0,
+                          help='margin value')
+        self.add_argument('--eval_percent', type=float, default=1,
+                          help='sample some percentage for evaluation.')
+        self.add_argument('--gpu', type=int, default=-1,
+                          help='use GPU')
+        self.add_argument('--mix_cpu_gpu', action='store_true',
+                          help='mix CPU and GPU training')
+        self.add_argument('-de', '--double_ent', action='store_true',
+                          help='double entitiy dim for complex number')
+        self.add_argument('-dr', '--double_rel', action='store_true',
+                          help='double relation dim for complex number')
+        self.add_argument('--seed', type=int, default=0,
+                          help='set random seed fro reproducibility')
+        self.add_argument('-log', '--log_interval', type=int, default=1000,
+                          help='do evaluation after every x steps')
+        self.add_argument('--eval_interval', type=int, default=10000,
+                          help='do evaluation after every x steps')
+        self.add_argument('-adv', '--neg_adversarial_sampling', action='store_true',
+                          help='if use negative adversarial sampling')
+        self.add_argument('-a', '--adversarial_temperature', default=1.0, type=float)
+        self.add_argument('--valid', action='store_true',
+                          help='if valid a model')
+        self.add_argument('--test', action='store_true',
+                          help='if test a model')
+        self.add_argument('-rc', '--regularization_coef', type=float, default=0.000002,
+                          help='set value > 0.0 if regularization is used')
+        self.add_argument('-rn', '--regularization_norm', type=int, default=3,
+                          help='norm used in regularization')
+        self.add_argument('--num_worker', type=int, default=16,
+                          help='number of workers used for loading data')
+        self.add_argument('--non_uni_weight', action='store_true',
+                          help='if use uniform weight when computing loss')
+        self.add_argument('--init_step', type=int, default=0,
+                          help='DONT SET MANUALLY, used for resume')
+        self.add_argument('--step', type=int, default=0,
+                          help='DONT SET MANUALLY, track current step')
+        self.add_argument('--pickle_graph', action='store_true',
+                          help='pickle built graph, building a huge graph is slow.')
+        self.add_argument('--num_proc', type=int, default=1,
+                          help='number of process used')
+        self.add_argument('--rel_part', action='store_true',
+                          help='enable relation partitioning')
+def get_logger(args):
+    if not os.path.exists(args.save_path):
+        os.mkdir(args.save_path)
+    folder = '{}_{}_'.format(args.model_name, args.dataset)
+    n = len([x for x in os.listdir(args.save_path) if x.startswith(folder)])
+    folder += str(n)
+    args.save_path = os.path.join(args.save_path, folder)
+    if not os.path.exists(args.save_path):
+        os.makedirs(args.save_path)
+    log_file = os.path.join(args.save_path, 'train.log')
+    logging.basicConfig(
+        format='%(asctime)s %(levelname)-8s %(message)s',
+        level=logging.INFO,
+        datefmt='%Y-%m-%d %H:%M:%S',
+        filename=log_file,
+        filemode='w'
+    )
+    logger = logging.getLogger(__name__)
+    print("Logs are being recorded at: {}".format(log_file))
+    return logger
+def run(args, logger):
+    # load dataset and samplers
+    dataset = get_dataset(args.data_path, args.dataset, args.format)
+    n_entities = dataset.n_entities
+    n_relations = dataset.n_relations
+    if args.neg_sample_size_test < 0:
+        args.neg_sample_size_test = n_entities
+    train_data = TrainDataset(dataset, args, ranks=args.num_proc)
+    if args.num_proc > 1:
+        train_samplers = []
+        for i in range(args.num_proc):
+            train_sampler_head = train_data.create_sampler(args.batch_size, args.neg_sample_size,
+                                                           mode='PBG-head',
+                                                           num_workers=args.num_worker,
+                                                           shuffle=True,
+                                                           exclude_positive=True,
+                                                           rank=i)
+            train_sampler_tail = train_data.create_sampler(args.batch_size, args.neg_sample_size,
+                                                           mode='PBG-tail',
+                                                           num_workers=args.num_worker,
+                                                           shuffle=True,
+                                                           exclude_positive=True,
+                                                           rank=i)
+            train_samplers.append(NewBidirectionalOneShotIterator(train_sampler_head, train_sampler_tail,
+                                                                  True, n_entities))
+    else:
+        train_sampler_head = train_data.create_sampler(args.batch_size, args.neg_sample_size,
+                                                       mode='PBG-head',
+                                                       num_workers=args.num_worker,
+                                                       shuffle=True,
+                                                       exclude_positive=True)
+        train_sampler_tail = train_data.create_sampler(args.batch_size, args.neg_sample_size,
+                                                       mode='PBG-tail',
+                                                       num_workers=args.num_worker,
+                                                       shuffle=True,
+                                                       exclude_positive=True)
+        train_sampler = NewBidirectionalOneShotIterator(train_sampler_head, train_sampler_tail,
+                                                        True, n_entities)
+    if args.valid or args.test:
+        eval_dataset = EvalDataset(dataset, args)
+    if args.valid:
+        # Here we want to use the regualr negative sampler because we need to ensure that
+        # all positive edges are excluded.
+        if args.num_proc > 1:
+            valid_sampler_heads = []
+            valid_sampler_tails = []
+            for i in range(args.num_proc):
+                valid_sampler_head = eval_dataset.create_sampler('valid', args.batch_size_eval,
+                                                                 args.neg_sample_size_valid,
+                                                                 mode='PBG-head',
+                                                                 num_workers=args.num_worker,
+                                                                 rank=i, ranks=args.num_proc)
+                valid_sampler_tail = eval_dataset.create_sampler('valid', args.batch_size_eval,
+                                                                 args.neg_sample_size_valid,
+                                                                 mode='PBG-tail',
+                                                                 num_workers=args.num_worker,
+                                                                 rank=i, ranks=args.num_proc)
+                valid_sampler_heads.append(valid_sampler_head)
+                valid_sampler_tails.append(valid_sampler_tail)
+        else:
+            valid_sampler_head = eval_dataset.create_sampler('valid', args.batch_size_eval,
+                                                             args.neg_sample_size_valid,
+                                                             mode='PBG-head',
+                                                             num_workers=args.num_worker,
+                                                             rank=0, ranks=1)
+            valid_sampler_tail = eval_dataset.create_sampler('valid', args.batch_size_eval,
+                                                             args.neg_sample_size_valid,
+                                                             mode='PBG-tail',
+                                                             num_workers=args.num_worker,
+                                                             rank=0, ranks=1)
+    if args.test:
+        # Here we want to use the regualr negative sampler because we need to ensure that
+        # all positive edges are excluded.
+        if args.num_proc > 1:
+            test_sampler_tails = []
+            test_sampler_heads = []
+            for i in range(args.num_proc):
+                test_sampler_head = eval_dataset.create_sampler('test', args.batch_size_eval,
+                                                                args.neg_sample_size_test,
+                                                                mode='head',
+                                                                num_workers=args.num_worker,
+                                                                rank=i, ranks=args.num_proc)
+                test_sampler_tail = eval_dataset.create_sampler('test', args.batch_size_eval,
+                                                                args.neg_sample_size_test,
+                                                                mode='tail',
+                                                                num_workers=args.num_worker,
+                                                                rank=i, ranks=args.num_proc)
+                test_sampler_heads.append(test_sampler_head)
+                test_sampler_tails.append(test_sampler_tail)
+        else:
+            test_sampler_head = eval_dataset.create_sampler('test', args.batch_size_eval,
+                                                            args.neg_sample_size_test,
+                                                            mode='head',
+                                                            num_workers=args.num_worker,
+                                                            rank=0, ranks=1)
+            test_sampler_tail = eval_dataset.create_sampler('test', args.batch_size_eval,
+                                                            args.neg_sample_size_test,
+                                                            mode='tail',
+                                                            num_workers=args.num_worker,
+                                                            rank=0, ranks=1)
+    # We need to free all memory referenced by dataset.
+    eval_dataset = None
+    dataset = None
+    # load model
+    model = load_model(logger, args, n_entities, n_relations)
+    if args.num_proc > 1:
+        model.share_memory()
+    # train
+    start = time.time()
+    if args.num_proc > 1:
+        procs = []
+        for i in range(args.num_proc):
+            valid_samplers = [valid_sampler_heads[i], valid_sampler_tails[i]] if args.valid else None
+            proc = mp.Process(target=train, args=(args, model, train_samplers[i], valid_samplers))
+            procs.append(proc)
+            proc.start()
+        for proc in procs:
+            proc.join()
+    else:
+        valid_samplers = [valid_sampler_head, valid_sampler_tail] if args.valid else None
+        train(args, model, train_sampler, valid_samplers)
+    print('training takes {} seconds'.format(time.time() - start))
+    if args.save_emb is not None:
+        if not os.path.exists(args.save_emb):
+            os.mkdir(args.save_emb)
+        model.save_emb(args.save_emb, args.dataset)
+    # test
+    if args.test:
+        if args.num_proc > 1:
+            procs = []
+            for i in range(args.num_proc):
+                proc = mp.Process(target=test, args=(args, model, [test_sampler_heads[i], test_sampler_tails[i]]))
+                procs.append(proc)
+                proc.start()
+            for proc in procs:
+                proc.join()
+        else:
+            test(args, model, [test_sampler_head, test_sampler_tail])
+if __name__ == '__main__':
+    args = ArgParser().parse_args()
+    logger = get_logger(args)
+    run(args, logger)
--- a/apps/kg/train_mxnet.py
+++ b/apps/kg/train_mxnet.py
+from models import KEModel
+import mxnet as mx
+from mxnet import gluon
+from mxnet import ndarray as nd
+import os
+import logging
+import time
+import json
+def load_model(logger, args, n_entities, n_relations, ckpt=None):
+    model = KEModel(args, args.model_name, n_entities, n_relations,
+                    args.hidden_dim, args.gamma,
+                    double_entity_emb=args.double_ent, double_relation_emb=args.double_rel)
+    if ckpt is not None:
+        # TODO: loading model emb only work for genernal Embedding, not for ExternalEmbedding
+        if args.gpu >= 0:
+            model.load_parameters(ckpt, ctx=mx.gpu(args.gpu))
+        else:
+            model.load_parameters(ckpt, ctx=mx.cpu())
+    logger.info('Load model {}'.format(args.model_name))
+    return model
+def load_model_from_checkpoint(logger, args, n_entities, n_relations, ckpt_path):
+    model = load_model(logger, args, n_entities, n_relations)
+    model.load_emb(ckpt_path, args.dataset)
+    return model
+def train(args, model, train_sampler, valid_samplers=None):
+    if args.num_proc > 1:
+        os.environ['OMP_NUM_THREADS'] = '1'
+    logs = []
+    for arg in vars(args):
+        logging.info('{:20}:{}'.format(arg, getattr(args, arg)))
+    start = time.time()
+    for step in range(args.init_step, args.max_step):
+        pos_g, neg_g = next(train_sampler)
+        args.step = step
+        with mx.autograd.record():
+            loss, log = model.forward(pos_g, neg_g, args.gpu)
+        loss.backward()
+        logs.append(log)
+        model.update()
+        if step % args.log_interval == 0:
+            for k in logs[0].keys():
+                v = sum(l[k] for l in logs) / len(logs)
+                print('[Train]({}/{}) average {}: {}'.format(step, args.max_step, k, v))
+            logs = []
+            print(time.time() - start)
+            start = time.time()
+        if args.valid and step % args.eval_interval == 0 and step > 1 and valid_samplers is not None:
+            start = time.time()
+            test(args, model, valid_samplers, mode='Valid')
+            print('test:', time.time() - start)
+    # clear cache
+    logs = []
+def test(args, model, test_samplers, mode='Test'):
+    logs = []
+    for sampler in test_samplers:
+        #print('Number of tests: ' + len(sampler))
+        count = 0
+        for pos_g, neg_g in sampler:
+            model.forward_test(pos_g, neg_g, logs, args.gpu)
+    metrics = {}
+    if len(logs) > 0:
+        for metric in logs[0].keys():
+            metrics[metric] = sum([log[metric] for log in logs]) / len(logs)
+    for k, v in metrics.items():
+        print('{} average {} at [{}/{}]: {}'.format(mode, k, args.step, args.max_step, v))
+    for i in range(len(test_samplers)):
+        test_samplers[i] = test_samplers[i].reset()
--- a/apps/kg/train_pytorch.py
+++ b/apps/kg/train_pytorch.py
+from models import KEModel
+from torch.utils.data import DataLoader
+import torch.optim as optim
+import torch as th
+import torch.multiprocessing as mp
+from distutils.version import LooseVersion
+TH_VERSION = LooseVersion(th.__version__)
+if TH_VERSION.version[0] == 1 and TH_VERSION.version[1] < 2:
+    raise Exception("DGL-ke has to work with Pytorch version >= 1.2")
+import os
+import logging
+import time
+def load_model(logger, args, n_entities, n_relations, ckpt=None):
+    model = KEModel(args, args.model_name, n_entities, n_relations,
+                    args.hidden_dim, args.gamma,
+                    double_entity_emb=args.double_ent, double_relation_emb=args.double_rel)
+    if ckpt is not None:
+        # TODO: loading model emb only work for genernal Embedding, not for ExternalEmbedding
+        model.load_state_dict(ckpt['model_state_dict'])
+    return model
+def load_model_from_checkpoint(logger, args, n_entities, n_relations, ckpt_path):
+    model = load_model(logger, args, n_entities, n_relations)
+    model.load_emb(ckpt_path, args.dataset)
+    return model
+def train(args, model, train_sampler, valid_samplers=None):
+    if args.num_proc > 1:
+        th.set_num_threads(1)
+    logs = []
+    for arg in vars(args):
+        logging.info('{:20}:{}'.format(arg, getattr(args, arg)))
+    start = time.time()
+    update_time = 0
+    forward_time = 0
+    backward_time = 0
+    for step in range(args.init_step, args.max_step):
+        pos_g, neg_g = next(train_sampler)
+        args.step = step
+        start1 = time.time()
+        loss, log = model.forward(pos_g, neg_g)
+        forward_time += time.time() - start1
+        start1 = time.time()
+        loss.backward()
+        backward_time += time.time() - start1
+        start1 = time.time()
+        model.update()
+        update_time += time.time() - start1
+        logs.append(log)
+        if step % args.log_interval == 0:
+            for k in logs[0].keys():
+                v = sum(l[k] for l in logs) / len(logs)
+                print('[Train]({}/{}) average {}: {}'.format(step, args.max_step, k, v))
+            logs = []
+            print('[Train] {} steps take {:.3f} seconds'.format(args.log_interval,
+                                                            time.time() - start))
+            print('forward: {:.3f}, backward: {:.3f}, update: {:.3f}'.format(forward_time,
+                                                                             backward_time,
+                                                                             update_time))
+            update_time = 0
+            forward_time = 0
+            backward_time = 0
+            start = time.time()
+        if args.valid and step % args.eval_interval == 0 and step > 1 and valid_samplers is not None:
+            start = time.time()
+            test(args, model, valid_samplers, mode='Valid')
+            print('test:', time.time() - start)
+def test(args, model, test_samplers, mode='Test'):
+    if args.num_proc > 1:
+        th.set_num_threads(1)
+    start = time.time()
+    with th.no_grad():
+        logs = []
+        for sampler in test_samplers:
+            count = 0
+            for pos_g, neg_g in sampler:
+                with th.no_grad():
+                    model.forward_test(pos_g, neg_g, logs, args.gpu)
+        metrics = {}
+        if len(logs) > 0:
+            for metric in logs[0].keys():
+                metrics[metric] = sum([log[metric] for log in logs]) / len(logs)
+        for k, v in metrics.items():
+            print('{} average {} at [{}/{}]: {}'.format(mode, k, args.step, args.max_step, v))
+    print('test:', time.time() - start)
+    test_samplers[0] = test_samplers[0].reset()
+    test_samplers[1] = test_samplers[1].reset()
--- a/docs/source/api/python/nn.mxnet.rst
+++ b/docs/source/api/python/nn.mxnet.rst
@@ -39,7 +39,7 @@ TAGConv
    :show-inheritance:
-Global Pooling Layers 
+Global Pooling Layers
 ----------------------------------------
 .. automodule:: dgl.nn.mxnet.glob