README.md 8.23 KB
Newer Older
1
2
3
4
5
# DGL - Knowledge Graph Embedding


## Introduction

Da Zheng's avatar
Da Zheng committed
6
DGL-KE is a DGL-based package for computing node embeddings and relation embeddings of
7
8
9
10
11
12
knowledge graphs efficiently. This package is adapted from
[KnowledgeGraphEmbedding](https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding).
We enable fast and scalable training of knowledge graph embedding,
while still keeping the package as extensible as
[KnowledgeGraphEmbedding](https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding).
On a single machine,
Da Zheng's avatar
Da Zheng committed
13
14
15
16
it takes only a few minutes for medium-size knowledge graphs, such as FB15k and wn18, and
takes a couple of hours on Freebase, which has hundreds of millions of edges.

DGL-KE includes the following knowledge graph embedding models:
17
 
18
- TransE (TransE_l1 with L1 distance and TransE_l2 with L2 distance)
19
20
- DistMult
- ComplEx
21
- RESCAL
22
- TransR
23
- RotatE
24

Da Zheng's avatar
Da Zheng committed
25
It will add other popular models in the future.
26
27
28

DGL-KE supports multiple training modes:

Da Zheng's avatar
Da Zheng committed
29
30
31
32
33
34
35
36
- CPU training
- GPU training
- Joint CPU & GPU training
- Multiprocessing training on CPUs

For joint CPU & GPU training, node embeddings are stored on CPU and mini-batches are trained on GPU. This is designed for training KGE models on large knowledge graphs

For multiprocessing training, each process train mini-batches independently and use shared memory for communication between processes. This is designed to train KGE models on large knowledge graphs with many CPU cores.
37
38
39
40
41
42

We will support multi-GPU training and distributed training in a near future.

## Requirements

The package can run with both Pytorch and MXNet. For Pytorch, it works with Pytorch v1.2 or newer.
Da Zheng's avatar
Da Zheng committed
43
For MXNet, it works with MXNet 1.5 or newer.
44
45
46
47
48

## Datasets

DGL-KE provides five knowledge graphs:

Da Zheng's avatar
Da Zheng committed
49
50
| Dataset | #nodes | #edges | #relations |
|---------|--------|--------|------------|
Jinjing Zhou's avatar
Jinjing Zhou committed
51
52
53
54
55
| [FB15k](https://data.dgl.ai/dataset/FB15k.zip) | 14951 | 592213 | 1345 |
| [FB15k-237](https://data.dgl.ai/dataset/FB15k-237.zip) | 14541 | 310116 | 237 |
| [wn18](https://data.dgl.ai/dataset/wn18.zip) | 40943 | 151442 | 18 |
| [wn18rr](https://data.dgl.ai/dataset/wn18rr.zip) | 40943 | 93003 | 11 |
| [Freebase](https://data.dgl.ai/dataset/Freebase.zip) | 86054151 | 338586276 | 14824 |
56
57
58
59
60

Users can specify one of the datasets with `--dataset` in `train.py` and `eval.py`.

## Performance

Da Zheng's avatar
Da Zheng committed
61
The speed is measured with 16 CPU cores and one Nvidia V100 GPU.
62
63
64

The speed on FB15k

65
66
67
68
|  Models | TransE_l1 | TransE_l2 | DistMult | ComplEx | RESCAL | TransR | RotatE |
|---------|-----------|-----------|----------|---------|--------|--------|--------|
|MAX_STEPS| 20000     | 30000     |100000    | 100000  | 30000  | 100000 | 100000 |
|TIME     | 411s      | 329s      |690s      | 806s    | 1800s  | 7627s  | 4327s  |
69
70
71

The accuracy on FB15k

72
73
74
75
76
77
78
79
80
|  Models   |  MR   |  MRR  | HITS@1 | HITS@3 | HITS@10 |
|-----------|-------|-------|--------|--------|---------|
| TransE_l1 | 69.12 | 0.656 | 0.567  | 0.718  | 0.802   |
| TransE_l2 | 35.86 | 0.570 | 0.400  | 0.708  | 0.834   |
| DistMult  | 43.35 | 0.783 | 0.713  | 0.837  | 0.897   |
| ComplEx   | 51.99 | 0.785 | 0.720  | 0.832  | 0.889   |
| RESCAL    | 130.89| 0.668 | 0.597  | 0.720  | 0.800   |
| TransR    | 138.7 | 0.501 | 0.274  | 0.704  | 0.801   |
| RotatE    | 39.6  | 0.725 | 0.628  | 0.802  | 0.875   |
81

82
83
In comparison, GraphVite uses 4 GPUs and takes 14 minutes. Thus, DGL-KE trains TransE on FB15k twice as fast as GraphVite while using much few resources. More performance information on GraphVite can be found [here](https://github.com/DeepGraphLearning/graphvite).

84
85
The speed on wn18

86
87
88
89
|  Models | TransE_l1 | TransE_l2 | DistMult | ComplEx | RESCAL | TransR | RotatE |
|---------|-----------|-----------|----------|---------|--------|--------|--------|
|MAX_STEPS| 40000     | 20000     | 10000    | 20000   | 20000  | 20000  | 20000  |
|TIME     | 719s      | 254s      | 126s     | 266s    | 333s   | 1547s  | 786s   |
90
91
92

The accuracy on wn18

93
94
95
96
97
98
99
100
101
|  Models   |  MR    |  MRR  | HITS@1 | HITS@3 | HITS@10 |
|-----------|--------|-------|--------|--------|---------|
| TransE_l1 | 321.35 | 0.760 | 0.652  | 0.850  | 0.940   |
| TransE_l2 | 181.57 | 0.570 | 0.322  | 0.802  | 0.944   |
| DistMult  | 271.09 | 0.769 | 0.639  | 0.892  | 0.949   |
| ComplEx   | 276.37 | 0.935 | 0.916  | 0.950  | 0.960   |
| RESCAL    | 579.54 | 0.846 | 0.791  | 0.898  | 0.931   |
| TransR    | 615.56 | 0.606 | 0.378  | 0.826  | 0.890   |
| RotatE    | 367.64 | 0.931 | 0.924  | 0.935  | 0.944   | 
102

Da Zheng's avatar
Da Zheng committed
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
The speed on Freebase

|  Models | DistMult | ComplEx |
|---------|----------|---------|
|MAX_STEPS| 3200000  | 3200000 |
|TIME     | 2.44h    | 2.94h   |

The accuracy on Freebase (it is tested when 100,000 negative edges are sampled for each positive edge).

|  Models  |  MR    |  MRR  | HITS@1 | HITS@3 | HITS@10 |
|----------|--------|-------|--------|--------|---------|
| DistMul  | 6159.1 | 0.716 | 0.690  | 0.729  | 0.760   |
| ComplEx  | 6888.8 | 0.716 | 0.697  | 0.728  | 0.760   |

The configuration for reproducing the performance results can be found [here](https://github.com/dmlc/dgl/blob/master/apps/kg/config/best_config.sh).

119
120
## Usage

Da Zheng's avatar
Da Zheng committed
121
122
123
124
125
126
127
128
DGL-KE doesn't require installation. The package contains two scripts `train.py` and `eval.py`.

* `train.py` trains knowledge graph embeddings and outputs the trained node embeddings
and relation embeddings.

* `eval.py` reads the pre-trained node embeddings and relation embeddings and evaluate
how accurate to predict the tail node when given (head, rel, ?), and predict the head node
when given (?, rel, tail).
Da Zheng's avatar
Da Zheng committed
129
130
131
132
133

### Input formats:

DGL-KE supports two knowledge graph input formats. A knowledge graph is stored
using five files.
134
135
136

Format 1:

Da Zheng's avatar
Da Zheng committed
137
138
139
140
141
- entities.dict contains pairs of (entity Id, entity name). The number of rows is the number of entities (nodes).
- relations.dict contains pairs of (relation Id, relation name). The number of rows is the number of relations.
- train.txt stores edges in the training set. They are stored as triples of (head, rel, tail).
- valid.txt stores edges in the validation set. They are stored as triples of (head, rel, tail).
- test.txt stores edges in the test set. They are stored as triples of (head, rel, tail).
142
143
144

Format 2:

Da Zheng's avatar
Da Zheng committed
145
146
147
148
149
150
- entity2id.txt contains pairs of (entity name, entity Id). The number of rows is the number of entities (nodes).
- relation2id.txt contains pairs of (relation name, relation Id). The number of rows is the number of relations.
- train.txt stores edges in the training set. They are stored as triples of (head, tail, rel).
- valid.txt stores edges in the validation set. They are stored as a triple of (head, tail, rel).
- test.txt stores edges in the test set. They are stored as a triple of (head, tail, rel).

Da Zheng's avatar
Da Zheng committed
151
152
153
154
155
156
157
158
159
160
161
### Output formats:

To save the trained embeddings, users have to provide the path with `--save_emb` when running
`train.py`. The saved embeddings are stored as numpy ndarrays.

* The node embedding is saved as `XXX_YYY_entity.npy`.

* The relation embedding is saved as `XXX_YYY_relation.npy`.

`XXX` is the dataset name and `YYY` is the model name.

Da Zheng's avatar
Da Zheng committed
162
### Command line parameters
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199

Here are some examples of using the training script.

Train KGE models with GPU.

```bash
python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
    --batch_size_eval 16 --gpu 0 --valid --test -adv
```

Train KGE models with mixed CPUs and GPUs.

```bash
python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
    --batch_size_eval 16 --gpu 0 --valid --test -adv --mix_cpu_gpu
```

Train embeddings and verify it later.

```bash
python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
    --batch_size_eval 16 --gpu 0 --valid -adv --save_emb DistMult_FB15k_emb

python3 eval.py --model_name DistMult --dataset FB15k --hidden_dim 2000 \
    --gamma 500.0 --batch_size 16 --gpu 0 --model_path DistMult_FB15k_emb/

```

Train embeddings with multi-processing. This currently doesn't work in MXNet.
```bash
python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.07 --max_step 3000 \
    --batch_size_eval 16 --regularization_coef 0.000001 --valid --test -adv --num_proc 8
```