Unverified Commit 56ffb650 authored by peizhou001's avatar peizhou001 Committed by GitHub
Browse files

[API Deprecation]Deprecate contrib module (#5114)

parent 436de3d1
# DGL - Knowledge Graph Embedding
**Note: DGL-KE is moved to [here](https://github.com/awslabs/dgl-ke). DGL-KE in this folder is deprecated.**
## Introduction
DGL-KE is a DGL-based package for computing node embeddings and relation embeddings of
knowledge graphs efficiently. This package is adapted from
[KnowledgeGraphEmbedding](https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding).
We enable fast and scalable training of knowledge graph embedding,
while still keeping the package as extensible as
[KnowledgeGraphEmbedding](https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding).
On a single machine,
it takes only a few minutes for medium-size knowledge graphs, such as FB15k and wn18, and
takes a couple of hours on Freebase, which has hundreds of millions of edges.
DGL-KE includes the following knowledge graph embedding models:
- TransE (TransE_l1 with L1 distance and TransE_l2 with L2 distance)
- DistMult
- ComplEx
- RESCAL
- TransR
- RotatE
It will add other popular models in the future.
DGL-KE supports multiple training modes:
- CPU training
- GPU training
- Joint CPU & GPU training
- Multiprocessing training on CPUs
For joint CPU & GPU training, node embeddings are stored on CPU and mini-batches are trained on GPU. This is designed for training KGE models on large knowledge graphs
For multiprocessing training, each process train mini-batches independently and use shared memory for communication between processes. This is designed to train KGE models on large knowledge graphs with many CPU cores.
We will support multi-GPU training and distributed training in a near future.
## Requirements
The package can run with both Pytorch and MXNet. For Pytorch, it works with Pytorch v1.2 or newer.
For MXNet, it works with MXNet 1.5 or newer.
## Built-in Datasets
DGL-KE provides five built-in knowledge graphs:
| Dataset | #nodes | #edges | #relations |
|---------|--------|--------|------------|
| [FB15k](https://data.dgl.ai/dataset/FB15k.zip) | 14951 | 592213 | 1345 |
| [FB15k-237](https://data.dgl.ai/dataset/FB15k-237.zip) | 14541 | 310116 | 237 |
| [wn18](https://data.dgl.ai/dataset/wn18.zip) | 40943 | 151442 | 18 |
| [wn18rr](https://data.dgl.ai/dataset/wn18rr.zip) | 40943 | 93003 | 11 |
| [Freebase](https://data.dgl.ai/dataset/Freebase.zip) | 86054151 | 338586276 | 14824 |
Users can specify one of the datasets with `--dataset` in `train.py` and `eval.py`.
## Performance
The 1 GPU speed is measured with 8 CPU cores and one Nvidia V100 GPU. (AWS P3.2xlarge)
The 8 GPU speed is measured with 64 CPU cores and eight Nvidia V100 GPU. (AWS P3.16xlarge)
The speed on FB15k 1GPU
| Models | TransE_l1 | TransE_l2 | DistMult | ComplEx | RESCAL | TransR | RotatE |
|---------|-----------|-----------|----------|---------|--------|--------|--------|
|MAX_STEPS| 48000 | 32000 | 40000 | 100000 | 32000 | 32000 | 20000 |
|TIME | 370s | 270s | 312s | 282s | 2095s | 1556s | 1861s |
The accuracy on FB15k
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|-------|-------|--------|--------|---------|
| TransE_l1 | 44.18 | 0.675 | 0.551 | 0.774 | 0.861 |
| TransE_l2 | 46.71 | 0.665 | 0.551 | 0.804 | 0.846 |
| DistMult | 61.04 | 0.725 | 0.625 | 0.837 | 0.883 |
| ComplEx | 64.59 | 0.785 | 0.718 | 0.835 | 0.889 |
| RESCAL | 122.3 | 0.669 | 0.598 | 0.711 | 0.793 |
| TransR | 59.86 | 0.676 | 0.591 | 0.735 | 0.814 |
| RotatE | 43.66 | 0.728 | 0.632 | 0.801 | 0.874 |
The speed on FB15k 8GPU
| Models | TransE_l1 | TransE_l2 | DistMult | ComplEx | RESCAL | TransR | RotatE |
|---------|-----------|-----------|----------|---------|--------|--------|--------|
|MAX_STEPS| 6000 | 4000 | 5000 | 4000 | 4000 | 4000 | 2500 |
|TIME | 88.93s | 62.99s | 72.74s | 68.37s | 245.9s | 203.9s | 126.7s |
The accuracy on FB15k
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|-------|-------|--------|--------|---------|
| TransE_l1 | 44.25 | 0.672 | 0.547 | 0.774 | 0.860 |
| TransE_l2 | 46.13 | 0.658 | 0.539 | 0.748 | 0.845 |
| DistMult | 61.72 | 0.723 | 0.626 | 0.798 | 0.881 |
| ComplEx | 65.84 | 0.754 | 0.676 | 0.813 | 0.880 |
| RESCAL | 135.6 | 0.652 | 0.580 | 0.693 | 0.779 |
| TransR | 65.27 | 0.676 | 0.591 | 0.736 | 0.811 |
| RotatE | 49.59 | 0.683 | 0.581 | 0.759 | 0.848 |
In comparison, GraphVite uses 4 GPUs and takes 14 minutes. Thus, DGL-KE trains TransE on FB15k 9.5X as fast as GraphVite with 8 GPUs. More performance information on GraphVite can be found [here](https://github.com/DeepGraphLearning/graphvite).
The speed on wn18 1GPU
| Models | TransE_l1 | TransE_l2 | DistMult | ComplEx | RESCAL | TransR | RotatE |
|---------|-----------|-----------|----------|---------|--------|--------|--------|
|MAX_STEPS| 32000 | 32000 | 20000 | 20000 | 20000 | 30000 | 24000 |
|TIME | 531.5s | 406.6s | 284.1s | 282.3s | 443.6s | 766.2s | 829.4s |
The accuracy on wn18
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|-------|-------|--------|--------|---------|
| TransE_l1 | 318.4 | 0.764 | 0.602 | 0.929 | 0.949 |
| TransE_l2 | 206.2 | 0.561 | 0.306 | 0.800 | 0.944 |
| DistMult | 486.0 | 0.818 | 0.711 | 0.921 | 0.948 |
| ComplEx | 268.6 | 0.933 | 0.916 | 0.949 | 0.961 |
| RESCAL | 536.6 | 0.848 | 0.790 | 0.900 | 0.927 |
| TransR | 452.4 | 0.620 | 0.461 | 0.758 | 0.856 |
| RotatE | 487.9 | 0.944 | 0.940 | 0.947 | 0.952 |
The speed on wn18 8GPU
| Models | TransE_l1 | TransE_l2 | DistMult | ComplEx | RESCAL | TransR | RotatE |
|---------|-----------|-----------|----------|---------|--------|--------|--------|
|MAX_STEPS| 4000 | 4000 | 2500 | 2500 | 2500 | 2500 | 3000 |
|TIME | 119.3s | 81.1s | 76.0s | 58.0s | 594.1s | 1168s | 139.8s |
The accuracy on wn18
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|-------|-------|--------|--------|---------|
| TransE_l1 | 360.3 | 0.745 | 0.562 | 0.930 | 0.951 |
| TransE_l2 | 193.8 | 0.557 | 0.301 | 0.799 | 0.942 |
| DistMult | 499.9 | 0.807 | 0.692 | 0.917 | 0.945 |
| ComplEx | 476.7 | 0.935 | 0.926 | 0.943 | 0.949 |
| RESCAL | 618.8 | 0.848 | 0.791 | 0.897 | 0.927 |
| TransR | 513.1 | 0.659 | 0.491 | 0.821 | 0.871 |
| RotatE | 466.2 | 0.944 | 0.940 | 0.945 | 0.951 |
The speed on Freebase (8 GPU)
| Models | TransE_l2 | DistMult | ComplEx | TransR | RotatE |
|---------|-----------|----------|---------|--------|--------|
|MAX_STEPS| 320000 | 300000 | 360000 | 300000 | 300000 |
|TIME | 7908s | 7425s | 8946s | 16816s | 12817s |
The accuracy on Freebase (it is tested when 1000 negative edges are sampled for each positive edge).
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|--------|-------|--------|--------|---------|
| TransE_l2 | 22.4 | 0.756 | 0.688 | 0.800 | 0.882 |
| DistMul | 45.4 | 0.833 | 0.812 | 0.843 | 0.872 |
| ComplEx | 48.0 | 0.830 | 0.812 | 0.838 | 0.864 |
| TransR | 51.2 | 0.697 | 0.656 | 0.716 | 0.771 |
| RotatE | 93.3 | 0.770 | 0.749 | 0.780 | 0.805 |
The speed on Freebase (48 CPU)
This measured with 48 CPU cores on an AWS r5dn.24xlarge
| Models | TransE_l2 | DistMult | ComplEx |
|---------|-----------|----------|---------|
|MAX_STEPS| 50000 | 50000 | 50000 |
|TIME | 7002s | 6340s | 8133s |
The accuracy on Freebase (it is tested when 1000 negative edges are sampled for each positive edge).
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|--------|-------|--------|--------|---------|
| TransE_l2 | 30.8 | 0.814 | 0.764 | 0.848 | 0.902 |
| DistMul | 45.1 | 0.834 | 0.815 | 0.843 | 0.871 |
| ComplEx | 44.9 | 0.837 | 0.819 | 0.845 | 0.870 |
The configuration for reproducing the performance results can be found [here](https://github.com/dmlc/dgl/blob/master/apps/kg/config/best_config.sh).
## Usage
DGL-KE doesn't require installation. The package contains two scripts `train.py` and `eval.py`.
* `train.py` trains knowledge graph embeddings and outputs the trained node embeddings
and relation embeddings.
* `eval.py` reads the pre-trained node embeddings and relation embeddings and evaluate
how accurate to predict the tail node when given (head, rel, ?), and predict the head node
when given (?, rel, tail).
### Input formats:
DGL-KE supports two knowledge graph input formats for user defined dataset
- raw_udd_[h|r|t], raw user defined dataset. In this format, user only need to provide triples and let the dataloader generate and manipulate the id mapping. The dataloader will generate two files: entities.tsv for entity id mapping and relations.tsv for relation id mapping. The order of head, relation and tail entities are described in [h|r|t], for example, raw_udd_trh means the triples are stored in the order of tail, relation and head. It should contains three files:
- *train* stores the triples in the training set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]
- *valid* stores the triples in the validation set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]
- *test* stores the triples in the test set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]
Format 2:
- udd_[h|r|t], user defined dataset. In this format, user should provide the id mapping for entities and relations. The order of head, relation and tail entities are described in [h|r|t], for example, raw_udd_trh means the triples are stored in the order of tail, relation and head. It should contains five files:
- *entities* stores the mapping between entity name and entity Id
- *relations* stores the mapping between relation name relation Id
- *train* stores the triples in the training set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]
- *valid* stores the triples in the validation set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]
- *test* stores the triples in the test set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]
### Output formats:
To save the trained embeddings, users have to provide the path with `--save_emb` when running
`train.py`. The saved embeddings are stored as numpy ndarrays.
* The node embedding is saved as `XXX_YYY_entity.npy`.
* The relation embedding is saved as `XXX_YYY_relation.npy`.
`XXX` is the dataset name and `YYY` is the model name.
### Command line parameters
Here are some examples of using the training script.
Train KGE models with GPU.
```bash
python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 \
--hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 --valid --test -adv \
--gpu 0 --max_step 40000
```
Train KGE models with mixed multiple GPUs.
```bash
python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 \
--hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 --valid --test -adv \
--max_step 5000 --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--soft_rel_part --force_sync_interval 1000
```
Train embeddings and verify it later.
```bash
python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 \
--hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 --valid --test -adv \
--gpu 0 --max_step 40000 --save_emb DistMult_FB15k_emb
python3 eval.py --model_name DistMult --dataset FB15k --hidden_dim 400 \
--gamma 143.0 --batch_size 16 --gpu 0 --model_path DistMult_FB15k_emb/
```
Train embeddings with multi-processing. This currently doesn't work in MXNet.
```bash
python3 train.py --model TransE_l2 --dataset Freebase --batch_size 1000 \
--neg_sample_size 200 --hidden_dim 400 --gamma 10 --lr 0.1 --max_step 50000 \
--log_interval 100 --batch_size_eval 1000 --neg_sample_size_eval 1000 --test \
-adv --regularization_coef 1e-9 --num_thread 1 --num_proc 48
```
#To reproduce reported results on README, you can run the model with the following commands:
# for FB15k
# DistMult 1GPU
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 \
--valid --test -adv --gpu 0 --max_step 40000
# DistMult 8GPU
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 \
--valid --test -adv --max_step 5000 --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --async_update --soft_rel_part --force_sync_interval 1000
# ComplEx 1GPU
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset FB15k --batch_size 1024 \
--neg_sample_size 1024 --hidden_dim 400 --gamma 143.0 --lr 0.1 \
--regularization_coef 2.00E-06 --batch_size_eval 16 --valid --test -adv --gpu 0 \
--max_step 32000
# ComplEx 8GPU
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset FB15k --batch_size 1024 \
--neg_sample_size 1024 --hidden_dim 400 --gamma 143.0 --lr 0.1 \
--regularization_coef 2.00E-06 --batch_size_eval 16 --valid --test -adv \
--max_step 4000 --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--soft_rel_part --force_sync_interval 1000
# TransE_l1 1GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l1 --dataset FB15k --batch_size 1024 \
--neg_sample_size 64 --regularization_coef 1e-07 --hidden_dim 400 --gamma 16.0 \
--lr 0.01 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 48000
# TransE_l1 8GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l1 --dataset FB15k --batch_size 1024 \
--neg_sample_size 64 --regularization_coef 1e-07 --hidden_dim 400 --gamma 16.0 \
--lr 0.01 --batch_size_eval 16 --valid --test -adv --max_step 6000 --mix_cpu_gpu \
--num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update --soft_rel_part \
--force_sync_interval 1000
# TransE_l2 1GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --regularization_coef=1e-9 --hidden_dim 400 --gamma 19.9 \
--lr 0.25 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 32000
# TransE_l2 8GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --regularization_coef=1e-9 --hidden_dim 400 --gamma 19.9 \
--lr 0.25 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 4000 \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update --soft_rel_part \
--force_sync_interval 1000
# RESCAL 1GPU
DGLBACKEND=pytorch python3 train.py --model RESCAL --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 500 --gamma 24.0 --lr 0.03 --batch_size_eval 16 \
--gpu 0 --valid --test -adv --max_step 30000
# RESCAL 8GPU
DGLBACKEND=pytorch python3 train.py --model RESCAL --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 500 --gamma 24.0 --lr 0.03 --batch_size_eval 16 \
--valid --test -adv --max_step 4000 --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --async_update --soft_rel_part --force_sync_interval 1000
# TransR 1GPU
DGLBACKEND=pytorch python3 train.py --model TransR --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 5e-8 --hidden_dim 200 --gamma 8.0 \
--lr 0.015 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 32000
# TransR 8GPU
DGLBACKEND=pytorch python3 train.py --model TransR --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 5e-8 --hidden_dim 200 --gamma 8.0 \
--lr 0.015 --batch_size_eval 16 --valid --test -adv --max_step 4000 --mix_cpu_gpu \
--num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update --soft_rel_part \
--force_sync_interval 1000
# RotatE 1GPU
DGLBACKEND=pytorch python3 train.py --model RotatE --dataset FB15k --batch_size 2048 \
--neg_sample_size 256 --regularization_coef 1e-07 --hidden_dim 200 --gamma 12.0 \
--lr 0.009 --batch_size_eval 16 --valid --test -adv -de --max_step 20000 \
--neg_deg_sample --gpu 0
# RotatE 8GPU
DGLBACKEND=pytorch python3 train.py --model RotatE --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 1e-07 --hidden_dim 200 --gamma 12.0 \
--lr 0.009 --batch_size_eval 16 --valid --test -adv -de --max_step 2500 \
--neg_deg_sample --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--soft_rel_part --force_sync_interval 1000
# for wn18
# DistMult 1GPU
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset wn18 --batch_size 2048 \
--neg_sample_size 128 --regularization_coef 1e-06 --hidden_dim 512 --gamma 20.0 \
--lr 0.14 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 20000
# DistMult 8GPU
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset wn18 --batch_size 2048 \
--neg_sample_size 128 --regularization_coef 1e-06 --hidden_dim 512 --gamma 20.0 \
--lr 0.14 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 2500 \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--force_sync_interval 1000
# ComplEx 1GPU
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset wn18 --batch_size 1024 \
--neg_sample_size 1024 --regularization_coef 0.00001 --hidden_dim 512 --gamma 200.0 \
--lr 0.1 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 20000
# ComplEx 8GPU
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset wn18 --batch_size 1024 \
--neg_sample_size 1024 --regularization_coef 0.00001 --hidden_dim 512 --gamma 200.0 \
--lr 0.1 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 2500 \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--force_sync_interval 1000
# TransE_l1 1GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l1 --dataset wn18 --batch_size 2048 \
--neg_sample_size 128 --regularization_coef 2e-07 --hidden_dim 512 --gamma 12.0 \
--lr 0.007 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 32000
# TransE_l1 8GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l1 --dataset wn18 --batch_size 2048 \
--neg_sample_size 128 --regularization_coef 2e-07 --hidden_dim 512 --gamma 12.0 \
--lr 0.007 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 4000 \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--force_sync_interval 1000
# TransE_l2 1GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 0.0000001 --hidden_dim 512 --gamma 6.0 \
--lr 0.1 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 32000
# TransE_l2 8GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 0.0000001 --hidden_dim 512 --gamma 6.0 \
--lr 0.1 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 4000 \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--force_sync_interval 1000
# RESCAL 1GPU
DGLBACKEND=pytorch python3 train.py --model RESCAL --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 250 --gamma 24.0 --lr 0.03 --batch_size_eval 16 \
--valid --test -adv --gpu 0 --max_step 20000
# RESCAL 8GPU
DGLBACKEND=pytorch python3 train.py --model RESCAL --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 250 --gamma 24.0 --lr 0.03 --batch_size_eval 16 \
--valid --test -adv --gpu 0 --max_step 2500 --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --async_update --force_sync_interval 1000 --soft_rel_part
# TransR 1GPU
DGLBACKEND=pytorch python3 train.py --model TransR --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 250 --gamma 16.0 --lr 0.1 --batch_size_eval 16 \
--valid --test -adv --gpu 0 --max_step 30000
# TransR 8GPU
DGLBACKEND=pytorch python3 train.py --model TransR --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 250 --gamma 16.0 --lr 0.1 --batch_size_eval 16 \
--valid --test -adv --max_step 2500 --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --async_update --force_sync_interval 1000 --soft_rel_part
# RotatE 1GPU
DGLBACKEND=pytorch python3 train.py --model RotatE --dataset wn18 --batch_size 2048 \
--neg_sample_size 64 --regularization_coef 2e-07 --hidden_dim 256 --gamma 9.0 \
--lr 0.0025 -de --batch_size_eval 16 --neg_deg_sample --valid --test -adv --gpu 0 \
--max_step 24000
# RotatE 8GPU
DGLBACKEND=pytorch python3 train.py --model RotatE --dataset wn18 --batch_size 2048 \
--neg_sample_size 64 --regularization_coef 2e-07 --hidden_dim 256 --gamma 9.0 \
--lr 0.0025 -de --batch_size_eval 16 --neg_deg_sample --valid --test -adv \
--max_step 3000 --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--force_sync_interval 1000
# for Freebase multi-process-cpu
# TransE_l2
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset Freebase --batch_size 1000 \
--neg_sample_size 200 --hidden_dim 400 --gamma 10 --lr 0.1 --max_step 50000 \
--log_interval 100 --batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv \
--regularization_coef 1e-9 --num_thread 1 --num_proc 48
# DistMult
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.08 --max_step 50000 \
--log_interval 100 --batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv \
--num_thread 1 --num_proc 48
# ComplEx
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.1 --max_step 50000 \
--log_interval 100 --batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv \
--num_thread 1 --num_proc 48
# Freebase multi-gpu
# TransE_l2 8GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset Freebase --batch_size 1000 \
--neg_sample_size 200 --hidden_dim 400 --gamma 10 --lr 0.1 --regularization_coef 1e-9 \
--batch_size_eval 1000 --valid --test -adv --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --max_step 320000 --neg_sample_size_eval 1000 --eval_interval \
100000 --log_interval 10000 --async_update --soft_rel_part --force_sync_interval 10000
# DistMult 8GPU
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 1000 \
--valid --test -adv --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --max_step 300000 \
--neg_sample_size_eval 1000 --eval_interval 100000 --log_interval 10000 --async_update \
--soft_rel_part --force_sync_interval 10000
# ComplEx 8GPU
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143 --lr 0.1 \
--regularization_coef 2.00E-06 --batch_size_eval 1000 --valid --test -adv \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --max_step 360000 \
--neg_sample_size_eval 1000 --eval_interval 100000 --log_interval 10000 \
--async_update --soft_rel_part --force_sync_interval 10000
# TransR 8GPU
DGLBACKEND=pytorch python3 train.py --model TransR --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 5e-8 --hidden_dim 200 --gamma 8.0 \
--lr 0.015 --batch_size_eval 1000 --valid --test -adv --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --max_step 300000 --neg_sample_size_eval 1000 \
--eval_interval 100000 --log_interval 10000 --async_update --soft_rel_part \
--force_sync_interval 10000
# RotatE 8GPU
DGLBACKEND=pytorch python3 train.py --model RotatE --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 -de --hidden_dim 200 --gamma 12.0 --lr 0.01 \
--regularization_coef 1e-7 --batch_size_eval 1000 --valid --test -adv --mix_cpu_gpu \
--num_proc 8 --gpu 0 1 2 3 4 5 6 7 --max_step 300000 --neg_sample_size_eval 1000 \
--eval_interval 100000 --log_interval 10000 --async_update --soft_rel_part \
--force_sync_interval 10000
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
import numpy as np
def _download_and_extract(url, path, filename):
import shutil, zipfile
import requests
fn = os.path.join(path, filename)
while True:
try:
with zipfile.ZipFile(fn) as zf:
zf.extractall(path)
print('Unzip finished.')
break
except Exception:
os.makedirs(path, exist_ok=True)
f_remote = requests.get(url, stream=True)
sz = f_remote.headers.get('content-length')
assert f_remote.status_code == 200, 'fail to open {}'.format(url)
with open(fn, 'wb') as writer:
for chunk in f_remote.iter_content(chunk_size=1024*1024):
writer.write(chunk)
print('Download finished. Unzipping the file...')
def _get_id(dict, key):
id = dict.get(key, None)
if id is None:
id = len(dict)
dict[key] = id
return id
def _parse_srd_format(format):
if format == "hrt":
return [0, 1, 2]
if format == "htr":
return [0, 2, 1]
if format == "rht":
return [1, 0, 2]
if format == "rth":
return [2, 0, 1]
if format == "thr":
return [1, 2, 0]
if format == "trh":
return [2, 1, 0]
def _file_line(path):
with open(path) as f:
for i, l in enumerate(f):
pass
return i + 1
class KGDataset:
'''Load a knowledge graph
The folder with a knowledge graph has five files:
* entities stores the mapping between entity Id and entity name.
* relations stores the mapping between relation Id and relation name.
* train stores the triples in the training set.
* valid stores the triples in the validation set.
* test stores the triples in the test set.
The mapping between entity (relation) Id and entity (relation) name is stored as 'id\tname'.
The triples are stored as 'head_name\trelation_name\ttail_name'.
'''
def __init__(self, entity_path, relation_path, train_path,
valid_path=None, test_path=None, format=[0,1,2], skip_first_line=False):
self.entity2id, self.n_entities = self.read_entity(entity_path)
self.relation2id, self.n_relations = self.read_relation(relation_path)
self.train = self.read_triple(train_path, "train", skip_first_line, format)
if valid_path is not None:
self.valid = self.read_triple(valid_path, "valid", skip_first_line, format)
if test_path is not None:
self.test = self.read_triple(test_path, "test", skip_first_line, format)
def read_entity(self, entity_path):
with open(entity_path) as f:
entity2id = {}
for line in f:
eid, entity = line.strip().split('\t')
entity2id[entity] = int(eid)
return entity2id, len(entity2id)
def read_relation(self, relation_path):
with open(relation_path) as f:
relation2id = {}
for line in f:
rid, relation = line.strip().split('\t')
relation2id[relation] = int(rid)
return relation2id, len(relation2id)
def read_triple(self, path, mode, skip_first_line=False, format=[0,1,2]):
# mode: train/valid/test
if path is None:
return None
print('Reading {} triples....'.format(mode))
heads = []
tails = []
rels = []
with open(path) as f:
if skip_first_line:
_ = f.readline()
for line in f:
triple = line.strip().split('\t')
h, r, t = triple[format[0]], triple[format[1]], triple[format[2]]
heads.append(self.entity2id[h])
rels.append(self.relation2id[r])
tails.append(self.entity2id[t])
heads = np.array(heads, dtype=np.int64)
tails = np.array(tails, dtype=np.int64)
rels = np.array(rels, dtype=np.int64)
print('Finished. Read {} {} triples.'.format(len(heads), mode))
return (heads, rels, tails)
class PartitionKGDataset():
'''Load a partitioned knowledge graph
The folder with a partitioned knowledge graph has four files:
* relations stores the mapping between relation Id and relation name.
* train stores the triples in the training set.
* local_to_global stores the mapping of local id and global id
* partition_book stores the machine id of each entity
The triples are stored as 'head_id\relation_id\tail_id'.
'''
def __init__(self, relation_path, train_path, local2global_path,
read_triple=True, skip_first_line=False):
self.n_entities = _file_line(local2global_path)
if skip_first_line == False:
self.n_relations = _file_line(relation_path)
else:
self.n_relations = _file_line(relation_path) - 1
if read_triple == True:
self.train = self.read_triple(train_path, "train")
def read_triple(self, path, mode):
heads = []
tails = []
rels = []
print('Reading {} triples....'.format(mode))
with open(path) as f:
for line in f:
h, r, t = line.strip().split('\t')
heads.append(int(h))
rels.append(int(r))
tails.append(int(t))
heads = np.array(heads, dtype=np.int64)
tails = np.array(tails, dtype=np.int64)
rels = np.array(rels, dtype=np.int64)
print('Finished. Read {} {} triples.'.format(len(heads), mode))
return (heads, rels, tails)
class KGDatasetFB15k(KGDataset):
'''Load a knowledge graph FB15k
The FB15k dataset has five files:
* entities.dict stores the mapping between entity Id and entity name.
* relations.dict stores the mapping between relation Id and relation name.
* train.txt stores the triples in the training set.
* valid.txt stores the triples in the validation set.
* test.txt stores the triples in the test set.
The mapping between entity (relation) name and entity (relation) Id is stored as 'name\tid'.
The triples are stored as 'head_nid\trelation_id\ttail_nid'.
'''
def __init__(self, path, name='FB15k'):
self.name = name
url = 'https://data.dgl.ai/dataset/{}.zip'.format(name)
if not os.path.exists(os.path.join(path, name)):
print('File not found. Downloading from', url)
_download_and_extract(url, path, name + '.zip')
self.path = os.path.join(path, name)
super(KGDatasetFB15k, self).__init__(os.path.join(self.path, 'entities.dict'),
os.path.join(self.path, 'relations.dict'),
os.path.join(self.path, 'train.txt'),
os.path.join(self.path, 'valid.txt'),
os.path.join(self.path, 'test.txt'))
class KGDatasetFB15k237(KGDataset):
'''Load a knowledge graph FB15k-237
The FB15k-237 dataset has five files:
* entities.dict stores the mapping between entity Id and entity name.
* relations.dict stores the mapping between relation Id and relation name.
* train.txt stores the triples in the training set.
* valid.txt stores the triples in the validation set.
* test.txt stores the triples in the test set.
The mapping between entity (relation) name and entity (relation) Id is stored as 'name\tid'.
The triples are stored as 'head_nid\trelation_id\ttail_nid'.
'''
def __init__(self, path, name='FB15k-237'):
self.name = name
url = 'https://data.dgl.ai/dataset/{}.zip'.format(name)
if not os.path.exists(os.path.join(path, name)):
print('File not found. Downloading from', url)
_download_and_extract(url, path, name + '.zip')
self.path = os.path.join(path, name)
super(KGDatasetFB15k237, self).__init__(os.path.join(self.path, 'entities.dict'),
os.path.join(self.path, 'relations.dict'),
os.path.join(self.path, 'train.txt'),
os.path.join(self.path, 'valid.txt'),
os.path.join(self.path, 'test.txt'))
class KGDatasetWN18(KGDataset):
'''Load a knowledge graph wn18
The wn18 dataset has five files:
* entities.dict stores the mapping between entity Id and entity name.
* relations.dict stores the mapping between relation Id and relation name.
* train.txt stores the triples in the training set.
* valid.txt stores the triples in the validation set.
* test.txt stores the triples in the test set.
The mapping between entity (relation) name and entity (relation) Id is stored as 'name\tid'.
The triples are stored as 'head_nid\trelation_id\ttail_nid'.
'''
def __init__(self, path, name='wn18'):
self.name = name
url = 'https://data.dgl.ai/dataset/{}.zip'.format(name)
if not os.path.exists(os.path.join(path, name)):
print('File not found. Downloading from', url)
_download_and_extract(url, path, name + '.zip')
self.path = os.path.join(path, name)
super(KGDatasetWN18, self).__init__(os.path.join(self.path, 'entities.dict'),
os.path.join(self.path, 'relations.dict'),
os.path.join(self.path, 'train.txt'),
os.path.join(self.path, 'valid.txt'),
os.path.join(self.path, 'test.txt'))
class KGDatasetWN18rr(KGDataset):
'''Load a knowledge graph wn18rr
The wn18rr dataset has five files:
* entities.dict stores the mapping between entity Id and entity name.
* relations.dict stores the mapping between relation Id and relation name.
* train.txt stores the triples in the training set.
* valid.txt stores the triples in the validation set.
* test.txt stores the triples in the test set.
The mapping between entity (relation) name and entity (relation) Id is stored as 'name\tid'.
The triples are stored as 'head_nid\trelation_id\ttail_nid'.
'''
def __init__(self, path, name='wn18rr'):
self.name = name
url = 'https://data.dgl.ai/dataset/{}.zip'.format(name)
if not os.path.exists(os.path.join(path, name)):
print('File not found. Downloading from', url)
_download_and_extract(url, path, name + '.zip')
self.path = os.path.join(path, name)
super(KGDatasetWN18rr, self).__init__(os.path.join(self.path, 'entities.dict'),
os.path.join(self.path, 'relations.dict'),
os.path.join(self.path, 'train.txt'),
os.path.join(self.path, 'valid.txt'),
os.path.join(self.path, 'test.txt'))
class KGDatasetFreebase(KGDataset):
'''Load a knowledge graph Full Freebase
The Freebase dataset has five files:
* entity2id.txt stores the mapping between entity name and entity Id.
* relation2id.txt stores the mapping between relation name relation Id.
* train.txt stores the triples in the training set.
* valid.txt stores the triples in the validation set.
* test.txt stores the triples in the test set.
The mapping between entity (relation) name and entity (relation) Id is stored as 'name\tid'.
The triples are stored as 'head_nid\trelation_id\ttail_nid'.
'''
def __init__(self, path, name='Freebase'):
self.name = name
url = 'https://data.dgl.ai/dataset/{}.zip'.format(name)
if not os.path.exists(os.path.join(path, name)):
print('File not found. Downloading from', url)
_download_and_extract(url, path, '{}.zip'.format(name))
self.path = os.path.join(path, name)
super(KGDatasetFreebase, self).__init__(os.path.join(self.path, 'entity2id.txt'),
os.path.join(self.path, 'relation2id.txt'),
os.path.join(self.path, 'train.txt'),
os.path.join(self.path, 'valid.txt'),
os.path.join(self.path, 'test.txt'))
def read_entity(self, entity_path):
with open(entity_path) as f_ent:
n_entities = int(f_ent.readline()[:-1])
return None, n_entities
def read_relation(self, relation_path):
with open(relation_path) as f_rel:
n_relations = int(f_rel.readline()[:-1])
return None, n_relations
def read_triple(self, path, mode, skip_first_line=False, format=None):
heads = []
tails = []
rels = []
print('Reading {} triples....'.format(mode))
with open(path) as f:
if skip_first_line:
_ = f.readline()
for line in f:
h, t, r = line.strip().split('\t')
heads.append(int(h))
tails.append(int(t))
rels.append(int(r))
heads = np.array(heads, dtype=np.int64)
tails = np.array(tails, dtype=np.int64)
rels = np.array(rels, dtype=np.int64)
print('Finished. Read {} {} triples.'.format(len(heads), mode))
return (heads, rels, tails)
class KGDatasetUDDRaw(KGDataset):
'''Load a knowledge graph user defined dataset
The user defined dataset has five files:
* entities stores the mapping between entity name and entity Id.
* relations stores the mapping between relation name relation Id.
* train stores the triples in the training set. In format [src_name, rel_name, dst_name]
* valid stores the triples in the validation set. In format [src_name, rel_name, dst_name]
* test stores the triples in the test set. In format [src_name, rel_name, dst_name]
The mapping between entity (relation) name and entity (relation) Id is stored as 'name\tid'.
The triples are stored as 'head_nid\trelation_id\ttail_nid'.
'''
def __init__(self, path, name, files, format):
self.name = name
for f in files:
assert os.path.exists(os.path.join(path, f)), \
'File {} now exist in {}'.format(f, path)
assert len(format) == 3
format = _parse_srd_format(format)
self.load_entity_relation(path, files, format)
# Only train set is provided
if len(files) == 1:
super(KGDatasetUDDRaw, self).__init__("entities.tsv",
"relation.tsv",
os.path.join(path, files[0]),
format=format)
# Train, validation and test set are provided
if len(files) == 3:
super(KGDatasetUDDRaw, self).__init__("entities.tsv",
"relation.tsv",
os.path.join(path, files[0]),
os.path.join(path, files[1]),
os.path.join(path, files[2]),
format=format)
def load_entity_relation(self, path, files, format):
entity_map = {}
rel_map = {}
for fi in files:
with open(os.path.join(path, fi)) as f:
for line in f:
triple = line.strip().split('\t')
src, rel, dst = triple[format[0]], triple[format[1]], triple[format[2]]
src_id = _get_id(entity_map, src)
dst_id = _get_id(entity_map, dst)
rel_id = _get_id(rel_map, rel)
entities = ["{}\t{}\n".format(key, val) for key, val in entity_map.items()]
with open(os.path.join(path, "entities.tsv"), "w+") as f:
f.writelines(entities)
self.entity2id = entity_map
self.n_entities = len(entities)
relations = ["{}\t{}\n".format(key, val) for key, val in rel_map.items()]
with open(os.path.join(path, "relations.tsv"), "w+") as f:
f.writelines(relations)
self.relation2id = rel_map
self.n_relations = len(relations)
def read_entity(self, entity_path):
return self.entity2id, self.n_entities
def read_relation(self, relation_path):
return self.relation2id, self.n_relations
class KGDatasetUDD(KGDataset):
'''Load a knowledge graph user defined dataset
The user defined dataset has five files:
* entities stores the mapping between entity name and entity Id.
* relations stores the mapping between relation name relation Id.
* train stores the triples in the training set. In format [src_id, rel_id, dst_id]
* valid stores the triples in the validation set. In format [src_id, rel_id, dst_id]
* test stores the triples in the test set. In format [src_id, rel_id, dst_id]
The mapping between entity (relation) name and entity (relation) Id is stored as 'name\tid'.
The triples are stored as 'head_nid\trelation_id\ttail_nid'.
'''
def __init__(self, path, name, files, format):
self.name = name
for f in files:
assert os.path.exists(os.path.join(path, f)), \
'File {} now exist in {}'.format(f, path)
format = _parse_srd_format(format)
if len(files) == 3:
super(KGDatasetUDD, self).__init__(os.path.join(path, files[0]),
os.path.join(path, files[1]),
os.path.join(path, files[2]),
None, None,
format=format)
if len(files) == 5:
super(KGDatasetUDD, self).__init__(os.path.join(path, files[0]),
os.path.join(path, files[1]),
os.path.join(path, files[2]),
os.path.join(path, files[3]),
os.path.join(path, files[4]),
format=format)
def read_entity(self, entity_path):
n_entities = 0
with open(entity_path) as f_ent:
for line in f_ent:
n_entities += 1
return None, n_entities
def read_relation(self, relation_path):
n_relations = 0
with open(relation_path) as f_rel:
for line in f_rel:
n_relations += 1
return None, n_relations
def read_triple(self, path, mode, skip_first_line=False, format=[0,1,2]):
heads = []
tails = []
rels = []
print('Reading {} triples....'.format(mode))
with open(path) as f:
if skip_first_line:
_ = f.readline()
for line in f:
triple = line.strip().split('\t')
h, r, t = triple[format[0]], triple[format[1]], triple[format[2]]
heads.append(int(h))
tails.append(int(t))
rels.append(int(r))
heads = np.array(heads, dtype=np.int64)
tails = np.array(tails, dtype=np.int64)
rels = np.array(rels, dtype=np.int64)
print('Finished. Read {} {} triples.'.format(len(heads), mode))
return (heads, rels, tails)
def get_dataset(data_path, data_name, format_str, files=None):
if format_str == 'built_in':
if data_name == 'Freebase':
dataset = KGDatasetFreebase(data_path)
elif data_name == 'FB15k':
dataset = KGDatasetFB15k(data_path)
elif data_name == 'FB15k-237':
dataset = KGDatasetFB15k237(data_path)
elif data_name == 'wn18':
dataset = KGDatasetWN18(data_path)
elif data_name == 'wn18rr':
dataset = KGDatasetWN18rr(data_path)
else:
assert False, "Unknown dataset {}".format(data_name)
elif format_str.startswith('raw_udd'):
# user defined dataset
format = format_str[8:]
dataset = KGDatasetUDDRaw(data_path, data_name, files, format)
elif format_str.startswith('udd'):
# user defined dataset
format = format_str[4:]
dataset = KGDatasetUDD(data_path, data_name, files, format)
else:
assert False, "Unknown format {}".format(format_str)
return dataset
def get_partition_dataset(data_path, data_name, part_id):
part_name = os.path.join(data_name, 'partition_'+str(part_id))
path = os.path.join(data_path, part_name)
if not os.path.exists(path):
print('Partition file not found.')
exit()
train_path = os.path.join(path, 'train.txt')
local2global_path = os.path.join(path, 'local_to_global.txt')
partition_book_path = os.path.join(path, 'partition_book.txt')
if data_name == 'Freebase':
relation_path = os.path.join(path, 'relation2id.txt')
skip_first_line = True
elif data_name in ['FB15k', 'FB15k-237', 'wn18', 'wn18rr']:
relation_path = os.path.join(path, 'relations.dict')
skip_first_line = False
else:
relation_path = os.path.join(path, 'relation.tsv')
skip_first_line = False
dataset = PartitionKGDataset(relation_path,
train_path,
local2global_path,
read_triple=True,
skip_first_line=skip_first_line)
partition_book = []
with open(partition_book_path) as f:
for line in f:
partition_book.append(int(line))
local_to_global = []
with open(local2global_path) as f:
for line in f:
local_to_global.append(int(line))
return dataset, partition_book, local_to_global
def get_server_partition_dataset(data_path, data_name, part_id):
part_name = os.path.join(data_name, 'partition_'+str(part_id))
path = os.path.join(data_path, part_name)
if not os.path.exists(path):
print('Partition file not found.')
exit()
train_path = os.path.join(path, 'train.txt')
local2global_path = os.path.join(path, 'local_to_global.txt')
if data_name == 'Freebase':
relation_path = os.path.join(path, 'relation2id.txt')
skip_first_line = True
elif data_name in ['FB15k', 'FB15k-237', 'wn18', 'wn18rr']:
relation_path = os.path.join(path, 'relations.dict')
skip_first_line = False
else:
relation_path = os.path.join(path, 'relation.tsv')
skip_first_line = False
dataset = PartitionKGDataset(relation_path,
train_path,
local2global_path,
read_triple=False,
skip_first_line=skip_first_line)
n_entities = _file_line(os.path.join(path, 'partition_book.txt'))
local_to_global = []
with open(local2global_path) as f:
for line in f:
local_to_global.append(int(line))
global_to_local = [0] * n_entities
for i in range(len(local_to_global)):
global_id = local_to_global[i]
global_to_local[global_id] = i
local_to_global = None
return global_to_local, dataset
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from .KGDataset import *
from .sampler import *
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import math
import numpy as np
import scipy as sp
import dgl.backend as F
import dgl
import os
import sys
import pickle
import time
from dgl.base import NID, EID
def SoftRelationPartition(edges, n, threshold=0.05):
"""This partitions a list of edges to n partitions according to their
relation types. For any relation with number of edges larger than the
threshold, its edges will be evenly distributed into all partitions.
For any relation with number of edges smaller than the threshold, its
edges will be put into one single partition.
Algo:
For r in relations:
if r.size() > threadold
Evenly divide edges of r into n parts and put into each relation.
else
Find partition with fewest edges, and put edges of r into
this partition.
Parameters
----------
edges : (heads, rels, tails) triple
Edge list to partition
n : int
Number of partitions
threshold : float
The threshold of whether a relation is LARGE or SMALL
Default: 5%
Returns
-------
List of np.array
Edges of each partition
List of np.array
Edge types of each partition
bool
Whether there exists some relations belongs to multiple partitions
"""
heads, rels, tails = edges
print('relation partition {} edges into {} parts'.format(len(heads), n))
uniq, cnts = np.unique(rels, return_counts=True)
idx = np.flip(np.argsort(cnts))
cnts = cnts[idx]
uniq = uniq[idx]
assert cnts[0] > cnts[-1]
edge_cnts = np.zeros(shape=(n,), dtype=np.int64)
rel_cnts = np.zeros(shape=(n,), dtype=np.int64)
rel_dict = {}
rel_parts = []
cross_rel_part = []
for _ in range(n):
rel_parts.append([])
large_threshold = int(len(rels) * threshold)
capacity_per_partition = int(len(rels) / n)
# ensure any relation larger than the partition capacity will be split
large_threshold = capacity_per_partition if capacity_per_partition < large_threshold \
else large_threshold
num_cross_part = 0
for i in range(len(cnts)):
cnt = cnts[i]
r = uniq[i]
r_parts = []
if cnt > large_threshold:
avg_part_cnt = (cnt // n) + 1
num_cross_part += 1
for j in range(n):
part_cnt = avg_part_cnt if cnt > avg_part_cnt else cnt
r_parts.append([j, part_cnt])
rel_parts[j].append(r)
edge_cnts[j] += part_cnt
rel_cnts[j] += 1
cnt -= part_cnt
cross_rel_part.append(r)
else:
idx = np.argmin(edge_cnts)
r_parts.append([idx, cnt])
rel_parts[idx].append(r)
edge_cnts[idx] += cnt
rel_cnts[idx] += 1
rel_dict[r] = r_parts
for i, edge_cnt in enumerate(edge_cnts):
print('part {} has {} edges and {} relations'.format(i, edge_cnt, rel_cnts[i]))
print('{}/{} duplicated relation across partitions'.format(num_cross_part, len(cnts)))
parts = []
for i in range(n):
parts.append([])
rel_parts[i] = np.array(rel_parts[i])
for i, r in enumerate(rels):
r_part = rel_dict[r][0]
part_idx = r_part[0]
cnt = r_part[1]
parts[part_idx].append(i)
cnt -= 1
if cnt == 0:
rel_dict[r].pop(0)
else:
rel_dict[r][0][1] = cnt
for i, part in enumerate(parts):
parts[i] = np.array(part, dtype=np.int64)
shuffle_idx = np.concatenate(parts)
heads[:] = heads[shuffle_idx]
rels[:] = rels[shuffle_idx]
tails[:] = tails[shuffle_idx]
off = 0
for i, part in enumerate(parts):
parts[i] = np.arange(off, off + len(part))
off += len(part)
cross_rel_part = np.array(cross_rel_part)
return parts, rel_parts, num_cross_part > 0, cross_rel_part
def BalancedRelationPartition(edges, n):
"""This partitions a list of edges based on relations to make sure
each partition has roughly the same number of edges and relations.
Algo:
For r in relations:
Find partition with fewest edges
if r.size() > num_of empty_slot
put edges of r into this partition to fill the partition,
find next partition with fewest edges to put r in.
else
put edges of r into this partition.
Parameters
----------
edges : (heads, rels, tails) triple
Edge list to partition
n : int
number of partitions
Returns
-------
List of np.array
Edges of each partition
List of np.array
Edge types of each partition
bool
Whether there exists some relations belongs to multiple partitions
"""
heads, rels, tails = edges
print('relation partition {} edges into {} parts'.format(len(heads), n))
uniq, cnts = np.unique(rels, return_counts=True)
idx = np.flip(np.argsort(cnts))
cnts = cnts[idx]
uniq = uniq[idx]
assert cnts[0] > cnts[-1]
edge_cnts = np.zeros(shape=(n,), dtype=np.int64)
rel_cnts = np.zeros(shape=(n,), dtype=np.int64)
rel_dict = {}
rel_parts = []
for _ in range(n):
rel_parts.append([])
max_edges = (len(rels) // n) + 1
num_cross_part = 0
for i in range(len(cnts)):
cnt = cnts[i]
r = uniq[i]
r_parts = []
while cnt > 0:
idx = np.argmin(edge_cnts)
if edge_cnts[idx] + cnt <= max_edges:
r_parts.append([idx, cnt])
rel_parts[idx].append(r)
edge_cnts[idx] += cnt
rel_cnts[idx] += 1
cnt = 0
else:
cur_cnt = max_edges - edge_cnts[idx]
r_parts.append([idx, cur_cnt])
rel_parts[idx].append(r)
edge_cnts[idx] += cur_cnt
rel_cnts[idx] += 1
num_cross_part += 1
cnt -= cur_cnt
rel_dict[r] = r_parts
for i, edge_cnt in enumerate(edge_cnts):
print('part {} has {} edges and {} relations'.format(i, edge_cnt, rel_cnts[i]))
print('{}/{} duplicated relation across partitions'.format(num_cross_part, len(cnts)))
parts = []
for i in range(n):
parts.append([])
rel_parts[i] = np.array(rel_parts[i])
for i, r in enumerate(rels):
r_part = rel_dict[r][0]
part_idx = r_part[0]
cnt = r_part[1]
parts[part_idx].append(i)
cnt -= 1
if cnt == 0:
rel_dict[r].pop(0)
else:
rel_dict[r][0][1] = cnt
for i, part in enumerate(parts):
parts[i] = np.array(part, dtype=np.int64)
shuffle_idx = np.concatenate(parts)
heads[:] = heads[shuffle_idx]
rels[:] = rels[shuffle_idx]
tails[:] = tails[shuffle_idx]
off = 0
for i, part in enumerate(parts):
parts[i] = np.arange(off, off + len(part))
off += len(part)
return parts, rel_parts, num_cross_part > 0
def RandomPartition(edges, n):
"""This partitions a list of edges randomly across n partitions
Parameters
----------
edges : (heads, rels, tails) triple
Edge list to partition
n : int
number of partitions
Returns
-------
List of np.array
Edges of each partition
"""
heads, rels, tails = edges
print('random partition {} edges into {} parts'.format(len(heads), n))
idx = np.random.permutation(len(heads))
heads[:] = heads[idx]
rels[:] = rels[idx]
tails[:] = tails[idx]
part_size = int(math.ceil(len(idx) / n))
parts = []
for i in range(n):
start = part_size * i
end = min(part_size * (i + 1), len(idx))
parts.append(idx[start:end])
print('part {} has {} edges'.format(i, len(parts[-1])))
return parts
def ConstructGraph(edges, n_entities, args):
"""Construct Graph for training
Parameters
----------
edges : (heads, rels, tails) triple
Edge list
n_entities : int
number of entities
args :
Global configs.
"""
pickle_name = 'graph_train.pickle'
if args.pickle_graph and os.path.exists(os.path.join(args.data_path, args.dataset, pickle_name)):
with open(os.path.join(args.data_path, args.dataset, pickle_name), 'rb') as graph_file:
g = pickle.load(graph_file)
print('Load pickled graph.')
else:
src, etype_id, dst = edges
coo = sp.sparse.coo_matrix((np.ones(len(src)), (src, dst)), shape=[n_entities, n_entities])
g = dgl.DGLGraph(coo, readonly=True, multigraph=True, sort_csr=True)
g.edata['tid'] = F.tensor(etype_id, F.int64)
if args.pickle_graph:
with open(os.path.join(args.data_path, args.dataset, pickle_name), 'wb') as graph_file:
pickle.dump(g, graph_file)
return g
class TrainDataset(object):
"""Dataset for training
Parameters
----------
dataset : KGDataset
Original dataset.
args :
Global configs.
ranks:
Number of partitions.
"""
def __init__(self, dataset, args, ranks=64):
triples = dataset.train
num_train = len(triples[0])
print('|Train|:', num_train)
if ranks > 1 and args.soft_rel_part:
self.edge_parts, self.rel_parts, self.cross_part, self.cross_rels = \
SoftRelationPartition(triples, ranks)
elif ranks > 1 and args.rel_part:
self.edge_parts, self.rel_parts, self.cross_part = \
BalancedRelationPartition(triples, ranks)
elif ranks > 1:
self.edge_parts = RandomPartition(triples, ranks)
self.cross_part = True
else:
self.edge_parts = [np.arange(num_train)]
self.rel_parts = [np.arange(dataset.n_relations)]
self.cross_part = False
self.g = ConstructGraph(triples, dataset.n_entities, args)
def create_sampler(self, batch_size, neg_sample_size=2, neg_chunk_size=None, mode='head', num_workers=32,
shuffle=True, exclude_positive=False, rank=0):
"""Create sampler for training
Parameters
----------
batch_size : int
Batch size of each mini batch.
neg_sample_size : int
How many negative edges sampled for each node.
neg_chunk_size : int
How many edges in one chunk. We split one batch into chunks.
mode : str
Sampling mode.
number_workers: int
Number of workers used in parallel for this sampler
shuffle : bool
If True, shuffle the seed edges.
If False, do not shuffle the seed edges.
Default: False
exclude_positive : bool
If True, exlucde true positive edges in sampled negative edges
If False, return all sampled negative edges even there are positive edges
Default: False
rank : int
Which partition to sample.
Returns
-------
dgl.contrib.sampling.EdgeSampler
Edge sampler
"""
EdgeSampler = getattr(dgl.contrib.sampling, 'EdgeSampler')
assert batch_size % neg_sample_size == 0, 'batch_size should be divisible by B'
return EdgeSampler(self.g,
seed_edges=F.tensor(self.edge_parts[rank]),
batch_size=batch_size,
neg_sample_size=int(neg_sample_size/neg_chunk_size),
chunk_size=neg_chunk_size,
negative_mode=mode,
num_workers=num_workers,
shuffle=shuffle,
exclude_positive=exclude_positive,
return_false_neg=False)
class ChunkNegEdgeSubgraph(dgl.DGLGraph):
"""Wrapper for negative graph
Parameters
----------
neg_g : DGLGraph
Graph holding negative edges.
num_chunks : int
Number of chunks in sampled graph.
chunk_size : int
Info of chunk_size.
neg_sample_size : int
Info of neg_sample_size.
neg_head : bool
If True, negative_mode is 'head'
If False, negative_mode is 'tail'
"""
def __init__(self, subg, num_chunks, chunk_size,
neg_sample_size, neg_head):
super(ChunkNegEdgeSubgraph, self).__init__(graph_data=subg.sgi.graph,
readonly=True,
parent=subg._parent)
self.ndata[NID] = subg.sgi.induced_nodes.tousertensor()
self.edata[EID] = subg.sgi.induced_edges.tousertensor()
self.subg = subg
self.num_chunks = num_chunks
self.chunk_size = chunk_size
self.neg_sample_size = neg_sample_size
self.neg_head = neg_head
@property
def head_nid(self):
return self.subg.head_nid
@property
def tail_nid(self):
return self.subg.tail_nid
def create_neg_subgraph(pos_g, neg_g, chunk_size, neg_sample_size, is_chunked,
neg_head, num_nodes):
"""KG models need to know the number of chunks, the chunk size and negative sample size
of a negative subgraph to perform the computation more efficiently.
This function tries to infer all of these information of the negative subgraph
and create a wrapper class that contains all of the information.
Parameters
----------
pos_g : DGLGraph
Graph holding positive edges.
neg_g : DGLGraph
Graph holding negative edges.
chunk_size : int
Chunk size of negative subgrap.
neg_sample_size : int
Negative sample size of negative subgrap.
is_chunked : bool
If True, the sampled batch is chunked.
neg_head : bool
If True, negative_mode is 'head'
If False, negative_mode is 'tail'
num_nodes: int
Total number of nodes in the whole graph.
Returns
-------
ChunkNegEdgeSubgraph
Negative graph wrapper
"""
assert neg_g.number_of_edges() % pos_g.number_of_edges() == 0
# We use all nodes to create negative edges. Regardless of the sampling algorithm,
# we can always view the subgraph with one chunk.
if (neg_head and len(neg_g.head_nid) == num_nodes) \
or (not neg_head and len(neg_g.tail_nid) == num_nodes):
num_chunks = 1
chunk_size = pos_g.number_of_edges()
elif is_chunked:
# This is probably for evaluation.
if pos_g.number_of_edges() < chunk_size \
and neg_g.number_of_edges() % neg_sample_size == 0:
num_chunks = 1
chunk_size = pos_g.number_of_edges()
# This is probably the last batch in the training. Let's ignore it.
elif pos_g.number_of_edges() % chunk_size > 0:
return None
else:
num_chunks = int(pos_g.number_of_edges() / chunk_size)
assert num_chunks * chunk_size == pos_g.number_of_edges()
else:
num_chunks = pos_g.number_of_edges()
chunk_size = 1
return ChunkNegEdgeSubgraph(neg_g, num_chunks, chunk_size,
neg_sample_size, neg_head)
class EvalSampler(object):
"""Sampler for validation and testing
Parameters
----------
g : DGLGraph
Graph containing KG graph
edges : tensor
Seed edges
batch_size : int
Batch size of each mini batch.
neg_sample_size : int
How many negative edges sampled for each node.
neg_chunk_size : int
How many edges in one chunk. We split one batch into chunks.
mode : str
Sampling mode.
number_workers: int
Number of workers used in parallel for this sampler
filter_false_neg : bool
If True, exlucde true positive edges in sampled negative edges
If False, return all sampled negative edges even there are positive edges
Default: True
"""
def __init__(self, g, edges, batch_size, neg_sample_size, neg_chunk_size, mode, num_workers=32,
filter_false_neg=True):
EdgeSampler = getattr(dgl.contrib.sampling, 'EdgeSampler')
self.sampler = EdgeSampler(g,
batch_size=batch_size,
seed_edges=edges,
neg_sample_size=neg_sample_size,
chunk_size=neg_chunk_size,
negative_mode=mode,
num_workers=num_workers,
shuffle=False,
exclude_positive=False,
relations=g.edata['tid'],
return_false_neg=filter_false_neg)
self.sampler_iter = iter(self.sampler)
self.mode = mode
self.neg_head = 'head' in mode
self.g = g
self.filter_false_neg = filter_false_neg
self.neg_chunk_size = neg_chunk_size
self.neg_sample_size = neg_sample_size
def __iter__(self):
return self
def __next__(self):
"""Get next batch
Returns
-------
DGLGraph
Sampled positive graph
ChunkNegEdgeSubgraph
Negative graph wrapper
"""
while True:
pos_g, neg_g = next(self.sampler_iter)
if self.filter_false_neg:
neg_positive = neg_g.edata['false_neg']
neg_g = create_neg_subgraph(pos_g, neg_g,
self.neg_chunk_size,
self.neg_sample_size,
'chunk' in self.mode,
self.neg_head,
self.g.number_of_nodes())
if neg_g is not None:
break
pos_g.ndata['id'] = pos_g.parent_nid
neg_g.ndata['id'] = neg_g.parent_nid
pos_g.edata['id'] = pos_g._parent.edata['tid'][pos_g.parent_eid]
if self.filter_false_neg:
neg_g.edata['bias'] = F.astype(-neg_positive, F.float32)
return pos_g, neg_g
def reset(self):
"""Reset the sampler
"""
self.sampler_iter = iter(self.sampler)
return self
class EvalDataset(object):
"""Dataset for validation or testing
Parameters
----------
dataset : KGDataset
Original dataset.
args :
Global configs.
"""
def __init__(self, dataset, args):
pickle_name = 'graph_all.pickle'
if args.pickle_graph and os.path.exists(os.path.join(args.data_path, args.dataset, pickle_name)):
with open(os.path.join(args.data_path, args.dataset, pickle_name), 'rb') as graph_file:
g = pickle.load(graph_file)
print('Load pickled graph.')
else:
src = np.concatenate((dataset.train[0], dataset.valid[0], dataset.test[0]))
etype_id = np.concatenate((dataset.train[1], dataset.valid[1], dataset.test[1]))
dst = np.concatenate((dataset.train[2], dataset.valid[2], dataset.test[2]))
coo = sp.sparse.coo_matrix((np.ones(len(src)), (src, dst)),
shape=[dataset.n_entities, dataset.n_entities])
g = dgl.DGLGraph(coo, readonly=True, multigraph=True, sort_csr=True)
g.edata['tid'] = F.tensor(etype_id, F.int64)
if args.pickle_graph:
with open(os.path.join(args.data_path, args.dataset, pickle_name), 'wb') as graph_file:
pickle.dump(g, graph_file)
self.g = g
self.num_train = len(dataset.train[0])
self.num_valid = len(dataset.valid[0])
self.num_test = len(dataset.test[0])
if args.eval_percent < 1:
self.valid = np.random.randint(0, self.num_valid,
size=(int(self.num_valid * args.eval_percent),)) + self.num_train
else:
self.valid = np.arange(self.num_train, self.num_train + self.num_valid)
print('|valid|:', len(self.valid))
if args.eval_percent < 1:
self.test = np.random.randint(0, self.num_test,
size=(int(self.num_test * args.eval_percent,)))
self.test += self.num_train + self.num_valid
else:
self.test = np.arange(self.num_train + self.num_valid, self.g.number_of_edges())
print('|test|:', len(self.test))
def get_edges(self, eval_type):
""" Get all edges in this dataset
Parameters
----------
eval_type : str
Sampling type, 'valid' for validation and 'test' for testing
Returns
-------
np.array
Edges
"""
if eval_type == 'valid':
return self.valid
elif eval_type == 'test':
return self.test
else:
raise Exception('get invalid type: ' + eval_type)
def create_sampler(self, eval_type, batch_size, neg_sample_size, neg_chunk_size,
filter_false_neg, mode='head', num_workers=32, rank=0, ranks=1):
"""Create sampler for validation or testing
Parameters
----------
eval_type : str
Sampling type, 'valid' for validation and 'test' for testing
batch_size : int
Batch size of each mini batch.
neg_sample_size : int
How many negative edges sampled for each node.
neg_chunk_size : int
How many edges in one chunk. We split one batch into chunks.
filter_false_neg : bool
If True, exlucde true positive edges in sampled negative edges
If False, return all sampled negative edges even there are positive edges
mode : str
Sampling mode.
number_workers: int
Number of workers used in parallel for this sampler
rank : int
Which partition to sample.
ranks : int
Total number of partitions.
Returns
-------
dgl.contrib.sampling.EdgeSampler
Edge sampler
"""
edges = self.get_edges(eval_type)
beg = edges.shape[0] * rank // ranks
end = min(edges.shape[0] * (rank + 1) // ranks, edges.shape[0])
edges = edges[beg: end]
return EvalSampler(self.g, edges, batch_size, neg_sample_size, neg_chunk_size,
mode, num_workers, filter_false_neg)
class NewBidirectionalOneShotIterator:
"""Grouped samper iterator
Parameters
----------
dataloader_head : dgl.contrib.sampling.EdgeSampler
EdgeSampler in head mode
dataloader_tail : dgl.contrib.sampling.EdgeSampler
EdgeSampler in tail mode
neg_chunk_size : int
How many edges in one chunk. We split one batch into chunks.
neg_sample_size : int
How many negative edges sampled for each node.
is_chunked : bool
If True, the sampled batch is chunked.
num_nodes : int
Total number of nodes in the whole graph.
"""
def __init__(self, dataloader_head, dataloader_tail, neg_chunk_size, neg_sample_size,
is_chunked, num_nodes):
self.sampler_head = dataloader_head
self.sampler_tail = dataloader_tail
self.iterator_head = self.one_shot_iterator(dataloader_head, neg_chunk_size,
neg_sample_size, is_chunked,
True, num_nodes)
self.iterator_tail = self.one_shot_iterator(dataloader_tail, neg_chunk_size,
neg_sample_size, is_chunked,
False, num_nodes)
self.step = 0
def __next__(self):
self.step += 1
if self.step % 2 == 0:
pos_g, neg_g = next(self.iterator_head)
else:
pos_g, neg_g = next(self.iterator_tail)
return pos_g, neg_g
@staticmethod
def one_shot_iterator(dataloader, neg_chunk_size, neg_sample_size, is_chunked,
neg_head, num_nodes):
while True:
for pos_g, neg_g in dataloader:
neg_g = create_neg_subgraph(pos_g, neg_g, neg_chunk_size, neg_sample_size,
is_chunked, neg_head, num_nodes)
if neg_g is None:
continue
pos_g.ndata['id'] = pos_g.parent_nid
neg_g.ndata['id'] = neg_g.parent_nid
pos_g.edata['id'] = pos_g._parent.edata['tid'][pos_g.parent_eid]
yield pos_g, neg_g
## Training Scripts for distributed training
1. Partition data
Partition FB15k:
```bash
./partition.sh ../data FB15k 4
```
Partition Freebase:
```bash
./partition.sh ../data Freebase 4
```
2. Modify `ip_config.txt` and copy dgl-ke to all the machines
3. Run
```bash
./launch.sh \
~/dgl/apps/kg/distributed \
./fb15k_transe_l2.sh \
ubuntu ~/mykey.pem
```
\ No newline at end of file
#!/bin/bash
##################################################################################
# This script runing distmult model on Freebase dataset in distributed setting.
# You can change the hyper-parameter in this file but DO NOT run script manually
##################################################################################
machine_id=$1
server_count=$2
machine_count=$3
# Delete the temp file
rm *-shape
##################################################################################
# Start kvserver
##################################################################################
SERVER_ID_LOW=$((machine_id*server_count))
SERVER_ID_HIGH=$(((machine_id+1)*server_count))
while [ $SERVER_ID_LOW -lt $SERVER_ID_HIGH ]
do
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvserver.py --model TransE_l2 --dataset FB15k \
--hidden_dim 400 --gamma 19.9 --lr 0.25 --total_client 64 --server_id $SERVER_ID_LOW &
let SERVER_ID_LOW+=1
done
##################################################################################
# Start kvclient
##################################################################################
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvclient.py --model TransE_l2 --dataset FB15k \
--batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 500 --log_interval 100 --num_thread 1 \
--batch_size_eval 16 --test -adv --regularization_coef 1e-9 --total_machine $machine_count --num_client 16
\ No newline at end of file
#!/bin/bash
##################################################################################
# This script runing ComplEx model on Freebase dataset in distributed setting.
# You can change the hyper-parameter in this file but DO NOT run script manually
##################################################################################
machine_id=$1
server_count=$2
machine_count=$3
# Delete the temp file
rm *-shape
##################################################################################
# Start kvserver
##################################################################################
SERVER_ID_LOW=$((machine_id*server_count))
SERVER_ID_HIGH=$(((machine_id+1)*server_count))
while [ $SERVER_ID_LOW -lt $SERVER_ID_HIGH ]
do
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvserver.py --model ComplEx --dataset Freebase \
--hidden_dim 400 --gamma 143.0 --lr 0.1 --total_client 160 --server_id $SERVER_ID_LOW &
let SERVER_ID_LOW+=1
done
##################################################################################
# Start kvclient
##################################################################################
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvclient.py --model ComplEx --dataset Freebase \
--batch_size 1024 --neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.1 --max_step 12500 --log_interval 100 \
--batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv --total_machine $machine_count --num_thread 1 --num_client 40
\ No newline at end of file
#!/bin/bash
##################################################################################
# This script runing distmult model on Freebase dataset in distributed setting.
# You can change the hyper-parameter in this file but DO NOT run script manually
##################################################################################
machine_id=$1
server_count=$2
machine_count=$3
# Delete the temp file
rm *-shape
##################################################################################
# Start kvserver
##################################################################################
SERVER_ID_LOW=$((machine_id*server_count))
SERVER_ID_HIGH=$(((machine_id+1)*server_count))
while [ $SERVER_ID_LOW -lt $SERVER_ID_HIGH ]
do
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvserver.py --model DistMult --dataset Freebase \
--hidden_dim 400 --gamma 143.0 --lr 0.08 --total_client 160 --server_id $SERVER_ID_LOW &
let SERVER_ID_LOW+=1
done
##################################################################################
# Start kvclient
##################################################################################
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvclient.py --model DistMult --dataset Freebase \
--batch_size 1024 --neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.08 --max_step 12500 --log_interval 100 \
--batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv --total_machine $machine_count --num_thread 1 --num_client 40
\ No newline at end of file
#!/bin/bash
##################################################################################
# This script runing distmult model on Freebase dataset in distributed setting.
# You can change the hyper-parameter in this file but DO NOT run script manually
##################################################################################
machine_id=$1
server_count=$2
machine_count=$3
# Delete the temp file
rm *-shape
##################################################################################
# Start kvserver
##################################################################################
SERVER_ID_LOW=$((machine_id*server_count))
SERVER_ID_HIGH=$(((machine_id+1)*server_count))
while [ $SERVER_ID_LOW -lt $SERVER_ID_HIGH ]
do
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvserver.py --model TransE_l2 --dataset Freebase \
--hidden_dim 400 --gamma 10 --lr 0.1 --total_client 160 --server_id $SERVER_ID_LOW &
let SERVER_ID_LOW+=1
done
##################################################################################
# Start kvclient
##################################################################################
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvclient.py --model TransE_l2 --dataset Freebase \
--batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 10 --lr 0.1 --max_step 12500 --log_interval 100 --num_thread 1 \
--batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv --regularization_coef 1e-9 --total_machine $machine_count --num_client 40
\ No newline at end of file
127.0.0.1 30050 8
127.0.0.1 30050 8
127.0.0.1 30050 8
127.0.0.1 30050 8
\ No newline at end of file
#!/bin/bash
##################################################################################
# User runs this script to launch distrobited jobs on cluster
##################################################################################
script_path=$1
script_file=$2
user_name=$3
ssh_key=$4
server_count=$(awk 'NR==1 {print $3}' ip_config.txt)
machine_count=$(awk 'END{print NR}' ip_config.txt)
# run command on remote machine
LINE_LOW=2
LINE_HIGH=$(awk 'END{print NR}' ip_config.txt)
let LINE_HIGH+=1
s_id=0
while [ $LINE_LOW -lt $LINE_HIGH ]
do
ip=$(awk 'NR=='$LINE_LOW' {print $1}' ip_config.txt)
let LINE_LOW+=1
let s_id+=1
if test -z "$ssh_key"
then
ssh $user_name@$ip 'cd '$script_path'; '$script_file' '$s_id' '$server_count' '$machine_count'' &
else
ssh -i $ssh_key $user_name@$ip 'cd '$script_path'; '$script_file' '$s_id' '$server_count' '$machine_count'' &
fi
done
# run command on local machine
$script_file 0 $server_count $machine_count
\ No newline at end of file
#!/bin/bash
##################################################################################
# User runs this script to partition a graph using METIS
##################################################################################
DATA_PATH=$1
DATA_SET=$2
K=$3
# partition graph
python3 ../partition.py --dataset $DATA_SET -k $K --data_path $DATA_PATH
# copy related file to partition
PART_ID=0
while [ $PART_ID -lt $K ]
do
cp $DATA_PATH/$DATA_SET/relation* $DATA_PATH/$DATA_SET/partition_$PART_ID
let PART_ID+=1
done
\ No newline at end of file
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from dataloader import EvalDataset, TrainDataset
from dataloader import get_dataset
import argparse
import os
import logging
import time
import pickle
from utils import get_compatible_batch_size
backend = os.environ.get('DGLBACKEND', 'pytorch')
if backend.lower() == 'mxnet':
import multiprocessing as mp
from train_mxnet import load_model_from_checkpoint
from train_mxnet import test
else:
import torch.multiprocessing as mp
from train_pytorch import load_model_from_checkpoint
from train_pytorch import test, test_mp
class ArgParser(argparse.ArgumentParser):
def __init__(self):
super(ArgParser, self).__init__()
self.add_argument('--model_name', default='TransE',
choices=['TransE', 'TransE_l1', 'TransE_l2', 'TransR',
'RESCAL', 'DistMult', 'ComplEx', 'RotatE'],
help='model to use')
self.add_argument('--data_path', type=str, default='data',
help='root path of all dataset')
self.add_argument('--dataset', type=str, default='FB15k',
help='dataset name, under data_path')
self.add_argument('--format', type=str, default='built_in',
help='the format of the dataset, it can be built_in,'\
'raw_udd_{htr} and udd_{htr}')
self.add_argument('--data_files', type=str, default=None, nargs='+',
help='a list of data files, e.g. entity relation train valid test')
self.add_argument('--model_path', type=str, default='ckpts',
help='the place where models are saved')
self.add_argument('--batch_size_eval', type=int, default=8,
help='batch size used for eval and test')
self.add_argument('--neg_sample_size_eval', type=int, default=-1,
help='negative sampling size for testing')
self.add_argument('--neg_deg_sample_eval', action='store_true',
help='negative sampling proportional to vertex degree for testing')
self.add_argument('--hidden_dim', type=int, default=256,
help='hidden dim used by relation and entity')
self.add_argument('-g', '--gamma', type=float, default=12.0,
help='margin value')
self.add_argument('--eval_percent', type=float, default=1,
help='sample some percentage for evaluation.')
self.add_argument('--no_eval_filter', action='store_true',
help='do not filter positive edges among negative edges for evaluation')
self.add_argument('--gpu', type=int, default=[-1], nargs='+',
help='a list of active gpu ids, e.g. 0')
self.add_argument('--mix_cpu_gpu', action='store_true',
help='mix CPU and GPU training')
self.add_argument('-de', '--double_ent', action='store_true',
help='double entitiy dim for complex number')
self.add_argument('-dr', '--double_rel', action='store_true',
help='double relation dim for complex number')
self.add_argument('--num_proc', type=int, default=1,
help='number of process used')
self.add_argument('--num_thread', type=int, default=1,
help='number of thread used')
def parse_args(self):
args = super().parse_args()
return args
def get_logger(args):
if not os.path.exists(args.model_path):
raise Exception('No existing model_path: ' + args.model_path)
log_file = os.path.join(args.model_path, 'eval.log')
logging.basicConfig(
format='%(asctime)s %(levelname)-8s %(message)s',
level=logging.INFO,
datefmt='%Y-%m-%d %H:%M:%S',
filename=log_file,
filemode='w'
)
logger = logging.getLogger(__name__)
print("Logs are being recorded at: {}".format(log_file))
return logger
def main(args):
args.eval_filter = not args.no_eval_filter
if args.neg_deg_sample_eval:
assert not args.eval_filter, "if negative sampling based on degree, we can't filter positive edges."
# load dataset and samplers
dataset = get_dataset(args.data_path, args.dataset, args.format, args.data_files)
args.pickle_graph = False
args.train = False
args.valid = False
args.test = True
args.strict_rel_part = False
args.soft_rel_part = False
args.async_update = False
logger = get_logger(args)
# Here we want to use the regualr negative sampler because we need to ensure that
# all positive edges are excluded.
eval_dataset = EvalDataset(dataset, args)
if args.neg_sample_size_eval < 0:
args.neg_sample_size_eval = args.neg_sample_size = eval_dataset.g.number_of_nodes()
args.batch_size_eval = get_compatible_batch_size(args.batch_size_eval, args.neg_sample_size_eval)
args.num_workers = 8 # fix num_workers to 8
if args.num_proc > 1:
test_sampler_tails = []
test_sampler_heads = []
for i in range(args.num_proc):
test_sampler_head = eval_dataset.create_sampler('test', args.batch_size_eval,
args.neg_sample_size_eval,
args.neg_sample_size_eval,
args.eval_filter,
mode='chunk-head',
num_workers=args.num_workers,
rank=i, ranks=args.num_proc)
test_sampler_tail = eval_dataset.create_sampler('test', args.batch_size_eval,
args.neg_sample_size_eval,
args.neg_sample_size_eval,
args.eval_filter,
mode='chunk-tail',
num_workers=args.num_workers,
rank=i, ranks=args.num_proc)
test_sampler_heads.append(test_sampler_head)
test_sampler_tails.append(test_sampler_tail)
else:
test_sampler_head = eval_dataset.create_sampler('test', args.batch_size_eval,
args.neg_sample_size_eval,
args.neg_sample_size_eval,
args.eval_filter,
mode='chunk-head',
num_workers=args.num_workers,
rank=0, ranks=1)
test_sampler_tail = eval_dataset.create_sampler('test', args.batch_size_eval,
args.neg_sample_size_eval,
args.neg_sample_size_eval,
args.eval_filter,
mode='chunk-tail',
num_workers=args.num_workers,
rank=0, ranks=1)
# load model
n_entities = dataset.n_entities
n_relations = dataset.n_relations
ckpt_path = args.model_path
model = load_model_from_checkpoint(logger, args, n_entities, n_relations, ckpt_path)
if args.num_proc > 1:
model.share_memory()
# test
args.step = 0
args.max_step = 0
start = time.time()
if args.num_proc > 1:
queue = mp.Queue(args.num_proc)
procs = []
for i in range(args.num_proc):
proc = mp.Process(target=test_mp, args=(args,
model,
[test_sampler_heads[i], test_sampler_tails[i]],
i,
'Test',
queue))
procs.append(proc)
proc.start()
total_metrics = {}
metrics = {}
logs = []
for i in range(args.num_proc):
log = queue.get()
logs = logs + log
for metric in logs[0].keys():
metrics[metric] = sum([log[metric] for log in logs]) / len(logs)
for k, v in metrics.items():
print('Test average {} at [{}/{}]: {}'.format(k, args.step, args.max_step, v))
for proc in procs:
proc.join()
else:
test(args, model, [test_sampler_head, test_sampler_tail])
print('Test takes {:.3f} seconds'.format(time.time() - start))
if __name__ == '__main__':
args = ArgParser().parse_args()
main(args)
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
import argparse
import time
import logging
import socket
if os.name != 'nt':
import fcntl
import struct
import torch.multiprocessing as mp
from train_pytorch import load_model, dist_train_test
from utils import get_compatible_batch_size
from train import get_logger
from dataloader import TrainDataset, NewBidirectionalOneShotIterator
from dataloader import get_dataset, get_partition_dataset
import dgl
import dgl.backend as F
WAIT_TIME = 10
class ArgParser(argparse.ArgumentParser):
def __init__(self):
super(ArgParser, self).__init__()
self.add_argument('--model_name', default='TransE',
choices=['TransE', 'TransE_l1', 'TransE_l2', 'TransR',
'RESCAL', 'DistMult', 'ComplEx', 'RotatE'],
help='model to use')
self.add_argument('--data_path', type=str, default='../data',
help='root path of all dataset')
self.add_argument('--dataset', type=str, default='FB15k',
help='dataset name, under data_path')
self.add_argument('--format', type=str, default='built_in',
help='the format of the dataset, it can be built_in,'\
'raw_udd_{htr} and udd_{htr}')
self.add_argument('--save_path', type=str, default='../ckpts',
help='place to save models and logs')
self.add_argument('--save_emb', type=str, default=None,
help='save the embeddings in the specific location.')
self.add_argument('--max_step', type=int, default=80000,
help='train xx steps')
self.add_argument('--batch_size', type=int, default=1024,
help='batch size')
self.add_argument('--batch_size_eval', type=int, default=8,
help='batch size used for eval and test')
self.add_argument('--neg_sample_size', type=int, default=128,
help='negative sampling size')
self.add_argument('--neg_deg_sample', action='store_true',
help='negative sample proportional to vertex degree in the training')
self.add_argument('--neg_deg_sample_eval', action='store_true',
help='negative sampling proportional to vertex degree in the evaluation')
self.add_argument('--neg_sample_size_eval', type=int, default=-1,
help='negative sampling size for evaluation')
self.add_argument('--hidden_dim', type=int, default=256,
help='hidden dim used by relation and entity')
self.add_argument('--lr', type=float, default=0.0001,
help='learning rate')
self.add_argument('-g', '--gamma', type=float, default=12.0,
help='margin value')
self.add_argument('--no_eval_filter', action='store_true',
help='do not filter positive edges among negative edges for evaluation')
self.add_argument('--gpu', type=int, default=[-1], nargs='+',
help='a list of active gpu ids, e.g. 0 1 2 4')
self.add_argument('--mix_cpu_gpu', action='store_true',
help='mix CPU and GPU training')
self.add_argument('-de', '--double_ent', action='store_true',
help='double entitiy dim for complex number')
self.add_argument('-dr', '--double_rel', action='store_true',
help='double relation dim for complex number')
self.add_argument('-log', '--log_interval', type=int, default=1000,
help='do evaluation after every x steps')
self.add_argument('--eval_interval', type=int, default=10000,
help='do evaluation after every x steps')
self.add_argument('--eval_percent', type=float, default=1,
help='sample some percentage for evaluation.')
self.add_argument('-adv', '--neg_adversarial_sampling', action='store_true',
help='if use negative adversarial sampling')
self.add_argument('-a', '--adversarial_temperature', default=1.0, type=float,
help='adversarial_temperature')
self.add_argument('--valid', action='store_true',
help='if valid a model')
self.add_argument('--test', action='store_true',
help='if test a model')
self.add_argument('-rc', '--regularization_coef', type=float, default=0.000002,
help='set value > 0.0 if regularization is used')
self.add_argument('-rn', '--regularization_norm', type=int, default=3,
help='norm used in regularization')
self.add_argument('--non_uni_weight', action='store_true',
help='if use uniform weight when computing loss')
self.add_argument('--pickle_graph', action='store_true',
help='pickle built graph, building a huge graph is slow.')
self.add_argument('--num_proc', type=int, default=1,
help='number of process used')
self.add_argument('--num_thread', type=int, default=1,
help='number of thread used')
self.add_argument('--rel_part', action='store_true',
help='enable relation partitioning')
self.add_argument('--soft_rel_part', action='store_true',
help='enable soft relation partition')
self.add_argument('--async_update', action='store_true',
help='allow async_update on node embedding')
self.add_argument('--force_sync_interval', type=int, default=-1,
help='We force a synchronization between processes every x steps')
self.add_argument('--machine_id', type=int, default=0,
help='Unique ID of current machine.')
self.add_argument('--total_machine', type=int, default=1,
help='Total number of machine.')
self.add_argument('--ip_config', type=str, default='ip_config.txt',
help='IP configuration file of kvstore')
self.add_argument('--num_client', type=int, default=1,
help='Number of client on each machine.')
def get_long_tail_partition(n_relations, n_machine):
"""Relation types has a long tail distribution for many dataset.
So we need to average shuffle the data before we partition it.
"""
assert n_relations > 0, 'n_relations must be a positive number.'
assert n_machine > 0, 'n_machine must be a positive number.'
partition_book = [0] * n_relations
part_id = 0
for i in range(n_relations):
partition_book[i] = part_id
part_id += 1
if part_id == n_machine:
part_id = 0
return partition_book
def local_ip4_addr_list():
"""Return a set of IPv4 address
"""
nic = set()
for ix in socket.if_nameindex():
name = ix[1]
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
ip = socket.inet_ntoa(fcntl.ioctl(
s.fileno(),
0x8915, # SIOCGIFADDR
struct.pack('256s', name[:15].encode("UTF-8")))[20:24])
nic.add(ip)
return nic
def get_local_machine_id(server_namebook):
"""Get machine ID via server_namebook
"""
assert len(server_namebook) > 0, 'server_namebook cannot be empty.'
res = 0
for ID, data in server_namebook.items():
machine_id = data[0]
ip = data[1]
if ip in local_ip4_addr_list():
res = machine_id
break
return res
def start_worker(args, logger):
"""Start kvclient for training
"""
init_time_start = time.time()
time.sleep(WAIT_TIME) # wait for launch script
server_namebook = dgl.contrib.read_ip_config(filename=args.ip_config)
args.machine_id = get_local_machine_id(server_namebook)
dataset, entity_partition_book, local2global = get_partition_dataset(
args.data_path,
args.dataset,
args.machine_id)
n_entities = dataset.n_entities
n_relations = dataset.n_relations
print('Partition %d n_entities: %d' % (args.machine_id, n_entities))
print("Partition %d n_relations: %d" % (args.machine_id, n_relations))
entity_partition_book = F.tensor(entity_partition_book)
relation_partition_book = get_long_tail_partition(dataset.n_relations, args.total_machine)
relation_partition_book = F.tensor(relation_partition_book)
local2global = F.tensor(local2global)
relation_partition_book.share_memory_()
entity_partition_book.share_memory_()
local2global.share_memory_()
train_data = TrainDataset(dataset, args, ranks=args.num_client)
# if there is no cross partition relaiton, we fall back to strict_rel_part
args.strict_rel_part = args.mix_cpu_gpu and (train_data.cross_part == False)
args.soft_rel_part = args.mix_cpu_gpu and args.soft_rel_part and train_data.cross_part
if args.neg_sample_size_eval < 0:
args.neg_sample_size_eval = dataset.n_entities
args.batch_size = get_compatible_batch_size(args.batch_size, args.neg_sample_size)
args.batch_size_eval = get_compatible_batch_size(args.batch_size_eval, args.neg_sample_size_eval)
args.num_workers = 8 # fix num_workers to 8
train_samplers = []
for i in range(args.num_client):
train_sampler_head = train_data.create_sampler(args.batch_size,
args.neg_sample_size,
args.neg_sample_size,
mode='head',
num_workers=args.num_workers,
shuffle=True,
exclude_positive=False,
rank=i)
train_sampler_tail = train_data.create_sampler(args.batch_size,
args.neg_sample_size,
args.neg_sample_size,
mode='tail',
num_workers=args.num_workers,
shuffle=True,
exclude_positive=False,
rank=i)
train_samplers.append(NewBidirectionalOneShotIterator(train_sampler_head, train_sampler_tail,
args.neg_sample_size, args.neg_sample_size,
True, n_entities))
dataset = None
model = load_model(logger, args, n_entities, n_relations)
model.share_memory()
print('Total initialize time {:.3f} seconds'.format(time.time() - init_time_start))
rel_parts = train_data.rel_parts if args.strict_rel_part or args.soft_rel_part else None
cross_rels = train_data.cross_rels if args.soft_rel_part else None
procs = []
barrier = mp.Barrier(args.num_client)
for i in range(args.num_client):
proc = mp.Process(target=dist_train_test, args=(args,
model,
train_samplers[i],
entity_partition_book,
relation_partition_book,
local2global,
i,
rel_parts,
cross_rels,
barrier))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
if __name__ == '__main__':
args = ArgParser().parse_args()
logger = get_logger(args)
start_worker(args, logger)
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
import argparse
import time
import dgl
from dgl.contrib import KVServer
import torch as th
from train_pytorch import load_model
from dataloader import get_server_partition_dataset
NUM_THREAD = 1 # Fix the number of threads to 1 on kvstore
class KGEServer(KVServer):
"""User-defined kvstore for DGL-KGE
"""
def _push_handler(self, name, ID, data, target):
"""Row-Sparse Adagrad updater
"""
original_name = name[0:-6]
state_sum = target[original_name+'_state-data-']
grad_sum = (data * data).mean(1)
state_sum.index_add_(0, ID, grad_sum)
std = state_sum[ID] # _sparse_mask
std_values = std.sqrt_().add_(1e-10).unsqueeze(1)
tmp = (-self.clr * data / std_values)
target[name].index_add_(0, ID, tmp)
def set_clr(self, learning_rate):
"""Set learning rate for Row-Sparse Adagrad updater
"""
self.clr = learning_rate
# Note: Most of the args are unnecessary for KVStore, will remove them later
class ArgParser(argparse.ArgumentParser):
def __init__(self):
super(ArgParser, self).__init__()
self.add_argument('--model_name', default='TransE',
choices=['TransE', 'TransE_l1', 'TransE_l2', 'TransR',
'RESCAL', 'DistMult', 'ComplEx', 'RotatE'],
help='model to use')
self.add_argument('--data_path', type=str, default='../data',
help='root path of all dataset')
self.add_argument('--dataset', type=str, default='FB15k',
help='dataset name, under data_path')
self.add_argument('--format', type=str, default='built_in',
help='the format of the dataset, it can be built_in,'\
'raw_udd_{htr} and udd_{htr}')
self.add_argument('--hidden_dim', type=int, default=256,
help='hidden dim used by relation and entity')
self.add_argument('--lr', type=float, default=0.0001,
help='learning rate')
self.add_argument('-g', '--gamma', type=float, default=12.0,
help='margin value')
self.add_argument('--gpu', type=int, default=[-1], nargs='+',
help='a list of active gpu ids, e.g. 0')
self.add_argument('--mix_cpu_gpu', action='store_true',
help='mix CPU and GPU training')
self.add_argument('-de', '--double_ent', action='store_true',
help='double entitiy dim for complex number')
self.add_argument('-dr', '--double_rel', action='store_true',
help='double relation dim for complex number')
self.add_argument('--num_thread', type=int, default=1,
help='number of thread used')
self.add_argument('--server_id', type=int, default=0,
help='Unique ID of each server')
self.add_argument('--ip_config', type=str, default='ip_config.txt',
help='IP configuration file of kvstore')
self.add_argument('--total_client', type=int, default=1,
help='Total number of client worker nodes')
def get_server_data(args, machine_id):
"""Get data from data_path/dataset/part_machine_id
Return: glocal2local,
entity_emb,
entity_state,
relation_emb,
relation_emb_state
"""
g2l, dataset = get_server_partition_dataset(
args.data_path,
args.dataset,
machine_id)
# Note that the dataset doesn't ccontain the triple
print('n_entities: ' + str(dataset.n_entities))
print('n_relations: ' + str(dataset.n_relations))
args.soft_rel_part = False
args.strict_rel_part = False
model = load_model(None, args, dataset.n_entities, dataset.n_relations)
return g2l, model.entity_emb.emb, model.entity_emb.state_sum, model.relation_emb.emb, model.relation_emb.state_sum
def start_server(args):
"""Start kvstore service
"""
th.set_num_threads(NUM_THREAD)
server_namebook = dgl.contrib.read_ip_config(filename=args.ip_config)
my_server = KGEServer(server_id=args.server_id,
server_namebook=server_namebook,
num_client=args.total_client)
my_server.set_clr(args.lr)
if my_server.get_id() % my_server.get_group_count() == 0: # master server
g2l, entity_emb, entity_emb_state, relation_emb, relation_emb_state = get_server_data(args, my_server.get_machine_id())
my_server.set_global2local(name='entity_emb', global2local=g2l)
my_server.init_data(name='relation_emb', data_tensor=relation_emb)
my_server.init_data(name='relation_emb_state', data_tensor=relation_emb_state)
my_server.init_data(name='entity_emb', data_tensor=entity_emb)
my_server.init_data(name='entity_emb_state', data_tensor=entity_emb_state)
else: # backup server
my_server.set_global2local(name='entity_emb')
my_server.init_data(name='relation_emb')
my_server.init_data(name='relation_emb_state')
my_server.init_data(name='entity_emb')
my_server.init_data(name='entity_emb_state')
print('KVServer %d listen for requests ...' % my_server.get_id())
my_server.start()
if __name__ == '__main__':
args = ArgParser().parse_args()
start_server(args)
\ No newline at end of file
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from .general_models import KEModel
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
"""
Graph Embedding Model
1. TransE
2. TransR
3. RESCAL
4. DistMult
5. ComplEx
6. RotatE
"""
import os
import numpy as np
import dgl.backend as F
backend = os.environ.get('DGLBACKEND', 'pytorch')
if backend.lower() == 'mxnet':
from .mxnet.tensor_models import logsigmoid
from .mxnet.tensor_models import get_device
from .mxnet.tensor_models import norm
from .mxnet.tensor_models import get_scalar
from .mxnet.tensor_models import reshape
from .mxnet.tensor_models import cuda
from .mxnet.tensor_models import ExternalEmbedding
from .mxnet.score_fun import *
else:
from .pytorch.tensor_models import logsigmoid
from .pytorch.tensor_models import get_device
from .pytorch.tensor_models import norm
from .pytorch.tensor_models import get_scalar
from .pytorch.tensor_models import reshape
from .pytorch.tensor_models import cuda
from .pytorch.tensor_models import ExternalEmbedding
from .pytorch.score_fun import *
class KEModel(object):
""" DGL Knowledge Embedding Model.
Parameters
----------
args:
Global configs.
model_name : str
Which KG model to use, including 'TransE_l1', 'TransE_l2', 'TransR',
'RESCAL', 'DistMult', 'ComplEx', 'RotatE'
n_entities : int
Num of entities.
n_relations : int
Num of relations.
hidden_dim : int
Dimetion size of embedding.
gamma : float
Gamma for score function.
double_entity_emb : bool
If True, entity embedding size will be 2 * hidden_dim.
Default: False
double_relation_emb : bool
If True, relation embedding size will be 2 * hidden_dim.
Default: False
"""
def __init__(self, args, model_name, n_entities, n_relations, hidden_dim, gamma,
double_entity_emb=False, double_relation_emb=False):
super(KEModel, self).__init__()
self.args = args
self.n_entities = n_entities
self.n_relations = n_relations
self.model_name = model_name
self.hidden_dim = hidden_dim
self.eps = 2.0
self.emb_init = (gamma + self.eps) / hidden_dim
entity_dim = 2 * hidden_dim if double_entity_emb else hidden_dim
relation_dim = 2 * hidden_dim if double_relation_emb else hidden_dim
device = get_device(args)
self.entity_emb = ExternalEmbedding(args, n_entities, entity_dim,
F.cpu() if args.mix_cpu_gpu else device)
# For RESCAL, relation_emb = relation_dim * entity_dim
if model_name == 'RESCAL':
rel_dim = relation_dim * entity_dim
else:
rel_dim = relation_dim
self.rel_dim = rel_dim
self.entity_dim = entity_dim
self.strict_rel_part = args.strict_rel_part
self.soft_rel_part = args.soft_rel_part
if not self.strict_rel_part and not self.soft_rel_part:
self.relation_emb = ExternalEmbedding(args, n_relations, rel_dim,
F.cpu() if args.mix_cpu_gpu else device)
else:
self.global_relation_emb = ExternalEmbedding(args, n_relations, rel_dim, F.cpu())
if model_name == 'TransE' or model_name == 'TransE_l2':
self.score_func = TransEScore(gamma, 'l2')
elif model_name == 'TransE_l1':
self.score_func = TransEScore(gamma, 'l1')
elif model_name == 'TransR':
projection_emb = ExternalEmbedding(args,
n_relations,
entity_dim * relation_dim,
F.cpu() if args.mix_cpu_gpu else device)
self.score_func = TransRScore(gamma, projection_emb, relation_dim, entity_dim)
elif model_name == 'DistMult':
self.score_func = DistMultScore()
elif model_name == 'ComplEx':
self.score_func = ComplExScore()
elif model_name == 'RESCAL':
self.score_func = RESCALScore(relation_dim, entity_dim)
elif model_name == 'RotatE':
self.score_func = RotatEScore(gamma, self.emb_init)
self.model_name = model_name
self.head_neg_score = self.score_func.create_neg(True)
self.tail_neg_score = self.score_func.create_neg(False)
self.head_neg_prepare = self.score_func.create_neg_prepare(True)
self.tail_neg_prepare = self.score_func.create_neg_prepare(False)
self.reset_parameters()
def share_memory(self):
"""Use torch.tensor.share_memory_() to allow cross process embeddings access.
"""
self.entity_emb.share_memory()
if self.strict_rel_part or self.soft_rel_part:
self.global_relation_emb.share_memory()
else:
self.relation_emb.share_memory()
if self.model_name == 'TransR':
self.score_func.share_memory()
def save_emb(self, path, dataset):
"""Save the model.
Parameters
----------
path : str
Directory to save the model.
dataset : str
Dataset name as prefix to the saved embeddings.
"""
self.entity_emb.save(path, dataset+'_'+self.model_name+'_entity')
if self.strict_rel_part or self.soft_rel_part:
self.global_relation_emb.save(path, dataset+'_'+self.model_name+'_relation')
else:
self.relation_emb.save(path, dataset+'_'+self.model_name+'_relation')
self.score_func.save(path, dataset+'_'+self.model_name)
def load_emb(self, path, dataset):
"""Load the model.
Parameters
----------
path : str
Directory to load the model.
dataset : str
Dataset name as prefix to the saved embeddings.
"""
self.entity_emb.load(path, dataset+'_'+self.model_name+'_entity')
self.relation_emb.load(path, dataset+'_'+self.model_name+'_relation')
self.score_func.load(path, dataset+'_'+self.model_name)
def reset_parameters(self):
"""Re-initialize the model.
"""
self.entity_emb.init(self.emb_init)
self.score_func.reset_parameters()
if (not self.strict_rel_part) and (not self.soft_rel_part):
self.relation_emb.init(self.emb_init)
else:
self.global_relation_emb.init(self.emb_init)
def predict_score(self, g):
"""Predict the positive score.
Parameters
----------
g : DGLGraph
Graph holding positive edges.
Returns
-------
tensor
The positive score
"""
self.score_func(g)
return g.edata['score']
def predict_neg_score(self, pos_g, neg_g, to_device=None, gpu_id=-1, trace=False,
neg_deg_sample=False):
"""Calculate the negative score.
Parameters
----------
pos_g : DGLGraph
Graph holding positive edges.
neg_g : DGLGraph
Graph holding negative edges.
to_device : func
Function to move data into device.
gpu_id : int
Which gpu to move data to.
trace : bool
If True, trace the computation. This is required in training.
If False, do not trace the computation.
Default: False
neg_deg_sample : bool
If True, we use the head and tail nodes of the positive edges to
construct negative edges.
Default: False
Returns
-------
tensor
The negative score
"""
num_chunks = neg_g.num_chunks
chunk_size = neg_g.chunk_size
neg_sample_size = neg_g.neg_sample_size
mask = F.ones((num_chunks, chunk_size * (neg_sample_size + chunk_size)),
dtype=F.float32, ctx=F.context(pos_g.ndata['emb']))
if neg_g.neg_head:
neg_head_ids = neg_g.ndata['id'][neg_g.head_nid]
neg_head = self.entity_emb(neg_head_ids, gpu_id, trace)
head_ids, tail_ids = pos_g.all_edges(order='eid')
if to_device is not None and gpu_id >= 0:
tail_ids = to_device(tail_ids, gpu_id)
tail = pos_g.ndata['emb'][tail_ids]
rel = pos_g.edata['emb']
# When we train a batch, we could use the head nodes of the positive edges to
# construct negative edges. We construct a negative edge between a positive head
# node and every positive tail node.
# When we construct negative edges like this, we know there is one positive
# edge for a positive head node among the negative edges. We need to mask
# them.
if neg_deg_sample:
head = pos_g.ndata['emb'][head_ids]
head = head.reshape(num_chunks, chunk_size, -1)
neg_head = neg_head.reshape(num_chunks, neg_sample_size, -1)
neg_head = F.cat([head, neg_head], 1)
neg_sample_size = chunk_size + neg_sample_size
mask[:,0::(neg_sample_size + 1)] = 0
neg_head = neg_head.reshape(num_chunks * neg_sample_size, -1)
neg_head, tail = self.head_neg_prepare(pos_g.edata['id'], num_chunks, neg_head, tail, gpu_id, trace)
neg_score = self.head_neg_score(neg_head, rel, tail,
num_chunks, chunk_size, neg_sample_size)
else:
neg_tail_ids = neg_g.ndata['id'][neg_g.tail_nid]
neg_tail = self.entity_emb(neg_tail_ids, gpu_id, trace)
head_ids, tail_ids = pos_g.all_edges(order='eid')
if to_device is not None and gpu_id >= 0:
head_ids = to_device(head_ids, gpu_id)
head = pos_g.ndata['emb'][head_ids]
rel = pos_g.edata['emb']
# This is negative edge construction similar to the above.
if neg_deg_sample:
tail = pos_g.ndata['emb'][tail_ids]
tail = tail.reshape(num_chunks, chunk_size, -1)
neg_tail = neg_tail.reshape(num_chunks, neg_sample_size, -1)
neg_tail = F.cat([tail, neg_tail], 1)
neg_sample_size = chunk_size + neg_sample_size
mask[:,0::(neg_sample_size + 1)] = 0
neg_tail = neg_tail.reshape(num_chunks * neg_sample_size, -1)
head, neg_tail = self.tail_neg_prepare(pos_g.edata['id'], num_chunks, head, neg_tail, gpu_id, trace)
neg_score = self.tail_neg_score(head, rel, neg_tail,
num_chunks, chunk_size, neg_sample_size)
if neg_deg_sample:
neg_g.neg_sample_size = neg_sample_size
mask = mask.reshape(num_chunks, chunk_size, neg_sample_size)
return neg_score * mask
else:
return neg_score
def forward_test(self, pos_g, neg_g, logs, gpu_id=-1):
"""Do the forward and generate ranking results.
Parameters
----------
pos_g : DGLGraph
Graph holding positive edges.
neg_g : DGLGraph
Graph holding negative edges.
logs : List
Where to put results in.
gpu_id : int
Which gpu to accelerate the calculation. if -1 is provided, cpu is used.
"""
pos_g.ndata['emb'] = self.entity_emb(pos_g.ndata['id'], gpu_id, False)
pos_g.edata['emb'] = self.relation_emb(pos_g.edata['id'], gpu_id, False)
self.score_func.prepare(pos_g, gpu_id, False)
batch_size = pos_g.number_of_edges()
pos_scores = self.predict_score(pos_g)
pos_scores = reshape(logsigmoid(pos_scores), batch_size, -1)
neg_scores = self.predict_neg_score(pos_g, neg_g, to_device=cuda,
gpu_id=gpu_id, trace=False,
neg_deg_sample=self.args.neg_deg_sample_eval)
neg_scores = reshape(logsigmoid(neg_scores), batch_size, -1)
# We need to filter the positive edges in the negative graph.
if self.args.eval_filter:
filter_bias = reshape(neg_g.edata['bias'], batch_size, -1)
if gpu_id >= 0:
filter_bias = cuda(filter_bias, gpu_id)
neg_scores += filter_bias
# To compute the rank of a positive edge among all negative edges,
# we need to know how many negative edges have higher scores than
# the positive edge.
rankings = F.sum(neg_scores >= pos_scores, dim=1) + 1
rankings = F.asnumpy(rankings)
for i in range(batch_size):
ranking = rankings[i]
logs.append({
'MRR': 1.0 / ranking,
'MR': float(ranking),
'HITS@1': 1.0 if ranking <= 1 else 0.0,
'HITS@3': 1.0 if ranking <= 3 else 0.0,
'HITS@10': 1.0 if ranking <= 10 else 0.0
})
# @profile
def forward(self, pos_g, neg_g, gpu_id=-1):
"""Do the forward.
Parameters
----------
pos_g : DGLGraph
Graph holding positive edges.
neg_g : DGLGraph
Graph holding negative edges.
gpu_id : int
Which gpu to accelerate the calculation. if -1 is provided, cpu is used.
Returns
-------
tensor
loss value
dict
loss info
"""
pos_g.ndata['emb'] = self.entity_emb(pos_g.ndata['id'], gpu_id, True)
pos_g.edata['emb'] = self.relation_emb(pos_g.edata['id'], gpu_id, True)
self.score_func.prepare(pos_g, gpu_id, True)
pos_score = self.predict_score(pos_g)
pos_score = logsigmoid(pos_score)
if gpu_id >= 0:
neg_score = self.predict_neg_score(pos_g, neg_g, to_device=cuda,
gpu_id=gpu_id, trace=True,
neg_deg_sample=self.args.neg_deg_sample)
else:
neg_score = self.predict_neg_score(pos_g, neg_g, trace=True,
neg_deg_sample=self.args.neg_deg_sample)
neg_score = reshape(neg_score, -1, neg_g.neg_sample_size)
# Adversarial sampling
if self.args.neg_adversarial_sampling:
neg_score = F.sum(F.softmax(neg_score * self.args.adversarial_temperature, dim=1).detach()
* logsigmoid(-neg_score), dim=1)
else:
neg_score = F.mean(logsigmoid(-neg_score), dim=1)
# subsampling weight
# TODO: add subsampling to new sampler
if self.args.non_uni_weight:
subsampling_weight = pos_g.edata['weight']
pos_score = (pos_score * subsampling_weight).sum() / subsampling_weight.sum()
neg_score = (neg_score * subsampling_weight).sum() / subsampling_weight.sum()
else:
pos_score = pos_score.mean()
neg_score = neg_score.mean()
# compute loss
loss = -(pos_score + neg_score) / 2
log = {'pos_loss': - get_scalar(pos_score),
'neg_loss': - get_scalar(neg_score),
'loss': get_scalar(loss)}
# regularization: TODO(zihao)
#TODO: only reg ent&rel embeddings. other params to be added.
if self.args.regularization_coef > 0.0 and self.args.regularization_norm > 0:
coef, nm = self.args.regularization_coef, self.args.regularization_norm
reg = coef * (norm(self.entity_emb.curr_emb(), nm) + norm(self.relation_emb.curr_emb(), nm))
log['regularization'] = get_scalar(reg)
loss = loss + reg
return loss, log
def update(self, gpu_id=-1):
""" Update the embeddings in the model
gpu_id : int
Which gpu to accelerate the calculation. if -1 is provided, cpu is used.
"""
self.entity_emb.update(gpu_id)
self.relation_emb.update(gpu_id)
self.score_func.update(gpu_id)
def prepare_relation(self, device=None):
""" Prepare relation embeddings in multi-process multi-gpu training model.
device : th.device
Which device (GPU) to put relation embeddings in.
"""
self.relation_emb = ExternalEmbedding(self.args, self.n_relations, self.rel_dim, device)
self.relation_emb.init(self.emb_init)
if self.model_name == 'TransR':
local_projection_emb = ExternalEmbedding(self.args, self.n_relations,
self.entity_dim * self.rel_dim, device)
self.score_func.prepare_local_emb(local_projection_emb)
self.score_func.reset_parameters()
def prepare_cross_rels(self, cross_rels):
self.relation_emb.setup_cross_rels(cross_rels, self.global_relation_emb)
if self.model_name == 'TransR':
self.score_func.prepare_cross_rels(cross_rels)
def writeback_relation(self, rank=0, rel_parts=None):
""" Writeback relation embeddings in a specific process to global relation embedding.
Used in multi-process multi-gpu training model.
rank : int
Process id.
rel_parts : List of tensor
List of tensor stroing edge types of each partition.
"""
idx = rel_parts[rank]
if self.soft_rel_part:
idx = self.relation_emb.get_noncross_idx(idx)
self.global_relation_emb.emb[idx] = F.copy_to(self.relation_emb.emb, F.cpu())[idx]
if self.model_name == 'TransR':
self.score_func.writeback_local_emb(idx)
def load_relation(self, device=None):
""" Sync global relation embeddings into local relation embeddings.
Used in multi-process multi-gpu training model.
device : th.device
Which device (GPU) to put relation embeddings in.
"""
self.relation_emb = ExternalEmbedding(self.args, self.n_relations, self.rel_dim, device)
self.relation_emb.emb = F.copy_to(self.global_relation_emb.emb, device)
if self.model_name == 'TransR':
local_projection_emb = ExternalEmbedding(self.args, self.n_relations,
self.entity_dim * self.rel_dim, device)
self.score_func.load_local_emb(local_projection_emb)
def create_async_update(self):
"""Set up the async update for entity embedding.
"""
self.entity_emb.create_async_update()
def finish_async_update(self):
"""Terminate the async update for entity embedding.
"""
self.entity_emb.finish_async_update()
def pull_model(self, client, pos_g, neg_g):
with th.no_grad():
entity_id = F.cat(seq=[pos_g.ndata['id'], neg_g.ndata['id']], dim=0)
relation_id = pos_g.edata['id']
entity_id = F.tensor(np.unique(F.asnumpy(entity_id)))
relation_id = F.tensor(np.unique(F.asnumpy(relation_id)))
l2g = client.get_local2global()
global_entity_id = l2g[entity_id]
entity_data = client.pull(name='entity_emb', id_tensor=global_entity_id)
relation_data = client.pull(name='relation_emb', id_tensor=relation_id)
self.entity_emb.emb[entity_id] = entity_data
self.relation_emb.emb[relation_id] = relation_data
def push_gradient(self, client):
with th.no_grad():
l2g = client.get_local2global()
for entity_id, entity_data in self.entity_emb.trace:
grad = entity_data.grad.data
global_entity_id =l2g[entity_id]
client.push(name='entity_emb', id_tensor=global_entity_id, data_tensor=grad)
for relation_id, relation_data in self.relation_emb.trace:
grad = relation_data.grad.data
client.push(name='relation_emb', id_tensor=relation_id, data_tensor=grad)
self.entity_emb.trace = []
self.relation_emb.trace = []
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
\ No newline at end of file
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import numpy as np
import mxnet as mx
from mxnet import gluon
from mxnet.gluon import nn
from mxnet import ndarray as nd
def batched_l2_dist(a, b):
a_squared = nd.power(nd.norm(a, axis=-1), 2)
b_squared = nd.power(nd.norm(b, axis=-1), 2)
squared_res = nd.add(nd.linalg_gemm(
a, nd.transpose(b, axes=(0, 2, 1)), nd.broadcast_axes(nd.expand_dims(b_squared, axis=-2), axis=1, size=a.shape[1]), alpha=-2
), nd.expand_dims(a_squared, axis=-1))
res = nd.sqrt(nd.clip(squared_res, 1e-30, np.finfo(np.float32).max))
return res
def batched_l1_dist(a, b):
a = nd.expand_dims(a, axis=-2)
b = nd.expand_dims(b, axis=-3)
res = nd.norm(a - b, ord=1, axis=-1)
return res
class TransEScore(nn.Block):
""" TransE score function
Paper link: https://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data
"""
def __init__(self, gamma, dist_func='l2'):
super(TransEScore, self).__init__()
self.gamma = gamma
if dist_func == 'l1':
self.neg_dist_func = batched_l1_dist
self.dist_ord = 1
else: # default use l2
self.neg_dist_func = batched_l2_dist
self.dist_ord = 2
def edge_func(self, edges):
head = edges.src['emb']
tail = edges.dst['emb']
rel = edges.data['emb']
score = head + rel - tail
return {'score': self.gamma - nd.norm(score, ord=self.dist_ord, axis=-1)}
def prepare(self, g, gpu_id, trace=False):
pass
def create_neg_prepare(self, neg_head):
def fn(rel_id, num_chunks, head, tail, gpu_id, trace=False):
return head, tail
return fn
def update(self, gpu_id=-1):
pass
def reset_parameters(self):
pass
def save(self, path, name):
pass
def load(self, path, name):
pass
def forward(self, g):
g.apply_edges(lambda edges: self.edge_func(edges))
def create_neg(self, neg_head):
gamma = self.gamma
if neg_head:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
hidden_dim = heads.shape[1]
heads = heads.reshape(num_chunks, neg_sample_size, hidden_dim)
tails = tails - relations
tails = tails.reshape(num_chunks, chunk_size, hidden_dim)
return gamma - self.neg_dist_func(tails, heads)
return fn
else:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
hidden_dim = heads.shape[1]
heads = heads + relations
heads = heads.reshape(num_chunks, chunk_size, hidden_dim)
tails = tails.reshape(num_chunks, neg_sample_size, hidden_dim)
return gamma - self.neg_dist_func(heads, tails)
return fn
class TransRScore(nn.Block):
"""TransR score function
Paper link: https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/download/9571/9523
"""
def __init__(self, gamma, projection_emb, relation_dim, entity_dim):
super(TransRScore, self).__init__()
self.gamma = gamma
self.projection_emb = projection_emb
self.relation_dim = relation_dim
self.entity_dim = entity_dim
def edge_func(self, edges):
head = edges.data['head_emb']
tail = edges.data['tail_emb']
rel = edges.data['emb']
score = head + rel - tail
return {'score': self.gamma - nd.norm(score, ord=1, axis=-1)}
def prepare(self, g, gpu_id, trace=False):
head_ids, tail_ids = g.all_edges(order='eid')
projection = self.projection_emb(g.edata['id'], gpu_id, trace)
projection = projection.reshape(-1, self.entity_dim, self.relation_dim)
head_emb = g.ndata['emb'][head_ids.as_in_context(g.ndata['emb'].context)].expand_dims(axis=-2)
tail_emb = g.ndata['emb'][tail_ids.as_in_context(g.ndata['emb'].context)].expand_dims(axis=-2)
g.edata['head_emb'] = nd.batch_dot(head_emb, projection).squeeze()
g.edata['tail_emb'] = nd.batch_dot(tail_emb, projection).squeeze()
def create_neg_prepare(self, neg_head):
if neg_head:
def fn(rel_id, num_chunks, head, tail, gpu_id, trace=False):
# pos node, project to its relation
projection = self.projection_emb(rel_id, gpu_id, trace)
projection = projection.reshape(-1, self.entity_dim, self.relation_dim)
tail = tail.reshape(-1, 1, self.entity_dim)
tail = nd.batch_dot(tail, projection)
tail = tail.reshape(num_chunks, -1, self.relation_dim)
# neg node, each project to all relations
projection = projection.reshape(num_chunks, -1, self.entity_dim, self.relation_dim)
head = head.reshape(num_chunks, -1, 1, self.entity_dim)
num_rels = projection.shape[1]
num_nnodes = head.shape[1]
heads = []
for i in range(num_chunks):
head_negs = []
for j in range(num_nnodes):
head_neg = head[i][j]
head_neg = head_neg.reshape(1, 1, self.entity_dim)
head_neg = nd.broadcast_axis(head_neg, axis=0, size=num_rels)
head_neg = nd.batch_dot(head_neg, projection[i])
head_neg = head_neg.squeeze(axis=1)
head_negs.append(head_neg)
head_negs = nd.stack(*head_negs, axis=1)
heads.append(head_negs)
head = nd.stack(*heads)
return head, tail
return fn
else:
def fn(rel_id, num_chunks, head, tail, gpu_id, trace=False):
# pos node, project to its relation
projection = self.projection_emb(rel_id, gpu_id, trace)
projection = projection.reshape(-1, self.entity_dim, self.relation_dim)
head = head.reshape(-1, 1, self.entity_dim)
head = nd.batch_dot(head, projection).squeeze()
head = head.reshape(num_chunks, -1, self.relation_dim)
projection = projection.reshape(num_chunks, -1, self.entity_dim, self.relation_dim)
tail = tail.reshape(num_chunks, -1, 1, self.entity_dim)
num_rels = projection.shape[1]
num_nnodes = tail.shape[1]
tails = []
for i in range(num_chunks):
tail_negs = []
for j in range(num_nnodes):
tail_neg = tail[i][j]
tail_neg = tail_neg.reshape(1, 1, self.entity_dim)
tail_neg = nd.broadcast_axis(tail_neg, axis=0, size=num_rels)
tail_neg = nd.batch_dot(tail_neg, projection[i])
tail_neg = tail_neg.squeeze(axis=1)
tail_negs.append(tail_neg)
tail_negs = nd.stack(*tail_negs, axis=1)
tails.append(tail_negs)
tail = nd.stack(*tails)
return head, tail
return fn
def forward(self, g):
g.apply_edges(lambda edges: self.edge_func(edges))
def reset_parameters(self):
self.projection_emb.init(1.0)
def update(self, gpu_id=-1):
self.projection_emb.update(gpu_id)
def save(self, path, name):
self.projection_emb.save(path, name+'projection')
def load(self, path, name):
self.projection_emb.load(path, name+'projection')
def prepare_local_emb(self, projection_emb):
self.global_projection_emb = self.projection_emb
self.projection_emb = projection_emb
def writeback_local_emb(self, idx):
self.global_projection_emb.emb[idx] = self.projection_emb.emb.as_in_context(mx.cpu())[idx]
def load_local_emb(self, projection_emb):
context = projection_emb.emb.context
projection_emb.emb = self.projection_emb.emb.as_in_context(context)
self.projection_emb = projection_emb
def create_neg(self, neg_head):
gamma = self.gamma
if neg_head:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
relations = relations.reshape(num_chunks, -1, self.relation_dim)
tails = tails - relations
tails = tails.reshape(num_chunks, -1, 1, self.relation_dim)
score = heads - tails
return gamma - nd.norm(score, ord=1, axis=-1)
return fn
else:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
relations = relations.reshape(num_chunks, -1, self.relation_dim)
heads = heads - relations
heads = heads.reshape(num_chunks, -1, 1, self.relation_dim)
score = heads - tails
return gamma - nd.norm(score, ord=1, axis=-1)
return fn
class DistMultScore(nn.Block):
"""DistMult score function
Paper link: https://arxiv.org/abs/1412.6575
"""
def __init__(self):
super(DistMultScore, self).__init__()
def edge_func(self, edges):
head = edges.src['emb']
tail = edges.dst['emb']
rel = edges.data['emb']
score = head * rel * tail
# TODO: check if there exists minus sign and if gamma should be used here(jin)
return {'score': nd.sum(score, axis=-1)}
def prepare(self, g, gpu_id, trace=False):
pass
def create_neg_prepare(self, neg_head):
def fn(rel_id, num_chunks, head, tail, gpu_id, trace=False):
return head, tail
return fn
def update(self, gpu_id=-1):
pass
def reset_parameters(self):
pass
def save(self, path, name):
pass
def load(self, path, name):
pass
def forward(self, g):
g.apply_edges(lambda edges: self.edge_func(edges))
def create_neg(self, neg_head):
if neg_head:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
hidden_dim = heads.shape[1]
heads = heads.reshape(num_chunks, neg_sample_size, hidden_dim)
heads = nd.transpose(heads, axes=(0, 2, 1))
tmp = (tails * relations).reshape(num_chunks, chunk_size, hidden_dim)
return nd.linalg_gemm2(tmp, heads)
return fn
else:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
hidden_dim = heads.shape[1]
tails = tails.reshape(num_chunks, neg_sample_size, hidden_dim)
tails = nd.transpose(tails, axes=(0, 2, 1))
tmp = (heads * relations).reshape(num_chunks, chunk_size, hidden_dim)
return nd.linalg_gemm2(tmp, tails)
return fn
class ComplExScore(nn.Block):
"""ComplEx score function
Paper link: https://arxiv.org/abs/1606.06357
"""
def __init__(self):
super(ComplExScore, self).__init__()
def edge_func(self, edges):
real_head, img_head = nd.split(edges.src['emb'], num_outputs=2, axis=-1)
real_tail, img_tail = nd.split(edges.dst['emb'], num_outputs=2, axis=-1)
real_rel, img_rel = nd.split(edges.data['emb'], num_outputs=2, axis=-1)
score = real_head * real_tail * real_rel \
+ img_head * img_tail * real_rel \
+ real_head * img_tail * img_rel \
- img_head * real_tail * img_rel
# TODO: check if there exists minus sign and if gamma should be used here(jin)
return {'score': nd.sum(score, -1)}
def prepare(self, g, gpu_id, trace=False):
pass
def create_neg_prepare(self, neg_head):
def fn(rel_id, num_chunks, head, tail, gpu_id, trace=False):
return head, tail
return fn
def update(self, gpu_id=-1):
pass
def reset_parameters(self):
pass
def save(self, path, name):
pass
def load(self, path, name):
pass
def forward(self, g):
g.apply_edges(lambda edges: self.edge_func(edges))
def create_neg(self, neg_head):
if neg_head:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
hidden_dim = heads.shape[1]
emb_real, emb_img = nd.split(tails, num_outputs=2, axis=-1)
rel_real, rel_img = nd.split(relations, num_outputs=2, axis=-1)
real = emb_real * rel_real + emb_img * rel_img
img = -emb_real * rel_img + emb_img * rel_real
emb_complex = nd.concat(real, img, dim=-1)
tmp = emb_complex.reshape(num_chunks, chunk_size, hidden_dim)
heads = heads.reshape(num_chunks, neg_sample_size, hidden_dim)
heads = nd.transpose(heads, axes=(0, 2, 1))
return nd.linalg_gemm2(tmp, heads)
return fn
else:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
hidden_dim = heads.shape[1]
emb_real, emb_img = nd.split(heads, num_outputs=2, axis=-1)
rel_real, rel_img = nd.split(relations, num_outputs=2, axis=-1)
real = emb_real * rel_real - emb_img * rel_img
img = emb_real * rel_img + emb_img * rel_real
emb_complex = nd.concat(real, img, dim=-1)
tmp = emb_complex.reshape(num_chunks, chunk_size, hidden_dim)
tails = tails.reshape(num_chunks, neg_sample_size, hidden_dim)
tails = nd.transpose(tails, axes=(0, 2, 1))
return nd.linalg_gemm2(tmp, tails)
return fn
class RESCALScore(nn.Block):
"""RESCAL score function
Paper link: http://www.icml-2011.org/papers/438_icmlpaper.pdf
"""
def __init__(self, relation_dim, entity_dim):
super(RESCALScore, self).__init__()
self.relation_dim = relation_dim
self.entity_dim = entity_dim
def edge_func(self, edges):
head = edges.src['emb']
tail = edges.dst['emb'].expand_dims(2)
rel = edges.data['emb']
rel = rel.reshape(-1, self.relation_dim, self.entity_dim)
score = head * mx.nd.batch_dot(rel, tail).squeeze()
# TODO: check if use self.gamma
return {'score': mx.nd.sum(score, -1)}
# return {'score': self.gamma - th.norm(score, p=1, dim=-1)}
def prepare(self, g, gpu_id, trace=False):
pass
def create_neg_prepare(self, neg_head):
def fn(rel_id, num_chunks, head, tail, gpu_id, trace=False):
return head, tail
return fn
def update(self, gpu_id=-1):
pass
def reset_parameters(self):
pass
def save(self, path, name):
pass
def load(self, path, name):
pass
def forward(self, g):
g.apply_edges(lambda edges: self.edge_func(edges))
def create_neg(self, neg_head):
if neg_head:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
hidden_dim = heads.shape[1]
heads = heads.reshape(num_chunks, neg_sample_size, hidden_dim)
heads = mx.nd.transpose(heads, axes=(0,2,1))
tails = tails.expand_dims(2)
relations = relations.reshape(-1, self.relation_dim, self.entity_dim)
tmp = mx.nd.batch_dot(relations, tails).squeeze()
tmp = tmp.reshape(num_chunks, chunk_size, hidden_dim)
return nd.linalg_gemm2(tmp, heads)
return fn
else:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
hidden_dim = heads.shape[1]
tails = tails.reshape(num_chunks, neg_sample_size, hidden_dim)
tails = mx.nd.transpose(tails, axes=(0,2,1))
heads = heads.expand_dims(2)
relations = relations.reshape(-1, self.relation_dim, self.entity_dim)
tmp = mx.nd.batch_dot(relations, heads).squeeze()
tmp = tmp.reshape(num_chunks, chunk_size, hidden_dim)
return nd.linalg_gemm2(tmp, tails)
return fn
class RotatEScore(nn.Block):
"""RotatE score function
Paper link: https://arxiv.org/abs/1902.10197
"""
def __init__(self, gamma, emb_init, eps=1e-10):
super(RotatEScore, self).__init__()
self.gamma = gamma
self.emb_init = emb_init
self.eps = eps
def edge_func(self, edges):
real_head, img_head = nd.split(edges.src['emb'], num_outputs=2, axis=-1)
real_tail, img_tail = nd.split(edges.dst['emb'], num_outputs=2, axis=-1)
phase_rel = edges.data['emb'] / (self.emb_init / np.pi)
re_rel, im_rel = nd.cos(phase_rel), nd.sin(phase_rel)
real_score = real_head * re_rel - img_head * im_rel
img_score = real_head * im_rel + img_head * re_rel
real_score = real_score - real_tail
img_score = img_score - img_tail
#sqrt((x*x).sum() + eps)
score = mx.nd.sqrt(real_score * real_score + img_score * img_score + self.eps).sum(-1)
return {'score': self.gamma - score}
def prepare(self, g, gpu_id, trace=False):
pass
def create_neg_prepare(self, neg_head):
def fn(rel_id, num_chunks, head, tail, gpu_id, trace=False):
return head, tail
return fn
def update(self, gpu_id=-1):
pass
def reset_parameters(self):
pass
def save(self, path, name):
pass
def load(self, path, name):
pass
def forward(self, g):
g.apply_edges(lambda edges: self.edge_func(edges))
def create_neg(self, neg_head):
gamma = self.gamma
emb_init = self.emb_init
eps = self.eps
if neg_head:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
hidden_dim = heads.shape[1]
emb_real, emb_img = nd.split(tails, num_outputs=2, axis=-1)
phase_rel = relations / (emb_init / np.pi)
rel_real, rel_img = nd.cos(phase_rel), nd.sin(phase_rel)
real = emb_real * rel_real + emb_img * rel_img
img = -emb_real * rel_img + emb_img * rel_real
emb_complex = nd.concat(real, img, dim=-1)
tmp = emb_complex.reshape(num_chunks, chunk_size, 1, hidden_dim)
heads = heads.reshape(num_chunks, 1, neg_sample_size, hidden_dim)
score = tmp - heads
score_real, score_img = nd.split(score, num_outputs=2, axis=-1)
score = mx.nd.sqrt(score_real * score_real + score_img * score_img + self.eps).sum(-1)
return gamma - score
return fn
else:
def fn(heads, relations, tails, num_chunks, chunk_size, neg_sample_size):
hidden_dim = heads.shape[1]
emb_real, emb_img = nd.split(heads, num_outputs=2, axis=-1)
phase_rel = relations / (emb_init / np.pi)
rel_real, rel_img = nd.cos(phase_rel), nd.sin(phase_rel)
real = emb_real * rel_real - emb_img * rel_img
img = emb_real * rel_img + emb_img * rel_real
emb_complex = nd.concat(real, img, dim=-1)
tmp = emb_complex.reshape(num_chunks, chunk_size, 1, hidden_dim)
tails = tails.reshape(num_chunks, 1, neg_sample_size, hidden_dim)
score = tmp - tails
score_real, score_img = nd.split(score, num_outputs=2, axis=-1)
score = mx.nd.sqrt(score_real * score_real + score_img * score_img + self.eps).sum(-1)
return gamma - score
return fn
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment