"...res/git@developer.sourcefind.cn:OpenDAS/llama.cpp.git" did not exist on "4cc1a6143387f41e2466536abcd6a2620b63a35b"
Unverified Commit 56ffb650 authored by peizhou001's avatar peizhou001 Committed by GitHub
Browse files

[API Deprecation]Deprecate contrib module (#5114)

parent 436de3d1
# DGL - Knowledge Graph Embedding
**Note: DGL-KE is moved to [here](https://github.com/awslabs/dgl-ke). DGL-KE in this folder is deprecated.**
## Introduction
DGL-KE is a DGL-based package for computing node embeddings and relation embeddings of
knowledge graphs efficiently. This package is adapted from
[KnowledgeGraphEmbedding](https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding).
We enable fast and scalable training of knowledge graph embedding,
while still keeping the package as extensible as
[KnowledgeGraphEmbedding](https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding).
On a single machine,
it takes only a few minutes for medium-size knowledge graphs, such as FB15k and wn18, and
takes a couple of hours on Freebase, which has hundreds of millions of edges.
DGL-KE includes the following knowledge graph embedding models:
- TransE (TransE_l1 with L1 distance and TransE_l2 with L2 distance)
- DistMult
- ComplEx
- RESCAL
- TransR
- RotatE
It will add other popular models in the future.
DGL-KE supports multiple training modes:
- CPU training
- GPU training
- Joint CPU & GPU training
- Multiprocessing training on CPUs
For joint CPU & GPU training, node embeddings are stored on CPU and mini-batches are trained on GPU. This is designed for training KGE models on large knowledge graphs
For multiprocessing training, each process train mini-batches independently and use shared memory for communication between processes. This is designed to train KGE models on large knowledge graphs with many CPU cores.
We will support multi-GPU training and distributed training in a near future.
## Requirements
The package can run with both Pytorch and MXNet. For Pytorch, it works with Pytorch v1.2 or newer.
For MXNet, it works with MXNet 1.5 or newer.
## Built-in Datasets
DGL-KE provides five built-in knowledge graphs:
| Dataset | #nodes | #edges | #relations |
|---------|--------|--------|------------|
| [FB15k](https://data.dgl.ai/dataset/FB15k.zip) | 14951 | 592213 | 1345 |
| [FB15k-237](https://data.dgl.ai/dataset/FB15k-237.zip) | 14541 | 310116 | 237 |
| [wn18](https://data.dgl.ai/dataset/wn18.zip) | 40943 | 151442 | 18 |
| [wn18rr](https://data.dgl.ai/dataset/wn18rr.zip) | 40943 | 93003 | 11 |
| [Freebase](https://data.dgl.ai/dataset/Freebase.zip) | 86054151 | 338586276 | 14824 |
Users can specify one of the datasets with `--dataset` in `train.py` and `eval.py`.
## Performance
The 1 GPU speed is measured with 8 CPU cores and one Nvidia V100 GPU. (AWS P3.2xlarge)
The 8 GPU speed is measured with 64 CPU cores and eight Nvidia V100 GPU. (AWS P3.16xlarge)
The speed on FB15k 1GPU
| Models | TransE_l1 | TransE_l2 | DistMult | ComplEx | RESCAL | TransR | RotatE |
|---------|-----------|-----------|----------|---------|--------|--------|--------|
|MAX_STEPS| 48000 | 32000 | 40000 | 100000 | 32000 | 32000 | 20000 |
|TIME | 370s | 270s | 312s | 282s | 2095s | 1556s | 1861s |
The accuracy on FB15k
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|-------|-------|--------|--------|---------|
| TransE_l1 | 44.18 | 0.675 | 0.551 | 0.774 | 0.861 |
| TransE_l2 | 46.71 | 0.665 | 0.551 | 0.804 | 0.846 |
| DistMult | 61.04 | 0.725 | 0.625 | 0.837 | 0.883 |
| ComplEx | 64.59 | 0.785 | 0.718 | 0.835 | 0.889 |
| RESCAL | 122.3 | 0.669 | 0.598 | 0.711 | 0.793 |
| TransR | 59.86 | 0.676 | 0.591 | 0.735 | 0.814 |
| RotatE | 43.66 | 0.728 | 0.632 | 0.801 | 0.874 |
The speed on FB15k 8GPU
| Models | TransE_l1 | TransE_l2 | DistMult | ComplEx | RESCAL | TransR | RotatE |
|---------|-----------|-----------|----------|---------|--------|--------|--------|
|MAX_STEPS| 6000 | 4000 | 5000 | 4000 | 4000 | 4000 | 2500 |
|TIME | 88.93s | 62.99s | 72.74s | 68.37s | 245.9s | 203.9s | 126.7s |
The accuracy on FB15k
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|-------|-------|--------|--------|---------|
| TransE_l1 | 44.25 | 0.672 | 0.547 | 0.774 | 0.860 |
| TransE_l2 | 46.13 | 0.658 | 0.539 | 0.748 | 0.845 |
| DistMult | 61.72 | 0.723 | 0.626 | 0.798 | 0.881 |
| ComplEx | 65.84 | 0.754 | 0.676 | 0.813 | 0.880 |
| RESCAL | 135.6 | 0.652 | 0.580 | 0.693 | 0.779 |
| TransR | 65.27 | 0.676 | 0.591 | 0.736 | 0.811 |
| RotatE | 49.59 | 0.683 | 0.581 | 0.759 | 0.848 |
In comparison, GraphVite uses 4 GPUs and takes 14 minutes. Thus, DGL-KE trains TransE on FB15k 9.5X as fast as GraphVite with 8 GPUs. More performance information on GraphVite can be found [here](https://github.com/DeepGraphLearning/graphvite).
The speed on wn18 1GPU
| Models | TransE_l1 | TransE_l2 | DistMult | ComplEx | RESCAL | TransR | RotatE |
|---------|-----------|-----------|----------|---------|--------|--------|--------|
|MAX_STEPS| 32000 | 32000 | 20000 | 20000 | 20000 | 30000 | 24000 |
|TIME | 531.5s | 406.6s | 284.1s | 282.3s | 443.6s | 766.2s | 829.4s |
The accuracy on wn18
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|-------|-------|--------|--------|---------|
| TransE_l1 | 318.4 | 0.764 | 0.602 | 0.929 | 0.949 |
| TransE_l2 | 206.2 | 0.561 | 0.306 | 0.800 | 0.944 |
| DistMult | 486.0 | 0.818 | 0.711 | 0.921 | 0.948 |
| ComplEx | 268.6 | 0.933 | 0.916 | 0.949 | 0.961 |
| RESCAL | 536.6 | 0.848 | 0.790 | 0.900 | 0.927 |
| TransR | 452.4 | 0.620 | 0.461 | 0.758 | 0.856 |
| RotatE | 487.9 | 0.944 | 0.940 | 0.947 | 0.952 |
The speed on wn18 8GPU
| Models | TransE_l1 | TransE_l2 | DistMult | ComplEx | RESCAL | TransR | RotatE |
|---------|-----------|-----------|----------|---------|--------|--------|--------|
|MAX_STEPS| 4000 | 4000 | 2500 | 2500 | 2500 | 2500 | 3000 |
|TIME | 119.3s | 81.1s | 76.0s | 58.0s | 594.1s | 1168s | 139.8s |
The accuracy on wn18
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|-------|-------|--------|--------|---------|
| TransE_l1 | 360.3 | 0.745 | 0.562 | 0.930 | 0.951 |
| TransE_l2 | 193.8 | 0.557 | 0.301 | 0.799 | 0.942 |
| DistMult | 499.9 | 0.807 | 0.692 | 0.917 | 0.945 |
| ComplEx | 476.7 | 0.935 | 0.926 | 0.943 | 0.949 |
| RESCAL | 618.8 | 0.848 | 0.791 | 0.897 | 0.927 |
| TransR | 513.1 | 0.659 | 0.491 | 0.821 | 0.871 |
| RotatE | 466.2 | 0.944 | 0.940 | 0.945 | 0.951 |
The speed on Freebase (8 GPU)
| Models | TransE_l2 | DistMult | ComplEx | TransR | RotatE |
|---------|-----------|----------|---------|--------|--------|
|MAX_STEPS| 320000 | 300000 | 360000 | 300000 | 300000 |
|TIME | 7908s | 7425s | 8946s | 16816s | 12817s |
The accuracy on Freebase (it is tested when 1000 negative edges are sampled for each positive edge).
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|--------|-------|--------|--------|---------|
| TransE_l2 | 22.4 | 0.756 | 0.688 | 0.800 | 0.882 |
| DistMul | 45.4 | 0.833 | 0.812 | 0.843 | 0.872 |
| ComplEx | 48.0 | 0.830 | 0.812 | 0.838 | 0.864 |
| TransR | 51.2 | 0.697 | 0.656 | 0.716 | 0.771 |
| RotatE | 93.3 | 0.770 | 0.749 | 0.780 | 0.805 |
The speed on Freebase (48 CPU)
This measured with 48 CPU cores on an AWS r5dn.24xlarge
| Models | TransE_l2 | DistMult | ComplEx |
|---------|-----------|----------|---------|
|MAX_STEPS| 50000 | 50000 | 50000 |
|TIME | 7002s | 6340s | 8133s |
The accuracy on Freebase (it is tested when 1000 negative edges are sampled for each positive edge).
| Models | MR | MRR | HITS@1 | HITS@3 | HITS@10 |
|-----------|--------|-------|--------|--------|---------|
| TransE_l2 | 30.8 | 0.814 | 0.764 | 0.848 | 0.902 |
| DistMul | 45.1 | 0.834 | 0.815 | 0.843 | 0.871 |
| ComplEx | 44.9 | 0.837 | 0.819 | 0.845 | 0.870 |
The configuration for reproducing the performance results can be found [here](https://github.com/dmlc/dgl/blob/master/apps/kg/config/best_config.sh).
## Usage
DGL-KE doesn't require installation. The package contains two scripts `train.py` and `eval.py`.
* `train.py` trains knowledge graph embeddings and outputs the trained node embeddings
and relation embeddings.
* `eval.py` reads the pre-trained node embeddings and relation embeddings and evaluate
how accurate to predict the tail node when given (head, rel, ?), and predict the head node
when given (?, rel, tail).
### Input formats:
DGL-KE supports two knowledge graph input formats for user defined dataset
- raw_udd_[h|r|t], raw user defined dataset. In this format, user only need to provide triples and let the dataloader generate and manipulate the id mapping. The dataloader will generate two files: entities.tsv for entity id mapping and relations.tsv for relation id mapping. The order of head, relation and tail entities are described in [h|r|t], for example, raw_udd_trh means the triples are stored in the order of tail, relation and head. It should contains three files:
- *train* stores the triples in the training set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]
- *valid* stores the triples in the validation set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]
- *test* stores the triples in the test set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]
Format 2:
- udd_[h|r|t], user defined dataset. In this format, user should provide the id mapping for entities and relations. The order of head, relation and tail entities are described in [h|r|t], for example, raw_udd_trh means the triples are stored in the order of tail, relation and head. It should contains five files:
- *entities* stores the mapping between entity name and entity Id
- *relations* stores the mapping between relation name relation Id
- *train* stores the triples in the training set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]
- *valid* stores the triples in the validation set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]
- *test* stores the triples in the test set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]
### Output formats:
To save the trained embeddings, users have to provide the path with `--save_emb` when running
`train.py`. The saved embeddings are stored as numpy ndarrays.
* The node embedding is saved as `XXX_YYY_entity.npy`.
* The relation embedding is saved as `XXX_YYY_relation.npy`.
`XXX` is the dataset name and `YYY` is the model name.
### Command line parameters
Here are some examples of using the training script.
Train KGE models with GPU.
```bash
python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 \
--hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 --valid --test -adv \
--gpu 0 --max_step 40000
```
Train KGE models with mixed multiple GPUs.
```bash
python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 \
--hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 --valid --test -adv \
--max_step 5000 --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--soft_rel_part --force_sync_interval 1000
```
Train embeddings and verify it later.
```bash
python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 \
--hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 --valid --test -adv \
--gpu 0 --max_step 40000 --save_emb DistMult_FB15k_emb
python3 eval.py --model_name DistMult --dataset FB15k --hidden_dim 400 \
--gamma 143.0 --batch_size 16 --gpu 0 --model_path DistMult_FB15k_emb/
```
Train embeddings with multi-processing. This currently doesn't work in MXNet.
```bash
python3 train.py --model TransE_l2 --dataset Freebase --batch_size 1000 \
--neg_sample_size 200 --hidden_dim 400 --gamma 10 --lr 0.1 --max_step 50000 \
--log_interval 100 --batch_size_eval 1000 --neg_sample_size_eval 1000 --test \
-adv --regularization_coef 1e-9 --num_thread 1 --num_proc 48
```
#To reproduce reported results on README, you can run the model with the following commands:
# for FB15k
# DistMult 1GPU
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 \
--valid --test -adv --gpu 0 --max_step 40000
# DistMult 8GPU
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 \
--valid --test -adv --max_step 5000 --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --async_update --soft_rel_part --force_sync_interval 1000
# ComplEx 1GPU
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset FB15k --batch_size 1024 \
--neg_sample_size 1024 --hidden_dim 400 --gamma 143.0 --lr 0.1 \
--regularization_coef 2.00E-06 --batch_size_eval 16 --valid --test -adv --gpu 0 \
--max_step 32000
# ComplEx 8GPU
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset FB15k --batch_size 1024 \
--neg_sample_size 1024 --hidden_dim 400 --gamma 143.0 --lr 0.1 \
--regularization_coef 2.00E-06 --batch_size_eval 16 --valid --test -adv \
--max_step 4000 --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--soft_rel_part --force_sync_interval 1000
# TransE_l1 1GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l1 --dataset FB15k --batch_size 1024 \
--neg_sample_size 64 --regularization_coef 1e-07 --hidden_dim 400 --gamma 16.0 \
--lr 0.01 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 48000
# TransE_l1 8GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l1 --dataset FB15k --batch_size 1024 \
--neg_sample_size 64 --regularization_coef 1e-07 --hidden_dim 400 --gamma 16.0 \
--lr 0.01 --batch_size_eval 16 --valid --test -adv --max_step 6000 --mix_cpu_gpu \
--num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update --soft_rel_part \
--force_sync_interval 1000
# TransE_l2 1GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --regularization_coef=1e-9 --hidden_dim 400 --gamma 19.9 \
--lr 0.25 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 32000
# TransE_l2 8GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --regularization_coef=1e-9 --hidden_dim 400 --gamma 19.9 \
--lr 0.25 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 4000 \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update --soft_rel_part \
--force_sync_interval 1000
# RESCAL 1GPU
DGLBACKEND=pytorch python3 train.py --model RESCAL --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 500 --gamma 24.0 --lr 0.03 --batch_size_eval 16 \
--gpu 0 --valid --test -adv --max_step 30000
# RESCAL 8GPU
DGLBACKEND=pytorch python3 train.py --model RESCAL --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 500 --gamma 24.0 --lr 0.03 --batch_size_eval 16 \
--valid --test -adv --max_step 4000 --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --async_update --soft_rel_part --force_sync_interval 1000
# TransR 1GPU
DGLBACKEND=pytorch python3 train.py --model TransR --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 5e-8 --hidden_dim 200 --gamma 8.0 \
--lr 0.015 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 32000
# TransR 8GPU
DGLBACKEND=pytorch python3 train.py --model TransR --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 5e-8 --hidden_dim 200 --gamma 8.0 \
--lr 0.015 --batch_size_eval 16 --valid --test -adv --max_step 4000 --mix_cpu_gpu \
--num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update --soft_rel_part \
--force_sync_interval 1000
# RotatE 1GPU
DGLBACKEND=pytorch python3 train.py --model RotatE --dataset FB15k --batch_size 2048 \
--neg_sample_size 256 --regularization_coef 1e-07 --hidden_dim 200 --gamma 12.0 \
--lr 0.009 --batch_size_eval 16 --valid --test -adv -de --max_step 20000 \
--neg_deg_sample --gpu 0
# RotatE 8GPU
DGLBACKEND=pytorch python3 train.py --model RotatE --dataset FB15k --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 1e-07 --hidden_dim 200 --gamma 12.0 \
--lr 0.009 --batch_size_eval 16 --valid --test -adv -de --max_step 2500 \
--neg_deg_sample --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--soft_rel_part --force_sync_interval 1000
# for wn18
# DistMult 1GPU
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset wn18 --batch_size 2048 \
--neg_sample_size 128 --regularization_coef 1e-06 --hidden_dim 512 --gamma 20.0 \
--lr 0.14 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 20000
# DistMult 8GPU
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset wn18 --batch_size 2048 \
--neg_sample_size 128 --regularization_coef 1e-06 --hidden_dim 512 --gamma 20.0 \
--lr 0.14 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 2500 \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--force_sync_interval 1000
# ComplEx 1GPU
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset wn18 --batch_size 1024 \
--neg_sample_size 1024 --regularization_coef 0.00001 --hidden_dim 512 --gamma 200.0 \
--lr 0.1 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 20000
# ComplEx 8GPU
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset wn18 --batch_size 1024 \
--neg_sample_size 1024 --regularization_coef 0.00001 --hidden_dim 512 --gamma 200.0 \
--lr 0.1 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 2500 \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--force_sync_interval 1000
# TransE_l1 1GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l1 --dataset wn18 --batch_size 2048 \
--neg_sample_size 128 --regularization_coef 2e-07 --hidden_dim 512 --gamma 12.0 \
--lr 0.007 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 32000
# TransE_l1 8GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l1 --dataset wn18 --batch_size 2048 \
--neg_sample_size 128 --regularization_coef 2e-07 --hidden_dim 512 --gamma 12.0 \
--lr 0.007 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 4000 \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--force_sync_interval 1000
# TransE_l2 1GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 0.0000001 --hidden_dim 512 --gamma 6.0 \
--lr 0.1 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 32000
# TransE_l2 8GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 0.0000001 --hidden_dim 512 --gamma 6.0 \
--lr 0.1 --batch_size_eval 16 --valid --test -adv --gpu 0 --max_step 4000 \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--force_sync_interval 1000
# RESCAL 1GPU
DGLBACKEND=pytorch python3 train.py --model RESCAL --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 250 --gamma 24.0 --lr 0.03 --batch_size_eval 16 \
--valid --test -adv --gpu 0 --max_step 20000
# RESCAL 8GPU
DGLBACKEND=pytorch python3 train.py --model RESCAL --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 250 --gamma 24.0 --lr 0.03 --batch_size_eval 16 \
--valid --test -adv --gpu 0 --max_step 2500 --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --async_update --force_sync_interval 1000 --soft_rel_part
# TransR 1GPU
DGLBACKEND=pytorch python3 train.py --model TransR --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 250 --gamma 16.0 --lr 0.1 --batch_size_eval 16 \
--valid --test -adv --gpu 0 --max_step 30000
# TransR 8GPU
DGLBACKEND=pytorch python3 train.py --model TransR --dataset wn18 --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 250 --gamma 16.0 --lr 0.1 --batch_size_eval 16 \
--valid --test -adv --max_step 2500 --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --async_update --force_sync_interval 1000 --soft_rel_part
# RotatE 1GPU
DGLBACKEND=pytorch python3 train.py --model RotatE --dataset wn18 --batch_size 2048 \
--neg_sample_size 64 --regularization_coef 2e-07 --hidden_dim 256 --gamma 9.0 \
--lr 0.0025 -de --batch_size_eval 16 --neg_deg_sample --valid --test -adv --gpu 0 \
--max_step 24000
# RotatE 8GPU
DGLBACKEND=pytorch python3 train.py --model RotatE --dataset wn18 --batch_size 2048 \
--neg_sample_size 64 --regularization_coef 2e-07 --hidden_dim 256 --gamma 9.0 \
--lr 0.0025 -de --batch_size_eval 16 --neg_deg_sample --valid --test -adv \
--max_step 3000 --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
--force_sync_interval 1000
# for Freebase multi-process-cpu
# TransE_l2
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset Freebase --batch_size 1000 \
--neg_sample_size 200 --hidden_dim 400 --gamma 10 --lr 0.1 --max_step 50000 \
--log_interval 100 --batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv \
--regularization_coef 1e-9 --num_thread 1 --num_proc 48
# DistMult
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.08 --max_step 50000 \
--log_interval 100 --batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv \
--num_thread 1 --num_proc 48
# ComplEx
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.1 --max_step 50000 \
--log_interval 100 --batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv \
--num_thread 1 --num_proc 48
# Freebase multi-gpu
# TransE_l2 8GPU
DGLBACKEND=pytorch python3 train.py --model TransE_l2 --dataset Freebase --batch_size 1000 \
--neg_sample_size 200 --hidden_dim 400 --gamma 10 --lr 0.1 --regularization_coef 1e-9 \
--batch_size_eval 1000 --valid --test -adv --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --max_step 320000 --neg_sample_size_eval 1000 --eval_interval \
100000 --log_interval 10000 --async_update --soft_rel_part --force_sync_interval 10000
# DistMult 8GPU
DGLBACKEND=pytorch python3 train.py --model DistMult --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 1000 \
--valid --test -adv --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --max_step 300000 \
--neg_sample_size_eval 1000 --eval_interval 100000 --log_interval 10000 --async_update \
--soft_rel_part --force_sync_interval 10000
# ComplEx 8GPU
DGLBACKEND=pytorch python3 train.py --model ComplEx --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 --hidden_dim 400 --gamma 143 --lr 0.1 \
--regularization_coef 2.00E-06 --batch_size_eval 1000 --valid --test -adv \
--mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --max_step 360000 \
--neg_sample_size_eval 1000 --eval_interval 100000 --log_interval 10000 \
--async_update --soft_rel_part --force_sync_interval 10000
# TransR 8GPU
DGLBACKEND=pytorch python3 train.py --model TransR --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 --regularization_coef 5e-8 --hidden_dim 200 --gamma 8.0 \
--lr 0.015 --batch_size_eval 1000 --valid --test -adv --mix_cpu_gpu --num_proc 8 \
--gpu 0 1 2 3 4 5 6 7 --max_step 300000 --neg_sample_size_eval 1000 \
--eval_interval 100000 --log_interval 10000 --async_update --soft_rel_part \
--force_sync_interval 10000
# RotatE 8GPU
DGLBACKEND=pytorch python3 train.py --model RotatE --dataset Freebase --batch_size 1024 \
--neg_sample_size 256 -de --hidden_dim 200 --gamma 12.0 --lr 0.01 \
--regularization_coef 1e-7 --batch_size_eval 1000 --valid --test -adv --mix_cpu_gpu \
--num_proc 8 --gpu 0 1 2 3 4 5 6 7 --max_step 300000 --neg_sample_size_eval 1000 \
--eval_interval 100000 --log_interval 10000 --async_update --soft_rel_part \
--force_sync_interval 10000
This diff is collapsed.
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from .KGDataset import *
from .sampler import *
This diff is collapsed.
## Training Scripts for distributed training
1. Partition data
Partition FB15k:
```bash
./partition.sh ../data FB15k 4
```
Partition Freebase:
```bash
./partition.sh ../data Freebase 4
```
2. Modify `ip_config.txt` and copy dgl-ke to all the machines
3. Run
```bash
./launch.sh \
~/dgl/apps/kg/distributed \
./fb15k_transe_l2.sh \
ubuntu ~/mykey.pem
```
\ No newline at end of file
#!/bin/bash
##################################################################################
# This script runing distmult model on Freebase dataset in distributed setting.
# You can change the hyper-parameter in this file but DO NOT run script manually
##################################################################################
machine_id=$1
server_count=$2
machine_count=$3
# Delete the temp file
rm *-shape
##################################################################################
# Start kvserver
##################################################################################
SERVER_ID_LOW=$((machine_id*server_count))
SERVER_ID_HIGH=$(((machine_id+1)*server_count))
while [ $SERVER_ID_LOW -lt $SERVER_ID_HIGH ]
do
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvserver.py --model TransE_l2 --dataset FB15k \
--hidden_dim 400 --gamma 19.9 --lr 0.25 --total_client 64 --server_id $SERVER_ID_LOW &
let SERVER_ID_LOW+=1
done
##################################################################################
# Start kvclient
##################################################################################
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvclient.py --model TransE_l2 --dataset FB15k \
--batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 500 --log_interval 100 --num_thread 1 \
--batch_size_eval 16 --test -adv --regularization_coef 1e-9 --total_machine $machine_count --num_client 16
\ No newline at end of file
#!/bin/bash
##################################################################################
# This script runing ComplEx model on Freebase dataset in distributed setting.
# You can change the hyper-parameter in this file but DO NOT run script manually
##################################################################################
machine_id=$1
server_count=$2
machine_count=$3
# Delete the temp file
rm *-shape
##################################################################################
# Start kvserver
##################################################################################
SERVER_ID_LOW=$((machine_id*server_count))
SERVER_ID_HIGH=$(((machine_id+1)*server_count))
while [ $SERVER_ID_LOW -lt $SERVER_ID_HIGH ]
do
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvserver.py --model ComplEx --dataset Freebase \
--hidden_dim 400 --gamma 143.0 --lr 0.1 --total_client 160 --server_id $SERVER_ID_LOW &
let SERVER_ID_LOW+=1
done
##################################################################################
# Start kvclient
##################################################################################
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvclient.py --model ComplEx --dataset Freebase \
--batch_size 1024 --neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.1 --max_step 12500 --log_interval 100 \
--batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv --total_machine $machine_count --num_thread 1 --num_client 40
\ No newline at end of file
#!/bin/bash
##################################################################################
# This script runing distmult model on Freebase dataset in distributed setting.
# You can change the hyper-parameter in this file but DO NOT run script manually
##################################################################################
machine_id=$1
server_count=$2
machine_count=$3
# Delete the temp file
rm *-shape
##################################################################################
# Start kvserver
##################################################################################
SERVER_ID_LOW=$((machine_id*server_count))
SERVER_ID_HIGH=$(((machine_id+1)*server_count))
while [ $SERVER_ID_LOW -lt $SERVER_ID_HIGH ]
do
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvserver.py --model DistMult --dataset Freebase \
--hidden_dim 400 --gamma 143.0 --lr 0.08 --total_client 160 --server_id $SERVER_ID_LOW &
let SERVER_ID_LOW+=1
done
##################################################################################
# Start kvclient
##################################################################################
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvclient.py --model DistMult --dataset Freebase \
--batch_size 1024 --neg_sample_size 256 --hidden_dim 400 --gamma 143.0 --lr 0.08 --max_step 12500 --log_interval 100 \
--batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv --total_machine $machine_count --num_thread 1 --num_client 40
\ No newline at end of file
#!/bin/bash
##################################################################################
# This script runing distmult model on Freebase dataset in distributed setting.
# You can change the hyper-parameter in this file but DO NOT run script manually
##################################################################################
machine_id=$1
server_count=$2
machine_count=$3
# Delete the temp file
rm *-shape
##################################################################################
# Start kvserver
##################################################################################
SERVER_ID_LOW=$((machine_id*server_count))
SERVER_ID_HIGH=$(((machine_id+1)*server_count))
while [ $SERVER_ID_LOW -lt $SERVER_ID_HIGH ]
do
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvserver.py --model TransE_l2 --dataset Freebase \
--hidden_dim 400 --gamma 10 --lr 0.1 --total_client 160 --server_id $SERVER_ID_LOW &
let SERVER_ID_LOW+=1
done
##################################################################################
# Start kvclient
##################################################################################
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 DGLBACKEND=pytorch python3 ../kvclient.py --model TransE_l2 --dataset Freebase \
--batch_size 1000 --neg_sample_size 200 --hidden_dim 400 --gamma 10 --lr 0.1 --max_step 12500 --log_interval 100 --num_thread 1 \
--batch_size_eval 1000 --neg_sample_size_eval 1000 --test -adv --regularization_coef 1e-9 --total_machine $machine_count --num_client 40
\ No newline at end of file
127.0.0.1 30050 8
127.0.0.1 30050 8
127.0.0.1 30050 8
127.0.0.1 30050 8
\ No newline at end of file
#!/bin/bash
##################################################################################
# User runs this script to launch distrobited jobs on cluster
##################################################################################
script_path=$1
script_file=$2
user_name=$3
ssh_key=$4
server_count=$(awk 'NR==1 {print $3}' ip_config.txt)
machine_count=$(awk 'END{print NR}' ip_config.txt)
# run command on remote machine
LINE_LOW=2
LINE_HIGH=$(awk 'END{print NR}' ip_config.txt)
let LINE_HIGH+=1
s_id=0
while [ $LINE_LOW -lt $LINE_HIGH ]
do
ip=$(awk 'NR=='$LINE_LOW' {print $1}' ip_config.txt)
let LINE_LOW+=1
let s_id+=1
if test -z "$ssh_key"
then
ssh $user_name@$ip 'cd '$script_path'; '$script_file' '$s_id' '$server_count' '$machine_count'' &
else
ssh -i $ssh_key $user_name@$ip 'cd '$script_path'; '$script_file' '$s_id' '$server_count' '$machine_count'' &
fi
done
# run command on local machine
$script_file 0 $server_count $machine_count
\ No newline at end of file
#!/bin/bash
##################################################################################
# User runs this script to partition a graph using METIS
##################################################################################
DATA_PATH=$1
DATA_SET=$2
K=$3
# partition graph
python3 ../partition.py --dataset $DATA_SET -k $K --data_path $DATA_PATH
# copy related file to partition
PART_ID=0
while [ $PART_ID -lt $K ]
do
cp $DATA_PATH/$DATA_SET/relation* $DATA_PATH/$DATA_SET/partition_$PART_ID
let PART_ID+=1
done
\ No newline at end of file
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from dataloader import EvalDataset, TrainDataset
from dataloader import get_dataset
import argparse
import os
import logging
import time
import pickle
from utils import get_compatible_batch_size
backend = os.environ.get('DGLBACKEND', 'pytorch')
if backend.lower() == 'mxnet':
import multiprocessing as mp
from train_mxnet import load_model_from_checkpoint
from train_mxnet import test
else:
import torch.multiprocessing as mp
from train_pytorch import load_model_from_checkpoint
from train_pytorch import test, test_mp
class ArgParser(argparse.ArgumentParser):
def __init__(self):
super(ArgParser, self).__init__()
self.add_argument('--model_name', default='TransE',
choices=['TransE', 'TransE_l1', 'TransE_l2', 'TransR',
'RESCAL', 'DistMult', 'ComplEx', 'RotatE'],
help='model to use')
self.add_argument('--data_path', type=str, default='data',
help='root path of all dataset')
self.add_argument('--dataset', type=str, default='FB15k',
help='dataset name, under data_path')
self.add_argument('--format', type=str, default='built_in',
help='the format of the dataset, it can be built_in,'\
'raw_udd_{htr} and udd_{htr}')
self.add_argument('--data_files', type=str, default=None, nargs='+',
help='a list of data files, e.g. entity relation train valid test')
self.add_argument('--model_path', type=str, default='ckpts',
help='the place where models are saved')
self.add_argument('--batch_size_eval', type=int, default=8,
help='batch size used for eval and test')
self.add_argument('--neg_sample_size_eval', type=int, default=-1,
help='negative sampling size for testing')
self.add_argument('--neg_deg_sample_eval', action='store_true',
help='negative sampling proportional to vertex degree for testing')
self.add_argument('--hidden_dim', type=int, default=256,
help='hidden dim used by relation and entity')
self.add_argument('-g', '--gamma', type=float, default=12.0,
help='margin value')
self.add_argument('--eval_percent', type=float, default=1,
help='sample some percentage for evaluation.')
self.add_argument('--no_eval_filter', action='store_true',
help='do not filter positive edges among negative edges for evaluation')
self.add_argument('--gpu', type=int, default=[-1], nargs='+',
help='a list of active gpu ids, e.g. 0')
self.add_argument('--mix_cpu_gpu', action='store_true',
help='mix CPU and GPU training')
self.add_argument('-de', '--double_ent', action='store_true',
help='double entitiy dim for complex number')
self.add_argument('-dr', '--double_rel', action='store_true',
help='double relation dim for complex number')
self.add_argument('--num_proc', type=int, default=1,
help='number of process used')
self.add_argument('--num_thread', type=int, default=1,
help='number of thread used')
def parse_args(self):
args = super().parse_args()
return args
def get_logger(args):
if not os.path.exists(args.model_path):
raise Exception('No existing model_path: ' + args.model_path)
log_file = os.path.join(args.model_path, 'eval.log')
logging.basicConfig(
format='%(asctime)s %(levelname)-8s %(message)s',
level=logging.INFO,
datefmt='%Y-%m-%d %H:%M:%S',
filename=log_file,
filemode='w'
)
logger = logging.getLogger(__name__)
print("Logs are being recorded at: {}".format(log_file))
return logger
def main(args):
args.eval_filter = not args.no_eval_filter
if args.neg_deg_sample_eval:
assert not args.eval_filter, "if negative sampling based on degree, we can't filter positive edges."
# load dataset and samplers
dataset = get_dataset(args.data_path, args.dataset, args.format, args.data_files)
args.pickle_graph = False
args.train = False
args.valid = False
args.test = True
args.strict_rel_part = False
args.soft_rel_part = False
args.async_update = False
logger = get_logger(args)
# Here we want to use the regualr negative sampler because we need to ensure that
# all positive edges are excluded.
eval_dataset = EvalDataset(dataset, args)
if args.neg_sample_size_eval < 0:
args.neg_sample_size_eval = args.neg_sample_size = eval_dataset.g.number_of_nodes()
args.batch_size_eval = get_compatible_batch_size(args.batch_size_eval, args.neg_sample_size_eval)
args.num_workers = 8 # fix num_workers to 8
if args.num_proc > 1:
test_sampler_tails = []
test_sampler_heads = []
for i in range(args.num_proc):
test_sampler_head = eval_dataset.create_sampler('test', args.batch_size_eval,
args.neg_sample_size_eval,
args.neg_sample_size_eval,
args.eval_filter,
mode='chunk-head',
num_workers=args.num_workers,
rank=i, ranks=args.num_proc)
test_sampler_tail = eval_dataset.create_sampler('test', args.batch_size_eval,
args.neg_sample_size_eval,
args.neg_sample_size_eval,
args.eval_filter,
mode='chunk-tail',
num_workers=args.num_workers,
rank=i, ranks=args.num_proc)
test_sampler_heads.append(test_sampler_head)
test_sampler_tails.append(test_sampler_tail)
else:
test_sampler_head = eval_dataset.create_sampler('test', args.batch_size_eval,
args.neg_sample_size_eval,
args.neg_sample_size_eval,
args.eval_filter,
mode='chunk-head',
num_workers=args.num_workers,
rank=0, ranks=1)
test_sampler_tail = eval_dataset.create_sampler('test', args.batch_size_eval,
args.neg_sample_size_eval,
args.neg_sample_size_eval,
args.eval_filter,
mode='chunk-tail',
num_workers=args.num_workers,
rank=0, ranks=1)
# load model
n_entities = dataset.n_entities
n_relations = dataset.n_relations
ckpt_path = args.model_path
model = load_model_from_checkpoint(logger, args, n_entities, n_relations, ckpt_path)
if args.num_proc > 1:
model.share_memory()
# test
args.step = 0
args.max_step = 0
start = time.time()
if args.num_proc > 1:
queue = mp.Queue(args.num_proc)
procs = []
for i in range(args.num_proc):
proc = mp.Process(target=test_mp, args=(args,
model,
[test_sampler_heads[i], test_sampler_tails[i]],
i,
'Test',
queue))
procs.append(proc)
proc.start()
total_metrics = {}
metrics = {}
logs = []
for i in range(args.num_proc):
log = queue.get()
logs = logs + log
for metric in logs[0].keys():
metrics[metric] = sum([log[metric] for log in logs]) / len(logs)
for k, v in metrics.items():
print('Test average {} at [{}/{}]: {}'.format(k, args.step, args.max_step, v))
for proc in procs:
proc.join()
else:
test(args, model, [test_sampler_head, test_sampler_tail])
print('Test takes {:.3f} seconds'.format(time.time() - start))
if __name__ == '__main__':
args = ArgParser().parse_args()
main(args)
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
import argparse
import time
import logging
import socket
if os.name != 'nt':
import fcntl
import struct
import torch.multiprocessing as mp
from train_pytorch import load_model, dist_train_test
from utils import get_compatible_batch_size
from train import get_logger
from dataloader import TrainDataset, NewBidirectionalOneShotIterator
from dataloader import get_dataset, get_partition_dataset
import dgl
import dgl.backend as F
WAIT_TIME = 10
class ArgParser(argparse.ArgumentParser):
def __init__(self):
super(ArgParser, self).__init__()
self.add_argument('--model_name', default='TransE',
choices=['TransE', 'TransE_l1', 'TransE_l2', 'TransR',
'RESCAL', 'DistMult', 'ComplEx', 'RotatE'],
help='model to use')
self.add_argument('--data_path', type=str, default='../data',
help='root path of all dataset')
self.add_argument('--dataset', type=str, default='FB15k',
help='dataset name, under data_path')
self.add_argument('--format', type=str, default='built_in',
help='the format of the dataset, it can be built_in,'\
'raw_udd_{htr} and udd_{htr}')
self.add_argument('--save_path', type=str, default='../ckpts',
help='place to save models and logs')
self.add_argument('--save_emb', type=str, default=None,
help='save the embeddings in the specific location.')
self.add_argument('--max_step', type=int, default=80000,
help='train xx steps')
self.add_argument('--batch_size', type=int, default=1024,
help='batch size')
self.add_argument('--batch_size_eval', type=int, default=8,
help='batch size used for eval and test')
self.add_argument('--neg_sample_size', type=int, default=128,
help='negative sampling size')
self.add_argument('--neg_deg_sample', action='store_true',
help='negative sample proportional to vertex degree in the training')
self.add_argument('--neg_deg_sample_eval', action='store_true',
help='negative sampling proportional to vertex degree in the evaluation')
self.add_argument('--neg_sample_size_eval', type=int, default=-1,
help='negative sampling size for evaluation')
self.add_argument('--hidden_dim', type=int, default=256,
help='hidden dim used by relation and entity')
self.add_argument('--lr', type=float, default=0.0001,
help='learning rate')
self.add_argument('-g', '--gamma', type=float, default=12.0,
help='margin value')
self.add_argument('--no_eval_filter', action='store_true',
help='do not filter positive edges among negative edges for evaluation')
self.add_argument('--gpu', type=int, default=[-1], nargs='+',
help='a list of active gpu ids, e.g. 0 1 2 4')
self.add_argument('--mix_cpu_gpu', action='store_true',
help='mix CPU and GPU training')
self.add_argument('-de', '--double_ent', action='store_true',
help='double entitiy dim for complex number')
self.add_argument('-dr', '--double_rel', action='store_true',
help='double relation dim for complex number')
self.add_argument('-log', '--log_interval', type=int, default=1000,
help='do evaluation after every x steps')
self.add_argument('--eval_interval', type=int, default=10000,
help='do evaluation after every x steps')
self.add_argument('--eval_percent', type=float, default=1,
help='sample some percentage for evaluation.')
self.add_argument('-adv', '--neg_adversarial_sampling', action='store_true',
help='if use negative adversarial sampling')
self.add_argument('-a', '--adversarial_temperature', default=1.0, type=float,
help='adversarial_temperature')
self.add_argument('--valid', action='store_true',
help='if valid a model')
self.add_argument('--test', action='store_true',
help='if test a model')
self.add_argument('-rc', '--regularization_coef', type=float, default=0.000002,
help='set value > 0.0 if regularization is used')
self.add_argument('-rn', '--regularization_norm', type=int, default=3,
help='norm used in regularization')
self.add_argument('--non_uni_weight', action='store_true',
help='if use uniform weight when computing loss')
self.add_argument('--pickle_graph', action='store_true',
help='pickle built graph, building a huge graph is slow.')
self.add_argument('--num_proc', type=int, default=1,
help='number of process used')
self.add_argument('--num_thread', type=int, default=1,
help='number of thread used')
self.add_argument('--rel_part', action='store_true',
help='enable relation partitioning')
self.add_argument('--soft_rel_part', action='store_true',
help='enable soft relation partition')
self.add_argument('--async_update', action='store_true',
help='allow async_update on node embedding')
self.add_argument('--force_sync_interval', type=int, default=-1,
help='We force a synchronization between processes every x steps')
self.add_argument('--machine_id', type=int, default=0,
help='Unique ID of current machine.')
self.add_argument('--total_machine', type=int, default=1,
help='Total number of machine.')
self.add_argument('--ip_config', type=str, default='ip_config.txt',
help='IP configuration file of kvstore')
self.add_argument('--num_client', type=int, default=1,
help='Number of client on each machine.')
def get_long_tail_partition(n_relations, n_machine):
"""Relation types has a long tail distribution for many dataset.
So we need to average shuffle the data before we partition it.
"""
assert n_relations > 0, 'n_relations must be a positive number.'
assert n_machine > 0, 'n_machine must be a positive number.'
partition_book = [0] * n_relations
part_id = 0
for i in range(n_relations):
partition_book[i] = part_id
part_id += 1
if part_id == n_machine:
part_id = 0
return partition_book
def local_ip4_addr_list():
"""Return a set of IPv4 address
"""
nic = set()
for ix in socket.if_nameindex():
name = ix[1]
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
ip = socket.inet_ntoa(fcntl.ioctl(
s.fileno(),
0x8915, # SIOCGIFADDR
struct.pack('256s', name[:15].encode("UTF-8")))[20:24])
nic.add(ip)
return nic
def get_local_machine_id(server_namebook):
"""Get machine ID via server_namebook
"""
assert len(server_namebook) > 0, 'server_namebook cannot be empty.'
res = 0
for ID, data in server_namebook.items():
machine_id = data[0]
ip = data[1]
if ip in local_ip4_addr_list():
res = machine_id
break
return res
def start_worker(args, logger):
"""Start kvclient for training
"""
init_time_start = time.time()
time.sleep(WAIT_TIME) # wait for launch script
server_namebook = dgl.contrib.read_ip_config(filename=args.ip_config)
args.machine_id = get_local_machine_id(server_namebook)
dataset, entity_partition_book, local2global = get_partition_dataset(
args.data_path,
args.dataset,
args.machine_id)
n_entities = dataset.n_entities
n_relations = dataset.n_relations
print('Partition %d n_entities: %d' % (args.machine_id, n_entities))
print("Partition %d n_relations: %d" % (args.machine_id, n_relations))
entity_partition_book = F.tensor(entity_partition_book)
relation_partition_book = get_long_tail_partition(dataset.n_relations, args.total_machine)
relation_partition_book = F.tensor(relation_partition_book)
local2global = F.tensor(local2global)
relation_partition_book.share_memory_()
entity_partition_book.share_memory_()
local2global.share_memory_()
train_data = TrainDataset(dataset, args, ranks=args.num_client)
# if there is no cross partition relaiton, we fall back to strict_rel_part
args.strict_rel_part = args.mix_cpu_gpu and (train_data.cross_part == False)
args.soft_rel_part = args.mix_cpu_gpu and args.soft_rel_part and train_data.cross_part
if args.neg_sample_size_eval < 0:
args.neg_sample_size_eval = dataset.n_entities
args.batch_size = get_compatible_batch_size(args.batch_size, args.neg_sample_size)
args.batch_size_eval = get_compatible_batch_size(args.batch_size_eval, args.neg_sample_size_eval)
args.num_workers = 8 # fix num_workers to 8
train_samplers = []
for i in range(args.num_client):
train_sampler_head = train_data.create_sampler(args.batch_size,
args.neg_sample_size,
args.neg_sample_size,
mode='head',
num_workers=args.num_workers,
shuffle=True,
exclude_positive=False,
rank=i)
train_sampler_tail = train_data.create_sampler(args.batch_size,
args.neg_sample_size,
args.neg_sample_size,
mode='tail',
num_workers=args.num_workers,
shuffle=True,
exclude_positive=False,
rank=i)
train_samplers.append(NewBidirectionalOneShotIterator(train_sampler_head, train_sampler_tail,
args.neg_sample_size, args.neg_sample_size,
True, n_entities))
dataset = None
model = load_model(logger, args, n_entities, n_relations)
model.share_memory()
print('Total initialize time {:.3f} seconds'.format(time.time() - init_time_start))
rel_parts = train_data.rel_parts if args.strict_rel_part or args.soft_rel_part else None
cross_rels = train_data.cross_rels if args.soft_rel_part else None
procs = []
barrier = mp.Barrier(args.num_client)
for i in range(args.num_client):
proc = mp.Process(target=dist_train_test, args=(args,
model,
train_samplers[i],
entity_partition_book,
relation_partition_book,
local2global,
i,
rel_parts,
cross_rels,
barrier))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
if __name__ == '__main__':
args = ArgParser().parse_args()
logger = get_logger(args)
start_worker(args, logger)
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
import argparse
import time
import dgl
from dgl.contrib import KVServer
import torch as th
from train_pytorch import load_model
from dataloader import get_server_partition_dataset
NUM_THREAD = 1 # Fix the number of threads to 1 on kvstore
class KGEServer(KVServer):
"""User-defined kvstore for DGL-KGE
"""
def _push_handler(self, name, ID, data, target):
"""Row-Sparse Adagrad updater
"""
original_name = name[0:-6]
state_sum = target[original_name+'_state-data-']
grad_sum = (data * data).mean(1)
state_sum.index_add_(0, ID, grad_sum)
std = state_sum[ID] # _sparse_mask
std_values = std.sqrt_().add_(1e-10).unsqueeze(1)
tmp = (-self.clr * data / std_values)
target[name].index_add_(0, ID, tmp)
def set_clr(self, learning_rate):
"""Set learning rate for Row-Sparse Adagrad updater
"""
self.clr = learning_rate
# Note: Most of the args are unnecessary for KVStore, will remove them later
class ArgParser(argparse.ArgumentParser):
def __init__(self):
super(ArgParser, self).__init__()
self.add_argument('--model_name', default='TransE',
choices=['TransE', 'TransE_l1', 'TransE_l2', 'TransR',
'RESCAL', 'DistMult', 'ComplEx', 'RotatE'],
help='model to use')
self.add_argument('--data_path', type=str, default='../data',
help='root path of all dataset')
self.add_argument('--dataset', type=str, default='FB15k',
help='dataset name, under data_path')
self.add_argument('--format', type=str, default='built_in',
help='the format of the dataset, it can be built_in,'\
'raw_udd_{htr} and udd_{htr}')
self.add_argument('--hidden_dim', type=int, default=256,
help='hidden dim used by relation and entity')
self.add_argument('--lr', type=float, default=0.0001,
help='learning rate')
self.add_argument('-g', '--gamma', type=float, default=12.0,
help='margin value')
self.add_argument('--gpu', type=int, default=[-1], nargs='+',
help='a list of active gpu ids, e.g. 0')
self.add_argument('--mix_cpu_gpu', action='store_true',
help='mix CPU and GPU training')
self.add_argument('-de', '--double_ent', action='store_true',
help='double entitiy dim for complex number')
self.add_argument('-dr', '--double_rel', action='store_true',
help='double relation dim for complex number')
self.add_argument('--num_thread', type=int, default=1,
help='number of thread used')
self.add_argument('--server_id', type=int, default=0,
help='Unique ID of each server')
self.add_argument('--ip_config', type=str, default='ip_config.txt',
help='IP configuration file of kvstore')
self.add_argument('--total_client', type=int, default=1,
help='Total number of client worker nodes')
def get_server_data(args, machine_id):
"""Get data from data_path/dataset/part_machine_id
Return: glocal2local,
entity_emb,
entity_state,
relation_emb,
relation_emb_state
"""
g2l, dataset = get_server_partition_dataset(
args.data_path,
args.dataset,
machine_id)
# Note that the dataset doesn't ccontain the triple
print('n_entities: ' + str(dataset.n_entities))
print('n_relations: ' + str(dataset.n_relations))
args.soft_rel_part = False
args.strict_rel_part = False
model = load_model(None, args, dataset.n_entities, dataset.n_relations)
return g2l, model.entity_emb.emb, model.entity_emb.state_sum, model.relation_emb.emb, model.relation_emb.state_sum
def start_server(args):
"""Start kvstore service
"""
th.set_num_threads(NUM_THREAD)
server_namebook = dgl.contrib.read_ip_config(filename=args.ip_config)
my_server = KGEServer(server_id=args.server_id,
server_namebook=server_namebook,
num_client=args.total_client)
my_server.set_clr(args.lr)
if my_server.get_id() % my_server.get_group_count() == 0: # master server
g2l, entity_emb, entity_emb_state, relation_emb, relation_emb_state = get_server_data(args, my_server.get_machine_id())
my_server.set_global2local(name='entity_emb', global2local=g2l)
my_server.init_data(name='relation_emb', data_tensor=relation_emb)
my_server.init_data(name='relation_emb_state', data_tensor=relation_emb_state)
my_server.init_data(name='entity_emb', data_tensor=entity_emb)
my_server.init_data(name='entity_emb_state', data_tensor=entity_emb_state)
else: # backup server
my_server.set_global2local(name='entity_emb')
my_server.init_data(name='relation_emb')
my_server.init_data(name='relation_emb_state')
my_server.init_data(name='entity_emb')
my_server.init_data(name='entity_emb_state')
print('KVServer %d listen for requests ...' % my_server.get_id())
my_server.start()
if __name__ == '__main__':
args = ArgParser().parse_args()
start_server(args)
\ No newline at end of file
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from .general_models import KEModel
This diff is collapsed.
# -*- coding: utf-8 -*-
#
# setup.py
#
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
\ No newline at end of file
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment