Unverified Commit 6568d0aa authored by Ligeng Zhu's avatar Ligeng Zhu Committed by GitHub
Browse files

[Example] Recommend NFS over copy files (#2689)



* Update README to use NFS, rather than copying files

* Update README.md

* Update README.md

* Update README.md

* Add ParMetis

* Update README.md

* fix directory.

* fix

* redo.

* fix.

* update.

* add more descriptions
Co-authored-by: default avatarDa Zheng <zhengda1936@gmail.com>
Co-authored-by: default avatarQuan Gan <coin2028@hotmail.com>
parent 99751d49
...@@ -9,9 +9,74 @@ sudo pip3 install pyinstrument ...@@ -9,9 +9,74 @@ sudo pip3 install pyinstrument
To train GraphSage, it has five steps: To train GraphSage, it has five steps:
### Step 0: set IP configuration file. ### Step 0: Setup a Distributed File System
* You may skip this step if your cluster already has folder(s) synchronized across machines.
User need to set their own IP configuration file before training. For example, if we have four machines in current cluster, the IP configuration To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph).
#### Server side setup
Here is an example of how to setup NFS. First, install essential libs on the storage server
```bash
sudo apt-get install nfs-kernel-server
```
Below we assume the user account is `ubuntu` and we create a directory of `workspace` in the home directory.
```bash
mkdir -p /home/ubuntu/workspace
```
We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to
```bash
sudo vim /etc/exports
# add the following line
/home/ubuntu/workspace 192.168.0.0/16(rw,sync,no_subtree_check)
```
The server's internal ip can be checked via `ifconfig` or `ip`. If the ip does not begin with `192.168`, then you may use
```bash
/home/ubuntu/workspace 10.0.0.0/8(rw,sync,no_subtree_check)
/home/ubuntu/workspace 172.16.0.0/12(rw,sync,no_subtree_check)
```
Then restart NFS, the setup on server side is finished.
```
sudo systemctl restart nfs-kernel-server
```
For configraution details, please refer to [NFS ArchWiki](https://wiki.archlinux.org/index.php/NFS).
#### Client side setup
To use NFS, clients also require to install essential packages
```
sudo apt-get install nfs-common
```
You can either mount the NFS manually
```
mkdir -p /home/ubuntu/workspace
sudo mount -t nfs <nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace
```
or edit the fstab so the folder will be mounted automatically
```
# vim /etc/fstab
## append the following line to the file
<nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace nfs defaults 0 0
```
Then run `mount -a`.
Now go to `/home/ubuntu/workspace` and clone the DGL Github repository.
### Step 1: set IP configuration file.
User need to set their own IP configuration file `ip_config.txt` before training. For example, if we have four machines in current cluster, the IP configuration
could like this: could like this:
```bash ```bash
...@@ -23,7 +88,7 @@ could like this: ...@@ -23,7 +88,7 @@ could like this:
Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes. Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes.
### Step 1: partition the graph. ### Step 2: partition the graph.
The example provides a script to partition some builtin graphs such as Reddit and OGB product graph. The example provides a script to partition some builtin graphs such as Reddit and OGB product graph.
If we want to train GraphSage on 4 machines, we need to partition the graph into 4 parts. If we want to train GraphSage on 4 machines, we need to partition the graph into 4 parts.
...@@ -41,29 +106,6 @@ python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train - ...@@ -41,29 +106,6 @@ python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train -
This script generates partitioned graphs and store them in the directory called `data`. This script generates partitioned graphs and store them in the directory called `data`.
### Step 2: copy the partitioned data and files to the cluster
DGL provides a script for copying partitioned data and files to the cluster. Before that, copy the training script to a local folder:
```bash
mkdir ~/dgl_code
cp ~/dgl/examples/pytorch/graphsage/experimental/train_dist.py ~/dgl_code
cp ~/dgl/examples/pytorch/graphsage/experimental/train_dist_unsupervised.py ~/dgl_code
```
The command below copies partition data, ip config file, as well as training scripts to the machines in the cluster. The configuration of the cluster is defined by `ip_config.txt`, The data is copied to `~/graphsage/ogb-product` on each of the remote machines. The training script is copied to `~/graphsage/dgl_code` on each of the remote machines. `--part_config` specifies the location of the partitioned data in the local machine (a user only needs to specify
the location of the partition configuration file).
```bash
python3 ~/dgl/tools/copy_files.py \
--ip_config ip_config.txt \
--workspace ~/graphsage \
--rel_data_path ogb-product \
--part_config data/ogb-product.json \
--script_folder ~/dgl_code
```
After runing this command, user can find a folder called ``graphsage`` on each machine. The folder contains ``ip_config.txt``, ``dgl_code``, and ``ogb-product`` inside.
### Step 3: Launch distributed jobs ### Step 3: Launch distributed jobs
...@@ -75,42 +117,43 @@ The command below launches one training process on each machine and each trainin ...@@ -75,42 +117,43 @@ The command below launches one training process on each machine and each trainin
Please set the number of sampling processes to 0 if you are using Python 3.8. Please set the number of sampling processes to 0 if you are using Python 3.8.
```bash ```bash
python3 ~/dgl/tools/launch.py \ python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/graphsage/ \ --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
--num_trainers 1 \ --num_trainers 1 \
--num_samplers 4 \ --num_samplers 4 \
--num_servers 1 \ --num_servers 1 \
--part_config ogb-product/ogb-product.json \ --part_config data/ogb-product.json \
--ip_config ip_config.txt \ --ip_config ip_config.txt \
"python3 dgl_code/train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4" "python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4"
``` ```
To run unsupervised training: To run unsupervised training:
```bash ```bash
python3 ~/dgl/tools/launch.py \ python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/graphsage/ \ --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
--num_trainers 1 \ --num_trainers 1 \
--num_samplers 4 \ --num_samplers 4 \
--num_servers 1 \ --num_servers 1 \
--part_config ogb-product/ogb-product.json \ --part_config data/ogb-product.json \
--ip_config ip_config.txt \ --ip_config ip_config.txt \
"python3 dgl_code/train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 3 --batch_size 1000 --num_workers 4" "python3 train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 3 --batch_size 1000 --num_workers 4"
``` ```
By default, this code will run on CPU. If you have GPU support, you can just add a `--num_gpus` argument in user command: By default, this code will run on CPU. If you have GPU support, you can just add a `--num_gpus` argument in user command:
```bash ```bash
python3 ~/dgl/tools/launch.py \ python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/graphsage/ \ --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
--num_trainers 4 \ --num_trainers 4 \
--num_samplers 4 \ --num_samplers 4 \
--num_servers 1 \ --num_servers 1 \
--part_config ogb-product/ogb-product.json \ --part_config data/ogb-product.json \
--ip_config ip_config.txt \ --ip_config ip_config.txt \
"python3 dgl_code/train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4 --num_gpus 4" "python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4 --num_gpus 4"
``` ```
**Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
## Distributed code runs in the standalone mode ## Distributed code runs in the standalone mode
......
...@@ -2,27 +2,85 @@ ...@@ -2,27 +2,85 @@
This is an example of training RGCN node classification in a distributed fashion. Currently, the example train RGCN graphs with input node features. The current implementation follows ../rgcn/entity_claasify_mp.py. This is an example of training RGCN node classification in a distributed fashion. Currently, the example train RGCN graphs with input node features. The current implementation follows ../rgcn/entity_claasify_mp.py.
Before training, please install some python libs by pip: Before training, install python libs by pip:
```bash ```bash
sudo pip3 install ogb pip3 install ogb pyinstrument pyarrow
sudo pip3 install pyinstrument
sudo pip3 install pyarrow
``` ```
To use the example and tools provided by DGL, please clone the DGL Github repository. To train RGCN, it has four steps:
### Step 0: Setup a Distributed File System
* You may skip this step if your cluster already has folder(s) synchronized across machines.
To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph).
#### Server side setup
Here is an example of how to setup NFS. First, install essential libs on the storage server
```bash ```bash
git clone --recursive https://github.com/dmlc/dgl.git sudo apt-get install nfs-kernel-server
```
Below we assume the user account is `ubuntu` and we create a directory of `workspace` in the home directory.
```bash
mkdir -p /home/ubuntu/workspace
``` ```
Below, we assume the repository is cloned in the home directory.
To train RGCN, it has four steps: We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to
```bash
sudo vim /etc/exports
# add the following line
/home/ubuntu/workspace 192.168.0.0/16(rw,sync,no_subtree_check)
```
The server's internal ip can be checked via `ifconfig` or `ip`. If the ip does not begin with `192.168`, then you may use
```bash
# for ip range 10.0.0.0 – 10.255.255.255
/home/ubuntu/workspace 10.0.0.0/8(rw,sync,no_subtree_check)
# for ip range 172.16.0.0 – 172.31.255.255
/home/ubuntu/workspace 172.16.0.0/12(rw,sync,no_subtree_check)
```
Then restart NFS, the setup on server side is finished.
```
sudo systemctl restart nfs-kernel-server
```
For configraution details, please refer to [NFS ArchWiki](https://wiki.archlinux.org/index.php/NFS).
#### Client side setup
To use NFS, clients also require to install essential packages
```
sudo apt-get install nfs-common
```
### Step 0: set IP configuration file. You can either mount the NFS manually
User need to set their own IP configuration file before training. For example, if we have four machines in current cluster, the IP configuration ```
could like this: mkdir -p /home/ubuntu/workspace
sudo mount -t nfs <nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace
```
or edit the fstab so the folder will be mounted automatically
```
# vim /etc/fstab
## append the following line to the file
<nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace nfs defaults 0 0
```
Then run `mount -a`.
Now go to `/home/ubuntu/workspace` and clone the DGL Github repository.
### Step 1: set IP configuration file.
User need to set their own IP configuration file `ip_config.txt` before training. For example, if we have four machines in current cluster, the IP configuration could like this:
```bash ```bash
172.31.0.1 172.31.0.1
...@@ -33,7 +91,7 @@ could like this: ...@@ -33,7 +91,7 @@ could like this:
Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes. Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes.
### Step 1: partition the graph. ### Step 2: partition the graph.
The example provides a script to partition some builtin graphs such as ogbn-mag graph. The example provides a script to partition some builtin graphs such as ogbn-mag graph.
If we want to train RGCN on 4 machines, we need to partition the graph into 4 parts. If we want to train RGCN on 4 machines, we need to partition the graph into 4 parts.
...@@ -44,37 +102,6 @@ the number of nodes, the number of edges and the number of labelled nodes. ...@@ -44,37 +102,6 @@ the number of nodes, the number of edges and the number of labelled nodes.
python3 partition_graph.py --dataset ogbn-mag --num_parts 4 --balance_train --balance_edges python3 partition_graph.py --dataset ogbn-mag --num_parts 4 --balance_train --balance_edges
``` ```
This script generates partitioned graphs and store them in the directory called `data`.
### Step 2: copy the partitioned data to the cluster
DGL provides a script for copying partitioned data to the cluster. Before that, copy the training script to a local folder:
```bash
mkdir ~/dgl_code
cp ~/dgl/examples/pytorch/rgcn/experimental/entity_classify_dist.py ~/dgl_code
```
The command below copies partition data, ip config file, as well as training scripts to the machines in the cluster.
The configuration of the cluster is defined by `ip_config.txt`.
The data is copied to `~/rgcn/ogbn-mag` on each of the remote machines.
`--rel_data_path` specifies the relative path in the workspace where the partitioned data will be stored.
`--part_config` specifies the location of the partitioned data in the local machine (a user only needs to specify
the location of the partition configuration file). `--script_folder` specifies the location of the training scripts.
```bash
python ~/dgl/tools/copy_files.py --ip_config ip_config.txt \
--workspace ~/rgcn \
--rel_data_path data \
--part_config data/ogbn-mag.json \
--script_folder ~/dgl_code
```
**Note**: users need to make sure that the master node has right permission to ssh to all the other nodes.
Users need to copy the training script to the workspace directory on remote machines as well.
### Step 3: Launch distributed jobs ### Step 3: Launch distributed jobs
DGL provides a script to launch the training job in the cluster. `part_config` and `ip_config` DGL provides a script to launch the training job in the cluster. `part_config` and `ip_config`
...@@ -85,14 +112,14 @@ The command below launches one training process on each machine and each trainin ...@@ -85,14 +112,14 @@ The command below launches one training process on each machine and each trainin
Please set the number of sampling processes to 0 if you are using Python 3.8. Please set the number of sampling processes to 0 if you are using Python 3.8.
```bash ```bash
python3 ~/dgl/tools/launch.py \ python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/rgcn/ \ --workspace ~/workspace/dgl/examples/pytorch/rgcn/experimental/ \
--num_trainers 1 \ --num_trainers 1 \
--num_servers 1 \ --num_servers 1 \
--num_samplers 4 \ --num_samplers 4 \
--part_config data/ogbn-mag.json \ --part_config data/ogbn-mag.json \
--ip_config ip_config.txt \ --ip_config ip_config.txt \
"python3 dgl_code/entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512 --n-hidden 64 --lr 0.01 --eval-batch-size 16 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --num-workers 4 --num-servers 1 --sparse-embedding --sparse-lr 0.06 --node-feats" "python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512 --n-hidden 64 --lr 0.01 --eval-batch-size 16 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --num-workers 4 --num-servers 1 --sparse-embedding --sparse-lr 0.06 --node-feats"
``` ```
We can get the performance score at the second epoch: We can get the performance score at the second epoch:
...@@ -100,6 +127,8 @@ We can get the performance score at the second epoch: ...@@ -100,6 +127,8 @@ We can get the performance score at the second epoch:
Val Acc 0.4323, Test Acc 0.4255, time: 128.0379 Val Acc 0.4323, Test Acc 0.4255, time: 128.0379
``` ```
**Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
## Partition a graph with ParMETIS ## Partition a graph with ParMETIS
It has four steps to partition a graph with ParMETIS for DGL's distributed training. It has four steps to partition a graph with ParMETIS for DGL's distributed training.
...@@ -144,7 +173,7 @@ DGL provides a tool called `convert_partition.py` to load one partition at a tim ...@@ -144,7 +173,7 @@ DGL provides a tool called `convert_partition.py` to load one partition at a tim
and save it into a file. and save it into a file.
```bash ```bash
python3 ~/dgl/tools/convert_partition.py --input-dir . --graph-name mag --schema mag.json --num-parts 2 --num-node-weights 4 --output outputs python3 ~/workspace/dgl/tools/convert_partition.py --input-dir . --graph-name mag --schema mag.json --num-parts 2 --num-node-weights 4 --output outputs
``` ```
### Step 4: Read node data and edge data for each partition ### Step 4: Read node data and edge data for each partition
......
...@@ -386,10 +386,10 @@ def run(args, device, data): ...@@ -386,10 +386,10 @@ def run(args, device, data):
emb_optimizer = dgl.distributed.SparseAdagrad(list(embed_layer.module.node_embeds.values()), lr=args.sparse_lr) emb_optimizer = dgl.distributed.SparseAdagrad(list(embed_layer.module.node_embeds.values()), lr=args.sparse_lr)
print('optimize DGL sparse embedding:', embed_layer.module.node_embeds.keys()) print('optimize DGL sparse embedding:', embed_layer.module.node_embeds.keys())
elif args.standalone: elif args.standalone:
emb_optimizer = th.optim.SparseAdam(embed_layer.node_embeds.parameters(), lr=args.sparse_lr) emb_optimizer = th.optim.SparseAdam(list(embed_layer.node_embeds.parameters()), lr=args.sparse_lr)
print('optimize Pytorch sparse embedding:', embed_layer.node_embeds) print('optimize Pytorch sparse embedding:', embed_layer.node_embeds)
else: else:
emb_optimizer = th.optim.SparseAdam(embed_layer.module.node_embeds.parameters(), lr=args.sparse_lr) emb_optimizer = th.optim.SparseAdam(list(embed_layer.module.node_embeds.parameters()), lr=args.sparse_lr)
print('optimize Pytorch sparse embedding:', embed_layer.module.node_embeds) print('optimize Pytorch sparse embedding:', embed_layer.module.node_embeds)
dense_params = list(model.parameters()) dense_params = list(model.parameters())
if args.node_feats: if args.node_feats:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment