[Example] Recommend NFS over copy files (#2689)

* Update README to use NFS, rather than copying files * Update README.md * Update README.md * Update README.md * Add ParMetis * Update README.md * fix directory. * fix * redo. * fix. * update. * add more descriptions Co-authored-by: Da Zheng <zhengda1936@gmail.com> Co-authored-by: Quan Gan <coin2028@hotmail.com>

[Example] Recommend NFS over copy files (#2689)
* Update README to use NFS, rather than copying files * Update README.md * Update README.md * Update README.md * Add ParMetis * Update README.md * fix directory. * fix * redo. * fix. * update. * add more descriptions Co-authored-by: Da Zheng <zhengda1936@gmail.com> Co-authored-by: Quan Gan <coin2028@hotmail.com>
6568d0aa · Ligeng Zhu · GitHub · 99751d49 · 6568d0aa · 6568d0aa
Unverified Commit 6568d0aa authored Feb 25, 2021 by Ligeng Zhu Committed by GitHub Feb 25, 2021
3 changed files
--- a/examples/pytorch/graphsage/experimental/README.md
+++ b/examples/pytorch/graphsage/experimental/README.md
@@ -9,9 +9,74 @@ sudo pip3 install pyinstrument
 To train GraphSage, it has five steps:
-### Step 0: set IP configuration file.
+### Step 0: Setup a Distributed File System
+* You may skip this step if your cluster already has folder(s) synchronized across machines.
-User need to set their own IP configuration file before training. For example, if we have four machines in current cluster, the IP configuration
+To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph).
+#### Server side setup
+Here is an example of how to setup NFS. First, install essential libs on the storage server
+```bash
+sudo apt-get install nfs-kernel-server
+```
+Below we assume the user account is `ubuntu` and we create a directory of `workspace` in the home directory.
+```bash
+mkdir -p /home/ubuntu/workspace
+```
+We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to
+```bash
+sudo vim /etc/exports
+# add the following line
+/home/ubuntu/workspace  192.168.0.0/16(rw,sync,no_subtree_check)
+```
+The server's internal ip can be checked  via `ifconfig` or `ip`. If the ip does not begin with `192.168`, then you may use
+```bash
+/home/ubuntu/workspace  10.0.0.0/8(rw,sync,no_subtree_check)
+/home/ubuntu/workspace  172.16.0.0/12(rw,sync,no_subtree_check)
+```
+Then restart NFS, the setup on server side is finished.
+```
+sudo systemctl restart nfs-kernel-server
+```
+For configraution details, please refer to [NFS ArchWiki](https://wiki.archlinux.org/index.php/NFS).
+#### Client side setup
+To use NFS, clients also require to install essential packages
+```
+sudo apt-get install nfs-common
+```
+You can either mount the NFS manually
+```
+mkdir -p /home/ubuntu/workspace
+sudo mount -t nfs <nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace
+```
+or edit the fstab so the folder will be mounted automatically
+```
+# vim /etc/fstab
+## append the following line to the file
+<nfs-server-ip>:/home/ubuntu/workspace   /home/ubuntu/workspace   nfs   defaults	0 0
+```
+Then run `mount -a`.
+Now go to `/home/ubuntu/workspace` and clone the DGL Github repository.
+### Step 1: set IP configuration file.
+User need to set their own IP configuration file `ip_config.txt` before training. For example, if we have four machines in current cluster, the IP configuration
 could like this:
 ```bash
@@ -23,7 +88,7 @@ could like this:
 Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes.
-### Step 1: partition the graph.
+### Step 2: partition the graph.
 The example provides a script to partition some builtin graphs such as Reddit and OGB product graph.
 If we want to train GraphSage on 4 machines, we need to partition the graph into 4 parts.
@@ -41,29 +106,6 @@ python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train -
 This script generates partitioned graphs and store them in the directory called `data`.
-### Step 2: copy the partitioned data and files to the cluster
-DGL provides a script for copying partitioned data and files to the cluster. Before that, copy the training script to a local folder:
-```bash
-mkdir ~/dgl_code
-cp ~/dgl/examples/pytorch/graphsage/experimental/train_dist.py ~/dgl_code
-cp ~/dgl/examples/pytorch/graphsage/experimental/train_dist_unsupervised.py ~/dgl_code
-```
-The command below copies partition data, ip config file, as well as training scripts to the machines in the cluster. The configuration of the cluster is defined by `ip_config.txt`, The data is copied to `~/graphsage/ogb-product` on each of the remote machines. The training script is copied to `~/graphsage/dgl_code` on each of the remote machines. `--part_config` specifies the location of the partitioned data in the local machine (a user only needs to specify
-the location of the partition configuration file).
-```bash
-python3 ~/dgl/tools/copy_files.py \
--ip_config ip_config.txt \
--workspace ~/graphsage \
--rel_data_path ogb-product \
--part_config data/ogb-product.json \
--script_folder ~/dgl_code
-```
-After runing this command, user can find a folder called ``graphsage`` on each machine. The folder contains ``ip_config.txt``, ``dgl_code``, and ``ogb-product`` inside.
 ### Step 3: Launch distributed jobs
@@ -75,42 +117,43 @@ The command below launches one training process on each machine and each trainin
 Please set the number of sampling processes to 0 if you are using Python 3.8.
 ```bash
-python3 ~/dgl/tools/launch.py \
+python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/graphsage/ \
+--workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
 --num_trainers 1 \
 --num_samplers 4 \
 --num_servers 1 \
--part_config ogb-product/ogb-product.json \
+--part_config data/ogb-product.json \
 --ip_config ip_config.txt \
-"python3 dgl_code/train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4"
+"python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4"
 ```
 To run unsupervised training:
 ```bash
-python3 ~/dgl/tools/launch.py \
+python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/graphsage/ \
+--workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
 --num_trainers 1 \
 --num_samplers 4 \
 --num_servers 1 \
--part_config ogb-product/ogb-product.json \
+--part_config data/ogb-product.json \
 --ip_config ip_config.txt \
-"python3 dgl_code/train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 3 --batch_size 1000 --num_workers 4"
+"python3 train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 3 --batch_size 1000 --num_workers 4"
 ```
 By default, this code will run on CPU. If you have GPU support, you can just add a `--num_gpus` argument in user command:
 ```bash
-python3 ~/dgl/tools/launch.py \
+python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/graphsage/ \
+--workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
 --num_trainers 4 \
 --num_samplers 4 \
 --num_servers 1 \
--part_config ogb-product/ogb-product.json \
+--part_config data/ogb-product.json \
 --ip_config ip_config.txt \
-"python3 dgl_code/train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4 --num_gpus 4"
+"python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4 --num_gpus 4"
 ```
+**Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
 ## Distributed code runs in the standalone mode

--- a/examples/pytorch/rgcn/experimental/README.md
+++ b/examples/pytorch/rgcn/experimental/README.md
@@ -2,27 +2,85 @@
 This is an example of training RGCN node classification in a distributed fashion. Currently, the example train RGCN graphs with input node features. The current implementation follows ../rgcn/entity_claasify_mp.py.
-Before training, please install some python libs by pip:
+Before training, install python libs by pip:
 ```bash
-sudo pip3 install ogb
+pip3 install ogb pyinstrument pyarrow
-sudo pip3 install pyinstrument
-sudo pip3 install pyarrow
 ```
-To use the example and tools provided by DGL, please clone the DGL Github repository.
+To train RGCN, it has four steps:
+### Step 0: Setup a Distributed File System 
+* You may skip this step if your cluster already has folder(s) synchronized across machines. 
+To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph). 
+#### Server side setup 
+Here is an example of how to setup NFS. First, install essential libs on the storage server
 ```bash
-git clone --recursive https://github.com/dmlc/dgl.git
+sudo apt-get install nfs-kernel-server
+``` 
+Below we assume the user account is `ubuntu` and we create a directory of `workspace` in the home directory.
+```bash
+mkdir -p /home/ubuntu/workspace
 ```
-Below, we assume the repository is cloned in the home directory.
-To train RGCN, it has four steps:
+We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to 
+```bash
+sudo vim /etc/exports
+# add the following line 
+/home/ubuntu/workspace  192.168.0.0/16(rw,sync,no_subtree_check)
+```
+The server's internal ip can be checked  via `ifconfig` or `ip`. If the ip does not begin with `192.168`, then you may use
+```bash
+# for ip range 10.0.0.0 – 10.255.255.255	
+/home/ubuntu/workspace  10.0.0.0/8(rw,sync,no_subtree_check)
+# for ip range 172.16.0.0 – 172.31.255.255	
+/home/ubuntu/workspace  172.16.0.0/12(rw,sync,no_subtree_check)
+```
+Then restart NFS, the setup on server side is finished.
+```
+sudo systemctl restart nfs-kernel-server
+```
+For configraution details, please refer to [NFS ArchWiki](https://wiki.archlinux.org/index.php/NFS).
+#### Client side setup 
+To use NFS, clients also require to install essential packages
+```
+sudo apt-get install nfs-common 
+```
-### Step 0: set IP configuration file.
+You can either mount the NFS manually 
-User need to set their own IP configuration file before training. For example, if we have four machines in current cluster, the IP configuration
+```
-could like this:
+mkdir -p /home/ubuntu/workspace
+sudo mount -t nfs <nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace
+```
+or edit the fstab so the folder will be mounted automatically 
+```
+# vim /etc/fstab
+## append the following line to the file
+<nfs-server-ip>:/home/ubuntu/workspace   /home/ubuntu/workspace   nfs   defaults	0 0
+```
+Then run `mount -a`. 
+Now go to `/home/ubuntu/workspace` and clone the DGL Github repository.
+### Step 1: set IP configuration file.
+User need to set their own IP configuration file `ip_config.txt` before training. For example, if we have four machines in current cluster, the IP configuration could like this:
 ```bash
 172.31.0.1
@@ -33,7 +91,7 @@ could like this:
 Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes.
-### Step 1: partition the graph.
+### Step 2: partition the graph.
 The example provides a script to partition some builtin graphs such as ogbn-mag graph.
 If we want to train RGCN on 4 machines, we need to partition the graph into 4 parts.
@@ -44,37 +102,6 @@ the number of nodes, the number of edges and the number of labelled nodes.
 python3 partition_graph.py --dataset ogbn-mag --num_parts 4 --balance_train --balance_edges
 ```
-This script generates partitioned graphs and store them in the directory called `data`.
-### Step 2: copy the partitioned data to the cluster
-DGL provides a script for copying partitioned data to the cluster. Before that, copy the training script to a local folder:
-```bash
-mkdir ~/dgl_code
-cp ~/dgl/examples/pytorch/rgcn/experimental/entity_classify_dist.py ~/dgl_code
-```
-The command below copies partition data, ip config file, as well as training scripts to the machines in the cluster.
-The configuration of the cluster is defined by `ip_config.txt`.
-The data is copied to `~/rgcn/ogbn-mag` on each of the remote machines.
-`--rel_data_path` specifies the relative path in the workspace where the partitioned data will be stored.
-`--part_config` specifies the location of the partitioned data in the local machine (a user only needs to specify
-the location of the partition configuration file). `--script_folder` specifies the location of the training scripts.
-```bash
-python ~/dgl/tools/copy_files.py --ip_config ip_config.txt \
-                                 --workspace ~/rgcn \
-                                 --rel_data_path data \
-				 --part_config data/ogbn-mag.json \
-			         --script_folder ~/dgl_code
-```
-**Note**: users need to make sure that the master node has right permission to ssh to all the other nodes.
-Users need to copy the training script to the workspace directory on remote machines as well.
 ### Step 3: Launch distributed jobs
 DGL provides a script to launch the training job in the cluster. `part_config` and `ip_config`
@@ -85,14 +112,14 @@ The command below launches one training process on each machine and each trainin
 Please set the number of sampling processes to 0 if you are using Python 3.8.
 ```bash
-python3 ~/dgl/tools/launch.py \
+python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/rgcn/ \
+--workspace ~/workspace/dgl/examples/pytorch/rgcn/experimental/ \
 --num_trainers 1 \
 --num_servers 1 \
 --num_samplers 4 \
 --part_config data/ogbn-mag.json \
 --ip_config ip_config.txt \
-"python3 dgl_code/entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512  --n-hidden 64 --lr 0.01 --eval-batch-size 16  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt  --num-workers 4 --num-servers 1 --sparse-embedding  --sparse-lr 0.06 --node-feats"
+"python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512  --n-hidden 64 --lr 0.01 --eval-batch-size 16  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt  --num-workers 4 --num-servers 1 --sparse-embedding  --sparse-lr 0.06 --node-feats"
 ```
 We can get the performance score at the second epoch:
@@ -100,6 +127,8 @@ We can get the performance score at the second epoch:
 Val Acc 0.4323, Test Acc 0.4255, time: 128.0379
 ```
+**Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
 ## Partition a graph with ParMETIS
 It has four steps to partition a graph with ParMETIS for DGL's distributed training.
@@ -144,7 +173,7 @@ DGL provides a tool called `convert_partition.py` to load one partition at a tim
 and save it into a file.
 ```bash
-python3 ~/dgl/tools/convert_partition.py --input-dir . --graph-name mag --schema mag.json --num-parts 2 --num-node-weights 4 --output outputs
+python3 ~/workspace/dgl/tools/convert_partition.py --input-dir . --graph-name mag --schema mag.json --num-parts 2 --num-node-weights 4 --output outputs
 ```
 ### Step 4: Read node data and edge data for each partition

--- a/examples/pytorch/rgcn/experimental/entity_classify_dist.py
+++ b/examples/pytorch/rgcn/experimental/entity_classify_dist.py
@@ -386,10 +386,10 @@ def run(args, device, data):
            emb_optimizer = dgl.distributed.SparseAdagrad(list(embed_layer.module.node_embeds.values()), lr=args.sparse_lr)
            print('optimize DGL sparse embedding:', embed_layer.module.node_embeds.keys())
        elif args.standalone:
-            emb_optimizer = th.optim.SparseAdam(embed_layer.node_embeds.parameters(), lr=args.sparse_lr)
+            emb_optimizer = th.optim.SparseAdam(list(embed_layer.node_embeds.parameters()), lr=args.sparse_lr)
            print('optimize Pytorch sparse embedding:', embed_layer.node_embeds)
        else:
-            emb_optimizer = th.optim.SparseAdam(embed_layer.module.node_embeds.parameters(), lr=args.sparse_lr)
+            emb_optimizer = th.optim.SparseAdam(list(embed_layer.module.node_embeds.parameters()), lr=args.sparse_lr)
            print('optimize Pytorch sparse embedding:', embed_layer.module.node_embeds)
        dense_params = list(model.parameters())
        if args.node_feats: