[Distributed] Copy training scripts in copy_partitions.py (#2010)

* update * update * update * update * update * update * update * update * update * update * update * update * update * update

[Distributed] Copy training scripts in copy_partitions.py (#2010)
* update * update * update * update * update * update * update * update * update * update * update * update * update * update
eb9c067b · Chao Ma · GitHub · cd484352 · eb9c067b · eb9c067b
Unverified Commit eb9c067b authored Aug 13, 2020 by Chao Ma Committed by GitHub Aug 13, 2020
4 changed files
--- a/examples/pytorch/graphsage/experimental/README.md
+++ b/examples/pytorch/graphsage/experimental/README.md
 ## Distributed training

-This is an example of training GraphSage in a distributed fashion. To train GraphSage, it has four steps:
+This is an example of training GraphSage in a distributed fashion. To train GraphSage, it has five steps:
+
+### Step 0: set IP configuration file.
+
+User need to set their own IP configuration file before training. For example, if we have four machines in current cluster, the IP configuration
+could like this:
+
+```bash
+172.31.19.1
+172.31.23.205
+172.31.29.175
+172.31.16.98
+```
+
+Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes.

 ### Step 1: partition the graph.

@@ -12,30 +26,35 @@ We need to load some function from the parent directory.
 export PYTHONPATH=$PYTHONPATH:..
 ```

-In this example, we partition the OGB product graph into 4 parts with Metis. The partitions are balanced with respect to
+In this example, we partition the OGB product graph into 4 parts with Metis on node-0. The partitions are balanced with respect to
 the number of nodes, the number of edges and the number of labelled nodes.
 ```bash
 python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train --balance_edges
 ```

-### Step 2: copy the partitioned data to the cluster
+### Step 2: copy the partitioned data and files to the cluster

-DGL provides a script for copying partitioned data to the cluster. The command below copies partition data
-to the machines in the cluster. The configuration of the cluster is defined by `ip_config.txt`,
-The data is copied to `~/graphsage/ogb-product` on each of the remote machines. `--part_config`
-specifies the location of the partitioned data in the local machine (a user only needs to specify
+DGL provides a script for copying partitioned data and files to the cluster. Before that, copy the training script to a local folder:
+
+```bash
+mkdir ~/dgl_code
+cp ~/dgl/examples/pytorch/graphsage/experimental/train_dist.py ~/dgl_code
+cp ~/dgl/examples/pytorch/graphsage/experimental/train_dist_unsupervised.py ~/dgl_code
+```
+
+The command below copies partition data, ip config file, as well as training scripts to the machines in the cluster. The configuration of the cluster is defined by `ip_config.txt`, The data is copied to `~/graphsage/ogb-product` on each of the remote machines. The training script is copied to `~/graphsage/dgl_code` on each of the remote machines. `--part_config` specifies the location of the partitioned data in the local machine (a user only needs to specify
 the location of the partition configuration file).
+
 ```bash
-python3 ~/dgl/tools/copy_partitions.py \
+python3 ~/dgl/tools/copy_files.py \
 --ip_config ip_config.txt \
 --workspace ~/graphsage \
 --rel_data_path ogb-product \
--part_config data/ogb-product.json 
+--part_config data/ogb-product.json \
+--script_folder ~/dgl_code
 ```

-**Note**: users need to make sure that the master node has right permission to ssh to all the other nodes.
-
-Users need to copy the training script to the workspace directory on remote machines as well.
+After runing this command, user can find a folder called ``graphsage`` on each machine. The folder contains ``ip_config.txt``, ``dgl_code``, and ``ogb-product`` inside.

 ### Step 3: Launch distributed jobs

@@ -50,7 +69,7 @@ python3 ~/dgl/tools/launch.py \
 --num_servers 1 \
 --part_config ogb-product/ogb-product.json \
 --ip_config ip_config.txt \
-"python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4"
+"python3 dgl_code/train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4"
 ```

 To run unsupervised training:
@@ -62,7 +81,7 @@ python3 ~/dgl/tools/launch.py \
 --num_servers 1 \
 --part_config ogb-product/ogb-product.json \
 --ip_config ip_config.txt \
-"python3 train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 3 --batch_size 1000"
+"python3 dgl_code/train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 3 --batch_size 1000"
 ```

 ## Distributed code runs in the standalone mode

--- a/examples/pytorch/graphsage/experimental/ip_config.txt
+++ b/examples/pytorch/graphsage/experimental/ip_config.txt
-172.31.19.1 5555
-172.31.23.205 5555
-172.31.29.175 5555
-172.31.16.98 5555
\ No newline at end of file
+172.31.19.1
+172.31.23.205
+172.31.29.175
+172.31.16.98
\ No newline at end of file
--- a/examples/pytorch/graphsage/experimental/partition_graph.py
+++ b/examples/pytorch/graphsage/experimental/partition_graph.py
@@ -20,6 +20,8 @@ if __name__ == '__main__':
                           help='turn the graph into an undirected graph.')
    argparser.add_argument('--balance_edges', action='store_true',
                           help='balance the number of edges in each partition.')
+    argparser.add_argument('--output', type=str, default='data',
+                           help='Output path of partitioned graph.')
    args = argparser.parse_args()

    start = time.time()
@@ -45,7 +47,7 @@ if __name__ == '__main__':
            sym_g.ndata[key] = g.ndata[key]
        g = sym_g

-    dgl.distributed.partition_graph(g, args.dataset, args.num_parts, 'data',
+    dgl.distributed.partition_graph(g, args.dataset, args.num_parts, args.output,
                                    part_method=args.part_method,
                                    balance_ntypes=balance_ntypes,
                                    balance_edges=args.balance_edges)
--- a/tools/copy_partitions.py
+++ b/tools/copy_partitions.py
@@ -28,6 +28,8 @@ def main():
                        help='Relative path in workspace to store the partition data.')
    parser.add_argument('--part_config', type=str, required=True,
                        help='The partition config file. The path is on the local machine.')
+    parser.add_argument('--script_folder', type=str, required=True,
+                        help='The folder contains all the user code scripts.')
    parser.add_argument('--ip_config', type=str, required=True,
                        help='The file of IP configuration for servers. \
                        The path is on the local machine.')
@@ -89,6 +91,8 @@ def main():
        copy_file(part_files['node_feats'], ip, remote_path)
        copy_file(part_files['edge_feats'], ip, remote_path)
        copy_file(part_files['part_graph'], ip, remote_path)
+        # copy script folder
+        copy_file(args.script_folder, ip, args.workspace)


 def signal_handler(signal, frame):