Unverified Commit eb9c067b authored by Chao Ma's avatar Chao Ma Committed by GitHub
Browse files

[Distributed] Copy training scripts in copy_partitions.py (#2010)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update
parent cd484352
## Distributed training ## Distributed training
This is an example of training GraphSage in a distributed fashion. To train GraphSage, it has four steps: This is an example of training GraphSage in a distributed fashion. To train GraphSage, it has five steps:
### Step 0: set IP configuration file.
User need to set their own IP configuration file before training. For example, if we have four machines in current cluster, the IP configuration
could like this:
```bash
172.31.19.1
172.31.23.205
172.31.29.175
172.31.16.98
```
Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes.
### Step 1: partition the graph. ### Step 1: partition the graph.
...@@ -12,30 +26,35 @@ We need to load some function from the parent directory. ...@@ -12,30 +26,35 @@ We need to load some function from the parent directory.
export PYTHONPATH=$PYTHONPATH:.. export PYTHONPATH=$PYTHONPATH:..
``` ```
In this example, we partition the OGB product graph into 4 parts with Metis. The partitions are balanced with respect to In this example, we partition the OGB product graph into 4 parts with Metis on node-0. The partitions are balanced with respect to
the number of nodes, the number of edges and the number of labelled nodes. the number of nodes, the number of edges and the number of labelled nodes.
```bash ```bash
python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train --balance_edges python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train --balance_edges
``` ```
### Step 2: copy the partitioned data to the cluster ### Step 2: copy the partitioned data and files to the cluster
DGL provides a script for copying partitioned data to the cluster. The command below copies partition data DGL provides a script for copying partitioned data and files to the cluster. Before that, copy the training script to a local folder:
to the machines in the cluster. The configuration of the cluster is defined by `ip_config.txt`,
The data is copied to `~/graphsage/ogb-product` on each of the remote machines. `--part_config` ```bash
specifies the location of the partitioned data in the local machine (a user only needs to specify mkdir ~/dgl_code
cp ~/dgl/examples/pytorch/graphsage/experimental/train_dist.py ~/dgl_code
cp ~/dgl/examples/pytorch/graphsage/experimental/train_dist_unsupervised.py ~/dgl_code
```
The command below copies partition data, ip config file, as well as training scripts to the machines in the cluster. The configuration of the cluster is defined by `ip_config.txt`, The data is copied to `~/graphsage/ogb-product` on each of the remote machines. The training script is copied to `~/graphsage/dgl_code` on each of the remote machines. `--part_config` specifies the location of the partitioned data in the local machine (a user only needs to specify
the location of the partition configuration file). the location of the partition configuration file).
```bash ```bash
python3 ~/dgl/tools/copy_partitions.py \ python3 ~/dgl/tools/copy_files.py \
--ip_config ip_config.txt \ --ip_config ip_config.txt \
--workspace ~/graphsage \ --workspace ~/graphsage \
--rel_data_path ogb-product \ --rel_data_path ogb-product \
--part_config data/ogb-product.json --part_config data/ogb-product.json \
--script_folder ~/dgl_code
``` ```
**Note**: users need to make sure that the master node has right permission to ssh to all the other nodes. After runing this command, user can find a folder called ``graphsage`` on each machine. The folder contains ``ip_config.txt``, ``dgl_code``, and ``ogb-product`` inside.
Users need to copy the training script to the workspace directory on remote machines as well.
### Step 3: Launch distributed jobs ### Step 3: Launch distributed jobs
...@@ -50,7 +69,7 @@ python3 ~/dgl/tools/launch.py \ ...@@ -50,7 +69,7 @@ python3 ~/dgl/tools/launch.py \
--num_servers 1 \ --num_servers 1 \
--part_config ogb-product/ogb-product.json \ --part_config ogb-product/ogb-product.json \
--ip_config ip_config.txt \ --ip_config ip_config.txt \
"python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4" "python3 dgl_code/train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4"
``` ```
To run unsupervised training: To run unsupervised training:
...@@ -62,7 +81,7 @@ python3 ~/dgl/tools/launch.py \ ...@@ -62,7 +81,7 @@ python3 ~/dgl/tools/launch.py \
--num_servers 1 \ --num_servers 1 \
--part_config ogb-product/ogb-product.json \ --part_config ogb-product/ogb-product.json \
--ip_config ip_config.txt \ --ip_config ip_config.txt \
"python3 train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 3 --batch_size 1000" "python3 dgl_code/train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 3 --batch_size 1000"
``` ```
## Distributed code runs in the standalone mode ## Distributed code runs in the standalone mode
......
172.31.19.1 5555 172.31.19.1
172.31.23.205 5555 172.31.23.205
172.31.29.175 5555 172.31.29.175
172.31.16.98 5555 172.31.16.98
\ No newline at end of file \ No newline at end of file
...@@ -20,6 +20,8 @@ if __name__ == '__main__': ...@@ -20,6 +20,8 @@ if __name__ == '__main__':
help='turn the graph into an undirected graph.') help='turn the graph into an undirected graph.')
argparser.add_argument('--balance_edges', action='store_true', argparser.add_argument('--balance_edges', action='store_true',
help='balance the number of edges in each partition.') help='balance the number of edges in each partition.')
argparser.add_argument('--output', type=str, default='data',
help='Output path of partitioned graph.')
args = argparser.parse_args() args = argparser.parse_args()
start = time.time() start = time.time()
...@@ -45,7 +47,7 @@ if __name__ == '__main__': ...@@ -45,7 +47,7 @@ if __name__ == '__main__':
sym_g.ndata[key] = g.ndata[key] sym_g.ndata[key] = g.ndata[key]
g = sym_g g = sym_g
dgl.distributed.partition_graph(g, args.dataset, args.num_parts, 'data', dgl.distributed.partition_graph(g, args.dataset, args.num_parts, args.output,
part_method=args.part_method, part_method=args.part_method,
balance_ntypes=balance_ntypes, balance_ntypes=balance_ntypes,
balance_edges=args.balance_edges) balance_edges=args.balance_edges)
...@@ -28,6 +28,8 @@ def main(): ...@@ -28,6 +28,8 @@ def main():
help='Relative path in workspace to store the partition data.') help='Relative path in workspace to store the partition data.')
parser.add_argument('--part_config', type=str, required=True, parser.add_argument('--part_config', type=str, required=True,
help='The partition config file. The path is on the local machine.') help='The partition config file. The path is on the local machine.')
parser.add_argument('--script_folder', type=str, required=True,
help='The folder contains all the user code scripts.')
parser.add_argument('--ip_config', type=str, required=True, parser.add_argument('--ip_config', type=str, required=True,
help='The file of IP configuration for servers. \ help='The file of IP configuration for servers. \
The path is on the local machine.') The path is on the local machine.')
...@@ -89,6 +91,8 @@ def main(): ...@@ -89,6 +91,8 @@ def main():
copy_file(part_files['node_feats'], ip, remote_path) copy_file(part_files['node_feats'], ip, remote_path)
copy_file(part_files['edge_feats'], ip, remote_path) copy_file(part_files['edge_feats'], ip, remote_path)
copy_file(part_files['part_graph'], ip, remote_path) copy_file(part_files['part_graph'], ip, remote_path)
# copy script folder
copy_file(args.script_folder, ip, args.workspace)
def signal_handler(signal, frame): def signal_handler(signal, frame):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment