README.md 4.88 KB
Newer Older
1
2
## Distributed training

Chao Ma's avatar
Chao Ma committed
3
4
5
6
7
8
9
10
This is an example of training GraphSage in a distributed fashion. Before training, please install some python libs by pip:

```bash
sudo pip3 install ogb
sudo pip3 install pyinstrument
```

To train GraphSage, it has five steps:
11
12
13
14
15
16
17
18
19
20
21
22
23
24

### Step 0: set IP configuration file.

User need to set their own IP configuration file before training. For example, if we have four machines in current cluster, the IP configuration
could like this:

```bash
172.31.19.1
172.31.23.205
172.31.29.175
172.31.16.98
```

Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes.
25
26
27
28
29
30
31
32
33
34
35

### Step 1: partition the graph.

The example provides a script to partition some builtin graphs such as Reddit and OGB product graph.
If we want to train GraphSage on 4 machines, we need to partition the graph into 4 parts.

We need to load some function from the parent directory.
```bash
export PYTHONPATH=$PYTHONPATH:..
```

36
In this example, we partition the OGB product graph into 4 parts with Metis on node-0. The partitions are balanced with respect to
37
38
39
40
41
the number of nodes, the number of edges and the number of labelled nodes.
```bash
python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train --balance_edges
```

42
43
This script generates partitioned graphs and store them in the directory called `data`.

44
### Step 2: copy the partitioned data and files to the cluster
45

46
47
48
49
50
51
52
53
54
DGL provides a script for copying partitioned data and files to the cluster. Before that, copy the training script to a local folder:

```bash
mkdir ~/dgl_code
cp ~/dgl/examples/pytorch/graphsage/experimental/train_dist.py ~/dgl_code
cp ~/dgl/examples/pytorch/graphsage/experimental/train_dist_unsupervised.py ~/dgl_code
```

The command below copies partition data, ip config file, as well as training scripts to the machines in the cluster. The configuration of the cluster is defined by `ip_config.txt`, The data is copied to `~/graphsage/ogb-product` on each of the remote machines. The training script is copied to `~/graphsage/dgl_code` on each of the remote machines. `--part_config` specifies the location of the partitioned data in the local machine (a user only needs to specify
55
the location of the partition configuration file).
56

57
```bash
58
python3 ~/dgl/tools/copy_files.py \
59
60
61
--ip_config ip_config.txt \
--workspace ~/graphsage \
--rel_data_path ogb-product \
62
63
--part_config data/ogb-product.json \
--script_folder ~/dgl_code
64
```
65

66
After runing this command, user can find a folder called ``graphsage`` on each machine. The folder contains ``ip_config.txt``, ``dgl_code``, and ``ogb-product`` inside.
67
68

### Step 3: Launch distributed jobs
69

70
71
DGL provides a script to launch the training job in the cluster. `part_config` and `ip_config`
specify relative paths to the path of the workspace.
72
73

```bash
74
python3 ~/dgl/tools/launch.py \
75
--workspace ~/graphsage/ \
76
77
--num_trainers 1 \
--num_samplers 4 \
78
--num_servers 1 \
79
--part_config ogb-product/ogb-product.json \
80
--ip_config ip_config.txt \
81
"python3 dgl_code/train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4"
82
```
83

84
85
86
87
To run unsupervised training:

```bash
python3 ~/dgl/tools/launch.py \
88
--workspace ~/graphsage/ \
89
--num_trainers 1 \
90
91
92
93
94
95
96
97
98
99
100
101
102
103
--num_samplers 4 \
--num_servers 1 \
--part_config ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"python3 dgl_code/train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 3 --batch_size 1000 --num_workers 4"
```

By default, this code will run on CPU. If you have GPU support, you can just add a `--num_gpus` argument in user command:

```bash
python3 ~/dgl/tools/launch.py \
--workspace ~/graphsage/ \
--num_trainers 4 \
--num_samplers 4 \
104
--num_servers 1 \
105
--part_config ogb-product/ogb-product.json \
106
--ip_config ip_config.txt \
107
"python3 dgl_code/train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 4 --num_gpus 4"
108
109
```

110

111
112
113
114
115
116
117
118
119
120
121
122
123
## Distributed code runs in the standalone mode

The standalone mode is mainly used for development and testing. The procedure to run the code is much simpler.

### Step 1: graph construction.

When testing the standalone mode of the training script, we should construct a graph with one partition.
```bash
python3 partition_graph.py --dataset ogb-product --num_parts 1
```

### Step 2: run the training script

124
125
To run supervised training:

126
```bash
127
python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --part_config data/ogb-product.json --standalone
128
129
```

130
131
132
To run unsupervised training:

```bash
133
python3 train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --part_config data/ogb-product.json --standalone
134
135
```

136
Note: please ensure that all environment variables shown above are unset if they were set for testing distributed training.