README.md 8 KB
Newer Older
1
2
## Distributed training

Chao Ma's avatar
Chao Ma committed
3
4
5
6
7
8
This is an example of training GraphSage in a distributed fashion. Before training, please install some python libs by pip:

```bash
sudo pip3 install ogb
```

9
10
**Requires PyTorch 1.10.0+ to work.**

Chao Ma's avatar
Chao Ma committed
11
To train GraphSage, it has five steps:
12

13
14
### Step 0: Setup a Distributed File System
* You may skip this step if your cluster already has folder(s) synchronized across machines.
15

16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph).

#### Server side setup
Here is an example of how to setup NFS. First, install essential libs on the storage server
```bash
sudo apt-get install nfs-kernel-server
```

Below we assume the user account is `ubuntu` and we create a directory of `workspace` in the home directory.
```bash
mkdir -p /home/ubuntu/workspace
```

We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to

```bash
sudo vim /etc/exports
# add the following line
/home/ubuntu/workspace  192.168.0.0/16(rw,sync,no_subtree_check)
```

The server's internal ip can be checked  via `ifconfig` or `ip`. If the ip does not begin with `192.168`, then you may use
```bash
/home/ubuntu/workspace  10.0.0.0/8(rw,sync,no_subtree_check)
/home/ubuntu/workspace  172.16.0.0/12(rw,sync,no_subtree_check)
```

Then restart NFS, the setup on server side is finished.

```
sudo systemctl restart nfs-kernel-server
```

For configraution details, please refer to [NFS ArchWiki](https://wiki.archlinux.org/index.php/NFS).

#### Client side setup

To use NFS, clients also require to install essential packages

```
sudo apt-get install nfs-common
```

You can either mount the NFS manually

```
mkdir -p /home/ubuntu/workspace
sudo mount -t nfs <nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace
```

or edit the fstab so the folder will be mounted automatically

```
# vim /etc/fstab
## append the following line to the file
<nfs-server-ip>:/home/ubuntu/workspace   /home/ubuntu/workspace   nfs   defaults	0 0
```

Then run `mount -a`.

Now go to `/home/ubuntu/workspace` and clone the DGL Github repository.

### Step 1: set IP configuration file.

User need to set their own IP configuration file `ip_config.txt` before training. For example, if we have four machines in current cluster, the IP configuration
81
82
83
84
85
86
87
88
89
could like this:

```bash
172.31.19.1
172.31.23.205
172.31.29.175
172.31.16.98
```

90
91
Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes without password authentication.
[This link](https://linuxize.com/post/how-to-setup-passwordless-ssh-login/) provides instructions of setting passwordless SSH login.
92

93
### Step 2: partition the graph.
94
95
96
97
98
99
100
101
102

The example provides a script to partition some builtin graphs such as Reddit and OGB product graph.
If we want to train GraphSage on 4 machines, we need to partition the graph into 4 parts.

We need to load some function from the parent directory.
```bash
export PYTHONPATH=$PYTHONPATH:..
```

103
In this example, we partition the OGB product graph into 4 parts with Metis on node-0. The partitions are balanced with respect to
104
105
106
107
108
the number of nodes, the number of edges and the number of labelled nodes.
```bash
python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train --balance_edges
```

109
110
This script generates partitioned graphs and store them in the directory called `data`.

111
112

### Step 3: Launch distributed jobs
113

114
115
DGL provides a script to launch the training job in the cluster. `part_config` and `ip_config`
specify relative paths to the path of the workspace.
116

117
The command below launches one process per machine for both sampling and training.
118

119
```bash
120
python3 ~/workspace/dgl/tools/launch.py \
121
--workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
122
--num_trainers 1 \
123
--num_samplers 0 \
124
--num_servers 1 \
125
--part_config data/ogb-product.json \
126
--ip_config ip_config.txt \
127
"python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 30 --batch_size 1000"
128
```
129

130
131
132
To run unsupervised training:

```bash
133
python3 ~/workspace/dgl/tools/launch.py \
134
--workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
135
--num_trainers 1 \
136
--num_samplers 0 \
137
--num_servers 1 \
138
--part_config data/ogb-product.json \
139
--ip_config ip_config.txt \
140
--graph_format csc,coo \
141
"python3 train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000"
142
143
144
145
146
```

By default, this code will run on CPU. If you have GPU support, you can just add a `--num_gpus` argument in user command:

```bash
147
python3 ~/workspace/dgl/tools/launch.py \
148
--workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
149
--num_trainers 4 \
150
--num_samplers 0 \
151
--num_servers 1 \
152
--part_config data/ogb-product.json \
153
--ip_config ip_config.txt \
154
"python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 30 --batch_size 1000 --num_gpus 4"
155
156
```

157
158
To run supervised with transductive setting (nodes are initialized with node embedding)
```bash
159
python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
160
161
162
163
164
165
166
167
168
--num_trainers 4 \
--num_samplers 4 \
--num_servers 1 \
--num_samplers 0 \
--part_config data/ogb-product.json \
--ip_config ip_config.txt \
"python3 train_dist_transductive.py --graph_name ogb-product --ip_config ip_config.txt --batch_size 1000 --num_gpu 4 --eval_every 5"
```

169
To run supervised with transductive setting using dgl distributed DistEmbedding
170
```bash
171
python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
172
173
174
175
176
177
178
179
180
181
182
--num_trainers 4 \
--num_samplers 4 \
--num_servers 1 \
--num_samplers 0 \
--part_config data/ogb-product.json \
--ip_config ip_config.txt \
"python3 train_dist_transductive.py --graph_name ogb-product --ip_config ip_config.txt --batch_size 1000 --num_gpu 4 --eval_every 5  --dgl_sparse"
```

To run unsupervised with transductive setting (nodes are initialized with node embedding)
```bash
183
python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
184
185
186
187
188
--num_trainers 4 \
--num_samplers 0 \
--num_servers 1 \
--part_config data/ogb-product.json \
--ip_config ip_config.txt \
189
--graph_format csc,coo \
190
191
192
"python3 train_dist_unsupervised_transductive.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 4"
```

193
To run unsupervised with transductive setting using dgl distributed DistEmbedding
194
```bash
195
python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
196
197
198
199
200
--num_trainers 4 \
--num_samplers 0 \
--num_servers 1 \
--part_config data/ogb-product.json \
--ip_config ip_config.txt \
201
--graph_format csc,coo \
202
203
204
"python3 train_dist_unsupervised_transductive.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 4 --dgl_sparse"
```

205
**Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
206

207
208
209
210
211
212
213
214
215
216
217
218
219
## Distributed code runs in the standalone mode

The standalone mode is mainly used for development and testing. The procedure to run the code is much simpler.

### Step 1: graph construction.

When testing the standalone mode of the training script, we should construct a graph with one partition.
```bash
python3 partition_graph.py --dataset ogb-product --num_parts 1
```

### Step 2: run the training script

220
221
To run supervised training:

222
```bash
223
python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --part_config data/ogb-product.json --standalone
224
225
```

226
227
228
To run unsupervised training:

```bash
229
python3 train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --part_config data/ogb-product.json --standalone
230
231
```

232
Note: please ensure that all environment variables shown above are unset if they were set for testing distributed training.