README.md 9.37 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
## DistGNN vertex-cut based graph partitioning (using Libra)

### How to run graph partitioning
```python ../../../../python/dgl/distgnn/partition/main_Libra.py <dataset> <#partitions>```

Example: The following command-line creates 4 partitions of pubmed graph   
```python ../../../../python/dgl/distgnn/partition/main_Libra.py pubmed 4```
 
The ouptut partitions are created in the current directory in Libra_result_\<dataset\>/ folder.  
The *upcoming DistGNN* application can directly use these partitions for distributed training.  

### How Libra partitioning works
Libra is a vertex-cut based graph partitioning method. It applies greedy heuristics to uniquely distribute the input graph edges among the partitions. It generates the partitions as a list of edges. Script ```main_Libra.py```  after getting the Libra partitions converts the Libra output to DGL/DistGNN input format.  


Note: Current Libra implementation is sequential. Extra overhead is paid due to the additional work of format conversion of the partitioned graph.  


### Expected partitioning timinigs  
Cora, Pubmed, Citeseer: < 10 sec (<10GB)  
Reddit: 1500 sec (~ 25GB)  
OGBN-Products: ~2000 sec (~30GB)  
Proteins: 18000 sec (Format conversion from public data takes time)  (~100GB)  
OGBN-Paper100M: 25000 sec (~200GB)  


### Settings
Tested with:
Cent OS 7.6
gcc v8.3.0
PyTorch 1.7.1
Python 3.7.10



36
37
## Distributed training

Chao Ma's avatar
Chao Ma committed
38
39
40
41
42
43
44
This is an example of training GraphSage in a distributed fashion. Before training, please install some python libs by pip:

```bash
sudo pip3 install ogb
```

To train GraphSage, it has five steps:
45

46
47
### Step 0: Setup a Distributed File System
* You may skip this step if your cluster already has folder(s) synchronized across machines.
48

49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph).

#### Server side setup
Here is an example of how to setup NFS. First, install essential libs on the storage server
```bash
sudo apt-get install nfs-kernel-server
```

Below we assume the user account is `ubuntu` and we create a directory of `workspace` in the home directory.
```bash
mkdir -p /home/ubuntu/workspace
```

We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to

```bash
sudo vim /etc/exports
# add the following line
/home/ubuntu/workspace  192.168.0.0/16(rw,sync,no_subtree_check)
```

The server's internal ip can be checked  via `ifconfig` or `ip`. If the ip does not begin with `192.168`, then you may use
```bash
/home/ubuntu/workspace  10.0.0.0/8(rw,sync,no_subtree_check)
/home/ubuntu/workspace  172.16.0.0/12(rw,sync,no_subtree_check)
```

Then restart NFS, the setup on server side is finished.

```
sudo systemctl restart nfs-kernel-server
```

For configraution details, please refer to [NFS ArchWiki](https://wiki.archlinux.org/index.php/NFS).

#### Client side setup

To use NFS, clients also require to install essential packages

```
sudo apt-get install nfs-common
```

You can either mount the NFS manually

```
mkdir -p /home/ubuntu/workspace
sudo mount -t nfs <nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace
```

or edit the fstab so the folder will be mounted automatically

```
# vim /etc/fstab
## append the following line to the file
<nfs-server-ip>:/home/ubuntu/workspace   /home/ubuntu/workspace   nfs   defaults	0 0
```

Then run `mount -a`.

Now go to `/home/ubuntu/workspace` and clone the DGL Github repository.

### Step 1: set IP configuration file.

User need to set their own IP configuration file `ip_config.txt` before training. For example, if we have four machines in current cluster, the IP configuration
114
115
116
117
118
119
120
121
122
could like this:

```bash
172.31.19.1
172.31.23.205
172.31.29.175
172.31.16.98
```

123
124
Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes without password authentication.
[This link](https://linuxize.com/post/how-to-setup-passwordless-ssh-login/) provides instructions of setting passwordless SSH login.
125

126
### Step 2: partition the graph.
127
128
129
130
131
132
133
134
135

The example provides a script to partition some builtin graphs such as Reddit and OGB product graph.
If we want to train GraphSage on 4 machines, we need to partition the graph into 4 parts.

We need to load some function from the parent directory.
```bash
export PYTHONPATH=$PYTHONPATH:..
```

136
In this example, we partition the OGB product graph into 4 parts with Metis on node-0. The partitions are balanced with respect to
137
138
139
140
141
the number of nodes, the number of edges and the number of labelled nodes.
```bash
python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train --balance_edges
```

142
143
This script generates partitioned graphs and store them in the directory called `data`.

144
145

### Step 3: Launch distributed jobs
146

147
148
DGL provides a script to launch the training job in the cluster. `part_config` and `ip_config`
specify relative paths to the path of the workspace.
149

150
The command below launches one process per machine for both sampling and training.
151

152
```bash
153
154
python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
155
--num_trainers 1 \
156
--num_samplers 0 \
157
--num_servers 1 \
158
--part_config data/ogb-product.json \
159
--ip_config ip_config.txt \
160
"python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 30 --batch_size 1000"
161
```
162

163
164
165
To run unsupervised training:

```bash
166
167
python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
168
--num_trainers 1 \
169
--num_samplers 0 \
170
--num_servers 1 \
171
--part_config data/ogb-product.json \
172
--ip_config ip_config.txt \
173
--graph_format csc,coo \
174
"python3 train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000"
175
176
177
178
179
```

By default, this code will run on CPU. If you have GPU support, you can just add a `--num_gpus` argument in user command:

```bash
180
181
python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
182
--num_trainers 4 \
183
--num_samplers 0 \
184
--num_servers 1 \
185
--part_config data/ogb-product.json \
186
--ip_config ip_config.txt \
187
"python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 30 --batch_size 1000 --num_gpus 4"
188
189
```

190
191
192
193
194
195
196
197
198
199
200
201
To run supervised with transductive setting (nodes are initialized with node embedding)
```bash
python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
--num_trainers 4 \
--num_samplers 4 \
--num_servers 1 \
--num_samplers 0 \
--part_config data/ogb-product.json \
--ip_config ip_config.txt \
"python3 train_dist_transductive.py --graph_name ogb-product --ip_config ip_config.txt --batch_size 1000 --num_gpu 4 --eval_every 5"
```

202
To run supervised with transductive setting using dgl distributed DistEmbedding
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
```bash
python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
--num_trainers 4 \
--num_samplers 4 \
--num_servers 1 \
--num_samplers 0 \
--part_config data/ogb-product.json \
--ip_config ip_config.txt \
"python3 train_dist_transductive.py --graph_name ogb-product --ip_config ip_config.txt --batch_size 1000 --num_gpu 4 --eval_every 5  --dgl_sparse"
```

To run unsupervised with transductive setting (nodes are initialized with node embedding)
```bash
python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
--num_trainers 4 \
--num_samplers 0 \
--num_servers 1 \
--part_config data/ogb-product.json \
--ip_config ip_config.txt \
222
--graph_format csc,coo \
223
224
225
"python3 train_dist_unsupervised_transductive.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 4"
```

226
To run unsupervised with transductive setting using dgl distributed DistEmbedding
227
228
229
230
231
232
233
```bash
python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
--num_trainers 4 \
--num_samplers 0 \
--num_servers 1 \
--part_config data/ogb-product.json \
--ip_config ip_config.txt \
234
--graph_format csc,coo \
235
236
237
"python3 train_dist_unsupervised_transductive.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 4 --dgl_sparse"
```

238
**Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
239

240
241
242
243
244
245
246
247
248
249
250
251
252
## Distributed code runs in the standalone mode

The standalone mode is mainly used for development and testing. The procedure to run the code is much simpler.

### Step 1: graph construction.

When testing the standalone mode of the training script, we should construct a graph with one partition.
```bash
python3 partition_graph.py --dataset ogb-product --num_parts 1
```

### Step 2: run the training script

253
254
To run supervised training:

255
```bash
256
python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --part_config data/ogb-product.json --standalone
257
258
```

259
260
261
To run unsupervised training:

```bash
262
python3 train_dist_unsupervised.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --part_config data/ogb-product.json --standalone
263
264
```

265
Note: please ensure that all environment variables shown above are unset if they were set for testing distributed training.