"...git@developer.sourcefind.cn:renzhc/diffusers_dcu.git" did not exist on "417927f554e748c99fa9f7d6d637934ac331ea40"
Unverified Commit 89e49439 authored by Rhett Ying's avatar Rhett Ying Committed by GitHub
Browse files

[DistGB] Enable GraphBolt for node classification on heterograph (#7198)

parent 1ec0092e
## Distributed training
This is an example of training RGCN node classification in a distributed fashion. Currently, the example train RGCN graphs with input node features.
Before training, install python libs by pip:
```bash
pip3 install ogb pyarrow
```
To train RGCN, it has four steps:
### Step 0: Setup a Distributed File System
* You may skip this step if your cluster already has folder(s) synchronized across machines.
To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph).
#### Server side setup
Here is an example of how to setup NFS. First, install essential libs on the storage server
```bash
sudo apt-get install nfs-kernel-server
```
Below we assume the user account is `ubuntu` and we create a directory of `workspace` in the home directory.
```bash
mkdir -p /home/ubuntu/workspace
```
We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to
```bash
sudo vim /etc/exports
# add the following line
/home/ubuntu/workspace 192.168.0.0/16(rw,sync,no_subtree_check)
```
The server's internal ip can be checked via `ifconfig` or `ip`. If the ip does not begin with `192.168`, then you may use
```bash
# for ip range 10.0.0.0 - 10.255.255.255
/home/ubuntu/workspace 10.0.0.0/8(rw,sync,no_subtree_check)
# for ip range 172.16.0.0 - 172.31.255.255
/home/ubuntu/workspace 172.16.0.0/12(rw,sync,no_subtree_check)
```
Then restart NFS, the setup on server side is finished.
```
sudo systemctl restart nfs-kernel-server
```
For configraution details, please refer to [NFS ArchWiki](https://wiki.archlinux.org/index.php/NFS).
#### Client side setup
To use NFS, clients also require to install essential packages
```
sudo apt-get install nfs-common
```
You can either mount the NFS manually
```
mkdir -p /home/ubuntu/workspace
sudo mount -t nfs <nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace
```
or edit the fstab so the folder will be mounted automatically
```
# vim /etc/fstab
## append the following line to the file
<nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace nfs defaults 0 0
```
Then run `mount -a`.
Now go to `/home/ubuntu/workspace` and clone the DGL Github repository.
### Step 1: set IP configuration file.
User need to set their own IP configuration file `ip_config.txt` before training. For example, if we have four machines in current cluster, the IP configuration could like this:
```bash
172.31.0.1
172.31.0.2
```
Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes without password authentication.
[This link](https://linuxize.com/post/how-to-setup-passwordless-ssh-login/) provides instructions of setting passwordless SSH login.
### Step 2: partition the graph.
The example provides a script to partition some builtin graphs such as ogbn-mag graph.
If we want to train RGCN on 2 machines, we need to partition the graph into 2 parts.
In this example, we partition the ogbn-mag graph into 2 parts with Metis. The partitions are balanced with respect to the number of nodes, the number of edges and the number of labelled nodes.
```bash
python3 partition_graph.py --dataset ogbn-mag --num_parts 2 --balance_train --balance_edges
```
If we want to train RGCN with `GraphBolt`, we need to append `--use_graphbolt` to generate partitions in `GraphBolt` format.
```bash
python3 partition_graph.py --dataset ogbn-mag --num_parts 2 --balance_train --balance_edges --use_graphbolt
```
### Step 3: Launch distributed jobs
DGL provides a script to launch the training job in the cluster. `part_config` and `ip_config`
specify relative paths to the path of the workspace.
The command below launches 4 training processes on each machine as we'd like to utilize 4 GPUs for training.
```bash
python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/workspace/dgl/examples/pytorch/rgcn/experimental/ \
--num_trainers 4 \
--num_servers 2 \
--num_samplers 0 \
--part_config data/ogbn-mag.json \
--ip_config ip_config.txt \
"python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 1024 --n-hidden 64 --lr 0.01 --eval-batch-size 1024 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --num_gpus 4"
```
If we want to train RGCN with `GraphBolt`, we need to append `--use_graphbolt`.
```bash
python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/workspace/dgl/examples/pytorch/rgcn/experimental/ \
--num_trainers 4 \
--num_servers 2 \
--num_samplers 0 \
--part_config data/ogbn-mag.json \
--ip_config ip_config.txt \
"python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 1024 --n-hidden 64 --lr 0.01 --eval-batch-size 1024 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --num_gpus 4 --use_graphbolt"
```
**Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
## Comparison between `DGL` and `GraphBolt`
### Partition sizes
Compared to `DGL`, `GraphBolt` partitions are reduced to **19%** for `ogbn-mag`.
`ogbn-mag`
| Data Formats | File Name | Part 0 | Part 1 |
| ------------ | ---------------------------- | ------ | ------ |
| DGL | graph.dgl | 714MB | 716MB |
| GraphBolt | fused_csc_sampling_graph.pt | 137MB | 136MB |
### Performance
Compared to `DGL`, `GraphBolt`'s sampler works faster(reduced to **16%** `ogbn-mag`). `Min` and `Max` are statistics of all trainers on all nodes(machines).
As for RAM usage, the shared memory(measured by **shared** field of `free` command) usage decreases due to smaller graph partitions in `GraphBolt`. The peak memory used by processes(measured by **used** field of `free` command) decreases as well.
`ogbn-mag`
| Data Formats | Sample Time Per Epoch (CPU) | Test Accuracy (3 epochs) | shared | used (peak) |
| ------------ | --------------------------- | ------------------------- | ----- | ---- |
| DGL | Min: 48.2s, Max: 91.4s | 42.76% | 1.3GB | 9.2GB|
| GraphBolt | Min: 9.2s, Max: 11.9s | 42.46% | 742MB | 5.9GB|
This diff is collapsed.
import argparse
import time
import dgl
import numpy as np
import torch as th
from ogb.nodeproppred import DglNodePropPredDataset
def load_ogb(dataset):
if dataset == "ogbn-mag":
dataset = DglNodePropPredDataset(name=dataset)
split_idx = dataset.get_idx_split()
train_idx = split_idx["train"]["paper"]
val_idx = split_idx["valid"]["paper"]
test_idx = split_idx["test"]["paper"]
hg_orig, labels = dataset[0]
subgs = {}
for etype in hg_orig.canonical_etypes:
u, v = hg_orig.all_edges(etype=etype)
subgs[etype] = (u, v)
subgs[(etype[2], "rev-" + etype[1], etype[0])] = (v, u)
hg = dgl.heterograph(subgs)
hg.nodes["paper"].data["feat"] = hg_orig.nodes["paper"].data["feat"]
paper_labels = labels["paper"].squeeze()
num_rels = len(hg.canonical_etypes)
num_of_ntype = len(hg.ntypes)
num_classes = dataset.num_classes
category = "paper"
print("Number of relations: {}".format(num_rels))
print("Number of class: {}".format(num_classes))
print("Number of train: {}".format(len(train_idx)))
print("Number of valid: {}".format(len(val_idx)))
print("Number of test: {}".format(len(test_idx)))
# get target category id
category_id = len(hg.ntypes)
for i, ntype in enumerate(hg.ntypes):
if ntype == category:
category_id = i
train_mask = th.zeros((hg.num_nodes("paper"),), dtype=th.bool)
train_mask[train_idx] = True
val_mask = th.zeros((hg.num_nodes("paper"),), dtype=th.bool)
val_mask[val_idx] = True
test_mask = th.zeros((hg.num_nodes("paper"),), dtype=th.bool)
test_mask[test_idx] = True
hg.nodes["paper"].data["train_mask"] = train_mask
hg.nodes["paper"].data["val_mask"] = val_mask
hg.nodes["paper"].data["test_mask"] = test_mask
hg.nodes["paper"].data["labels"] = paper_labels
return hg
else:
raise ("Do not support other ogbn datasets.")
if __name__ == "__main__":
argparser = argparse.ArgumentParser("Partition builtin graphs")
argparser.add_argument(
"--dataset", type=str, default="ogbn-mag", help="datasets: ogbn-mag"
)
argparser.add_argument(
"--num_parts", type=int, default=4, help="number of partitions"
)
argparser.add_argument(
"--part_method", type=str, default="metis", help="the partition method"
)
argparser.add_argument(
"--balance_train",
action="store_true",
help="balance the training size in each partition.",
)
argparser.add_argument(
"--undirected",
action="store_true",
help="turn the graph into an undirected graph.",
)
argparser.add_argument(
"--balance_edges",
action="store_true",
help="balance the number of edges in each partition.",
)
argparser.add_argument(
"--num_trainers_per_machine",
type=int,
default=1,
help="the number of trainers per machine. The trainer ids are stored\
in the node feature 'trainer_id'",
)
argparser.add_argument(
"--output",
type=str,
default="data",
help="Output path of partitioned graph.",
)
argparser.add_argument(
"--use_graphbolt",
action="store_true",
help="Use GraphBolt for distributed train.",
)
args = argparser.parse_args()
start = time.time()
g = load_ogb(args.dataset)
print(
"load {} takes {:.3f} seconds".format(args.dataset, time.time() - start)
)
print("|V|={}, |E|={}".format(g.num_nodes(), g.num_edges()))
print(
"train: {}, valid: {}, test: {}".format(
th.sum(g.nodes["paper"].data["train_mask"]),
th.sum(g.nodes["paper"].data["val_mask"]),
th.sum(g.nodes["paper"].data["test_mask"]),
)
)
if args.balance_train:
balance_ntypes = {"paper": g.nodes["paper"].data["train_mask"]}
else:
balance_ntypes = None
dgl.distributed.partition_graph(
g,
args.dataset,
args.num_parts,
args.output,
part_method=args.part_method,
balance_ntypes=balance_ntypes,
balance_edges=args.balance_edges,
num_trainers_per_machine=args.num_trainers_per_machine,
use_graphbolt=args.use_graphbolt,
)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment