[DistGB] Enable GraphBolt for node classification on heterograph (#7198)

89e49439 · Rhett Ying · GitHub · 1ec0092e · 89e49439 · 89e49439
Unverified Commit 89e49439 authored Mar 07, 2024 by Rhett Ying Committed by GitHub Mar 07, 2024
3 changed files
--- a/examples/distributed/rgcn/README.md
+++ b/examples/distributed/rgcn/README.md
+## Distributed training
+This is an example of training RGCN node classification in a distributed fashion. Currently, the example train RGCN graphs with input node features.
+Before training, install python libs by pip:
+```bash
+pip3 install ogb pyarrow
+```
+To train RGCN, it has four steps:
+### Step 0: Setup a Distributed File System
+* You may skip this step if your cluster already has folder(s) synchronized across machines.
+To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph).
+#### Server side setup
+Here is an example of how to setup NFS. First, install essential libs on the storage server
+```bash
+sudo apt-get install nfs-kernel-server
+```
+Below we assume the user account is `ubuntu` and we create a directory of `workspace` in the home directory.
+```bash
+mkdir -p /home/ubuntu/workspace
+```
+We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to
+```bash
+sudo vim /etc/exports
+# add the following line
+/home/ubuntu/workspace  192.168.0.0/16(rw,sync,no_subtree_check)
+```
+The server's internal ip can be checked  via `ifconfig` or `ip`. If the ip does not begin with `192.168`, then you may use
+```bash
+# for ip range 10.0.0.0 - 10.255.255.255
+/home/ubuntu/workspace  10.0.0.0/8(rw,sync,no_subtree_check)
+# for ip range 172.16.0.0 - 172.31.255.255
+/home/ubuntu/workspace  172.16.0.0/12(rw,sync,no_subtree_check)
+```
+Then restart NFS, the setup on server side is finished.
+```
+sudo systemctl restart nfs-kernel-server
+```
+For configraution details, please refer to [NFS ArchWiki](https://wiki.archlinux.org/index.php/NFS).
+#### Client side setup
+To use NFS, clients also require to install essential packages
+```
+sudo apt-get install nfs-common
+```
+You can either mount the NFS manually
+```
+mkdir -p /home/ubuntu/workspace
+sudo mount -t nfs <nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace
+```
+or edit the fstab so the folder will be mounted automatically
+```
+# vim /etc/fstab
+## append the following line to the file
+<nfs-server-ip>:/home/ubuntu/workspace   /home/ubuntu/workspace   nfs   defaults	0 0
+```
+Then run `mount -a`.
+Now go to `/home/ubuntu/workspace` and clone the DGL Github repository.
+### Step 1: set IP configuration file.
+User need to set their own IP configuration file `ip_config.txt` before training. For example, if we have four machines in current cluster, the IP configuration could like this:
+```bash
+172.31.0.1
+172.31.0.2
+```
+Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes without password authentication.
+[This link](https://linuxize.com/post/how-to-setup-passwordless-ssh-login/) provides instructions of setting passwordless SSH login.
+### Step 2: partition the graph.
+The example provides a script to partition some builtin graphs such as ogbn-mag graph.
+If we want to train RGCN on 2 machines, we need to partition the graph into 2 parts.
+In this example, we partition the ogbn-mag graph into 2 parts with Metis. The partitions are balanced with respect to the number of nodes, the number of edges and the number of labelled nodes.
+```bash
+python3 partition_graph.py --dataset ogbn-mag --num_parts 2 --balance_train --balance_edges
+```
+If we want to train RGCN with `GraphBolt`, we need to append `--use_graphbolt` to generate partitions in `GraphBolt` format.
+```bash
+python3 partition_graph.py --dataset ogbn-mag --num_parts 2 --balance_train --balance_edges --use_graphbolt
+```
+### Step 3: Launch distributed jobs
+DGL provides a script to launch the training job in the cluster. `part_config` and `ip_config`
+specify relative paths to the path of the workspace.
+The command below launches 4 training processes on each machine as we'd like to utilize 4 GPUs for training.
+```bash
+python3 ~/workspace/dgl/tools/launch.py \
+--workspace ~/workspace/dgl/examples/pytorch/rgcn/experimental/ \
+--num_trainers 4 \
+--num_servers 2 \
+--num_samplers 0 \
+--part_config data/ogbn-mag.json \
+--ip_config ip_config.txt \
+"python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 1024  --n-hidden 64 --lr 0.01 --eval-batch-size 1024  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --num_gpus 4"
+```
+If we want to train RGCN with `GraphBolt`, we need to append `--use_graphbolt`.
+```bash
+python3 ~/workspace/dgl/tools/launch.py \
+--workspace ~/workspace/dgl/examples/pytorch/rgcn/experimental/ \
+--num_trainers 4 \
+--num_servers 2 \
+--num_samplers 0 \
+--part_config data/ogbn-mag.json \
+--ip_config ip_config.txt \
+"python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 1024  --n-hidden 64 --lr 0.01 --eval-batch-size 1024  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --num_gpus 4 --use_graphbolt"
+```
+**Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
+## Comparison between `DGL` and `GraphBolt`
+### Partition sizes
+Compared to `DGL`, `GraphBolt` partitions are reduced to **19%** for `ogbn-mag`.
+`ogbn-mag`
+| Data Formats |         File Name            | Part 0 | Part 1 |
+| ------------ | ---------------------------- | ------ | ------ |
+| DGL          | graph.dgl                    | 714MB  | 716MB  |
+| GraphBolt    | fused_csc_sampling_graph.pt  | 137MB  | 136MB  |
+### Performance
+Compared to `DGL`, `GraphBolt`'s sampler works faster(reduced to **16%** `ogbn-mag`). `Min` and `Max` are statistics of all trainers on all nodes(machines).
+As for RAM usage, the shared memory(measured by **shared** field of `free` command) usage decreases due to smaller graph partitions in `GraphBolt`. The peak memory used by processes(measured by **used** field of `free` command) decreases as well.
+`ogbn-mag`
+| Data Formats | Sample Time Per Epoch (CPU) |  Test Accuracy (3 epochs) | shared | used (peak) |
+| ------------ | --------------------------- | ------------------------- |  -----  | ---- |
+|     DGL      | Min: 48.2s, Max: 91.4s      |            42.76%         |  1.3GB  | 9.2GB|
+|   GraphBolt  | Min: 9.2s, Max: 11.9s       |            42.46%         |  742MB  | 5.9GB|
--- a/examples/distributed/rgcn/node_classification.py
+++ b/examples/distributed/rgcn/node_classification.py
--- a/examples/distributed/rgcn/partition_graph.py
+++ b/examples/distributed/rgcn/partition_graph.py
+import argparse
+import time
+import dgl
+import numpy as np
+import torch as th
+from ogb.nodeproppred import DglNodePropPredDataset
+def load_ogb(dataset):
+    if dataset == "ogbn-mag":
+        dataset = DglNodePropPredDataset(name=dataset)
+        split_idx = dataset.get_idx_split()
+        train_idx = split_idx["train"]["paper"]
+        val_idx = split_idx["valid"]["paper"]
+        test_idx = split_idx["test"]["paper"]
+        hg_orig, labels = dataset[0]
+        subgs = {}
+        for etype in hg_orig.canonical_etypes:
+            u, v = hg_orig.all_edges(etype=etype)
+            subgs[etype] = (u, v)
+            subgs[(etype[2], "rev-" + etype[1], etype[0])] = (v, u)
+        hg = dgl.heterograph(subgs)
+        hg.nodes["paper"].data["feat"] = hg_orig.nodes["paper"].data["feat"]
+        paper_labels = labels["paper"].squeeze()
+        num_rels = len(hg.canonical_etypes)
+        num_of_ntype = len(hg.ntypes)
+        num_classes = dataset.num_classes
+        category = "paper"
+        print("Number of relations: {}".format(num_rels))
+        print("Number of class: {}".format(num_classes))
+        print("Number of train: {}".format(len(train_idx)))
+        print("Number of valid: {}".format(len(val_idx)))
+        print("Number of test: {}".format(len(test_idx)))
+        # get target category id
+        category_id = len(hg.ntypes)
+        for i, ntype in enumerate(hg.ntypes):
+            if ntype == category:
+                category_id = i
+        train_mask = th.zeros((hg.num_nodes("paper"),), dtype=th.bool)
+        train_mask[train_idx] = True
+        val_mask = th.zeros((hg.num_nodes("paper"),), dtype=th.bool)
+        val_mask[val_idx] = True
+        test_mask = th.zeros((hg.num_nodes("paper"),), dtype=th.bool)
+        test_mask[test_idx] = True
+        hg.nodes["paper"].data["train_mask"] = train_mask
+        hg.nodes["paper"].data["val_mask"] = val_mask
+        hg.nodes["paper"].data["test_mask"] = test_mask
+        hg.nodes["paper"].data["labels"] = paper_labels
+        return hg
+    else:
+        raise ("Do not support other ogbn datasets.")
+if __name__ == "__main__":
+    argparser = argparse.ArgumentParser("Partition builtin graphs")
+    argparser.add_argument(
+        "--dataset", type=str, default="ogbn-mag", help="datasets: ogbn-mag"
+    )
+    argparser.add_argument(
+        "--num_parts", type=int, default=4, help="number of partitions"
+    )
+    argparser.add_argument(
+        "--part_method", type=str, default="metis", help="the partition method"
+    )
+    argparser.add_argument(
+        "--balance_train",
+        action="store_true",
+        help="balance the training size in each partition.",
+    )
+    argparser.add_argument(
+        "--undirected",
+        action="store_true",
+        help="turn the graph into an undirected graph.",
+    )
+    argparser.add_argument(
+        "--balance_edges",
+        action="store_true",
+        help="balance the number of edges in each partition.",
+    )
+    argparser.add_argument(
+        "--num_trainers_per_machine",
+        type=int,
+        default=1,
+        help="the number of trainers per machine. The trainer ids are stored\
+                                in the node feature 'trainer_id'",
+    )
+    argparser.add_argument(
+        "--output",
+        type=str,
+        default="data",
+        help="Output path of partitioned graph.",
+    )
+    argparser.add_argument(
+        "--use_graphbolt",
+        action="store_true",
+        help="Use GraphBolt for distributed train.",
+    )
+    args = argparser.parse_args()
+    start = time.time()
+    g = load_ogb(args.dataset)
+    print(
+        "load {} takes {:.3f} seconds".format(args.dataset, time.time() - start)
+    )
+    print("|V|={}, |E|={}".format(g.num_nodes(), g.num_edges()))
+    print(
+        "train: {}, valid: {}, test: {}".format(
+            th.sum(g.nodes["paper"].data["train_mask"]),
+            th.sum(g.nodes["paper"].data["val_mask"]),
+            th.sum(g.nodes["paper"].data["test_mask"]),
+        )
+    )
+    if args.balance_train:
+        balance_ntypes = {"paper": g.nodes["paper"].data["train_mask"]}
+    else:
+        balance_ntypes = None
+    dgl.distributed.partition_graph(
+        g,
+        args.dataset,
+        args.num_parts,
+        args.output,
+        part_method=args.part_method,
+        balance_ntypes=balance_ntypes,
+        balance_edges=args.balance_edges,
+        num_trainers_per_machine=args.num_trainers_per_machine,
+        use_graphbolt=args.use_graphbolt,
+    )