[Example] Rgcn distributed training (#1999)

* add entity_classify_dist * upd * update * Fix * Fix * upd * upd * upd * upd * global eval * Fix * Fix * Fix * Fix * FIx * upd * upd * update * support pytorch sparse embedding * Fix * Fix * update Readme * update with new API * Fix * update Readme * add fanout for validation neighbor sampling Co-authored-by: Ubuntu <ubuntu@ip-172-31-51-214.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-210.ec2.internal> Co-authored-by: Chao Ma <mctt90@gmail.com> Co-authored-by: Da Zheng <zhengda1936@gmail.com>

[Example] Rgcn distributed training (#1999)
* add entity_classify_dist * upd * update * Fix * Fix * upd * upd * upd * upd * global eval * Fix * Fix * Fix * Fix * FIx * upd * upd * update * support pytorch sparse embedding * Fix * Fix * update Readme * update with new API * Fix * update Readme * add fanout for validation neighbor sampling Co-authored-by: Ubuntu <ubuntu@ip-172-31-51-214.ec2.internal> Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-210.ec2.internal> Co-authored-by: Chao Ma <mctt90@gmail.com> Co-authored-by: Da Zheng <zhengda1936@gmail.com>
0052f121 · xiang song(charlie.song) · GitHub · 75ffc31f · 0052f121 · 0052f121
Unverified Commit 0052f121 authored Aug 19, 2020 by xiang song(charlie.song) Committed by GitHub Aug 18, 2020
3 changed files
--- a/examples/pytorch/rgcn/experimental/README.md
+++ b/examples/pytorch/rgcn/experimental/README.md
+## Distributed training
+This is an example of training RGCN node classification in a distributed fashion. Currently, the example only support training RGCN graphs with no input features. The current implementation follows ../rgcn/entity_claasify_mp.py.
+To train RGCN, it has four steps:
+### Step 0: set IP configuration file.
+User need to set their own IP configuration file before training. For example, if we have four machines in current cluster, the IP configuration
+could like this:
+```bash
+172.31.0.1
+172.31.0.2
+172.31.0.3
+172.31.0.4
+```
+Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes.
+### Step 1: partition the graph.
+The example provides a script to partition some builtin graphs such as ogbn-mag graph.
+If we want to train RGCN on 4 machines, we need to partition the graph into 4 parts.
+In this example, we partition the ogbn-mag graph into 4 parts with Metis. The partitions are balanced with respect to
+the number of nodes, the number of edges and the number of labelled nodes.
+```bash
+python3 partition_graph.py --dataset ogbn-mag --num_parts 4 --balance_train --balance_edges
+```
+### Step 2: copy the partitioned data to the cluster
+DGL provides a script for copying partitioned data to the cluster. Before that, copy the training script to a local folder:
+```bash
+mkdir ~/dgl_code
+cp /home/ubuntu/dgl/examples/pytorch/rgcn/experimental/entity_classify_dist.py ~/dgl_code
+```
+The command below copies partition data, ip config file, as well as training scripts to the machines in the cluster.
+The configuration of the cluster is defined by `ip_config.txt`.
+The data is copied to `~/rgcn/ogbn-mag` on each of the remote machines.
+`--rel_data_path` specifies the relative path in the workspace where the partitioned data will be stored.
+`--part_config` specifies the location of the partitioned data in the local machine (a user only needs to specify
+the location of the partition configuration file). `--script_folder` specifies the location of the training scripts.
+```bash
+python ~/dgl/tools/copy_files.py --ip_config ip_config.txt \
+                                 --workspace ~/rgcn \
+                                 --rel_data_path data \
+				 --part_config data/ogbn-mag.json \
+			         --script_folder ~/dgl_code
+```
+**Note**: users need to make sure that the master node has right permission to ssh to all the other nodes.
+Users need to copy the training script to the workspace directory on remote machines as well.
+### Step 3: Launch distributed jobs
+DGL provides a script to launch the training job in the cluster. `part_config` and `ip_config`
+specify relative paths to the path of the workspace.
+```bash
+python3 ~/dgl/tools/launch.py \
+--workspace ~/rgcn/ \
+--num_trainers 1 \
+--num_servers 1 \
+--num_samplers 4 \
+--part_config data/ogbn-mag.json \
+--ip_config ip_config.txt \
+"python3 dgl_code/entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512  --n-hidden 64 --lr 0.01 --eval-batch-size 16  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt  --num-workers 4 --num-servers 1 --sparse-embedding  --sparse-lr 0.06"
+```
+We can get the performance score at the second epoch:
+```
+Val Acc 0.4323, Test Acc 0.4255, time: 128.0379
+```
+## Distributed code runs in the standalone mode
+The standalone mode is mainly used for development and testing. The procedure to run the code is much simpler.
+### Step 1: graph construction.
+When testing the standalone mode of the training script, we should construct a graph with one partition.
+```bash
+python3 partition_graph.py --dataset ogbn-mag --num_parts 1
+```
+### Step 2: run the training script
+```bash
+python3 entity_classify_dist.py --graph-name ogbn-mag  --dataset ogbn-mag --fanout='25,25' --batch-size 256 --n-hidden 64 --lr 0.01 --eval-batch-size 8 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --conf-path 'data/ogbn-mag.json' --standalone
+```
--- a/examples/pytorch/rgcn/experimental/entity_classify_dist.py
+++ b/examples/pytorch/rgcn/experimental/entity_classify_dist.py
--- a/examples/pytorch/rgcn/experimental/partition_graph.py
+++ b/examples/pytorch/rgcn/experimental/partition_graph.py
+import dgl
+import numpy as np
+import torch as th
+import argparse
+import time
+from ogb.nodeproppred import DglNodePropPredDataset
+def load_ogb(dataset, global_norm):
+    if dataset == 'ogbn-mag':
+        dataset = DglNodePropPredDataset(name=dataset)
+        split_idx = dataset.get_idx_split()
+        train_idx = split_idx["train"]['paper']
+        val_idx = split_idx["valid"]['paper']
+        test_idx = split_idx["test"]['paper']
+        hg_orig, labels = dataset[0]
+        subgs = {}
+        for etype in hg_orig.canonical_etypes:
+            u, v = hg_orig.all_edges(etype=etype)
+            subgs[etype] = (u, v)
+            subgs[(etype[2], 'rev-'+etype[1], etype[0])] = (v, u)
+        hg = dgl.heterograph(subgs)
+        hg.nodes['paper'].data['feat'] = hg_orig.nodes['paper'].data['feat']
+        paper_labels = labels['paper'].squeeze()
+        num_rels = len(hg.canonical_etypes)
+        num_of_ntype = len(hg.ntypes)
+        num_classes = dataset.num_classes
+        category = 'paper'
+        print('Number of relations: {}'.format(num_rels))
+        print('Number of class: {}'.format(num_classes))
+        print('Number of train: {}'.format(len(train_idx)))
+        print('Number of valid: {}'.format(len(val_idx)))
+        print('Number of test: {}'.format(len(test_idx)))
+        # currently we do not support node feature in mag dataset.
+        # calculate norm for each edge type and store in edge
+        if global_norm is False:
+            for canonical_etype in hg.canonical_etypes:
+                u, v, eid = hg.all_edges(form='all', etype=canonical_etype)
+                _, inverse_index, count = th.unique(v, return_inverse=True, return_counts=True)
+                degrees = count[inverse_index]
+                norm = th.ones(eid.shape[0]) / degrees
+                norm = norm.unsqueeze(1)
+                hg.edges[canonical_etype].data['norm'] = norm
+        # get target category id
+        category_id = len(hg.ntypes)
+        for i, ntype in enumerate(hg.ntypes):
+            if ntype == category:
+                category_id = i
+        g = dgl.to_homo(hg)
+        if global_norm:
+            u, v, eid = g.all_edges(form='all')
+            _, inverse_index, count = th.unique(v, return_inverse=True, return_counts=True)
+            degrees = count[inverse_index]
+            norm = th.ones(eid.shape[0]) / degrees
+            norm = norm.unsqueeze(1)
+            g.edata['norm'] = norm
+        node_ids = th.arange(g.number_of_nodes())
+        # find out the target node ids
+        node_tids = g.ndata[dgl.NTYPE]
+        loc = (node_tids == category_id)
+        target_idx = node_ids[loc]
+        train_idx = target_idx[train_idx]
+        val_idx = target_idx[val_idx]
+        test_idx = target_idx[test_idx]
+        train_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
+        train_mask[train_idx] = True
+        val_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
+        val_mask[val_idx] = True
+        test_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
+        test_mask[test_idx] = True
+        g.ndata['train_mask'] = train_mask
+        g.ndata['val_mask'] = val_mask
+        g.ndata['test_mask'] = test_mask
+        labels = th.full((g.number_of_nodes(),), -1, dtype=paper_labels.dtype)
+        labels[target_idx] = paper_labels
+        g.ndata['labels'] = labels
+        return g
+    else:
+        raise("Do not support other ogbn datasets.")
+if __name__ == '__main__':
+    argparser = argparse.ArgumentParser("Partition builtin graphs")
+    argparser.add_argument('--dataset', type=str, default='ogbn-mag',
+                           help='datasets: ogbn-mag')
+    argparser.add_argument('--num_parts', type=int, default=4,
+                           help='number of partitions')
+    argparser.add_argument('--part_method', type=str, default='metis',
+                           help='the partition method')
+    argparser.add_argument('--balance_train', action='store_true',
+                           help='balance the training size in each partition.')
+    argparser.add_argument('--undirected', action='store_true',
+                           help='turn the graph into an undirected graph.')
+    argparser.add_argument('--balance_edges', action='store_true',
+                           help='balance the number of edges in each partition.')
+    argparser.add_argument('--global-norm', default=False, action='store_true',
+                           help='User global norm instead of per node type norm')
+    args = argparser.parse_args()
+    start = time.time()
+    g = load_ogb(args.dataset, args.global_norm)
+    print('load {} takes {:.3f} seconds'.format(args.dataset, time.time() - start))
+    print('|V|={}, |E|={}'.format(g.number_of_nodes(), g.number_of_edges()))
+    print('train: {}, valid: {}, test: {}'.format(th.sum(g.ndata['train_mask']),
+                                                  th.sum(g.ndata['val_mask']),
+                                                  th.sum(g.ndata['test_mask'])))
+    if args.balance_train:
+        balance_ntypes = g.ndata['train_mask']
+    else:
+        balance_ntypes = None
+    dgl.distributed.partition_graph(g, args.dataset, args.num_parts, 'data',
+                                    part_method=args.part_method,
+                                    balance_ntypes=balance_ntypes,
+                                    balance_edges=args.balance_edges)