This is an example of training RGCN node classification in a distributed fashion. Currently, the example train RGCN graphs with input node features.
Before training, install python libs by pip:
```bash
pip3 install ogb pyarrow
```
To train RGCN, it has four steps:
### Step 0: Setup a Distributed File System
* You may skip this step if your cluster already has folder(s) synchronized across machines.
To perform distributed training, files and codes need to be accessed across multiple machines. A distributed file system would perfectly handle the job (i.e., NFS, Ceph).
#### Server side setup
Here is an example of how to setup NFS. First, install essential libs on the storage server
```bash
sudo apt-get install nfs-kernel-server
```
Below we assume the user account is `ubuntu` and we create a directory of `workspace` in the home directory.
```bash
mkdir-p /home/ubuntu/workspace
```
We assume that the all servers are under a subnet with ip range `192.168.0.0` to `192.168.255.255`. The exports configuration needs to be modifed to
Now go to `/home/ubuntu/workspace` and clone the DGL Github repository.
### Step 1: set IP configuration file.
User need to set their own IP configuration file `ip_config.txt` before training. For example, if we have four machines in current cluster, the IP configuration could like this:
```bash
172.31.0.1
172.31.0.2
```
Users need to make sure that the master node (node-0) has right permission to ssh to all the other nodes without password authentication.
[This link](https://linuxize.com/post/how-to-setup-passwordless-ssh-login/) provides instructions of setting passwordless SSH login.
### Step 2: partition the graph.
The example provides a script to partition some builtin graphs such as ogbn-mag graph.
If we want to train RGCN on 2 machines, we need to partition the graph into 2 parts.
In this example, we partition the ogbn-mag graph into 2 parts with Metis. The partitions are balanced with respect to the number of nodes, the number of edges and the number of labelled nodes.
**Note:** if you are using conda or other virtual environments on the remote machines, you need to replace `python3` in the command string (i.e. the last argument) with the path to the Python interpreter in that environment.
## Comparison between `DGL` and `GraphBolt`
### Partition sizes
Compared to `DGL`, `GraphBolt` partitions are reduced to **19%** for `ogbn-mag`.
Compared to `DGL`, `GraphBolt`'s sampler works faster(reduced to **16%**`ogbn-mag`). `Min` and `Max` are statistics of all trainers on all nodes(machines).
As for RAM usage, the shared memory(measured by **shared** field of `free` command) usage decreases due to smaller graph partitions in `GraphBolt`. The peak memory used by processes(measured by **used** field of `free` command) decreases as well.
`ogbn-mag`
| Data Formats | Sample Time Per Epoch (CPU) | Test Accuracy (3 epochs) | shared | used (peak) |