#Now go to `/home/ubuntu/workspace` and save the training script and the partitioned data in the folder.
#
#SSH Access
#^^^^^^^^^^
#
#The launch script accesses the machines in the cluster via SSH. Users should follow the instruction
#in `this document <https://linuxize.com/post/how-to-setup-passwordless-ssh-login/>`_ to set up
#the passwordless SSH login on every machine in the cluster. After setting up the passwordless SSH,
#users need to authenticate the connection to each machine and add their key fingerprints to `~/.ssh/known_hosts`.
#This can be done automatically when we ssh to a machine for the first time.
#
#Launch the distributed training job
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
#After everything is ready, we can now use the launch script provided by DGL to launch the distributed
#training job in the cluster. We can run the launch script on any machine in the cluster.
#
#.. code-block:: shell
#
# python3 ~/workspace/dgl/tools/launch.py \
# --workspace ~/workspace/ \
# --num_trainers 1 \
# --num_samplers 0 \
# --num_servers 1 \
# --part_config 4part_data/ogbn-products.json \
# --ip_config ip_config.txt \
# "python3 train_dist.py"
#
#If we split the graph into four partitions as demonstrated at the beginning of the tutorial, the cluster has to have four machines. The command above launches one trainer and one server on each machine in the cluster. `ip_config.txt` lists the IP addresses of all machines in the cluster as follows:
#
#.. code-block:: shell
#
# ip_addr1
# ip_addr2
# ip_addr3
# ip_addr4
Partition a graph
-----------------
In this tutorial, we will use `OGBN products graph <https://ogb.stanford.edu/docs/nodeprop/#ogbn-products>`_
as an example to illustrate the graph partitioning. Let's first load the graph into a DGL graph.
Here we store the node labels as node data in the DGL Graph.
.. code-block:: python
import dgl
import torch as th
from ogb.nodeproppred import DglNodePropPredDataset
data = DglNodePropPredDataset(name='ogbn-products')
graph, labels = data[0]
labels = labels[:, 0]
graph.ndata['labels'] = labels
We need to split the data into training/validation/test set during the graph partitioning.
Because this is a node classification task, the training/validation/test sets contain node IDs.
We recommend users to convert them as boolean arrays, in which True indicates the existence
of the node ID in the set. In this way, we can store them as node data. After the partitioning,
the boolean arrays will be stored with the graph partitions.
Now go to `/home/ubuntu/workspace` and save the training script and the partitioned data in the folder.
SSH Access
^^^^^^^^^^
The launch script accesses the machines in the cluster via SSH. Users should follow the instruction
in `this document <https://linuxize.com/post/how-to-setup-passwordless-ssh-login/>`_ to set up
the passwordless SSH login on every machine in the cluster. After setting up the passwordless SSH,
users need to authenticate the connection to each machine and add their key fingerprints to `~/.ssh/known_hosts`.
This can be done automatically when we ssh to a machine for the first time.
Launch the distributed training job
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After everything is ready, we can now use the launch script provided by DGL to launch the distributed
training job in the cluster. We can run the launch script on any machine in the cluster.
.. code-block:: shell
python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/workspace/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config 4part_data/ogbn-products.json \
--ip_config ip_config.txt \
"python3 train_dist.py"
If we split the graph into four partitions as demonstrated at the beginning of the tutorial, the cluster has to have four machines. The command above launches one trainer and one server on each machine in the cluster. `ip_config.txt` lists the IP addresses of all machines in the cluster as follows: