#Now go to `/home/ubuntu/workspace` and save the training script and the partitioned data in the folder.
#
#SSH Access
#^^^^^^^^^^
#
#The launch script accesses the machines in the cluster via SSH. Users should follow the instruction
#in `this document <https://linuxize.com/post/how-to-setup-passwordless-ssh-login/>`_ to set up
#the passwordless SSH login on every machine in the cluster. After setting up the passwordless SSH,
#users need to authenticate the connection to each machine and add their key fingerprints to `~/.ssh/known_hosts`.
#This can be done automatically when we ssh to a machine for the first time.
#
#Launch the distributed training job
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
#After everything is ready, we can now use the launch script provided by DGL to launch the distributed
#training job in the cluster. We can run the launch script on any machine in the cluster.
#
#.. code-block:: shell
#
# python3 ~/workspace/dgl/tools/launch.py \
# --workspace ~/workspace/ \
# --num_trainers 1 \
# --num_samplers 0 \
# --num_servers 1 \
# --part_config 4part_data/ogbn-products.json \
# --ip_config ip_config.txt \
# "python3 train_dist.py"
#
#If we split the graph into four partitions as demonstrated at the beginning of the tutorial, the cluster has to have four machines. The command above launches one trainer and one server on each machine in the cluster. `ip_config.txt` lists the IP addresses of all machines in the cluster as follows:
#
#.. code-block:: shell
#
# ip_addr1
# ip_addr2
# ip_addr3
# ip_addr4
Set up distributed training environment
---------------------------------------
After partitioning a graph and preparing the training script, we now need to set up
the distributed training environment and launch the training job. Basically, we need to
create a cluster of machines and upload both the training script and the partitioned data
to each machine in the cluster. A recommended solution of sharing the training script and
the partitioned data in the cluster is to use NFS (Network File System).
For any users who are not familiar with NFS, below is a small tutorial of setting up NFS
in an existing cluster.
NFS server side setup (ubuntu only)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
First, install essential libs on the storage server
.. code-block:: shell
sudo apt-get install nfs-kernel-server
Below we assume the user account is ubuntu and we create a directory of workspace in the home directory.
.. code-block:: shell
mkdir -p /home/ubuntu/workspace
We assume that the all servers are under a subnet with ip range 192.168.0.0 to 192.168.255.255.
We need to add the following line to `/etc/exports`
Now go to `/home/ubuntu/workspace` and save the training script and the partitioned data in the folder.
SSH Access
^^^^^^^^^^
The launch script accesses the machines in the cluster via SSH. Users should follow the instruction
in `this document <https://linuxize.com/post/how-to-setup-passwordless-ssh-login/>`_ to set up
the passwordless SSH login on every machine in the cluster. After setting up the passwordless SSH,
users need to authenticate the connection to each machine and add their key fingerprints to `~/.ssh/known_hosts`.
This can be done automatically when we ssh to a machine for the first time.
Launch the distributed training job
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After everything is ready, we can now use the launch script provided by DGL to launch the distributed
training job in the cluster. We can run the launch script on any machine in the cluster.
.. code-block:: shell
python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/workspace/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config 4part_data/ogbn-products.json \
--ip_config ip_config.txt \
"python3 train_dist.py"
If we split the graph into four partitions as demonstrated at the beginning of the tutorial, the cluster has to have four machines. The command above launches one trainer and one server on each machine in the cluster. `ip_config.txt` lists the IP addresses of all machines in the cluster as follows: