#Now go to `/home/ubuntu/workspace` and save the training script and the partitioned data in the folder.
Now go to `/home/ubuntu/workspace` and save the training script and the partitioned data in the folder.
#
#SSH Access
SSH Access
#^^^^^^^^^^
^^^^^^^^^^
#
#The launch script accesses the machines in the cluster via SSH. Users should follow the instruction
The launch script accesses the machines in the cluster via SSH. Users should follow the instruction
#in `this document <https://linuxize.com/post/how-to-setup-passwordless-ssh-login/>`_ to set up
in `this document <https://linuxize.com/post/how-to-setup-passwordless-ssh-login/>`_ to set up
#the passwordless SSH login on every machine in the cluster. After setting up the passwordless SSH,
the passwordless SSH login on every machine in the cluster. After setting up the passwordless SSH,
#users need to authenticate the connection to each machine and add their key fingerprints to `~/.ssh/known_hosts`.
users need to authenticate the connection to each machine and add their key fingerprints to `~/.ssh/known_hosts`.
#This can be done automatically when we ssh to a machine for the first time.
This can be done automatically when we ssh to a machine for the first time.
#
#Launch the distributed training job
Launch the distributed training job
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
#After everything is ready, we can now use the launch script provided by DGL to launch the distributed
After everything is ready, we can now use the launch script provided by DGL to launch the distributed
#training job in the cluster. We can run the launch script on any machine in the cluster.
training job in the cluster. We can run the launch script on any machine in the cluster.
#
#.. code-block:: shell
.. code-block:: shell
#
# python3 ~/workspace/dgl/tools/launch.py \
python3 ~/workspace/dgl/tools/launch.py \
# --workspace ~/workspace/ \
--workspace ~/workspace/ \
# --num_trainers 1 \
--num_trainers 1 \
# --num_samplers 0 \
--num_samplers 0 \
# --num_servers 1 \
--num_servers 1 \
# --part_config 4part_data/ogbn-products.json \
--part_config 4part_data/ogbn-products.json \
# --ip_config ip_config.txt \
--ip_config ip_config.txt \
# "python3 train_dist.py"
"python3 train_dist.py"
#
#If we split the graph into four partitions as demonstrated at the beginning of the tutorial, the cluster has to have four machines. The command above launches one trainer and one server on each machine in the cluster. `ip_config.txt` lists the IP addresses of all machines in the cluster as follows:
If we split the graph into four partitions as demonstrated at the beginning of the tutorial, the cluster has to have four machines. The command above launches one trainer and one server on each machine in the cluster. `ip_config.txt` lists the IP addresses of all machines in the cluster as follows: