#If we split the graph into four partitions as demonstrated at the beginning of the tutorial, the cluster has to have four machines. The command above launches one trainer and one server on each machine in the cluster. `ip_config.txt` lists the IP addresses of all machines in the cluster as follows:
Then restart NFS, the setup on server side is finished.
#
#.. code-block:: shell
.. code-block:: shell
#
# ip_addr1
sudo systemctl restart nfs-kernel-server
# ip_addr2
# ip_addr3
For configuration details, please refer to NFS ArchWiki (https://wiki.archlinux.org/index.php/NFS).
# ip_addr4
NFS client side setup (ubuntu only)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To use NFS, clients also require to install essential packages
.. code-block:: shell
sudo apt-get install nfs-common
You can either mount the NFS manually
.. code-block:: shell
mkdir -p /home/ubuntu/workspace
sudo mount -t nfs <nfs-server-ip>:/home/ubuntu/workspace /home/ubuntu/workspace
or add the following line to `/etc/fstab` so the folder will be mounted automatically
Now go to `/home/ubuntu/workspace` and save the training script and the partitioned data in the folder.
SSH Access
^^^^^^^^^^
The launch script accesses the machines in the cluster via SSH. Users should follow the instruction
in `this document <https://linuxize.com/post/how-to-setup-passwordless-ssh-login/>`_ to set up
the passwordless SSH login on every machine in the cluster. After setting up the passwordless SSH,
users need to authenticate the connection to each machine and add their key fingerprints to `~/.ssh/known_hosts`.
This can be done automatically when we ssh to a machine for the first time.
Launch the distributed training job
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After everything is ready, we can now use the launch script provided by DGL to launch the distributed
training job in the cluster. We can run the launch script on any machine in the cluster.
.. code-block:: shell
python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/workspace/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config 4part_data/ogbn-products.json \
--ip_config ip_config.txt \
"python3 train_dist.py"
If we split the graph into four partitions as demonstrated at the beginning of the tutorial, the cluster has to have four machines. The command above launches one trainer and one server on each machine in the cluster. `ip_config.txt` lists the IP addresses of all machines in the cluster as follows: