@@ -39,7 +41,13 @@ Note: To use *Distributed Training*\ , you will need to run one training script
...
@@ -39,7 +41,13 @@ Note: To use *Distributed Training*\ , you will need to run one training script
.. code-block:: bash
.. code-block:: bash
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=$THIS_MACHINE_INDEX --master_addr="192.168.1.1" --master_port=1234 run_bert_classifier.py (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)
python -m torch.distributed.launch \
--nproc_per_node=4 \
--nnodes=2 \
--node_rank=$THIS_MACHINE_INDEX \
--master_addr="192.168.1.1" \
--master_port=1234 run_bert_classifier.py \
(--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)
Where ``$THIS_MACHINE_INDEX`` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address ``192.168.1.1`` and an open port ``1234``.
Where ``$THIS_MACHINE_INDEX`` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address ``192.168.1.1`` and an open port ``1234``.
...
@@ -186,7 +194,19 @@ Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word
...
@@ -186,7 +194,19 @@ Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word