"src/vscode:/vscode.git/clone" did not exist on "0df83c79e4247e6b58c4c0aacfcb40b74db8d96e"
Unverified Commit 101186bc authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

[docs] [testing] distributed training (#7993)

* distributed training

* fix

* fix formatting

* wording
parent c153bcc5
...@@ -451,6 +451,24 @@ Inside tests: ...@@ -451,6 +451,24 @@ Inside tests:
Distributed training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one spawns a normal process that then spawns off multiple workers and manages the IO pipes.
This is still under development but you can study 2 different tests that perform this successfully:
* `test_seq2seq_examples_multi_gpu.py <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_seq2seq_examples_multi_gpu.py>`__ - a ``pytorch-lightning``-running test (had to use PL's ``ddp`` spawning method which is the default)
* `test_finetune_trainer.py <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_finetune_trainer.py>`__ - a normal (non-PL) test
To jump right into the execution point, search for the ``execute_async_std`` function in those tests.
You will need at least 2 GPUs to see these tests in action:
.. code-block:: bash
CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW=1 pytest -sv examples/seq2seq/test_finetune_trainer.py \
examples/seq2seq/test_seq2seq_examples_multi_gpu.py
Output capture Output capture
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment