Unverified Commit d64372fd authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

[docs] outline sharded ddp doc (#9208)



* outline sharded dpp doc

* fix link

* add example

* Apply suggestions from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* narrow the command and remove non-essentials
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent eef66035
...@@ -278,6 +278,46 @@ pass it to the trainer. ...@@ -278,6 +278,46 @@ pass it to the trainer.
Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified
``logging_dir`` directory. ``logging_dir`` directory.
Trainer Integrations
-----------------------------------------------------------------------------------------------------------------------
The trainer is being extended to support experimental libraries that may dramatically improve your training time and
fit bigger models.
The main part that is being integrated at the moment is based on the paper `ZeRO: Memory Optimizations Toward Training
Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
<https://arxiv.org/abs/1910.02054>`__.
You can already deploy the following features from this paper:
* Optimizer State Sharding
* Gradient Sharding
using the `--sharded_ddp` trainer argument. This is implemented via `fairscale
<https://github.com/facebookresearch/fairscale/>`__, so you will have to install this library.
This feature requires distributed training (so multiple GPUs) and is not implemented for TPUs.
For example here is how you could use it for `finetune_trainer.py`:
.. code-block:: bash
cd examples/seq2seq
python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py \
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
--output_dir output_dir --overwrite_output_dir \
--do_train --n_train 500 --num_train_epochs 1 \
--per_device_train_batch_size 1 --freeze_embeds \
--src_lang en_XX --tgt_lang ro_RO --task translation \
--fp16 --sharded_ddp
Note that it works with `--fp16` too, to make things even faster.
One of the main benefits of enabling `--sharded_ddp` is that it uses a lot less GPU memory, so you should be able to
use significantly larger batch sizes using the same hardware (e.g. 3x or bigger).
Eventually more parts will be supported via integrating `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__.
.. _additional-resources: .. _additional-resources:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment