[docs] outline sharded ddp doc (#9208)

* outline sharded dpp doc * fix link * add example * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * narrow the command and remove non-essentials Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

[docs] outline sharded ddp doc (#9208)
* outline sharded dpp doc * fix link * add example * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * narrow the command and remove non-essentials Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
d64372fd · Stas Bekman · GitHub · eef66035 · d64372fd
Unverified Commit d64372fd authored Jan 05, 2021 by Stas Bekman Committed by GitHub Jan 05, 2021
Show whitespace changes
Inline Side-by-side

Showing with 40 additions and 0 deletions

docs/source/training.rst docs/source/training.rst +40 -0

No files found.
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -278,6 +278,46 @@ pass it to the trainer.
 Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified
 ``logging_dir`` directory.
+Trainer Integrations
+-----------------------------------------------------------------------------------------------------------------------
+The trainer is being extended to support experimental libraries that may dramatically improve your training time and
+fit bigger models.
+The main part that is being integrated at the moment is based on the paper `ZeRO: Memory Optimizations Toward Training
+Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
+<https://arxiv.org/abs/1910.02054>`__.
+You can already deploy the following features from this paper:
+* Optimizer State Sharding
+* Gradient Sharding
+using the `--sharded_ddp` trainer argument. This is implemented via `fairscale
+<https://github.com/facebookresearch/fairscale/>`__, so you will have to install this library.
+This feature requires distributed training (so multiple GPUs) and is not implemented for TPUs.
+For example here is how you could use it for `finetune_trainer.py`:
+.. code-block:: bash
+    cd examples/seq2seq
+    python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py \
+    --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
+    --output_dir output_dir --overwrite_output_dir \
+    --do_train --n_train 500 --num_train_epochs 1 \
+    --per_device_train_batch_size 1  --freeze_embeds \
+    --src_lang en_XX --tgt_lang ro_RO --task translation \
+    --fp16 --sharded_ddp
+Note that it works with `--fp16` too, to make things even faster.
+One of the main benefits of enabling `--sharded_ddp` is that it uses a lot less GPU memory, so you should be able to
+use significantly larger batch sizes using the same hardware (e.g. 3x or bigger).
+Eventually more parts will be supported via integrating `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__.
 .. _additional-resources: