[doc] Adding some more ShardedDDP documentation (#547)

a2b11de4 · Benjamin Lefaudeux · GitHub · ece0cbf9 · a2b11de4 · a2b11de4
Unverified Commit a2b11de4 authored Mar 25, 2021 by Benjamin Lefaudeux Committed by GitHub Mar 25, 2021
Hide whitespace changes
Inline Side-by-side

Showing with 15 additions and 2 deletions

docs/source/api/nn/sharded_ddp.rst docs/source/api/nn/sharded_ddp.rst +11 -0

docs/source/tutorials/oss.rst docs/source/tutorials/oss.rst +4 -2

No files found.
--- a/docs/source/api/nn/sharded_ddp.rst
+++ b/docs/source/api/nn/sharded_ddp.rst
@@ -4,3 +4,14 @@ ShardedDataParallel
 .. autoclass:: fairscale.nn.ShardedDataParallel
    :members:
    :undoc-members:
+Performance tips
+====================
+Using OSS and ShardedDDP changes the communication pattern when compared to DDP, and depending on the training hardware a couple of changes can be beneficial.
+* If using multiple nodes, make sure that the reduce buckets are activated. This mitigates some of the communication latency cost
+* If using Torch AMP, the forward and backward passes are mostly computed in fp16, but by default the communications will still be fp32.
+    * ShardedDDP can compress the gradients back to fp16, using the `reduce_fp16` option.
+    * OSS can compress the model shards to fp16 when broadcasting, using the `broadcast_fp16` option. This could have a major effect on performance.
--- a/docs/source/tutorials/oss.rst
+++ b/docs/source/tutorials/oss.rst
@@ -111,8 +111,10 @@ the only assumption being that each of the ranks lives in its own python process
 to see it in action, you can test it with the following script `here <../../../examples/tutorial_oss.py>`_.
-Using PyTorch Automatic Mixed Precision is possible, but it requires a shard-aware GradScaler, which is available in
+Using PyTorch Automatic Mixed Precision is possible, and its actual usage will depend on whether OSS is used with DDP or with ShardedDDP.
-`fairscale.optim.grad_scaler`. Autocast can be used as is, and the loss will be scaled and handled in the same way.
+If OSS is used with DDP, then the normal PyTorch GradScaler can be used, nothing needs to be changed. If OSS is used with ShardedDDP (to
+get the gradient sharding), then a very similar flow can be used, but it requires a shard-aware GradScaler, which is available in
+`fairscale.optim.grad_scaler`. In both cases Autocast can be used as is, and the loss will be scaled and handled in the same way.
 See [the original documentation] (https://pytorch.org/docs/stable/notes/amp_examples.html?highlight=automatic%20mixed%20precision)
 for more information.