Unverified Commit a2b11de4 authored by Benjamin Lefaudeux's avatar Benjamin Lefaudeux Committed by GitHub
Browse files

[doc] Adding some more ShardedDDP documentation (#547)

parent ece0cbf9
...@@ -4,3 +4,14 @@ ShardedDataParallel ...@@ -4,3 +4,14 @@ ShardedDataParallel
.. autoclass:: fairscale.nn.ShardedDataParallel .. autoclass:: fairscale.nn.ShardedDataParallel
:members: :members:
:undoc-members: :undoc-members:
Performance tips
====================
Using OSS and ShardedDDP changes the communication pattern when compared to DDP, and depending on the training hardware a couple of changes can be beneficial.
* If using multiple nodes, make sure that the reduce buckets are activated. This mitigates some of the communication latency cost
* If using Torch AMP, the forward and backward passes are mostly computed in fp16, but by default the communications will still be fp32.
* ShardedDDP can compress the gradients back to fp16, using the `reduce_fp16` option.
* OSS can compress the model shards to fp16 when broadcasting, using the `broadcast_fp16` option. This could have a major effect on performance.
...@@ -111,8 +111,10 @@ the only assumption being that each of the ranks lives in its own python process ...@@ -111,8 +111,10 @@ the only assumption being that each of the ranks lives in its own python process
to see it in action, you can test it with the following script `here <../../../examples/tutorial_oss.py>`_. to see it in action, you can test it with the following script `here <../../../examples/tutorial_oss.py>`_.
Using PyTorch Automatic Mixed Precision is possible, but it requires a shard-aware GradScaler, which is available in Using PyTorch Automatic Mixed Precision is possible, and its actual usage will depend on whether OSS is used with DDP or with ShardedDDP.
`fairscale.optim.grad_scaler`. Autocast can be used as is, and the loss will be scaled and handled in the same way. If OSS is used with DDP, then the normal PyTorch GradScaler can be used, nothing needs to be changed. If OSS is used with ShardedDDP (to
get the gradient sharding), then a very similar flow can be used, but it requires a shard-aware GradScaler, which is available in
`fairscale.optim.grad_scaler`. In both cases Autocast can be used as is, and the loss will be scaled and handled in the same way.
See [the original documentation] (https://pytorch.org/docs/stable/notes/amp_examples.html?highlight=automatic%20mixed%20precision) See [the original documentation] (https://pytorch.org/docs/stable/notes/amp_examples.html?highlight=automatic%20mixed%20precision)
for more information. for more information.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment