[minor] OSS doc fix - add the DDP wrap (#131)

* wrapping the model in DDP in the tutorial * typo

[minor] OSS doc fix - add the DDP wrap (#131)
* wrapping the model in DDP in the tutorial * typo
5220f89b · Benjamin Lefaudeux · GitHub · bfd88cad · 5220f89b
Unverified Commit 5220f89b authored Oct 09, 2020 by Benjamin Lefaudeux Committed by GitHub Oct 09, 2020
Show whitespace changes
Inline Side-by-side

Showing with 9 additions and 4 deletions

docs/source/tutorials/oss.rst docs/source/tutorials/oss.rst +9 -4

No files found.
--- a/docs/source/tutorials/oss.rst
+++ b/docs/source/tutorials/oss.rst
 Optimizer state sharding
 ========================

-Using torch.nn.parallel.DistributedDataParallel leads to some wasted communications, but it is possible and makes OSS a drop in solution in your existing torch distributed code.
+Using torch.nn.parallel.DistributedDataParallel leads to some wasted communications in the case of OSS, but it is possible and makes OSS a drop in solution in your existing torch distributed code.
 Let's suppose that your trainer looks like

 .. code-block:: python


    import torch
+    from torch.nn.parallel import DistributedDataParallel as DDP
+

    def train(
        rank: int,
@@ -19,6 +21,7 @@ Let's suppose that your trainer looks like

        # Problem statement
        model = myAwesomeModel().to(rank)
+        model = DDP(model, device_ids=[rank])
        dataloader = mySuperFastDataloader()
        loss_ln = myVeryRelevantLoss()

@@ -50,6 +53,7 @@ Then sharding the optimizer state is merely a matter of wrapping your optimizer

    import torch
    from fairscale.optim.oss import OSS
+    from torch.nn.parallel import DistributedDataParallel as DDP

    def train(
        rank: int,
@@ -61,6 +65,7 @@ Then sharding the optimizer state is merely a matter of wrapping your optimizer

        # Problem statement
        model = myAwesomeModel().to(rank)
+        model = DDP(model, device_ids=[rank])
        dataloader = mySuperFastDataloader()
        loss_ln = myVeryRelevantLoss()