[fix] Dead code removal for OSS (#276)

* removing a dead call since ShardedDDP, small speedup * unrelated, but filling in the changelog * another nit

[fix] Dead code removal for OSS (#276)
* removing a dead call since ShardedDDP, small speedup * unrelated, but filling in the changelog * another nit
fb8d9137 · Benjamin Lefaudeux · GitHub · 7abaa2be · fb8d9137 · fb8d9137
Unverified Commit fb8d9137 authored Dec 29, 2020 by Benjamin Lefaudeux Committed by GitHub Dec 29, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 16 deletions

CHANGELOG.md CHANGELOG.md +3 -0

benchmarks/oss.py benchmarks/oss.py +1 -2

fairscale/optim/oss.py fairscale/optim/oss.py +0 -14

No files found.
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -13,6 +13,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Fixed
 - AdaScale: smoothing factor value fixed when using gradient accumulation (#235)
 - Pipe: documentation on balancing functions (#243)
+- ShardedDDP: handle typical NLP models
+- ShardedDDP: better partitioning when finetuning
 ## [0.1.1] - 2020-12-01
 ### Fixed

--- a/benchmarks/oss.py
+++ b/benchmarks/oss.py
@@ -171,8 +171,7 @@ def train(
                        else:
                            final_loss = optimizer.step(closure)
-                        prof.export_chrome_trace(f"{optim_type}_trace_rank_{rank}.json")
+                prof.export_chrome_trace(f"{optim_type}_trace_rank_{rank}.json")
                need_profiling = False  # only profile once
            else:

--- a/fairscale/optim/oss.py
+++ b/fairscale/optim/oss.py
@@ -208,9 +208,6 @@ class OSS(Optimizer):
        else:
            loss = self.optim.step(**kwargs)
-        # Depending on the DDP engine used, gradients specific to other ranks may still be loaded
-        self._free_other_grads()
        # Sync all the updated shards in between the ranks
        self._broadcast_params()
@@ -507,17 +504,6 @@ class OSS(Optimizer):
                # Discard this tensor/rank, broadcast necessary for syncing
                broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
-    def _free_other_grads(self) -> None:
-        """Free all the gradients only useful for the other ranks
-        """
-        for rank, partition in enumerate(self.partition_parameters()):
-            if rank == self.rank:
-                continue
-            for p in partition:
-                for t in p["params"]:
-                    t.grad = None
    def _broadcast_params(self) -> None:
        """Helper function to broadcast all the parameters from a given device"""