[fix] fixing adascale all_reduce (#155)

- Aurick noticed this bug and I ran into it yesterday - after the fix, our cifar training shows same gain values from different replics now: ``` 20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3512124098087777 20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3512124098087777 20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000600 fwd 0:00:00.003678 loss 0:00:00.000086 bwd 0:00:00.314158 update 0:00:00.002132 rest 0:00:00.000399 20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000643 fwd 0:00:00.003460 loss 0:00:00.000084 bwd 0:00:00.314678 update 0:00:00.002001 rest 0:00:00.000408 20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3514997779980324 20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3514997779980324 20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000732 fwd 0:00:00.003689 loss 0:00:00.000086 bwd 0:00:00.314176 update 0:00:00.002146 rest 0:00:00.000397 20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000646 fwd 0:00:00.003542 loss 0:00:00.000089 bwd 0:00:00.314549 update 0:00:00.001956 rest 0:00:00.000392 20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.352149646693932 20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.352149646693932 ```

[fix] fixing adascale all_reduce (#155)
- Aurick noticed this bug and I ran into it yesterday - after the fix, our cifar training shows same gain values from different replics now: ``` 20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3512124098087777 20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3512124098087777 20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000600 fwd 0:00:00.003678 loss 0:00:00.000086 bwd 0:00:00.314158 update 0:00:00.002132 rest 0:00:00.000399 20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000643 fwd 0:00:00.003460 loss 0:00:00.000084 bwd 0:00:00.314678 update 0:00:00.002001 rest 0:00:00.000408 20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.3514997779980324 20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.3514997779980324 20-Oct-20 16:00:19 - DEBUG - rank1 - timing: data 0:00:00.000732 fwd 0:00:00.003689 loss 0:00:00.000086 bwd 0:00:00.314176 update 0:00:00.002146 rest 0:00:00.000397 20-Oct-20 16:00:19 - DEBUG - rank0 - timing: data 0:00:00.000646 fwd 0:00:00.003542 loss 0:00:00.000089 bwd 0:00:00.314549 update 0:00:00.001956 rest 0:00:00.000392 20-Oct-20 16:00:19 - DEBUG - rank1 - scale 2, gain ratio 1.352149646693932 20-Oct-20 16:00:19 - DEBUG - rank0 - scale 2, gain ratio 1.352149646693932 ```
6802ad49 · Min Xu · GitHub · 6f8a8652 · 6802ad49 · 6802ad49
Unverified Commit 6802ad49 authored Oct 21, 2020 by Min Xu Committed by GitHub Oct 21, 2020
Showing with 13 additions and 3 deletions

fairscale/nn/data_parallel/sharded_ddp.py fairscale/nn/data_parallel/sharded_ddp.py +2 -2

fairscale/optim/adascale.py fairscale/optim/adascale.py +9 -1

stubs/torch/__init__.pyi stubs/torch/__init__.pyi +2 -0

No files found.
--- a/fairscale/nn/data_parallel/sharded_ddp.py
+++ b/fairscale/nn/data_parallel/sharded_ddp.py
@@ -185,7 +185,7 @@ class ShardedDataParallel(nn.Module):
                i_bucketed += 1
            if i_bucketed > 0:
-                buffer.div_(world_size)  # type: ignore
+                buffer.div_(world_size)
                bucket_requests.append(
                    (
                        dist.reduce(tensor=buffer, dst=global_rank, group=group, async_op=True),  # type: ignore
@@ -199,7 +199,7 @@ class ShardedDataParallel(nn.Module):
                if p.grad.requires_grad:
                    raise RuntimeError("DistributedDataParallel only works with gradients that don't require grad")
-                p.grad.div_(world_size)  # type: ignore
+                p.grad.div_(world_size)
                requests.append(dist.reduce(tensor=p.grad, dst=global_rank, group=group, async_op=True))  # type: ignore
        # Unroll the initial packed small gradients, as soon as possible

--- a/fairscale/optim/adascale.py
+++ b/fairscale/optim/adascale.py
@@ -199,7 +199,11 @@ class AdaScale(object):
        # gradients have been synchronized between each worker.
        self._final_callback_queued = False
        assert isinstance(self._local_grad_sqr, torch.Tensor)
-        torch.distributed.all_reduce(self._local_grad_sqr / self._world_size)
+        # self._local_grad_sqr is FP32, sum then div shouldn't overflow.
+        torch.distributed.all_reduce(self._local_grad_sqr)  # SUM
+        self._local_grad_sqr.div_(self._world_size)
        local_grad_sqr = self._local_grad_sqr.cpu().numpy()
        total_grad_sqr = np.array(
            [sum(param.grad.pow(2).sum().item() for param in group["params"]) for group in self._optimizer.param_groups]
@@ -243,3 +247,7 @@ class AdaScale(object):
            return self.step(*args, **kwargs)
        setattr(self._optimizer, "step", wrapper)
+    def zero_grad(self) -> None:
+        """Proxy function to optimizer"""
+        self._optimizer.zero_grad()
--- a/stubs/torch/__init__.pyi
+++ b/stubs/torch/__init__.pyi
@@ -348,6 +348,8 @@ class Tensor:
    def digamma_(self) -> Tensor: ...
    def dim(self) -> _int: ...
    def dist(self, other: Tensor, p: Number=2) -> Tensor: ...
+    def div(self, denominator: Number) -> Tensor: ...
+    def div_(self, denominator: Number) -> Tensor: ...
    def dot(self, tensor: Tensor) -> Tensor: ...
    def double(self) -> Tensor: ...
    def eig(self, eigenvectors: _bool=False) -> Tuple[Tensor, Tensor]: ...