[feature] Skip creating the CPU grad tensor when training (#821)

* skip creating cpu grads and pinning memory * added additional comment * pin docutils to fix circleCI

[feature] Skip creating the CPU grad tensor when training (#821)
* skip creating cpu grads and pinning memory * added additional comment * pin docutils to fix circleCI
5f895f0b · anj-s · GitHub · 5da5c0eb · 5f895f0b
Unverified Commit 5f895f0b authored Oct 27, 2021 by anj-s Committed by GitHub Oct 27, 2021
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 1 deletion

fairscale/nn/data_parallel/fully_sharded_data_parallel.py fairscale/nn/data_parallel/fully_sharded_data_parallel.py +2 -1

No files found.
--- a/fairscale/nn/data_parallel/fully_sharded_data_parallel.py
+++ b/fairscale/nn/data_parallel/fully_sharded_data_parallel.py
@@ -1038,10 +1038,11 @@ class FullyShardedDataParallel(nn.Module):
            )
            free_storage_(p._full_param_padded)

-        if self.move_grads_to_cpu:
+        if self.move_grads_to_cpu and self.training:
            # We can optionally move the grad shard to CPU during the backward
            # pass. In this case, it's important to pre-allocate the CPU grad
            # shard in pinned memory so that we can do a non-blocking transfer.
+            # This is only needed during training and not evaluation.
            p._cpu_grad = torch.zeros_like(p.data, device="cpu").pin_memory()

    def _set_is_root(self) -> None: