Unverified Commit 5f895f0b authored by anj-s's avatar anj-s Committed by GitHub
Browse files

[feature] Skip creating the CPU grad tensor when training (#821)

* skip creating cpu grads and pinning memory

* added additional comment

* pin docutils to fix circleCI
parent 5da5c0eb
...@@ -1038,10 +1038,11 @@ class FullyShardedDataParallel(nn.Module): ...@@ -1038,10 +1038,11 @@ class FullyShardedDataParallel(nn.Module):
) )
free_storage_(p._full_param_padded) free_storage_(p._full_param_padded)
if self.move_grads_to_cpu: if self.move_grads_to_cpu and self.training:
# We can optionally move the grad shard to CPU during the backward # We can optionally move the grad shard to CPU during the backward
# pass. In this case, it's important to pre-allocate the CPU grad # pass. In this case, it's important to pre-allocate the CPU grad
# shard in pinned memory so that we can do a non-blocking transfer. # shard in pinned memory so that we can do a non-blocking transfer.
# This is only needed during training and not evaluation.
p._cpu_grad = torch.zeros_like(p.data, device="cpu").pin_memory() p._cpu_grad = torch.zeros_like(p.data, device="cpu").pin_memory()
def _set_is_root(self) -> None: def _set_is_root(self) -> None:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment