Unverified Commit 5f895f0b authored by anj-s's avatar anj-s Committed by GitHub
Browse files

[feature] Skip creating the CPU grad tensor when training (#821)

* skip creating cpu grads and pinning memory

* added additional comment

* pin docutils to fix circleCI
parent 5da5c0eb
......@@ -1038,10 +1038,11 @@ class FullyShardedDataParallel(nn.Module):
)
free_storage_(p._full_param_padded)
if self.move_grads_to_cpu:
if self.move_grads_to_cpu and self.training:
# We can optionally move the grad shard to CPU during the backward
# pass. In this case, it's important to pre-allocate the CPU grad
# shard in pinned memory so that we can do a non-blocking transfer.
# This is only needed during training and not evaluation.
p._cpu_grad = torch.zeros_like(p.data, device="cpu").pin_memory()
def _set_is_root(self) -> None:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment