Unverified Commit 34c83a5a authored by Leo Gao's avatar Leo Gao Committed by GitHub
Browse files

Fix ZeRO 2 + Pipelining (#677)

* Fix ZeRO 2 + Pipelining
parent e59ba12d
...@@ -206,6 +206,16 @@ class PipelineEngine(DeepSpeedEngine): ...@@ -206,6 +206,16 @@ class PipelineEngine(DeepSpeedEngine):
self.set_dataloader(pipe_dataloader) self.set_dataloader(pipe_dataloader)
def _exec_reduce_tied_grads(self): def _exec_reduce_tied_grads(self):
# We need to run this first to write to self.averaged_gradients;
# since this class turns `enable_backward_allreduce` off,
# `self.overlapping_partition_gradients_reduce_epilogue()` defined in the DeepSpeedEngine
# never actually runs. I suspect this is because of efficiency problems; get_flat_partition in
# stage2.py might do something expensive; someone will have to look into that later. But
# in the meantime, this fixes ZeRO2 + Pipelining enough to run a demo. Further profiling
# needed to decide if it actually breaks everything.
# (see https://github.com/EleutherAI/gpt-neox/issues/62#issuecomment-761471944)
if self.zero_optimization_partition_gradients():
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
self.module.allreduce_tied_weight_gradients() self.module.allreduce_tied_weight_gradients()
def _exec_reduce_grads(self): def _exec_reduce_grads(self):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment