Commit 23e9dc2e authored by Halil Akin's avatar Halil Akin Committed by Facebook Github Bot
Browse files

Fix another distributed syncing issue

Summary:
This is another failure due to distributed GPU's getting out of sync.
We are running save_and_eval (which has the inter-gpu communication calls) by
looking at number of updates. But number of updates means weight updates. Whenever
there is an issue in the training and weights can't be updated, nodes go
out of sync and nodes start failing. So we should check number of iterations instead.

I am, again, making a small change to save the day, but we should decouple/refactor
save_and_eval logic from the training, to have less headache in future.
Planning, working on that in future. But this should solve some of the
issues for now.

Reviewed By: jhcross

Differential Revision: D10478427

fbshipit-source-id: b9deacfea252b2fb66b81c799fa78e2439fa514c
parent 8441cbf3
......@@ -45,6 +45,7 @@ class Trainer(object):
self._model = model.cuda()
self._dummy_batch = dummy_batch
self._num_iterations = 0
self._num_updates = 0
self._optim_history = None
self._optimizer = None
......@@ -223,6 +224,7 @@ class Trainer(object):
).format(self.task.__class__.__name__))
try:
self._num_iterations += 1
# normalize grads by sample size
self.optimizer.multiply_grads(self.args.distributed_world_size / float(sample_size))
......@@ -355,6 +357,10 @@ class Trainer(object):
"""Get the number of parameters updates."""
return self._num_updates
def get_num_iterations(self):
"""Get the number of iterations."""
return self._num_iterations
def _prepare_sample(self, sample):
if sample is None or len(sample) == 0:
return None
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment