Fix another distributed syncing issue

Summary: This is another failure due to distributed GPU's getting out of sync. We are running save_and_eval (which has the inter-gpu communication calls) by looking at number of updates. But number of updates means weight updates. Whenever there is an issue in the training and weights can't be updated, nodes go out of sync and nodes start failing. So we should check number of iterations instead. I am, again, making a small change to save the day, but we should decouple/refactor save_and_eval logic from the training, to have less headache in future. Planning, working on that in future. But this should solve some of the issues for now. Reviewed By: jhcross Differential Revision: D10478427 fbshipit-source-id: b9deacfea252b2fb66b81c799fa78e2439fa514c

Fix another distributed syncing issue
Summary: This is another failure due to distributed GPU's getting out of sync. We are running save_and_eval (which has the inter-gpu communication calls) by looking at number of updates. But number of updates means weight updates. Whenever there is an issue in the training and weights can't be updated, nodes go out of sync and nodes start failing. So we should check number of iterations instead. I am, again, making a small change to save the day, but we should decouple/refactor save_and_eval logic from the training, to have less headache in future. Planning, working on that in future. But this should solve some of the issues for now. Reviewed By: jhcross Differential Revision: D10478427 fbshipit-source-id: b9deacfea252b2fb66b81c799fa78e2439fa514c
23e9dc2e · Halil Akin · Facebook Github Bot · 8441cbf3 · 23e9dc2e
Commit 23e9dc2e authored Oct 22, 2018 by Halil Akin Committed by Facebook Github Bot Oct 22, 2018
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 0 deletions

fairseq/trainer.py fairseq/trainer.py +6 -0

No files found.
--- a/fairseq/trainer.py
+++ b/fairseq/trainer.py
@@ -45,6 +45,7 @@ class Trainer(object):
            self._model = model.cuda()

        self._dummy_batch = dummy_batch
+        self._num_iterations = 0
        self._num_updates = 0
        self._optim_history = None
        self._optimizer = None
@@ -223,6 +224,7 @@ class Trainer(object):
            ).format(self.task.__class__.__name__))

        try:
+            self._num_iterations += 1
            # normalize grads by sample size
            self.optimizer.multiply_grads(self.args.distributed_world_size / float(sample_size))

@@ -355,6 +357,10 @@ class Trainer(object):
        """Get the number of parameters updates."""
        return self._num_updates

+    def get_num_iterations(self):
+        """Get the number of iterations."""
+        return self._num_iterations
+
    def _prepare_sample(self, sample):
        if sample is None or len(sample) == 0:
            return None