fix a bug when resuming training from the last epoch (#1275)

Summary: If the training stopped in the middle of the last epoch, and then it was resumed from checkpoint, it will not continue the training because `epoch_itr.epoch < max_epoch` is not satisfied. This PR fixed the issue. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1275 Differential Revision: D18483945 Pulled By: myleott fbshipit-source-id: 80df6f73fa17606a79a28e8328bb4c577f504683

fix a bug when resuming training from the last epoch (#1275)
Summary: If the training stopped in the middle of the last epoch, and then it was resumed from checkpoint, it will not continue the training because `epoch_itr.epoch < max_epoch` is not satisfied. This PR fixed the issue. Pull Request resolved: https://github.com/pytorch/fairseq/pull/1275 Differential Revision: D18483945 Pulled By: myleott fbshipit-source-id: 80df6f73fa17606a79a28e8328bb4c577f504683
0d03aa88 · freewym · Facebook Github Bot · d9836217 · 0d03aa88
Commit 0d03aa88 authored Nov 14, 2019 by freewym Committed by Facebook Github Bot Nov 14, 2019
Show whitespace changes
Inline Side-by-side

Showing with 6 additions and 1 deletion

train.py train.py +6 -1

No files found.
--- a/train.py
+++ b/train.py
@@ -76,7 +76,12 @@ def main(args, init_distributed=False):
    train_meter = StopwatchMeter()
    train_meter.start()
    valid_subsets = args.valid_subset.split(',')
-    while lr > args.min_lr and epoch_itr.epoch < max_epoch and trainer.get_num_updates() < max_update:
+    while (
+        lr > args.min_lr
+        and (epoch_itr.epoch < max_epoch or (epoch_itr.epoch == max_epoch
+            and epoch_itr._next_epoch_itr is not None))
+        and trainer.get_num_updates() < max_update
+    ):
        # train for one epoch
        train(args, trainer, task, epoch_itr)