Commit 0d03aa88 authored by freewym's avatar freewym Committed by Facebook Github Bot
Browse files

fix a bug when resuming training from the last epoch (#1275)

Summary:
If the training stopped in the middle of the last epoch, and then it was resumed from checkpoint, it will not continue the training because `epoch_itr.epoch < max_epoch` is not satisfied. This PR fixed the issue.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1275

Differential Revision: D18483945

Pulled By: myleott

fbshipit-source-id: 80df6f73fa17606a79a28e8328bb4c577f504683
parent d9836217
......@@ -76,7 +76,12 @@ def main(args, init_distributed=False):
train_meter = StopwatchMeter()
train_meter.start()
valid_subsets = args.valid_subset.split(',')
while lr > args.min_lr and epoch_itr.epoch < max_epoch and trainer.get_num_updates() < max_update:
while (
lr > args.min_lr
and (epoch_itr.epoch < max_epoch or (epoch_itr.epoch == max_epoch
and epoch_itr._next_epoch_itr is not None))
and trainer.get_num_updates() < max_update
):
# train for one epoch
train(args, trainer, task, epoch_itr)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment