"vscode:/vscode.git/clone" did not exist on "e9fca7bdb422b19a2d629960a0654edd65a290d4"
avoid "divided by zero error" in logging_outputs when --use-bmuf is e… (#812)
Summary: … enabled. When doing multi-gpu training with --use-bmuf turned on and --global-sync-iter > 1, each replica may not sync with other replicas at each iteration. So logging_outputs only has stats of their own. On the other hand, logging_outputs may be empty at the end of an epoch after "a dummy iteration" because the number of replicas does not divide the number of batches of the training data. If this happens, sample_size and ntokens would be 0 for some replica and cause "divided by 0" error. This fix sets *loss to 0 if sample_size/ntokens is 0. Pull Request resolved: https://github.com/pytorch/fairseq/pull/812 Reviewed By: myleott, yqwangustc Differential Revision: D15908614 Pulled By: nayansinghal fbshipit-source-id: c92e8e095f012bdb4ef753a3c627fd215afa215d
Showing
Please register or sign in to comment