Commit 78328839 authored by Francisc Bungiu's avatar Francisc Bungiu Committed by Facebook GitHub Bot
Browse files

Fix key error 0 in multinode training

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/579

Current code assumed training runs only on one node, and there is always a global rank0 on each node. This assumption fails on multinode training, resulting in a key 0 error.

Reviewed By: crassirostris

Differential Revision: D46841286

fbshipit-source-id: d57919239fa5042de795d74c9c2013b07c9a0a48
parent 1a8e1283
...@@ -157,6 +157,7 @@ def run_with_cmdline_args(args): ...@@ -157,6 +157,7 @@ def run_with_cmdline_args(args):
return_save_file=None, return_save_file=None,
shared_context=shared_context, shared_context=shared_context,
) )
outputs = {0: result}
else: else:
outputs = launch( outputs = launch(
main_func, main_func,
...@@ -172,11 +173,10 @@ def run_with_cmdline_args(args): ...@@ -172,11 +173,10 @@ def run_with_cmdline_args(args):
"resume": args.resume, "resume": args.resume,
}, },
) )
result = outputs[0]
# Only save results from global rank 0 for consistency. # Only save results from global rank 0 for consistency.
if args.save_return_file is not None and args.machine_rank == 0: if args.save_return_file is not None and args.machine_rank == 0:
save_binary_outputs(args.save_return_file, result) save_binary_outputs(args.save_return_file, outputs[0])
def cli(args=None): def cli(args=None):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment