Do distributed init after data loading

Summary: FACEBOOK This switches back to torch.multiprocessing.spawn, instead of directly calling fb_train.par using a subprocess.Process. This has the advantage that exceptions are propagated properly. It also moves the distributed_init part to happen after data loading, which gets around the timeout issue. The downside of this approach is that it's not so easy to pipe stdout to multiple places, which was nice when using the sweep.py scripts. I'm still working on a fix for that. Reviewed By: rutyrinott, ngoyal2707 Differential Revision: D13873224 fbshipit-source-id: 08d593233b8d23590c01c723363630a79804a8b0

Do distributed init after data loading
Summary: FACEBOOK This switches back to torch.multiprocessing.spawn, instead of directly calling fb_train.par using a subprocess.Process. This has the advantage that exceptions are propagated properly. It also moves the distributed_init part to happen after data loading, which gets around the timeout issue. The downside of this approach is that it's not so easy to pipe stdout to multiple places, which was nice when using the sweep.py scripts. I'm still working on a fix for that. Reviewed By: rutyrinott, ngoyal2707 Differential Revision: D13873224 fbshipit-source-id: 08d593233b8d23590c01c723363630a79804a8b0
ec6f8ef9 · Myle Ott · Facebook Github Bot · 3dce7c9f · ec6f8ef9
Commit ec6f8ef9 authored Jan 30, 2019 by Myle Ott Committed by Facebook Github Bot Jan 30, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 8 additions and 5 deletions

train.py train.py +8 -5

No files found.
--- a/train.py
+++ b/train.py
@@ -24,7 +24,7 @@ from fairseq.meters import AverageMeter, StopwatchMeter
 from fairseq.utils import import_user_module


-def main(args):
+def main(args, init_distributed=False):
    import_user_module(args)

    if args.max_tokens is None:
@@ -41,6 +41,12 @@ def main(args):
    # Load dataset splits
    load_dataset_splits(task, ['train', 'valid'])

+    # Initialize distributed training (after data loading)
+    if init_distributed:
+        import socket
+        args.distributed_rank = distributed_utils.distributed_init(args)
+        print('| initialized host {} as rank {}'.format(socket.gethostname(), args.distributed_rank))
+
    # Build model and criterion
    model = task.build_model(args)
    criterion = task.build_criterion(args)
@@ -368,13 +374,10 @@ def load_dataset_splits(task, splits):


 def distributed_main(i, args):
-    import socket
    args.device_id = i
    if args.distributed_rank is None:  # torch.multiprocessing.spawn
        args.distributed_rank = i
-    args.distributed_rank = distributed_utils.distributed_init(args)
-    print('| initialized host {} as rank {}'.format(socket.gethostname(), args.distributed_rank))
-    main(args)
+    main(args, init_distributed=True)


 if __name__ == '__main__':