support specifying concurrency level for interleave

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/481 X-link: https://github.com/facebookresearch/mobile-vision/pull/139 also support specifying number of concurrency for interleaving. Reviewed By: mattcyu1 Differential Revision: D43522445 fbshipit-source-id: 790a8527c6b42c9098ef82c4fc01ec1a528e2418

support specifying concurrency level for interleave
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/481 X-link: https://github.com/facebookresearch/mobile-vision/pull/139 also support specifying number of concurrency for interleaving. Reviewed By: mattcyu1 Differential Revision: D43522445 fbshipit-source-id: 790a8527c6b42c9098ef82c4fc01ec1a528e2418
4e4a865c · Yanghan Wang · Facebook GitHub Bot · 34a5a3e8 · 4e4a865c
Commit 4e4a865c authored Feb 23, 2023 by Yanghan Wang Committed by Facebook GitHub Bot Feb 23, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 2 deletions

d2go/checkpoint/fsdp_checkpoint.py d2go/checkpoint/fsdp_checkpoint.py +4 -2

No files found.
--- a/d2go/checkpoint/fsdp_checkpoint.py
+++ b/d2go/checkpoint/fsdp_checkpoint.py
@@ -149,13 +149,15 @@ class FSDPCheckpointer(QATCheckpointer):
                self.tag_last_checkpoint(basename)
    def _save_file(self, data, filename):
-        with interleave_by_rank():
+        # allow 8 GPUs to write to manifold at the same time
+        with interleave_by_rank(concurrency_limit=8):
            self.logger.info("Saving checkpoint to {}".format(filename))
            with self.path_manager.open(filename, "wb") as f:
                torch.save(data, cast(IO[bytes], f))
    def _load_file(self, f: str):
-        with interleave_by_rank():
+        # allow 8 GPUs to read from manifold at the same time
+        with interleave_by_rank(concurrency_limit=8):
            return super()._load_file(f)