add option to load checkpoints to GPU
Summary: X-link: https://github.com/facebookresearch/detectron2/pull/4667 X-link: https://github.com/fairinternal/detectron2/pull/578 Pull Request resolved: https://github.com/facebookresearch/d2go/pull/411 Add config option `cfg.LOAD_CKPT_TO_GPU` to load checkpoints to the worker's current GPU Previously, D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)go maps checkpoints to CPU before loading them to the model. In large-scale distributed training, many GPU processes may be used to train a model. This means each process will load the model checkpoint to a single CPU, causing the same model checkpoint to be loaded many times. This would cause CPU OOM issue when the model checkpoint size is large. There're two solutions to this problem. One is to load checkpoints to GPU; the other one is to use share memory for the checkpoint between different GPU processes. This diff implements the first solution, which can support cases where model size + model checkpoint size is smaller than the total GPU memory. The second solution may be revisited for large models that need to offload checkpoints to cpu. Reference diff: D40789062 Reviewed By: mcimpoi Differential Revision: D41063306 fbshipit-source-id: edcfd390a25582fffb2f1a6a7fc22917874ee2fc
Showing
Please register or sign in to comment