-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440 Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook **Motivation** When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training. This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook. **API changes** * Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP. * `FSDP.ALGORITHM` can only be `full` or `grad_optim` **Note** It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented Reviewed By: wat3rBro Differential Revision: D41416917 fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06
dc6fac12