• Anthony Chen's avatar
    Rewrite FSDP wrapping as modeling hook · dc6fac12
    Anthony Chen authored
    Summary:
    Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440
    
    Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook
    
    **Motivation**
    When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training.
    
    This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook.
    
    **API changes**
    * Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP.
    * `FSDP.ALGORITHM` can only be `full` or `grad_optim`
    
    **Note**
    It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented
    
    Reviewed By: wat3rBro
    
    Differential Revision: D41416917
    
    fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06
    dc6fac12
train_net.py 4.76 KB