1. 13 Jan, 2023 1 commit
    • Anthony Chen's avatar
      Rewrite FSDP wrapping as modeling hook · dc6fac12
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440
      
      Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook
      
      **Motivation**
      When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training.
      
      This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook.
      
      **API changes**
      * Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP.
      * `FSDP.ALGORITHM` can only be `full` or `grad_optim`
      
      **Note**
      It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented
      
      Reviewed By: wat3rBro
      
      Differential Revision: D41416917
      
      fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06
      dc6fac12
  2. 19 Dec, 2022 1 commit
  3. 17 Nov, 2022 1 commit
    • Anthony Chen's avatar
      Integrate PyTorch Fully Sharded Data Parallel (FSDP) · 02625ff8
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/396
      
      Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path.
      
      Sources
      * Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
      
      API changes
      * Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fsdp.py for the full list of config options
      * Add `FSDPCheckpointer` as an inheritance of `QATCheckpointer` to support special loading/saving logic for FSDP models
      
      Reviewed By: wat3rBro
      
      Differential Revision: D39228316
      
      fbshipit-source-id: 342ecb3bcbce748453c3fba2d6e1b7b7e478473c
      02625ff8
  4. 14 Nov, 2022 1 commit
  5. 23 Oct, 2022 1 commit
  6. 09 Aug, 2022 2 commits
  7. 28 Jul, 2022 1 commit
  8. 27 Jul, 2022 1 commit
  9. 25 Jul, 2022 1 commit
  10. 22 Jul, 2022 1 commit
  11. 30 Jun, 2022 1 commit
  12. 24 Jun, 2022 2 commits
    • Mik Vyatskov's avatar
      Only save results to file from rank 0 · f0297b81
      Mik Vyatskov authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/309
      
      Right now multiple machines can try to write to the same output file,
      since they get the same argument. Additionally, on the same machine, several
      outputs can be saved which requires unncessary unpacking. This change makes
      train_net only write output of the rank 0 trainer.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D37310084
      
      fbshipit-source-id: 9d5352a274e8fb1d2043393b12896d402333c17b
      f0297b81
    • Yanghan Wang's avatar
      use runner class instead of instance outside of main · 8051775c
      Yanghan Wang authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/312
      
      As discussed, we decided to not use runner instance outside of `main`, previous diffs already solved the prerequisites, this diff mainly does the renaming.
      - Use runner name (str) in the fblearner, ML pipeline.
      - Use runner name (str) in FBL operator, MAST and binary operator.
      - Use runner class as the interface of main, it can be either the name of class (str) or actual class. The main usage should be using `str`, so that the importing of class happens inside `main`. But it's also a common use case to import runner class and call `main` for things like ad-hoc scripts or tests, supporting actual class makes it easier modify code for those cases (eg. some local test class doesn't have a name, so it's not feasible to use runner name).
      
      Reviewed By: newstzpz
      
      Differential Revision: D37060338
      
      fbshipit-source-id: 879852d41902b87d6db6cb9d7b3e8dc55dc4b976
      8051775c
  13. 18 Jun, 2022 2 commits
  14. 16 Jun, 2022 1 commit
  15. 15 Jun, 2022 1 commit
  16. 15 May, 2022 1 commit
    • John Reese's avatar
      apply import merging for fbcode (7 of 11) · b3a9204c
      John Reese authored
      Summary:
      Applies new import merging and sorting from µsort v1.0.
      
      When merging imports, µsort will make a best-effort to move associated
      comments to match merged elements, but there are known limitations due to
      the diynamic nature of Python and developer tooling. These changes should
      not produce any dangerous runtime changes, but may require touch-ups to
      satisfy linters and other tooling.
      
      Note that µsort uses case-insensitive, lexicographical sorting, which
      results in a different ordering compared to isort. This provides a more
      consistent sorting order, matching the case-insensitive order used when
      sorting import statements by module name, and ensures that "frog", "FROG",
      and "Frog" always sort next to each other.
      
      For details on µsort's sorting and merging semantics, see the user guide:
      https://usort.readthedocs.io/en/stable/guide.html#sorting
      
      Reviewed By: lisroach
      
      Differential Revision: D36402205
      
      fbshipit-source-id: a4efc688d02da80c6e96685aa8eb00411615a366
      b3a9204c
  17. 05 Mar, 2022 1 commit
  18. 03 Mar, 2022 1 commit
  19. 22 May, 2021 1 commit
    • Zhicheng Yan's avatar
      support FP16 gradient compression · 57809b0f
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/70
      
      DDP supports an fp16_compress_hook which compresses the gradient to FP16 before communication. This can result in a significant speed up.
      
      Add one argument `_C.MODEL.DDP_FP16_GRAD_COMPRESS` to trigger it.
      
      Reviewed By: zhanghang1989
      
      Differential Revision: D28467701
      
      fbshipit-source-id: 3c80865222f48eb8fe6947ea972448c445ee3ef3
      57809b0f
  20. 03 Mar, 2021 1 commit