1. 05 Apr, 2023 1 commit
    • Mik Vyatskov's avatar
      Setup root logger once & on import time · abdeafb0
      Mik Vyatskov authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/523
      
      To avoid setting it up multiple times, add run_once() decorator.
      
      Additionally make sure logging is configured for datalodaing workers, which have a different entry point, by moving setting up logging to the import time. Right now when a dataloader worker is created using spawn method from multiprocessing module, a new Python interpreter is created, with all the modules imported anew and with the entry point set to the method specified. This means that the entry point of the training framework is skipped, together with the logging setup.
      
      With this change, the logging is configured on the import time, which means that when a dataloading process is created, even though the training main is not invoked, the logging is still configured because even though train_net is not invoked as an entry point, it's still imported in the child process.
      
      Reviewed By: miqueljubert
      
      Differential Revision: D44641142
      
      fbshipit-source-id: 06ea85363d965b31d7f9ade3c2615ed9db67470b
      abdeafb0
  2. 16 Feb, 2023 1 commit
    • Sudarshan Raghunathan's avatar
      Add reply files to d2go training processes · f0f55cdc
      Sudarshan Raghunathan authored
      Summary:
      This diff contains a minimal set of changes to support returning reply files to MAST.
      
      There are three parts:
      1. First, we have a try..except in the main function to catch all the "catchable" Python exceptions. Exceptions from C++ code or segfaults will not be handled here.
      2. Each exception is then written to a per-process JSON reply file.
      3. At the end, all per-process files are stat-ed and the earliest file is copied to a location specified by MAST.
      
      # Limitations
      1. This only works when local processes are launched using multiprocessing (which is the default)
      2. If any error happens in C++ code - it will likely not be caught in Python and the reply file might not have the correct logs
      
      Differential Revision: D43097683
      
      fbshipit-source-id: 0eaf4f19f6199a9c77f2ce4c7d2bbc2a2078be99
      f0f55cdc
  3. 13 Jan, 2023 1 commit
    • Anthony Chen's avatar
      Rewrite FSDP wrapping as modeling hook · dc6fac12
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440
      
      Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook
      
      **Motivation**
      When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training.
      
      This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook.
      
      **API changes**
      * Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP.
      * `FSDP.ALGORITHM` can only be `full` or `grad_optim`
      
      **Note**
      It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented
      
      Reviewed By: wat3rBro
      
      Differential Revision: D41416917
      
      fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06
      dc6fac12
  4. 19 Dec, 2022 1 commit
  5. 17 Nov, 2022 1 commit
    • Anthony Chen's avatar
      Integrate PyTorch Fully Sharded Data Parallel (FSDP) · 02625ff8
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/396
      
      Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path.
      
      Sources
      * Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
      
      API changes
      * Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fsdp.py for the full list of config options
      * Add `FSDPCheckpointer` as an inheritance of `QATCheckpointer` to support special loading/saving logic for FSDP models
      
      Reviewed By: wat3rBro
      
      Differential Revision: D39228316
      
      fbshipit-source-id: 342ecb3bcbce748453c3fba2d6e1b7b7e478473c
      02625ff8
  6. 14 Nov, 2022 1 commit
  7. 23 Oct, 2022 1 commit
  8. 09 Aug, 2022 2 commits
  9. 28 Jul, 2022 1 commit
  10. 27 Jul, 2022 1 commit
  11. 25 Jul, 2022 1 commit
  12. 22 Jul, 2022 1 commit
  13. 30 Jun, 2022 1 commit
  14. 24 Jun, 2022 2 commits
    • Mik Vyatskov's avatar
      Only save results to file from rank 0 · f0297b81
      Mik Vyatskov authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/309
      
      Right now multiple machines can try to write to the same output file,
      since they get the same argument. Additionally, on the same machine, several
      outputs can be saved which requires unncessary unpacking. This change makes
      train_net only write output of the rank 0 trainer.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D37310084
      
      fbshipit-source-id: 9d5352a274e8fb1d2043393b12896d402333c17b
      f0297b81
    • Yanghan Wang's avatar
      use runner class instead of instance outside of main · 8051775c
      Yanghan Wang authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/312
      
      As discussed, we decided to not use runner instance outside of `main`, previous diffs already solved the prerequisites, this diff mainly does the renaming.
      - Use runner name (str) in the fblearner, ML pipeline.
      - Use runner name (str) in FBL operator, MAST and binary operator.
      - Use runner class as the interface of main, it can be either the name of class (str) or actual class. The main usage should be using `str`, so that the importing of class happens inside `main`. But it's also a common use case to import runner class and call `main` for things like ad-hoc scripts or tests, supporting actual class makes it easier modify code for those cases (eg. some local test class doesn't have a name, so it's not feasible to use runner name).
      
      Reviewed By: newstzpz
      
      Differential Revision: D37060338
      
      fbshipit-source-id: 879852d41902b87d6db6cb9d7b3e8dc55dc4b976
      8051775c
  15. 18 Jun, 2022 2 commits
  16. 16 Jun, 2022 1 commit
  17. 15 Jun, 2022 1 commit
  18. 15 May, 2022 1 commit
    • John Reese's avatar
      apply import merging for fbcode (7 of 11) · b3a9204c
      John Reese authored
      Summary:
      Applies new import merging and sorting from µsort v1.0.
      
      When merging imports, µsort will make a best-effort to move associated
      comments to match merged elements, but there are known limitations due to
      the diynamic nature of Python and developer tooling. These changes should
      not produce any dangerous runtime changes, but may require touch-ups to
      satisfy linters and other tooling.
      
      Note that µsort uses case-insensitive, lexicographical sorting, which
      results in a different ordering compared to isort. This provides a more
      consistent sorting order, matching the case-insensitive order used when
      sorting import statements by module name, and ensures that "frog", "FROG",
      and "Frog" always sort next to each other.
      
      For details on µsort's sorting and merging semantics, see the user guide:
      https://usort.readthedocs.io/en/stable/guide.html#sorting
      
      Reviewed By: lisroach
      
      Differential Revision: D36402205
      
      fbshipit-source-id: a4efc688d02da80c6e96685aa8eb00411615a366
      b3a9204c
  19. 05 Mar, 2022 1 commit
  20. 03 Mar, 2022 1 commit
  21. 22 May, 2021 1 commit
    • Zhicheng Yan's avatar
      support FP16 gradient compression · 57809b0f
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/70
      
      DDP supports an fp16_compress_hook which compresses the gradient to FP16 before communication. This can result in a significant speed up.
      
      Add one argument `_C.MODEL.DDP_FP16_GRAD_COMPRESS` to trigger it.
      
      Reviewed By: zhanghang1989
      
      Differential Revision: D28467701
      
      fbshipit-source-id: 3c80865222f48eb8fe6947ea972448c445ee3ef3
      57809b0f
  22. 03 Mar, 2021 1 commit