1. 03 Mar, 2024 1 commit
    • Amethyst Reese's avatar
      apply Black 2024 style in fbcode (7/16) · 2256bdb7
      Amethyst Reese authored
      Summary:
      Formats the covered files with pyfmt.
      
      paintitblack
      
      Reviewed By: aleivag
      
      Differential Revision: D54447732
      
      fbshipit-source-id: e21fbbe27882c8af183d021f4ac27029cbe93e8e
      2256bdb7
  2. 08 Jan, 2024 1 commit
  3. 07 Dec, 2023 1 commit
  4. 05 Nov, 2023 1 commit
    • Zhicheng Yan's avatar
      allow to skip loading model weights in build_model() · f2a0c52c
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/630
      
      Currently, in runner **build_model()** method, when **eval_only=True**, we always try to load model weights.
      This is quite restricted in some cases. For example, we may just wanna build a model in eval mode to profile its efficiency, and we have not trained the model or generated the model weights in a checkpoint file.
      
      Thus, this diff adds an argument **skip_model_weights** to allow users to skip the loading of model weights.
      Note, this diff is entirely back-compatible and is NOT expected to break existing implementations.
      
      Reviewed By: navsud, wat3rBro
      
      Differential Revision: D50623772
      
      fbshipit-source-id: 282dc6f19e17a4dd9eb0048e068c5299bb3d47c2
      f2a0c52c
  5. 27 Sep, 2023 1 commit
  6. 24 Aug, 2023 1 commit
  7. 19 Jul, 2023 1 commit
  8. 12 Jul, 2023 1 commit
    • Francisc Bungiu's avatar
      Extend reply files to all binaries · e4fa6d63
      Francisc Bungiu authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/591
      
      We previously added reply files for train_net, but not the other relevant binaries with MAST support: evaluator and lightning.
      Adding support here by extracting the common bits into a separate module and wrapping the functions to reuse the functionality.
      
      Differential Revision: D47293689
      
      fbshipit-source-id: 70630a471c0cf037d180c9edfb57a4db4fdf7bde
      e4fa6d63
  9. 22 Jun, 2023 1 commit
  10. 19 Jun, 2023 1 commit
  11. 02 Jun, 2023 1 commit
  12. 11 Apr, 2023 1 commit
  13. 05 Apr, 2023 1 commit
    • Mik Vyatskov's avatar
      Setup root logger once & on import time · abdeafb0
      Mik Vyatskov authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/523
      
      To avoid setting it up multiple times, add run_once() decorator.
      
      Additionally make sure logging is configured for datalodaing workers, which have a different entry point, by moving setting up logging to the import time. Right now when a dataloader worker is created using spawn method from multiprocessing module, a new Python interpreter is created, with all the modules imported anew and with the entry point set to the method specified. This means that the entry point of the training framework is skipped, together with the logging setup.
      
      With this change, the logging is configured on the import time, which means that when a dataloading process is created, even though the training main is not invoked, the logging is still configured because even though train_net is not invoked as an entry point, it's still imported in the child process.
      
      Reviewed By: miqueljubert
      
      Differential Revision: D44641142
      
      fbshipit-source-id: 06ea85363d965b31d7f9ade3c2615ed9db67470b
      abdeafb0
  14. 16 Feb, 2023 1 commit
    • Sudarshan Raghunathan's avatar
      Add reply files to d2go training processes · f0f55cdc
      Sudarshan Raghunathan authored
      Summary:
      This diff contains a minimal set of changes to support returning reply files to MAST.
      
      There are three parts:
      1. First, we have a try..except in the main function to catch all the "catchable" Python exceptions. Exceptions from C++ code or segfaults will not be handled here.
      2. Each exception is then written to a per-process JSON reply file.
      3. At the end, all per-process files are stat-ed and the earliest file is copied to a location specified by MAST.
      
      # Limitations
      1. This only works when local processes are launched using multiprocessing (which is the default)
      2. If any error happens in C++ code - it will likely not be caught in Python and the reply file might not have the correct logs
      
      Differential Revision: D43097683
      
      fbshipit-source-id: 0eaf4f19f6199a9c77f2ce4c7d2bbc2a2078be99
      f0f55cdc
  15. 01 Feb, 2023 1 commit
    • Yanghan Wang's avatar
      Allow specifying extra lightning trainer params via `_DEFAULTS_` in yaml · 6940fa9c
      Yanghan Wang authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/461
      
      There're needs for extending trainer parameters that are not in (or conflict with) the base d2go config, this diff adds a way to inject those configs without touching the base d2go config.
      - In `get_trainer_params`, it simply checks the `LIGHTNING_TRAINER` and use whatever configs under it.
      - Adds `GeneralizedRCNNTaskNoDefaultConfig`, which allows specify default config via yaml file for `GeneralizedRCNNTask`. (also make some changes for prerequisite)
      - (next diff) User can add their own config updater by registering it in `CONFIG_UPDATER_REGISTRY`.
      
      Differential Revision: D42928992
      
      fbshipit-source-id: f2a1d8a3f2bec9908bb1af03928611d963b92c0e
      6940fa9c
  16. 13 Jan, 2023 1 commit
    • Anthony Chen's avatar
      Rewrite FSDP wrapping as modeling hook · dc6fac12
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440
      
      Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook
      
      **Motivation**
      When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training.
      
      This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook.
      
      **API changes**
      * Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP.
      * `FSDP.ALGORITHM` can only be `full` or `grad_optim`
      
      **Note**
      It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented
      
      Reviewed By: wat3rBro
      
      Differential Revision: D41416917
      
      fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06
      dc6fac12
  17. 19 Dec, 2022 1 commit
  18. 17 Nov, 2022 1 commit
    • Anthony Chen's avatar
      Integrate PyTorch Fully Sharded Data Parallel (FSDP) · 02625ff8
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/396
      
      Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path.
      
      Sources
      * Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
      
      API changes
      * Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fsdp.py for the full list of config options
      * Add `FSDPCheckpointer` as an inheritance of `QATCheckpointer` to support special loading/saving logic for FSDP models
      
      Reviewed By: wat3rBro
      
      Differential Revision: D39228316
      
      fbshipit-source-id: 342ecb3bcbce748453c3fba2d6e1b7b7e478473c
      02625ff8
  19. 14 Nov, 2022 1 commit
  20. 11 Nov, 2022 1 commit
  21. 27 Oct, 2022 1 commit
  22. 23 Oct, 2022 1 commit
  23. 05 Oct, 2022 1 commit
  24. 28 Sep, 2022 1 commit
  25. 10 Sep, 2022 1 commit
  26. 09 Aug, 2022 2 commits
  27. 28 Jul, 2022 1 commit
  28. 27 Jul, 2022 1 commit
  29. 25 Jul, 2022 1 commit
  30. 22 Jul, 2022 1 commit
  31. 30 Jun, 2022 2 commits
  32. 29 Jun, 2022 1 commit
  33. 24 Jun, 2022 2 commits
    • Mik Vyatskov's avatar
      Only save results to file from rank 0 · f0297b81
      Mik Vyatskov authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/309
      
      Right now multiple machines can try to write to the same output file,
      since they get the same argument. Additionally, on the same machine, several
      outputs can be saved which requires unncessary unpacking. This change makes
      train_net only write output of the rank 0 trainer.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D37310084
      
      fbshipit-source-id: 9d5352a274e8fb1d2043393b12896d402333c17b
      f0297b81
    • Yanghan Wang's avatar
      use runner class instead of instance outside of main · 8051775c
      Yanghan Wang authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/312
      
      As discussed, we decided to not use runner instance outside of `main`, previous diffs already solved the prerequisites, this diff mainly does the renaming.
      - Use runner name (str) in the fblearner, ML pipeline.
      - Use runner name (str) in FBL operator, MAST and binary operator.
      - Use runner class as the interface of main, it can be either the name of class (str) or actual class. The main usage should be using `str`, so that the importing of class happens inside `main`. But it's also a common use case to import runner class and call `main` for things like ad-hoc scripts or tests, supporting actual class makes it easier modify code for those cases (eg. some local test class doesn't have a name, so it's not feasible to use runner name).
      
      Reviewed By: newstzpz
      
      Differential Revision: D37060338
      
      fbshipit-source-id: 879852d41902b87d6db6cb9d7b3e8dc55dc4b976
      8051775c
  34. 18 Jun, 2022 2 commits
  35. 16 Jun, 2022 2 commits