1. 16 Feb, 2023 1 commit
    • Sudarshan Raghunathan's avatar
      Add reply files to d2go training processes · f0f55cdc
      Sudarshan Raghunathan authored
      Summary:
      This diff contains a minimal set of changes to support returning reply files to MAST.
      
      There are three parts:
      1. First, we have a try..except in the main function to catch all the "catchable" Python exceptions. Exceptions from C++ code or segfaults will not be handled here.
      2. Each exception is then written to a per-process JSON reply file.
      3. At the end, all per-process files are stat-ed and the earliest file is copied to a location specified by MAST.
      
      # Limitations
      1. This only works when local processes are launched using multiprocessing (which is the default)
      2. If any error happens in C++ code - it will likely not be caught in Python and the reply file might not have the correct logs
      
      Differential Revision: D43097683
      
      fbshipit-source-id: 0eaf4f19f6199a9c77f2ce4c7d2bbc2a2078be99
      f0f55cdc
  2. 01 Feb, 2023 1 commit
    • Yanghan Wang's avatar
      Allow specifying extra lightning trainer params via `_DEFAULTS_` in yaml · 6940fa9c
      Yanghan Wang authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/461
      
      There're needs for extending trainer parameters that are not in (or conflict with) the base d2go config, this diff adds a way to inject those configs without touching the base d2go config.
      - In `get_trainer_params`, it simply checks the `LIGHTNING_TRAINER` and use whatever configs under it.
      - Adds `GeneralizedRCNNTaskNoDefaultConfig`, which allows specify default config via yaml file for `GeneralizedRCNNTask`. (also make some changes for prerequisite)
      - (next diff) User can add their own config updater by registering it in `CONFIG_UPDATER_REGISTRY`.
      
      Differential Revision: D42928992
      
      fbshipit-source-id: f2a1d8a3f2bec9908bb1af03928611d963b92c0e
      6940fa9c
  3. 13 Jan, 2023 1 commit
    • Anthony Chen's avatar
      Rewrite FSDP wrapping as modeling hook · dc6fac12
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440
      
      Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook
      
      **Motivation**
      When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training.
      
      This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook.
      
      **API changes**
      * Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP.
      * `FSDP.ALGORITHM` can only be `full` or `grad_optim`
      
      **Note**
      It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented
      
      Reviewed By: wat3rBro
      
      Differential Revision: D41416917
      
      fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06
      dc6fac12
  4. 19 Dec, 2022 1 commit
  5. 17 Nov, 2022 1 commit
    • Anthony Chen's avatar
      Integrate PyTorch Fully Sharded Data Parallel (FSDP) · 02625ff8
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/396
      
      Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path.
      
      Sources
      * Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
      
      API changes
      * Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fsdp.py for the full list of config options
      * Add `FSDPCheckpointer` as an inheritance of `QATCheckpointer` to support special loading/saving logic for FSDP models
      
      Reviewed By: wat3rBro
      
      Differential Revision: D39228316
      
      fbshipit-source-id: 342ecb3bcbce748453c3fba2d6e1b7b7e478473c
      02625ff8
  6. 14 Nov, 2022 1 commit
  7. 11 Nov, 2022 1 commit
  8. 27 Oct, 2022 1 commit
  9. 23 Oct, 2022 1 commit
  10. 05 Oct, 2022 1 commit
  11. 28 Sep, 2022 1 commit
  12. 10 Sep, 2022 1 commit
  13. 09 Aug, 2022 2 commits
  14. 28 Jul, 2022 1 commit
  15. 27 Jul, 2022 1 commit
  16. 25 Jul, 2022 1 commit
  17. 22 Jul, 2022 1 commit
  18. 30 Jun, 2022 2 commits
  19. 29 Jun, 2022 1 commit
  20. 24 Jun, 2022 2 commits
    • Mik Vyatskov's avatar
      Only save results to file from rank 0 · f0297b81
      Mik Vyatskov authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/309
      
      Right now multiple machines can try to write to the same output file,
      since they get the same argument. Additionally, on the same machine, several
      outputs can be saved which requires unncessary unpacking. This change makes
      train_net only write output of the rank 0 trainer.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D37310084
      
      fbshipit-source-id: 9d5352a274e8fb1d2043393b12896d402333c17b
      f0297b81
    • Yanghan Wang's avatar
      use runner class instead of instance outside of main · 8051775c
      Yanghan Wang authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/312
      
      As discussed, we decided to not use runner instance outside of `main`, previous diffs already solved the prerequisites, this diff mainly does the renaming.
      - Use runner name (str) in the fblearner, ML pipeline.
      - Use runner name (str) in FBL operator, MAST and binary operator.
      - Use runner class as the interface of main, it can be either the name of class (str) or actual class. The main usage should be using `str`, so that the importing of class happens inside `main`. But it's also a common use case to import runner class and call `main` for things like ad-hoc scripts or tests, supporting actual class makes it easier modify code for those cases (eg. some local test class doesn't have a name, so it's not feasible to use runner name).
      
      Reviewed By: newstzpz
      
      Differential Revision: D37060338
      
      fbshipit-source-id: 879852d41902b87d6db6cb9d7b3e8dc55dc4b976
      8051775c
  21. 18 Jun, 2022 2 commits
  22. 16 Jun, 2022 2 commits
  23. 15 Jun, 2022 1 commit
  24. 14 Jun, 2022 1 commit
  25. 09 Jun, 2022 1 commit
  26. 02 Jun, 2022 1 commit
    • Miquel Jubert Hermoso's avatar
      Separate into API and Exporter · 24da990f
      Miquel Jubert Hermoso authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/238
      
      *This diff is part of a stack which has the goal of "buckifying" D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)Go core and enabling autodeps and other tooling. The last diff in the stack introduces the TARGETS. The diffs earlier in the stack are resolving circular dependencies and other issues which prevent the buckification from occurring.*
      
      Following the comments in an abandoned diff, split the export code into two files, which will have their corresponding dependencies: exporter and api. api.py contains the components which have little dependencies, so it can be imported basically anywhere without circular dependencies.
      
      exporter.py contains the utilities, which are use for export operations, for example in the exporter binary.
      
      Reviewed By: mcimpoi
      
      Differential Revision: D36166603
      
      fbshipit-source-id: 25ded0b3925464c05be4048472a4c2ddcdb17ecf
      24da990f
  27. 15 May, 2022 1 commit
    • John Reese's avatar
      apply import merging for fbcode (7 of 11) · b3a9204c
      John Reese authored
      Summary:
      Applies new import merging and sorting from µsort v1.0.
      
      When merging imports, µsort will make a best-effort to move associated
      comments to match merged elements, but there are known limitations due to
      the diynamic nature of Python and developer tooling. These changes should
      not produce any dangerous runtime changes, but may require touch-ups to
      satisfy linters and other tooling.
      
      Note that µsort uses case-insensitive, lexicographical sorting, which
      results in a different ordering compared to isort. This provides a more
      consistent sorting order, matching the case-insensitive order used when
      sorting import statements by module name, and ensures that "frog", "FROG",
      and "Frog" always sort next to each other.
      
      For details on µsort's sorting and merging semantics, see the user guide:
      https://usort.readthedocs.io/en/stable/guide.html#sorting
      
      Reviewed By: lisroach
      
      Differential Revision: D36402205
      
      fbshipit-source-id: a4efc688d02da80c6e96685aa8eb00411615a366
      b3a9204c
  28. 14 May, 2022 1 commit
  29. 24 Mar, 2022 1 commit
    • Tsahi Glik's avatar
      refactor exporter and eval command line tools · 744d72d7
      Tsahi Glik authored
      Summary: Tweak exporter and evaluator cli entry point func to support calling it as a module with args from custom launching code.
      
      Reviewed By: sstsai-adl
      
      Differential Revision: D35035813
      
      fbshipit-source-id: c8b24099e94ccc58c184f8aac95b2a24a137e86a
      744d72d7
  30. 10 Mar, 2022 1 commit
  31. 05 Mar, 2022 1 commit
  32. 03 Mar, 2022 1 commit
  33. 14 Feb, 2022 1 commit
    • Tugrul Savran's avatar
      D2Go Fail Fast: Move exception coming from not implemented "compare accuracy" feature to the top. · eee4dfc1
      Tugrul Savran authored
      Summary:
      Currently, the exporter method takes in a compare_accuracy parameter, which after all the compute (exporting etc.) raises an exception if it is set to True.
      
      This looks like an antipattern, and causes a waste of compute.
      
      Therefore, I am proposing to raise the exception at the very beginning of method call to let the client know in advance that this argument's functionality isn't implemented yet.
      
      NOTE: We might also choose to get rid of the entire parameter. I am open for suggestions.
      
      Differential Revision: D34186578
      
      fbshipit-source-id: d7fbe7589dfe2d2f688b870885ca61e6829c9329
      eee4dfc1
  34. 08 Jan, 2022 1 commit
    • Binh Tang's avatar
      Add deprecation path for renamed training type plugins (#11227) · fcd51171
      Binh Tang authored
      Summary:
      ### New commit log messages
        4eede7c30 Add deprecation path for renamed training type plugins (#11227)
      
      Reviewed By: edward-io, daniellepintz
      
      Differential Revision: D33409991
      
      fbshipit-source-id: 373e48767e992d67db3c85e436648481ad16c9d0
      fcd51171
  35. 06 Jan, 2022 1 commit
    • Binh Tang's avatar
      Rename `DDPPlugin` to `DDPStrategy` (#11142) · aeb15613
      Binh Tang authored
      Summary:
      ### New commit log messages
        b64dea9dc Rename `DDPPlugin` to `DDPStrategy` (#11142)
      
      Reviewed By: jjenniferdai
      
      Differential Revision: D33259306
      
      fbshipit-source-id: b4608c6b96b4a7977eaa4ed3f03c4b824882aef0
      aeb15613