1. 07 Aug, 2023 1 commit
  2. 04 Aug, 2023 1 commit
    • Zhicheng Yan's avatar
      only select pth files with prefix "model" as model checkpoint file · 94c7f647
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/605
      
      D2GO workflow async validation monitor the model checkpoint files *.pth in **e2e_train** folder (such as **model_0004999.pth**, **model_final.pth**) and launch async val operator as needed.
      All model files actually have prefix **"model"**.  In some cases, there are non-model-checkpoint files also with pth file extension.
      To exclude them, add a filtering to check if the file prefix is "model".
      
      Reviewed By: ayushidalmia
      
      Differential Revision: D48021972
      
      fbshipit-source-id: 54d9c14117192809ea76d812ebd4240b44166637
      94c7f647
  3. 25 Jul, 2023 2 commits
  4. 21 Jul, 2023 2 commits
  5. 19 Jul, 2023 2 commits
  6. 18 Jul, 2023 1 commit
  7. 14 Jul, 2023 1 commit
  8. 12 Jul, 2023 1 commit
    • Francisc Bungiu's avatar
      Extend reply files to all binaries · e4fa6d63
      Francisc Bungiu authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/591
      
      We previously added reply files for train_net, but not the other relevant binaries with MAST support: evaluator and lightning.
      Adding support here by extracting the common bits into a separate module and wrapping the functions to reuse the functionality.
      
      Differential Revision: D47293689
      
      fbshipit-source-id: 70630a471c0cf037d180c9edfb57a4db4fdf7bde
      e4fa6d63
  9. 05 Jul, 2023 1 commit
  10. 28 Jun, 2023 2 commits
    • Yanghan Wang's avatar
      enable autodeps for tests · a2b9a523
      Yanghan Wang authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/588
      
      enable autodeps for d2go test to unblock next diff.
      
      maybe in future we can break it into smaller pieces to make tests build and run faster.
      
      Reviewed By: ajinkya-deogade
      
      Differential Revision: D47080563
      
      fbshipit-source-id: 9d8ee2a13f91a34c79aa13f2b8165c615643b87d
      a2b9a523
    • Francisc Bungiu's avatar
      Remove profiling of evaluation · b1e24e81
      Francisc Bungiu authored
      Summary: Deprecate prepare_fb_model_for_eval().
      
      Reviewed By: miqueljubert
      
      Differential Revision: D47085783
      
      fbshipit-source-id: 34b7e822e9baa1f9f77a11d3497df7fb0463c955
      b1e24e81
  11. 26 Jun, 2023 1 commit
  12. 23 Jun, 2023 4 commits
    • Anthony Chen's avatar
      disable FSDP mixed precision for model buffers · b0abd7aa
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/585
      
      Disable FSDP mixed precision for model buffers. Buffers are usually small in size so there's very limited performance gain for enabling mixed precision. Plus, applications like BatchNorm layers and diffusion models are very sensitive to the precision of buffers. Thus, we stick to full precision for buffers in FSDP.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D46951673
      
      fbshipit-source-id: 12bb1a47fbd8b3dd85c7f781bab707206044af15
      b0abd7aa
    • Zhicheng Yan's avatar
      update INJECTED_COCO_DATASETS_LUT when registering AdhocCOCODataset · be8a6324
      Zhicheng Yan authored
      Summary:
      When registering AdhocCOCODataset, INJECTED_COCO_DATASETS_LUT needs to be updated as well.
      For example, if a dataset uses custom registering function, it can be only retrieved from INJECTED_COCO_DATASETS_LUT.
      Otherwise, it uses the default registering function as in branch `register_dataset_split`.
      
      Reviewed By: antonrigner
      
      Differential Revision: D46826507
      
      fbshipit-source-id: 9170c5b57f3935875b899ab7f93c3c57e77eb28c
      be8a6324
    • Anthony Chen's avatar
      remove AC prefix from EMA to make it compatible with loading · 5c23bee8
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/578
      
      # Problem:
      d2go EMA uses `named_parameters()` to traverse model states and save EMA checkpoints, while using `state_dict()`  to save model checkpoints. This is a brittle practice because `named_parameters()` and `state_dict()` are calling two sets of python APIs and can return different things.
      In the case of Activation Checkpointing (AC), we don't want AC wrapper to affect checkpoint names. Thus, `state_dict()` is overriden by Pytorch to remove prefix "_checkpoint_wrapped_module" from FQN. However, `named_parameters()` does not have that support, so prefix still exists. In the event of us changing AC wrapping strategy (very common for optimization), we will not be able to load the previous EMA state back to the model. And the same problem also happened with FSDP.
      
      # Short-term hack:
      This diff adds a short term hack to manually remove the AC prefix in EMA. We can expand `IGNORED_FQN_PREFIX` to support more use cases.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D46815031
      
      fbshipit-source-id: 29b6ea444ed2ef90b8741fccdcb2b62625933e7f
      5c23bee8
    • Anthony Chen's avatar
      disable memory profiler by default + remove force disable + add logging · c0a84df5
      Anthony Chen authored
      Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/581
      
      Reviewed By: wat3rBro
      
      Differential Revision: D46913792
      
      fbshipit-source-id: cf3c3812c455091fbf63842443644d2571976017
      c0a84df5
  13. 22 Jun, 2023 3 commits
    • Anthony Chen's avatar
      expose use_orig_params to d2go config · 7f17bbf0
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/582
      
      Expose use_orig_params for FSDP constructor to d2go config. Read more about it in the docstring of torch.distributed.fsdp.fully_sharded_data_parallel.
      
      use_orig_params=False (default) uses FlatParameters to store flattened parameters, which saves memory by avoiding fragmentation. However, use_orig_params=True is essential for models that are partly frozen. This is because FlatParameters can only accept uniform requries_grad across the whole model
      
      Reviewed By: wat3rBro
      
      Differential Revision: D46917757
      
      fbshipit-source-id: 12ebe83e6de456e37d89eaf8b257f23925a6786d
      7f17bbf0
    • Francisc Bungiu's avatar
      Add MAST support for eval · 60b6995d
      Francisc Bungiu authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/583
      
      Extend support to MAST for evaluator binary.
      
      Reviewed By: miqueljubert
      
      Differential Revision: D46762473
      
      fbshipit-source-id: 62ac68f195c89924abf71c9b6a9715d60ffcbf9b
      60b6995d
    • Yanghan Wang's avatar
      clean up all __init__.py · 955e53f6
      Yanghan Wang authored
      Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/580
      
      Reviewed By: ajinkya-deogade
      
      Differential Revision: D46875151
      
      fbshipit-source-id: e19d9ac79c0a4ad1b1ab49112e36f80c55062ea4
      955e53f6
  14. 21 Jun, 2023 1 commit
  15. 19 Jun, 2023 1 commit
  16. 16 Jun, 2023 2 commits
  17. 14 Jun, 2023 1 commit
  18. 13 Jun, 2023 2 commits
    • Anthony Chen's avatar
      delete loaded ckpt after use to save memory · 3fce52cf
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/574
      
      Currently, d2go runner doesn't delete checkpoint after loading. This is fine if we run `resume=True` because all the model/optimizer/ema state in the checkpoint will be loaded into the corresponding training components. However, in the case of `resume=False`, only model state will be loaded and the optimizer/ema state will be left in memory until the end of training. This could potentially cause OOM if the checkpoint size is large.
      
      This diff deletes loaded ckpt after use to save memory and avoid potentiall OOM issues.
      
      Reviewed By: tglik
      
      Differential Revision: D46674618
      
      fbshipit-source-id: 2b70a8e46c7f2a309f83cc4deefe5d7a14783734
      3fce52cf
    • Yanghan Wang's avatar
      move detectron2 related .autodeps.toml to detectron2 · a879c1b4
      Yanghan Wang authored
      Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/572
      
      Reviewed By: ajinkya-deogade
      
      Differential Revision: D46664313
      
      fbshipit-source-id: acb1876c92c3907eb185dd144782495bda593d23
      a879c1b4
  19. 12 Jun, 2023 1 commit
    • Yanghan Wang's avatar
      fix d2go.config · bcad53f6
      Yanghan Wang authored
      Summary:
      I think the main issue is that we import `reroute_config_path` from `d2go.config.config` in `__init__.py`, but it's actually in `d2go.config.utils`. After fixing this, the namespace forward also works, see `scripts/wangyanghan/autodeps_testbed/d2go_config/TARGETS`
      
      Update all TARGETS:
      ```
      fbgs -l "d2go/config:" | xargs printf -- '/data/sandcastle/boxes/%s\n' | xargs arc lint -a
      ```
      
      For reviewers, only `.autodeps.toml` and files in `d2go/d2go/config/` and `scripts/wangyanghan/autodeps_testbed/d2go_config/` are manually changed, other files are auto modified.
      
      Reviewed By: ajinkya-deogade
      
      Differential Revision: D46582416
      
      fbshipit-source-id: 0be0bebedd1aad5b67a746c75db3c6b81bcfecee
      bcad53f6
  20. 08 Jun, 2023 1 commit
  21. 07 Jun, 2023 1 commit
  22. 06 Jun, 2023 1 commit
  23. 03 Jun, 2023 1 commit
  24. 02 Jun, 2023 1 commit
  25. 01 Jun, 2023 1 commit
  26. 29 May, 2023 2 commits
  27. 27 May, 2023 2 commits