Commits · e82635eb896f444ce8d86d47f442e6a8dbd01470 · OpenDAS / d2go

07 Aug, 2023 1 commit

Add ODS logging to all runners · e82635eb

Francisc Bungiu authored Aug 07, 2023

Summary:
X-link: https://github.com/facebookresearch/detectron2/pull/5050

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/606

Allow attaching a monitoring service to the training loop.

Reviewed By: miqueljubert

Differential Revision: D47595332

fbshipit-source-id: 49d770207aeea56113c008fcd29ad7b545cec849

e82635eb

04 Aug, 2023 1 commit

only select pth files with prefix "model" as model checkpoint file · 94c7f647

Zhicheng Yan authored Aug 03, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/605

D2GO workflow async validation monitor the model checkpoint files *.pth in **e2e_train** folder (such as **model_0004999.pth**, **model_final.pth**) and launch async val operator as needed.
All model files actually have prefix **"model"**. In some cases, there are non-model-checkpoint files also with pth file extension.
To exclude them, add a filtering to check if the file prefix is "model".

Reviewed By: ayushidalmia

Differential Revision: D48021972

fbshipit-source-id: 54d9c14117192809ea76d812ebd4240b44166637

94c7f647

25 Jul, 2023 2 commits

add warm up stage for d2go ema (for fsdp) · 1c9e0e83

Ji Hou authored Jul 25, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/602

per title

Reviewed By: wat3rBro

Differential Revision: D47740831

fbshipit-source-id: ecbe48a1085232a5cfb696e7f8e537d7e58e534a

1c9e0e83

Move predictor type check into a separate function · 0940b814

Ivan Malin authored Jul 25, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/600

To be able to reuse this logic

Reviewed By: wat3rBro

Differential Revision: D47722117

fbshipit-source-id: 4df1083317eb29fce45ecc4d8c0fdffa417b70d4

0940b814

21 Jul, 2023 2 commits

allow setting limit_all_gather in fsdp · d8734049

Xiaoliang Dai authored Jul 21, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/598

allow setting limit_all_gather in fsdp.  This enables faster training, as discussed in S351092

Reviewed By: Sekunde

Differential Revision: D47603555

fbshipit-source-id: 48d672fd5cce1763da91d8b801a8cb81630bfcdc

d8734049

Integrate the Genie optimization engine to d2go (reapply D47502855) · 361c5457

Fei Sun authored Jul 20, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/599

Genie optimization engine has the assumption that when a training iteration is started, it is also finished. And the after_step hook is called. This assumption is not valid in d2go.
https://www.internalfb.com/code/fbsource/[1537eddbd235e3f599709a493c1a80c7d016b3f8]/fbcode/vision/fair/detectron2/detectron2/engine/train_loop.py?lines=151-165

When an exception is triggered, the last iteration's after_step hook is not called.

In this diff, we patch up the hook integration to ensure that the Genie after_step hook is always called.

everything else remain the same as D47502855

Reviewed By: XiaoliangDai

Differential Revision: D47611143

fbshipit-source-id: b8b1ae2f304a40cf74340bbaf35647332a9a1524

361c5457

19 Jul, 2023 2 commits

Backout D47502855 · 21f96aa8

Kapil Krishnakumar authored Jul 19, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/597

Report of items being broken on D47580212

Reviewed By: crassirostris

Differential Revision: D47580502

fbshipit-source-id: 899221774cc92aef7fd4f37354171932b09494b6

21f96aa8

minor update of result gathering logic · 95e429a1

Yanghan Wang authored Jul 18, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/596

`outputs = {0: result}` feels a bit hacky, technically it should be `outputs = {worker_rank: result}` in order to match the `outputs` semantic in the else-branch.

Reviewed By: frabu6

Differential Revision: D47442322

fbshipit-source-id: f4d24f7022971b4f919b4fb4a563164c7f71cd2b

95e429a1

18 Jul, 2023 1 commit

Optimization integration to d2go · bbfdc182

Fei Sun authored Jul 17, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/595

Integrate the Genie optimization module to d2go. Currently only GC is added. Once the integration is successful, more optimizations may be added.

Reviewed By: XiaoliangDai

Differential Revision: D47502855

fbshipit-source-id: ec4bf60bb047463a2c310c7510d66620d801dd29

bbfdc182

14 Jul, 2023 1 commit

Fix instrument_checkpoint swallow exception · 461b6a80

Jack Zhang authored Jul 14, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/593

ContextDecorator won't raise exception in `__exit__`. We have to manually re-raise it. Otherwise, the exception will be silently discarded.

Reviewed By: wat3rBro

Differential Revision: D47454999

fbshipit-source-id: 44b1884543206202036f588eebe23cf61974982b

461b6a80

12 Jul, 2023 1 commit

Extend reply files to all binaries · e4fa6d63

Francisc Bungiu authored Jul 12, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/591

We previously added reply files for train_net, but not the other relevant binaries with MAST support: evaluator and lightning.
Adding support here by extracting the common bits into a separate module and wrapping the functions to reuse the functionality.

Differential Revision: D47293689

fbshipit-source-id: 70630a471c0cf037d180c9edfb57a4db4fdf7bde

e4fa6d63

05 Jul, 2023 1 commit

Add profiler to d2go lightning · 53748d9d

Francisc Bungiu authored Jul 05, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/589

Allow attaching GPU profiler to lightning d2go tasks.

Reviewed By: miqueljubert

Differential Revision: D47190798

fbshipit-source-id: b10269d25de6b5f977633796e77b0d6d912a873a

53748d9d

28 Jun, 2023 2 commits

enable autodeps for tests · a2b9a523

Yanghan Wang authored Jun 28, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/588

enable autodeps for d2go test to unblock next diff.

maybe in future we can break it into smaller pieces to make tests build and run faster.

Reviewed By: ajinkya-deogade

Differential Revision: D47080563

fbshipit-source-id: 9d8ee2a13f91a34c79aa13f2b8165c615643b87d

a2b9a523

Remove profiling of evaluation · b1e24e81

Francisc Bungiu authored Jun 28, 2023

Summary: Deprecate prepare_fb_model_for_eval().

Reviewed By: miqueljubert

Differential Revision: D47085783

fbshipit-source-id: 34b7e822e9baa1f9f77a11d3497df7fb0463c955

b1e24e81

26 Jun, 2023 1 commit

Exposing adding additional parameters for observers. · 2de6546e

Ayushi Dalmia authored Jun 26, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/586

Adding additional parameters for observers

Reviewed By: navsud

Differential Revision: D46136523

fbshipit-source-id: ce44d4cdfcd4ef8524f85eb148ee789137fa8abf

2de6546e

23 Jun, 2023 4 commits

disable FSDP mixed precision for model buffers · b0abd7aa

Anthony Chen authored Jun 22, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/585

Disable FSDP mixed precision for model buffers. Buffers are usually small in size so there's very limited performance gain for enabling mixed precision. Plus, applications like BatchNorm layers and diffusion models are very sensitive to the precision of buffers. Thus, we stick to full precision for buffers in FSDP.

Reviewed By: wat3rBro

Differential Revision: D46951673

fbshipit-source-id: 12bb1a47fbd8b3dd85c7f781bab707206044af15

b0abd7aa

update INJECTED_COCO_DATASETS_LUT when registering AdhocCOCODataset · be8a6324

Zhicheng Yan authored Jun 22, 2023

Summary:
When registering AdhocCOCODataset, INJECTED_COCO_DATASETS_LUT needs to be updated as well.
For example, if a dataset uses custom registering function, it can be only retrieved from INJECTED_COCO_DATASETS_LUT.
Otherwise, it uses the default registering function as in branch `register_dataset_split`.

Reviewed By: antonrigner

Differential Revision: D46826507

fbshipit-source-id: 9170c5b57f3935875b899ab7f93c3c57e77eb28c

be8a6324

remove AC prefix from EMA to make it compatible with loading · 5c23bee8

Anthony Chen authored Jun 22, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/578

# Problem:
d2go EMA uses `named_parameters()` to traverse model states and save EMA checkpoints, while using `state_dict()`  to save model checkpoints. This is a brittle practice because `named_parameters()` and `state_dict()` are calling two sets of python APIs and can return different things.
In the case of Activation Checkpointing (AC), we don't want AC wrapper to affect checkpoint names. Thus, `state_dict()` is overriden by Pytorch to remove prefix "_checkpoint_wrapped_module" from FQN. However, `named_parameters()` does not have that support, so prefix still exists. In the event of us changing AC wrapping strategy (very common for optimization), we will not be able to load the previous EMA state back to the model. And the same problem also happened with FSDP.

# Short-term hack:
This diff adds a short term hack to manually remove the AC prefix in EMA. We can expand `IGNORED_FQN_PREFIX` to support more use cases.

Reviewed By: wat3rBro

Differential Revision: D46815031

fbshipit-source-id: 29b6ea444ed2ef90b8741fccdcb2b62625933e7f

5c23bee8

disable memory profiler by default + remove force disable + add logging · c0a84df5

Anthony Chen authored Jun 22, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/581

Reviewed By: wat3rBro

Differential Revision: D46913792

fbshipit-source-id: cf3c3812c455091fbf63842443644d2571976017

c0a84df5

22 Jun, 2023 3 commits

expose use_orig_params to d2go config · 7f17bbf0

Anthony Chen authored Jun 22, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/582

Expose use_orig_params for FSDP constructor to d2go config. Read more about it in the docstring of torch.distributed.fsdp.fully_sharded_data_parallel.

use_orig_params=False (default) uses FlatParameters to store flattened parameters, which saves memory by avoiding fragmentation. However, use_orig_params=True is essential for models that are partly frozen. This is because FlatParameters can only accept uniform requries_grad across the whole model

Reviewed By: wat3rBro

Differential Revision: D46917757

fbshipit-source-id: 12ebe83e6de456e37d89eaf8b257f23925a6786d

7f17bbf0

Add MAST support for eval · 60b6995d

Francisc Bungiu authored Jun 22, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/583

Extend support to MAST for evaluator binary.

Reviewed By: miqueljubert

Differential Revision: D46762473

fbshipit-source-id: 62ac68f195c89924abf71c9b6a9715d60ffcbf9b

60b6995d

clean up all __init__.py · 955e53f6

Yanghan Wang authored Jun 21, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/580

Reviewed By: ajinkya-deogade

Differential Revision: D46875151

fbshipit-source-id: e19d9ac79c0a4ad1b1ab49112e36f80c55062ea4

955e53f6

21 Jun, 2023 1 commit

Enable Class Balancing for Model Train Sampler · 94b027bb

Devin Zhou authored Jun 21, 2023

Summary:
This diff enables both category and datasets weight balancing at the same time by declaring "WeightedCategoryTrainingSampler" under "SAMPLER_TRAIN" in config file.

X-link: https://github.com/facebookresearch/detectron2/pull/4995

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/570

Reviewed By: jiaxuzhu92, shiyud

Differential Revision: D46377371

fbshipit-source-id: 4e8bdf6a7e5d40b04072cb99637d13d85b2e0fce

94b027bb

19 Jun, 2023 1 commit

Fix key error 0 in multinode training · 78328839

Francisc Bungiu authored Jun 19, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/579

Current code assumed training runs only on one node, and there is always a global rank0 on each node. This assumption fails on multinode training, resulting in a key 0 error.

Reviewed By: crassirostris

Differential Revision: D46841286

fbshipit-source-id: d57919239fa5042de795d74c9c2013b07c9a0a48

78328839

16 Jun, 2023 2 commits

Force disable oom monitor · 1a8e1283

Miquel Jubert Hermoso authored Jun 16, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/577

Reviewed By: seijiyamamoto

Differential Revision: D46798443

fbshipit-source-id: 21e66cc26d98e866d34c92fa86b26b977c02925d

1a8e1283

fix quantization import · 62613829

Yanghan Wang authored Jun 15, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/575

ez

Reviewed By: ajinkya-deogade

Differential Revision: D46773836

fbshipit-source-id: 8cbfbfac6a60cab26ee1975ce0b876738711c160

62613829

14 Jun, 2023 1 commit

Enable activation checkpointing · 0389f4ee

Anthony Chen authored Jun 14, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/573

Enable Activation Checkpointing from Pytorch Distributed in d2go.

Reviewed By: rohan-varma

Differential Revision: D45681009

fbshipit-source-id: c03f27af61e0374b9e5991d82070edbe41edde6d

0389f4ee

13 Jun, 2023 2 commits

delete loaded ckpt after use to save memory · 3fce52cf

Anthony Chen authored Jun 13, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/574

Currently, d2go runner doesn't delete checkpoint after loading. This is fine if we run `resume=True` because all the model/optimizer/ema state in the checkpoint will be loaded into the corresponding training components. However, in the case of `resume=False`, only model state will be loaded and the optimizer/ema state will be left in memory until the end of training. This could potentially cause OOM if the checkpoint size is large.

This diff deletes loaded ckpt after use to save memory and avoid potentiall OOM issues.

Reviewed By: tglik

Differential Revision: D46674618

fbshipit-source-id: 2b70a8e46c7f2a309f83cc4deefe5d7a14783734

3fce52cf

move detectron2 related .autodeps.toml to detectron2 · a879c1b4

Yanghan Wang authored Jun 12, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/572

Reviewed By: ajinkya-deogade

Differential Revision: D46664313

fbshipit-source-id: acb1876c92c3907eb185dd144782495bda593d23

a879c1b4

12 Jun, 2023 1 commit

fix d2go.config · bcad53f6

Yanghan Wang authored Jun 12, 2023

Summary:
I think the main issue is that we import `reroute_config_path` from `d2go.config.config` in `__init__.py`, but it's actually in `d2go.config.utils`. After fixing this, the namespace forward also works, see `scripts/wangyanghan/autodeps_testbed/d2go_config/TARGETS`

Update all TARGETS:
```
fbgs -l "d2go/config:" | xargs printf -- '/data/sandcastle/boxes/%s\n' | xargs arc lint -a
```

For reviewers, only `.autodeps.toml` and files in `d2go/d2go/config/` and `scripts/wangyanghan/autodeps_testbed/d2go_config/` are manually changed, other files are auto modified.

Reviewed By: ajinkya-deogade

Differential Revision: D46582416

fbshipit-source-id: 0be0bebedd1aad5b67a746c75db3c6b81bcfecee

bcad53f6

08 Jun, 2023 1 commit

Enable preemption checkpointing for d2go FSDPCheckpointer · 61f72a8c

Anthony Chen authored Jun 07, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/567

As title.

Reviewed By: tglik

Differential Revision: D46383823

fbshipit-source-id: b5f80f55eb37ddc4e0918a349840b451f2b4b094

61f72a8c

07 Jun, 2023 1 commit

Convert GPU to CPU if CUDA not available · 3ecf8806

Jessica Zhong authored Jun 06, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/569

Reviewed By: wat3rBro

Differential Revision: D46498855

fbshipit-source-id: 99888f6a36a0f69155c3447cc080392ae9886539

3ecf8806

06 Jun, 2023 1 commit
- added logging and command line flag --use_elastic to enable torch elastic · f6afd9a9
  Jessica Zhong authored Jun 06, 2023
```
Reviewed By: wat3rBro

Differential Revision: D46460305

fbshipit-source-id: e91d9312c5d81ef1ba64ab169380329c8ad05f7c
```
  f6afd9a9
03 Jun, 2023 1 commit

use `get_convert_fx_fn` for eager mode convert · 3ba489fa

Jiaxu Zhu authored Jun 02, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/564

As title, as we need `ai_factory.quantization.convert.convert_eager` for Stinson models. This diff renames ``get_convert_fx_fn` to `get_convert_fn` and includes eager mode convert functions as well

Reviewed By: wat3rBro

Differential Revision: D46368438

fbshipit-source-id: 5ebea1f05b43b476a14ab1091f6ce39bffe614d3

3ba489fa

02 Jun, 2023 1 commit

Enable Torch Elastic Launch on Mast in D2go · 7d35bae7

Jessica Zhong authored Jun 02, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/566

Reviewed By: wat3rBro

Differential Revision: D45829249

fbshipit-source-id: 4e70bed0e85179b49b4e2358be3d937cfbf474d4

7d35bae7

01 Jun, 2023 1 commit

print parameter names in individual param groups · 87956d50

Zhicheng Yan authored May 31, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/539

Print out parameter names in each parameter group to a separate file (vs writing to the main log file)
This is useful to know assignment of specific parameters to a param group.

Reviewed By: wat3rBro

Differential Revision: D45855436

fbshipit-source-id: 1e1db4cf079802fc20fe3e3d0a931d8c44721d6c

87956d50

29 May, 2023 2 commits

Put back typing for Base Runner create_shared_context · 17672daa

Ajinkya Deogade authored May 28, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/562

Reverting the changes introduced in the diff D46096375 to restore the state before modularization.

Reviewed By: tglik

Differential Revision: D46145093

fbshipit-source-id: 9897640ec00331fc6ea2817fa46b2272fc33cb8d

17672daa

Trainer part 2: Create a separate TARGET for lightning trainer · d06a8fb1

Ajinkya Deogade authored May 28, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/561

This is the continuation from the part 1 D45912069 where we had not defined the TARGETS for the lightning trainer.
As the circular deps have been resolved, we can define the targets for `d2go/trainer/lightning` and move the other TARGETS inside `d2go/trainer`.

Reviewed By: tglik

Differential Revision: D46096373

fbshipit-source-id: 6efc13eb9ab343d11028fb238e6e3f0c64a03e09

d06a8fb1

27 May, 2023 2 commits

Utils part 2: create a separate buck target · 0cde431c

Ajinkya Deogade authored May 27, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/560

This is the continuation from the part 1 D45912077.
As the dependencies have been resolved, we can define the targets inside the dir `d2go/utils`

Reviewed By: wat3rBro

Differential Revision: D46096376

fbshipit-source-id: ab674d382162a4d7e5ee944b2a649e23278ca79f

0cde431c

Runner: create a separate buck target · 00208026

Ajinkya Deogade authored May 26, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/559

Create modular TARGETS for files inside `runner`.

Reviewed By: wat3rBro

Differential Revision: D45854271

fbshipit-source-id: a15ef475f72685ae8c3c73e0a83cf136a7285d3e

00208026