Commits · 94dc481abce37490fbbf16a11fca23824f9328e2 · OpenDAS / d2go

09 Jun, 2022 1 commit

unify DDP launcher for elastic and non-elastic (support elastic launch correctly) · 94dc481a

Yanghan Wang authored Jun 08, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/274

X-link: https://github.com/facebookresearch/mobile-vision/pull/76

TLDR: this diff consolidate the `distributed_helper` of `mobile_cv`, it (together with `mobile_cv`'s `comm` module) should be the TOGO library for dealing with DDP. D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)Go's `distributed` is now built on-top of `mobile_cv`'s `distributed_helper`.

Reviewed By: newstzpz

Differential Revision: D36787336

fbshipit-source-id: 640c9dcff5eec534e7894c75cfdf0a12d21c297e

94dc481a

15 May, 2022 1 commit

apply import merging for fbcode (7 of 11) · b3a9204c

John Reese authored May 15, 2022

Summary:
Applies new import merging and sorting from µsort v1.0.

When merging imports, µsort will make a best-effort to move associated
comments to match merged elements, but there are known limitations due to
the diynamic nature of Python and developer tooling. These changes should
not produce any dangerous runtime changes, but may require touch-ups to
satisfy linters and other tooling.

Note that µsort uses case-insensitive, lexicographical sorting, which
results in a different ordering compared to isort. This provides a more
consistent sorting order, matching the case-insensitive order used when
sorting import statements by module name, and ensures that "frog", "FROG",
and "Frog" always sort next to each other.

For details on µsort's sorting and merging semantics, see the user guide:
https://usort.readthedocs.io/en/stable/guide.html#sorting

Reviewed By: lisroach

Differential Revision: D36402205

fbshipit-source-id: a4efc688d02da80c6e96685aa8eb00411615a366

b3a9204c

14 May, 2022 1 commit

refactor setup for lightning_train_net · 6e8e4256

Yanghan Wang authored May 13, 2022

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/242

Reviewed By: newstzpz

Differential Revision: D36297282

fbshipit-source-id: 8efb19b3186f6978283f4e17e0628b55c2ec816e

6e8e4256

24 Mar, 2022 1 commit

fix ddp init twice in oss test · 8eb45690

Yanghan Wang authored Mar 23, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/192

Nowadays lightning will initialize process group when using ddp strategy, since `TestLightningTrainNet` does a training with ddp strategy (https://fburl.com/code/a9yp0kzy), the process group ended up initialized after running the test. However there're other tests that will also set up ddp and thus expect non-initialized process group, this is not a problem on sandcastle since the tests run separately, however in OSS env, the tests are running together, so the error happens (eg. https://github.com/facebookresearch/d2go/runs/5668912203?check_suite_focus=true).

This diff adds a clean up step in `TestLightningTrainNet`.

Reviewed By: tglik

Differential Revision: D35099944

fbshipit-source-id: f5b42b2a87d4efd9aa0ed97e6bd2140d80ab9522

8eb45690

25 May, 2021 1 commit

Read number of processes from dist_config · 29b57165

Kai Zhang authored May 24, 2021

Summary: Currently when launching a training flow, we read number of processes from resources.num_gpus. To be backward compatible with existing D2 (https://github.com/facebookresearch/d2go/commit/f82d44d3c33e6c781a3c6f2b27b376fdfbaeda53)Go training config, this diff changes to dist_config.num_processes_per_machine instead.

Reviewed By: wat3rBro

Differential Revision: D28630334

fbshipit-source-id: 3c684cd56e5d2e247c7b82e1d1eeff0f39e59ee4

29b57165

09 Apr, 2021 1 commit

Make checkpointing tests slightly less restrictive · fc5616c8

Ananth Subramaniam authored Apr 09, 2021

Summary:
Before: this test would assume only 2 checkpoints were stored: `last.ckpt`, and `FINAL_MODEL_CKPT`
Now: this test asserts that at least these 2 checkpoints are stored. In case the config specifies `save_top_k=-1` for instance, we'd save more checkpoints, causing this test to fail

Since this test is only loading the last and the final outputs, I'm changing the behavior to assert that these checkpoints must be saved and ignoring other checkpoint files that could be generated.

Reviewed By: kazhang

Differential Revision: D27671284

fbshipit-source-id: 0419fb46856d048e7b6eba3ff1dc65b7280a9a90

fc5616c8

30 Mar, 2021 1 commit

reorganize unit tests · a0658c4a

Sam Tsai authored Mar 30, 2021

Summary: Separate unit tests into individual folder based on functionality.

Reviewed By: wat3rBro

Differential Revision: D27132567

fbshipit-source-id: 9a8200be530ca14c7ef42191d59795b05b9800cc

a0658c4a

24 Mar, 2021 1 commit

Support evaluate predictor · 6aec097e

Kai Zhang authored Mar 24, 2021

Summary:
Evaluate the predictor generated by previous step.
This diff modify the lightning_train_net to reuse the evaluation logic by adding a `predictor_path` param.
This diff also makes Lightning training backend depends on `cfg.MODEL.DEVICE` so that in evaluate_predictor step, user could set backend by changing model device. This is useful for evaluating int8 quantized model.

Reviewed By: newstzpz

Differential Revision: D27150609

fbshipit-source-id: fb72da3e81db932c0fa479350150720143e09a3e

6aec097e

20 Mar, 2021 1 commit

move test utils to core library · 9d238344

Yanghan Wang authored Mar 20, 2021

Summary: Not d2go.tests is not a library for oss, move utils code to d2go.utils.testing

Reviewed By: zhanghang1989

Differential Revision: D26706933

fbshipit-source-id: 85767b66bbb6c67db05e11823beb4840220b2aa3

9d238344

11 Mar, 2021 1 commit

update Lightning module test for OSS · 2b5a3176

Kai Zhang authored Mar 11, 2021

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/17

Use PyTorch Lightning checkpoint in the test.

Reviewed By: zhanghang1989

Differential Revision: D26962697

fbshipit-source-id: abe635e374c3ada130243f0eaadff34204f04fa1

2b5a3176

03 Mar, 2021 1 commit

Split lightning_train_net into OSS and internal · 857195d8

Kai Zhang authored Mar 03, 2021

Summary:
As titled. The OSS version only use PyTorch Lightning while internal version leverages some features(e.g. Manifold integration, every_n_step checkpointing).
This diff splits train_net.main into smaller functions so that they could be shared across OSS and internal versions.

Reviewed By: zhanghang1989

Differential Revision: D26752701

fbshipit-source-id: 7f68e2a81e78193e117517a0ff668ab14b76ea65

857195d8