Commits · c2256758202eb51ae8f21200f58dcbb70ca96690 · OpenDAS / d2go

15 Dec, 2023 2 commits

allow to ignore state dict keys in QAT model · c2256758

Zhicheng Yan authored Dec 15, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/642

When we build a QAT model using FX graph mode API **prepare_qat_fx** and **convert_fx**, they will run symbolic tracing following **module.forward()**.

In certain cases, such as a module takes constant tensor input, the symbolic tracing will add new tensor attributes with name prefix **_tensor_constant** (https://fburl.com/code/msc4ch4o), which becomes new keys in the QAT model state dict.

In current implementation of **_setup_non_qat_to_qat_state_dict_map**, it asserts # of keys in the state dict of original- and QAT model should be the same.

Thus, we extend **qat_state_dict_keys_to_ignore** method by adding an argument, which allows to ignore specified state dict keys in the QAT model.

Reviewed By: wat3rBro

Differential Revision: D52152706

fbshipit-source-id: 92219feae43bf8841b0a3a71adfbfcb84d8e8f95

c2256758

do not fuse model again for a QAT model · 8f130231

Zhicheng Yan authored Dec 15, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/643

For a QAT model, it contains observers. After QAT training, those observers already contain updated statistics, such as min_val, max_val.

When we want to export FP32 QAT model for a sanity check, if we call **fuse_utils.fuse_model()** again (which is often already called when we build the QAT model before QAT training), it will remove statistics in the observers.

Reviewed By: wat3rBro

Differential Revision: D52152688

fbshipit-source-id: 08aa16f2aa72b3809e0ba2d346f1b806c0e6ede7

8f130231

07 Dec, 2023 2 commits

add API reset optimzation engine · da53aa10

Yanghan Wang authored Dec 07, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/640

Reviewed By: tglik

Differential Revision: D51908239

fbshipit-source-id: 7bcbad1fc7065b736cf4e38d155eed5d734758f7

da53aa10

Enable preemption checkpointing · 409cd213

Francisc Bungiu authored Dec 07, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/639

Expose ability to add a preemption checkpointing hook running in a separate process group.

Reviewed By: wat3rBro, ynonaolga

Differential Revision: D51115437

fbshipit-source-id: c843802bc59da9f57c09c8d9a20f3d72d5b98edf

409cd213

30 Nov, 2023 1 commit

add callbacks for inference_on_dataset · d0e16684

Yanghan Wang authored Nov 30, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/637

Reviewed By: tglik

Differential Revision: D51540498

fbshipit-source-id: f246559963c5187140db7b8113765f66a964ae1b

d0e16684

17 Nov, 2023 1 commit

Use the consolidated snapshot API in Unitrace to support Zoomer · 87649f4f

Wei Sun authored Nov 17, 2023

Summary: Similar to D48210543. Update the training_hooks to use the Unitrace memory snapshot APIs. This allows us to maintain a singel path for memory snapshot APIs, and also collect important details such as snapshot location for Zoomer.

Pulled By:
HugeEngine

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/636

Reviewed By: frabu6, aaronenyeshi, jackiexu1992, mengluy0125

Differential Revision: D48368150

fbshipit-source-id: b279adfa29d390e615d2c32a7ab9e05d95b4f164

87649f4f

10 Nov, 2023 1 commit

add print during _populate_registries · 8d072ebf

Yanghan Wang authored Nov 10, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/634

Reviewed By: yzhao30

Differential Revision: D51208655

fbshipit-source-id: 3280bde8807b623ec56841cc6d0ffc87a1e02e83

8d072ebf

09 Nov, 2023 1 commit

Migrate transformer_auto_wrap_policy to ModuleWrapPolicy · 40e78153

Anthony Chen authored Nov 08, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/633

transformer_auto_wrap_policy is buggy and causes issues when wrapping wrapped module. Migrate to ModuleWrapPolicy

Reviewed By: tglik

Differential Revision: D51124721

fbshipit-source-id: 61c4f5f810ead3c3776a7310926b2181121162ac

40e78153

05 Nov, 2023 1 commit

allow to skip loading model weights in build_model() · f2a0c52c

Zhicheng Yan authored Nov 05, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/630

Currently, in runner **build_model()** method, when **eval_only=True**, we always try to load model weights.
This is quite restricted in some cases. For example, we may just wanna build a model in eval mode to profile its efficiency, and we have not trained the model or generated the model weights in a checkpoint file.

Thus, this diff adds an argument **skip_model_weights** to allow users to skip the loading of model weights.
Note, this diff is entirely back-compatible and is NOT expected to break existing implementations.

Reviewed By: navsud, wat3rBro

Differential Revision: D50623772

fbshipit-source-id: 282dc6f19e17a4dd9eb0048e068c5299bb3d47c2

f2a0c52c

01 Nov, 2023 1 commit

resolve CPU OOM with FSDP checkpointer · 2d4d2f29

Yanghan Wang authored Nov 01, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/632

Reviewed By: yzhao30

Differential Revision: D50663689

fbshipit-source-id: 5c4c1dd2e5d2087be5aec268672bb5e7fc329df9

2d4d2f29

23 Oct, 2023 1 commit

Adding search for all torch multi-tensor optimizers · 7ace1ef0

Matteo Presutto authored Oct 23, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/629

This diff adds all of torch multi-tensor optimizers to d2go since it only supports Adamw, Adam and SGD in its current form.

Reviewed By: mlopezantequera

Differential Revision: D50498623

fbshipit-source-id: 5a38509354e565dd22256261bf1a688bcdc94951

7ace1ef0

20 Oct, 2023 1 commit

remove logging using logger at exit · b18c078a

Zhicheng Yan authored Oct 20, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/628

At the exit, file descriptor in the logger for info level logging has already been closed. Calling **logger.info()** will raise an exception. Thus, we remove it.

Reviewed By: ayushidalmia, wat3rBro

Differential Revision: D50488097

fbshipit-source-id: 42b568e2e29d837424c3b2e42a5a33c067651ec3

b18c078a

12 Oct, 2023 1 commit

Enable training for fraction of total steps; enable early stopping from trial 0 · 3c724416

Igor Fedorov authored Oct 12, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/627

Enable training for fraction of total steps: when doing HPO, users may want to train for a fraction of the number of training steps of a regular (baseline) training run. In this case, it is not enough to just change SOLVER.MAX_ITER because that also changes the learning rate schedule. We introduce a multiplier to be used on top of SOLVER.MAX_ITER when deciding how many steps to train for. This multiplier does not scale the number of steps over which the learning rate schedule is defined.

Reviewed By: raghuramank100

Differential Revision: D48699087

fbshipit-source-id: 903f7c957ee471f36365c1449e9cd6a919fd260a

3c724416

11 Oct, 2023 1 commit

only let local master to download fsdp full checkpoint · 54d9d91b

Yanghan Wang authored Oct 10, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/626

Reviewed By: YanjunChen329

Differential Revision: D50135150

fbshipit-source-id: 6c85d4e966bb9e399c0fc17046fd1318bfbb1546

54d9d91b

05 Oct, 2023 1 commit

disable_fake_quant on 0 step · b375c290

Olga Gerasimova authored Oct 05, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/623

If we load model d2go/runner/default_runner.py?lines=567
that was had enable_fake_quant, than on_begin_train we need to disable it.

Reviewed By: jiaxuzhu92

Differential Revision: D49911356

fbshipit-source-id: f51b2a043c0c3f754d5698eb4b5d968a28d601d1

b375c290

03 Oct, 2023 1 commit

Add proper barriers around FSDP checkpointing · 27918553

SK Bong authored Oct 02, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/621

There should be barriers around FSDP checkpointing to ensure other ranks do not continue to training while rank 0 is still checkpointing

Also add log after checkpoint finishes

Reviewed By: wat3rBro

Differential Revision: D49541229

fbshipit-source-id: ac8c086eb0d65611be0b258e3006d9e14b7387ad

27918553

27 Sep, 2023 2 commits

Make EMA checkpointing with FSDP more robust · 477629d0

Anthony Chen authored Sep 27, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/615

Previous FSDP EMA checkpointing logic directly handles `EMAState`: it manually calls `FSDP.summon_full_params()` to gather the full model params, and reconstruct/load an `EMAState` for checkpointing. This logic has two drawbacks:

1. `FSDP.summon_full_params()` gathers all model weights at the same time, which could cause OOM issues if the model can't fit into a single GPU. This is quite common for FSDP workloads.
2. Directly saving and loading `EMAState` is error-prone. EMA state dict has different semantics and behaviors than `model.state_dict()`. However, users often expect it to function seamlessly like the model state dict

This diff modifies the save/load logic of EMA to directly use `model.state_dict()` to solve the above 2 painpoints

Reviewed By: wat3rBro

Differential Revision: D48813697

fbshipit-source-id: be53c2677d2e493ba923508bbd82d9d295397941

477629d0

add damit uri support in train_net local run · c668ed4e

Min Xu authored Sep 27, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/622

as title

Reviewed By: jiaxuzhu92, ywwwer

Differential Revision: D49672980

fbshipit-source-id: f34ffe944c25c948fe1abd492ea0b96e47dc5b06

c668ed4e

25 Sep, 2023 1 commit

Propagate include_frozen/buffers to EMAState in FSDP FULL_STATE_DICT checkpoints · 206a05c6

Ed Pizzi authored Sep 25, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/620

EMA can be configured to exclude frozen (`requires_grad=False`) parameters and buffers, reducing memory use and checkpoint size.

However `FULL_STATE_DICT` FSDP + EMA checkpoints construct an inner `EMAState` after unsharding FSDP parameters. This inner `EMAState` uses default `include_frozen` and `include_buffers` settings, resulting in checkpoints containing frozen parameters and buffers regardless of settings.

Propagate `include_frozen` and `include_buffers` settings to the inner `EMAState` when gathering `FULL_STATE_DICT` FSDP EMA state.

This change only affects frozen parameters with a parallel fix to PyTorch FSDP to propagate `requires_grad` across parameter sharding/unsharding: https://github.com/pytorch/pytorch/pull/109892.

Reviewed By: daveboat

Differential Revision: D49517178

fbshipit-source-id: 0fe159dcec9ec1f2c456ae2ee7798681e7536249

206a05c6

21 Sep, 2023 1 commit

Fix cpu().detach() into detach().cpu() · 93037c4e

Yang Liu authored Sep 21, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/619

For visualisation, tensor variables should be detached from the computational graph. The .cpu() function call should be after the detach().

Reviewed By: frabu6, wat3rBro

Differential Revision: D48737228

fbshipit-source-id: b7308c852bdbae89fddba088f5188f61a9a216a8

93037c4e

11 Sep, 2023 1 commit

adding training deterministic setups · d49077dd

Hongye Yang authored Sep 11, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/617

Reviewed By: tglik

Differential Revision:
D49065205

Privacy Context Container: L1181999

fbshipit-source-id: b8e8b994a2bd32967dbb9afbc0d8fcfa7ef59667

d49077dd

06 Sep, 2023 1 commit

Add check for empty bboxes · 66f626dd

Karla Brkic authored Sep 06, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/616

The check for valid bboxes doesn't verify that the bbox list has exactly 4 elements, and crashes the training instead of marking empty bboxes as invalid (see f472700454).

Reviewed By: tglik

Differential Revision: D48653084

fbshipit-source-id: 2d47fb267c5e51ab27798662ae739014f3d310e4

66f626dd

24 Aug, 2023 1 commit

Add Optimizer FSDP and AC on 3xUnet/5xUnet · 7ad54f57

Jessica Zhong authored Aug 24, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/614

Reviewed By: wat3rBro, YanjunChen329

Differential Revision: D48544742

fbshipit-source-id: 9e49f13aa50e065c30e5551a636a83afd2d11acd

7ad54f57

22 Aug, 2023 2 commits

Revert D48533397: disable recording memory snapshots after dumping · c3169c1e

Reza Barazesh authored Aug 22, 2023

Differential Revision:
D48533397

Original commit changeset: cbf260823172

Original Phabricator Diff: D48533397

fbshipit-source-id: 6ef669973058fc9dc20f3b2839f4d931c3a58c3d

c3169c1e

disable recording memory snapshots after dumping · 61485c81

Anthony Chen authored Aug 22, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/611

Disable recording memory snapshots after dumping to files. Otherwise the process won't have a clean shutdown.

Reviewed By: ertrue, wat3rBro

Differential Revision: D48533397

fbshipit-source-id: cbf260823172222b8015008eaffa3d0361fa6233

61485c81

19 Aug, 2023 1 commit

print sampling probability for WeightedTrainingSampler · 49ffc846

Wei Ye authored Aug 18, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/610

As titled

Reviewed By: wat3rBro

Differential Revision: D48461077

fbshipit-source-id: f0bfd0dc9b8615b958a68d35c3df25a6c52859c0

49ffc846

12 Aug, 2023 1 commit

Summary: · f7e1b47e

Yichao Lu authored Aug 11, 2023

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/609

In previous code, the valid_bbox function was only designed for XYWH horizontal bboxes, this caused XYWHA rotated bboxes being marked invalid when the bboxes are large or close to the right edge of the image. So writing a valid_bbox_rotated for XYWHA format bbox separately

Reviewed By: debowin

Differential Revision: D48138234

fbshipit-source-id: d09d209afde9843624169af04f2e1692180bca0d

f7e1b47e

08 Aug, 2023 1 commit

Enable memory profiling for D2Go trainer of Genie · 9e40d710

Menglu Yu authored Aug 07, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/607

Titled

Reviewed By: tglik

Differential Revision: D47535500

fbshipit-source-id: 93635f36b7164472bac6560d9f6626262096d14e

9e40d710

07 Aug, 2023 2 commits

Fix empty context manager · f59dbb04

Francisc Bungiu authored Aug 07, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/608

In the current form, unit test fails with (https://fburl.com/ssrymti4)
```
 with get_monitoring_service():
E       AttributeError: __enter__
```
Return nullcontext to address.

Reviewed By: ynonaolga

Differential Revision: D48113440

fbshipit-source-id: 241d649e49c65ad778d999f7c25515dd72953bca

f59dbb04

Add ODS logging to all runners · e82635eb

Francisc Bungiu authored Aug 07, 2023

Summary:
X-link: https://github.com/facebookresearch/detectron2/pull/5050

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/606

Allow attaching a monitoring service to the training loop.

Reviewed By: miqueljubert

Differential Revision: D47595332

fbshipit-source-id: 49d770207aeea56113c008fcd29ad7b545cec849

e82635eb

04 Aug, 2023 1 commit

only select pth files with prefix "model" as model checkpoint file · 94c7f647

Zhicheng Yan authored Aug 03, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/605

D2GO workflow async validation monitor the model checkpoint files *.pth in **e2e_train** folder (such as **model_0004999.pth**, **model_final.pth**) and launch async val operator as needed.
All model files actually have prefix **"model"**. In some cases, there are non-model-checkpoint files also with pth file extension.
To exclude them, add a filtering to check if the file prefix is "model".

Reviewed By: ayushidalmia

Differential Revision: D48021972

fbshipit-source-id: 54d9c14117192809ea76d812ebd4240b44166637

94c7f647

25 Jul, 2023 2 commits

add warm up stage for d2go ema (for fsdp) · 1c9e0e83

Ji Hou authored Jul 25, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/602

per title

Reviewed By: wat3rBro

Differential Revision: D47740831

fbshipit-source-id: ecbe48a1085232a5cfb696e7f8e537d7e58e534a

1c9e0e83

Move predictor type check into a separate function · 0940b814

Ivan Malin authored Jul 25, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/600

To be able to reuse this logic

Reviewed By: wat3rBro

Differential Revision: D47722117

fbshipit-source-id: 4df1083317eb29fce45ecc4d8c0fdffa417b70d4

0940b814

21 Jul, 2023 2 commits

allow setting limit_all_gather in fsdp · d8734049

Xiaoliang Dai authored Jul 21, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/598

allow setting limit_all_gather in fsdp.  This enables faster training, as discussed in S351092

Reviewed By: Sekunde

Differential Revision: D47603555

fbshipit-source-id: 48d672fd5cce1763da91d8b801a8cb81630bfcdc

d8734049

Integrate the Genie optimization engine to d2go (reapply D47502855) · 361c5457

Fei Sun authored Jul 20, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/599

Genie optimization engine has the assumption that when a training iteration is started, it is also finished. And the after_step hook is called. This assumption is not valid in d2go.
https://www.internalfb.com/code/fbsource/[1537eddbd235e3f599709a493c1a80c7d016b3f8]/fbcode/vision/fair/detectron2/detectron2/engine/train_loop.py?lines=151-165

When an exception is triggered, the last iteration's after_step hook is not called.

In this diff, we patch up the hook integration to ensure that the Genie after_step hook is always called.

everything else remain the same as D47502855

Reviewed By: XiaoliangDai

Differential Revision: D47611143

fbshipit-source-id: b8b1ae2f304a40cf74340bbaf35647332a9a1524

361c5457

19 Jul, 2023 2 commits

Backout D47502855 · 21f96aa8

Kapil Krishnakumar authored Jul 19, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/597

Report of items being broken on D47580212

Reviewed By: crassirostris

Differential Revision: D47580502

fbshipit-source-id: 899221774cc92aef7fd4f37354171932b09494b6

21f96aa8

minor update of result gathering logic · 95e429a1

Yanghan Wang authored Jul 18, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/596

`outputs = {0: result}` feels a bit hacky, technically it should be `outputs = {worker_rank: result}` in order to match the `outputs` semantic in the else-branch.

Reviewed By: frabu6

Differential Revision: D47442322

fbshipit-source-id: f4d24f7022971b4f919b4fb4a563164c7f71cd2b

95e429a1

18 Jul, 2023 1 commit

Optimization integration to d2go · bbfdc182

Fei Sun authored Jul 17, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/595

Integrate the Genie optimization module to d2go. Currently only GC is added. Once the integration is successful, more optimizations may be added.

Reviewed By: XiaoliangDai

Differential Revision: D47502855

fbshipit-source-id: ec4bf60bb047463a2c310c7510d66620d801dd29

bbfdc182

14 Jul, 2023 1 commit

Fix instrument_checkpoint swallow exception · 461b6a80

Jack Zhang authored Jul 14, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/593

ContextDecorator won't raise exception in `__exit__`. We have to manually re-raise it. Otherwise, the exception will be silently discarded.

Reviewed By: wat3rBro

Differential Revision: D47454999

fbshipit-source-id: 44b1884543206202036f588eebe23cf61974982b

461b6a80

12 Jul, 2023 1 commit

Extend reply files to all binaries · e4fa6d63

Francisc Bungiu authored Jul 12, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/591

We previously added reply files for train_net, but not the other relevant binaries with MAST support: evaluator and lightning.
Adding support here by extracting the common bits into a separate module and wrapping the functions to reuse the functionality.

Differential Revision: D47293689

fbshipit-source-id: 70630a471c0cf037d180c9edfb57a4db4fdf7bde

e4fa6d63