- 15 Dec, 2023 2 commits
-
-
Zhicheng Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/642 When we build a QAT model using FX graph mode API **prepare_qat_fx** and **convert_fx**, they will run symbolic tracing following **module.forward()**. In certain cases, such as a module takes constant tensor input, the symbolic tracing will add new tensor attributes with name prefix **_tensor_constant** (https://fburl.com/code/msc4ch4o), which becomes new keys in the QAT model state dict. In current implementation of **_setup_non_qat_to_qat_state_dict_map**, it asserts # of keys in the state dict of original- and QAT model should be the same. Thus, we extend **qat_state_dict_keys_to_ignore** method by adding an argument, which allows to ignore specified state dict keys in the QAT model. Reviewed By: wat3rBro Differential Revision: D52152706 fbshipit-source-id: 92219feae43bf8841b0a3a71adfbfcb84d8e8f95
-
Zhicheng Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/643 For a QAT model, it contains observers. After QAT training, those observers already contain updated statistics, such as min_val, max_val. When we want to export FP32 QAT model for a sanity check, if we call **fuse_utils.fuse_model()** again (which is often already called when we build the QAT model before QAT training), it will remove statistics in the observers. Reviewed By: wat3rBro Differential Revision: D52152688 fbshipit-source-id: 08aa16f2aa72b3809e0ba2d346f1b806c0e6ede7
-
- 07 Dec, 2023 2 commits
-
-
Yanghan Wang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/640 Reviewed By: tglik Differential Revision: D51908239 fbshipit-source-id: 7bcbad1fc7065b736cf4e38d155eed5d734758f7
-
Francisc Bungiu authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/639 Expose ability to add a preemption checkpointing hook running in a separate process group. Reviewed By: wat3rBro, ynonaolga Differential Revision: D51115437 fbshipit-source-id: c843802bc59da9f57c09c8d9a20f3d72d5b98edf
-
- 30 Nov, 2023 1 commit
-
-
Yanghan Wang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/637 Reviewed By: tglik Differential Revision: D51540498 fbshipit-source-id: f246559963c5187140db7b8113765f66a964ae1b
-
- 17 Nov, 2023 1 commit
-
-
Wei Sun authored
Summary: Similar to D48210543. Update the training_hooks to use the Unitrace memory snapshot APIs. This allows us to maintain a singel path for memory snapshot APIs, and also collect important details such as snapshot location for Zoomer. Pulled By: HugeEngine Pull Request resolved: https://github.com/facebookresearch/d2go/pull/636 Reviewed By: frabu6, aaronenyeshi, jackiexu1992, mengluy0125 Differential Revision: D48368150 fbshipit-source-id: b279adfa29d390e615d2c32a7ab9e05d95b4f164
-
- 10 Nov, 2023 1 commit
-
-
Yanghan Wang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/634 Reviewed By: yzhao30 Differential Revision: D51208655 fbshipit-source-id: 3280bde8807b623ec56841cc6d0ffc87a1e02e83
-
- 09 Nov, 2023 1 commit
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/633 transformer_auto_wrap_policy is buggy and causes issues when wrapping wrapped module. Migrate to ModuleWrapPolicy Reviewed By: tglik Differential Revision: D51124721 fbshipit-source-id: 61c4f5f810ead3c3776a7310926b2181121162ac
-
- 05 Nov, 2023 1 commit
-
-
Zhicheng Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/630 Currently, in runner **build_model()** method, when **eval_only=True**, we always try to load model weights. This is quite restricted in some cases. For example, we may just wanna build a model in eval mode to profile its efficiency, and we have not trained the model or generated the model weights in a checkpoint file. Thus, this diff adds an argument **skip_model_weights** to allow users to skip the loading of model weights. Note, this diff is entirely back-compatible and is NOT expected to break existing implementations. Reviewed By: navsud, wat3rBro Differential Revision: D50623772 fbshipit-source-id: 282dc6f19e17a4dd9eb0048e068c5299bb3d47c2
-
- 01 Nov, 2023 1 commit
-
-
Yanghan Wang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/632 Reviewed By: yzhao30 Differential Revision: D50663689 fbshipit-source-id: 5c4c1dd2e5d2087be5aec268672bb5e7fc329df9
-
- 23 Oct, 2023 1 commit
-
-
Matteo Presutto authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/629 This diff adds all of torch multi-tensor optimizers to d2go since it only supports Adamw, Adam and SGD in its current form. Reviewed By: mlopezantequera Differential Revision: D50498623 fbshipit-source-id: 5a38509354e565dd22256261bf1a688bcdc94951
-
- 20 Oct, 2023 1 commit
-
-
Zhicheng Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/628 At the exit, file descriptor in the logger for info level logging has already been closed. Calling **logger.info()** will raise an exception. Thus, we remove it. Reviewed By: ayushidalmia, wat3rBro Differential Revision: D50488097 fbshipit-source-id: 42b568e2e29d837424c3b2e42a5a33c067651ec3
-
- 12 Oct, 2023 1 commit
-
-
Igor Fedorov authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/627 Enable training for fraction of total steps: when doing HPO, users may want to train for a fraction of the number of training steps of a regular (baseline) training run. In this case, it is not enough to just change SOLVER.MAX_ITER because that also changes the learning rate schedule. We introduce a multiplier to be used on top of SOLVER.MAX_ITER when deciding how many steps to train for. This multiplier does not scale the number of steps over which the learning rate schedule is defined. Reviewed By: raghuramank100 Differential Revision: D48699087 fbshipit-source-id: 903f7c957ee471f36365c1449e9cd6a919fd260a
-
- 11 Oct, 2023 1 commit
-
-
Yanghan Wang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/626 Reviewed By: YanjunChen329 Differential Revision: D50135150 fbshipit-source-id: 6c85d4e966bb9e399c0fc17046fd1318bfbb1546
-
- 05 Oct, 2023 1 commit
-
-
Olga Gerasimova authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/623 If we load model d2go/runner/default_runner.py?lines=567 that was had enable_fake_quant, than on_begin_train we need to disable it. Reviewed By: jiaxuzhu92 Differential Revision: D49911356 fbshipit-source-id: f51b2a043c0c3f754d5698eb4b5d968a28d601d1
-
- 03 Oct, 2023 1 commit
-
-
SK Bong authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/621 There should be barriers around FSDP checkpointing to ensure other ranks do not continue to training while rank 0 is still checkpointing Also add log after checkpoint finishes Reviewed By: wat3rBro Differential Revision: D49541229 fbshipit-source-id: ac8c086eb0d65611be0b258e3006d9e14b7387ad
-
- 27 Sep, 2023 2 commits
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/615 Previous FSDP EMA checkpointing logic directly handles `EMAState`: it manually calls `FSDP.summon_full_params()` to gather the full model params, and reconstruct/load an `EMAState` for checkpointing. This logic has two drawbacks: 1. `FSDP.summon_full_params()` gathers all model weights at the same time, which could cause OOM issues if the model can't fit into a single GPU. This is quite common for FSDP workloads. 2. Directly saving and loading `EMAState` is error-prone. EMA state dict has different semantics and behaviors than `model.state_dict()`. However, users often expect it to function seamlessly like the model state dict This diff modifies the save/load logic of EMA to directly use `model.state_dict()` to solve the above 2 painpoints Reviewed By: wat3rBro Differential Revision: D48813697 fbshipit-source-id: be53c2677d2e493ba923508bbd82d9d295397941
-
Min Xu authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/622 as title Reviewed By: jiaxuzhu92, ywwwer Differential Revision: D49672980 fbshipit-source-id: f34ffe944c25c948fe1abd492ea0b96e47dc5b06
-
- 25 Sep, 2023 1 commit
-
-
Ed Pizzi authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/620 EMA can be configured to exclude frozen (`requires_grad=False`) parameters and buffers, reducing memory use and checkpoint size. However `FULL_STATE_DICT` FSDP + EMA checkpoints construct an inner `EMAState` after unsharding FSDP parameters. This inner `EMAState` uses default `include_frozen` and `include_buffers` settings, resulting in checkpoints containing frozen parameters and buffers regardless of settings. Propagate `include_frozen` and `include_buffers` settings to the inner `EMAState` when gathering `FULL_STATE_DICT` FSDP EMA state. This change only affects frozen parameters with a parallel fix to PyTorch FSDP to propagate `requires_grad` across parameter sharding/unsharding: https://github.com/pytorch/pytorch/pull/109892. Reviewed By: daveboat Differential Revision: D49517178 fbshipit-source-id: 0fe159dcec9ec1f2c456ae2ee7798681e7536249
-
- 21 Sep, 2023 1 commit
-
-
Yang Liu authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/619 For visualisation, tensor variables should be detached from the computational graph. The .cpu() function call should be after the detach(). Reviewed By: frabu6, wat3rBro Differential Revision: D48737228 fbshipit-source-id: b7308c852bdbae89fddba088f5188f61a9a216a8
-
- 11 Sep, 2023 1 commit
-
-
Hongye Yang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/617 Reviewed By: tglik Differential Revision: D49065205 Privacy Context Container: L1181999 fbshipit-source-id: b8e8b994a2bd32967dbb9afbc0d8fcfa7ef59667
-
- 06 Sep, 2023 1 commit
-
-
Karla Brkic authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/616 The check for valid bboxes doesn't verify that the bbox list has exactly 4 elements, and crashes the training instead of marking empty bboxes as invalid (see f472700454). Reviewed By: tglik Differential Revision: D48653084 fbshipit-source-id: 2d47fb267c5e51ab27798662ae739014f3d310e4
-
- 24 Aug, 2023 1 commit
-
-
Jessica Zhong authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/614 Reviewed By: wat3rBro, YanjunChen329 Differential Revision: D48544742 fbshipit-source-id: 9e49f13aa50e065c30e5551a636a83afd2d11acd
-
- 22 Aug, 2023 2 commits
-
-
Reza Barazesh authored
Differential Revision: D48533397 Original commit changeset: cbf260823172 Original Phabricator Diff: D48533397 fbshipit-source-id: 6ef669973058fc9dc20f3b2839f4d931c3a58c3d
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/611 Disable recording memory snapshots after dumping to files. Otherwise the process won't have a clean shutdown. Reviewed By: ertrue, wat3rBro Differential Revision: D48533397 fbshipit-source-id: cbf260823172222b8015008eaffa3d0361fa6233
-
- 19 Aug, 2023 1 commit
-
-
Wei Ye authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/610 As titled Reviewed By: wat3rBro Differential Revision: D48461077 fbshipit-source-id: f0bfd0dc9b8615b958a68d35c3df25a6c52859c0
-
- 12 Aug, 2023 1 commit
-
-
Yichao Lu authored
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/609 In previous code, the valid_bbox function was only designed for XYWH horizontal bboxes, this caused XYWHA rotated bboxes being marked invalid when the bboxes are large or close to the right edge of the image. So writing a valid_bbox_rotated for XYWHA format bbox separately Reviewed By: debowin Differential Revision: D48138234 fbshipit-source-id: d09d209afde9843624169af04f2e1692180bca0d
-
- 08 Aug, 2023 1 commit
-
-
Menglu Yu authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/607 Titled Reviewed By: tglik Differential Revision: D47535500 fbshipit-source-id: 93635f36b7164472bac6560d9f6626262096d14e
-
- 07 Aug, 2023 2 commits
-
-
Francisc Bungiu authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/608 In the current form, unit test fails with (https://fburl.com/ssrymti4) ``` with get_monitoring_service(): E AttributeError: __enter__ ``` Return nullcontext to address. Reviewed By: ynonaolga Differential Revision: D48113440 fbshipit-source-id: 241d649e49c65ad778d999f7c25515dd72953bca
-
Francisc Bungiu authored
Summary: X-link: https://github.com/facebookresearch/detectron2/pull/5050 Pull Request resolved: https://github.com/facebookresearch/d2go/pull/606 Allow attaching a monitoring service to the training loop. Reviewed By: miqueljubert Differential Revision: D47595332 fbshipit-source-id: 49d770207aeea56113c008fcd29ad7b545cec849
-
- 04 Aug, 2023 1 commit
-
-
Zhicheng Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/605 D2GO workflow async validation monitor the model checkpoint files *.pth in **e2e_train** folder (such as **model_0004999.pth**, **model_final.pth**) and launch async val operator as needed. All model files actually have prefix **"model"**. In some cases, there are non-model-checkpoint files also with pth file extension. To exclude them, add a filtering to check if the file prefix is "model". Reviewed By: ayushidalmia Differential Revision: D48021972 fbshipit-source-id: 54d9c14117192809ea76d812ebd4240b44166637
-
- 25 Jul, 2023 2 commits
-
-
Ji Hou authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/602 per title Reviewed By: wat3rBro Differential Revision: D47740831 fbshipit-source-id: ecbe48a1085232a5cfb696e7f8e537d7e58e534a
-
Ivan Malin authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/600 To be able to reuse this logic Reviewed By: wat3rBro Differential Revision: D47722117 fbshipit-source-id: 4df1083317eb29fce45ecc4d8c0fdffa417b70d4
-
- 21 Jul, 2023 2 commits
-
-
Xiaoliang Dai authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/598 allow setting limit_all_gather in fsdp. This enables faster training, as discussed in S351092 Reviewed By: Sekunde Differential Revision: D47603555 fbshipit-source-id: 48d672fd5cce1763da91d8b801a8cb81630bfcdc
-
Fei Sun authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/599 Genie optimization engine has the assumption that when a training iteration is started, it is also finished. And the after_step hook is called. This assumption is not valid in d2go. https://www.internalfb.com/code/fbsource/[1537eddbd235e3f599709a493c1a80c7d016b3f8]/fbcode/vision/fair/detectron2/detectron2/engine/train_loop.py?lines=151-165 When an exception is triggered, the last iteration's after_step hook is not called. In this diff, we patch up the hook integration to ensure that the Genie after_step hook is always called. everything else remain the same as D47502855 Reviewed By: XiaoliangDai Differential Revision: D47611143 fbshipit-source-id: b8b1ae2f304a40cf74340bbaf35647332a9a1524
-
- 19 Jul, 2023 2 commits
-
-
Kapil Krishnakumar authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/597 Report of items being broken on D47580212 Reviewed By: crassirostris Differential Revision: D47580502 fbshipit-source-id: 899221774cc92aef7fd4f37354171932b09494b6
-
Yanghan Wang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/596 `outputs = {0: result}` feels a bit hacky, technically it should be `outputs = {worker_rank: result}` in order to match the `outputs` semantic in the else-branch. Reviewed By: frabu6 Differential Revision: D47442322 fbshipit-source-id: f4d24f7022971b4f919b4fb4a563164c7f71cd2b
-
- 18 Jul, 2023 1 commit
-
-
Fei Sun authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/595 Integrate the Genie optimization module to d2go. Currently only GC is added. Once the integration is successful, more optimizations may be added. Reviewed By: XiaoliangDai Differential Revision: D47502855 fbshipit-source-id: ec4bf60bb047463a2c310c7510d66620d801dd29
-
- 14 Jul, 2023 1 commit
-
-
Jack Zhang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/593 ContextDecorator won't raise exception in `__exit__`. We have to manually re-raise it. Otherwise, the exception will be silently discarded. Reviewed By: wat3rBro Differential Revision: D47454999 fbshipit-source-id: 44b1884543206202036f588eebe23cf61974982b
-
- 12 Jul, 2023 1 commit
-
-
Francisc Bungiu authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/591 We previously added reply files for train_net, but not the other relevant binaries with MAST support: evaluator and lightning. Adding support here by extracting the common bits into a separate module and wrapping the functions to reuse the functionality. Differential Revision: D47293689 fbshipit-source-id: 70630a471c0cf037d180c9edfb57a4db4fdf7bde
-