Commits · 5b37512b26ee9b73074e940cd019e13d2c8bbf9d · OpenDAS / d2go

04 Oct, 2024 2 commits

Add version guard to migration · 5b37512b

Shangdi Yu authored Oct 04, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/679

as title, add versioning to avoid breaking d2go CI

Reviewed By: wat3rBro

Differential Revision: D63907037

fbshipit-source-id: baf94c71c68ab017ed21b4c12eaf2fa69219db68

5b37512b

Migrate in d2go · f905958c

Shangdi Yu authored Oct 04, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/678

capture_pre_autograd_graph is deprecating. Migrate to use the new API.

Reviewed By: navsud, tugsbayasgalan

Differential Revision: D63859679

fbshipit-source-id: f14def6bc622cc451020d0edcc312330fa626943

f905958c

26 Sep, 2024 1 commit

Deterministic D2GO Trainer Params · 5b856252

Victor Bourgin authored Sep 26, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/677

Previously, cfg.SOLVER.DETERMINISTIC was not taken into account for lightning `Trainer` in d2go:
- Nested checks `hasattr(cfg, "SOLVER.DETERMINISTIC")` do not work as expected
- If SOLVER.DETERMINISTIC exists, we should check that it is set to `True`

Reviewed By: ayushidalmia, rbasch

Differential Revision: D63426319

fbshipit-source-id: 8caf0af53e7b97a49392df09153e26ee3628231f

5b856252

13 Aug, 2024 1 commit

Hipify various dependencies to enable AMD Face Enhancer · 7739077a

Josh Fromm authored Aug 13, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/675

This diff extends several targets to be hip compatible and fixes a few silly hipification issues with those targets.

After these changes, all dependencies needed for the face enhancer can compile with AMD.

A few silly issues that I had to hack around, maybe we could improve hipification to avoid similar issues in the future:
* Some of the dependencies used sources in `src/cuda/**.cu`. Hipification tried to rename "cuda" to "hip" and broke the paths. I'm not sure where that rename happens so I just changed the directory from "cuda" to "gpu" to avoid the issue.
* One header import called `THCAtomics.cuh` was incorrectly being renamed to `THHAtomics.cuh`, which doesnt exist. Fortunately an equivalent import that doesnt have name issues was available.

We also might want to consider graduating the cpp_library_hip bazel helper out of fbgemm since it seems pretty generally useful.

For some of the targets, we needed to build a python cpp extension, which as far as I can tell we didnt have good hipification for yet. I added a new buck rule very similar to our standard cpp_library_hip rule that creates an extension instead. It's a little copy-pasted so if there are cleaner ways to work around this requirement let me know.

Reviewed By: houseroad

Differential Revision: D61080247

fbshipit-source-id: dc6f101eb3eadfd43ef5610c651b1639e4c78ae6

7739077a

30 Jul, 2024 2 commits

remove from pytorch-nightly build in CI · e09224b8

Yanghan Wang authored Jul 30, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/672

Nightly is less stable than d2go itself (eg. issue like https://github.com/facebookresearch/d2go/actions/runs/10135977974), just use the latest build.

Differential Revision: D60458684

fbshipit-source-id: 2cce9a0eaabdeba2908703753d67dcd4bb24c378

e09224b8

fix test_integration_tests_coco · 938d372e

Yanghan Wang authored Jul 30, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/671

> AttributeError: type object 'GeneralizedRCNNTask' has no attribute 'cleanup'

EZ fix

Reviewed By: rbasch

Differential Revision: D60400187

fbshipit-source-id: 25872f4928cf8851ff63e96311c12086e272d619

938d372e

11 Jul, 2024 1 commit

added an arch without trunk3 for code detection · e6d115e9

Yichao Lu authored Jul 11, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/670

registered a new arch without trunk3 since our trunk2 model has better quality.

Reviewed By: huiyujie

Differential Revision: D59613942

fbshipit-source-id: 605e8925bfcd91d8a966303d9c0a3b4f56a9a0c7

e6d115e9

01 Jul, 2024 1 commit

Use different signpost id for each checkpoint · 55e0dae3

Francisc Bungiu authored Jul 01, 2024

Summary:
Use different signpost so they don't get deduplicated in minion.

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/669

Reviewed By: wat3rBro

Differential Revision: D59226344

fbshipit-source-id: c1356feadbc1b63220a1abdd8cc079723b230e42

55e0dae3

22 Jun, 2024 1 commit

adhere to lazy import rules · 040a7167

Ahmed Gheith authored Jun 21, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/668

Lazy import changes `Python` import semantics, specifically when it comes to initialization of packages/modules: https://www.internalfb.com/intern/wiki/Python/Cinder/Onboarding/Tutorial/Lazy_Imports/Troubleshooting/

For example, this pattern is not guaranteed to work:

```
import torch.optim
...
torch.optim._multi_tensor.Adam   # may fail to resolve _multi_tensor
```

And this is guaranteed to work:

```
import torch.optim._multi_tensor
...
torch.optim._multi_tensor.Adam   # will always work
```

A recent change to `PyTorch` changed module initialization logic in a way that exposed this issue.

But the code has been working for years? This is the nature of undefined behavior, any change in the environment (in this the `PyTorch` code base can make it fail.

Reviewed By: wat3rBro

Differential Revision: D58876582

fbshipit-source-id: c8f3f53605822517d646e57ddbf4359af54dba0d

040a7167

19 Jun, 2024 1 commit

Add preemption checkpointing to lightning tasks · 8eab506b

Francisc Bungiu authored Jun 19, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/666

While debugging elevated preemption wastage in d2go, came across a few long running Pinocchio jobs in d2go that do not checkpoint preemption and also do not have checkpointing instrumented. This diff addresses both of these issues.

Reviewed By: wat3rBro

Differential Revision: D58669254

fbshipit-source-id: 9d1c5ff9e61a4a83d284a45154aa54d2d41178cf

8eab506b

11 Jun, 2024 1 commit

Custom qconfig for eager-mode quantization · 20054748

Naveen Suda authored Jun 10, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/664

Generate qconfig based on the config

Reviewed By: wat3rBro

Differential Revision: D58210321

fbshipit-source-id: 7a86f8b6e9d112302c978080c2bd5721e3c7dbff

20054748

08 May, 2024 1 commit

small fixes of github CI · c606fdf4

Yanghan Wang authored May 07, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/662

- d2go: aws model zoo is not available anymore, disable the test in oss.
- d2: scripts running at midnight hitting rate limit issue, change it to a random time.

Differential Revision: D57085427

fbshipit-source-id: 8dc24b2a7996c8ae5ed8808c3301af2851c15a14

c606fdf4

02 May, 2024 1 commit

Make lightning reproducible · aa7716be

Ayushi Dalmia authored May 02, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/661

X-link: https://github.com/fairinternal/detectron2/pull/603

X-link: https://github.com/facebookresearch/detectron2/pull/5273

In this diff we make changes to ensure we can control reproducibility in d2go:

- update setup.py to enforce deterministic performance if set via config
- set lightning parameters if deterministic is passed:

```
 {
                "sync_batchnorm": True,
                "deterministic": True,
                "replace_sampler_ddp": False,
 }
```
- allow passing prefetch_factor, pin_memory, persistent_memory as args to batch dataloader.
- minor fix in training sampler

Differential Revision: D55767128

fbshipit-source-id: eeab50c95969a91c58f1773473b6fc666494cc16

aa7716be

24 Apr, 2024 1 commit

use custom prepare function · 05b33018

Naveen Suda authored Apr 23, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/660

To enable w8a16 sigmoid in d2go, we need to use custom prepare function.

Reviewed By: ayushidalmia, jiaxuzhu92

Differential Revision: D56275899

fbshipit-source-id: 654900011a1393e81289e8c9412b5886831765e2

05b33018

03 Apr, 2024 1 commit

Add new FBNetV3_B_large backbone for HAPTIC. · ba7c235b

Debojeet Chatterjee authored Apr 03, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/659

FBNetV3_B_large using FBNetV3_B backbone with large box and kpts heads for better multitask KP+Box detection tasks while being in similar range of #flops and #params.

Reviewed By: ashishvshenoy

Differential Revision: D55645349

fbshipit-source-id: 3fe84f566b3eeaddf84a94ef708557944fffcd22

ba7c235b

02 Apr, 2024 1 commit

INPUT.CROP=ENABLED silently fails when annotations dict is present but empty. · dda734a4

Debojeet Chatterjee authored Apr 02, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/658

D2Go's INPUT.CROP=ENABLED silently fails when annotations dict is present but empty.
Ends up not using the dataset at all.

Adding an additional check to circumvent this failure.

Reviewed By: ayushidalmia

Differential Revision: D55640142

fbshipit-source-id: b733b841edb17c16d69332795c89c32b008cb6e5

dda734a4

27 Mar, 2024 1 commit

fix distributed initialization for FSDP · b14282fb

Fanyi Xiao authored Mar 26, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/657

Without properly set `requires_grad` for params and buffers, it causes hang in FSDP training. This becomes an issue eg when training with LoRA.

Reviewed By: wat3rBro

Differential Revision: D55220828

fbshipit-source-id: 1e33aa540c84c4de62a3a37c48a322aa26c98292

b14282fb

19 Mar, 2024 1 commit

distributed FSDP model initialization · abdad994

Geet Sethi authored Mar 19, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/656

Enable distributed FSDP model initialization. This iteratively moves and shards the model on GPU to allow for the training of models greater than single GPU HBM capacity and which cannot be instantiated multiple times on a single host.

The flow is as follows:
1. Rank 0 will init the whole model on CPU using existing code paths, while all other ranks init an 'empty' model using fake tensors.
2. Once this is complete and initialization moves to FSDP, distributed init will traverse the model 'bottom-up', transferring all params/buffers from rank 0 to all other ranks, while simultaneously wrapping modules in FSDP whenever possible (based on the specified config). Thus modules are sharded (and memory usage distributed) at the first possible instance using the existing FSDP api/implementation.

Reviewed By: XiaoliangDai

Differential Revision: D54287718

fbshipit-source-id: 16d63d78065d1fca0c6baf7a385f666a4e1b2a5f

abdad994

14 Mar, 2024 1 commit

add additional args to custom_prepare_pt2e · 102305a5

Naveen Suda authored Mar 13, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/654

example_inputs and is_qat args are needed for some models during prepare_pt2e step.

Reviewed By: chakriu, tarun292

Differential Revision: D54873270

fbshipit-source-id: 67df457aca82fd9da77969133ecf390cdc80fb85

102305a5

10 Mar, 2024 1 commit

ensure metadata thing_classes consistency with multiple datasets and category filtering · 1216c225

Zhicheng Yan authored Mar 10, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/653

# Changes
In Mask2Former RC4 training, we need to use a particular weighted category training sampler where `DATALOADER.SAMPLER_TRAIN = "WeightedCategoryTrainingSampler"`.

Also there are multiple datasets are used, and the set of each one's categories are not exactly identical. Some datasets have more categories (e.g. Exo-body) than other datasets that do not have exobody annotations.

Also we use category filtering by setting `D2GO_DATA.DATASETS.TRAIN_CATEGORIES` to a subset of full categories.

In this setup, currently D2GO will complain metadata.thing_classes is NOT consistency across datasets (https://fburl.com/code/k8xbvyfd).

The reason is when category filtering is used, D2GO writes a temporary dataset json file (https://fburl.com/code/slb5z6mc).
And this tmp json file will be loaded when we get the dataset dicts from DatasetCatalog (https://fburl.com/code/5k4ynyhc). Meanwhile, metadata in MetadataCatalog for this category-filtered dataset is also updated based on categories stored in this tmp file.

Therefore, we must ensure categories stored in the tmp file is consistent between multiple category-filtered datasets.

In this diff, we update the logic of writing such tmp dataset json file.

# Github CI test
Note **CI / python-unittest-cpu** is shown as failed with error below. But I do not think it is related to changes in this diff since error is related to observer in the QAT model training, but changes in the diff are related to dataset preparation.

```
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train
    self.run_step()
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 310, in run_step
    loss_dict = self.model(data)
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/runner/work/d2go/d2go/tests/runner/test_runner_default_runner.py", line 44, in forward
    ret = self.conv(images.tensor)
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1590, in _call_impl
    hook_result = hook(self, args, result)
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/quantize.py", line 131, in _observer_forward_hook
    return self.activation_post_process(output)
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/fake_quantize.py", line 199, in forward
    _scale, _zero_point = self.calculate_qparams()
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/fake_quantize.py", line 194, in calculate_qparams
    return self.activation_post_process.calculate_qparams()
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/observer.py", line 529, in calculate_qparams
    return self._calculate_qparams(self.min_val, self.max_val)
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/observer.py", line 328, in _calculate_qparams
    if not check_min_max_valid(min_val, max_val):
  File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/utils.py", line 346, in check_min_max_valid
    assert min_val <= max_val, f"min {min_val} should be less than max {max_val}"
AssertionError: min 3.8139522075653076e-05 should be less than max -3.8139522075653076e-05
```

Reviewed By: ayushidalmia

Differential Revision:
D54665936

Privacy Context Container: L1243674

fbshipit-source-id: 322ab4a84a710b03fa39b39fa81117752d369ba5

1216c225

03 Mar, 2024 1 commit

apply Black 2024 style in fbcode (7/16) · 2256bdb7

Amethyst Reese authored Mar 02, 2024

Summary:
Formats the covered files with pyfmt.

paintitblack

Reviewed By: aleivag

Differential Revision: D54447732

fbshipit-source-id: e21fbbe27882c8af183d021f4ac27029cbe93e8e

2256bdb7

23 Feb, 2024 1 commit

pt2e quantization support in D2Go · 09bd2869

Naveen Suda authored Feb 23, 2024

Summary: Add pt2e quantization support in D2Go.

Reviewed By: chakriu

Differential Revision: D54132092

fbshipit-source-id: 34a9ba79a5eb49ed27a3f33454078b0df37cf2f0

09bd2869

17 Feb, 2024 1 commit

fix more typo · a637c6cc

Mo Mo authored Feb 16, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/631

omm -> oom

Reviewed By: EugenHotaj, wat3rBro

Differential Revision: D50860125

fbshipit-source-id: 553220106aed1c8c752347a7a5c01b525ec25588

a637c6cc

08 Feb, 2024 1 commit

Pass cfg.SEED to dataloader building · 2653e226

Jiaxu Zhu authored Feb 07, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/651

As title, so that dataloading is deterministic when `cfg.SEED` is set.

Reviewed By: navsud

Differential Revision: D53547772

fbshipit-source-id: 73cfd2b351e81b370fb721a4f7b7c2a6313470bd

2653e226

04 Feb, 2024 1 commit

bfloat16 training · d6b2d860

Xiaoliang Dai authored Feb 04, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/649

support bfloat16 training

Reviewed By: chihyaoma, Sekunde

Differential Revision: D53029989

fbshipit-source-id: 2e1d8f2112d238441e3f6801db3092383147fdbd

d6b2d860

17 Jan, 2024 1 commit

expose example_input argument in setup_qat_model() · 3c6f71b4

Zhicheng Yan authored Jan 17, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/647

Major changes
- **example_input** argument in **prepare_fake_quant_model()** is useful in certain cases. For example, in Argos model **custom_prepare_fx()** method under FX graph + QAT setup (D52760682), it is used to prepare example inputs to individual sub-modules by running one forward pass and bookkeeping the inputs to individual sub-modules. Therefore, we export argument **example_input** in **setup_qat_model()** function.
- For QAT model, currently we assert # of state dict keys (excluding observers) should be equal to # of state dict keys in the original model. However, when the assertion fails, it does not log useful information for debugging. We make changes to report what are the unique keys in each state dict.

Reviewed By: navsud

Differential Revision: D52760688

fbshipit-source-id: 27535a0324ebe6513f198acb839918a0346720d0

3c6f71b4

16 Jan, 2024 1 commit

fbcode//mobile-vision/d2go/tests · 92af450b

generatedunixname2443911735787003 authored Jan 16, 2024

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/648

Differential Revision: D52792992

fbshipit-source-id: d3a64f3b306ea024ec072eaa6327446a84e2d83c

92af450b

12 Jan, 2024 1 commit

consolidate deterministic settings · 573bd454

Kapil Krishnakumar authored Jan 11, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/644

This diff consolidates deterministic settings in D2Go. In the `default_runner.py` file, the `torch.set_float32_matmul_precision("highest")` function is added to set the precision for matrix multiplication to the highest possible value. In the `setup.py` file, the `torch.backends.cudnn.deterministic` setting is set to `True` and the `torch.backends.cudnn.allow_tf32` setting is set to `False` to avoid random pytorch and CUDA algorithms during the training. The `torch.backends.cuda.matmul.allow_tf32` setting is also set to `False` to avoid random matrix multiplication algorithms. Additionally, the `seed` function is used to set the seed for reproducibility.

Reviewed By: wat3rBro

Differential Revision: D51796739

fbshipit-source-id: 50e44ea50b0311b56a885db9f633491ac3002bd4

573bd454

08 Jan, 2024 1 commit

mobile-vision/experimental 2/5 · 94cf5068

generatedunixname89002005287564 authored Jan 08, 2024

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/646

Reviewed By: wat3rBro

Differential Revision: D52536555

fbshipit-source-id: e57dc5b2774771f0739118c5244014171732c151

94cf5068

04 Jan, 2024 1 commit

mobile-vision/d2go 2/2 · cfa85835

generatedunixname89002005287564 authored Jan 04, 2024

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/645

Reviewed By: zsol

Differential Revision: D52536030

fbshipit-source-id: e6d0004c5bea81b5dab0ff69a1e9f6df4929b952

cfa85835

15 Dec, 2023 2 commits

allow to ignore state dict keys in QAT model · c2256758

Zhicheng Yan authored Dec 15, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/642

When we build a QAT model using FX graph mode API **prepare_qat_fx** and **convert_fx**, they will run symbolic tracing following **module.forward()**.

In certain cases, such as a module takes constant tensor input, the symbolic tracing will add new tensor attributes with name prefix **_tensor_constant** (https://fburl.com/code/msc4ch4o), which becomes new keys in the QAT model state dict.

In current implementation of **_setup_non_qat_to_qat_state_dict_map**, it asserts # of keys in the state dict of original- and QAT model should be the same.

Thus, we extend **qat_state_dict_keys_to_ignore** method by adding an argument, which allows to ignore specified state dict keys in the QAT model.

Reviewed By: wat3rBro

Differential Revision: D52152706

fbshipit-source-id: 92219feae43bf8841b0a3a71adfbfcb84d8e8f95

c2256758

do not fuse model again for a QAT model · 8f130231

Zhicheng Yan authored Dec 15, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/643

For a QAT model, it contains observers. After QAT training, those observers already contain updated statistics, such as min_val, max_val.

When we want to export FP32 QAT model for a sanity check, if we call **fuse_utils.fuse_model()** again (which is often already called when we build the QAT model before QAT training), it will remove statistics in the observers.

Reviewed By: wat3rBro

Differential Revision: D52152688

fbshipit-source-id: 08aa16f2aa72b3809e0ba2d346f1b806c0e6ede7

8f130231

07 Dec, 2023 2 commits

add API reset optimzation engine · da53aa10

Yanghan Wang authored Dec 07, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/640

Reviewed By: tglik

Differential Revision: D51908239

fbshipit-source-id: 7bcbad1fc7065b736cf4e38d155eed5d734758f7

da53aa10

Enable preemption checkpointing · 409cd213

Francisc Bungiu authored Dec 07, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/639

Expose ability to add a preemption checkpointing hook running in a separate process group.

Reviewed By: wat3rBro, ynonaolga

Differential Revision: D51115437

fbshipit-source-id: c843802bc59da9f57c09c8d9a20f3d72d5b98edf

409cd213

30 Nov, 2023 1 commit

add callbacks for inference_on_dataset · d0e16684

Yanghan Wang authored Nov 30, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/637

Reviewed By: tglik

Differential Revision: D51540498

fbshipit-source-id: f246559963c5187140db7b8113765f66a964ae1b

d0e16684

17 Nov, 2023 1 commit

Use the consolidated snapshot API in Unitrace to support Zoomer · 87649f4f

Wei Sun authored Nov 17, 2023

Summary: Similar to D48210543. Update the training_hooks to use the Unitrace memory snapshot APIs. This allows us to maintain a singel path for memory snapshot APIs, and also collect important details such as snapshot location for Zoomer.

Pulled By:
HugeEngine

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/636

Reviewed By: frabu6, aaronenyeshi, jackiexu1992, mengluy0125

Differential Revision: D48368150

fbshipit-source-id: b279adfa29d390e615d2c32a7ab9e05d95b4f164

87649f4f

10 Nov, 2023 1 commit

add print during _populate_registries · 8d072ebf

Yanghan Wang authored Nov 10, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/634

Reviewed By: yzhao30

Differential Revision: D51208655

fbshipit-source-id: 3280bde8807b623ec56841cc6d0ffc87a1e02e83

8d072ebf

09 Nov, 2023 1 commit

Migrate transformer_auto_wrap_policy to ModuleWrapPolicy · 40e78153

Anthony Chen authored Nov 08, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/633

transformer_auto_wrap_policy is buggy and causes issues when wrapping wrapped module. Migrate to ModuleWrapPolicy

Reviewed By: tglik

Differential Revision: D51124721

fbshipit-source-id: 61c4f5f810ead3c3776a7310926b2181121162ac

40e78153

05 Nov, 2023 1 commit

allow to skip loading model weights in build_model() · f2a0c52c

Zhicheng Yan authored Nov 05, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/630

Currently, in runner **build_model()** method, when **eval_only=True**, we always try to load model weights.
This is quite restricted in some cases. For example, we may just wanna build a model in eval mode to profile its efficiency, and we have not trained the model or generated the model weights in a checkpoint file.

Thus, this diff adds an argument **skip_model_weights** to allow users to skip the loading of model weights.
Note, this diff is entirely back-compatible and is NOT expected to break existing implementations.

Reviewed By: navsud, wat3rBro

Differential Revision: D50623772

fbshipit-source-id: 282dc6f19e17a4dd9eb0048e068c5299bb3d47c2

f2a0c52c

01 Nov, 2023 1 commit

resolve CPU OOM with FSDP checkpointer · 2d4d2f29

Yanghan Wang authored Nov 01, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/632

Reviewed By: yzhao30

Differential Revision: D50663689

fbshipit-source-id: 5c4c1dd2e5d2087be5aec268672bb5e7fc329df9

2d4d2f29