Commits · 5ecbb174015ebbde7bfaf3e129b7f4e7daface61 · OpenDAS / d2go

02 May, 2023 1 commit

Use FSDP.STATE_DICT_TYPE = SHARDED_STATE_DICT by default · 5ecbb174

Anthony Chen authored May 02, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/535

Use `FSDP.STATE_DICT_TYPE = SHARDED_STATE_DICT` for FSDP checkpointing by default.` FSDP.USE_LOCAL_STATE_DICT` will be deprecated in the future.

# Note
After the change, config usage of `FSDP.USE_LOCAL_STATE_DICT` will not be picked up by code: it will be superseded by the default type of FSDP.STATE_DICT_TYPE instead

Reviewed By: tglik

Differential Revision: D45413143

fbshipit-source-id: e7bc2d5dc04ac09004cb89353333be020a9c80b5

5ecbb174

13 Jan, 2023 1 commit

Support local state dict checkpointing for FSDP · eea6339f

Anthony Chen authored Jan 12, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/457

## Context:

The Pytorch FSDP (Fully Sharded Data Parallel) backend supports two checkpointing modes. The first one is full_state_dict mode, where each FSDP worker summons parameters from other workers to produce a global state dict that can be loaded by non-FSDP models. This mode is the desired mode for checkpointing because checkpoint structures and key names follows the default convention. It's already supported in D39228316 (https://github.com/facebookresearch/d2go/commit/02625ff83207b836df349eadc4a61eb3d4a5810c)

However, when the model is too large to fit into a single GPU memory, this approach would fail because a worker's GPU can't hold all the summoned parameters during checkpoint saving. The rescue is to use the second checkpointing mode: local_state_dict. This mode saves the sharded parameters in each GPU process locally. It can only be loaded by FSDP-wrapped models with the same distributed training settings (i.e. num processes), but it reduces the need for summoning parameters and greatly saves peak GPU memory during training

This diff enables local state dict checkpointing in d2go.

## API:

This diff supports both **saving** local state and **loading** state dict that is locally sharded. Whether to save local state is controlled by `FSDP.USE_LOCAL_STATE`. If `FSDP.USE_LOCAL_STATE=True` and we want to save `output/model_0000001.pth` as in the old pattern, the local checkpoints will be saved as:
```
- output
- model_0000001
- rank0.pth
- rank1.pth
- rank2.pth
- rank3.pth
```
Whether to load local state, on the other hand, is controlled by the path of the checkpoint to load. If the path is a file, i.e. `output/model_final.pth`, the file will be loaded as a full state dict by all GPU processes like before. If the path is a directory, i.e. `output/model_final`, the checkpointer will attempt to load `output/model_final/rankX.pth` for rank X.

This API design enables the full combinations of loading local/full states and saving local/full states.

## Conversion to full state dict [Temporary]

Conversion from local state dict to full state dict is needed during an e2e workflow. This will be implemented in another diff

Reviewed By: wat3rBro

Differential Revision: D41861308

fbshipit-source-id: 2e01b601683d06b46f0c5517c6cff30bbcffa8f7

eea6339f

05 Jan, 2022 1 commit

Try LSJ on Faster RCNN with FBNet · 21ae9538

Hang Zhang authored Jan 05, 2022

Summary: Try LSJ with Faster RCNN with FBNet backbone

Reviewed By: newstzpz

Differential Revision: D32054932

fbshipit-source-id: 4fdb30e7b1258d6f167f2c2fd331209aad1b599a

21ae9538

07 Oct, 2021 1 commit

remove SOLVER.STEPS from configs · 79ea94d5

Yuxin Wu authored Oct 06, 2021

Summary:
the LR scheduler is cosine, so this config has no effect.
Remove it to avoid confusion.

Reviewed By: sstsai-adl

Differential Revision: D31444047

fbshipit-source-id: b40e0d7d923c3b55dfe23353050ea0238b3afd16

79ea94d5

03 Aug, 2021 1 commit

fix model_zoo links & retrain V3G mask rcnn · 30e798a6

Hang Zhang authored Aug 02, 2021

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/102

- fix model_zoo model urls (missed in D27992340 (https://github.com/facebookresearch/d2go/commit/477ab964e2165cb586b5c00425f6e463d7edeadd))
- update mask rcnn fbnet V3G config
- update v3g retrained weights

Reviewed By: ppwwyyxx, wat3rBro

Differential Revision: D29627615

fbshipit-source-id: 0694772e47b9c58965e47492177a5d6de53364cb

30e798a6

04 May, 2021 2 commits

OSS build mask head using fbnet builder · 477ab964

Hang Zhang authored May 04, 2021

Summary:
[WIP] Will add pretrained weights and update model url & scores

build mask head using fbnet builder and retrain weights

Reviewed By: wat3rBro

Differential Revision: D27992340

fbshipit-source-id: a216a99954eb3784438d595cd09cbb19e70ec3c3

477ab964

move some of `test_meta_arch_rcnn.py` to oss · e84d3414

Yanghan Wang authored May 04, 2021

Reviewed By: newstzpz

Differential Revision: D27747996

fbshipit-source-id: 6ae3b89c3944098828e246e5a4a89209b8e171a1

e84d3414

30 Mar, 2021 1 commit

Change SyncBN to BN for qat_faster_rcnn_fbnetv3a_C4.yaml · d29f93e7

Hang Zhang authored Mar 29, 2021

Summary:
fixes https://github.com/facebookresearch/d2go/issues/27

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/28

Reviewed By: newstzpz

Differential Revision: D27214440

Pulled By: zhanghang1989

fbshipit-source-id: da538ad1e29faa9c36065db89138b1cc97045a28

d29f93e7

04 Mar, 2021 1 commit

Add Demo and Quick Start Instructions · 82a8e0a0

Hang Zhang authored Mar 03, 2021

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/5

Reviewed By: wat3rBro

Differential Revision: D26780956

Pulled By: zhanghang1989

fbshipit-source-id: 26af80bbdf6bcb6af4a8b5d27e655826b34db26a

82a8e0a0

03 Mar, 2021 1 commit
- Initial commit · f23248c0
  facebook-github-bot authored Mar 02, 2021
```
fbshipit-source-id: f4a8ba78691d8cf46e003ef0bd2e95f170932778
```
  f23248c0