Commits · b14282fb8380798ebd756781cd3e20a93538681c · OpenDAS / d2go

"vscode:/vscode.git/clone" did not exist on "a9dd42f74f541649b577c212f9caeea1f18b8cde"

27 Mar, 2024 1 commit

fix distributed initialization for FSDP · b14282fb

Fanyi Xiao authored Mar 26, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/657

Without properly set `requires_grad` for params and buffers, it causes hang in FSDP training. This becomes an issue eg when training with LoRA.

Reviewed By: wat3rBro

Differential Revision: D55220828

fbshipit-source-id: 1e33aa540c84c4de62a3a37c48a322aa26c98292

b14282fb

19 Mar, 2024 1 commit

distributed FSDP model initialization · abdad994

Geet Sethi authored Mar 19, 2024

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/656

Enable distributed FSDP model initialization. This iteratively moves and shards the model on GPU to allow for the training of models greater than single GPU HBM capacity and which cannot be instantiated multiple times on a single host.

The flow is as follows:
1. Rank 0 will init the whole model on CPU using existing code paths, while all other ranks init an 'empty' model using fake tensors.
2. Once this is complete and initialization moves to FSDP, distributed init will traverse the model 'bottom-up', transferring all params/buffers from rank 0 to all other ranks, while simultaneously wrapping modules in FSDP whenever possible (based on the specified config). Thus modules are sharded (and memory usage distributed) at the first possible instance using the existing FSDP api/implementation.

Reviewed By: XiaoliangDai

Differential Revision: D54287718

fbshipit-source-id: 16d63d78065d1fca0c6baf7a385f666a4e1b2a5f

abdad994

21 Jul, 2023 1 commit

allow setting limit_all_gather in fsdp · d8734049

Xiaoliang Dai authored Jul 21, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/598

allow setting limit_all_gather in fsdp.  This enables faster training, as discussed in S351092

Reviewed By: Sekunde

Differential Revision: D47603555

fbshipit-source-id: 48d672fd5cce1763da91d8b801a8cb81630bfcdc

d8734049

23 Jun, 2023 1 commit

disable FSDP mixed precision for model buffers · b0abd7aa

Anthony Chen authored Jun 22, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/585

Disable FSDP mixed precision for model buffers. Buffers are usually small in size so there's very limited performance gain for enabling mixed precision. Plus, applications like BatchNorm layers and diffusion models are very sensitive to the precision of buffers. Thus, we stick to full precision for buffers in FSDP.

Reviewed By: wat3rBro

Differential Revision: D46951673

fbshipit-source-id: 12bb1a47fbd8b3dd85c7f781bab707206044af15

b0abd7aa

22 Jun, 2023 1 commit

expose use_orig_params to d2go config · 7f17bbf0

Anthony Chen authored Jun 22, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/582

Expose use_orig_params for FSDP constructor to d2go config. Read more about it in the docstring of torch.distributed.fsdp.fully_sharded_data_parallel.

use_orig_params=False (default) uses FlatParameters to store flattened parameters, which saves memory by avoiding fragmentation. However, use_orig_params=True is essential for models that are partly frozen. This is because FlatParameters can only accept uniform requries_grad across the whole model

Reviewed By: wat3rBro

Differential Revision: D46917757

fbshipit-source-id: 12ebe83e6de456e37d89eaf8b257f23925a6786d

7f17bbf0

14 Jun, 2023 1 commit

Enable activation checkpointing · 0389f4ee

Anthony Chen authored Jun 14, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/573

Enable Activation Checkpointing from Pytorch Distributed in d2go.

Reviewed By: rohan-varma

Differential Revision: D45681009

fbshipit-source-id: c03f27af61e0374b9e5991d82070edbe41edde6d

0389f4ee

27 May, 2023 1 commit

Trainer except lightning: create a separate buck target · b6efc047

Ajinkya Deogade authored May 26, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/555

The TARGETS for the files inside the directory `trainer` are tackled in two parts.
1. This diff creates TARGETS for the files inside `trainer` i.e. except `trainer/lightning`
2. The diff D46096373 creates TARGETS for files inside `trainer/lightning`

Reviewed By: tglik, wat3rBro

Differential Revision: D45912069

fbshipit-source-id: 3026250a49978f1b8e7a48aeebe1914d8a0a692b

b6efc047

02 May, 2023 1 commit

Use FSDP.STATE_DICT_TYPE = SHARDED_STATE_DICT by default · 5ecbb174

Anthony Chen authored May 02, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/535

Use `FSDP.STATE_DICT_TYPE = SHARDED_STATE_DICT` for FSDP checkpointing by default.` FSDP.USE_LOCAL_STATE_DICT` will be deprecated in the future.

# Note
After the change, config usage of `FSDP.USE_LOCAL_STATE_DICT` will not be picked up by code: it will be superseded by the default type of FSDP.STATE_DICT_TYPE instead

Reviewed By: tglik

Differential Revision: D45413143

fbshipit-source-id: e7bc2d5dc04ac09004cb89353333be020a9c80b5

5ecbb174

05 Apr, 2023 1 commit

change default FSDP strategy to grad_optim (ZERO2) · 35affd74

Anthony Chen authored Apr 04, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/522

Change d2go's default FSDP sharding strategy to grad_optim, which corresponds to ShardingStrategy.SHARD_GRAD_OP in FSDP API, or ZERO2 in literature. grad_optim is shown to have the best tradeoff between memory utilization and training speed for mid-sized models.

`FSDP.ALGORITHM = ""` was from the previous design to indicate that no FSDP is used. It will not work now

Reviewed By: tglik

Differential Revision: D44657184

fbshipit-source-id: 3888eea5f2b5042269e69453f3cdd8db7cf1581c

35affd74

24 Mar, 2023 2 commits

Add tests for sharded_state_dict and fix compatibility problems · 46606a02

David Yan authored Mar 23, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/511

Add tests for sharded_state_dict integration in AIF Checkpointer

Fix compatibility problems including:
1. small API errors of flatten_sharded_optim_state_dict
2. deprecate model.use_local_state_dict and model.load_local_state_dict
3. fix auto conversion for local_state_dict
4. fix T148056077: add metadata to differentiate between local_state_dict and sharded_state_dict when loading a directory with FSDPCheckpointer

Reviewed By: YanjunChen329

Differential Revision: D44160045

fbshipit-source-id: f607b7076d0e49b9407f9adfbc8ecfe439c3b0c9

46606a02

Add support for FSDP SHARDED_STATE_DICT in D2Go · fbc1c2e8

David Yan authored Mar 23, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/512

Currently, when saving and loading checkpoints for FSDP-wrapped modules, we are saving and loading using `StateDictType.LOCAL_STATE_DICT`, where the state_dict becomes essentially a single flat tensor under the `_flat_param` key (or some other layer-specific key for flat weights). This means that
1. It's impossible to load weights directly from checkpoints, for example in notebooks
2. Converting from a local to a global checkpoint requires running a special workflow (https://fburl.com/code/6yqa4ldb) that occupies the same number of GPUs as was used during training

This diff adds an option, `FSDP.STATE_DICT_TYPE`, which allows selection of the type of state dict to save (local, sharded, full). In sharded mode, with AIF checkpointing, we are able to have the benefit of allowing local loading of state dicts in minutes with any number of GPUs, in notebooks and elsewhere.

Note: for backwards compatibility, `CFG.FSDP.use_local_state_dict` and `CFG.FSDP.load_local_state_dict` still need to work when the new config parameter (`CFG.FSDP.state_dict_type`) is not set. Also, it's used to signify that local/sharded state dicts need to be converted to a full state dict when loading. This functionality can be deprecated when everyone migrates to AIF checkpointing with sharded dicts.

Reviewed By: YanjunChen329

Differential Revision: D43840887

fbshipit-source-id: d112f7b7ad97ba82fd5bf1da986b95ad7fc61c42

fbc1c2e8

05 Mar, 2023 1 commit

Prefetch forward · 5f1ef548

Fei Sun authored Mar 04, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/492

Enable prefetching the FSDP all gathers. Forward prefetch may or may not improve performance. Its effectiveness is determined by other FSDP options, such as zero2/zero3, HSDP/FSDP. Need to do a HPO sweep to figure out the best configuration.

Reviewed By: wat3rBro

Differential Revision: D43027253

fbshipit-source-id: cbf1b4bcf5b0b8301b5b9547e3c22b1f0ffc7590

5f1ef548

14 Feb, 2023 1 commit

Ignore modules · 7ef9d897

Fei Sun authored Feb 13, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/470

Enable ignore FSDP modules. Those modules will not be put in FSDP. It is useful in the diffusion model, where the CLIP model is not used in training. Thus, it is OK to have a separate copy in each GPU. It reduces the CLIP execution time from 63ms to 48ms (15ms reduction). This is mostly because it is a CPU bounded module and in each FSDP block, some code is injected. In addition, it also reduces the FSDP all gather time before the CLIP execution from 56ms to 7ms (49ms reduction).

In total, this change may reduce the CLIP runtime from 119ms to 64ms (63ms reduction)

This feature is controlled by this flag:
IGNORED_MODULES: ["clip_model"]

Reviewed By: newstzpz

Differential Revision: D42910383

fbshipit-source-id: dc4c12254d45ac45d88329feb63a26ec4ae04aef

7ef9d897

03 Feb, 2023 1 commit

Add HSDP · 0753f8b4

Fei Sun authored Feb 03, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/463

Enable HSDP when training models.

Reviewed By: wat3rBro

Differential Revision: D42658128

fbshipit-source-id: 3c37c3b6c4abaa54d677447ee704f2e18c9d3b26

0753f8b4

13 Jan, 2023 3 commits

Make AMP compatible with FSDP · abf0ca0c

Anthony Chen authored Jan 12, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/458

Make AMP compatible with FSDP. FSDP does not depend on the torch AMP module and implements its own MixedPrecision module. This MixedPrecision module directly saves additional copy of weights in lower precision and use run these tensors in mixed precision training. This is very different from AMP, which automatically casts tensors to lower precision upon tensor operations.

This diff solves some compatibility bugs between AMP and FSDP with 2 changes:
1. Use "never_wrap_policy" as the default dummy autowrap policy.
FSDP Mixed Precision doesn't work with Batchnorm layers. This is because FSDP and other resources like NVidia apex highly discourage running lower precision for batchnorm: https://github.com/pytorch/pytorch/issues/75478. We need to use some autowrap policy in order to let FSDP surpass batchnorm layers in constructing mixed precision.
2. Wrap FSDPWrapper.forward() with autocast()
FSDP Mixed Precision uses lower-precision tensors in computation, which could raise type mismatch error when amp.autocast() is not enabled, like in eval. Thus, we wrap FSDP forward() with autocast()

Reviewed By: wat3rBro

Differential Revision: D41328834

fbshipit-source-id: 18cf94c4ad8d9422ffd3bb335873cd29ac987ae9

abf0ca0c

Support local state dict checkpointing for FSDP · eea6339f

Anthony Chen authored Jan 12, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/457

## Context:

The Pytorch FSDP (Fully Sharded Data Parallel) backend supports two checkpointing modes. The first one is full_state_dict mode, where each FSDP worker summons parameters from other workers to produce a global state dict that can be loaded by non-FSDP models. This mode is the desired mode for checkpointing because checkpoint structures and key names follows the default convention. It's already supported in D39228316 (https://github.com/facebookresearch/d2go/commit/02625ff83207b836df349eadc4a61eb3d4a5810c)

However, when the model is too large to fit into a single GPU memory, this approach would fail because a worker's GPU can't hold all the summoned parameters during checkpoint saving. The rescue is to use the second checkpointing mode: local_state_dict. This mode saves the sharded parameters in each GPU process locally. It can only be loaded by FSDP-wrapped models with the same distributed training settings (i.e. num processes), but it reduces the need for summoning parameters and greatly saves peak GPU memory during training

This diff enables local state dict checkpointing in d2go.

## API:

This diff supports both **saving** local state and **loading** state dict that is locally sharded. Whether to save local state is controlled by `FSDP.USE_LOCAL_STATE`. If `FSDP.USE_LOCAL_STATE=True` and we want to save `output/model_0000001.pth` as in the old pattern, the local checkpoints will be saved as:
```
- output
- model_0000001
- rank0.pth
- rank1.pth
- rank2.pth
- rank3.pth
```
Whether to load local state, on the other hand, is controlled by the path of the checkpoint to load. If the path is a file, i.e. `output/model_final.pth`, the file will be loaded as a full state dict by all GPU processes like before. If the path is a directory, i.e. `output/model_final`, the checkpointer will attempt to load `output/model_final/rankX.pth` for rank X.

This API design enables the full combinations of loading local/full states and saving local/full states.

## Conversion to full state dict [Temporary]

Conversion from local state dict to full state dict is needed during an e2e workflow. This will be implemented in another diff

Reviewed By: wat3rBro

Differential Revision: D41861308

fbshipit-source-id: 2e01b601683d06b46f0c5517c6cff30bbcffa8f7

eea6339f

Rewrite FSDP wrapping as modeling hook · dc6fac12

Anthony Chen authored Jan 12, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440

Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook

**Motivation**
When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training.

This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook.

**API changes**
* Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP.
* `FSDP.ALGORITHM` can only be `full` or `grad_optim`

**Note**
It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented

Reviewed By: wat3rBro

Differential Revision: D41416917

fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06

dc6fac12

17 Nov, 2022 1 commit

Integrate PyTorch Fully Sharded Data Parallel (FSDP) · 02625ff8

Anthony Chen authored Nov 17, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/396

Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path.

Sources
* Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

API changes
* Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fsdp.py for the full list of config options
* Add `FSDPCheckpointer` as an inheritance of `QATCheckpointer` to support special loading/saving logic for FSDP models

Reviewed By: wat3rBro

Differential Revision: D39228316

fbshipit-source-id: 342ecb3bcbce748453c3fba2d6e1b7b7e478473c

02625ff8