"vscode:/vscode.git/clone" did not exist on "a9dd42f74f541649b577c212f9caeea1f18b8cde"
- 27 Mar, 2024 1 commit
-
-
Fanyi Xiao authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/657 Without properly set `requires_grad` for params and buffers, it causes hang in FSDP training. This becomes an issue eg when training with LoRA. Reviewed By: wat3rBro Differential Revision: D55220828 fbshipit-source-id: 1e33aa540c84c4de62a3a37c48a322aa26c98292
-
- 19 Mar, 2024 1 commit
-
-
Geet Sethi authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/656 Enable distributed FSDP model initialization. This iteratively moves and shards the model on GPU to allow for the training of models greater than single GPU HBM capacity and which cannot be instantiated multiple times on a single host. The flow is as follows: 1. Rank 0 will init the whole model on CPU using existing code paths, while all other ranks init an 'empty' model using fake tensors. 2. Once this is complete and initialization moves to FSDP, distributed init will traverse the model 'bottom-up', transferring all params/buffers from rank 0 to all other ranks, while simultaneously wrapping modules in FSDP whenever possible (based on the specified config). Thus modules are sharded (and memory usage distributed) at the first possible instance using the existing FSDP api/implementation. Reviewed By: XiaoliangDai Differential Revision: D54287718 fbshipit-source-id: 16d63d78065d1fca0c6baf7a385f666a4e1b2a5f
-
- 21 Jul, 2023 1 commit
-
-
Xiaoliang Dai authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/598 allow setting limit_all_gather in fsdp. This enables faster training, as discussed in S351092 Reviewed By: Sekunde Differential Revision: D47603555 fbshipit-source-id: 48d672fd5cce1763da91d8b801a8cb81630bfcdc
-
- 23 Jun, 2023 1 commit
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/585 Disable FSDP mixed precision for model buffers. Buffers are usually small in size so there's very limited performance gain for enabling mixed precision. Plus, applications like BatchNorm layers and diffusion models are very sensitive to the precision of buffers. Thus, we stick to full precision for buffers in FSDP. Reviewed By: wat3rBro Differential Revision: D46951673 fbshipit-source-id: 12bb1a47fbd8b3dd85c7f781bab707206044af15
-
- 22 Jun, 2023 1 commit
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/582 Expose use_orig_params for FSDP constructor to d2go config. Read more about it in the docstring of torch.distributed.fsdp.fully_sharded_data_parallel. use_orig_params=False (default) uses FlatParameters to store flattened parameters, which saves memory by avoiding fragmentation. However, use_orig_params=True is essential for models that are partly frozen. This is because FlatParameters can only accept uniform requries_grad across the whole model Reviewed By: wat3rBro Differential Revision: D46917757 fbshipit-source-id: 12ebe83e6de456e37d89eaf8b257f23925a6786d
-
- 14 Jun, 2023 1 commit
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/573 Enable Activation Checkpointing from Pytorch Distributed in d2go. Reviewed By: rohan-varma Differential Revision: D45681009 fbshipit-source-id: c03f27af61e0374b9e5991d82070edbe41edde6d
-
- 27 May, 2023 1 commit
-
-
Ajinkya Deogade authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/555 The TARGETS for the files inside the directory `trainer` are tackled in two parts. 1. This diff creates TARGETS for the files inside `trainer` i.e. except `trainer/lightning` 2. The diff D46096373 creates TARGETS for files inside `trainer/lightning` Reviewed By: tglik, wat3rBro Differential Revision: D45912069 fbshipit-source-id: 3026250a49978f1b8e7a48aeebe1914d8a0a692b
-
- 02 May, 2023 1 commit
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/535 Use `FSDP.STATE_DICT_TYPE = SHARDED_STATE_DICT` for FSDP checkpointing by default.` FSDP.USE_LOCAL_STATE_DICT` will be deprecated in the future. # Note After the change, config usage of `FSDP.USE_LOCAL_STATE_DICT` will not be picked up by code: it will be superseded by the default type of FSDP.STATE_DICT_TYPE instead Reviewed By: tglik Differential Revision: D45413143 fbshipit-source-id: e7bc2d5dc04ac09004cb89353333be020a9c80b5
-
- 05 Apr, 2023 1 commit
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/522 Change d2go's default FSDP sharding strategy to grad_optim, which corresponds to ShardingStrategy.SHARD_GRAD_OP in FSDP API, or ZERO2 in literature. grad_optim is shown to have the best tradeoff between memory utilization and training speed for mid-sized models. `FSDP.ALGORITHM = ""` was from the previous design to indicate that no FSDP is used. It will not work now Reviewed By: tglik Differential Revision: D44657184 fbshipit-source-id: 3888eea5f2b5042269e69453f3cdd8db7cf1581c
-
- 24 Mar, 2023 2 commits
-
-
David Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/511 Add tests for sharded_state_dict integration in AIF Checkpointer Fix compatibility problems including: 1. small API errors of flatten_sharded_optim_state_dict 2. deprecate model.use_local_state_dict and model.load_local_state_dict 3. fix auto conversion for local_state_dict 4. fix T148056077: add metadata to differentiate between local_state_dict and sharded_state_dict when loading a directory with FSDPCheckpointer Reviewed By: YanjunChen329 Differential Revision: D44160045 fbshipit-source-id: f607b7076d0e49b9407f9adfbc8ecfe439c3b0c9
-
David Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/512 Currently, when saving and loading checkpoints for FSDP-wrapped modules, we are saving and loading using `StateDictType.LOCAL_STATE_DICT`, where the state_dict becomes essentially a single flat tensor under the `_flat_param` key (or some other layer-specific key for flat weights). This means that 1. It's impossible to load weights directly from checkpoints, for example in notebooks 2. Converting from a local to a global checkpoint requires running a special workflow (https://fburl.com/code/6yqa4ldb) that occupies the same number of GPUs as was used during training This diff adds an option, `FSDP.STATE_DICT_TYPE`, which allows selection of the type of state dict to save (local, sharded, full). In sharded mode, with AIF checkpointing, we are able to have the benefit of allowing local loading of state dicts in minutes with any number of GPUs, in notebooks and elsewhere. Note: for backwards compatibility, `CFG.FSDP.use_local_state_dict` and `CFG.FSDP.load_local_state_dict` still need to work when the new config parameter (`CFG.FSDP.state_dict_type`) is not set. Also, it's used to signify that local/sharded state dicts need to be converted to a full state dict when loading. This functionality can be deprecated when everyone migrates to AIF checkpointing with sharded dicts. Reviewed By: YanjunChen329 Differential Revision: D43840887 fbshipit-source-id: d112f7b7ad97ba82fd5bf1da986b95ad7fc61c42
-
- 05 Mar, 2023 1 commit
-
-
Fei Sun authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/492 Enable prefetching the FSDP all gathers. Forward prefetch may or may not improve performance. Its effectiveness is determined by other FSDP options, such as zero2/zero3, HSDP/FSDP. Need to do a HPO sweep to figure out the best configuration. Reviewed By: wat3rBro Differential Revision: D43027253 fbshipit-source-id: cbf1b4bcf5b0b8301b5b9547e3c22b1f0ffc7590
-
- 14 Feb, 2023 1 commit
-
-
Fei Sun authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/470 Enable ignore FSDP modules. Those modules will not be put in FSDP. It is useful in the diffusion model, where the CLIP model is not used in training. Thus, it is OK to have a separate copy in each GPU. It reduces the CLIP execution time from 63ms to 48ms (15ms reduction). This is mostly because it is a CPU bounded module and in each FSDP block, some code is injected. In addition, it also reduces the FSDP all gather time before the CLIP execution from 56ms to 7ms (49ms reduction). In total, this change may reduce the CLIP runtime from 119ms to 64ms (63ms reduction) This feature is controlled by this flag: IGNORED_MODULES: ["clip_model"] Reviewed By: newstzpz Differential Revision: D42910383 fbshipit-source-id: dc4c12254d45ac45d88329feb63a26ec4ae04aef
-
- 03 Feb, 2023 1 commit
-
-
Fei Sun authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/463 Enable HSDP when training models. Reviewed By: wat3rBro Differential Revision: D42658128 fbshipit-source-id: 3c37c3b6c4abaa54d677447ee704f2e18c9d3b26
-
- 13 Jan, 2023 3 commits
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/458 Make AMP compatible with FSDP. FSDP does not depend on the torch AMP module and implements its own MixedPrecision module. This MixedPrecision module directly saves additional copy of weights in lower precision and use run these tensors in mixed precision training. This is very different from AMP, which automatically casts tensors to lower precision upon tensor operations. This diff solves some compatibility bugs between AMP and FSDP with 2 changes: 1. Use "never_wrap_policy" as the default dummy autowrap policy. FSDP Mixed Precision doesn't work with Batchnorm layers. This is because FSDP and other resources like NVidia apex highly discourage running lower precision for batchnorm: https://github.com/pytorch/pytorch/issues/75478. We need to use some autowrap policy in order to let FSDP surpass batchnorm layers in constructing mixed precision. 2. Wrap FSDPWrapper.forward() with autocast() FSDP Mixed Precision uses lower-precision tensors in computation, which could raise type mismatch error when amp.autocast() is not enabled, like in eval. Thus, we wrap FSDP forward() with autocast() Reviewed By: wat3rBro Differential Revision: D41328834 fbshipit-source-id: 18cf94c4ad8d9422ffd3bb335873cd29ac987ae9
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/457 ## Context: The Pytorch FSDP (Fully Sharded Data Parallel) backend supports two checkpointing modes. The first one is full_state_dict mode, where each FSDP worker summons parameters from other workers to produce a global state dict that can be loaded by non-FSDP models. This mode is the desired mode for checkpointing because checkpoint structures and key names follows the default convention. It's already supported in D39228316 (https://github.com/facebookresearch/d2go/commit/02625ff83207b836df349eadc4a61eb3d4a5810c) However, when the model is too large to fit into a single GPU memory, this approach would fail because a worker's GPU can't hold all the summoned parameters during checkpoint saving. The rescue is to use the second checkpointing mode: local_state_dict. This mode saves the sharded parameters in each GPU process locally. It can only be loaded by FSDP-wrapped models with the same distributed training settings (i.e. num processes), but it reduces the need for summoning parameters and greatly saves peak GPU memory during training This diff enables local state dict checkpointing in d2go. ## API: This diff supports both **saving** local state and **loading** state dict that is locally sharded. Whether to save local state is controlled by `FSDP.USE_LOCAL_STATE`. If `FSDP.USE_LOCAL_STATE=True` and we want to save `output/model_0000001.pth` as in the old pattern, the local checkpoints will be saved as: ``` - output - model_0000001 - rank0.pth - rank1.pth - rank2.pth - rank3.pth ``` Whether to load local state, on the other hand, is controlled by the path of the checkpoint to load. If the path is a file, i.e. `output/model_final.pth`, the file will be loaded as a full state dict by all GPU processes like before. If the path is a directory, i.e. `output/model_final`, the checkpointer will attempt to load `output/model_final/rankX.pth` for rank X. This API design enables the full combinations of loading local/full states and saving local/full states. ## Conversion to full state dict [Temporary] Conversion from local state dict to full state dict is needed during an e2e workflow. This will be implemented in another diff Reviewed By: wat3rBro Differential Revision: D41861308 fbshipit-source-id: 2e01b601683d06b46f0c5517c6cff30bbcffa8f7
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440 Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook **Motivation** When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training. This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook. **API changes** * Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP. * `FSDP.ALGORITHM` can only be `full` or `grad_optim` **Note** It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented Reviewed By: wat3rBro Differential Revision: D41416917 fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06
-
- 17 Nov, 2022 1 commit
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/396 Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path. Sources * Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html API changes * Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fsdp.py for the full list of config options * Add `FSDPCheckpointer` as an inheritance of `QATCheckpointer` to support special loading/saving logic for FSDP models Reviewed By: wat3rBro Differential Revision: D39228316 fbshipit-source-id: 342ecb3bcbce748453c3fba2d6e1b7b7e478473c
-