"vscode:/vscode.git/clone" did not exist on "a9dd42f74f541649b577c212f9caeea1f18b8cde"
  1. 27 Mar, 2024 1 commit
  2. 19 Mar, 2024 1 commit
    • Geet Sethi's avatar
      distributed FSDP model initialization · abdad994
      Geet Sethi authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/656
      
      Enable distributed FSDP model initialization. This iteratively moves and shards the model on GPU to allow for the training of models greater than single GPU HBM capacity and which cannot be instantiated multiple times on a single host.
      
      The flow is as follows:
      1. Rank 0 will init the whole model on CPU using existing code paths, while all other ranks init an 'empty' model using fake tensors.
      2. Once this is complete and initialization moves to FSDP, distributed init will traverse the model 'bottom-up', transferring all params/buffers from rank 0 to all other ranks, while simultaneously wrapping modules in FSDP whenever possible (based on the specified config). Thus modules are sharded (and memory usage distributed) at the first possible instance using the existing FSDP api/implementation.
      
      Reviewed By: XiaoliangDai
      
      Differential Revision: D54287718
      
      fbshipit-source-id: 16d63d78065d1fca0c6baf7a385f666a4e1b2a5f
      abdad994
  3. 21 Jul, 2023 1 commit
  4. 23 Jun, 2023 1 commit
    • Anthony Chen's avatar
      disable FSDP mixed precision for model buffers · b0abd7aa
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/585
      
      Disable FSDP mixed precision for model buffers. Buffers are usually small in size so there's very limited performance gain for enabling mixed precision. Plus, applications like BatchNorm layers and diffusion models are very sensitive to the precision of buffers. Thus, we stick to full precision for buffers in FSDP.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D46951673
      
      fbshipit-source-id: 12bb1a47fbd8b3dd85c7f781bab707206044af15
      b0abd7aa
  5. 22 Jun, 2023 1 commit
    • Anthony Chen's avatar
      expose use_orig_params to d2go config · 7f17bbf0
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/582
      
      Expose use_orig_params for FSDP constructor to d2go config. Read more about it in the docstring of torch.distributed.fsdp.fully_sharded_data_parallel.
      
      use_orig_params=False (default) uses FlatParameters to store flattened parameters, which saves memory by avoiding fragmentation. However, use_orig_params=True is essential for models that are partly frozen. This is because FlatParameters can only accept uniform requries_grad across the whole model
      
      Reviewed By: wat3rBro
      
      Differential Revision: D46917757
      
      fbshipit-source-id: 12ebe83e6de456e37d89eaf8b257f23925a6786d
      7f17bbf0
  6. 14 Jun, 2023 1 commit
  7. 27 May, 2023 1 commit
  8. 02 May, 2023 1 commit
    • Anthony Chen's avatar
      Use FSDP.STATE_DICT_TYPE = SHARDED_STATE_DICT by default · 5ecbb174
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/535
      
      Use `FSDP.STATE_DICT_TYPE = SHARDED_STATE_DICT` for FSDP checkpointing by default.` FSDP.USE_LOCAL_STATE_DICT` will be deprecated in the future.
      
      # Note
      After the change, config usage of `FSDP.USE_LOCAL_STATE_DICT` will not be picked up by code: it will be superseded by the default type of FSDP.STATE_DICT_TYPE instead
      
      Reviewed By: tglik
      
      Differential Revision: D45413143
      
      fbshipit-source-id: e7bc2d5dc04ac09004cb89353333be020a9c80b5
      5ecbb174
  9. 05 Apr, 2023 1 commit
    • Anthony Chen's avatar
      change default FSDP strategy to grad_optim (ZERO2) · 35affd74
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/522
      
      Change d2go's default FSDP sharding strategy to grad_optim, which corresponds to ShardingStrategy.SHARD_GRAD_OP in FSDP API, or ZERO2 in literature. grad_optim is shown to have the best tradeoff between memory utilization and training speed for mid-sized models.
      
      `FSDP.ALGORITHM = ""` was from the previous design to indicate that no FSDP is used. It will not work now
      
      Reviewed By: tglik
      
      Differential Revision: D44657184
      
      fbshipit-source-id: 3888eea5f2b5042269e69453f3cdd8db7cf1581c
      35affd74
  10. 24 Mar, 2023 2 commits
    • David Yan's avatar
      Add tests for sharded_state_dict and fix compatibility problems · 46606a02
      David Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/511
      
      Add tests for sharded_state_dict integration in AIF Checkpointer
      
      Fix compatibility problems including:
      1. small API errors of flatten_sharded_optim_state_dict
      2. deprecate model.use_local_state_dict and model.load_local_state_dict
      3. fix auto conversion for local_state_dict
      4. fix T148056077: add metadata to differentiate between local_state_dict and sharded_state_dict when loading a directory with FSDPCheckpointer
      
      Reviewed By: YanjunChen329
      
      Differential Revision: D44160045
      
      fbshipit-source-id: f607b7076d0e49b9407f9adfbc8ecfe439c3b0c9
      46606a02
    • David Yan's avatar
      Add support for FSDP SHARDED_STATE_DICT in D2Go · fbc1c2e8
      David Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/512
      
      Currently, when saving and loading checkpoints for FSDP-wrapped modules, we are saving and loading using `StateDictType.LOCAL_STATE_DICT`, where the state_dict becomes essentially a single flat tensor under the `_flat_param` key (or some other layer-specific key for flat weights). This means that
      1. It's impossible to load weights directly from checkpoints, for example in notebooks
      2. Converting from a local to a global checkpoint requires running a special workflow (https://fburl.com/code/6yqa4ldb) that occupies the same number of GPUs as was used during training
      
      This diff adds an option, `FSDP.STATE_DICT_TYPE`, which allows selection of the type of state dict to save (local, sharded, full). In sharded mode, with AIF checkpointing, we are able to have the benefit of allowing local loading of state dicts in minutes with any number of GPUs, in notebooks and elsewhere.
      
      Note: for backwards compatibility, `CFG.FSDP.use_local_state_dict` and `CFG.FSDP.load_local_state_dict` still need to work when the new config parameter (`CFG.FSDP.state_dict_type`) is not set. Also, it's used to signify that local/sharded state dicts need to be converted to a full state dict when loading. This functionality can be deprecated when everyone migrates to AIF checkpointing with sharded dicts.
      
      Reviewed By: YanjunChen329
      
      Differential Revision: D43840887
      
      fbshipit-source-id: d112f7b7ad97ba82fd5bf1da986b95ad7fc61c42
      fbc1c2e8
  11. 05 Mar, 2023 1 commit
    • Fei Sun's avatar
      Prefetch forward · 5f1ef548
      Fei Sun authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/492
      
      Enable prefetching the FSDP all gathers. Forward prefetch may or may not improve performance. Its effectiveness is determined by other FSDP options, such as zero2/zero3, HSDP/FSDP. Need to do a HPO sweep to figure out the best configuration.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D43027253
      
      fbshipit-source-id: cbf1b4bcf5b0b8301b5b9547e3c22b1f0ffc7590
      5f1ef548
  12. 14 Feb, 2023 1 commit
    • Fei Sun's avatar
      Ignore modules · 7ef9d897
      Fei Sun authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/470
      
      Enable ignore FSDP modules. Those modules will not be put in FSDP. It is useful in the diffusion model, where the CLIP model is not used in training. Thus, it is OK to have a separate copy in each GPU. It reduces the CLIP execution time from 63ms to 48ms (15ms reduction). This is mostly because it is a CPU bounded module and in each FSDP block, some code is injected. In addition, it also reduces the FSDP all gather time before the CLIP execution from 56ms to 7ms (49ms reduction).
      
      In total, this change may reduce the CLIP runtime from 119ms to 64ms (63ms reduction)
      
      This feature is controlled by this flag:
          IGNORED_MODULES: ["clip_model"]
      
      Reviewed By: newstzpz
      
      Differential Revision: D42910383
      
      fbshipit-source-id: dc4c12254d45ac45d88329feb63a26ec4ae04aef
      7ef9d897
  13. 03 Feb, 2023 1 commit
  14. 13 Jan, 2023 3 commits
    • Anthony Chen's avatar
      Make AMP compatible with FSDP · abf0ca0c
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/458
      
      Make AMP compatible with FSDP. FSDP does not depend on the torch AMP module and implements its own MixedPrecision module. This MixedPrecision module directly saves additional copy of weights in lower precision and use run these tensors in mixed precision training. This is very different from AMP, which automatically casts tensors to lower precision upon tensor operations.
      
      This diff solves some compatibility bugs between AMP and FSDP with 2 changes:
      1. Use "never_wrap_policy" as the default dummy autowrap policy.
      FSDP Mixed Precision doesn't work with Batchnorm layers. This is because FSDP and other resources like NVidia apex highly discourage running lower precision for batchnorm: https://github.com/pytorch/pytorch/issues/75478. We need to use some autowrap policy in order to let FSDP surpass batchnorm layers in constructing mixed precision.
      2. Wrap FSDPWrapper.forward() with autocast()
      FSDP Mixed Precision uses lower-precision tensors in computation, which could raise type mismatch error when amp.autocast() is not enabled, like in eval. Thus, we wrap FSDP forward() with autocast()
      
      Reviewed By: wat3rBro
      
      Differential Revision: D41328834
      
      fbshipit-source-id: 18cf94c4ad8d9422ffd3bb335873cd29ac987ae9
      abf0ca0c
    • Anthony Chen's avatar
      Support local state dict checkpointing for FSDP · eea6339f
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/457
      
      ## Context:
      
      The Pytorch FSDP (Fully Sharded Data Parallel) backend supports two checkpointing modes. The first one is full_state_dict mode, where each FSDP worker summons parameters from other workers to produce a global state dict that can be loaded by non-FSDP models. This mode is the desired mode for checkpointing because checkpoint structures and key names follows the default convention. It's already supported in D39228316 (https://github.com/facebookresearch/d2go/commit/02625ff83207b836df349eadc4a61eb3d4a5810c)
      
      However, when the model is too large to fit into a single GPU memory, this approach would fail because a worker's GPU can't hold all the summoned parameters during checkpoint saving. The rescue is to use the second checkpointing mode: local_state_dict. This mode saves the sharded parameters in each GPU process locally. It can only be loaded by FSDP-wrapped models with the same distributed training settings (i.e. num processes), but it reduces the need for summoning parameters and greatly saves peak GPU memory during training
      
      This diff enables local state dict checkpointing in d2go.
      
      ## API:
      
      This diff supports both **saving** local state and **loading** state dict that is locally sharded. Whether to save local state is controlled by `FSDP.USE_LOCAL_STATE`. If `FSDP.USE_LOCAL_STATE=True` and we want to save `output/model_0000001.pth` as in the old pattern, the local checkpoints will be saved as:
      ```
      - output
          - model_0000001
              - rank0.pth
              - rank1.pth
              - rank2.pth
              - rank3.pth
      ```
      Whether to load local state, on the other hand, is controlled by the path of the checkpoint to load. If the path is a file, i.e. `output/model_final.pth`, the file will be loaded as a full state dict by all GPU processes like before. If the path is a directory, i.e. `output/model_final`, the checkpointer will attempt to load `output/model_final/rankX.pth` for rank X.
      
      This API design enables the full combinations of loading local/full states and saving local/full states.
      
      ## Conversion to full state dict [Temporary]
      
      Conversion from local state dict to full state dict is needed during an e2e workflow. This will be implemented in another diff
      
      Reviewed By: wat3rBro
      
      Differential Revision: D41861308
      
      fbshipit-source-id: 2e01b601683d06b46f0c5517c6cff30bbcffa8f7
      eea6339f
    • Anthony Chen's avatar
      Rewrite FSDP wrapping as modeling hook · dc6fac12
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440
      
      Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook
      
      **Motivation**
      When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training.
      
      This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook.
      
      **API changes**
      * Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP.
      * `FSDP.ALGORITHM` can only be `full` or `grad_optim`
      
      **Note**
      It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented
      
      Reviewed By: wat3rBro
      
      Differential Revision: D41416917
      
      fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06
      dc6fac12
  15. 17 Nov, 2022 1 commit
    • Anthony Chen's avatar
      Integrate PyTorch Fully Sharded Data Parallel (FSDP) · 02625ff8
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/396
      
      Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path.
      
      Sources
      * Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
      
      API changes
      * Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fsdp.py for the full list of config options
      * Add `FSDPCheckpointer` as an inheritance of `QATCheckpointer` to support special loading/saving logic for FSDP models
      
      Reviewed By: wat3rBro
      
      Differential Revision: D39228316
      
      fbshipit-source-id: 342ecb3bcbce748453c3fba2d6e1b7b7e478473c
      02625ff8