1. 06 Jan, 2022 1 commit
    • four4fish's avatar
      FullyShardedDataParallel: only return full state dict on rank 0 (#885) · d3417ceb
      four4fish authored
      * FullyShardedDataParallel: only return full state dict on rank 0
      
      * Add flag and make rank 0 only optional
      
      * Add tests
      
      * Add docs
      
      * address comments
      
      * update comments
      
      * update torch nightly version
      
      * update torchvision number for torch nightly dependence
      
      * add changelog
      
      * Update CHANGELOG.md
      
      * Update CHANGELOG.md
      d3417ceb
  2. 15 Nov, 2021 1 commit
    • Anupam Bhatnagar's avatar
      Allow sharded grad scaler to cpu offload with FSDP (#831) · ba5785f7
      Anupam Bhatnagar authored
      * first commit
      
      * sharded scaler hitting nan assertions
      
      * adding test for sharded grad scaler without cpu offload
      
      * ddp grad scaler and fsdp sharded grad scaler test failing
      
      * removing test_output
      
      * fix no cpu offload test
      
      * changing optimizer from OSS to SGD
      
      * all tests passing, code cleanup pending
      
      * code cleanup
      
      * fix pyproject.toml
      
      * removing .isort.cfg
      
      * running isort linter
      
      * resolving isort issues
      
      * resolving black linter issue
      
      * resolving mypy issues
      
      * fix import statement
      
      * fix mypy error
      
      * modifying import statement
      
      * adding pytorch version requirement
      
      * fixing pytest skip test decorator
      
      * apply version guard for ShardedGradScaler
      
      * removing test_fsdp_grad_scaler
      
      * increasing num_epochs for ShardedGradScaler so that updates are not skipped
      
      * adding support for torch 1.8
      
      * minor edit
      
      * [skip ci] more torch 1.8 changes
      
      * parametrizing the tests
      
      * cleanup code with linters
      
      * [skip ci] update doc string
      
      * [skip ci] addressing some more comments
      ba5785f7
  3. 12 Nov, 2021 1 commit
    • Anupam Bhatnagar's avatar
      Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d
      Anupam Bhatnagar authored
      * adding pre-commit files
      
      * applying pre-commit to all files
      
      * adding no-strict-optional argument to mypy in circle ci config
      
      * fix typo
      
      * updating python versions
      
      * [skip ci] remove extra args
      
      * adding python 3.9
      
      * [skip ci] set pre-commit version in requirements-dev.txt
      
      * set CACHE_VERSION
      
      * move linters from circleci to github actions
      
      * update python version
      
      * update python version in benchmarks_2
      
      * moving to python 3.9.7
      7d7edf6d
  4. 27 Oct, 2021 1 commit
    • anj-s's avatar
      [fix] Decouple `move_params_to_cpu` from the `mixed_precision`. (#822) · ed7ca766
      anj-s authored
      * remove offload dependency on fp16
      
      * update python version for cpu tess
      
      * run CPU tests with updated PyTorch version
      
      * split changes
      
      * revert tests config
      
      * fix lint errors
      
      * update nightly and test PyTorch versions
      
      * skip failing multiprocess pipe test
      
      * always skip test
      
      * always skip test
      
      * always skip test
      
      * lint error
      
      * skip unsupported versions
      
      * improve skip message
      
      * lint errors
      
      * modify docs
      
      * add tests
      
      * fix test failures
      
      * modify comments
      
      * fix lint errors
      
      * fix lint errors
      ed7ca766
  5. 06 Sep, 2021 1 commit
    • Min Xu's avatar
      [cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup;... · 3ecf76f4
      Min Xu authored
      
      [cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup; pre-commit documentation (#744)
      
      * changelog; mypy; oss cleanup
      
      * more broadcast_object cleanup in FSDP
      
      * one more mypy fix
      
      * retire pytorch 1.6 from circleci, add new lightly, add 1.8 LTS and 1.9 stable release
      
      * update torch version for LTS
      
      * minor fixes
      
      * update cache key
      
      * trying newer gpu VMs
      
      * bump the cache
      
      * update to gpu.medium, which should be 2 GPUs
      
      * update nightly version
      
      * add pre-commit instruction
      
      * fixed CHANGELOG after merging
      
      * updated to newer nightly
      
      * retained the older broadcast function for older GPUs for oss.py
      
      * fixed a bug
      
      * added a comment
      
      * fixing a test for pytorch 1.10
      
      * testing a fix
      
      * Update fairscale/optim/oss.py
      
      * Update CONTRIBUTING.md
      Co-authored-by: default avatarMin Xu <min.xu.public@gmail.com>
      3ecf76f4
  6. 12 Aug, 2021 1 commit
    • anj-s's avatar
      [FSDP][feature] Support returning the original parameter names after a model... · a825348d
      anj-s authored
      [FSDP][feature] Support returning the original parameter names after a model has been wrapped with FSDP (#755)
      
      * checkpoint work
      
      * fix lint issues
      
      * remove debug statement
      
      * remove print
      
      * fix lint errors
      
      * fix lint errors
      
      * fix lint errors
      
      * add comments and fix lint errors
      
      * modified comments and tests
      a825348d
  7. 30 Jul, 2021 1 commit
    • Yanli Zhao's avatar
      [FSDP] Move final backward callback queueing to pre-backward hook of root instance (#753) · ba7df621
      Yanli Zhao authored
      Move final backward callback to pre-backward hook of root FSDP instance
      
      Summary:
      
      Move final backward callback to pre-backward hook of root FSDP instance,
      so that it is always attached to the outer most backward call and fired
      after all backward calls are completed.
      
      Also added flags to check final backward callback is fired when final
      backward callback is required.
      
      If root FSDP is checkpointed and called multiple times in forward,
      check pointer counter is used to make sure final backward callback is queued inside last inner backward
      call as well.
      
      Test Plan: unit tests
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * reformat
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * nits and unit tests
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * address some comments
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * replace m with self
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * reformat
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * nits
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * remove the fired flag
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * assert state on root only
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * comments
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * comments
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      ba7df621
  8. 26 Jun, 2021 1 commit
  9. 17 May, 2021 1 commit
    • Quentin Duval's avatar
      [feat] Save FSDP metadata for offline unflattening + Consolidate checkpoints (#683) · 81c20f72
      Quentin Duval authored
      
      
      * Save FSDP metadata for offline unflattening
      
      * Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint
      
      * Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint
      
      * Add a unit test to show how to use the function
      
      * Code review + improvement of the unit tests
      
      * Code review: extract clean_path
      
      * Make meta data and consolidation of checkpoint work for flatten_parameter=False
      
      * Add new unit test file in CI
      
      * Complete changelog and fix mypy issues
      
      * Add support for module buffers in the consolidation of sharded checkpoints
      
      * Better support for module buffers: save them in the meta data
      
      * Refactoring: use a data-format for the meta data that is simpler to understand (move from object of array to array of object format)
      
      * Renaming to make code clearer
      
      * Code review: in_temporary_directory rework and typo correction
      
      * Renaming
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      Co-authored-by: default avatarQuentinDuval <QuentinDuval@users.noreply.github.com>
      81c20f72
  10. 12 May, 2021 1 commit
    • anj-s's avatar
      [chore] Rename and move checkpoint_activations from misc folder. (#654) · 72c6bab2
      anj-s authored
      * rename files
      
      * add newly renamed file
      
      * rename and move checkpoint activations related files
      
      * add test files to ci list
      
      * fix lint errors
      
      * modify docs
      
      * add changelog
      
      * retain old path for now
      
      * fix lint errors
      
      * add another import test case
      
      * fix merge conflict
      
      * add missing test file
      72c6bab2
  11. 13 Apr, 2021 1 commit
  12. 08 Apr, 2021 1 commit
  13. 07 Apr, 2021 1 commit
  14. 04 Apr, 2021 1 commit
  15. 20 Mar, 2021 1 commit
  16. 09 Mar, 2021 1 commit
  17. 06 Mar, 2021 1 commit
  18. 04 Mar, 2021 1 commit
  19. 03 Mar, 2021 1 commit
  20. 02 Mar, 2021 1 commit
  21. 01 Mar, 2021 1 commit
    • Min Xu's avatar
      [chores]: make CI more efficient and update py39 env a bit (#447) · 5eb6b8c7
      Min Xu authored
      * [chores]: CI py39 on GPU and more efficiency
      
      * add test list files
      
      * fix
      
      * add test list files
      
      * split benchmark run into 2 runs
      
      * fix 1.8 version and balance benchmarks
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * recording tests
      
      * py39 install fix
      
      * test again
      
      * move tests
      
      * reorg tests
      
      * skip tests for torch 1.8 due to an upstream bug
      
      * removed __init__.py from tests since it confuses pytest
      
      * Revert "removed __init__.py from tests since it confuses pytest"
      
      This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0.
      
      * don't include __init__ in file list
      
      * notes on __init__.py and added missing ones
      
      * fixed mypy in a test file
      
      * balance test runtime
      
      * better pip install
      
      * balance more
      
      * pip fix
      
      * balance
      
      * balance more, all test should finish within 20m now
      
      * minor license update
      
      * trying cu102
      
      * more doc and addressed Ben's comments
      
      * debugging
      
      * debugging...
      5eb6b8c7
  22. 27 Feb, 2021 1 commit
  23. 26 Feb, 2021 2 commits
  24. 24 Feb, 2021 1 commit
  25. 23 Feb, 2021 1 commit
    • Myle Ott's avatar
      Add FullyShardedDataParallel (FSDP) (#413) · 15512d9e
      Myle Ott authored
      Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336
      
      ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper.
      
      Compared to PyTorch DDP:
      * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
      * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2
      * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3:
          * all-gather parameters at start of forward pass and start of backward pass
          * reduce-scatter grads at end of backward pass
      Co-authored-by: default avatarMin Xu <24926999+min-xu-ai@users.noreply.github.com>
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      15512d9e