1. 08 Aug, 2022 1 commit
  2. 12 Jun, 2022 1 commit
  3. 26 May, 2022 1 commit
  4. 02 May, 2022 1 commit
    • Paul Johnson's avatar
      [FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParameter and... · 51b53ddb
      Paul Johnson authored
      [FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParameter and SsdFlatParameterView (#974)
      
      * [FSDP] fixing backward path for SsdFlatParameter and SsdFlatParameterView when overriding .data
      
      * Get ssd_offload unit tests passing
      
      * [FSDP] get all test_fsdp_offload tests passing w/ ssd_offload on
      
      * Update changelog
      51b53ddb
  5. 26 Apr, 2022 1 commit
  6. 06 Apr, 2022 1 commit
  7. 30 Mar, 2022 1 commit
  8. 03 Mar, 2022 1 commit
  9. 23 Feb, 2022 2 commits
  10. 22 Feb, 2022 1 commit
    • anj-s's avatar
      [benchmarks] Add benchmarks for FSDP (#765) · f9a125db
      anj-s authored
      * add benchmarks for fsdp
      
      * fix lint errors
      
      * clean up
      
      * clean up unused flags
      
      * add the benchmarks
      
      * remove unused args
      
      * fix lint errors
      
      * fix lint errors
      
      * update command line
      
      * add support for multiple devices
      
      * try full fp16 mode
      
      * try full fp16 mode
      
      * lint errors
      
      * merge main
      
      * lint errors
      
      * lint errors
      
      * lint error
      
      * update intersphinx mapping for numpy
      
      * update intersphinx mapping for numpy
      
      * skip test
      
      * added golden configs
      
      * use synthetic benchmarks
      
      * fix fn name
      
      * fix cuda device id
      
      * fix verify
      
      * lint fix
      f9a125db
  11. 14 Feb, 2022 1 commit
    • Min Xu's avatar
      [chore] [cleanup]: pytest, pytorch new versions, fix tests (#933) · fae29959
      Min Xu authored
      
      
      * update pytest versions
      
      * [test] test related changes
      
      - upgrade to newer pytorch versions
      - added function to make test more deterministic on A100 and TF32
      - fixed some tests so that they are correctly skipped on a single GPU system
      
      * more fixes
      
      * formatting overly long lines
      
      * format
      
      * better test without trigger a warning
      
      * fix an optim state bug with newer pytorch
      
      - adam optimizer seems to return "step" as a singleton tensor now in the
      nightly build
      - this fixes it assumeing non-tensor value can still be loaded back by
      the optimizer
      
      * improve oss.py
      
      - use min_loss for regression checking is a bit more reliable
      - also increased the num epochs from 10 to 12
      
      * small oss.py fix
      
      * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
      Co-authored-by: default avatarMin Xu <min.xu.public@gmail.com>
      fae29959
  12. 11 Feb, 2022 1 commit
  13. 08 Feb, 2022 1 commit
  14. 25 Jan, 2022 1 commit
  15. 13 Jan, 2022 1 commit
    • tmarkstrum's avatar
      [Fix][FSDP]fixed padding size of input tensor for reduce scatter (#907) · fb4eca19
      tmarkstrum authored
      
      
      * fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group
      
      * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
      Co-authored-by: default avatarMin Xu <24926999+min-xu-ai@users.noreply.github.com>
      
      * added changelog
      
      * fixed some commit.
      
      * added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size.
      
      * throw an error instead of rolling back to use default process group for reduce_scatter_process_group
      
      * Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group"
      
      This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9.
      
      * added check for None to avoid unit test failure
      
      * fixed an error to avoid the unit tests failure
      Co-authored-by: default avatarMin Xu <24926999+min-xu-ai@users.noreply.github.com>
      fb4eca19
  16. 07 Jan, 2022 1 commit
    • tmarkstrum's avatar
      [FSDP] Enable FSDP reduce scatter overlap (#897) · 0a526bcb
      tmarkstrum authored
      * enable reduce scatter overlap with other operations
      
      * fixed unit tests and added docstrings for the new parameters for fsdp
      
      * fixed more unit tests
      
      * fixed unit tests
      
      * avoided the pickle error on process_group_reduce_scatter
      
      * removed an unnecessary parameter in unit tests
      
      * remove unnecessary prints
      
      * fixed the docstring
      
      * skipped the test_offload unit test because this unit test failed in the main branch
      
      * removed the enable_reduce_scatter_overlap API parameter
      
      * added doc string for the defualt value of process_group_reduce_scatter parameter
      
      * fixed a syntax bug
      
      * fixed a bug which cause unitest failure
      
      * removed the all_gather in the ProcessGroupName enum
      
      * added more comment
      
      * changed the default value of process_group_reduce_scatter from None to ProcessGroupName.reduce_scatter
      0a526bcb
  17. 06 Jan, 2022 1 commit
    • four4fish's avatar
      FullyShardedDataParallel: only return full state dict on rank 0 (#885) · d3417ceb
      four4fish authored
      * FullyShardedDataParallel: only return full state dict on rank 0
      
      * Add flag and make rank 0 only optional
      
      * Add tests
      
      * Add docs
      
      * address comments
      
      * update comments
      
      * update torch nightly version
      
      * update torchvision number for torch nightly dependence
      
      * add changelog
      
      * Update CHANGELOG.md
      
      * Update CHANGELOG.md
      d3417ceb
  18. 05 Jan, 2022 1 commit
    • Paul Johnson's avatar
      Enabling ssd_offload training basic tests. (#887) · c5e471bc
      Paul Johnson authored
      * Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py.
      * Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor
      * Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper
      * Modifications from PR #887 review comments.
      * Update Changelog
      c5e471bc
  19. 06 Dec, 2021 1 commit
    • Freddy Snijder's avatar
      Fix for Key Error that can happen in certain FSDP wrapping scenarios of... · e6acdcc3
      Freddy Snijder authored
      Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876) (#881)
      
      * Fix for Key Error that can happen in certain FSDP wrapping scenarios of Huggingface model sub-modules (issue #876)
      
      * Styling fixes
      
      * Updated the test to be independent of the Huggingface transformers package
      
      * Added test for issue #876
      
      * Small error message fix
      
      * Skip test when CUDA is not available
      
      * Fixed naming of model
      e6acdcc3
  20. 18 Nov, 2021 1 commit
  21. 17 Nov, 2021 1 commit
  22. 15 Nov, 2021 1 commit
    • Anupam Bhatnagar's avatar
      Allow sharded grad scaler to cpu offload with FSDP (#831) · ba5785f7
      Anupam Bhatnagar authored
      * first commit
      
      * sharded scaler hitting nan assertions
      
      * adding test for sharded grad scaler without cpu offload
      
      * ddp grad scaler and fsdp sharded grad scaler test failing
      
      * removing test_output
      
      * fix no cpu offload test
      
      * changing optimizer from OSS to SGD
      
      * all tests passing, code cleanup pending
      
      * code cleanup
      
      * fix pyproject.toml
      
      * removing .isort.cfg
      
      * running isort linter
      
      * resolving isort issues
      
      * resolving black linter issue
      
      * resolving mypy issues
      
      * fix import statement
      
      * fix mypy error
      
      * modifying import statement
      
      * adding pytorch version requirement
      
      * fixing pytest skip test decorator
      
      * apply version guard for ShardedGradScaler
      
      * removing test_fsdp_grad_scaler
      
      * increasing num_epochs for ShardedGradScaler so that updates are not skipped
      
      * adding support for torch 1.8
      
      * minor edit
      
      * [skip ci] more torch 1.8 changes
      
      * parametrizing the tests
      
      * cleanup code with linters
      
      * [skip ci] update doc string
      
      * [skip ci] addressing some more comments
      ba5785f7
  23. 12 Nov, 2021 1 commit
    • Anupam Bhatnagar's avatar
      Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d
      Anupam Bhatnagar authored
      * adding pre-commit files
      
      * applying pre-commit to all files
      
      * adding no-strict-optional argument to mypy in circle ci config
      
      * fix typo
      
      * updating python versions
      
      * [skip ci] remove extra args
      
      * adding python 3.9
      
      * [skip ci] set pre-commit version in requirements-dev.txt
      
      * set CACHE_VERSION
      
      * move linters from circleci to github actions
      
      * update python version
      
      * update python version in benchmarks_2
      
      * moving to python 3.9.7
      7d7edf6d
  24. 09 Nov, 2021 1 commit
  25. 08 Nov, 2021 1 commit
  26. 05 Nov, 2021 1 commit
    • Min Xu's avatar
      [feat] experimental MEVO layer (#840) · 8347c1a2
      Min Xu authored
      
      
      * [feat] MEVO kernel
      
      - initial import from min/softmax and min/testing branches
      - need to rename and further cleanup
      
      * only test with newer pytorch
      
      * renamed and added comments and code cleanup
      
      * rename and reduce test memory
      
      * testing
      
      * minor fixing
      
      * fixing
      
      * more fix
      
      * changelog
      
      * more 1.7 and 1.8 paper cuts
      
      * remove dead code
      
      * addressed Benjamin's comments
      
      * addressed more comments
      Co-authored-by: default avatarMin Xu <min.xu.public@gmail.com>
      8347c1a2
  27. 01 Nov, 2021 1 commit
    • Min Xu's avatar
      [feat] [FSDP]: add experimental support to shared weights (#836) · f2af4c66
      Min Xu authored
      
      
      * added a new test, passing without shared weights
      
      * tested weight sharing
      
      * added the test to test list file
      
      * extended to world_size = 2
      
      * fixed test
      
      * [feat]: add limited and experimental support for shared parameter
      
      * fixed tests
      
      * simplify to work with layer with at least 1 non-shared params and add code to pick up linked_param field for sharding the shared param
      
      * fixed the case where linked param is not in separate FSDP
      
      * changelog and remove old code
      Co-authored-by: default avatarMin Xu <min.xu.public@gmail.com>
      f2af4c66
  28. 28 Oct, 2021 1 commit
  29. 27 Oct, 2021 3 commits
  30. 12 Sep, 2021 1 commit
    • Darryl Barnhart's avatar
      [fix] FSDP intra-backwards gradient accumulation. (#784) · 4fa2ab9b
      Darryl Barnhart authored
      * [fix] FSDP intra-backwards gradient accumulation.
      
      Ensure gradient reduction accumulates into the unsharded gradient tensor
      within a backwards pass. This matters when an FSDP module is called
      multiple times within a forward pass, and reduction is _not_ deferred
      using activation checkpoint forward counters, bucketing or some other
      mechanism.
      
      Closes #780
      
      * [refactor] Remove forward counters. Comments.
      
      Removed forward counters from the activation checkpointing utility, now
      that FSDP does not require them for correct operation. Add more detailed
      comment about memory usage behaviour with gradient reduction.
      
      * [refactor] Delete deprecated forward counter usage.
      
      * [refactor] Add state assertion as end of pre-backward hook.
      4fa2ab9b
  31. 06 Sep, 2021 1 commit
    • Min Xu's avatar
      [cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup;... · 3ecf76f4
      Min Xu authored
      
      [cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup; pre-commit documentation (#744)
      
      * changelog; mypy; oss cleanup
      
      * more broadcast_object cleanup in FSDP
      
      * one more mypy fix
      
      * retire pytorch 1.6 from circleci, add new lightly, add 1.8 LTS and 1.9 stable release
      
      * update torch version for LTS
      
      * minor fixes
      
      * update cache key
      
      * trying newer gpu VMs
      
      * bump the cache
      
      * update to gpu.medium, which should be 2 GPUs
      
      * update nightly version
      
      * add pre-commit instruction
      
      * fixed CHANGELOG after merging
      
      * updated to newer nightly
      
      * retained the older broadcast function for older GPUs for oss.py
      
      * fixed a bug
      
      * added a comment
      
      * fixing a test for pytorch 1.10
      
      * testing a fix
      
      * Update fairscale/optim/oss.py
      
      * Update CONTRIBUTING.md
      Co-authored-by: default avatarMin Xu <min.xu.public@gmail.com>
      3ecf76f4
  32. 12 Aug, 2021 2 commits
  33. 31 Jul, 2021 1 commit
  34. 30 Jul, 2021 1 commit
    • Yanli Zhao's avatar
      [FSDP] Move final backward callback queueing to pre-backward hook of root instance (#753) · ba7df621
      Yanli Zhao authored
      Move final backward callback to pre-backward hook of root FSDP instance
      
      Summary:
      
      Move final backward callback to pre-backward hook of root FSDP instance,
      so that it is always attached to the outer most backward call and fired
      after all backward calls are completed.
      
      Also added flags to check final backward callback is fired when final
      backward callback is required.
      
      If root FSDP is checkpointed and called multiple times in forward,
      check pointer counter is used to make sure final backward callback is queued inside last inner backward
      call as well.
      
      Test Plan: unit tests
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * reformat
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * nits and unit tests
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * address some comments
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * replace m with self
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * reformat
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * nits
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * remove the fired flag
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * assert state on root only
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * comments
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      
      * comments
      
      Summary:
      
      Test Plan:
      
      Reviewers:
      
      Subscribers:
      
      Tasks:
      
      Tags:
      ba7df621
  35. 26 Jul, 2021 1 commit
    • Min Xu's avatar
      [feat]: prepare FSDP to handle multiple flatten params and fixed metadata saving for MoE (#746) · 83b0b49e
      Min Xu authored
      
      
      * [feat] FSDP: supporting multiple flatten parameter groups
      
      - step 3: make FSDP use FlattenParamModule unconditionally
      
      * fixing the auto_wrap tests
      
      * minor
      
      * rewrite local_metadata_dict
      
      - updated FPW so that custom flat param name is also supported
      
      * bug fix
      
      * mypy
      
      * rewrote consolidate_shard_weights
      
      - test_consolidate passes
      
      * comments
      
      * fixing pickling
      
      * Fix shared params and MoE logic (#749)
      
      * add strict kwarg to support fairseq:gshard MoE saving logic
      
      * Test fairseq style shard
      
      * style
      
      * formatting and address comments
      
      * added changelog
      
      * fixing a test after padding renaming
      Co-authored-by: default avatarMin Xu <min.xu.public@gmail.com>
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      83b0b49e
  36. 19 Jul, 2021 1 commit