1. 28 Jan, 2022 1 commit
  2. 14 Jan, 2022 1 commit
  3. 13 Jan, 2022 2 commits
    • Anupam Bhatnagar's avatar
      [feature] [experimental] Layerwise Gradient Scaler (#879) · 52d066a2
      Anupam Bhatnagar authored
      * [skip ci] first commit
      
      * [skip ci] gradient scaler example
      
      * [skip ci] adding feed forward toy example
      
      * [skip ci] adding types
      
      * [skip ci] adding backward hook
      
      * [skip ci] update
      
      * [skip ci] working feed forward example
      
      * [skip ci] working feed forward example
      
      * [skip ci] use named_modules instead of named_children
      
      * [skip ci] adding new file
      
      * [skip ci] clean up
      
      * [skip ci] implement unscale function
      
      * [skip ci] implement unscale function
      
      * [skip ci] removing old file
      
      * [skip ci] removing some more old files
      
      * [skip ci] making unscale function generic
      
      * [skip ci] adding test for vision model
      
      * [skip ci] adding identity layer
      
      * [skip ci] cleanup files
      
      * [skip ci] refactoring
      
      * [skip ci] more refactoring
      
      * [skip ci] added functionality to update scale
      
      * [skip ci] data loader clean up
      
      * [skip ci] implemented inf checks and update scale functions
      
      * [skip ci]code clean up. added test with autocast. does not work atm
      
      * adding documentation
      
      * adding dependency in requirements-dev.txt
      
      * updating pytorch nightly version
      
      * updating changelog
      
      * adding is_cuda_available to test_vision_model
      
      * set same timeout on cpu and gpu
      
      * reverting cpu timeout, skip vision test on cpu
      
      * addressing comments, fixing vision test
      
      * unscale uses in-place matmul
      
      * some more cleanup
      52d066a2
    • tmarkstrum's avatar
      [Fix][FSDP]fixed padding size of input tensor for reduce scatter (#907) · fb4eca19
      tmarkstrum authored
      
      
      * fixed padding size of input tensor for reduce scatter, and fixed an error that assigned wrong group
      
      * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
      Co-authored-by: default avatarMin Xu <24926999+min-xu-ai@users.noreply.github.com>
      
      * added changelog
      
      * fixed some commit.
      
      * added unit test to ensure the reduce_scatter process group size is correct in default cases. And fall back to default process grouop when the reduce_scatter process group has the wrong size.
      
      * throw an error instead of rolling back to use default process group for reduce_scatter_process_group
      
      * Revert "throw an error instead of rolling back to use default process group for reduce_scatter_process_group"
      
      This reverts commit eab5620da3b726ea55d3088ae4ca10d94dcdf4d9.
      
      * added check for None to avoid unit test failure
      
      * fixed an error to avoid the unit tests failure
      Co-authored-by: default avatarMin Xu <24926999+min-xu-ai@users.noreply.github.com>
      fb4eca19
  4. 12 Jan, 2022 1 commit
  5. 06 Jan, 2022 1 commit
    • four4fish's avatar
      FullyShardedDataParallel: only return full state dict on rank 0 (#885) · d3417ceb
      four4fish authored
      * FullyShardedDataParallel: only return full state dict on rank 0
      
      * Add flag and make rank 0 only optional
      
      * Add tests
      
      * Add docs
      
      * address comments
      
      * update comments
      
      * update torch nightly version
      
      * update torchvision number for torch nightly dependence
      
      * add changelog
      
      * Update CHANGELOG.md
      
      * Update CHANGELOG.md
      d3417ceb
  6. 05 Jan, 2022 1 commit
    • Paul Johnson's avatar
      Enabling ssd_offload training basic tests. (#887) · c5e471bc
      Paul Johnson authored
      * Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py.
      * Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor
      * Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper
      * Modifications from PR #887 review comments.
      * Update Changelog
      c5e471bc
  7. 21 Dec, 2021 3 commits
  8. 02 Dec, 2021 1 commit
    • Min Xu's avatar
      [fix] [FSDP] Do not lose original reshard_after_forward (#880) · 7c2c3e00
      Min Xu authored
      * [fix] [FSDP] Do not lose original reshard_after_forward
      
      - In a corner case we can lose this value
      - Saving it and use it in the reset function fixed it
      - A trivial case probably not worth a dedicated test for now
      
      * added changelog
      7c2c3e00
  9. 18 Nov, 2021 2 commits
  10. 17 Nov, 2021 2 commits
  11. 12 Nov, 2021 1 commit
    • Anupam Bhatnagar's avatar
      Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d
      Anupam Bhatnagar authored
      * adding pre-commit files
      
      * applying pre-commit to all files
      
      * adding no-strict-optional argument to mypy in circle ci config
      
      * fix typo
      
      * updating python versions
      
      * [skip ci] remove extra args
      
      * adding python 3.9
      
      * [skip ci] set pre-commit version in requirements-dev.txt
      
      * set CACHE_VERSION
      
      * move linters from circleci to github actions
      
      * update python version
      
      * update python version in benchmarks_2
      
      * moving to python 3.9.7
      7d7edf6d
  12. 08 Nov, 2021 3 commits
  13. 05 Nov, 2021 1 commit
    • Min Xu's avatar
      [feat] experimental MEVO layer (#840) · 8347c1a2
      Min Xu authored
      
      
      * [feat] MEVO kernel
      
      - initial import from min/softmax and min/testing branches
      - need to rename and further cleanup
      
      * only test with newer pytorch
      
      * renamed and added comments and code cleanup
      
      * rename and reduce test memory
      
      * testing
      
      * minor fixing
      
      * fixing
      
      * more fix
      
      * changelog
      
      * more 1.7 and 1.8 paper cuts
      
      * remove dead code
      
      * addressed Benjamin's comments
      
      * addressed more comments
      Co-authored-by: default avatarMin Xu <min.xu.public@gmail.com>
      8347c1a2
  14. 01 Nov, 2021 1 commit
    • Min Xu's avatar
      [feat] [FSDP]: add experimental support to shared weights (#836) · f2af4c66
      Min Xu authored
      
      
      * added a new test, passing without shared weights
      
      * tested weight sharing
      
      * added the test to test list file
      
      * extended to world_size = 2
      
      * fixed test
      
      * [feat]: add limited and experimental support for shared parameter
      
      * fixed tests
      
      * simplify to work with layer with at least 1 non-shared params and add code to pick up linked_param field for sharding the shared param
      
      * fixed the case where linked param is not in separate FSDP
      
      * changelog and remove old code
      Co-authored-by: default avatarMin Xu <min.xu.public@gmail.com>
      f2af4c66
  15. 27 Oct, 2021 1 commit
  16. 20 Oct, 2021 1 commit
  17. 20 Sep, 2021 1 commit
  18. 13 Sep, 2021 1 commit
  19. 12 Sep, 2021 1 commit
  20. 05 Sep, 2021 1 commit
  21. 12 Aug, 2021 2 commits
  22. 01 Aug, 2021 1 commit
  23. 31 Jul, 2021 1 commit
  24. 27 Jul, 2021 2 commits
  25. 26 Jul, 2021 1 commit
    • Min Xu's avatar
      [feat]: prepare FSDP to handle multiple flatten params and fixed metadata saving for MoE (#746) · 83b0b49e
      Min Xu authored
      
      
      * [feat] FSDP: supporting multiple flatten parameter groups
      
      - step 3: make FSDP use FlattenParamModule unconditionally
      
      * fixing the auto_wrap tests
      
      * minor
      
      * rewrite local_metadata_dict
      
      - updated FPW so that custom flat param name is also supported
      
      * bug fix
      
      * mypy
      
      * rewrote consolidate_shard_weights
      
      - test_consolidate passes
      
      * comments
      
      * fixing pickling
      
      * Fix shared params and MoE logic (#749)
      
      * add strict kwarg to support fairseq:gshard MoE saving logic
      
      * Test fairseq style shard
      
      * style
      
      * formatting and address comments
      
      * added changelog
      
      * fixing a test after padding renaming
      Co-authored-by: default avatarMin Xu <min.xu.public@gmail.com>
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      83b0b49e
  26. 12 Jul, 2021 1 commit
  27. 21 Jun, 2021 1 commit
    • Min Xu's avatar
      [feat] FSDP: supporting multiple flatten parameter groups (#711) · ab71efb3
      Min Xu authored
      
      
      * [feat] FSDP: supporting multiple flatten parameter groups
      
      - step 2: extending FPW to support multiple flat params groups
      - FSDP still only use one group
      - unit test does this the new code paths
      - updated the changelog
      
      * first cut, mypy passed
      
      * test_flatten_params_wrapper.py::TestFlattenParams tests pass
      
      * added two more test cases and fixed a case in the code
      
      * fixed one bug with param_path_infos
      
      * fixed two more tests with hardcoded flat_param names
      
      * Update CHANGELOG.md
      Co-authored-by: default avatarMin Xu <min.xu.public@gmail.com>
      ab71efb3
  28. 11 Jun, 2021 1 commit
    • Pete's avatar
      Use original forward pass directly when in eval mode from within checkpoint wrapper (#709) · 370b8483
      Pete authored
      * add failing test
      
      * add fix
      
      * use 'torch.is_grad_enabled()' instead of 'module.training'
      
      * Revert "add failing test"
      
      This reverts commit 1c34242208f9b2c5fa6c8f181434c2be6d7cdbc0.
      
      * add simple test
      
      * improve test
      
      * add check for fwd_counter
      
      * revert typing/format changes
      
      * move to new test file
      
      * CHANGELOG
      
      * remove old test
      
      * fix import order
      
      * fix test to be compat with torch 1.6.0
      
      * clean up
      
      * comments
      
      * isort 🤦
      370b8483
  29. 01 Jun, 2021 1 commit
  30. 28 May, 2021 1 commit
  31. 18 May, 2021 1 commit