1. 26 Jun, 2021 1 commit
  2. 11 Jun, 2021 1 commit
    • anj-s's avatar
      [Offload][feature] Add auto shard functionality to remove requirement of... · cbeda830
      anj-s authored
      [Offload][feature] Add auto shard functionality to remove requirement of nn.Sequential models. (#695)
      
      * auto wrap functionality
      
      * lint and doc strings
      
      * fix lint errors
      
      * lint errors and version skips
      
      * remove mypy checking and add conditional import
      
      * another math.prod instance
      
      * another import fix
      
      * address comments
      
      * lint errors
      
      * address comments
      
      * fix lint errors
      
      * add placeholder nodes to tracker list
      cbeda830
  3. 17 May, 2021 1 commit
    • Quentin Duval's avatar
      [feat] Save FSDP metadata for offline unflattening + Consolidate checkpoints (#683) · 81c20f72
      Quentin Duval authored
      
      
      * Save FSDP metadata for offline unflattening
      
      * Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint
      
      * Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint
      
      * Add a unit test to show how to use the function
      
      * Code review + improvement of the unit tests
      
      * Code review: extract clean_path
      
      * Make meta data and consolidation of checkpoint work for flatten_parameter=False
      
      * Add new unit test file in CI
      
      * Complete changelog and fix mypy issues
      
      * Add support for module buffers in the consolidation of sharded checkpoints
      
      * Better support for module buffers: save them in the meta data
      
      * Refactoring: use a data-format for the meta data that is simpler to understand (move from object of array to array of object format)
      
      * Renaming to make code clearer
      
      * Code review: in_temporary_directory rework and typo correction
      
      * Renaming
      Co-authored-by: default avatarSam Shleifer <sshleifer@gmail.com>
      Co-authored-by: default avatarQuentinDuval <QuentinDuval@users.noreply.github.com>
      81c20f72
  4. 12 May, 2021 1 commit
    • anj-s's avatar
      [chore] Rename and move checkpoint_activations from misc folder. (#654) · 72c6bab2
      anj-s authored
      * rename files
      
      * add newly renamed file
      
      * rename and move checkpoint activations related files
      
      * add test files to ci list
      
      * fix lint errors
      
      * modify docs
      
      * add changelog
      
      * retain old path for now
      
      * fix lint errors
      
      * add another import test case
      
      * fix merge conflict
      
      * add missing test file
      72c6bab2
  5. 11 May, 2021 1 commit
    • Min Xu's avatar
      [fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3
      Min Xu authored
      
      
      * [fix] FSDP forward pass overlap between compute and all-gather
      
      - much thanks for @cyanguwa for report and @QuentinDuval for debugging it
      - a new unit test is added to check for this and ensure we detect
        issue with overlapping and cpu/gpu blocking wait calls
      
      * fix
      
      * fix
      
      * fix
      
      * better assertion outputs
      
      * fix format and tune all_gather mb for CI
      
      * more tuning with non_flatten
      
      * undo an accidental change
      
      * tuning all gather mb and del model
      
      * Update + fix overlapping test to use patched all_gather w/ delay (#672)
      
      * fixing get_cycles_per_ms
      
      * add get_smi_memory
      
      * update the docstring
      Co-authored-by: default avatarMin Xu <min.xu@acm.org>
      Co-authored-by: default avatarMyle Ott <myleott@fb.com>
      8a42a8e3
  6. 07 May, 2021 1 commit
  7. 04 May, 2021 1 commit
  8. 07 Apr, 2021 1 commit
  9. 31 Mar, 2021 1 commit
    • Min Xu's avatar
      [fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed... · a0458b98
      Min Xu authored
      [fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556)
      
      * [fix] disable single rank process group for auto_wrap_bn
      
      - beefed up unit test with regnet-like model
      - found that single-rank process group is causing problem
      - disabled it to enable convergence tests on the vissl side
      - use `raise e from None` to get a better assertion output
        in testing.py.
      
      * [test] fix regnet test for ddp+mixed_precision
      
      - need AMP context in FSDP
      - workaround different between ddp & fsdp when bias=True
      - fixed a bug in input data generation that caused different ranks have
        the same data with wrong iteration count.
      - added TODO for need a better loss and grad_scaler and reduced
        iters so there is no nan.
      - added a (disabled) debugging code
      
      * lint
      
      * lint
      
      * add scaler
      
      * lint
      
      * scaler
      
      * add a real loss
      
      * seeding in the ranks
      
      * blance tests
      
      * run AMP DDP==FSDP test only on cuda version 11 and up
      
      * add relu inplace and comment
      
      * make wrap_bn covers more cases in full precision mode
      a0458b98
  10. 20 Mar, 2021 1 commit
  11. 19 Mar, 2021 2 commits
  12. 18 Mar, 2021 1 commit
  13. 04 Mar, 2021 1 commit
    • Min Xu's avatar
      [feat]: checkpoint and normalization (#457) · 5e64d6a7
      Min Xu authored
      * [feat]: checkpoint and normalization
      
      - added special handling of BN for track_running_stats and checkpointing
      - we test BN/LN and checkpointing
      - we test them with mixed precision
      5e64d6a7
  14. 01 Mar, 2021 1 commit
    • Min Xu's avatar
      [chores]: make CI more efficient and update py39 env a bit (#447) · 5eb6b8c7
      Min Xu authored
      * [chores]: CI py39 on GPU and more efficiency
      
      * add test list files
      
      * fix
      
      * add test list files
      
      * split benchmark run into 2 runs
      
      * fix 1.8 version and balance benchmarks
      
      * fix
      
      * fix
      
      * fix
      
      * fix
      
      * recording tests
      
      * py39 install fix
      
      * test again
      
      * move tests
      
      * reorg tests
      
      * skip tests for torch 1.8 due to an upstream bug
      
      * removed __init__.py from tests since it confuses pytest
      
      * Revert "removed __init__.py from tests since it confuses pytest"
      
      This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0.
      
      * don't include __init__ in file list
      
      * notes on __init__.py and added missing ones
      
      * fixed mypy in a test file
      
      * balance test runtime
      
      * better pip install
      
      * balance more
      
      * pip fix
      
      * balance
      
      * balance more, all test should finish within 20m now
      
      * minor license update
      
      * trying cu102
      
      * more doc and addressed Ben's comments
      
      * debugging
      
      * debugging...
      5eb6b8c7