Commits · 51e43b6143969ea6570d5873cdcf3e430ac9b73e · OpenDAS / fairscale

26 Jun, 2021 1 commit
- Fix pytorch version check (#716) · bc1e60e0
  Pavel Belevich authored Jun 25, 2021
  
  bc1e60e0
11 Jun, 2021 1 commit

[Offload][feature] Add auto shard functionality to remove requirement of... · cbeda830

anj-s authored Jun 10, 2021

[Offload][feature] Add auto shard functionality to remove requirement of nn.Sequential models. (#695)

* auto wrap functionality

* lint and doc strings

* fix lint errors

* lint errors and version skips

* remove mypy checking and add conditional import

* another math.prod instance

* another import fix

* address comments

* lint errors

* address comments

* fix lint errors

* add placeholder nodes to tracker list

cbeda830

17 May, 2021 1 commit

[feat] Save FSDP metadata for offline unflattening + Consolidate checkpoints (#683) · 81c20f72

Quentin Duval authored May 17, 2021



* Save FSDP metadata for offline unflattening

* Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint

* Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint

* Add a unit test to show how to use the function

* Code review + improvement of the unit tests

* Code review: extract clean_path

* Make meta data and consolidation of checkpoint work for flatten_parameter=False

* Add new unit test file in CI

* Complete changelog and fix mypy issues

* Add support for module buffers in the consolidation of sharded checkpoints

* Better support for module buffers: save them in the meta data

* Refactoring: use a data-format for the meta data that is simpler to understand (move from object of array to array of object format)

* Renaming to make code clearer

* Code review: in_temporary_directory rework and typo correction

* Renaming
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: QuentinDuval <QuentinDuval@users.noreply.github.com>

81c20f72

12 May, 2021 1 commit

[chore] Rename and move checkpoint_activations from misc folder. (#654) · 72c6bab2

anj-s authored May 12, 2021

* rename files

* add newly renamed file

* rename and move checkpoint activations related files

* add test files to ci list

* fix lint errors

* modify docs

* add changelog

* retain old path for now

* fix lint errors

* add another import test case

* fix merge conflict

* add missing test file

72c6bab2

11 May, 2021 1 commit

[fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3

Min Xu authored May 10, 2021



* [fix] FSDP forward pass overlap between compute and all-gather

- much thanks for @cyanguwa for report and @QuentinDuval for debugging it
- a new unit test is added to check for this and ensure we detect
  issue with overlapping and cpu/gpu blocking wait calls

* fix

* fix

* fix

* better assertion outputs

* fix format and tune all_gather mb for CI

* more tuning with non_flatten

* undo an accidental change

* tuning all gather mb and del model

* Update + fix overlapping test to use patched all_gather w/ delay (#672)

* fixing get_cycles_per_ms

* add get_smi_memory

* update the docstring
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Myle Ott <myleott@fb.com>

8a42a8e3

07 May, 2021 1 commit

[fix]: support pytorch SyncBatchNorm under AMP & checkpointing with FSDP (#659) · 6db68518

Min Xu authored May 07, 2021



* [test]: add a more general test case

- also rebalance the tests a bit

* added missing arg

* balance

* better checking

* balance

* make test smaller and faster

* make ddp results cached and enable sync_bn

* clean up

* fix tests

* changelog

* blance

* fix

* addressing comments
Co-authored-by: Min Xu <min.xu@acm.org>

6db68518

04 May, 2021 1 commit

[feat]Adding DynamicLossScaler class for supporting optimizer updates on the CPU (#635) · 14d1f78c

tmarkstrum authored May 03, 2021

* dynamic loss scaler

* isort

* black

* flake8

* comments

* added the test to ci file, added a line to catch the overflow error, fixed some formatting errors

* adding type annotation

* added todo for adding more test cases for handling Nan gradients

* fix some doc string and comments, add more tods

* fix two doc strings

14d1f78c

07 Apr, 2021 1 commit
- [FSDP] [feat] Add state_dict_device option (#579) · 14abed6e
  Myle Ott authored Apr 07, 2021
  
  14abed6e
31 Mar, 2021 1 commit

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed... · a0458b98

Min Xu authored Mar 31, 2021

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556)

* [fix] disable single rank process group for auto_wrap_bn

- beefed up unit test with regnet-like model
- found that single-rank process group is causing problem
- disabled it to enable convergence tests on the vissl side
- use `raise e from None` to get a better assertion output
  in testing.py.

* [test] fix regnet test for ddp+mixed_precision

- need AMP context in FSDP
- workaround different between ddp & fsdp when bias=True
- fixed a bug in input data generation that caused different ranks have
  the same data with wrong iteration count.
- added TODO for need a better loss and grad_scaler and reduced
  iters so there is no nan.
- added a (disabled) debugging code

* lint

* lint

* add scaler

* lint

* scaler

* add a real loss

* seeding in the ranks

* blance tests

* run AMP DDP==FSDP test only on cuda version 11 and up

* add relu inplace and comment

* make wrap_bn covers more cases in full precision mode

a0458b98

20 Mar, 2021 1 commit

[fix][FSDP] fix weight init when using apply() (fixes #490 and #444) (#543) · fa1b85fb

Myle Ott authored Mar 20, 2021

* Add new test for weight init (fails)
* Set FSDP.compute_device so summon_full_params works before module moves to CUDA
* Override FSDP.apply to enable custom weight init

fa1b85fb

19 Mar, 2021 2 commits
- [feat][refactor][OSS] Param buckets + fp16 broadcasts (#540) · e3865549
  Benjamin Lefaudeux authored Mar 19, 2021
```
* param buckets
* unifying the buckets
```
  e3865549
- [feat] experimental.nn.multiprocess_pipe: re-implemented using rpc (#519) · 84e0de84
  msbaines authored Mar 18, 2021
  
  84e0de84
18 Mar, 2021 1 commit
- [refactor][fix][SDP] Extract the grad buckets in a dedicated class, fix the resize_ bug (#532) · a1bdc7d3
  Benjamin Lefaudeux authored Mar 18, 2021
```
* extracting the buckets in a dedicated class, fixing the resize_ bug
* adding a unit test
* copyright
```
  a1bdc7d3
04 Mar, 2021 1 commit

[feat]: checkpoint and normalization (#457) · 5e64d6a7

Min Xu authored Mar 04, 2021

* [feat]: checkpoint and normalization

- added special handling of BN for track_running_stats and checkpointing
- we test BN/LN and checkpointing
- we test them with mixed precision

5e64d6a7

01 Mar, 2021 1 commit

[chores]: make CI more efficient and update py39 env a bit (#447) · 5eb6b8c7

Min Xu authored Mar 01, 2021

* [chores]: CI py39 on GPU and more efficiency

* add test list files

* fix

* add test list files

* split benchmark run into 2 runs

* fix 1.8 version and balance benchmarks

* fix

* fix

* fix

* fix

* recording tests

* py39 install fix

* test again

* move tests

* reorg tests

* skip tests for torch 1.8 due to an upstream bug

* removed __init__.py from tests since it confuses pytest

* Revert "removed __init__.py from tests since it confuses pytest"

This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0.

* don't include __init__ in file list

* notes on __init__.py and added missing ones

* fixed mypy in a test file

* balance test runtime

* better pip install

* balance more

* pip fix

* balance

* balance more, all test should finish within 20m now

* minor license update

* trying cu102

* more doc and addressed Ben's comments

* debugging

* debugging...

5eb6b8c7