Commits · 782714a8c59da07e2fd4b0dd27d50e3a6ee9fbf4 · OpenDAS / fairscale

12 Jul, 2021 2 commits
- [chore] 0.3.8 release (#739) · 782714a8
  anj-s authored Jul 12, 2021
  
  782714a8
- Update README.md · 86fdebd8
  Vittorio Caggiano authored Jul 12, 2021
```
misspelled name
```
  86fdebd8
07 Jul, 2021 1 commit

Future proof storage size test (#735) · 8d82db43

Edward Z. Yang authored Jul 06, 2021

See https://github.com/pytorch/pytorch/pull/59671/

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

8d82db43

28 Jun, 2021 4 commits

remove numpy requirement for install and add it only for tests (#732) · d442ad18
anj-s authored Jun 28, 2021

d442ad18

Make sure requires_grad of FlatParameter to be consistent with requires_grad... · 91c7dd05

Yanli Zhao authored Jun 28, 2021

Make sure requires_grad of FlatParameter to be consistent with requires_grad of original parameters (#721)

* Make sure requires_grad of FlatParameter to be consistent with requires_grad of original parameters

* Make sure requires_grad of FlatParameter to be consistent with requires_grad of original parameters

91c7dd05

Fixing memory leak in distributed pipeline (#724) · e2c39426
Mehdi Mirzazadeh authored Jun 28, 2021
```
* Fixing memory lead in distributed pipeline

* fix mypy error
```
e2c39426
fixing bug in setting dependencies in partition handler (#723) · 681606f0
Mehdi Mirzazadeh authored Jun 28, 2021
```
* fixing bug in setting dependancies in parition handler

* modifying unit test to need the fix

* black
```
681606f0

26 Jun, 2021 2 commits
- Fix pytorch version check (#716) · bc1e60e0
  Pavel Belevich authored Jun 25, 2021
  
  bc1e60e0
- [fix] Add the latest numpy version to requirements.txt (#731) · 00ec9ff1
  anj-s authored Jun 25, 2021
```
* set numpy version

* remove numpy requirement

* remove numpy plugin

* add numpy requirements
```
  00ec9ff1
25 Jun, 2021 3 commits
- checking number parameters in distributed pipeline test (#728) · 4a63034e
  Mehdi Mirzazadeh authored Jun 25, 2021
  
  4a63034e
- Preparing pipeline for newer versions of pytorch (#726) · bcd4748d
  Mehdi Mirzazadeh authored Jun 25, 2021
```
* Preparing pipeline for newer versions of pytorch

* updated error message
```
  bcd4748d
- checkpoint_activations: use non blocking cpu transfer (#719) · 63f289f2
  Sam Shleifer authored Jun 25, 2021
  
  63f289f2
23 Jun, 2021 1 commit
- use ModuleLst for submodules (#722) · 308f1057
  Mehdi Mirzazadeh authored Jun 23, 2021
  
  308f1057
22 Jun, 2021 1 commit

Update torch to 1.9.0 release (#717) · 1cc4c837

Pavel Belevich authored Jun 21, 2021

* Update torch to 1.9.0.dev20210614+cu102

* Update config.yml

* Update config.yml

* Update setup.py

* Update config.yml

* Update config.yml

* Update config.yml

* Update config.yml

1cc4c837

21 Jun, 2021 1 commit

[feat] FSDP: supporting multiple flatten parameter groups (#711) · ab71efb3

Min Xu authored Jun 21, 2021



* [feat] FSDP: supporting multiple flatten parameter groups

- step 2: extending FPW to support multiple flat params groups
- FSDP still only use one group
- unit test does this the new code paths
- updated the changelog

* first cut, mypy passed

* test_flatten_params_wrapper.py::TestFlattenParams tests pass

* added two more test cases and fixed a case in the code

* fixed one bug with param_path_infos

* fixed two more tests with hardcoded flat_param names

* Update CHANGELOG.md
Co-authored-by: Min Xu <min.xu.public@gmail.com>

ab71efb3

14 Jun, 2021 1 commit
- [chore]Migrate away from legacy torchtext iterators (#713) · cec011bb
  anj-s authored Jun 14, 2021
```
* migrate away from legacy iterators

* fix lint error
```
  cec011bb
11 Jun, 2021 3 commits

[Offload][feature] Add auto shard functionality to remove requirement of... · cbeda830

anj-s authored Jun 10, 2021

[Offload][feature] Add auto shard functionality to remove requirement of nn.Sequential models. (#695)

* auto wrap functionality

* lint and doc strings

* fix lint errors

* lint errors and version skips

* remove mypy checking and add conditional import

* another math.prod instance

* another import fix

* address comments

* lint errors

* address comments

* fix lint errors

* add placeholder nodes to tracker list

cbeda830

remove examples dir (#712) · 7bdb9a7f
anj-s authored Jun 10, 2021

7bdb9a7f

Use original forward pass directly when in eval mode from within checkpoint wrapper (#709) · 370b8483

Pete authored Jun 10, 2021

* add failing test

* add fix

* use 'torch.is_grad_enabled()' instead of 'module.training'

* Revert "add failing test"

This reverts commit 1c34242208f9b2c5fa6c8f181434c2be6d7cdbc0.

* add simple test

* improve test

* add check for fwd_counter

* revert typing/format changes

* move to new test file

* CHANGELOG

* remove old test

* fix import order

* fix test to be compat with torch 1.6.0

* clean up

* comments

* isort 🤦

370b8483

08 Jun, 2021 1 commit

[feat] supporting multiple flatten parameter groups (step 1 and step 1.5) (#708) · d60fc284

Min Xu authored Jun 08, 2021



* refactoring FlattenParamWrapper

- use a FlatParameter class to encapsulate the logic of
  flattening and expanding into views.
- this will make it easier to have multiple groups of flatten
  parameters

* fixed testing context issues for both temp files and temp dirs

* fixing test_fsdp_metadata

* fix pickling of FlatParameter

* fixed test_fsdp_optimizer_utils.py

* minor

* fix assert

* lint

* remove nesting from the test

* step 1.5: remove the code related unnecessary nesting support in FPW

* Update fairscale/nn/misc/flatten_params_wrapper.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* address comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

d60fc284

01 Jun, 2021 3 commits

[cleanup] clearly document how backward/forward are working (#700) · 1e4a503c

msbaines authored Jun 01, 2021

_SyncBatchNormFunction is a little complex in that it does the
full backward, including mean and var, but does not calculate
statistics in the forward path. Statistics are calculated outside
in the SyncBatchNorm nn.module.

This change does not impact functionality.

1e4a503c

Fix buffer dtype in ` FSDP.state_dict()` when using mixed precision (#705) · 25cebf85
Pete authored Jun 01, 2021
```
* add failing test for buffer dtype

* fix buffer dtype issue

* update CHANGELOG

* fix
```
25cebf85

[test] fixing 1.9 nightly install (#706) · 3443a635

Min Xu authored Jun 01, 2021



* [test] fixing 1.9 nightly install

* update cache version so that we don't keep reinstall
Co-authored-by: Min Xu <min.xu.public@gmail.com>

3443a635

28 May, 2021 2 commits

[fix] using dummy tensor to ensure checkpoint backward pass is called in corner cases (#701) · df7db85c

Min Xu authored May 28, 2021



* [do not merge] testing a corner case

* workaround

* using dummy tensor to fix

* lint

* changelog

* update a comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>

df7db85c

[docs] Update README (#702) · 1bcab8dd
anj-s authored May 27, 2021
```
* update installation instructions

* modify README

* fix heading
```
1bcab8dd

27 May, 2021 3 commits
- update workflow diagram (#699) · b84b9146
  anj-s authored May 26, 2021
  
  b84b9146
- [docs] Revamp FairScale documentation (#698) · dcfb7a99
  anj-s authored May 26, 2021
```
* add tutorials

* add new context, modify and delete existing docs

* remove duplicate labels

* modify layout and more nits

* address comments

* fix merge conflicts
```
  dcfb7a99
- [perf] SyncBatchNorm: avoid 2nd set of all_reduce when wrapped by checkpoint_wrapper (#694) · 29aae007
  msbaines authored May 26, 2021
```
This change also ensure that we calculate running_{mean,var} correctly
when wrapped.
```
  29aae007
26 May, 2021 2 commits
- [docs] add MOE to docs (#693) · 3dcc9eff
  msbaines authored May 26, 2021
  
  3dcc9eff
- Update CONTRIBUTING.md · 2c663f5a
  anj-s authored May 26, 2021
  
  2c663f5a
21 May, 2021 1 commit

[refactor] ShardedGradScaler init and super call (#691) · 945b9666

Nicholas Cilfone authored May 21, 2021

Make ShardedGradScaler __init__ mirror GradScaler so super can forward parameters. Without this one cannot configure a ShardedGradScaler object like one can with the PyTorch native GradScaler object.
Updated with black linter.
Added stub for GradScaler __init__ which solves mypy issues and removed
ignore comment.

945b9666

18 May, 2021 2 commits
- [potential fix] Rename codecov yaml file according to docs (#687) · 8a05ff76
  anj-s authored May 18, 2021
```
* rename codecov yaml file

* remove status checks
```
  8a05ff76
- [chore] 0.3.7 release (#686) · a462df2e
  Min Xu authored May 17, 2021
```
* [chore] 0.3.7 release

* fixed changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  a462df2e
17 May, 2021 2 commits

[fix] auto_wrap: support wrapping based on wrapper_config (#685) · 9d2bbcf2

Min Xu authored May 17, 2021



* [fix] auto_wrap: support wrapping based on wrapper_config

- user can use this to avoid assert if auto_wrap is used multiple times on a module
- user can traverse the modules multiple times and assign a wrapper_config
  to the module and then use auto_wrap once to wrap them

fix #649
fix #585

* added changelog

* fix tests

* fix a test

* added an optional assert for collision based on discussions with Quentin

* added config_auto_wrap_policy

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>

9d2bbcf2

[feat] Save FSDP metadata for offline unflattening + Consolidate checkpoints (#683) · 81c20f72

Quentin Duval authored May 17, 2021



* Save FSDP metadata for offline unflattening

* Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint

* Complete the meta-data saving method with all the information needed to reconstruct a checkpoint offline, and implement the method that reconstruct a consolidated checkpoint from a sharded checkpoint

* Add a unit test to show how to use the function

* Code review + improvement of the unit tests

* Code review: extract clean_path

* Make meta data and consolidation of checkpoint work for flatten_parameter=False

* Add new unit test file in CI

* Complete changelog and fix mypy issues

* Add support for module buffers in the consolidation of sharded checkpoints

* Better support for module buffers: save them in the meta data

* Refactoring: use a data-format for the meta data that is simpler to understand (move from object of array to array of object format)

* Renaming to make code clearer

* Code review: in_temporary_directory rework and typo correction

* Renaming
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: QuentinDuval <QuentinDuval@users.noreply.github.com>

81c20f72

14 May, 2021 4 commits

[perf] nn.SyncBatchNorm: use autograd function to save memory (#680) · d240b748
msbaines authored May 14, 2021

d240b748
[refactor] [fsdp] Modify FSDP API param name to better reflect functionality (#676) · 5be4817d
anj-s authored May 14, 2021
```
* api changes

* fix list

* modify changelog

* modify changelog

* modify changelog

* move function
```
5be4817d

[minor] use dist.group.WORLD for default process group (#681) · bbac5564

Min Xu authored May 14, 2021



* [minor] use dist.group.WORLD for default process group

- this is slightly more efficient than the previous commit
  for get_process_group_cached.

* fix

* better fix

* fixed for pytorch 1.6 and 1.7

* Update fairscale/utils/parallel.py
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Min Xu <min.xu.public@gmail.com>

bbac5564

FSDP: Fix saving and loading checkpoints with use_sharded_state=True (#574) · 468874c8

Shruti Bhosale authored May 14, 2021



* fix saving and loading checkpoints with use_sharded_state=True

* mypy fix

* better fix of the infinite recursion

- we need to specifically call FSDP.state_dict from its local state_dict
- added unit test that fails without the fix and works with the fix
- fixed mypy for the overloaded functions

* make cpu-only fsdp work for state_dict at least
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Min Xu <m1n@fb.com>

468874c8

13 May, 2021 1 commit

[fix] add and use get_process_group_cached (#678) · bde4bac5

Min Xu authored May 12, 2021

* [fix] add and use get_process_group_cached

- This commit makes FSDP avoid making too many process groups by default
- Extra process group is bad for GPU memory and init time

* add changelog

* lint

* note on speed

* add better assert output

test seems to be flaky:
https://app.circleci.com/pipelines/github/facebookresearch/fairscale/2957/workflows/383c9f9f-f1a5-461c-8c41-e2e28ece037b/jobs/26783/steps



* update test reference memory values

- With cached process groups, the memory is reduced as reported by
pytorch as well (due to bucket buffer memory for the reduction buffer)
- The effect on memory is actually more on the SMI memory, which is not
reported by pytorch and checked by this test.

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

* Update CHANGELOG.md

* Update fairscale/utils/parallel.py

* Update fairscale/utils/parallel.py

* Update fairscale/utils/parallel.py

* Update fairscale/utils/parallel.py

* improved changelog

* better handling of underscores in the md file
Co-authored-by: Min Xu <min.xu@acm.org>

bde4bac5