Commits · 3932a1f68e26e2cb6d5ccb2ba23a16a9ed8a5874 · OpenDAS / fairscale

04 Jan, 2021 1 commit

[feat] sync adascale from internal repo, support add_param_group (#266) · 3932a1f6

Min Xu authored Jan 04, 2021

* [feat] sync adascale from internal repo

- tbd

testing: tbd

* Update argument document of __init__

* update documentation around set_num_gradients_to_accumulate

* added checking code for proper API calling places

* rename internal APIs to make them internal

* updated changelog

* added support for add_param_group and its unit test

* added unit test for set_num_gradients_to_accumulate

* added debias_ewma unit test

* fixed test_set_num_gradients_to_accumulate (need zero_grad() call)

* added missing zero_grad() to test_lr_scheduler

* fixed test_add_param_group with respect to optim.zero_grad()

* added test_gradient_value

* added test_scale_not_equal_default for scale != world_size * grad_accum

* added test_unhook()

* removed print statements

* fixed a typo

* addressed Ben's comment

3932a1f6

29 Dec, 2020 1 commit
- [feature] OSS: add unit test for distributed checkpointing (#273) · 60c8de4a
  Joshua Meier authored Dec 28, 2020
```
author: Joshua Meier
```
  60c8de4a
22 Dec, 2020 1 commit

[OSS] Balance the trainable params only (#262) · c386e937

Benjamin Lefaudeux authored Dec 21, 2020

* fix, one liner

* adjust so that frozen trunks get spread still, even if this should have little consequences

* removing dead code, hopeful unit test fix

* now with some linting..

* adding a proper unit test case

c386e937

16 Dec, 2020 1 commit

[feat]: AdaScale work with lr_scheduler and tests, examples (#229) · d65cd838

Min Xu authored Dec 15, 2020

* [doc]: AdaScale example and notes

* formatted notes correctly as suggested by Benjamin

* added feature and unit test to make sure lr_scheduler works

* update the example with lr_scheduler

* fixed doc with "make html"

* addressed Mike's suggestions

d65cd838

14 Dec, 2020 1 commit

[fix] more adascale gradient accumulation tests and smoothing factor fix (#235) · f74afebb

Min Xu authored Dec 14, 2020

* better ddp adascale tests

* make sure the single node test use the same test cases and expected gains

* added unit test that covers smoothing factor

- tested by re-introducing the bug and see the test fail as expected.

f74afebb

06 Dec, 2020 1 commit
- [fix] skipping NCCL tests on 2-GPU systems (#233) · bb468670
  Min Xu authored Dec 05, 2020
  
  bb468670
03 Dec, 2020 1 commit

[feat] AdaScale: Gradient Accumulation and Add PyTest unit tests (#202) · ce5860ea

Min Xu authored Dec 03, 2020

* added AdaScale to README

* [adascale] added gradient accumulation

- added gradient accumulation
- tested with cifar full trainings with different value of accumulation
and verified the full accuracy is obtained
- also removed the patch optimize flag until we need it

* [adascale] adding pytest

- added basic and ddp tests and grad_accum
- closes #195

* added changelog

* added ddp grad_accum test

* moved ddp and non-ddp tests into separate files

* added checkpoint test

* more doc

* addressed Mike's comments

ce5860ea

16 Nov, 2020 1 commit
- [feat] OSS-aware clip grads, bridge sharded states (#167) · ade312c4
  Benjamin Lefaudeux authored Nov 16, 2020
```
add a clip gradients util, equivalent to torch's but aware of the sharded states. Add a corresponding unit test
```
  ade312c4
06 Nov, 2020 1 commit
- [fix] OSS tests - remove concurrent dist inits (#177) · 543d5693
  Benjamin Lefaudeux authored Nov 06, 2020
  
  543d5693
28 Oct, 2020 1 commit
- [chore] update isort to 5.6.4 (#170) · ea9876e3
  msbaines authored Oct 27, 2020
  
  ea9876e3
14 Oct, 2020 2 commits
- [bugfix] OSS + Apex (#136) · 37c686e7
  Benjamin Lefaudeux authored Oct 14, 2020
```
* fixing the issue wrt Apex, validated with Latte, Classy would need another pass
```
  37c686e7
- [feat] moe: add all_to_all support (#134) · 6d802f5a
  msbaines authored Oct 13, 2020
  
  6d802f5a
08 Oct, 2020 1 commit
- [fix] OSS unit test to check data group (#129) · 81ac5b28
  Benjamin Lefaudeux authored Oct 08, 2020
```
* new unit test to catch rank issues in OSS
```
  81ac5b28
15 Sep, 2020 2 commits
- [feat] Gracefully handle local/global state dict queries (#89) · d16e9f61
  Benjamin Lefaudeux authored Sep 15, 2020
```
Return either the local or global state when queried, depending on a prior consolidation
```
  d16e9f61
- [feat ] OSS : optional closure argument for the optimizer (#86) · 3d7f524a
  Benjamin Lefaudeux authored Sep 15, 2020
```
Make OSS compatible with optimizers which do not support the closure argument
```
  3d7f524a
09 Sep, 2020 1 commit

[feat] OSS flatten state dict (#65) · 4f597233

Benjamin Lefaudeux authored Sep 09, 2020

Changes the structure of the returned state dict with respect to the param_groups to make it closer to what a vanilla optimizer would return (un-shard them). Shard again when loading

4f597233

08 Sep, 2020 1 commit

[feat] OSS: Sync all attributes (#67) · 5a268b25

Benjamin Lefaudeux authored Sep 08, 2020

Make sure that all attributes (not just LR) are in sync in between the OSS.param_groups and the actual wrapped optimizer. Some frameworks make it possible to alter any attribute on a scheduled basis, which proves useful depending on the optimizer, so the keys need to be generically supported (not just "lr"). Not syncing these attributes is a worst case scenario, since these adjustments are silently not propagated, fixing that.

5a268b25

03 Sep, 2020 2 commits

Add grad scaler (#48) · b6a5e634

Jun Ru Anderson authored Sep 03, 2020



Add GradScaler to Fairscale, subclassing PyTorch's GradScaler. Use GradScaler in the pipe benchmark; though it is not needed in this case, it is a good example of how to use gradient scaling for larger models that do require gradient scaling in order to converge.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

b6a5e634

[fix] OSS pytorch-compliant state dict (#61) · 1d1d15ea

Benjamin Lefaudeux authored Sep 03, 2020

* Aligning the optimizer state dict with what PyTorch expects

* Adding a check on the dict keys, ensure that `state` and `param_groups` are there

* after installing the specific isort, black and all, one liner to please the linter..

1d1d15ea

28 Aug, 2020 1 commit

[fix] optim/oss: work correctly with LRScheduler (#58) · ab32cb7d

msbaines authored Aug 28, 2020

* [fix] optim/oss: work correctly with LRScheduler

Sync lr before every step and before consolidate.

ab32cb7d

27 Aug, 2020 3 commits
- [refactor] optim/oss: save memory and time by avoiding duplicate copy of parameters (#57) · e4a0804c
  msbaines authored Aug 27, 2020
  
  e4a0804c
- [fix] optim/oss: PyTorch already handles putting state on proper device (#54) · 220ee323
  msbaines authored Aug 27, 2020
  
  220ee323
- [fix] optim/oss: support optimizers with additional step kwargs (#53) · 09028a0d
  msbaines authored Aug 26, 2020
```
* [fix] optim/oss: support optimizers with additional step kwargs

Some of the optimizers in apex support additional kwargs to step
such as scale.
```
  09028a0d
22 Aug, 2020 1 commit

[feat] optimizer state scaling (#44) · 5251a69a

Jun Ru Anderson authored Aug 21, 2020



Implement scaling of optimizer state when using pure-fp16 training to avoid underflow. Update benchmark to use pure-fp16. Modify state_dict methods to store and load the optimizer state scale.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

5251a69a

21 Aug, 2020 1 commit

[test] set torch seed for Adam tests (#49) · 0e8c2a96

Jun Ru Anderson authored Aug 21, 2020



Set the torch seed for tests. xfail mixed precision and memory-efficient mixed-precision state_dict tests due to their states being cast to FP16 and back to FP32 during load_state_dict.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

0e8c2a96

20 Aug, 2020 1 commit
- [fix] OSS restore state to proper device (#46) · c2d6f4b6
  Benjamin Lefaudeux authored Aug 20, 2020
```
* move the restored param groups to the original device

* adding a corresponding test
```
  c2d6f4b6
19 Aug, 2020 1 commit

[fix] fix tests and state_dict; refactor tests (#45) · 9d6c7b6a

Jun Ru Anderson authored Aug 19, 2020



Refactor tests to remove duplicated code. Fix the state_dict test to instantiate the second optimizer with the correct precision. Fix Adam.load_state_dict to make optimizer state the right type.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

9d6c7b6a

18 Aug, 2020 1 commit

[feat] allow fp16 optimizer state with Adam (#41) · 8ee5a8ff

Jun Ru Anderson authored Aug 18, 2020



Allow training with optimizer state in fp16. Use an enum to select from full-precision, mixed precision, memory efficient mixed precision and pure fp16. Improve clarity of testing code
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

8ee5a8ff

14 Aug, 2020 2 commits

[feat] add mixed precision Adam (#40) · e2d8f573

Jun Ru Anderson authored Aug 14, 2020



Add support for mixed-precision (half precision params, full precision gradients) and memory-efficient (half precision params and half precision gradients) training with Adam
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

e2d8f573

[fix] Properly restore a sharded optim state (#39) · 585f177b

Benjamin Lefaudeux authored Aug 14, 2020



* hotfix a half-cooked optimizer state restoration, the global shared state also needs to be restored

* [cleanup] get 100% coverage on oss.py (#38)
authored-by: Mandeep Singh Baines <msb@fb.com>

* better unit testing, check that the .param_groups attribute is properly in sync with the loaded state
Co-authored-by: msbaines <35972327+msbaines@users.noreply.github.com>

585f177b

13 Aug, 2020 1 commit

Aligning OSS state dict with... · 57079b08

Benjamin Lefaudeux authored Aug 12, 2020

Aligning OSS state dict with `https://pytorch.org/docs/stable/_modules/torch/optim/optimizer.html#Optimizer` (#31)

57079b08

08 Aug, 2020 1 commit
- [fix] fix test_oss.py when host have 2 GPUs (#26) · d9e6ceaa
  Min Xu authored Aug 07, 2020
```
Co-authored-by: Min Xu <m1n@fb.com>
```
  d9e6ceaa
31 Jul, 2020 2 commits
- [feat] Implement OSS save and load of the sharded state from a single replica (#16) · 8e363567
  Benjamin Lefaudeux authored Jul 31, 2020
  
  8e363567
- [feat] add FusedAdam (#10) · bfba68d8
  Jun Ru Anderson authored Jul 30, 2020
```
Add FusedAdam, update benchmark and add tests.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>
```
  bfba68d8
08 Jul, 2020 1 commit
- Initial commit · 0cd65242
  Mandeep Singh Baines authored Jul 07, 2020
  
  0cd65242