Commits · f5e727ccc7542ab1268e4fe26b227f01490ad843 · OpenDAS / fairscale

05 Oct, 2022 1 commit

Fix gradient accumulation (#1086) · f5e727cc

Changyu Gao authored Oct 05, 2022

* Fix gradient accumulation

- Add ``is_scaled_loss`` flag to support both scaled / unscaled loss
- Add a method `scale_grad_by_num_grads_to_accum`to handle gradient accumulation using unscaled loss more explicitly
- Fix ``test_grad_accum`` and``test_set_num_gradients_to_accumulate``
- Add tests for gradient

f5e727cc

24 Sep, 2022 1 commit
- [chore] move fair_dev into fairscale (#1078) · 8f8f8ef9
  Min Xu authored Sep 23, 2022
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  8f8f8ef9
15 Jun, 2022 1 commit
- Fix CI (#1010) · 32b0b98e
  Crutcher Dunnavant authored Jun 14, 2022
```
* Fix CI

* ci pythonpath
```
  32b0b98e
12 Jun, 2022 1 commit
- Move f/utils => f/internal; move testing libs to fair_dev/testing (#1004) · 2350968e
  Crutcher Dunnavant authored Jun 12, 2022
  
  2350968e
30 Mar, 2022 1 commit

Remove sort_iseed_config and related dependencies. (#969) · 72f373c1

Paul Johnson authored Mar 30, 2022

This is no longer needed since isort's version is 5.10

Also fix black version to 22.3.0 to fix issue with click
dependency.

Update files that now fail with new version of black {a = 2 ** 4} ->
{a = 2**4}

72f373c1

14 Feb, 2022 1 commit

[chore] [cleanup]: pytest, pytorch new versions, fix tests (#933) · fae29959

Min Xu authored Feb 14, 2022



* update pytest versions

* [test] test related changes

- upgrade to newer pytorch versions
- added function to make test more deterministic on A100 and TF32
- fixed some tests so that they are correctly skipped on a single GPU system

* more fixes

* formatting overly long lines

* format

* better test without trigger a warning

* fix an optim state bug with newer pytorch

- adam optimizer seems to return "step" as a singleton tensor now in the
nightly build
- this fixes it assumeing non-tensor value can still be loaded back by
the optimizer

* improve oss.py

- use min_loss for regression checking is a bit more reliable
- also increased the num epochs from 10 to 12

* small oss.py fix

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

fae29959

22 Feb, 2021 1 commit
- [fix][OSS] adding an assert for empty shards + corresponding unit test (#406) · 279b8024
  Benjamin Lefaudeux authored Feb 22, 2021
```
* adding an assert + corresponding unit test
* updated changelog
* adjusting the adascale tests
```
  279b8024
19 Feb, 2021 1 commit
- [bug]: fix a bug on custom smoothing factor (#401) · 4396ef4a
  Min Xu authored Feb 18, 2021
  
  4396ef4a
29 Jan, 2021 1 commit

[test]: test with py39 + torch 1.8 nightly (#339) · e348806b

Min Xu authored Jan 29, 2021

* [test]: test with py39 + torch 1.8 nightly

* version fix

* more fix

* fix version function for nightly version

* fix torch_pg build

* invalidate cache

* separate benchmark requirements

* comment

* fixed mypy

* fixed a test

e348806b

28 Jan, 2021 1 commit

[test]: test adascale with oss (#328) · fa11d338

Min Xu authored Jan 28, 2021

* [test]: test adascale with oss

* minor fix

* add a small comment

* refactor: moved find_tensor_by_shape

* refactor: move test golden data into its own module

* refactor: simplied the train function

* refactor: added comments as suggested

fa11d338

05 Jan, 2021 1 commit

[fix] Flaky tests (#283) · 79365ee6

Benjamin Lefaudeux authored Jan 04, 2021

* adding the pytest timeout plugin to properly root out hanging tests
* removing redundant code, slightly more reasonable timeout, works on single cuda
* finding the root bug for some of the cpu hangs, rpc init
* propagating all the rpc init test changes to the pipe and model parallel tests

79365ee6

04 Jan, 2021 1 commit

[feat] sync adascale from internal repo, support add_param_group (#266) · 3932a1f6

Min Xu authored Jan 04, 2021

* [feat] sync adascale from internal repo

- tbd

testing: tbd

* Update argument document of __init__

* update documentation around set_num_gradients_to_accumulate

* added checking code for proper API calling places

* rename internal APIs to make them internal

* updated changelog

* added support for add_param_group and its unit test

* added unit test for set_num_gradients_to_accumulate

* added debias_ewma unit test

* fixed test_set_num_gradients_to_accumulate (need zero_grad() call)

* added missing zero_grad() to test_lr_scheduler

* fixed test_add_param_group with respect to optim.zero_grad()

* added test_gradient_value

* added test_scale_not_equal_default for scale != world_size * grad_accum

* added test_unhook()

* removed print statements

* fixed a typo

* addressed Ben's comment

3932a1f6

16 Dec, 2020 1 commit

[feat]: AdaScale work with lr_scheduler and tests, examples (#229) · d65cd838

Min Xu authored Dec 15, 2020

* [doc]: AdaScale example and notes

* formatted notes correctly as suggested by Benjamin

* added feature and unit test to make sure lr_scheduler works

* update the example with lr_scheduler

* fixed doc with "make html"

* addressed Mike's suggestions

d65cd838

14 Dec, 2020 1 commit

[fix] more adascale gradient accumulation tests and smoothing factor fix (#235) · f74afebb

Min Xu authored Dec 14, 2020

* better ddp adascale tests

* make sure the single node test use the same test cases and expected gains

* added unit test that covers smoothing factor

- tested by re-introducing the bug and see the test fail as expected.

f74afebb

03 Dec, 2020 1 commit

[feat] AdaScale: Gradient Accumulation and Add PyTest unit tests (#202) · ce5860ea

Min Xu authored Dec 03, 2020

* added AdaScale to README

* [adascale] added gradient accumulation

- added gradient accumulation
- tested with cifar full trainings with different value of accumulation
and verified the full accuracy is obtained
- also removed the patch optimize flag until we need it

* [adascale] adding pytest

- added basic and ddp tests and grad_accum
- closes #195

* added changelog

* added ddp grad_accum test

* moved ddp and non-ddp tests into separate files

* added checkpoint test

* more doc

* addressed Mike's comments

ce5860ea