- 05 Oct, 2022 1 commit
-
-
Changyu Gao authored
* Fix gradient accumulation - Add ``is_scaled_loss`` flag to support both scaled / unscaled loss - Add a method `scale_grad_by_num_grads_to_accum`to handle gradient accumulation using unscaled loss more explicitly - Fix ``test_grad_accum`` and``test_set_num_gradients_to_accumulate`` - Add tests for gradient
-
- 24 Sep, 2022 1 commit
-
-
Min Xu authored
Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 12 Jun, 2022 1 commit
-
-
Crutcher Dunnavant authored
-
- 04 Mar, 2021 1 commit
-
-
Min Xu authored
- cover them in terms of code path only - numerically, AdaScale is different on SDP/FSDP than DDP, mainly due to partial view of the gradients. - this doesn't mean it is definitely not useful but it is yet to be validated. - not going to spend too much time until we have a real use case.
-
- 22 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding an assert + corresponding unit test * updated changelog * adjusting the adascale tests
-
- 28 Jan, 2021 1 commit
-
-
Min Xu authored
* [test]: test adascale with oss * minor fix * add a small comment * refactor: moved find_tensor_by_shape * refactor: move test golden data into its own module * refactor: simplied the train function * refactor: added comments as suggested
-
- 05 Jan, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* adding the pytest timeout plugin to properly root out hanging tests * removing redundant code, slightly more reasonable timeout, works on single cuda * finding the root bug for some of the cpu hangs, rpc init * propagating all the rpc init test changes to the pipe and model parallel tests
-
- 14 Dec, 2020 1 commit
-
-
Min Xu authored
* better ddp adascale tests * make sure the single node test use the same test cases and expected gains * added unit test that covers smoothing factor - tested by re-introducing the bug and see the test fail as expected.
-
- 03 Dec, 2020 1 commit
-
-
Min Xu authored
* added AdaScale to README * [adascale] added gradient accumulation - added gradient accumulation - tested with cifar full trainings with different value of accumulation and verified the full accuracy is obtained - also removed the patch optimize flag until we need it * [adascale] adding pytest - added basic and ddp tests and grad_accum - closes #195 * added changelog * added ddp grad_accum test * moved ddp and non-ddp tests into separate files * added checkpoint test * more doc * addressed Mike's comments
-