Commits · d65cd838fdc426505594f4a488db83fbc8d14fff · OpenDAS / fairscale

16 Dec, 2020 1 commit

[feat]: AdaScale work with lr_scheduler and tests, examples (#229) · d65cd838

Min Xu authored Dec 15, 2020

* [doc]: AdaScale example and notes

* formatted notes correctly as suggested by Benjamin

* added feature and unit test to make sure lr_scheduler works

* update the example with lr_scheduler

* fixed doc with "make html"

* addressed Mike's suggestions

d65cd838

14 Dec, 2020 1 commit

[fix] more adascale gradient accumulation tests and smoothing factor fix (#235) · f74afebb

Min Xu authored Dec 14, 2020

* better ddp adascale tests

* make sure the single node test use the same test cases and expected gains

* added unit test that covers smoothing factor

- tested by re-introducing the bug and see the test fail as expected.

f74afebb

10 Dec, 2020 1 commit

[fix] Check ShardedDDP / DDP parity + bugfix (#242) · 138b2033

Benjamin Lefaudeux authored Dec 09, 2020

* unit test checking ddp and sharded_ddp equivalence, reproducing the issue that Sean spotted
* fixing the issue, not counting requests in flight properly
* adding a multiple optimizers case

138b2033

06 Dec, 2020 1 commit
- [fix] skipping NCCL tests on 2-GPU systems (#233) · bb468670
  Min Xu authored Dec 05, 2020
  
  bb468670
04 Dec, 2020 1 commit

[fix] Fix iGPT buckets with ShardedDDP (#223) · 6d223777

Benjamin Lefaudeux authored Dec 03, 2020

* proper unit testing, but no other solution than disabling bucketing for now, couple of options tested do not work

6d223777

03 Dec, 2020 1 commit

[feat] AdaScale: Gradient Accumulation and Add PyTest unit tests (#202) · ce5860ea

Min Xu authored Dec 03, 2020

* added AdaScale to README

* [adascale] added gradient accumulation

- added gradient accumulation
- tested with cifar full trainings with different value of accumulation
and verified the full accuracy is obtained
- also removed the patch optimize flag until we need it

* [adascale] adding pytest

- added basic and ddp tests and grad_accum
- closes #195

* added changelog

* added ddp grad_accum test

* moved ddp and non-ddp tests into separate files

* added checkpoint test

* more doc

* addressed Mike's comments

ce5860ea

01 Dec, 2020 2 commits
- [chore] Refactor unit testing, shared utils (#218) · e83da060
  Benjamin Lefaudeux authored Dec 01, 2020
  
  e83da060
- [fix][Pipe] fallback for Pipe tests on internal pytorch numbering (#216) · 4d8f2e59
  Benjamin Lefaudeux authored Nov 30, 2020
```
* fallback on internal pytorch numbering
```
  4d8f2e59
21 Nov, 2020 1 commit

[feat] ShardedDataParallel with autoreduce (#157) · ad933b34

Benjamin Lefaudeux authored Nov 21, 2020

* rewrite using autograd and Variable execution queue to make the reduce automatic
* share buckets with OSS to remove duplication
* some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up

ad933b34

18 Nov, 2020 1 commit
- fix bug (#193) · f80b303c
  Tom Birch authored Nov 17, 2020
  
  f80b303c
16 Nov, 2020 1 commit
- [feat] OSS-aware clip grads, bridge sharded states (#167) · ade312c4
  Benjamin Lefaudeux authored Nov 16, 2020
```
add a clip gradients util, equivalent to torch's but aware of the sharded states. Add a corresponding unit test
```
  ade312c4
11 Nov, 2020 2 commits
- [fix] moe: fix bug for multiple experts per-gpu case (#184) · 317c0945
  msbaines authored Nov 11, 2020
  
  317c0945
- [refactor] moe: remove G dimension (#183) · 89176e34
  msbaines authored Nov 11, 2020
  
  89176e34
10 Nov, 2020 1 commit

Single-process control via PipeRPCWrapper (#156) · 5d4f50fb

Tom Birch authored Nov 10, 2020

Adds support for:
* Reused layers (e.g. for weight sharing)
* Lazily-constructed layers
* Single-process control via PipeRPCWrapper
* PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive

Also added examples for multi-process and PipeRPCWrapper

5d4f50fb

06 Nov, 2020 1 commit
- [fix] OSS tests - remove concurrent dist inits (#177) · 543d5693
  Benjamin Lefaudeux authored Nov 06, 2020
  
  543d5693
30 Oct, 2020 1 commit
- [chore] add circleci testing of torch==1.5.1 (#172) · 4247f602
  msbaines authored Oct 29, 2020
  
  4247f602
29 Oct, 2020 1 commit
- [chore] update to torch v1.7.0 (#171) · ace61a41
  msbaines authored Oct 28, 2020
  
  ace61a41
28 Oct, 2020 1 commit
- [chore] update isort to 5.6.4 (#170) · ea9876e3
  msbaines authored Oct 27, 2020
  
  ea9876e3
23 Oct, 2020 1 commit
- [feat] moe: add support for multiple experts per device (#161) · 339cf060
  msbaines authored Oct 23, 2020
  
  339cf060
21 Oct, 2020 1 commit
- [test] moe: add a more thorough MOELayer routing test (#151) · c6d9be79
  msbaines authored Oct 20, 2020
  
  c6d9be79
20 Oct, 2020 1 commit

[test] fine tune test for checkpoint & DDP (#148) · 66b2b514

Min Xu authored Oct 20, 2020

- fixed typing
- make it run less often to reduce CI time

testing: run it in a loop make sure it is run in the right frequency.

66b2b514

17 Oct, 2020 1 commit
- [cleanup] moe: rename moelayer.py to moe_layer.py (#141) · 61234360
  msbaines authored Oct 16, 2020
  
  61234360
16 Oct, 2020 2 commits
- [feat] moe: annotate expert params (#140) · ee88bb19
  msbaines authored Oct 16, 2020
```
The expert annotation is used by clip_grads and DDP.
```
  ee88bb19
- [feat] moe: add all_to_all backward support (#137) · d99c445a
  msbaines authored Oct 16, 2020
  
  d99c445a
14 Oct, 2020 2 commits
- [bugfix] OSS + Apex (#136) · 37c686e7
  Benjamin Lefaudeux authored Oct 14, 2020
```
* fixing the issue wrt Apex, validated with Latte, Classy would need another pass
```
  37c686e7
- [feat] moe: add all_to_all support (#134) · 6d802f5a
  msbaines authored Oct 13, 2020
  
  6d802f5a
08 Oct, 2020 3 commits

[fix] OSS unit test to check data group (#129) · 81ac5b28
Benjamin Lefaudeux authored Oct 08, 2020
```
* new unit test to catch rank issues in OSS
```
81ac5b28
[feat] moe: initial implementation of MOELayer (#128) · 22ff665d
msbaines authored Oct 08, 2020
```
Currently only implemented for a single process and expert.
```
22ff665d

[test] Add unittest for checkpoint & DDP (#126) · 6658be22

Min Xu authored Oct 07, 2020

* Add unittest for checkpoint & DDP

- this change adds test cases to reproduce the error with checkpoint & DDP
- mandeep mentioned that there is also deadlock in this case, but this
  change doesn't cover that.
- we cover cases where weight sharing is OK
- however, same module multiple checkpoint or find_unused_parameters are
  both not OK

* added norm checks

6658be22

06 Oct, 2020 1 commit

[feat] OSS/SDP : bucketing (#122) · 341d8b2b

Benjamin Lefaudeux authored Oct 05, 2020

Same bucketing strategy for OSS and SDP:
sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed

341d8b2b

05 Oct, 2020 1 commit
- [fix] moe: fix Top2Gate to work on GPU (#124) · 662667d0
  msbaines authored Oct 05, 2020
  
  662667d0
02 Oct, 2020 1 commit
- [feat] moe: initial implementation of Top2Gating (#118) · 7815f6f3
  msbaines authored Oct 01, 2020
  
  7815f6f3
29 Sep, 2020 1 commit
- [ShardedDDP] Sync buffers + small cleanup (#112) · 79ded821
  Benjamin Lefaudeux authored Sep 28, 2020
```
- adding the buffer broadcast option
- minor cleanup in shardedDDP
```
  79ded821
17 Sep, 2020 2 commits

Multi-process pipe (#90) · 63f7796a

Tom Birch authored Sep 17, 2020

Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines)
* Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess
* Added support for lazy construction of modules (see lazy_construction for an example)
* Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv
* Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess

63f7796a

[feat] Sharded DDP - small refactor and new features (#97) · 49a198c9

Benjamin Lefaudeux authored Sep 17, 2020

- rename oss_ddp to ShardedDataParallel
- some refactoring
- ShardedDataParallel owns the sharded optimizer, exposed if need be
- some small perf bumps

49a198c9

15 Sep, 2020 2 commits
- [feat] Gracefully handle local/global state dict queries (#89) · d16e9f61
  Benjamin Lefaudeux authored Sep 15, 2020
```
Return either the local or global state when queried, depending on a prior consolidation
```
  d16e9f61
- [feat ] OSS : optional closure argument for the optimizer (#86) · 3d7f524a
  Benjamin Lefaudeux authored Sep 15, 2020
```
Make OSS compatible with optimizers which do not support the closure argument
```
  3d7f524a
09 Sep, 2020 1 commit

[feat] OSS flatten state dict (#65) · 4f597233

Benjamin Lefaudeux authored Sep 09, 2020

Changes the structure of the returned state dict with respect to the param_groups to make it closer to what a vanilla optimizer would return (un-shard them). Shard again when loading

4f597233

08 Sep, 2020 1 commit

[feat] OSS: Sync all attributes (#67) · 5a268b25

Benjamin Lefaudeux authored Sep 08, 2020

Make sure that all attributes (not just LR) are in sync in between the OSS.param_groups and the actual wrapped optimizer. Some frameworks make it possible to alter any attribute on a scheduled basis, which proves useful depending on the optimizer, so the keys need to be generically supported (not just "lr"). Not syncing these attributes is a worst case scenario, since these adjustments are silently not propagated, fixing that.

5a268b25

03 Sep, 2020 1 commit

Add grad scaler (#48) · b6a5e634

Jun Ru Anderson authored Sep 03, 2020



Add GradScaler to Fairscale, subclassing PyTorch's GradScaler. Use GradScaler in the pipe benchmark; though it is not needed in this case, it is a good example of how to use gradient scaling for larger models that do require gradient scaling in order to converge.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

b6a5e634