Commits · 7abaa2bedb06a7a128ea805699fb37abdc1560ca · OpenDAS / fairscale

29 Dec, 2020 1 commit
- [hotfix] Catching properly a given test failing if not enough gpus (#274) · 7abaa2be
  Benjamin Lefaudeux authored Dec 28, 2020
```
* catching properly a given test failing if not enough gpus
```
  7abaa2be
28 Dec, 2020 1 commit
- [chore] Move all unit tests dist init to being file based (#272) · b640cab5
  Benjamin Lefaudeux authored Dec 28, 2020
```
* file based dist init
* nicer handling of broken world sizes vs. number of available GPUs, do not break but warn out
```
  b640cab5
19 Dec, 2020 1 commit

[OSS] Getting rid of the "should bucket" hash table, just use a list + non... · ca74ee22

Benjamin Lefaudeux authored Dec 19, 2020

[OSS] Getting rid of the "should bucket" hash table, just use a list + non trainable params fix (#259)

* Getting rid of the "should bucket" hash table, just use a list
Properly handle all params, with or without requires_grad

* make sure that this case is unit tested

ca74ee22

10 Dec, 2020 1 commit

[fix] Check ShardedDDP / DDP parity + bugfix (#242) · 138b2033

Benjamin Lefaudeux authored Dec 09, 2020

* unit test checking ddp and sharded_ddp equivalence, reproducing the issue that Sean spotted
* fixing the issue, not counting requests in flight properly
* adding a multiple optimizers case

138b2033

04 Dec, 2020 1 commit

[fix] Fix iGPT buckets with ShardedDDP (#223) · 6d223777

Benjamin Lefaudeux authored Dec 03, 2020

* proper unit testing, but no other solution than disabling bucketing for now, couple of options tested do not work

6d223777

01 Dec, 2020 2 commits
- [chore] Refactor unit testing, shared utils (#218) · e83da060
  Benjamin Lefaudeux authored Dec 01, 2020
  
  e83da060
- [fix][Pipe] fallback for Pipe tests on internal pytorch numbering (#216) · 4d8f2e59
  Benjamin Lefaudeux authored Nov 30, 2020
```
* fallback on internal pytorch numbering
```
  4d8f2e59
21 Nov, 2020 1 commit

[feat] ShardedDataParallel with autoreduce (#157) · ad933b34

Benjamin Lefaudeux authored Nov 21, 2020

* rewrite using autograd and Variable execution queue to make the reduce automatic
* share buckets with OSS to remove duplication
* some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up

ad933b34

18 Nov, 2020 1 commit
- fix bug (#193) · f80b303c
  Tom Birch authored Nov 17, 2020
  
  f80b303c
11 Nov, 2020 2 commits
- [fix] moe: fix bug for multiple experts per-gpu case (#184) · 317c0945
  msbaines authored Nov 11, 2020
  
  317c0945
- [refactor] moe: remove G dimension (#183) · 89176e34
  msbaines authored Nov 11, 2020
  
  89176e34
10 Nov, 2020 1 commit

Single-process control via PipeRPCWrapper (#156) · 5d4f50fb

Tom Birch authored Nov 10, 2020

Adds support for:
* Reused layers (e.g. for weight sharing)
* Lazily-constructed layers
* Single-process control via PipeRPCWrapper
* PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive

Also added examples for multi-process and PipeRPCWrapper

5d4f50fb

30 Oct, 2020 1 commit
- [chore] add circleci testing of torch==1.5.1 (#172) · 4247f602
  msbaines authored Oct 29, 2020
  
  4247f602
29 Oct, 2020 1 commit
- [chore] update to torch v1.7.0 (#171) · ace61a41
  msbaines authored Oct 28, 2020
  
  ace61a41
23 Oct, 2020 1 commit
- [feat] moe: add support for multiple experts per device (#161) · 339cf060
  msbaines authored Oct 23, 2020
  
  339cf060
21 Oct, 2020 1 commit
- [test] moe: add a more thorough MOELayer routing test (#151) · c6d9be79
  msbaines authored Oct 20, 2020
  
  c6d9be79
20 Oct, 2020 1 commit

[test] fine tune test for checkpoint & DDP (#148) · 66b2b514

Min Xu authored Oct 20, 2020

- fixed typing
- make it run less often to reduce CI time

testing: run it in a loop make sure it is run in the right frequency.

66b2b514

17 Oct, 2020 1 commit
- [cleanup] moe: rename moelayer.py to moe_layer.py (#141) · 61234360
  msbaines authored Oct 16, 2020
  
  61234360
16 Oct, 2020 2 commits
- [feat] moe: annotate expert params (#140) · ee88bb19
  msbaines authored Oct 16, 2020
```
The expert annotation is used by clip_grads and DDP.
```
  ee88bb19
- [feat] moe: add all_to_all backward support (#137) · d99c445a
  msbaines authored Oct 16, 2020
  
  d99c445a
14 Oct, 2020 1 commit
- [feat] moe: add all_to_all support (#134) · 6d802f5a
  msbaines authored Oct 13, 2020
  
  6d802f5a
08 Oct, 2020 2 commits

[feat] moe: initial implementation of MOELayer (#128) · 22ff665d
msbaines authored Oct 08, 2020
```
Currently only implemented for a single process and expert.
```
22ff665d

[test] Add unittest for checkpoint & DDP (#126) · 6658be22

Min Xu authored Oct 07, 2020

* Add unittest for checkpoint & DDP

- this change adds test cases to reproduce the error with checkpoint & DDP
- mandeep mentioned that there is also deadlock in this case, but this
  change doesn't cover that.
- we cover cases where weight sharing is OK
- however, same module multiple checkpoint or find_unused_parameters are
  both not OK

* added norm checks

6658be22

06 Oct, 2020 1 commit

[feat] OSS/SDP : bucketing (#122) · 341d8b2b

Benjamin Lefaudeux authored Oct 05, 2020

Same bucketing strategy for OSS and SDP:
sort everything ahead of time, per rank and per size, smaller tensors first. Bucket the smallest elements in a fixed buffer, send async, then send all the others async, and get back to the bucket. Once done then scatter the contents if needed

341d8b2b

05 Oct, 2020 1 commit
- [fix] moe: fix Top2Gate to work on GPU (#124) · 662667d0
  msbaines authored Oct 05, 2020
  
  662667d0
02 Oct, 2020 1 commit
- [feat] moe: initial implementation of Top2Gating (#118) · 7815f6f3
  msbaines authored Oct 01, 2020
  
  7815f6f3
29 Sep, 2020 1 commit
- [ShardedDDP] Sync buffers + small cleanup (#112) · 79ded821
  Benjamin Lefaudeux authored Sep 28, 2020
```
- adding the buffer broadcast option
- minor cleanup in shardedDDP
```
  79ded821
17 Sep, 2020 2 commits

Multi-process pipe (#90) · 63f7796a

Tom Birch authored Sep 17, 2020

Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines)
* Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess
* Added support for lazy construction of modules (see lazy_construction for an example)
* Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv
* Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess

63f7796a

[feat] Sharded DDP - small refactor and new features (#97) · 49a198c9

Benjamin Lefaudeux authored Sep 17, 2020

- rename oss_ddp to ShardedDataParallel
- some refactoring
- ShardedDataParallel owns the sharded optimizer, exposed if need be
- some small perf bumps

49a198c9

28 Aug, 2020 1 commit
- [fix] fix eval for oss_ddp (#55) · 8c8eb8e8
  Min Xu authored Aug 28, 2020
```
- added train(mode) method to be aware of eval mode
```
  8c8eb8e8
06 Aug, 2020 1 commit
- [feat] add ddp that works with oss with reduce() not all_reduce() (#19) · 525e709b
  Min Xu authored Aug 06, 2020
```
Co-authored-by: Min Xu <m1n@fb.com>
```
  525e709b
31 Jul, 2020 1 commit
- [feat] Model parallel (#3) · 30f5009a
  Tom Birch authored Jul 22, 2020
  
  30f5009a
08 Jul, 2020 1 commit
- Initial commit · 0cd65242
  Mandeep Singh Baines authored Jul 07, 2020
  
  0cd65242