Commits · 15512d9e071c861f6cd9129eaa678d9d93ec371e · OpenDAS / fairscale

23 Feb, 2021 1 commit

Add FullyShardedDataParallel (FSDP) (#413) · 15512d9e

Myle Ott authored Feb 22, 2021

Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336

) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper.

Compared to PyTorch DDP:
* FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
* FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2
* FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3:
    * all-gather parameters at start of forward pass and start of backward pass
    * reduce-scatter grads at end of backward pass
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

15512d9e

29 Jan, 2021 1 commit

[test]: test with py39 + torch 1.8 nightly (#339) · e348806b

Min Xu authored Jan 29, 2021

* [test]: test with py39 + torch 1.8 nightly

* version fix

* more fix

* fix version function for nightly version

* fix torch_pg build

* invalidate cache

* separate benchmark requirements

* comment

* fixed mypy

* fixed a test

e348806b

25 Jan, 2021 1 commit

[test] cover python 3.7 to 3.9 on CPU (#303) · 8459634f

Min Xu authored Jan 25, 2021

* [test] cover python 3.7 to 3.9 on CPU

- covering common python versions on CPU tests
- added doc build test

* add doc build test

* skipping failing tests on py39

* catching doc build warnings

* add doc build to py38 and py39

* minor fix

* fix doc build for adascale

* removed dead code

* fix the skipping

* skip unit test for py39

* add failing example

* no more py39 skipping the tests

8459634f

21 Jan, 2021 1 commit
- [feat] Enabling ViT in OSS benchmarks (#322) · 8a49a748
  Benjamin Lefaudeux authored Jan 21, 2021
  
  8a49a748
05 Jan, 2021 1 commit

[fix] Flaky tests (#283) · 79365ee6

Benjamin Lefaudeux authored Jan 04, 2021

* adding the pytest timeout plugin to properly root out hanging tests
* removing redundant code, slightly more reasonable timeout, works on single cuda
* finding the root bug for some of the cpu hangs, rpc init
* propagating all the rpc init test changes to the pipe and model parallel tests

79365ee6

30 Oct, 2020 1 commit
- [chore] add circleci testing of torch==1.5.1 (#172) · 4247f602
  msbaines authored Oct 29, 2020
  
  4247f602
29 Oct, 2020 1 commit
- [chore] update to torch v1.7.0 (#171) · ace61a41
  msbaines authored Oct 28, 2020
  
  ace61a41
28 Oct, 2020 1 commit
- [chore] update isort to 5.6.4 (#170) · ea9876e3
  msbaines authored Oct 27, 2020
  
  ea9876e3
14 Oct, 2020 1 commit
- [feat] moe: add all_to_all support (#134) · 6d802f5a
  msbaines authored Oct 13, 2020
  
  6d802f5a
21 Aug, 2020 1 commit

[feat] Simple macro OSS benchmark (#47) · 46c3776b

Benjamin Lefaudeux authored Aug 21, 2020



* initial commit, dummy training loop, pure pytorch but not DDP

* probably slightly broken, but rough DDP benchmark run

* adding the torchvision requirement for testing

* brainfart

* reduce the loss, do something slightly distributed

* Some cleanup, distributing the training on two GPUs

* some cleanup + adding a vanilla run, still not good to go

* less silly defaults, gtg for a start I think

* smaller batch to fit the smaller gpus used in the circleci rigs

* Adding some options for the benchmark, and regression testing

* [test] set torch seed for Adam tests (#49)

Set the torch seed for tests. xfail mixed precision and memory-efficient mixed-precision state_dict tests due to their states being cast to FP16 and back to FP32 during load_state_dict.
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

* linting, I really need to automate this isort insanity
Co-authored-by: Jun Ru Anderson <33384298+andersonic@users.noreply.github.com>
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>

46c3776b

13 Aug, 2020 1 commit
- [chore] run tests on PyTorch 1.6.0 and gpu tests on 1.6.0 and 1.5.1 (#34) · 571f5efa
  msbaines authored Aug 13, 2020
  
  571f5efa
06 Aug, 2020 1 commit
- add pytest coverage (#24) · 4cd2590c
  Min Xu authored Aug 06, 2020
```
Co-authored-by: Min Xu <m1n@fb.com>
```
  4cd2590c
31 Jul, 2020 1 commit
- [test] Use PyTorch v1.5 for ci (#7) · 8634280c
  msbaines authored Jul 22, 2020
  
  8634280c
08 Jul, 2020 1 commit
- Initial commit · 0cd65242
  Mandeep Singh Baines authored Jul 07, 2020
  
  0cd65242