Commits · 29d81c43fa06b3f23f7128a08950e678c26709f8 · OpenDAS / fairscale

08 May, 2021 4 commits
- [perf] nn.moe: replace einsum with faster equivalent code (#667) · 29d81c43
  msbaines authored May 08, 2021
```
Co-authored-by: @myleott
```
  29d81c43
- [chore][benchmarks] Add license file headers for all files in fairscale/benchmarks (#670) · a9156260
  anj-s authored May 08, 2021
```
* add license file headers for all files

* fix lint
```
  a9156260
- [test] Force overflow in top2gating test (#664) · 29c01fb1
  Sam Shleifer authored May 08, 2021
  
  29c01fb1
- [chore] Rename and move utils.py from optim/ to utils/ (#669) · 5739930f
  anj-s authored May 07, 2021
```
* rename and move optim/utils.py

* attach the new file
```
  5739930f
07 May, 2021 3 commits

[perf] nn.moe: workaround inefficiency in PyTorch's one_hot (#666) · 99b30a04
msbaines authored May 07, 2021
```
Workaround for https://github.com/pytorch/pytorch/issues/55579

Co-authored-by: @shruti-bh, @myleott
```
99b30a04

[fix]: support pytorch SyncBatchNorm under AMP & checkpointing with FSDP (#659) · 6db68518

Min Xu authored May 07, 2021



* [test]: add a more general test case

- also rebalance the tests a bit

* added missing arg

* balance

* better checking

* balance

* make test smaller and faster

* make ddp results cached and enable sync_bn

* clean up

* fix tests

* changelog

* blance

* fix

* addressing comments
Co-authored-by: Min Xu <min.xu@acm.org>

6db68518

[feat] experimental.nn.SyncBatchNorm: initial commit (#662) · f0a40046

msbaines authored May 07, 2021

* [feat] experimental.nn.SyncBatchNorm: initial commit

Fast/simple re-implementation of SyncBatchNorm.

When profiling SSL Vision, I was seeing a majority of cycles spent in
SyncBatchNorm. With this change, I see a 10% to 20% speedup on the
model I was profiling.

When running benchmarks/experimental/sync_batchnorm.py on 8 x V100,
I get a 6x speedup:

<class 'torch.nn.modules.batchnorm.BatchNorm2d'>
Elapsed time is  0.08709120750427246
Elapsed time is  0.12632274627685547
Elapsed time is  0.14095258712768555
Elapsed time is  0.16529417037963867
Elapsed time is  0.1419970989227295
Elapsed time is  0.15166854858398438
Elapsed time is  0.12000870704650879
Elapsed time is  0.17534875869750977
<class 'torch.nn.modules.batchnorm.SyncBatchNorm'>
Elapsed time is  2.5087168216705322
Elapsed time is  2.497001886367798
Elapsed time is  2.5204885005950928
Elapsed time is  2.526789903640747
Elapsed time is  2.5080230236053467
Elapsed time is  2.524489641189575
Elapsed time is  2.513214588165283
Elapsed time is  2.5359973907470703
<class 'fairscale.experimental.nn.sync_batchnorm.SyncBatchNorm'>
Elapsed time is  0.4126114845275879
Elapsed time is  0.39051294326782227
Elapsed time is  0.40685415267944336
Elapsed time is  0.4159870147705078
Elapsed time is  0.42383885383605957
Elapsed time is  0.4080159664154053
Elapsed time is  0.41202712059020996
Elapsed time is  0.42400121688842773

f0a40046

05 May, 2021 6 commits

[fix] better assert and better test for frozen weights (#657) · b54eed1b

Min Xu authored May 05, 2021



* [fix] better assert and better test for frozen weights

- the precise condition should have been check m.parameters(), not
  m.params.
- fixes #643

* add changelog

* use enum is so much better
Co-authored-by: Min Xu <min.xu@acm.org>

b54eed1b

[fix][adascale] Fix infinite loop in docstring (#656) · 1ae77784
anj-s authored May 05, 2021
```
* fix infinite loop in docstring

* fix docstring
```
1ae77784
[draft][chore] SDP : increase code coverage (#653) · 69cbdf5d
Benjamin Lefaudeux authored May 05, 2021
```
* increasing the code coverage, good practice and raising bugs.  hopefully getting to 100%
* small bugfix
```
69cbdf5d
[chore] Rename misc.py to better reflect functionality. (#652) · c65a48f3
anj-s authored May 04, 2021
```
* rename files

* add newly renamed file
```
c65a48f3
add info about PEP8 style guide (#651) · 0ce85af2
anj-s authored May 04, 2021
```
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>
```
0ce85af2

[fix] add clear_autocast_cache flag (#650) · 861b5ce2

Min Xu authored May 04, 2021



* [fix] add clear_autocast_cache flag

- when training in AMP model with weight dtype32, FSDP may need to
  optionally clear the autocast cache to avoid GPU OOM
- this flag is default false, automatically doing it is a future TODO
- also added a verbose flag to make print(fsdp_model) a bit shorter
- updated the memory test to cover those new code
- added a couple of useful functions in parallel.py and testing.py

* minor

* address comments

* format

* improve the test
Co-authored-by: Min Xu <min.xu@acm.org>

861b5ce2

04 May, 2021 1 commit

[feat]Adding DynamicLossScaler class for supporting optimizer updates on the CPU (#635) · 14d1f78c

tmarkstrum authored May 03, 2021

* dynamic loss scaler

* isort

* black

* flake8

* comments

* added the test to ci file, added a line to catch the overflow error, fixed some formatting errors

* adding type annotation

* added todo for adding more test cases for handling Nan gradients

* fix some doc string and comments, add more tods

* fix two doc strings

14d1f78c

03 May, 2021 2 commits

[fix] SDP: expose module property fix + unit test (#647) · 4e438ba1
Benjamin Lefaudeux authored May 03, 2021
```
* fix + unit test
* changelog update
```
4e438ba1

[minor] not creating a temp file on import (#641) · b66168da

Min Xu authored May 03, 2021



* [minor] not creating a temp file on import

* address review

* Revert "address review"

This reverts commit f65eb9bc7f7ea8829b1ac0a369ef9a3e6b56420a.
Co-authored-by: Min Xu <min.xu@acm.org>

b66168da

30 Apr, 2021 1 commit
- [test] nn.Pipe: add a parity test that also tests with amp (#645) · fee979d9
  msbaines authored Apr 30, 2021
  
  fee979d9
29 Apr, 2021 2 commits
- [test][refactor][SDP] Using the nice context-based tempfiles (#640) · 3b7373e2
  Benjamin Lefaudeux authored Apr 29, 2021
  
  3b7373e2
- [test][minor] Improving SDP test coverage (#639) · 8c8a625a
  Benjamin Lefaudeux authored Apr 29, 2021
```
* Improving test coverage on SDP
* using pytest exception catcher
```
  8c8a625a
28 Apr, 2021 4 commits

[test] improve BN test coverage (#638) · 21cba91b

Min Xu authored Apr 28, 2021



* [test] improve BN test coverage

- Added sync_bn on/off cases
- Added conv and linear bias on/off cases
- clarified when sync_bn is off, when is BN wrapping needed with the test

* adding a comment
Co-authored-by: Min Xu <min.xu@acm.org>

21cba91b

adding auto graph generation for distributed pipeline (#615) · bdc0581b

Mehdi Mirzazadeh authored Apr 28, 2021

* adding auto graph generation for distributed pipeline

* ignore trace.py for my for now, since it needs pytorch 1.8

* fixing tests

* simplifying graph api

* remove unused debug utilities

* use inspect to find argument lists

* use sharded linear layer

* flkae8

* comment

* polishing

* polishing

bdc0581b

[chore] do not build cuda extensions by default (#634) · 2bb2a134
msbaines authored Apr 27, 2021

2bb2a134

[feat] save memory by using bucket buffer only in backward (#633) · a5594032

Min Xu authored Apr 27, 2021



* [feat] save memory by using bucket buffer only in backward

- this fixes bug #627
- added documentation to clarify the buffer's cost and speed/memory
  tradeoff
- added setup/teardown calls so that the buffer is only allocated
  during the backward pass, saving more memory for forward and stepping
  so that they can be used for things like activations.
- added a unit test that assert the memory is in range.

Comparing with DDP:

  1. buffer size scales with # of FSDP not model size
  2. buffer is only allocated during backward
  3. buffer is used for small tensors only to reduce overhead
  4. overlapping of compute-reduction is very different

* add PR number to changelog

* filled in with memory number on 1.9

* addressed comments

* update comments

* fix for 1.6

* add a todo
Co-authored-by: Min Xu <min.xu@acm.org>

a5594032

27 Apr, 2021 1 commit
- [chore] OSS - adding the profiler labels (#629) · 9b79cc02
  Benjamin Lefaudeux authored Apr 26, 2021
  
  9b79cc02
26 Apr, 2021 4 commits

[chore] SDP - adding the profiler labels (#630) · 85dea5b2
Benjamin Lefaudeux authored Apr 26, 2021
```
* adding the labels
* longer labels, following aten::
```
85dea5b2
[chore] PR Template update, mention Changelog (#632) · 38ce54b7
Benjamin Lefaudeux authored Apr 26, 2021

38ce54b7

[chore] 0.3.6 release (#631) · 36da9d6e

Min Xu authored Apr 26, 2021



* [chore] 0.3.6 release

* try redo the caches
Co-authored-by: Min Xu <min.xu@acm.org>

36da9d6e

[fix]: let FSDP handle model with multiple forward pass and checkpoint (#621) · a1612d79

Min Xu authored Apr 26, 2021



* [fix]: let FSDP handle model with multiple forward pass and checkpoint

* try CI again

* save

* save

* fixed case with bn

* minor

* add the new file

* minor

* added test of a single case, runtime is about 50s

* enable all 8 test cases

* cleanup

* cleanup

* skip flatten case with 1.6 and 1.7

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

a1612d79

23 Apr, 2021 2 commits

[fix] check before calling _specify_ddp_gpu_num (#626) · 5cddaea4

Min Xu authored Apr 23, 2021



- this function is being removed in pytorch
- we only need to call it in case we are working with older pytorch
Co-authored-by: Min Xu <min.xu@acm.org>

5cddaea4

[FSDP] relax checking root condition (#620) · d3b86d65

shuyingsunshine21 authored Apr 22, 2021

* relax checking root condition

* formatting

* add unittest

* add unittest to ci test list

* isort for import of unittest

* format black .

* move test to list 1

* add skip no cuda

* black and isort

d3b86d65

22 Apr, 2021 3 commits

[fix] mypy and flaky test (#624) · 961df76e

Min Xu authored Apr 22, 2021



* [fix] mypy and flaky test

- CI didn't seem to catch this or maybe I merged incorrectly yesterday
- this should fix the mypy error on master
- also updated a test that seems to be flaky due to tcp port conflict

* another flaky test, hopefully more determinism helps

* CR

* skip 1.6

* fix

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

961df76e

[SDP] removing an assert which does not seem always accurate (#625) · 85962b97
Benjamin Lefaudeux authored Apr 22, 2021

85962b97

Fixing logging, changing info to debug to avoid clutter (#622) · b0048b28

girifb authored Apr 21, 2021



* Changing FSDP init to by pass pg validation for freshly minted pgs inside of init.

* Addressing Min's review comments.

* Changing logging in init to debug from info

* Changing logging in init to debug from info
Co-authored-by: Giri Anantharaman <giriman@devfair0439.h2.fair>

b0048b28

21 Apr, 2021 2 commits

Changing FSDP init to by pass pg validation (#619) · f768eb93

girifb authored Apr 21, 2021



* Changing FSDP init to by pass pg validation for freshly minted pgs inside of init.

* Addressing Min's review comments.
Co-authored-by: Giri Anantharaman <giriman@devfair0439.h2.fair>

f768eb93

[chore] OSS to 100% coverage (#618) · b0e6b9bd
Benjamin Lefaudeux authored Apr 20, 2021

b0e6b9bd

20 Apr, 2021 1 commit
- [FSDP] Consolidate cpu_adam optimizer state dict (#607) · d9f36130
  Sam Shleifer authored Apr 20, 2021
  
  d9f36130
19 Apr, 2021 2 commits

[chore] 0.3.5 release (#616) · 1141528e

Min Xu authored Apr 19, 2021



* [chore] 0.3.5 release

* address comment
Co-authored-by: Min Xu <min.xu@acm.org>

1141528e

FSDP: fixing training with freezing weights (#614) · 24da3b11

Min Xu authored Apr 18, 2021



* FSDP: fixing training with freezing weights

- an assert is changed to catch this case correctly
- unit test added (based on Quentin's test code) for this case and
  compare DDP and FSDP

fixes: #610

* added test file to list 1

* Use better and simpler code as suggested by Myle

* testing both methods of freezing as well
Co-authored-by: Min Xu <min.xu@acm.org>

24da3b11

15 Apr, 2021 2 commits

[chore][SDP] privatizing all the things (#611) · c084b202
Benjamin Lefaudeux authored Apr 15, 2021

c084b202

[fix] Revert change that removed the option to run OffloadModel with out... · a77c56f0

anj-s authored Apr 14, 2021


[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. (#608)

* revert change made

* add tests and revert sync shard changes

* add tests

* remove file checked in by error

* inine var

* fix lint errors

* add checkpoint activation

* fix mypy

* use a bigger model

* modify tests for now

* resolve conflicts
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

a77c56f0