Commits · 72c6bab24f398dbc583a26508dd9ee1f3dbc4fc2 · OpenDAS / fairscale

"discover/amd_linux.go" did not exist on "e2c3f6b3e2de014656ab9ddffccf7b89d1bcc09e"

12 May, 2021 1 commit

[chore] Rename and move checkpoint_activations from misc folder. (#654) · 72c6bab2

anj-s authored May 12, 2021

* rename files

* add newly renamed file

* rename and move checkpoint activations related files

* add test files to ci list

* fix lint errors

* modify docs

* add changelog

* retain old path for now

* fix lint errors

* add another import test case

* fix merge conflict

* add missing test file

72c6bab2

11 May, 2021 1 commit

[fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3

Min Xu authored May 10, 2021



* [fix] FSDP forward pass overlap between compute and all-gather

- much thanks for @cyanguwa for report and @QuentinDuval for debugging it
- a new unit test is added to check for this and ensure we detect
  issue with overlapping and cpu/gpu blocking wait calls

* fix

* fix

* fix

* better assertion outputs

* fix format and tune all_gather mb for CI

* more tuning with non_flatten

* undo an accidental change

* tuning all gather mb and del model

* Update + fix overlapping test to use patched all_gather w/ delay (#672)

* fixing get_cycles_per_ms

* add get_smi_memory

* update the docstring
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Myle Ott <myleott@fb.com>

8a42a8e3

08 May, 2021 2 commits
- [test] Force overflow in top2gating test (#664) · 29c01fb1
  Sam Shleifer authored May 08, 2021
  
  29c01fb1
- [chore] Rename and move utils.py from optim/ to utils/ (#669) · 5739930f
  anj-s authored May 07, 2021
```
* rename and move optim/utils.py

* attach the new file
```
  5739930f
07 May, 2021 2 commits

[fix]: support pytorch SyncBatchNorm under AMP & checkpointing with FSDP (#659) · 6db68518

Min Xu authored May 07, 2021



* [test]: add a more general test case

- also rebalance the tests a bit

* added missing arg

* balance

* better checking

* balance

* make test smaller and faster

* make ddp results cached and enable sync_bn

* clean up

* fix tests

* changelog

* blance

* fix

* addressing comments
Co-authored-by: Min Xu <min.xu@acm.org>

6db68518

[feat] experimental.nn.SyncBatchNorm: initial commit (#662) · f0a40046

msbaines authored May 07, 2021

* [feat] experimental.nn.SyncBatchNorm: initial commit

Fast/simple re-implementation of SyncBatchNorm.

When profiling SSL Vision, I was seeing a majority of cycles spent in
SyncBatchNorm. With this change, I see a 10% to 20% speedup on the
model I was profiling.

When running benchmarks/experimental/sync_batchnorm.py on 8 x V100,
I get a 6x speedup:

<class 'torch.nn.modules.batchnorm.BatchNorm2d'>
Elapsed time is  0.08709120750427246
Elapsed time is  0.12632274627685547
Elapsed time is  0.14095258712768555
Elapsed time is  0.16529417037963867
Elapsed time is  0.1419970989227295
Elapsed time is  0.15166854858398438
Elapsed time is  0.12000870704650879
Elapsed time is  0.17534875869750977
<class 'torch.nn.modules.batchnorm.SyncBatchNorm'>
Elapsed time is  2.5087168216705322
Elapsed time is  2.497001886367798
Elapsed time is  2.5204885005950928
Elapsed time is  2.526789903640747
Elapsed time is  2.5080230236053467
Elapsed time is  2.524489641189575
Elapsed time is  2.513214588165283
Elapsed time is  2.5359973907470703
<class 'fairscale.experimental.nn.sync_batchnorm.SyncBatchNorm'>
Elapsed time is  0.4126114845275879
Elapsed time is  0.39051294326782227
Elapsed time is  0.40685415267944336
Elapsed time is  0.4159870147705078
Elapsed time is  0.42383885383605957
Elapsed time is  0.4080159664154053
Elapsed time is  0.41202712059020996
Elapsed time is  0.42400121688842773

f0a40046

05 May, 2021 3 commits

[fix] better assert and better test for frozen weights (#657) · b54eed1b

Min Xu authored May 05, 2021



* [fix] better assert and better test for frozen weights

- the precise condition should have been check m.parameters(), not
  m.params.
- fixes #643

* add changelog

* use enum is so much better
Co-authored-by: Min Xu <min.xu@acm.org>

b54eed1b

[draft][chore] SDP : increase code coverage (#653) · 69cbdf5d
Benjamin Lefaudeux authored May 05, 2021
```
* increasing the code coverage, good practice and raising bugs.  hopefully getting to 100%
* small bugfix
```
69cbdf5d

[fix] add clear_autocast_cache flag (#650) · 861b5ce2

Min Xu authored May 04, 2021



* [fix] add clear_autocast_cache flag

- when training in AMP model with weight dtype32, FSDP may need to
  optionally clear the autocast cache to avoid GPU OOM
- this flag is default false, automatically doing it is a future TODO
- also added a verbose flag to make print(fsdp_model) a bit shorter
- updated the memory test to cover those new code
- added a couple of useful functions in parallel.py and testing.py

* minor

* address comments

* format

* improve the test
Co-authored-by: Min Xu <min.xu@acm.org>

861b5ce2

04 May, 2021 1 commit

[feat]Adding DynamicLossScaler class for supporting optimizer updates on the CPU (#635) · 14d1f78c

tmarkstrum authored May 03, 2021

* dynamic loss scaler

* isort

* black

* flake8

* comments

* added the test to ci file, added a line to catch the overflow error, fixed some formatting errors

* adding type annotation

* added todo for adding more test cases for handling Nan gradients

* fix some doc string and comments, add more tods

* fix two doc strings

14d1f78c

03 May, 2021 1 commit
- [fix] SDP: expose module property fix + unit test (#647) · 4e438ba1
  Benjamin Lefaudeux authored May 03, 2021
```
* fix + unit test
* changelog update
```
  4e438ba1
30 Apr, 2021 1 commit
- [test] nn.Pipe: add a parity test that also tests with amp (#645) · fee979d9
  msbaines authored Apr 30, 2021
  
  fee979d9
29 Apr, 2021 2 commits
- [test][refactor][SDP] Using the nice context-based tempfiles (#640) · 3b7373e2
  Benjamin Lefaudeux authored Apr 29, 2021
  
  3b7373e2
- [test][minor] Improving SDP test coverage (#639) · 8c8a625a
  Benjamin Lefaudeux authored Apr 29, 2021
```
* Improving test coverage on SDP
* using pytest exception catcher
```
  8c8a625a
28 Apr, 2021 3 commits

[test] improve BN test coverage (#638) · 21cba91b

Min Xu authored Apr 28, 2021



* [test] improve BN test coverage

- Added sync_bn on/off cases
- Added conv and linear bias on/off cases
- clarified when sync_bn is off, when is BN wrapping needed with the test

* adding a comment
Co-authored-by: Min Xu <min.xu@acm.org>

21cba91b

adding auto graph generation for distributed pipeline (#615) · bdc0581b

Mehdi Mirzazadeh authored Apr 28, 2021

* adding auto graph generation for distributed pipeline

* ignore trace.py for my for now, since it needs pytorch 1.8

* fixing tests

* simplifying graph api

* remove unused debug utilities

* use inspect to find argument lists

* use sharded linear layer

* flkae8

* comment

* polishing

* polishing

bdc0581b

[feat] save memory by using bucket buffer only in backward (#633) · a5594032

Min Xu authored Apr 27, 2021



* [feat] save memory by using bucket buffer only in backward

- this fixes bug #627
- added documentation to clarify the buffer's cost and speed/memory
  tradeoff
- added setup/teardown calls so that the buffer is only allocated
  during the backward pass, saving more memory for forward and stepping
  so that they can be used for things like activations.
- added a unit test that assert the memory is in range.

Comparing with DDP:

  1. buffer size scales with # of FSDP not model size
  2. buffer is only allocated during backward
  3. buffer is used for small tensors only to reduce overhead
  4. overlapping of compute-reduction is very different

* add PR number to changelog

* filled in with memory number on 1.9

* addressed comments

* update comments

* fix for 1.6

* add a todo
Co-authored-by: Min Xu <min.xu@acm.org>

a5594032

26 Apr, 2021 1 commit

[fix]: let FSDP handle model with multiple forward pass and checkpoint (#621) · a1612d79

Min Xu authored Apr 26, 2021



* [fix]: let FSDP handle model with multiple forward pass and checkpoint

* try CI again

* save

* save

* fixed case with bn

* minor

* add the new file

* minor

* added test of a single case, runtime is about 50s

* enable all 8 test cases

* cleanup

* cleanup

* skip flatten case with 1.6 and 1.7

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

a1612d79

23 Apr, 2021 1 commit

[FSDP] relax checking root condition (#620) · d3b86d65

shuyingsunshine21 authored Apr 22, 2021

* relax checking root condition

* formatting

* add unittest

* add unittest to ci test list

* isort for import of unittest

* format black .

* move test to list 1

* add skip no cuda

* black and isort

d3b86d65

22 Apr, 2021 2 commits

[fix] mypy and flaky test (#624) · 961df76e

Min Xu authored Apr 22, 2021



* [fix] mypy and flaky test

- CI didn't seem to catch this or maybe I merged incorrectly yesterday
- this should fix the mypy error on master
- also updated a test that seems to be flaky due to tcp port conflict

* another flaky test, hopefully more determinism helps

* CR

* skip 1.6

* fix

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

961df76e

[SDP] removing an assert which does not seem always accurate (#625) · 85962b97
Benjamin Lefaudeux authored Apr 22, 2021

85962b97

19 Apr, 2021 1 commit

FSDP: fixing training with freezing weights (#614) · 24da3b11

Min Xu authored Apr 18, 2021



* FSDP: fixing training with freezing weights

- an assert is changed to catch this case correctly
- unit test added (based on Quentin's test code) for this case and
  compare DDP and FSDP

fixes: #610

* added test file to list 1

* Use better and simpler code as suggested by Myle

* testing both methods of freezing as well
Co-authored-by: Min Xu <min.xu@acm.org>

24da3b11

15 Apr, 2021 1 commit

[fix] Revert change that removed the option to run OffloadModel with out... · a77c56f0

anj-s authored Apr 14, 2021


[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. (#608)

* revert change made

* add tests and revert sync shard changes

* add tests

* remove file checked in by error

* inine var

* fix lint errors

* add checkpoint activation

* fix mypy

* use a bigger model

* modify tests for now

* resolve conflicts
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

a77c56f0

13 Apr, 2021 3 commits
- [FSDP] use all_gather for 10X OSD consolidation speedup (#595) · a82825db
  Sam Shleifer authored Apr 13, 2021
  
  a82825db
- replacing multip-process pipe implementation with more flexible one (#567) · 4726d5be
  Mehdi Mirzazadeh authored Apr 13, 2021
```
replacing multip-process pipe implementation with more flexible one

Initial implementation of proposal pytorch/pytorch#55256
```
  4726d5be
- [SDP] Adding a unit test which checks for multiple FW passes on the same block (#596) · b191fe5f
  Benjamin Lefaudeux authored Apr 12, 2021
```
* Adding a unit test which checks for multiple FW passes on the same block
* Adding an embedding table, but still no problem to show for it
```
  b191fe5f
08 Apr, 2021 1 commit
- [fix] [FSDP] optim state dict should be completely on CPU (#590) · a6549be7
  Sam Shleifer authored Apr 08, 2021
  
  a6549be7
07 Apr, 2021 2 commits
- [fix][ShardedDDP] Properly handle .eval() mode (#587) · ce1f2cea
  Benjamin Lefaudeux authored Apr 07, 2021
```
* Properly handle .train() and .eval() modes
* showing that the unit test works, now fixed
* code review
```
  ce1f2cea
- [FSDP] [feat] Add state_dict_device option (#579) · 14abed6e
  Myle Ott authored Apr 07, 2021
  
  14abed6e
06 Apr, 2021 1 commit
- [fix][OSS] two small hotfixes.. repro not obvious for grad_fn (#583) · 121b9db0
  Benjamin Lefaudeux authored Apr 06, 2021
  
  121b9db0
05 Apr, 2021 1 commit
- [OSS/ShardedDDP] making APIs more private (#582) · e41452e8
  Benjamin Lefaudeux authored Apr 05, 2021
```
* making APIs more private
* linting
```
  e41452e8
04 Apr, 2021 3 commits
- [FSDP] add no_broadcast_optim_state option (#560) · 1fcbd624
  Sam Shleifer authored Apr 04, 2021
  
  1fcbd624
- [test] disable test which has started to become flaky (#575) · 54a97ee5
  msbaines authored Apr 04, 2021
```
This test is flaky for torch >= 1.8.0.
```
  54a97ee5
- [fix] OSS - enforce cuda parameters for state consolidation if NCCL backend (#573) · 88553373
  Benjamin Lefaudeux authored Apr 03, 2021
  
  88553373
02 Apr, 2021 1 commit
- [test] modify MOE tests to use NCCL (#570) · 5a3df0da
  msbaines authored Apr 02, 2021
```
NCCL all_to_all is now supported in PyTorch (since v1.8.0)

Fixes: #548
```
  5a3df0da
01 Apr, 2021 1 commit
- [feat] remove old MultiProcessPipe (#563) · 2d3d5a7b
  msbaines authored Apr 01, 2021
  
  2d3d5a7b
31 Mar, 2021 4 commits

[refactor] multiprocess_pipe: only support torch >= 1.9.0 (#561) · 204392e5
msbaines authored Mar 31, 2021

204392e5
[offload] Audit OffloadModel API, add error messages and remove redundant code path. (#557) · 34384e1b
anj-s authored Mar 31, 2021
```
* renaming/adding error messages

* address comments

* address comments

* add more comments

* add more comments
```
34384e1b

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed... · a0458b98

Min Xu authored Mar 31, 2021

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556)

* [fix] disable single rank process group for auto_wrap_bn

- beefed up unit test with regnet-like model
- found that single-rank process group is causing problem
- disabled it to enable convergence tests on the vissl side
- use `raise e from None` to get a better assertion output
  in testing.py.

* [test] fix regnet test for ddp+mixed_precision

- need AMP context in FSDP
- workaround different between ddp & fsdp when bias=True
- fixed a bug in input data generation that caused different ranks have
  the same data with wrong iteration count.
- added TODO for need a better loss and grad_scaler and reduced
  iters so there is no nan.
- added a (disabled) debugging code

* lint

* lint

* add scaler

* lint

* scaler

* add a real loss

* seeding in the ranks

* blance tests

* run AMP DDP==FSDP test only on cuda version 11 and up

* add relu inplace and comment

* make wrap_bn covers more cases in full precision mode

a0458b98

[chore] add testing of torch 1.9.0 nightly build (#559) · acb9ef00
msbaines authored Mar 31, 2021

acb9ef00