Commits · 0ce85af25346bdb00f326ad8889dddaf4174e3b1 · OpenDAS / fairscale

05 May, 2021 2 commits

add info about PEP8 style guide (#651) · 0ce85af2
anj-s authored May 04, 2021
```
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>
```
0ce85af2

[fix] add clear_autocast_cache flag (#650) · 861b5ce2

Min Xu authored May 04, 2021



* [fix] add clear_autocast_cache flag

- when training in AMP model with weight dtype32, FSDP may need to
  optionally clear the autocast cache to avoid GPU OOM
- this flag is default false, automatically doing it is a future TODO
- also added a verbose flag to make print(fsdp_model) a bit shorter
- updated the memory test to cover those new code
- added a couple of useful functions in parallel.py and testing.py

* minor

* address comments

* format

* improve the test
Co-authored-by: Min Xu <min.xu@acm.org>

861b5ce2

04 May, 2021 1 commit

[feat]Adding DynamicLossScaler class for supporting optimizer updates on the CPU (#635) · 14d1f78c

tmarkstrum authored May 03, 2021

* dynamic loss scaler

* isort

* black

* flake8

* comments

* added the test to ci file, added a line to catch the overflow error, fixed some formatting errors

* adding type annotation

* added todo for adding more test cases for handling Nan gradients

* fix some doc string and comments, add more tods

* fix two doc strings

14d1f78c

03 May, 2021 2 commits

[fix] SDP: expose module property fix + unit test (#647) · 4e438ba1
Benjamin Lefaudeux authored May 03, 2021
```
* fix + unit test
* changelog update
```
4e438ba1

[minor] not creating a temp file on import (#641) · b66168da

Min Xu authored May 03, 2021



* [minor] not creating a temp file on import

* address review

* Revert "address review"

This reverts commit f65eb9bc7f7ea8829b1ac0a369ef9a3e6b56420a.
Co-authored-by: Min Xu <min.xu@acm.org>

b66168da

30 Apr, 2021 1 commit
- [test] nn.Pipe: add a parity test that also tests with amp (#645) · fee979d9
  msbaines authored Apr 30, 2021
  
  fee979d9
29 Apr, 2021 2 commits
- [test][refactor][SDP] Using the nice context-based tempfiles (#640) · 3b7373e2
  Benjamin Lefaudeux authored Apr 29, 2021
  
  3b7373e2
- [test][minor] Improving SDP test coverage (#639) · 8c8a625a
  Benjamin Lefaudeux authored Apr 29, 2021
```
* Improving test coverage on SDP
* using pytest exception catcher
```
  8c8a625a
28 Apr, 2021 4 commits

[test] improve BN test coverage (#638) · 21cba91b

Min Xu authored Apr 28, 2021



* [test] improve BN test coverage

- Added sync_bn on/off cases
- Added conv and linear bias on/off cases
- clarified when sync_bn is off, when is BN wrapping needed with the test

* adding a comment
Co-authored-by: Min Xu <min.xu@acm.org>

21cba91b

adding auto graph generation for distributed pipeline (#615) · bdc0581b

Mehdi Mirzazadeh authored Apr 28, 2021

* adding auto graph generation for distributed pipeline

* ignore trace.py for my for now, since it needs pytorch 1.8

* fixing tests

* simplifying graph api

* remove unused debug utilities

* use inspect to find argument lists

* use sharded linear layer

* flkae8

* comment

* polishing

* polishing

bdc0581b

[chore] do not build cuda extensions by default (#634) · 2bb2a134
msbaines authored Apr 27, 2021

2bb2a134

[feat] save memory by using bucket buffer only in backward (#633) · a5594032

Min Xu authored Apr 27, 2021



* [feat] save memory by using bucket buffer only in backward

- this fixes bug #627
- added documentation to clarify the buffer's cost and speed/memory
  tradeoff
- added setup/teardown calls so that the buffer is only allocated
  during the backward pass, saving more memory for forward and stepping
  so that they can be used for things like activations.
- added a unit test that assert the memory is in range.

Comparing with DDP:

  1. buffer size scales with # of FSDP not model size
  2. buffer is only allocated during backward
  3. buffer is used for small tensors only to reduce overhead
  4. overlapping of compute-reduction is very different

* add PR number to changelog

* filled in with memory number on 1.9

* addressed comments

* update comments

* fix for 1.6

* add a todo
Co-authored-by: Min Xu <min.xu@acm.org>

a5594032

27 Apr, 2021 1 commit
- [chore] OSS - adding the profiler labels (#629) · 9b79cc02
  Benjamin Lefaudeux authored Apr 26, 2021
  
  9b79cc02
26 Apr, 2021 4 commits

[chore] SDP - adding the profiler labels (#630) · 85dea5b2
Benjamin Lefaudeux authored Apr 26, 2021
```
* adding the labels
* longer labels, following aten::
```
85dea5b2
[chore] PR Template update, mention Changelog (#632) · 38ce54b7
Benjamin Lefaudeux authored Apr 26, 2021

38ce54b7

[chore] 0.3.6 release (#631) · 36da9d6e

Min Xu authored Apr 26, 2021



* [chore] 0.3.6 release

* try redo the caches
Co-authored-by: Min Xu <min.xu@acm.org>

36da9d6e

[fix]: let FSDP handle model with multiple forward pass and checkpoint (#621) · a1612d79

Min Xu authored Apr 26, 2021



* [fix]: let FSDP handle model with multiple forward pass and checkpoint

* try CI again

* save

* save

* fixed case with bn

* minor

* add the new file

* minor

* added test of a single case, runtime is about 50s

* enable all 8 test cases

* cleanup

* cleanup

* skip flatten case with 1.6 and 1.7

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

a1612d79

23 Apr, 2021 2 commits

[fix] check before calling _specify_ddp_gpu_num (#626) · 5cddaea4

Min Xu authored Apr 23, 2021



- this function is being removed in pytorch
- we only need to call it in case we are working with older pytorch
Co-authored-by: Min Xu <min.xu@acm.org>

5cddaea4

[FSDP] relax checking root condition (#620) · d3b86d65

shuyingsunshine21 authored Apr 22, 2021

* relax checking root condition

* formatting

* add unittest

* add unittest to ci test list

* isort for import of unittest

* format black .

* move test to list 1

* add skip no cuda

* black and isort

d3b86d65

22 Apr, 2021 3 commits

[fix] mypy and flaky test (#624) · 961df76e

Min Xu authored Apr 22, 2021



* [fix] mypy and flaky test

- CI didn't seem to catch this or maybe I merged incorrectly yesterday
- this should fix the mypy error on master
- also updated a test that seems to be flaky due to tcp port conflict

* another flaky test, hopefully more determinism helps

* CR

* skip 1.6

* fix

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

961df76e

[SDP] removing an assert which does not seem always accurate (#625) · 85962b97
Benjamin Lefaudeux authored Apr 22, 2021

85962b97

Fixing logging, changing info to debug to avoid clutter (#622) · b0048b28

girifb authored Apr 21, 2021



* Changing FSDP init to by pass pg validation for freshly minted pgs inside of init.

* Addressing Min's review comments.

* Changing logging in init to debug from info

* Changing logging in init to debug from info
Co-authored-by: Giri Anantharaman <giriman@devfair0439.h2.fair>

b0048b28

21 Apr, 2021 2 commits

Changing FSDP init to by pass pg validation (#619) · f768eb93

girifb authored Apr 21, 2021



* Changing FSDP init to by pass pg validation for freshly minted pgs inside of init.

* Addressing Min's review comments.
Co-authored-by: Giri Anantharaman <giriman@devfair0439.h2.fair>

f768eb93

[chore] OSS to 100% coverage (#618) · b0e6b9bd
Benjamin Lefaudeux authored Apr 20, 2021

b0e6b9bd

20 Apr, 2021 1 commit
- [FSDP] Consolidate cpu_adam optimizer state dict (#607) · d9f36130
  Sam Shleifer authored Apr 20, 2021
  
  d9f36130
19 Apr, 2021 2 commits

[chore] 0.3.5 release (#616) · 1141528e

Min Xu authored Apr 19, 2021



* [chore] 0.3.5 release

* address comment
Co-authored-by: Min Xu <min.xu@acm.org>

1141528e

FSDP: fixing training with freezing weights (#614) · 24da3b11

Min Xu authored Apr 18, 2021



* FSDP: fixing training with freezing weights

- an assert is changed to catch this case correctly
- unit test added (based on Quentin's test code) for this case and
  compare DDP and FSDP

fixes: #610

* added test file to list 1

* Use better and simpler code as suggested by Myle

* testing both methods of freezing as well
Co-authored-by: Min Xu <min.xu@acm.org>

24da3b11

15 Apr, 2021 3 commits

[chore][SDP] privatizing all the things (#611) · c084b202
Benjamin Lefaudeux authored Apr 15, 2021

c084b202

[fix] Revert change that removed the option to run OffloadModel with out... · a77c56f0

anj-s authored Apr 14, 2021


[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. (#608)

* revert change made

* add tests and revert sync shard changes

* add tests

* remove file checked in by error

* inine var

* fix lint errors

* add checkpoint activation

* fix mypy

* use a bigger model

* modify tests for now

* resolve conflicts
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

a77c56f0

[offload] Add API, tutorial and smaller doc string changes. (#576) · 56506951

anj-s authored Apr 14, 2021



* modify doc string

* add offload docs

* add tutorial

* remove print

* remove print statement

* modify import

* modify constants

* modify README and add Offload symbol

* fix lint

* smaller mods

* lint errors

* Update README.md

added the references at the bottom of the readme

* address comments

* doc changes

* add blank line
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>
Co-authored-by: Vittorio Caggiano <caggiano@gmail.com>

56506951

14 Apr, 2021 1 commit
- [fix] [FSDP] Make _get_default_cuda_device more robust to modules without params (#606) · 8f7ee69f
  Myle Ott authored Apr 14, 2021
  
  8f7ee69f
13 Apr, 2021 4 commits
- [chore] v0.3.4 (#603) · 82d6997c
  Benjamin Lefaudeux authored Apr 13, 2021
  
  82d6997c
- [FSDP] use all_gather for 10X OSD consolidation speedup (#595) · a82825db
  Sam Shleifer authored Apr 13, 2021
  
  a82825db
- replacing multip-process pipe implementation with more flexible one (#567) · 4726d5be
  Mehdi Mirzazadeh authored Apr 13, 2021
```
replacing multip-process pipe implementation with more flexible one

Initial implementation of proposal pytorch/pytorch#55256
```
  4726d5be
- [SDP] Adding a unit test which checks for multiple FW passes on the same block (#596) · b191fe5f
  Benjamin Lefaudeux authored Apr 12, 2021
```
* Adding a unit test which checks for multiple FW passes on the same block
* Adding an embedding table, but still no problem to show for it
```
  b191fe5f
09 Apr, 2021 1 commit
- [cleanup] nn.Pipe: deprecate Pipe when torch version >= 1.8.0 (#597) · e9693976
  msbaines authored Apr 08, 2021
  
  e9693976
08 Apr, 2021 1 commit
- [fix] [FSDP] optim state dict should be completely on CPU (#590) · a6549be7
  Sam Shleifer authored Apr 08, 2021
  
  a6549be7
07 Apr, 2021 3 commits
- [fix][ShardedDDP] Properly handle .eval() mode (#587) · ce1f2cea
  Benjamin Lefaudeux authored Apr 07, 2021
```
* Properly handle .train() and .eval() modes
* showing that the unit test works, now fixed
* code review
```
  ce1f2cea
- [offload] Fix activation offloading to CPU in FW pass. (#588) · e89a1916
  anj-s authored Apr 07, 2021
```
* debugging

* debugging activation issue

* fix activation loading

* remove changes used for testing

* remove comment
```
  e89a1916
- [FSDP] [feat] Add state_dict_device option (#579) · 14abed6e
  Myle Ott authored Apr 07, 2021
  
  14abed6e