Commits · 961df76eb2cdc475deb7ad14de491826e56183f7 · OpenDAS / fairscale

"torchvision/csrc/vscode:/vscode.git/clone" did not exist on "a00a72b1ee41483407717379fb5cafe992de2f82"

22 Apr, 2021 2 commits

[fix] mypy and flaky test (#624) · 961df76e

Min Xu authored Apr 22, 2021



* [fix] mypy and flaky test

- CI didn't seem to catch this or maybe I merged incorrectly yesterday
- this should fix the mypy error on master
- also updated a test that seems to be flaky due to tcp port conflict

* another flaky test, hopefully more determinism helps

* CR

* skip 1.6

* fix

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

961df76e

[SDP] removing an assert which does not seem always accurate (#625) · 85962b97
Benjamin Lefaudeux authored Apr 22, 2021

85962b97

19 Apr, 2021 1 commit

FSDP: fixing training with freezing weights (#614) · 24da3b11

Min Xu authored Apr 18, 2021



* FSDP: fixing training with freezing weights

- an assert is changed to catch this case correctly
- unit test added (based on Quentin's test code) for this case and
  compare DDP and FSDP

fixes: #610

* added test file to list 1

* Use better and simpler code as suggested by Myle

* testing both methods of freezing as well
Co-authored-by: Min Xu <min.xu@acm.org>

24da3b11

15 Apr, 2021 1 commit

[fix] Revert change that removed the option to run OffloadModel with out... · a77c56f0

anj-s authored Apr 14, 2021


[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. (#608)

* revert change made

* add tests and revert sync shard changes

* add tests

* remove file checked in by error

* inine var

* fix lint errors

* add checkpoint activation

* fix mypy

* use a bigger model

* modify tests for now

* resolve conflicts
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

a77c56f0

13 Apr, 2021 3 commits
- [FSDP] use all_gather for 10X OSD consolidation speedup (#595) · a82825db
  Sam Shleifer authored Apr 13, 2021
  
  a82825db
- replacing multip-process pipe implementation with more flexible one (#567) · 4726d5be
  Mehdi Mirzazadeh authored Apr 13, 2021
```
replacing multip-process pipe implementation with more flexible one

Initial implementation of proposal pytorch/pytorch#55256
```
  4726d5be
- [SDP] Adding a unit test which checks for multiple FW passes on the same block (#596) · b191fe5f
  Benjamin Lefaudeux authored Apr 12, 2021
```
* Adding a unit test which checks for multiple FW passes on the same block
* Adding an embedding table, but still no problem to show for it
```
  b191fe5f
08 Apr, 2021 1 commit
- [fix] [FSDP] optim state dict should be completely on CPU (#590) · a6549be7
  Sam Shleifer authored Apr 08, 2021
  
  a6549be7
07 Apr, 2021 2 commits
- [fix][ShardedDDP] Properly handle .eval() mode (#587) · ce1f2cea
  Benjamin Lefaudeux authored Apr 07, 2021
```
* Properly handle .train() and .eval() modes
* showing that the unit test works, now fixed
* code review
```
  ce1f2cea
- [FSDP] [feat] Add state_dict_device option (#579) · 14abed6e
  Myle Ott authored Apr 07, 2021
  
  14abed6e
06 Apr, 2021 1 commit
- [fix][OSS] two small hotfixes.. repro not obvious for grad_fn (#583) · 121b9db0
  Benjamin Lefaudeux authored Apr 06, 2021
  
  121b9db0
05 Apr, 2021 1 commit
- [OSS/ShardedDDP] making APIs more private (#582) · e41452e8
  Benjamin Lefaudeux authored Apr 05, 2021
```
* making APIs more private
* linting
```
  e41452e8
04 Apr, 2021 3 commits
- [FSDP] add no_broadcast_optim_state option (#560) · 1fcbd624
  Sam Shleifer authored Apr 04, 2021
  
  1fcbd624
- [test] disable test which has started to become flaky (#575) · 54a97ee5
  msbaines authored Apr 04, 2021
```
This test is flaky for torch >= 1.8.0.
```
  54a97ee5
- [fix] OSS - enforce cuda parameters for state consolidation if NCCL backend (#573) · 88553373
  Benjamin Lefaudeux authored Apr 03, 2021
  
  88553373
02 Apr, 2021 1 commit
- [test] modify MOE tests to use NCCL (#570) · 5a3df0da
  msbaines authored Apr 02, 2021
```
NCCL all_to_all is now supported in PyTorch (since v1.8.0)

Fixes: #548
```
  5a3df0da
01 Apr, 2021 1 commit
- [feat] remove old MultiProcessPipe (#563) · 2d3d5a7b
  msbaines authored Apr 01, 2021
  
  2d3d5a7b
31 Mar, 2021 4 commits

[refactor] multiprocess_pipe: only support torch >= 1.9.0 (#561) · 204392e5
msbaines authored Mar 31, 2021

204392e5
[offload] Audit OffloadModel API, add error messages and remove redundant code path. (#557) · 34384e1b
anj-s authored Mar 31, 2021
```
* renaming/adding error messages

* address comments

* address comments

* add more comments

* add more comments
```
34384e1b

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed... · a0458b98

Min Xu authored Mar 31, 2021

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556)

* [fix] disable single rank process group for auto_wrap_bn

- beefed up unit test with regnet-like model
- found that single-rank process group is causing problem
- disabled it to enable convergence tests on the vissl side
- use `raise e from None` to get a better assertion output
  in testing.py.

* [test] fix regnet test for ddp+mixed_precision

- need AMP context in FSDP
- workaround different between ddp & fsdp when bias=True
- fixed a bug in input data generation that caused different ranks have
  the same data with wrong iteration count.
- added TODO for need a better loss and grad_scaler and reduced
  iters so there is no nan.
- added a (disabled) debugging code

* lint

* lint

* add scaler

* lint

* scaler

* add a real loss

* seeding in the ranks

* blance tests

* run AMP DDP==FSDP test only on cuda version 11 and up

* add relu inplace and comment

* make wrap_bn covers more cases in full precision mode

a0458b98

[chore] add testing of torch 1.9.0 nightly build (#559) · acb9ef00
msbaines authored Mar 31, 2021

acb9ef00

30 Mar, 2021 1 commit

[feat][fix] ShardedDDP deferred init (#558) · daa1bad5

Benjamin Lefaudeux authored Mar 30, 2021

* survive the model being moved to device post-construction
* make sure that a unit test would catch a regression

daa1bad5

29 Mar, 2021 1 commit
- [feat] multiproces_pipe: add checkpoint support (#555) · 5e6a7a57
  msbaines authored Mar 29, 2021
  
  5e6a7a57
28 Mar, 2021 1 commit
- [feat] multiprocess_pipe: add support for testing gpu-gpu rpc (#552) · 62635f0f
  msbaines authored Mar 28, 2021
  
  62635f0f
26 Mar, 2021 1 commit

[test] FSDP: check with ddp parity with conv + bn (#549) · 0233efca

Min Xu authored Mar 26, 2021

- added DDP equivalency test
- added rmf, state_dict_norm functions to testing utils
- added more debugging output to objects_are_equal

0233efca

25 Mar, 2021 2 commits
- [chore][fix] SDP: yet another unit test improvement + bugfixes (#546) · ece0cbf9
  Benjamin Lefaudeux authored Mar 25, 2021
```
* re-activating unit test
* removing changed that slipped in
```
  ece0cbf9
- [FSDP][feature] optimizer state dict save and load (#537) · 9474d75d
  Sam Shleifer authored Mar 25, 2021
```
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
```
  9474d75d
22 Mar, 2021 1 commit
- [ci][SDP] extending the test matrix which checks for equivalence with DDP (#542) · df493a29
  Benjamin Lefaudeux authored Mar 22, 2021
  
  df493a29
20 Mar, 2021 1 commit

[fix][FSDP] fix weight init when using apply() (fixes #490 and #444) (#543) · fa1b85fb

Myle Ott authored Mar 20, 2021

* Add new test for weight init (fails)
* Set FSDP.compute_device so summon_full_params works before module moves to CUDA
* Override FSDP.apply to enable custom weight init

fa1b85fb

19 Mar, 2021 3 commits
- [feat][refactor][OSS] Param buckets + fp16 broadcasts (#540) · e3865549
  Benjamin Lefaudeux authored Mar 19, 2021
```
* param buckets
* unifying the buckets
```
  e3865549
- [test] use workaround to enable rpc tests when cuda not available (#541) · 195d62f1
  msbaines authored Mar 19, 2021
  
  195d62f1
- [feat] experimental.nn.multiprocess_pipe: re-implemented using rpc (#519) · 84e0de84
  msbaines authored Mar 18, 2021
  
  84e0de84
18 Mar, 2021 5 commits
- [refactor][fix][SDP] Extract the grad buckets in a dedicated class, fix the resize_ bug (#532) · a1bdc7d3
  Benjamin Lefaudeux authored Mar 18, 2021
```
* extracting the buckets in a dedicated class, fixing the resize_ bug
* adding a unit test
* copyright
```
  a1bdc7d3
- [fix][OSS] enabling disabled tests for 1.8 (#534) · 7b127ccb
  Benjamin Lefaudeux authored Mar 18, 2021
```
* enabling disabled tests
```
  7b127ccb
- [feat] FSDP: add auto_wrap_bn (#531) · 8b59267b
  Min Xu authored Mar 18, 2021
```
* [feat] FSDP: add auto_wrap_bn

- add an utility function to handle wrapping of BN

* changelog
```
  8b59267b
- [feature] FSDP: enable pytorch SyncBN (#527) · 2fc1f6d8
  Min Xu authored Mar 17, 2021
```
* [feature] FSDP: enable pytorch SyncBN

- not fully validated yet but at least not asserting
- this enables VISSL to move forward with its next PR

* add the test file

* changelog and lint

* addressed comment
```
  2fc1f6d8
- [refactor] removing duplicated tests (#529) · 98223763
  Benjamin Lefaudeux authored Mar 17, 2021
  
  98223763
17 Mar, 2021 2 commits
- [fix][SDP] Lightning-compat: deactivating buckets for a single rank, not useful (#514) · d3bfcbf5
  Benjamin Lefaudeux authored Mar 17, 2021
```
* Deactivating buckets for a single rank, not crashing but not useful
```
  d3bfcbf5
- [feat][OSS] handle the device being changed after construction (#523) · d217278c
  Benjamin Lefaudeux authored Mar 16, 2021
  
  d217278c
15 Mar, 2021 1 commit

[feat] Make OSS state available on all ranks (#500) · 2d2412e2

Benjamin Lefaudeux authored Mar 15, 2021

* extending the current state_dict interface, make it possible to do everything in a single call, and to checkpoint on all ranks

2d2412e2