Commits · e141a93e0107de466f5e1c2f33dae5621d3ddd21 · OpenDAS / fairscale

31 Mar, 2021 4 commits

[refactor] multiprocess_pipe: only support torch >= 1.9.0 (#561) · 204392e5
msbaines authored Mar 31, 2021

204392e5
[offload] Audit OffloadModel API, add error messages and remove redundant code path. (#557) · 34384e1b
anj-s authored Mar 31, 2021
```
* renaming/adding error messages

* address comments

* address comments

* add more comments

* add more comments
```
34384e1b

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed... · a0458b98

Min Xu authored Mar 31, 2021

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556)

* [fix] disable single rank process group for auto_wrap_bn

- beefed up unit test with regnet-like model
- found that single-rank process group is causing problem
- disabled it to enable convergence tests on the vissl side
- use `raise e from None` to get a better assertion output
  in testing.py.

* [test] fix regnet test for ddp+mixed_precision

- need AMP context in FSDP
- workaround different between ddp & fsdp when bias=True
- fixed a bug in input data generation that caused different ranks have
  the same data with wrong iteration count.
- added TODO for need a better loss and grad_scaler and reduced
  iters so there is no nan.
- added a (disabled) debugging code

* lint

* lint

* add scaler

* lint

* scaler

* add a real loss

* seeding in the ranks

* blance tests

* run AMP DDP==FSDP test only on cuda version 11 and up

* add relu inplace and comment

* make wrap_bn covers more cases in full precision mode

a0458b98

[chore] add testing of torch 1.9.0 nightly build (#559) · acb9ef00
msbaines authored Mar 31, 2021

acb9ef00

30 Mar, 2021 1 commit

[feat][fix] ShardedDDP deferred init (#558) · daa1bad5

Benjamin Lefaudeux authored Mar 30, 2021

* survive the model being moved to device post-construction
* make sure that a unit test would catch a regression

daa1bad5

29 Mar, 2021 1 commit
- [feat] multiproces_pipe: add checkpoint support (#555) · 5e6a7a57
  msbaines authored Mar 29, 2021
  
  5e6a7a57
28 Mar, 2021 1 commit
- [feat] multiprocess_pipe: add support for testing gpu-gpu rpc (#552) · 62635f0f
  msbaines authored Mar 28, 2021
  
  62635f0f
26 Mar, 2021 1 commit

[test] FSDP: check with ddp parity with conv + bn (#549) · 0233efca

Min Xu authored Mar 26, 2021

- added DDP equivalency test
- added rmf, state_dict_norm functions to testing utils
- added more debugging output to objects_are_equal

0233efca

25 Mar, 2021 2 commits
- [chore][fix] SDP: yet another unit test improvement + bugfixes (#546) · ece0cbf9
  Benjamin Lefaudeux authored Mar 25, 2021
```
* re-activating unit test
* removing changed that slipped in
```
  ece0cbf9
- [FSDP][feature] optimizer state dict save and load (#537) · 9474d75d
  Sam Shleifer authored Mar 25, 2021
```
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
```
  9474d75d
22 Mar, 2021 1 commit
- [ci][SDP] extending the test matrix which checks for equivalence with DDP (#542) · df493a29
  Benjamin Lefaudeux authored Mar 22, 2021
  
  df493a29
20 Mar, 2021 1 commit

[fix][FSDP] fix weight init when using apply() (fixes #490 and #444) (#543) · fa1b85fb

Myle Ott authored Mar 20, 2021

* Add new test for weight init (fails)
* Set FSDP.compute_device so summon_full_params works before module moves to CUDA
* Override FSDP.apply to enable custom weight init

fa1b85fb

19 Mar, 2021 3 commits
- [feat][refactor][OSS] Param buckets + fp16 broadcasts (#540) · e3865549
  Benjamin Lefaudeux authored Mar 19, 2021
```
* param buckets
* unifying the buckets
```
  e3865549
- [test] use workaround to enable rpc tests when cuda not available (#541) · 195d62f1
  msbaines authored Mar 19, 2021
  
  195d62f1
- [feat] experimental.nn.multiprocess_pipe: re-implemented using rpc (#519) · 84e0de84
  msbaines authored Mar 18, 2021
  
  84e0de84
18 Mar, 2021 5 commits
- [refactor][fix][SDP] Extract the grad buckets in a dedicated class, fix the resize_ bug (#532) · a1bdc7d3
  Benjamin Lefaudeux authored Mar 18, 2021
```
* extracting the buckets in a dedicated class, fixing the resize_ bug
* adding a unit test
* copyright
```
  a1bdc7d3
- [fix][OSS] enabling disabled tests for 1.8 (#534) · 7b127ccb
  Benjamin Lefaudeux authored Mar 18, 2021
```
* enabling disabled tests
```
  7b127ccb
- [feat] FSDP: add auto_wrap_bn (#531) · 8b59267b
  Min Xu authored Mar 18, 2021
```
* [feat] FSDP: add auto_wrap_bn

- add an utility function to handle wrapping of BN

* changelog
```
  8b59267b
- [feature] FSDP: enable pytorch SyncBN (#527) · 2fc1f6d8
  Min Xu authored Mar 17, 2021
```
* [feature] FSDP: enable pytorch SyncBN

- not fully validated yet but at least not asserting
- this enables VISSL to move forward with its next PR

* add the test file

* changelog and lint

* addressed comment
```
  2fc1f6d8
- [refactor] removing duplicated tests (#529) · 98223763
  Benjamin Lefaudeux authored Mar 17, 2021
  
  98223763
17 Mar, 2021 2 commits
- [fix][SDP] Lightning-compat: deactivating buckets for a single rank, not useful (#514) · d3bfcbf5
  Benjamin Lefaudeux authored Mar 17, 2021
```
* Deactivating buckets for a single rank, not crashing but not useful
```
  d3bfcbf5
- [feat][OSS] handle the device being changed after construction (#523) · d217278c
  Benjamin Lefaudeux authored Mar 16, 2021
  
  d217278c
15 Mar, 2021 1 commit

[feat] Make OSS state available on all ranks (#500) · 2d2412e2

Benjamin Lefaudeux authored Mar 15, 2021

* extending the current state_dict interface, make it possible to do everything in a single call, and to checkpoint on all ranks

2d2412e2

12 Mar, 2021 2 commits

[fix] FSDP: multi-pass autograd graph and mixed precision (#513) · 82986ca0

Min Xu authored Mar 12, 2021



* FSDP: multi-pass autograd graph and mixed precision

- added BACKWARD_PRE/POST checking
- better assert_state
- fixed issue of backward hook misfiring

* fix

* cleanup

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Myle Ott <myleott@fb.com>
Co-authored-by: Myle Ott <myleott@fb.com>

82986ca0

[chore] update to torch v1.8.0 (#508) · c79bbd01
msbaines authored Mar 11, 2021

c79bbd01

11 Mar, 2021 1 commit

[fix][OSS] Adding a hard sync stream barrier before broadcast (#512) · c9fdf506

Benjamin Lefaudeux authored Mar 11, 2021

* Adding a hard sync barrier before the broadcast, mostly useful for Gloo actually, NCCL is synced behind the scene
* adding a proper unit test
* adding a unit test for https://github.com/facebookresearch/fairscale/pull/510

c9fdf506

09 Mar, 2021 3 commits
- [perf] Further improve performance for FSDP.no_sync (#502) · 0cbf3bab
  Myle Ott authored Mar 09, 2021
  
  0cbf3bab
- [fix] FSDP: fix MoE corner case (fixes #467) (#501) · 05ce7971
  Myle Ott authored Mar 08, 2021
  
  05ce7971
- [fix] oss and interleaved param groups (#483) · 02405740
  Benjamin Lefaudeux authored Mar 08, 2021
  
  02405740
08 Mar, 2021 2 commits

Fixed RNN support for containers (#494) · 8c405c51

Sean Naren authored Mar 08, 2021



* Fix packed sequence apply

* Update fairscale/utils/containers.py
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

8c405c51

[fix]: handle inputs with containers in mixed precision (#486) · 2e9a14e7

Min Xu authored Mar 08, 2021

* [fix]: handle inputs with containers

- this is an issue surfaces by vissl as well
- fix seems to be super simple
- also cleaned up two tests with respect to multiple such tests
  running back to back (they don't do that presently)

* cleanup

* fix

* lint

2e9a14e7

06 Mar, 2021 1 commit
- [perf] FSDP: speed up no_sync and test communication volume (#470) · 1204c7cf
  Myle Ott authored Mar 06, 2021
  
  1204c7cf
05 Mar, 2021 3 commits

[refactor] enhance wrap and auto_wrap (#467) · a05a79bc

Min Xu authored Mar 05, 2021



* [refactor] enhance wrap and auto_wrap

- Two things were done in this PR
  1. We don't need to import FSDP in wrap.py since the wrapper class
     type is stored in the context now.
  2. We can use a `auto_wrap_policy` function to customize wrapping policy
     for auto_wrap, including size of module, blacklist, exclude list
- The auto_wrap function got simplified a bit as a minor side effect.

* Update fairscale/nn/wrap/auto_wrap.py
Co-authored-by: Sean Naren <sean@grid.ai>

* addressed comments

* addressed more comments
Co-authored-by: Sean Naren <sean@grid.ai>

a05a79bc

[perf][minor] cache the rank lookups, small shardedddp perf fix (#474) · 131a5356
Benjamin Lefaudeux authored Mar 05, 2021
```
* [perf][minor] cache the rank lookups, small shardedddp perf fix
* tiny improvement, code quality
```
131a5356
[fix][minor] Change empty shard handling for OSS, do not rely on asserts (#460) · d1fab39e
Benjamin Lefaudeux authored Mar 04, 2021
```
* change empty shard handling for OSS, do not rely on asserts
* code review
```
d1fab39e

04 Mar, 2021 5 commits

[feat]: checkpoint and normalization (#457) · 5e64d6a7

Min Xu authored Mar 04, 2021

* [feat]: checkpoint and normalization

- added special handling of BN for track_running_stats and checkpointing
- we test BN/LN and checkpointing
- we test them with mixed precision

5e64d6a7

[feat] add buffer_dtype kwarg for more control of batchnorm (#458) · b36e01d5
Sam Shleifer authored Mar 04, 2021

b36e01d5
Fix ampnet unit tests (#466) · 103d33c1
Siddharth Goyal authored Mar 04, 2021
```
* Fix ampnet unit test by adding delegate object

* Remove comments
```
103d33c1

[test] AdaScale & SDP/FSDP (#468) · efed9cee

Min Xu authored Mar 04, 2021

- cover them in terms of code path only
- numerically, AdaScale is different on SDP/FSDP than DDP, mainly
  due to partial view of the gradients.
- this doesn't mean it is definitely not useful but it is yet to
  be validated.
- not going to spend too much time until we have a real use case.

efed9cee

[chore] move a test script and a CI test improvement (#464) · eeabc6f1
Min Xu authored Mar 03, 2021
```
* [chore] move a test script

* add a shortcut for installing

* more skipping

* keep apt-get part
```
eeabc6f1