Commits · 121b9db01d9ccdcd5f32586cb512d0e765dbccac · OpenDAS / fairscale

06 Apr, 2021 1 commit
- [fix][OSS] two small hotfixes.. repro not obvious for grad_fn (#583) · 121b9db0
  Benjamin Lefaudeux authored Apr 06, 2021
  
  121b9db0
05 Apr, 2021 3 commits
- [offload] Add golden data for offload benchmarks. (#578) · 168c9baa
  anj-s authored Apr 05, 2021
```
* add model

* add offload regression benchmarks

* add golden data

* remove mp pipe benchmark

* fix lint

* remove rank

* add check for model type

* lint errors
```
  168c9baa
- [OSS/ShardedDDP] making APIs more private (#582) · e41452e8
  Benjamin Lefaudeux authored Apr 05, 2021
```
* making APIs more private
* linting
```
  e41452e8
- [CI] MNIST download fix (#581) · befbc73a
  Benjamin Lefaudeux authored Apr 05, 2021
```
* fixing given torchvision's change
```
  befbc73a
04 Apr, 2021 3 commits
- [FSDP] add no_broadcast_optim_state option (#560) · 1fcbd624
  Sam Shleifer authored Apr 04, 2021
  
  1fcbd624
- [test] disable test which has started to become flaky (#575) · 54a97ee5
  msbaines authored Apr 04, 2021
```
This test is flaky for torch >= 1.8.0.
```
  54a97ee5
- [fix] OSS - enforce cuda parameters for state consolidation if NCCL backend (#573) · 88553373
  Benjamin Lefaudeux authored Apr 03, 2021
  
  88553373
03 Apr, 2021 1 commit
- [FSDP] Add gradient predivide factor to avoid overflow/underflow with large world size (#565) · 04001e76
  Shruti Bhosale authored Apr 03, 2021
  
  04001e76
02 Apr, 2021 6 commits
- [test] modify MOE tests to use NCCL (#570) · 5a3df0da
  msbaines authored Apr 02, 2021
```
NCCL all_to_all is now supported in PyTorch (since v1.8.0)

Fixes: #548
```
  5a3df0da
- [chore] 0.3.3 release (#568) · 60694da1
  Min Xu authored Apr 02, 2021
```
- releasing 0.3.3
- I need it in vissl for the auto_wrap_bn change
```
  60694da1
- remove folder (#572) · f37d7603
  anj-s authored Apr 02, 2021
  
  f37d7603
- move back · 1c88e3b7
  Anjali Sridhar authored Apr 02, 2021
  
  1c88e3b7
- move grad scaler to the tutorials folder · 79a9373a
  Anjali Sridhar authored Apr 02, 2021
  
  79a9373a
- [offload] Add support for record_function when using OffloadModel (#564) · c19cc897
  anj-s authored Apr 01, 2021
```
* add record_function support

* add more record_function cutpoints

* add more record_function cutpoints

* lint errors

* make string ids more specific
```
  c19cc897
01 Apr, 2021 1 commit
- [feat] remove old MultiProcessPipe (#563) · 2d3d5a7b
  msbaines authored Apr 01, 2021
  
  2d3d5a7b
31 Mar, 2021 5 commits

[feat] experimental: Add xpipe support (#553) · e141a93e
Siddharth Goyal authored Mar 31, 2021

e141a93e
[refactor] multiprocess_pipe: only support torch >= 1.9.0 (#561) · 204392e5
msbaines authored Mar 31, 2021

204392e5
[offload] Audit OffloadModel API, add error messages and remove redundant code path. (#557) · 34384e1b
anj-s authored Mar 31, 2021
```
* renaming/adding error messages

* address comments

* address comments

* add more comments

* add more comments
```
34384e1b

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed... · a0458b98

Min Xu authored Mar 31, 2021

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556)

* [fix] disable single rank process group for auto_wrap_bn

- beefed up unit test with regnet-like model
- found that single-rank process group is causing problem
- disabled it to enable convergence tests on the vissl side
- use `raise e from None` to get a better assertion output
  in testing.py.

* [test] fix regnet test for ddp+mixed_precision

- need AMP context in FSDP
- workaround different between ddp & fsdp when bias=True
- fixed a bug in input data generation that caused different ranks have
  the same data with wrong iteration count.
- added TODO for need a better loss and grad_scaler and reduced
  iters so there is no nan.
- added a (disabled) debugging code

* lint

* lint

* add scaler

* lint

* scaler

* add a real loss

* seeding in the ranks

* blance tests

* run AMP DDP==FSDP test only on cuda version 11 and up

* add relu inplace and comment

* make wrap_bn covers more cases in full precision mode

a0458b98

[chore] add testing of torch 1.9.0 nightly build (#559) · acb9ef00
msbaines authored Mar 31, 2021

acb9ef00

30 Mar, 2021 1 commit

[feat][fix] ShardedDDP deferred init (#558) · daa1bad5

Benjamin Lefaudeux authored Mar 30, 2021

* survive the model being moved to device post-construction
* make sure that a unit test would catch a regression

daa1bad5

29 Mar, 2021 3 commits
- [feat] multiproces_pipe: add checkpoint support (#555) · 5e6a7a57
  msbaines authored Mar 29, 2021
  
  5e6a7a57
- [chore] Enable codecov for fairscale (#551) · 9a950651
  anj-s authored Mar 29, 2021
```
* codedcov testing

* codecov testnig

* more changes for uploading cov

* fix invalid config

* fix invalid config

* modify name

* fix config
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>
```
  9a950651
- [chore] update to torch v1.8.1 (#554) · c9db4775
  msbaines authored Mar 28, 2021
  
  c9db4775
28 Mar, 2021 1 commit
- [feat] multiprocess_pipe: add support for testing gpu-gpu rpc (#552) · 62635f0f
  msbaines authored Mar 28, 2021
  
  62635f0f
26 Mar, 2021 2 commits

[cleanup] consistent __init__.py for import * (#550) · 9a6ca9bd
Min Xu authored Mar 26, 2021
```
- fixes #471
- one less thing to worry about during development.
```
9a6ca9bd

[test] FSDP: check with ddp parity with conv + bn (#549) · 0233efca

Min Xu authored Mar 26, 2021

- added DDP equivalency test
- added rmf, state_dict_norm functions to testing utils
- added more debugging output to objects_are_equal

0233efca

25 Mar, 2021 3 commits
- [doc] Adding some more ShardedDDP documentation (#547) · a2b11de4
  Benjamin Lefaudeux authored Mar 25, 2021
  
  a2b11de4
- [chore][fix] SDP: yet another unit test improvement + bugfixes (#546) · ece0cbf9
  Benjamin Lefaudeux authored Mar 25, 2021
```
* re-activating unit test
* removing changed that slipped in
```
  ece0cbf9
- [FSDP][feature] optimizer state dict save and load (#537) · 9474d75d
  Sam Shleifer authored Mar 25, 2021
```
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
```
  9474d75d
22 Mar, 2021 1 commit
- [ci][SDP] extending the test matrix which checks for equivalence with DDP (#542) · df493a29
  Benjamin Lefaudeux authored Mar 22, 2021
  
  df493a29
20 Mar, 2021 1 commit

[fix][FSDP] fix weight init when using apply() (fixes #490 and #444) (#543) · fa1b85fb

Myle Ott authored Mar 20, 2021

* Add new test for weight init (fails)
* Set FSDP.compute_device so summon_full_params works before module moves to CUDA
* Override FSDP.apply to enable custom weight init

fa1b85fb

19 Mar, 2021 3 commits
- [feat][refactor][OSS] Param buckets + fp16 broadcasts (#540) · e3865549
  Benjamin Lefaudeux authored Mar 19, 2021
```
* param buckets
* unifying the buckets
```
  e3865549
- [test] use workaround to enable rpc tests when cuda not available (#541) · 195d62f1
  msbaines authored Mar 19, 2021
  
  195d62f1
- [feat] experimental.nn.multiprocess_pipe: re-implemented using rpc (#519) · 84e0de84
  msbaines authored Mar 18, 2021
  
  84e0de84
18 Mar, 2021 5 commits
- [fix] super minor, but make sure that the mem leak does not come back (#536) · f7e6680b
  Benjamin Lefaudeux authored Mar 18, 2021
  
  f7e6680b
- [chore] 0.3.2 release (#535) · 9a37498c
  Min Xu authored Mar 18, 2021
  
  9a37498c
- [refactor][fix][SDP] Extract the grad buckets in a dedicated class, fix the resize_ bug (#532) · a1bdc7d3
  Benjamin Lefaudeux authored Mar 18, 2021
```
* extracting the buckets in a dedicated class, fixing the resize_ bug
* adding a unit test
* copyright
```
  a1bdc7d3
- [perf] [FSDP] micro-optimization for memory usage (#533) · fcbf1ea3
  Myle Ott authored Mar 18, 2021
  
  fcbf1ea3
- [fix][OSS] enabling disabled tests for 1.8 (#534) · 7b127ccb
  Benjamin Lefaudeux authored Mar 18, 2021
```
* enabling disabled tests
```
  7b127ccb