Commits · a77c56f09959d3dabe4107fd23a27c8875af92fb · OpenDAS / fairscale

15 Apr, 2021 2 commits

[fix] Revert change that removed the option to run OffloadModel with out... · a77c56f0

anj-s authored Apr 14, 2021


[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. (#608)

* revert change made

* add tests and revert sync shard changes

* add tests

* remove file checked in by error

* inine var

* fix lint errors

* add checkpoint activation

* fix mypy

* use a bigger model

* modify tests for now

* resolve conflicts
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

a77c56f0

[offload] Add API, tutorial and smaller doc string changes. (#576) · 56506951

anj-s authored Apr 14, 2021



* modify doc string

* add offload docs

* add tutorial

* remove print

* remove print statement

* modify import

* modify constants

* modify README and add Offload symbol

* fix lint

* smaller mods

* lint errors

* Update README.md

added the references at the bottom of the readme

* address comments

* doc changes

* add blank line
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>
Co-authored-by: Vittorio Caggiano <caggiano@gmail.com>

56506951

14 Apr, 2021 1 commit
- [fix] [FSDP] Make _get_default_cuda_device more robust to modules without params (#606) · 8f7ee69f
  Myle Ott authored Apr 14, 2021
  
  8f7ee69f
13 Apr, 2021 4 commits
- [chore] v0.3.4 (#603) · 82d6997c
  Benjamin Lefaudeux authored Apr 13, 2021
  
  82d6997c
- [FSDP] use all_gather for 10X OSD consolidation speedup (#595) · a82825db
  Sam Shleifer authored Apr 13, 2021
  
  a82825db
- replacing multip-process pipe implementation with more flexible one (#567) · 4726d5be
  Mehdi Mirzazadeh authored Apr 13, 2021
```
replacing multip-process pipe implementation with more flexible one

Initial implementation of proposal pytorch/pytorch#55256
```
  4726d5be
- [SDP] Adding a unit test which checks for multiple FW passes on the same block (#596) · b191fe5f
  Benjamin Lefaudeux authored Apr 12, 2021
```
* Adding a unit test which checks for multiple FW passes on the same block
* Adding an embedding table, but still no problem to show for it
```
  b191fe5f
09 Apr, 2021 1 commit
- [cleanup] nn.Pipe: deprecate Pipe when torch version >= 1.8.0 (#597) · e9693976
  msbaines authored Apr 08, 2021
  
  e9693976
08 Apr, 2021 1 commit
- [fix] [FSDP] optim state dict should be completely on CPU (#590) · a6549be7
  Sam Shleifer authored Apr 08, 2021
  
  a6549be7
07 Apr, 2021 3 commits
- [fix][ShardedDDP] Properly handle .eval() mode (#587) · ce1f2cea
  Benjamin Lefaudeux authored Apr 07, 2021
```
* Properly handle .train() and .eval() modes
* showing that the unit test works, now fixed
* code review
```
  ce1f2cea
- [offload] Fix activation offloading to CPU in FW pass. (#588) · e89a1916
  anj-s authored Apr 07, 2021
```
* debugging

* debugging activation issue

* fix activation loading

* remove changes used for testing

* remove comment
```
  e89a1916
- [FSDP] [feat] Add state_dict_device option (#579) · 14abed6e
  Myle Ott authored Apr 07, 2021
  
  14abed6e
06 Apr, 2021 1 commit
- [fix][OSS] two small hotfixes.. repro not obvious for grad_fn (#583) · 121b9db0
  Benjamin Lefaudeux authored Apr 06, 2021
  
  121b9db0
05 Apr, 2021 3 commits
- [offload] Add golden data for offload benchmarks. (#578) · 168c9baa
  anj-s authored Apr 05, 2021
```
* add model

* add offload regression benchmarks

* add golden data

* remove mp pipe benchmark

* fix lint

* remove rank

* add check for model type

* lint errors
```
  168c9baa
- [OSS/ShardedDDP] making APIs more private (#582) · e41452e8
  Benjamin Lefaudeux authored Apr 05, 2021
```
* making APIs more private
* linting
```
  e41452e8
- [CI] MNIST download fix (#581) · befbc73a
  Benjamin Lefaudeux authored Apr 05, 2021
```
* fixing given torchvision's change
```
  befbc73a
04 Apr, 2021 3 commits
- [FSDP] add no_broadcast_optim_state option (#560) · 1fcbd624
  Sam Shleifer authored Apr 04, 2021
  
  1fcbd624
- [test] disable test which has started to become flaky (#575) · 54a97ee5
  msbaines authored Apr 04, 2021
```
This test is flaky for torch >= 1.8.0.
```
  54a97ee5
- [fix] OSS - enforce cuda parameters for state consolidation if NCCL backend (#573) · 88553373
  Benjamin Lefaudeux authored Apr 03, 2021
  
  88553373
03 Apr, 2021 1 commit
- [FSDP] Add gradient predivide factor to avoid overflow/underflow with large world size (#565) · 04001e76
  Shruti Bhosale authored Apr 03, 2021
  
  04001e76
02 Apr, 2021 6 commits
- [test] modify MOE tests to use NCCL (#570) · 5a3df0da
  msbaines authored Apr 02, 2021
```
NCCL all_to_all is now supported in PyTorch (since v1.8.0)

Fixes: #548
```
  5a3df0da
- [chore] 0.3.3 release (#568) · 60694da1
  Min Xu authored Apr 02, 2021
```
- releasing 0.3.3
- I need it in vissl for the auto_wrap_bn change
```
  60694da1
- remove folder (#572) · f37d7603
  anj-s authored Apr 02, 2021
  
  f37d7603
- move back · 1c88e3b7
  Anjali Sridhar authored Apr 02, 2021
  
  1c88e3b7
- move grad scaler to the tutorials folder · 79a9373a
  Anjali Sridhar authored Apr 02, 2021
  
  79a9373a
- [offload] Add support for record_function when using OffloadModel (#564) · c19cc897
  anj-s authored Apr 01, 2021
```
* add record_function support

* add more record_function cutpoints

* add more record_function cutpoints

* lint errors

* make string ids more specific
```
  c19cc897
01 Apr, 2021 1 commit
- [feat] remove old MultiProcessPipe (#563) · 2d3d5a7b
  msbaines authored Apr 01, 2021
  
  2d3d5a7b
31 Mar, 2021 5 commits

[feat] experimental: Add xpipe support (#553) · e141a93e
Siddharth Goyal authored Mar 31, 2021

e141a93e
[refactor] multiprocess_pipe: only support torch >= 1.9.0 (#561) · 204392e5
msbaines authored Mar 31, 2021

204392e5
[offload] Audit OffloadModel API, add error messages and remove redundant code path. (#557) · 34384e1b
anj-s authored Mar 31, 2021
```
* renaming/adding error messages

* address comments

* address comments

* add more comments

* add more comments
```
34384e1b

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed... · a0458b98

Min Xu authored Mar 31, 2021

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test (#556)

* [fix] disable single rank process group for auto_wrap_bn

- beefed up unit test with regnet-like model
- found that single-rank process group is causing problem
- disabled it to enable convergence tests on the vissl side
- use `raise e from None` to get a better assertion output
  in testing.py.

* [test] fix regnet test for ddp+mixed_precision

- need AMP context in FSDP
- workaround different between ddp & fsdp when bias=True
- fixed a bug in input data generation that caused different ranks have
  the same data with wrong iteration count.
- added TODO for need a better loss and grad_scaler and reduced
  iters so there is no nan.
- added a (disabled) debugging code

* lint

* lint

* add scaler

* lint

* scaler

* add a real loss

* seeding in the ranks

* blance tests

* run AMP DDP==FSDP test only on cuda version 11 and up

* add relu inplace and comment

* make wrap_bn covers more cases in full precision mode

a0458b98

[chore] add testing of torch 1.9.0 nightly build (#559) · acb9ef00
msbaines authored Mar 31, 2021

acb9ef00

30 Mar, 2021 1 commit

[feat][fix] ShardedDDP deferred init (#558) · daa1bad5

Benjamin Lefaudeux authored Mar 30, 2021

* survive the model being moved to device post-construction
* make sure that a unit test would catch a regression

daa1bad5

29 Mar, 2021 3 commits
- [feat] multiproces_pipe: add checkpoint support (#555) · 5e6a7a57
  msbaines authored Mar 29, 2021
  
  5e6a7a57
- [chore] Enable codecov for fairscale (#551) · 9a950651
  anj-s authored Mar 29, 2021
```
* codedcov testing

* codecov testnig

* more changes for uploading cov

* fix invalid config

* fix invalid config

* modify name

* fix config
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>
```
  9a950651
- [chore] update to torch v1.8.1 (#554) · c9db4775
  msbaines authored Mar 28, 2021
  
  c9db4775
28 Mar, 2021 1 commit
- [feat] multiprocess_pipe: add support for testing gpu-gpu rpc (#552) · 62635f0f
  msbaines authored Mar 28, 2021
  
  62635f0f
26 Mar, 2021 2 commits

[cleanup] consistent __init__.py for import * (#550) · 9a6ca9bd
Min Xu authored Mar 26, 2021
```
- fixes #471
- one less thing to worry about during development.
```
9a6ca9bd

[test] FSDP: check with ddp parity with conv + bn (#549) · 0233efca

Min Xu authored Mar 26, 2021

- added DDP equivalency test
- added rmf, state_dict_norm functions to testing utils
- added more debugging output to objects_are_equal

0233efca

25 Mar, 2021 1 commit
- [doc] Adding some more ShardedDDP documentation (#547) · a2b11de4
  Benjamin Lefaudeux authored Mar 25, 2021
  
  a2b11de4