Commits · f565d443f8c0c8e179372a79fc7fab8c7546f351 · OpenDAS / fairscale

05 Mar, 2021 1 commit

[fix]: CI and check_version (#475) · f565d443

Min Xu authored Mar 04, 2021

* [hotfix]: fix a bug in CI command

* debug

* debug

* bump cache ver

* fix

* eq

* check

* bump

* addressed comment

f565d443

04 Mar, 2021 6 commits

[feat]: checkpoint and normalization (#457) · 5e64d6a7

Min Xu authored Mar 04, 2021

* [feat]: checkpoint and normalization

- added special handling of BN for track_running_stats and checkpointing
- we test BN/LN and checkpointing
- we test them with mixed precision

5e64d6a7

[feat] add buffer_dtype kwarg for more control of batchnorm (#458) · b36e01d5
Sam Shleifer authored Mar 04, 2021

b36e01d5
Fix ampnet unit tests (#466) · 103d33c1
Siddharth Goyal authored Mar 04, 2021
```
* Fix ampnet unit test by adding delegate object

* Remove comments
```
103d33c1

[test] AdaScale & SDP/FSDP (#468) · efed9cee

Min Xu authored Mar 04, 2021

- cover them in terms of code path only
- numerically, AdaScale is different on SDP/FSDP than DDP, mainly
  due to partial view of the gradients.
- this doesn't mean it is definitely not useful but it is yet to
  be validated.
- not going to spend too much time until we have a real use case.

efed9cee

[chore] move a test script and a CI test improvement (#464) · eeabc6f1
Min Xu authored Mar 03, 2021
```
* [chore] move a test script

* add a shortcut for installing

* more skipping

* keep apt-get part
```
eeabc6f1
[fix] Cache MNIST fetchs, use alternative URLs (#465) · 0491715f
Benjamin Lefaudeux authored Mar 03, 2021

0491715f

03 Mar, 2021 3 commits
- [refactor] Use logging in place of print statements, remove unused functions... · 7a3199b1
  anj-s authored Mar 02, 2021
```
[refactor] Use logging in place of print statements, remove unused functions and other minor refactoring changes. (#461)

* fix pipe logging and other cleanups

* more log/debug changes
```
  7a3199b1
- [docs] minor doc update (#459) · 428110b8
  Min Xu authored Mar 02, 2021
  
  428110b8
- [refactor] multiprocess_pipe: avoid unnecessary use of create_task and other cleanup (#456) · 8f77255b
  msbaines authored Mar 02, 2021
  
  8f77255b
02 Mar, 2021 2 commits

[fix] Make state_dict all-gather FP32 params (#451) · d2924670
Myle Ott authored Mar 02, 2021

d2924670

[feat] Add context manager to FSDP for easier child module wrapping (#446) · f3359550

Sean Naren authored Mar 02, 2021

This adds a context manager that assists in making child modules with similar defaults.
Usage:
```
from fairscale.nn.misc import enable_wrap, wrap

with enable_wrap(**handleful_of_important_params):
    layer_1 = wrap(torch.nn.Linear(5, 5))
    layer_2 = wrap(torch.nn.Linear(5, 5), flatten_parameters=True) # Override parameters if you'd like

# without the context manager, creates Linear layer
layer_1 = wrap(torch.nn.Linear(5, 5))
```
If not within the FSDP context, this would be a no-op. This makes it easier to annotate layers without having to copy any changes in parameters.

f3359550

01 Mar, 2021 3 commits

[chores]: make CI more efficient and update py39 env a bit (#447) · 5eb6b8c7

Min Xu authored Mar 01, 2021

* [chores]: CI py39 on GPU and more efficiency

* add test list files

* fix

* add test list files

* split benchmark run into 2 runs

* fix 1.8 version and balance benchmarks

* fix

* fix

* fix

* fix

* recording tests

* py39 install fix

* test again

* move tests

* reorg tests

* skip tests for torch 1.8 due to an upstream bug

* removed __init__.py from tests since it confuses pytest

* Revert "removed __init__.py from tests since it confuses pytest"

This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0.

* don't include __init__ in file list

* notes on __init__.py and added missing ones

* fixed mypy in a test file

* balance test runtime

* better pip install

* balance more

* pip fix

* balance

* balance more, all test should finish within 20m now

* minor license update

* trying cu102

* more doc and addressed Ben's comments

* debugging

* debugging...

5eb6b8c7

[test] FSDP: add the failing test for #421 (#453) · 5ecac15a

Min Xu authored Mar 01, 2021



* [test] FSDP: add the failing test for #421

* skip on 1.5

* better skipping

* Update tests/nn/data_parallel/test_fsdp_grad_scaler.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

5ecac15a

Add is root check to only cast to FP16 on main FSDP wrapper (#452) · 5c5866b3
Sean Naren authored Mar 01, 2021

5c5866b3

27 Feb, 2021 3 commits
- [fix] fixed typo (#448) · c114a219
  vfdev authored Feb 28, 2021
  
  c114a219
- [fix] FSDP: fix the corner case of all params are in the children (#441) · b75a5e26
  Min Xu authored Feb 26, 2021
```
* [fix] FSDP corner case of all params at in the children

* lint

* fix

* tradeoff

* fix doc build

* review comments
```
  b75a5e26
- Update README.md (#443) · bd04f21f
  Vittorio Caggiano authored Feb 26, 2021
  
  bd04f21f
26 Feb, 2021 7 commits

[fix] fix FSDP state_dict/load_state_dict for nested wrapped instances (#440) · b6dc98cf
Myle Ott authored Feb 26, 2021

b6dc98cf
update readme (#439) · 93d115c6
Min Xu authored Feb 26, 2021

93d115c6
[fix] Fix nested FlattenParamsWrapper state_dict/load_state_dict (#434) · 506d6209
Myle Ott authored Feb 26, 2021

506d6209
Update README.md (#438) · 9163e381
Vittorio Caggiano authored Feb 26, 2021
```
* Update README.md
```
9163e381
parallelize tests on GPUs (#436) · 2fa0b1e7
Min Xu authored Feb 25, 2021

2fa0b1e7

[feat]: add summon_full_params context mgr (#433) · 77f92b38

Min Xu authored Feb 25, 2021

* [feat]: add summon_full_params context mgr

* fix

* fix

* addressed comments

* fixed the state_dict copy

* lint

77f92b38

[feature] Add support for OffloadModel to enable training large models on 1 GPU. (#432) · f7813d6d

anj-s authored Feb 25, 2021



* clean start

* removing per layer split strategy, probably not that useful indeed

* initial transformer benchmark

* hack, enable testing ViT + offload, python3 benchmarks/oss.py  --epochs 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224

* proper cuda streams and device, something off in terms of mems consumption

* minor, stashing

* unit test fix

* removing all the distributed parts

* simpler test, needs debugging

* working OOP, running a model which does not fit on the gpu memory

* spring cleaning

* removing the ill-advised optimizer bits, better keep that orthogonal

* [offload] Add support for activation offloading + other changes (#367)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* avoid saving inputs

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* [offload] Add support for fp16 training (#374)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* add support for fp16

* add unit tests

* fix lint errors

* fix test failure
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* [offload] Add support for activation checkpointing for all layers. (#381)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* add support for fp16

* add unit tests

* fix lint errors

* fix test failure

* cp work, incorrect output dimensions still need to be fixed

* fixed activation outputs

* intermediate cp of work

* add tests

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* add support for microbatches

* revert benchmark config changes

* add parametrization

* fix lint errors and tests

* skip test for 1.5

* fix lint errors

* skip test if there are no GPUs

* fix lint errors

* fix lint errors

* move experimental to the fairscale repo

* lint error fixes

* modify test imports

* lint error fixes

* move offload files to the experimental directory

* move tests and benchmarks to their forlder

* fix mypy errors

* cp intermediate working benchmarks

* more changes

* split benchmark configs

* remove print statements

* fix lint errors

* remove unused print

* stress testing

* remove unused file

* change param nae

* lint fixes

* move file to the right folder

* offload_experimental

* add doc string

* add error message
Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@gmail.com>
Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@protonmail.com>
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

f7813d6d

25 Feb, 2021 3 commits
- [ShardedDDP][Minor] Backport a bucket flush fix from FSDP, may help a few existing users (#435) · 7ee228bf
  Benjamin Lefaudeux authored Feb 25, 2021
```
* bring back a fix from FSDP, may help a few existing users
```
  7ee228bf
- [cleanup] FSDP docstrings (#428) · 6b2897ca
  Myle Ott authored Feb 25, 2021
  
  6b2897ca
- [test] checkpoint: multiple input and output model test (#425) · 2478a9ad
  Min Xu authored Feb 25, 2021
  
  2478a9ad
24 Feb, 2021 4 commits

[refactor] Modify folder locations for tests/ to mirror source code tree. (#419) · 3b0717eb

anj-s authored Feb 24, 2021



* refactor experimental file locations

* refactor fix

* disable test temporarily

* lint error fix

* make the change in the right file

* fix lint errors

* skip failing tests
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

3b0717eb

[fix]: Fix non-float buffers in FSDP (#427) · 9e0df348
Myle Ott authored Feb 23, 2021

9e0df348
split benchmark configs (#420) · b89365e6
anj-s authored Feb 23, 2021

b89365e6

[bug] use weakref in the wrapper (#424) · 8876553e

Min Xu authored Feb 23, 2021



* use weakref in the wrapper

* comment

* comment

* Update fairscale/nn/misc/checkpoint_activations.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

8876553e

23 Feb, 2021 8 commits
- [test]: add peak mem in checkpoint test (#415) · 4b5b4d3d
  Min Xu authored Feb 23, 2021
```
* [test]: add peak mem in checkpoint test

* more debugging

* new test

* more fix

* better collection of debug in case of future failures

* update the comment

* typo

* comment

* clarify

* better wording
```
  4b5b4d3d
- [chore] v0.3.0 (#416) · d64ff250
  Benjamin Lefaudeux authored Feb 22, 2021
```
* v0.3.0 it is, celebration time
```
  d64ff250
- [refactor] Move experimental folder to the fairscale repo (#410) · 045a9743
  anj-s authored Feb 22, 2021
```
* move experimental to the fairscale repo

* lint error fixes

* modify test imports

* lint error fixes

* lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>
```
  045a9743
- [hotfix] ShardedDDP fp16 grads, default flipped while testing (#417) · 8fd82858
  Benjamin Lefaudeux authored Feb 22, 2021
  
  8fd82858
- [perf][ShardedDDP] fp16 gradient reduce (#411) · d52d2186
  Benjamin Lefaudeux authored Feb 22, 2021
```
* POC, testing against the DDP comm hook when available
* docs, adding a reference to DDP's compress hook
* updating changelog, prep for v0.1.8 release
```
  d52d2186
- Add ninja to setup_requires (#408) · d10c34e7
  Myle Ott authored Feb 22, 2021
  
  d10c34e7
- [docs] minor changelog update · 4f2eb1ad
  Min Xu authored Feb 22, 2021
  
  4f2eb1ad
- [doc] minor formatting of changelog · b6934bf5
  Min Xu authored Feb 22, 2021
  
  b6934bf5