Commits · ef194cd2345055b142407bf75c58e1e2a2d0865e · OpenDAS / fairscale

17 Nov, 2021 1 commit
- [feature] Add a OffloadConfig object to specify offloading params to disk. (#855) · ef194cd2
  anj-s authored Nov 17, 2021
```
* fixed lint issues

* remove unused print statements

* add changelog entry

* [skip ci] fix lint errors
```
  ef194cd2
12 Nov, 2021 1 commit

Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d

Anupam Bhatnagar authored Nov 11, 2021

* adding pre-commit files

* applying pre-commit to all files

* adding no-strict-optional argument to mypy in circle ci config

* fix typo

* updating python versions

* [skip ci] remove extra args

* adding python 3.9

* [skip ci] set pre-commit version in requirements-dev.txt

* set CACHE_VERSION

* move linters from circleci to github actions

* update python version

* update python version in benchmarks_2

* moving to python 3.9.7

7d7edf6d

05 Nov, 2021 1 commit

[feat] experimental MEVO layer (#840) · 8347c1a2

Min Xu authored Nov 05, 2021



* [feat] MEVO kernel

- initial import from min/softmax and min/testing branches
- need to rename and further cleanup

* only test with newer pytorch

* renamed and added comments and code cleanup

* rename and reduce test memory

* testing

* minor fixing

* fixing

* more fix

* changelog

* more 1.7 and 1.8 paper cuts

* remove dead code

* addressed Benjamin's comments

* addressed more comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8347c1a2

02 Aug, 2021 1 commit
- Change test to use tensorpipe rpc backend (#759) · 57821dd2
  Howard Huang authored Aug 02, 2021
  
  57821dd2
08 May, 2021 1 commit
- [chore][benchmarks] Add license file headers for all files in fairscale/benchmarks (#670) · a9156260
  anj-s authored May 08, 2021
```
* add license file headers for all files

* fix lint
```
  a9156260
07 May, 2021 1 commit

[feat] experimental.nn.SyncBatchNorm: initial commit (#662) · f0a40046

msbaines authored May 07, 2021

* [feat] experimental.nn.SyncBatchNorm: initial commit

Fast/simple re-implementation of SyncBatchNorm.

When profiling SSL Vision, I was seeing a majority of cycles spent in
SyncBatchNorm. With this change, I see a 10% to 20% speedup on the
model I was profiling.

When running benchmarks/experimental/sync_batchnorm.py on 8 x V100,
I get a 6x speedup:

<class 'torch.nn.modules.batchnorm.BatchNorm2d'>
Elapsed time is  0.08709120750427246
Elapsed time is  0.12632274627685547
Elapsed time is  0.14095258712768555
Elapsed time is  0.16529417037963867
Elapsed time is  0.1419970989227295
Elapsed time is  0.15166854858398438
Elapsed time is  0.12000870704650879
Elapsed time is  0.17534875869750977
<class 'torch.nn.modules.batchnorm.SyncBatchNorm'>
Elapsed time is  2.5087168216705322
Elapsed time is  2.497001886367798
Elapsed time is  2.5204885005950928
Elapsed time is  2.526789903640747
Elapsed time is  2.5080230236053467
Elapsed time is  2.524489641189575
Elapsed time is  2.513214588165283
Elapsed time is  2.5359973907470703
<class 'fairscale.experimental.nn.sync_batchnorm.SyncBatchNorm'>
Elapsed time is  0.4126114845275879
Elapsed time is  0.39051294326782227
Elapsed time is  0.40685415267944336
Elapsed time is  0.4159870147705078
Elapsed time is  0.42383885383605957
Elapsed time is  0.4080159664154053
Elapsed time is  0.41202712059020996
Elapsed time is  0.42400121688842773

f0a40046

20 Apr, 2021 1 commit
- [FSDP] Consolidate cpu_adam optimizer state dict (#607) · d9f36130
  Sam Shleifer authored Apr 20, 2021
  
  d9f36130
15 Apr, 2021 1 commit

[fix] Revert change that removed the option to run OffloadModel with out... · a77c56f0

anj-s authored Apr 14, 2021


[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. (#608)

* revert change made

* add tests and revert sync shard changes

* add tests

* remove file checked in by error

* inine var

* fix lint errors

* add checkpoint activation

* fix mypy

* use a bigger model

* modify tests for now

* resolve conflicts
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

a77c56f0

07 Apr, 2021 1 commit

[offload] Fix activation offloading to CPU in FW pass. (#588) · e89a1916

anj-s authored Apr 07, 2021

* debugging

* debugging activation issue

* fix activation loading

* remove changes used for testing

* remove comment

e89a1916

05 Apr, 2021 1 commit

[offload] Add golden data for offload benchmarks. (#578) · 168c9baa

anj-s authored Apr 05, 2021

* add model

* add offload regression benchmarks

* add golden data

* remove mp pipe benchmark

* fix lint

* remove rank

* add check for model type

* lint errors

168c9baa

02 Apr, 2021 1 commit

[offload] Add support for record_function when using OffloadModel (#564) · c19cc897

anj-s authored Apr 01, 2021

* add record_function support

* add more record_function cutpoints

* add more record_function cutpoints

* lint errors

* make string ids more specific

c19cc897

31 Mar, 2021 2 commits
- [feat] experimental: Add xpipe support (#553) · e141a93e
  Siddharth Goyal authored Mar 31, 2021
  
  e141a93e
- [offload] Audit OffloadModel API, add error messages and remove redundant code path. (#557) · 34384e1b
  anj-s authored Mar 31, 2021
```
* renaming/adding error messages

* address comments

* address comments

* add more comments

* add more comments
```
  34384e1b
17 Mar, 2021 1 commit

[offload] Add support for multiple streams and fix issue with integer inputs. (#515) · 39a12a8b

anj-s authored Mar 17, 2021



* debugging statements

* fix index inputs and streams

* fix lint errors

* remove print

* lint errors

* address comments

* lint error
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

39a12a8b

10 Mar, 2021 1 commit
- [feat] experimental: Add spectrain support (#372) · 5e8a6422
  Siddharth Goyal authored Mar 09, 2021
```
* experimental: Add spectrain support

* Address review comments

* Address review comments
```
  5e8a6422
03 Mar, 2021 1 commit

[refactor] Use logging in place of print statements, remove unused functions... · 7a3199b1

anj-s authored Mar 02, 2021

[refactor] Use logging in place of print statements, remove unused functions and other minor refactoring changes. (#461)

* fix pipe logging and other cleanups

* more log/debug changes

7a3199b1

26 Feb, 2021 1 commit

[feature] Add support for OffloadModel to enable training large models on 1 GPU. (#432) · f7813d6d

anj-s authored Feb 25, 2021



* clean start

* removing per layer split strategy, probably not that useful indeed

* initial transformer benchmark

* hack, enable testing ViT + offload, python3 benchmarks/oss.py  --epochs 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224

* proper cuda streams and device, something off in terms of mems consumption

* minor, stashing

* unit test fix

* removing all the distributed parts

* simpler test, needs debugging

* working OOP, running a model which does not fit on the gpu memory

* spring cleaning

* removing the ill-advised optimizer bits, better keep that orthogonal

* [offload] Add support for activation offloading + other changes (#367)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* avoid saving inputs

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* [offload] Add support for fp16 training (#374)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* add support for fp16

* add unit tests

* fix lint errors

* fix test failure
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* [offload] Add support for activation checkpointing for all layers. (#381)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* add support for fp16

* add unit tests

* fix lint errors

* fix test failure

* cp work, incorrect output dimensions still need to be fixed

* fixed activation outputs

* intermediate cp of work

* add tests

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* add support for microbatches

* revert benchmark config changes

* add parametrization

* fix lint errors and tests

* skip test for 1.5

* fix lint errors

* skip test if there are no GPUs

* fix lint errors

* fix lint errors

* move experimental to the fairscale repo

* lint error fixes

* modify test imports

* lint error fixes

* move offload files to the experimental directory

* move tests and benchmarks to their forlder

* fix mypy errors

* cp intermediate working benchmarks

* more changes

* split benchmark configs

* remove print statements

* fix lint errors

* remove unused print

* stress testing

* remove unused file

* change param nae

* lint fixes

* move file to the right folder

* offload_experimental

* add doc string

* add error message
Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@gmail.com>
Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@protonmail.com>
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

f7813d6d

24 Feb, 2021 1 commit

[refactor] Modify folder locations for tests/ to mirror source code tree. (#419) · 3b0717eb

anj-s authored Feb 24, 2021



* refactor experimental file locations

* refactor fix

* disable test temporarily

* lint error fix

* make the change in the right file

* fix lint errors

* skip failing tests
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

3b0717eb