Commits · 377e96969665e31eb65c73288fb69cab92a7c4af · OpenDAS / fairscale

15 Feb, 2023 1 commit
- [fix] typo in wikitext2_data.py (#1104) · 377e9696
  Junyeol Ryu authored Feb 16, 2023
```
* [fix] typo in wikitext2_data.py

* [fix] typo and code duplication in fsdp.py
```
  377e9696
24 Sep, 2022 2 commits
- [cleanup] remove ssd offload to simplify the FSDP code (#1080) · e71d2570
  Min Xu authored Sep 24, 2022
```
* simlificed the readme

* clean up ssd offload

* try to fix readthedocs
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  e71d2570
- [chore] move fair_dev into fairscale (#1078) · 8f8f8ef9
  Min Xu authored Sep 23, 2022
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  8f8f8ef9
12 Jun, 2022 1 commit
- Move f/utils => f/internal; move testing libs to fair_dev/testing (#1004) · 2350968e
  Crutcher Dunnavant authored Jun 12, 2022
  
  2350968e
30 Mar, 2022 1 commit

Remove sort_iseed_config and related dependencies. (#969) · 72f373c1

Paul Johnson authored Mar 30, 2022

This is no longer needed since isort's version is 5.10

Also fix black version to 22.3.0 to fix issue with click
dependency.

Update files that now fail with new version of black {a = 2 ** 4} ->
{a = 2**4}

72f373c1

08 Mar, 2022 1 commit

[chore] Fix copyright headers & fixed issue with mypy & NumPy versions in pre-commit (#951) · 8fa26ae4

Min Xu authored Mar 08, 2022



* copyright headers

* isort and pyproject.toml

* precommit and requirement for isort-seed-config

* mypy

* dummy change

* numpy version for pre-commit

* fix mypy issue caused by numpy
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8fa26ae4

22 Feb, 2022 1 commit

[benchmarks] Add benchmarks for FSDP (#765) · f9a125db

anj-s authored Feb 22, 2022

* add benchmarks for fsdp

* fix lint errors

* clean up

* clean up unused flags

* add the benchmarks

* remove unused args

* fix lint errors

* fix lint errors

* update command line

* add support for multiple devices

* try full fp16 mode

* try full fp16 mode

* lint errors

* merge main

* lint errors

* lint errors

* lint error

* update intersphinx mapping for numpy

* update intersphinx mapping for numpy

* skip test

* added golden configs

* use synthetic benchmarks

* fix fn name

* fix cuda device id

* fix verify

* lint fix

f9a125db

14 Feb, 2022 1 commit

[chore] [cleanup]: pytest, pytorch new versions, fix tests (#933) · fae29959

Min Xu authored Feb 14, 2022



* update pytest versions

* [test] test related changes

- upgrade to newer pytorch versions
- added function to make test more deterministic on A100 and TF32
- fixed some tests so that they are correctly skipped on a single GPU system

* more fixes

* formatting overly long lines

* format

* better test without trigger a warning

* fix an optim state bug with newer pytorch

- adam optimizer seems to return "step" as a singleton tensor now in the
nightly build
- this fixes it assumeing non-tensor value can still be loaded back by
the optimizer

* improve oss.py

- use min_loss for regression checking is a bit more reliable
- also increased the num epochs from 10 to 12

* small oss.py fix

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

fae29959

24 Nov, 2021 1 commit

[benchmarks]Add an MOE benchmark (#866) · 56254247

Ying Zhang authored Nov 24, 2021

* Add MOE to lm benchmarks

* linter

* Fix source / target

* address comments

* address comments

* address comments

* add circleci

* fix circleci

* precommit

56254247

18 Nov, 2021 1 commit

[fix] [MEVO]: make mevo work with eval and optim_state checkpointing (#851) · 0db50ce5

Min Xu authored Nov 18, 2021



* [fix]: fix eval for shared weight FSDP

* fixing optim state saving

* add changelog

* reformat with newer local isort

* update test

* avoid computing reference state unless we are testing training

* added optim_state test

* make mypy happy

* move tests; maybe we need to CUDA memory related tests in the first of the lists
Co-authored-by: Min Xu <min.xu.public@gmail.com>

0db50ce5

17 Nov, 2021 1 commit
- [feature] Add a OffloadConfig object to specify offloading params to disk. (#855) · ef194cd2
  anj-s authored Nov 17, 2021
```
* fixed lint issues

* remove unused print statements

* add changelog entry

* [skip ci] fix lint errors
```
  ef194cd2
12 Nov, 2021 1 commit

Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d

Anupam Bhatnagar authored Nov 11, 2021

* adding pre-commit files

* applying pre-commit to all files

* adding no-strict-optional argument to mypy in circle ci config

* fix typo

* updating python versions

* [skip ci] remove extra args

* adding python 3.9

* [skip ci] set pre-commit version in requirements-dev.txt

* set CACHE_VERSION

* move linters from circleci to github actions

* update python version

* update python version in benchmarks_2

* moving to python 3.9.7

7d7edf6d

05 Nov, 2021 1 commit

[feat] experimental MEVO layer (#840) · 8347c1a2

Min Xu authored Nov 05, 2021



* [feat] MEVO kernel

- initial import from min/softmax and min/testing branches
- need to rename and further cleanup

* only test with newer pytorch

* renamed and added comments and code cleanup

* rename and reduce test memory

* testing

* minor fixing

* fixing

* more fix

* changelog

* more 1.7 and 1.8 paper cuts

* remove dead code

* addressed Benjamin's comments

* addressed more comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8347c1a2

24 Oct, 2021 1 commit
- [chore] Fix main breakage temporarily by relaxing constraints (#828) · eadfdc49
  anj-s authored Oct 23, 2021
```
* relax speed constraints

* relax the regressions constraints
```
  eadfdc49
22 Oct, 2021 1 commit
- modify golden data (#825) · 35f327f3
  anj-s authored Oct 22, 2021
  
  35f327f3
21 Oct, 2021 1 commit
- [chore] Update the PyTorch version that we run benchmarks with. (#823) · e4da75ea
  anj-s authored Oct 21, 2021
```
* update pytorch version for benchmarks

* reduce golden data precision check
```
  e4da75ea
02 Aug, 2021 1 commit
- Change test to use tensorpipe rpc backend (#759) · 57821dd2
  Howard Huang authored Aug 02, 2021
  
  57821dd2
14 Jun, 2021 1 commit
- [chore]Migrate away from legacy torchtext iterators (#713) · cec011bb
  anj-s authored Jun 14, 2021
```
* migrate away from legacy iterators

* fix lint error
```
  cec011bb
08 May, 2021 1 commit
- [chore][benchmarks] Add license file headers for all files in fairscale/benchmarks (#670) · a9156260
  anj-s authored May 08, 2021
```
* add license file headers for all files

* fix lint
```
  a9156260
07 May, 2021 1 commit

[feat] experimental.nn.SyncBatchNorm: initial commit (#662) · f0a40046

msbaines authored May 07, 2021

* [feat] experimental.nn.SyncBatchNorm: initial commit

Fast/simple re-implementation of SyncBatchNorm.

When profiling SSL Vision, I was seeing a majority of cycles spent in
SyncBatchNorm. With this change, I see a 10% to 20% speedup on the
model I was profiling.

When running benchmarks/experimental/sync_batchnorm.py on 8 x V100,
I get a 6x speedup:

<class 'torch.nn.modules.batchnorm.BatchNorm2d'>
Elapsed time is  0.08709120750427246
Elapsed time is  0.12632274627685547
Elapsed time is  0.14095258712768555
Elapsed time is  0.16529417037963867
Elapsed time is  0.1419970989227295
Elapsed time is  0.15166854858398438
Elapsed time is  0.12000870704650879
Elapsed time is  0.17534875869750977
<class 'torch.nn.modules.batchnorm.SyncBatchNorm'>
Elapsed time is  2.5087168216705322
Elapsed time is  2.497001886367798
Elapsed time is  2.5204885005950928
Elapsed time is  2.526789903640747
Elapsed time is  2.5080230236053467
Elapsed time is  2.524489641189575
Elapsed time is  2.513214588165283
Elapsed time is  2.5359973907470703
<class 'fairscale.experimental.nn.sync_batchnorm.SyncBatchNorm'>
Elapsed time is  0.4126114845275879
Elapsed time is  0.39051294326782227
Elapsed time is  0.40685415267944336
Elapsed time is  0.4159870147705078
Elapsed time is  0.42383885383605957
Elapsed time is  0.4080159664154053
Elapsed time is  0.41202712059020996
Elapsed time is  0.42400121688842773

f0a40046

20 Apr, 2021 1 commit
- [FSDP] Consolidate cpu_adam optimizer state dict (#607) · d9f36130
  Sam Shleifer authored Apr 20, 2021
  
  d9f36130
15 Apr, 2021 1 commit

[fix] Revert change that removed the option to run OffloadModel with out... · a77c56f0

anj-s authored Apr 14, 2021


[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. (#608)

* revert change made

* add tests and revert sync shard changes

* add tests

* remove file checked in by error

* inine var

* fix lint errors

* add checkpoint activation

* fix mypy

* use a bigger model

* modify tests for now

* resolve conflicts
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

a77c56f0

07 Apr, 2021 1 commit

[offload] Fix activation offloading to CPU in FW pass. (#588) · e89a1916

anj-s authored Apr 07, 2021

* debugging

* debugging activation issue

* fix activation loading

* remove changes used for testing

* remove comment

e89a1916

05 Apr, 2021 2 commits

[offload] Add golden data for offload benchmarks. (#578) · 168c9baa

anj-s authored Apr 05, 2021

* add model

* add offload regression benchmarks

* add golden data

* remove mp pipe benchmark

* fix lint

* remove rank

* add check for model type

* lint errors

168c9baa

[CI] MNIST download fix (#581) · befbc73a
Benjamin Lefaudeux authored Apr 05, 2021
```
* fixing given torchvision's change
```
befbc73a

02 Apr, 2021 1 commit

[offload] Add support for record_function when using OffloadModel (#564) · c19cc897

anj-s authored Apr 01, 2021

* add record_function support

* add more record_function cutpoints

* add more record_function cutpoints

* lint errors

* make string ids more specific

c19cc897

01 Apr, 2021 1 commit
- [feat] remove old MultiProcessPipe (#563) · 2d3d5a7b
  msbaines authored Apr 01, 2021
  
  2d3d5a7b
31 Mar, 2021 2 commits
- [feat] experimental: Add xpipe support (#553) · e141a93e
  Siddharth Goyal authored Mar 31, 2021
  
  e141a93e
- [offload] Audit OffloadModel API, add error messages and remove redundant code path. (#557) · 34384e1b
  anj-s authored Mar 31, 2021
```
* renaming/adding error messages

* address comments

* address comments

* add more comments

* add more comments
```
  34384e1b
18 Mar, 2021 1 commit
- [fix] super minor, but make sure that the mem leak does not come back (#536) · f7e6680b
  Benjamin Lefaudeux authored Mar 18, 2021
  
  f7e6680b
17 Mar, 2021 1 commit

[offload] Add support for multiple streams and fix issue with integer inputs. (#515) · 39a12a8b

anj-s authored Mar 17, 2021



* debugging statements

* fix index inputs and streams

* fix lint errors

* remove print

* lint errors

* address comments

* lint error
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

39a12a8b

12 Mar, 2021 1 commit
- [chore] update to torch v1.8.0 (#508) · c79bbd01
  msbaines authored Mar 11, 2021
  
  c79bbd01
10 Mar, 2021 1 commit
- [feat] experimental: Add spectrain support (#372) · 5e8a6422
  Siddharth Goyal authored Mar 09, 2021
```
* experimental: Add spectrain support

* Address review comments

* Address review comments
```
  5e8a6422
09 Mar, 2021 1 commit

[refactor] Fix for using synthetic data + remove unused flags (#485) · 8eaa3622

anj-s authored Mar 09, 2021



* smal fix, remove unused flags

* remove usused flag

* add back max_batch flag

* adding back lazy_construction

* adding back lazy_construction

* add missing device arg
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

8eaa3622

08 Mar, 2021 1 commit
- [chore] OSS perf test, super minor (#495) · 886aa327
  Benjamin Lefaudeux authored Mar 08, 2021
  
  886aa327
05 Mar, 2021 1 commit
- [fix] OSS speed ref adjusted down.. (#480) · 80cc7559
  Benjamin Lefaudeux authored Mar 05, 2021
```
:(
```
  80cc7559
04 Mar, 2021 1 commit
- [fix] Cache MNIST fetchs, use alternative URLs (#465) · 0491715f
  Benjamin Lefaudeux authored Mar 03, 2021
  
  0491715f
03 Mar, 2021 1 commit

[refactor] Use logging in place of print statements, remove unused functions... · 7a3199b1

anj-s authored Mar 02, 2021

[refactor] Use logging in place of print statements, remove unused functions and other minor refactoring changes. (#461)

* fix pipe logging and other cleanups

* more log/debug changes

7a3199b1

01 Mar, 2021 1 commit

[chores]: make CI more efficient and update py39 env a bit (#447) · 5eb6b8c7

Min Xu authored Mar 01, 2021

* [chores]: CI py39 on GPU and more efficiency

* add test list files

* fix

* add test list files

* split benchmark run into 2 runs

* fix 1.8 version and balance benchmarks

* fix

* fix

* fix

* fix

* recording tests

* py39 install fix

* test again

* move tests

* reorg tests

* skip tests for torch 1.8 due to an upstream bug

* removed __init__.py from tests since it confuses pytest

* Revert "removed __init__.py from tests since it confuses pytest"

This reverts commit 7e156ba33dfaa5ed052031780613ec0cb57a45b0.

* don't include __init__ in file list

* notes on __init__.py and added missing ones

* fixed mypy in a test file

* balance test runtime

* better pip install

* balance more

* pip fix

* balance

* balance more, all test should finish within 20m now

* minor license update

* trying cu102

* more doc and addressed Ben's comments

* debugging

* debugging...

5eb6b8c7

26 Feb, 2021 1 commit

[feature] Add support for OffloadModel to enable training large models on 1 GPU. (#432) · f7813d6d

anj-s authored Feb 25, 2021



* clean start

* removing per layer split strategy, probably not that useful indeed

* initial transformer benchmark

* hack, enable testing ViT + offload, python3 benchmarks/oss.py  --epochs 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224

* proper cuda streams and device, something off in terms of mems consumption

* minor, stashing

* unit test fix

* removing all the distributed parts

* simpler test, needs debugging

* working OOP, running a model which does not fit on the gpu memory

* spring cleaning

* removing the ill-advised optimizer bits, better keep that orthogonal

* [offload] Add support for activation offloading + other changes (#367)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* avoid saving inputs

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* [offload] Add support for fp16 training (#374)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* add support for fp16

* add unit tests

* fix lint errors

* fix test failure
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* [offload] Add support for activation checkpointing for all layers. (#381)

* initial fwd/bwd commit

* checkpoint work

* modify shard loop

* activation offloading and test to start with

* fix lint errors

* update comments

* fix lint

* remove unused var

* remove commented out lines

* modify name

* remove break

* remove profiler comments

* add support for fp16

* add unit tests

* fix lint errors

* fix test failure

* cp work, incorrect output dimensions still need to be fixed

* fixed activation outputs

* intermediate cp of work

* add tests

* fix lint errors
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

* add support for microbatches

* revert benchmark config changes

* add parametrization

* fix lint errors and tests

* skip test for 1.5

* fix lint errors

* skip test if there are no GPUs

* fix lint errors

* fix lint errors

* move experimental to the fairscale repo

* lint error fixes

* modify test imports

* lint error fixes

* move offload files to the experimental directory

* move tests and benchmarks to their forlder

* fix mypy errors

* cp intermediate working benchmarks

* more changes

* split benchmark configs

* remove print statements

* fix lint errors

* remove unused print

* stress testing

* remove unused file

* change param nae

* lint fixes

* move file to the right folder

* offload_experimental

* add doc string

* add error message
Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@gmail.com>
Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@protonmail.com>
Co-authored-by: Anjali Sridhar <anj@devfair0443.h2.fair>

f7813d6d