Commits · 16fba4c095843821e544b15d17a610c5e2541bce · OpenDAS / fairscale

22 Jul, 2022 1 commit

[fix] original size computation (#1037) · 16fba4c0

Min Xu authored Jul 21, 2022



* flip per_tensor's default

* fixed original size computation
Co-authored-by: Min Xu <min.xu.public@gmail.com>

16fba4c0

21 Jul, 2022 1 commit

[feat]: add size and names metadata to sha1 store (#1036) · 2e544bd7

Min Xu authored Jul 21, 2022



* additional metadata, step 1

* add gzip option to repo::add

* add repo:add's return value and some refactoring and todo

* added size metadata to sha1_store

* added names metadata to sha1_store
Co-authored-by: Min Xu <min.xu.public@gmail.com>

2e544bd7

19 Jul, 2022 1 commit

[feat]: add per-tensor add to repo (#1033) · 4d58a294

Min Xu authored Jul 18, 2022



* formatting change, no logical change

* formatting and name change, no logical change

* [refactor] sha1_store's path arg

- make sha1_store's path arg directly the path, not its parent
- this is because sha1_store is not like a .git or a .wgit dir, which is
  nested inside another "working" dir. It is simply a store, which
  is using a given dir.
- updated repo and tests as well.

* remove a test warning due to deprecated API from torch

* [refactor] change how dot_wgit_dir_path is used

- it should only be assigned in __init__.
- we use it in error checking in the rest APIs.

* simplify the init a bit

* refactor the sanity check

* moved some functions, no code change

* [feat] added per-tensor add to the repo

* enabled gzip compression on add

* fix a unit test

* add a note

* make sha1 store work on general dict

* handle general state_dict from a model, not just a module's one-level OrderedDict

* formatting
Co-authored-by: Min Xu <min.xu.public@gmail.com>

4d58a294

18 Jul, 2022 1 commit
- [feat] add compression and tests to sha1 store (#1032) · d0ad08c0
  Min Xu authored Jul 18, 2022
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  d0ad08c0
14 Jul, 2022 2 commits
- [feat] add sha1_store delete function (#1028) · c75d1896
  Min Xu authored Jul 14, 2022
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  c75d1896
- [feat] add sha1_store get function (#1027) · 073618d8
  Min Xu authored Jul 14, 2022
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  073618d8
12 Jul, 2022 1 commit

[feat] Sha1 Store enhancements (#1026) · 68af57d8

Min Xu authored Jul 12, 2022



* refactor SHA1_Store

- renamed the class
- added created_on field and refactored how init is done
- wrap long lines

* wrapped longer lines

* rename json file ref_count.json

* make sha1_buf_size an argument

* update gitignore

* added tmp_dir

* added new sha1_store add and tests

* update chdir

* add debug to test

* fixing unit test for 1.8
Co-authored-by: Min Xu <min.xu.public@gmail.com>

68af57d8

05 Jul, 2022 1 commit

weigit status and checking for file modification and tracking (#1021) · 5b5db28d

Riyasat Ohib authored Jul 05, 2022

* [Fix] Restructure for wgit availability as a package

* Preliminary implementation of wgit status

* [Feat] Addition of wgit status
1. Functionalities to check the status of the repo.
2. Checks if file has been modified, whether changes added or added changes commited.

* [test] Addition of tests for weigit status
1. Some minor refactors and docstring changes

* [Fix] Changes in repo status test

* [test] status test fix
1. made the test status printing order independent

* [refactor] Metadata dirs mirroring chkpt paths, changes in wgit status
1. Metadata files are now created within wgit with directory structure mirroring the relative paths of the checkpoint/files they track.
2. Changes in status: 3 statuses now.
3. Changes in tests.
4. Some code refactoring.

* [cleanup] minor changes in comments and cleanup

5b5db28d

29 Jun, 2022 1 commit
- [test] disable a flaky test (#1020) · 775a0f06
  Min Xu authored Jun 28, 2022
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  775a0f06
24 Jun, 2022 1 commit

weigit: Fixed file tracking with metadata. Changes in sha1_store for better... · a1cc3874

Riyasat Ohib authored Jun 23, 2022


weigit: Fixed file tracking with metadata. Changes in sha1_store for better encapsulation. Docstrings. (#1013)

* [Feat] Fixed file tracking with metadata. Change in sha1_store for better encapsulation. Tests.

1. Adds metadata creation per added file and independently tracks version of each separate file added. That is, now creates separate metadata files
   for each file to be tracked.
2. Changes in reference tracking to accomodate the change in 1.
3. Somes changes in SHA1_store for better encapsulation.
4. Modified the tests to reflect above.

* [Feat]
1. Added docstrings to the classes.
2. Added a recursively search for the weigit repo upto root.
3. Some refactor of the codes.

* [Feat][Refactor] repo and sha1_store add modification and separation. Modification in reference tracking

1. Separation of add functionalities of repo.add and sha1_store.add.
2. Updated the reference tracking.
3. New tests and code refactor

* [Fix] Sha1_store fix overlap in first two characters of sha1 hash.
1. Accept multiple sha1 hash's with same two starting characters and create directories accordingly.

* [Fix] Minor refactoring and test fix

* [Fix] Fix for pygit class initialization in cases when no .gitconfig file is available
Co-authored-by: Riyasat Ohib <riohib@devfair0756.h2.fair>

a1cc3874

14 Jun, 2022 1 commit

Addition of wgit add and wgit commit functionalities. Includes refactors and new classes. (#1002) · c506e7ed

Riyasat Ohib authored Jun 14, 2022

* [feat] Adds the implementaion for the wgit add functionality, with sha1 hash creation, reference tracking, dependency graph creation and all related functionalities for the wgit add method.

* [feat] Adds the wgit add and wgit commit functionalities and major refactors.

1. Adds the wgit add and wgit commit functionalities to the api.
2. Introduces a new PyGit class that wraps the internal .wgit/.git repo.
3. Refactors the Repo class in the api, and introduces some methods.
4. .Refactors all the classes which no longer uses @staticmethods and now uses object istances instead.
5.  Moved many of the directory path handling code from os.path to pathlib library.

* [Feat] Combines the Repo and Weigit classes. Separate claases into separate modules.

1. Combines the functionalities of the WeiGit and Repo class into a single WeiGitRepo class.
2. Classes are now separated into their own modules.
3. Moved some functions and staticmethod to utils.
4. Adds a range of tests for add and commit functionalities of weigit.

* [fix] adds a new test to the ci_test_list_3

* [fix] test fix

* [fix] test fix

* [Feat] Directory restructuring, type checking and some standardization
1. Restructured the directory and moved wgit to fairscale/experimental/wgit so that it can be found as a package when pip installed.
2. Added a range of type checking
3. Some refactors

* [Feat][Refactor] Directory restructuring, test addition and type checking
1. Restructed the test directory
2. Added and modified a few wgit tests.
3. Added some type checking to the code

* test fix

* "setup fix and repo checking added in cli"

* [Feat] Better initialization and error handling for init and wgit subcommands. Test reorg.

* [refactor] Changes in classes, encapsulation and addition of PyGit test.

* [Feat][Refactor]
1. Changed some class method arguments for better encapsulation for Sha1_store.
2. Moved sha1 hash calculation within sha1_store.
3. Some standardization and code clean up of unnecessary snippets.
4. Added new tests for the PyGit and Sha1_Store class.

c506e7ed

12 Jun, 2022 1 commit
- Move f/utils => f/internal; move testing libs to fair_dev/testing (#1004) · 2350968e
  Crutcher Dunnavant authored Jun 12, 2022
  
  2350968e
01 Jun, 2022 1 commit

wgit functionalities and skeleton, move to subparsers and addition of Repo Class (#1001) · 3b727945

Riyasat Ohib authored Jun 01, 2022

* [feat] Adding wgit within fairscale/experimental/wgit.

* [feat] adding experimental wgit

* [feat] wgit init functionalities and skeleton for the rest.

* adapted the suggested changes

* repo class working

* [feat] wgit functionalities and skeleton. Addition of subparsers and repo class along with some changes.

* [feat] wgit functionalities and skeleton, move to subparsers and addition of Repo Class

* [feat] wgit functionalities and skeleton, move to subparsers and addition of Repo Class

* [docs] changed a comment in .gitignore

* [refactor] changed the sequene of tests in ci_test_list2.txt

3b727945

26 May, 2022 1 commit
- [minor] Remove false tuples (#994) · c4af33b6
  Crutcher Dunnavant authored May 26, 2022
  
  c4af33b6
25 May, 2022 1 commit
- Adding experimental wgit to the repo with wgit command (#991) · e14cca44
  Riyasat Ohib authored May 24, 2022
```
* [feat] Adding wgit within fairscale/experimental/wgit.

* [feat] adding experimental wgit
```
  e14cca44
02 May, 2022 1 commit

[FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParameter and... · 51b53ddb

Paul Johnson authored May 02, 2022

[FSDP] ssd_offload fixing backward path (grad_fn) for SsdFlatParameter and SsdFlatParameterView (#974)

* [FSDP] fixing backward path for SsdFlatParameter and SsdFlatParameterView when overriding .data

* Get ssd_offload unit tests passing

* [FSDP] get all test_fsdp_offload tests passing w/ ssd_offload on

* Update changelog

51b53ddb

26 Apr, 2022 1 commit
- skip failed ssd offload tests for nightly (#977) · e65833a0
  Min Xu authored Apr 25, 2022
```
Co-authored-by: Min Xu <min.xu.public@gmail.com>
```
  e65833a0
06 Apr, 2022 1 commit

Improvements to ssd_offload to support pickling/unpickling SsdTensorHandle... · 92f27daa

Paul Johnson authored Apr 06, 2022

Improvements to ssd_offload to support pickling/unpickling SsdTensorHandle (and derived classes) (#964)

Verified that FSDP wrapped models using ssd_offload checkpoint save and restore correctly

92f27daa

14 Feb, 2022 1 commit

[chore] [cleanup]: pytest, pytorch new versions, fix tests (#933) · fae29959

Min Xu authored Feb 14, 2022



* update pytest versions

* [test] test related changes

- upgrade to newer pytorch versions
- added function to make test more deterministic on A100 and TF32
- fixed some tests so that they are correctly skipped on a single GPU system

* more fixes

* formatting overly long lines

* format

* better test without trigger a warning

* fix an optim state bug with newer pytorch

- adam optimizer seems to return "step" as a singleton tensor now in the
nightly build
- this fixes it assumeing non-tensor value can still be loaded back by
the optimizer

* improve oss.py

- use min_loss for regression checking is a bit more reliable
- also increased the num epochs from 10 to 12

* small oss.py fix

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py
Co-authored-by: Min Xu <min.xu.public@gmail.com>

fae29959

28 Jan, 2022 1 commit

[feat] add CosFace paper's LMCL to MEVO (#916) · 89e1ae5f

Min Xu authored Jan 27, 2022



* [feat] add CosFace paper's LMCL to MEVO

- added baseline algorithm to the reference kernel
- added MEVO version of LMCL
- added unit test to verify it is correct with respect to the reference as well as its memory usage

* updated changelog
Co-authored-by: Min Xu <min.xu.public@gmail.com>

89e1ae5f

07 Jan, 2022 1 commit

[FSDP] Enable FSDP reduce scatter overlap (#897) · 0a526bcb

tmarkstrum authored Jan 07, 2022

* enable reduce scatter overlap with other operations

* fixed unit tests and added docstrings for the new parameters for fsdp

* fixed more unit tests

* fixed unit tests

* avoided the pickle error on process_group_reduce_scatter

* removed an unnecessary parameter in unit tests

* remove unnecessary prints

* fixed the docstring

* skipped the test_offload unit test because this unit test failed in the main branch

* removed the enable_reduce_scatter_overlap API parameter

* added doc string for the defualt value of process_group_reduce_scatter parameter

* fixed a syntax bug

* fixed a bug which cause unitest failure

* removed the all_gather in the ProcessGroupName enum

* added more comment

* changed the default value of process_group_reduce_scatter from None to ProcessGroupName.reduce_scatter

0a526bcb

05 Jan, 2022 1 commit

Enabling ssd_offload training basic tests. (#887) · c5e471bc

Paul Johnson authored Jan 05, 2022

* Enabling ssd_offload training and test via tests/nn/data_parallel/test_fsdp_offload.py.
* Removed unused classes: SsdBuffer, SsdTensorHandleView, SsdParameter, SsdTensor
* Enhance test coverage of test_ssd_offloading_train_flatten_params_wrapper
* Modifications from PR #887 review comments.
* Update Changelog

c5e471bc

13 Dec, 2021 1 commit

[feat] support eval in mevo (#884) · 56add6d5

Min Xu authored Dec 13, 2021

- During eval, we will fallback to just output projection without fusing
- added unit test to ensure the shape is correct

56add6d5

12 Nov, 2021 1 commit

Setup pre-commit github action and apply pre-commit to all files (#849) · 7d7edf6d

Anupam Bhatnagar authored Nov 11, 2021

* adding pre-commit files

* applying pre-commit to all files

* adding no-strict-optional argument to mypy in circle ci config

* fix typo

* updating python versions

* [skip ci] remove extra args

* adding python 3.9

* [skip ci] set pre-commit version in requirements-dev.txt

* set CACHE_VERSION

* move linters from circleci to github actions

* update python version

* update python version in benchmarks_2

* moving to python 3.9.7

7d7edf6d

08 Nov, 2021 2 commits

[feature]Add support for SSD offload with FSDP for eval workloads (#839) · d7c4aa52

anj-s authored Nov 08, 2021

* update release notes

* initial commit

* lint cleanup etc.

* helper functions; lint errors

* lint errors

* lint errors

* add back the boolean for named_parameters

* address comments and fix lint

* remove unused functions and class

* remove unused state

d7c4aa52

[feat] Gossip/SlowMo (#378) · 21464e05

Benjamin Lefaudeux authored Nov 08, 2021



Add SlowMo Distributed Data Parallel for clusters with slow interconnects
Co-authored-by: Vinayak Tantia <tantia.vinayak1@gmail.com>

21464e05

05 Nov, 2021 1 commit

[feat] experimental MEVO layer (#840) · 8347c1a2

Min Xu authored Nov 05, 2021



* [feat] MEVO kernel

- initial import from min/softmax and min/testing branches
- need to rename and further cleanup

* only test with newer pytorch

* renamed and added comments and code cleanup

* rename and reduce test memory

* testing

* minor fixing

* fixing

* more fix

* changelog

* more 1.7 and 1.8 paper cuts

* remove dead code

* addressed Benjamin's comments

* addressed more comments
Co-authored-by: Min Xu <min.xu.public@gmail.com>

8347c1a2

01 Nov, 2021 1 commit

[feature] Add the low level SSD APIs (#829) · a9fcaa28

anj-s authored Nov 01, 2021

* add doc strings

* add lower level SSD APIs and tests

* add the test to the list to be run

* remove unused imports

* more doc string changes

* fix lint errors

a9fcaa28

27 Oct, 2021 1 commit
- Use correct node names for param counting in auto_shard. (#830) · 86c62cc9
  Eugen Hotaj authored Oct 26, 2021
```
Fixes #827.
Co-authored-by: Eugen Hotaj <ehotaj@fb.com>
```
  86c62cc9
22 Oct, 2021 1 commit

Extend auto shard capabilities to work around torch.fx edge cases. (#817) · 7bdf50a3

Eugen Hotaj authored Oct 22, 2021

auto_shard.py currently uses torch.fx to create a symbolic DAG of
operations and linearizes that DAG into an nn.Sequential so it can later
be used for model offloading. This works in most cases but runs into
issues for certain eager mode features, such as dynamic conditionals,
shape-dependent computation, etc.

This PR extends auto_shard.py to first run a preprocessing step which wraps
any nn.Module which cannot be traced through. It adds a test for dynamic
conditionals and updates existing failing test code.

There are some immediate extensions to this approach which are marked as
TODO in the code.

7bdf50a3

21 Oct, 2021 1 commit

[chore] Update the PyTorch version that we run CPU tests with (#809) · 11a24161

anj-s authored Oct 20, 2021

* update python version for cpu tess

* run CPU tests with updated PyTorch version

* update nightly and test PyTorch versions

* skip failing multiprocess pipe test

* always skip test

* always skip test

* always skip test

* lint error

* skip unsupported versions

* improve skip message

* lint errors

11a24161

20 Oct, 2021 1 commit

[feat] layer memory tracking (#808) · ad92220c

Quentin Duval authored Oct 20, 2021



* [feat] layer memory tracking

* [feat] layer memory tracking (add tests in CI)

* [feat] layer memory tracking: doc typos

* [feat] layer memory tracking: mypy fixes

* [feat] layer memory tracking: fixes for FSDP all gather tracking on pytorch 1.9 and above

* [feat] layer memory tracking: lint

* [feat] layer memory tracking: mypy
Co-authored-by: QuentinDuval <QuentinDuval@users.noreply.github.com>

ad92220c

12 Sep, 2021 1 commit

[fix] FSDP intra-backwards gradient accumulation. (#784) · 4fa2ab9b

Darryl Barnhart authored Sep 12, 2021

* [fix] FSDP intra-backwards gradient accumulation.

Ensure gradient reduction accumulates into the unsharded gradient tensor
within a backwards pass. This matters when an FSDP module is called
multiple times within a forward pass, and reduction is _not_ deferred
using activation checkpoint forward counters, bucketing or some other
mechanism.

Closes #780

* [refactor] Remove forward counters. Comments.

Removed forward counters from the activation checkpointing utility, now
that FSDP does not require them for correct operation. Add more detailed
comment about memory usage behaviour with gradient reduction.

* [refactor] Delete deprecated forward counter usage.

* [refactor] Add state assertion as end of pre-backward hook.

4fa2ab9b

28 Jun, 2021 1 commit
- fixing bug in setting dependencies in partition handler (#723) · 681606f0
  Mehdi Mirzazadeh authored Jun 28, 2021
```
* fixing bug in setting dependancies in parition handler

* modifying unit test to need the fix

* black
```
  681606f0
26 Jun, 2021 1 commit
- Fix pytorch version check (#716) · bc1e60e0
  Pavel Belevich authored Jun 25, 2021
  
  bc1e60e0
25 Jun, 2021 2 commits
- checking number parameters in distributed pipeline test (#728) · 4a63034e
  Mehdi Mirzazadeh authored Jun 25, 2021
  
  4a63034e
- Preparing pipeline for newer versions of pytorch (#726) · bcd4748d
  Mehdi Mirzazadeh authored Jun 25, 2021
```
* Preparing pipeline for newer versions of pytorch

* updated error message
```
  bcd4748d
22 Jun, 2021 1 commit

Update torch to 1.9.0 release (#717) · 1cc4c837

Pavel Belevich authored Jun 21, 2021

* Update torch to 1.9.0.dev20210614+cu102

* Update config.yml

* Update config.yml

* Update setup.py

* Update config.yml

* Update config.yml

* Update config.yml

* Update config.yml

1cc4c837

11 Jun, 2021 1 commit

[Offload][feature] Add auto shard functionality to remove requirement of... · cbeda830

anj-s authored Jun 10, 2021

[Offload][feature] Add auto shard functionality to remove requirement of nn.Sequential models. (#695)

* auto wrap functionality

* lint and doc strings

* fix lint errors

* lint errors and version skips

* remove mypy checking and add conditional import

* another math.prod instance

* another import fix

* address comments

* lint errors

* address comments

* fix lint errors

* add placeholder nodes to tracker list

cbeda830

27 May, 2021 1 commit
- [perf] SyncBatchNorm: avoid 2nd set of all_reduce when wrapped by checkpoint_wrapper (#694) · 29aae007
  msbaines authored May 26, 2021
```
This change also ensure that we calculate running_{mean,var} correctly
when wrapped.
```
  29aae007