Commits · f2af4c66c70c0f673ddc7532b8c3db6954d67706 · OpenDAS / fairscale

01 Nov, 2021 1 commit

[feat] [FSDP]: add experimental support to shared weights (#836) · f2af4c66

Min Xu authored Nov 01, 2021



* added a new test, passing without shared weights

* tested weight sharing

* added the test to test list file

* extended to world_size = 2

* fixed test

* [feat]: add limited and experimental support for shared parameter

* fixed tests

* simplify to work with layer with at least 1 non-shared params and add code to pick up linked_param field for sharding the shared param

* fixed the case where linked param is not in separate FSDP

* changelog and remove old code
Co-authored-by: Min Xu <min.xu.public@gmail.com>

f2af4c66

12 Sep, 2021 1 commit

[fix] FSDP intra-backwards gradient accumulation. (#784) · 4fa2ab9b

Darryl Barnhart authored Sep 12, 2021

* [fix] FSDP intra-backwards gradient accumulation.

Ensure gradient reduction accumulates into the unsharded gradient tensor
within a backwards pass. This matters when an FSDP module is called
multiple times within a forward pass, and reduction is _not_ deferred
using activation checkpoint forward counters, bucketing or some other
mechanism.

Closes #780

* [refactor] Remove forward counters. Comments.

Removed forward counters from the activation checkpointing utility, now
that FSDP does not require them for correct operation. Add more detailed
comment about memory usage behaviour with gradient reduction.

* [refactor] Delete deprecated forward counter usage.

* [refactor] Add state assertion as end of pre-backward hook.

4fa2ab9b

06 Sep, 2021 1 commit

[cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup;... · 3ecf76f4

Min Xu authored Sep 05, 2021


[cleanup] CI test updates; mypy cleanup; partial broadcast_object cleanup; pre-commit documentation (#744)

* changelog; mypy; oss cleanup

* more broadcast_object cleanup in FSDP

* one more mypy fix

* retire pytorch 1.6 from circleci, add new lightly, add 1.8 LTS and 1.9 stable release

* update torch version for LTS

* minor fixes

* update cache key

* trying newer gpu VMs

* bump the cache

* update to gpu.medium, which should be 2 GPUs

* update nightly version

* add pre-commit instruction

* fixed CHANGELOG after merging

* updated to newer nightly

* retained the older broadcast function for older GPUs for oss.py

* fixed a bug

* added a comment

* fixing a test for pytorch 1.10

* testing a fix

* Update fairscale/optim/oss.py

* Update CONTRIBUTING.md
Co-authored-by: Min Xu <min.xu.public@gmail.com>

3ecf76f4

31 Jul, 2021 1 commit

FSDP: supporting gradient accumulation without no_sync context manager to save GPU memory (#752) · cd0f0b88

Myle Ott authored Jul 31, 2021



* Add test (broken) for gradient accumulation without no_sync context manager

* changelog

* no_sync to grad_acc renaming for tests

* clean up tmp files

* support grad acc without no_sync

* minor

* update changelog

* Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Better assertion from Sam.
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

cd0f0b88

21 Jun, 2021 1 commit

[feat] FSDP: supporting multiple flatten parameter groups (#711) · ab71efb3

Min Xu authored Jun 21, 2021



* [feat] FSDP: supporting multiple flatten parameter groups

- step 2: extending FPW to support multiple flat params groups
- FSDP still only use one group
- unit test does this the new code paths
- updated the changelog

* first cut, mypy passed

* test_flatten_params_wrapper.py::TestFlattenParams tests pass

* added two more test cases and fixed a case in the code

* fixed one bug with param_path_infos

* fixed two more tests with hardcoded flat_param names

* Update CHANGELOG.md
Co-authored-by: Min Xu <min.xu.public@gmail.com>

ab71efb3

08 Jun, 2021 1 commit

[feat] supporting multiple flatten parameter groups (step 1 and step 1.5) (#708) · d60fc284

Min Xu authored Jun 08, 2021



* refactoring FlattenParamWrapper

- use a FlatParameter class to encapsulate the logic of
  flattening and expanding into views.
- this will make it easier to have multiple groups of flatten
  parameters

* fixed testing context issues for both temp files and temp dirs

* fixing test_fsdp_metadata

* fix pickling of FlatParameter

* fixed test_fsdp_optimizer_utils.py

* minor

* fix assert

* lint

* remove nesting from the test

* step 1.5: remove the code related unnecessary nesting support in FPW

* Update fairscale/nn/misc/flatten_params_wrapper.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* address comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

d60fc284

21 May, 2021 1 commit

[refactor] ShardedGradScaler init and super call (#691) · 945b9666

Nicholas Cilfone authored May 21, 2021

Make ShardedGradScaler __init__ mirror GradScaler so super can forward parameters. Without this one cannot configure a ShardedGradScaler object like one can with the PyTorch native GradScaler object.
Updated with black linter.
Added stub for GradScaler __init__ which solves mypy issues and removed
ignore comment.

945b9666

17 May, 2021 1 commit

[fix] auto_wrap: support wrapping based on wrapper_config (#685) · 9d2bbcf2

Min Xu authored May 17, 2021



* [fix] auto_wrap: support wrapping based on wrapper_config

- user can use this to avoid assert if auto_wrap is used multiple times on a module
- user can traverse the modules multiple times and assign a wrapper_config
  to the module and then use auto_wrap once to wrap them

fix #649
fix #585

* added changelog

* fix tests

* fix a test

* added an optional assert for collision based on discussions with Quentin

* added config_auto_wrap_policy

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>

9d2bbcf2

14 May, 2021 2 commits

[minor] use dist.group.WORLD for default process group (#681) · bbac5564

Min Xu authored May 14, 2021



* [minor] use dist.group.WORLD for default process group

- this is slightly more efficient than the previous commit
  for get_process_group_cached.

* fix

* better fix

* fixed for pytorch 1.6 and 1.7

* Update fairscale/utils/parallel.py
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Min Xu <min.xu.public@gmail.com>

bbac5564

FSDP: Fix saving and loading checkpoints with use_sharded_state=True (#574) · 468874c8

Shruti Bhosale authored May 14, 2021



* fix saving and loading checkpoints with use_sharded_state=True

* mypy fix

* better fix of the infinite recursion

- we need to specifically call FSDP.state_dict from its local state_dict
- added unit test that fails without the fix and works with the fix
- fixed mypy for the overloaded functions

* make cpu-only fsdp work for state_dict at least
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Min Xu <m1n@fb.com>

468874c8

11 May, 2021 1 commit

[fix] FSDP forward pass overlap between compute and all-gather (#671) · 8a42a8e3

Min Xu authored May 10, 2021



* [fix] FSDP forward pass overlap between compute and all-gather

- much thanks for @cyanguwa for report and @QuentinDuval for debugging it
- a new unit test is added to check for this and ensure we detect
  issue with overlapping and cpu/gpu blocking wait calls

* fix

* fix

* fix

* better assertion outputs

* fix format and tune all_gather mb for CI

* more tuning with non_flatten

* undo an accidental change

* tuning all gather mb and del model

* Update + fix overlapping test to use patched all_gather w/ delay (#672)

* fixing get_cycles_per_ms

* add get_smi_memory

* update the docstring
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Myle Ott <myleott@fb.com>

8a42a8e3

08 May, 2021 1 commit
- [fix] nn.moe: softmax should be done in FP32 (#668) · 002aae63
  msbaines authored May 08, 2021
```
Co-authored-by: @myleott
```
  002aae63
07 May, 2021 1 commit

[feat] experimental.nn.SyncBatchNorm: initial commit (#662) · f0a40046

msbaines authored May 07, 2021

* [feat] experimental.nn.SyncBatchNorm: initial commit

Fast/simple re-implementation of SyncBatchNorm.

When profiling SSL Vision, I was seeing a majority of cycles spent in
SyncBatchNorm. With this change, I see a 10% to 20% speedup on the
model I was profiling.

When running benchmarks/experimental/sync_batchnorm.py on 8 x V100,
I get a 6x speedup:

<class 'torch.nn.modules.batchnorm.BatchNorm2d'>
Elapsed time is  0.08709120750427246
Elapsed time is  0.12632274627685547
Elapsed time is  0.14095258712768555
Elapsed time is  0.16529417037963867
Elapsed time is  0.1419970989227295
Elapsed time is  0.15166854858398438
Elapsed time is  0.12000870704650879
Elapsed time is  0.17534875869750977
<class 'torch.nn.modules.batchnorm.SyncBatchNorm'>
Elapsed time is  2.5087168216705322
Elapsed time is  2.497001886367798
Elapsed time is  2.5204885005950928
Elapsed time is  2.526789903640747
Elapsed time is  2.5080230236053467
Elapsed time is  2.524489641189575
Elapsed time is  2.513214588165283
Elapsed time is  2.5359973907470703
<class 'fairscale.experimental.nn.sync_batchnorm.SyncBatchNorm'>
Elapsed time is  0.4126114845275879
Elapsed time is  0.39051294326782227
Elapsed time is  0.40685415267944336
Elapsed time is  0.4159870147705078
Elapsed time is  0.42383885383605957
Elapsed time is  0.4080159664154053
Elapsed time is  0.41202712059020996
Elapsed time is  0.42400121688842773

f0a40046

05 May, 2021 1 commit

[fix] add clear_autocast_cache flag (#650) · 861b5ce2

Min Xu authored May 04, 2021



* [fix] add clear_autocast_cache flag

- when training in AMP model with weight dtype32, FSDP may need to
  optionally clear the autocast cache to avoid GPU OOM
- this flag is default false, automatically doing it is a future TODO
- also added a verbose flag to make print(fsdp_model) a bit shorter
- updated the memory test to cover those new code
- added a couple of useful functions in parallel.py and testing.py

* minor

* address comments

* format

* improve the test
Co-authored-by: Min Xu <min.xu@acm.org>

861b5ce2

26 Apr, 2021 1 commit

[fix]: let FSDP handle model with multiple forward pass and checkpoint (#621) · a1612d79

Min Xu authored Apr 26, 2021



* [fix]: let FSDP handle model with multiple forward pass and checkpoint

* try CI again

* save

* save

* fixed case with bn

* minor

* add the new file

* minor

* added test of a single case, runtime is about 50s

* enable all 8 test cases

* cleanup

* cleanup

* skip flatten case with 1.6 and 1.7

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

a1612d79

07 Apr, 2021 1 commit
- [FSDP] [feat] Add state_dict_device option (#579) · 14abed6e
  Myle Ott authored Apr 07, 2021
  
  14abed6e
29 Mar, 2021 1 commit
- [feat] multiproces_pipe: add checkpoint support (#555) · 5e6a7a57
  msbaines authored Mar 29, 2021
  
  5e6a7a57
28 Mar, 2021 1 commit
- [feat] multiprocess_pipe: add support for testing gpu-gpu rpc (#552) · 62635f0f
  msbaines authored Mar 28, 2021
  
  62635f0f
26 Mar, 2021 1 commit

[test] FSDP: check with ddp parity with conv + bn (#549) · 0233efca

Min Xu authored Mar 26, 2021

- added DDP equivalency test
- added rmf, state_dict_norm functions to testing utils
- added more debugging output to objects_are_equal

0233efca

19 Mar, 2021 1 commit
- [feat] experimental.nn.multiprocess_pipe: re-implemented using rpc (#519) · 84e0de84
  msbaines authored Mar 18, 2021
  
  84e0de84
04 Mar, 2021 2 commits

[feat]: checkpoint and normalization (#457) · 5e64d6a7

Min Xu authored Mar 04, 2021

* [feat]: checkpoint and normalization

- added special handling of BN for track_running_stats and checkpointing
- we test BN/LN and checkpointing
- we test them with mixed precision

5e64d6a7

[test] AdaScale & SDP/FSDP (#468) · efed9cee

Min Xu authored Mar 04, 2021

- cover them in terms of code path only
- numerically, AdaScale is different on SDP/FSDP than DDP, mainly
  due to partial view of the gradients.
- this doesn't mean it is definitely not useful but it is yet to
  be validated.
- not going to spend too much time until we have a real use case.

efed9cee

02 Mar, 2021 1 commit

[feat] Add context manager to FSDP for easier child module wrapping (#446) · f3359550

Sean Naren authored Mar 02, 2021

This adds a context manager that assists in making child modules with similar defaults.
Usage:
```
from fairscale.nn.misc import enable_wrap, wrap

with enable_wrap(**handleful_of_important_params):
    layer_1 = wrap(torch.nn.Linear(5, 5))
    layer_2 = wrap(torch.nn.Linear(5, 5), flatten_parameters=True) # Override parameters if you'd like

# without the context manager, creates Linear layer
layer_1 = wrap(torch.nn.Linear(5, 5))
```
If not within the FSDP context, this would be a no-op. This makes it easier to annotate layers without having to copy any changes in parameters.

f3359550

26 Feb, 2021 1 commit
- [fix] Fix nested FlattenParamsWrapper state_dict/load_state_dict (#434) · 506d6209
  Myle Ott authored Feb 26, 2021
  
  506d6209
23 Feb, 2021 1 commit

Add FullyShardedDataParallel (FSDP) (#413) · 15512d9e

Myle Ott authored Feb 22, 2021

Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336

) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper.

Compared to PyTorch DDP:
* FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
* FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2
* FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3:
    * all-gather parameters at start of forward pass and start of backward pass
    * reduce-scatter grads at end of backward pass
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

15512d9e

12 Feb, 2021 1 commit

[feature-fix-refactor][ShardedDDP] Make it possible to change trainability graph on the fly (#369) · 13445c55

Benjamin Lefaudeux authored Feb 11, 2021

* Better unit testing
* Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time
* Enabling accumulation tests

13445c55

10 Feb, 2021 1 commit

Add fairscale.nn.misc.checkpoint_activations (#376) · c963a72a

Myle Ott authored Feb 10, 2021



* Add fairscale.utils.containers
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

* Add fairscale.nn.misc.checkpoint_activations
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

c963a72a

27 Jan, 2021 1 commit
- [refactor] pipe: separate out Single and MultiProcess pipe (#326) · cae9b638
  msbaines authored Jan 26, 2021
  
  cae9b638
21 Jan, 2021 1 commit
- [fix] lint/typing in FlattenParamsWrapper (#318) · a6ed6da8
  Myle Ott authored Jan 21, 2021
  
  a6ed6da8
11 Jan, 2021 1 commit

[chore][ci] restore 1.5 & 1.6 tests and compatibility (#306) · 2d954203

Benjamin Lefaudeux authored Jan 11, 2021

* tentatively fixing the cpu version of circleci jobs, now pipe tests are the last ones standing
* fixing oss backcompat, trying to fix rpc in old pytorch also
* fixing the file based init in torch 1.5

2d954203

08 Jan, 2021 2 commits
- [refactor][OSS] Adding a pytorch parity unit test (#298) · 3d02f052
  Benjamin Lefaudeux authored Jan 08, 2021
```
* adding a parity unit test
* code review, better testing, use torch defaults and check for the loss, log world size
```
  3d02f052
- [refactor][OSS] Removing ad-hoc object broadcast, use pytorch's (#297) · 3399e97c
  Benjamin Lefaudeux authored Jan 08, 2021
  
  3399e97c
16 Dec, 2020 1 commit

[feat]: AdaScale work with lr_scheduler and tests, examples (#229) · d65cd838

Min Xu authored Dec 15, 2020

* [doc]: AdaScale example and notes

* formatted notes correctly as suggested by Benjamin

* added feature and unit test to make sure lr_scheduler works

* update the example with lr_scheduler

* fixed doc with "make html"

* addressed Mike's suggestions

d65cd838

01 Dec, 2020 1 commit
- [chore] Refactor unit testing, shared utils (#218) · e83da060
  Benjamin Lefaudeux authored Dec 01, 2020
  
  e83da060
21 Nov, 2020 1 commit

[feat] ShardedDataParallel with autoreduce (#157) · ad933b34

Benjamin Lefaudeux authored Nov 21, 2020

* rewrite using autograd and Variable execution queue to make the reduce automatic
* share buckets with OSS to remove duplication
* some speed still likely on the table since the speed vs. bucketing does not match expectations, could be a follow up

ad933b34

18 Nov, 2020 1 commit

[feat] ShardedOptim: Distributed Grad Scaler (for torch AMP) (#182) · d85acf72

Benjamin Lefaudeux authored Nov 17, 2020

* adding a shard-aware GradScaler wrap, credits to Sean Naren for the idea
* adding stubs & explanations in the documentation

d85acf72

16 Nov, 2020 1 commit
- [feat] OSS-aware clip grads, bridge sharded states (#167) · ade312c4
  Benjamin Lefaudeux authored Nov 16, 2020
```
add a clip gradients util, equivalent to torch's but aware of the sharded states. Add a corresponding unit test
```
  ade312c4
11 Nov, 2020 1 commit
- [refactor] moe: remove G dimension (#183) · 89176e34
  msbaines authored Nov 11, 2020
  
  89176e34
10 Nov, 2020 1 commit

Single-process control via PipeRPCWrapper (#156) · 5d4f50fb

Tom Birch authored Nov 10, 2020

Adds support for:
* Reused layers (e.g. for weight sharing)
* Lazily-constructed layers
* Single-process control via PipeRPCWrapper
* PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive

Also added examples for multi-process and PipeRPCWrapper

5d4f50fb

28 Oct, 2020 1 commit
- [refactor] moe: use all_to_all_single (#168) · 2108f20e
  msbaines authored Oct 27, 2020
  
  2108f20e