Commits · d60fc2841a01c5c4033a5dcee709b4fd7a3dfadd · OpenDAS / fairscale

08 Jun, 2021 1 commit

[feat] supporting multiple flatten parameter groups (step 1 and step 1.5) (#708) · d60fc284

Min Xu authored Jun 08, 2021



* refactoring FlattenParamWrapper

- use a FlatParameter class to encapsulate the logic of
  flattening and expanding into views.
- this will make it easier to have multiple groups of flatten
  parameters

* fixed testing context issues for both temp files and temp dirs

* fixing test_fsdp_metadata

* fix pickling of FlatParameter

* fixed test_fsdp_optimizer_utils.py

* minor

* fix assert

* lint

* remove nesting from the test

* step 1.5: remove the code related unnecessary nesting support in FPW

* Update fairscale/nn/misc/flatten_params_wrapper.py
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

* address comment
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

d60fc284

17 May, 2021 1 commit

[fix] auto_wrap: support wrapping based on wrapper_config (#685) · 9d2bbcf2

Min Xu authored May 17, 2021



* [fix] auto_wrap: support wrapping based on wrapper_config

- user can use this to avoid assert if auto_wrap is used multiple times on a module
- user can traverse the modules multiple times and assign a wrapper_config
  to the module and then use auto_wrap once to wrap them

fix #649
fix #585

* added changelog

* fix tests

* fix a test

* added an optional assert for collision based on discussions with Quentin

* added config_auto_wrap_policy

* lint
Co-authored-by: Min Xu <min.xu.public@gmail.com>

9d2bbcf2

14 May, 2021 1 commit

FSDP: Fix saving and loading checkpoints with use_sharded_state=True (#574) · 468874c8

Shruti Bhosale authored May 14, 2021



* fix saving and loading checkpoints with use_sharded_state=True

* mypy fix

* better fix of the infinite recursion

- we need to specifically call FSDP.state_dict from its local state_dict
- added unit test that fails without the fix and works with the fix
- fixed mypy for the overloaded functions

* make cpu-only fsdp work for state_dict at least
Co-authored-by: Min Xu <min.xu@acm.org>
Co-authored-by: Min Xu <min.xu.public@gmail.com>
Co-authored-by: Min Xu <m1n@fb.com>

468874c8

08 May, 2021 1 commit
- [fix] nn.moe: softmax should be done in FP32 (#668) · 002aae63
  msbaines authored May 08, 2021
```
Co-authored-by: @myleott
```
  002aae63
26 Apr, 2021 1 commit

[fix]: let FSDP handle model with multiple forward pass and checkpoint (#621) · a1612d79

Min Xu authored Apr 26, 2021



* [fix]: let FSDP handle model with multiple forward pass and checkpoint

* try CI again

* save

* save

* fixed case with bn

* minor

* add the new file

* minor

* added test of a single case, runtime is about 50s

* enable all 8 test cases

* cleanup

* cleanup

* skip flatten case with 1.6 and 1.7

* minor
Co-authored-by: Min Xu <min.xu@acm.org>

a1612d79

04 Mar, 2021 1 commit

[feat]: checkpoint and normalization (#457) · 5e64d6a7

Min Xu authored Mar 04, 2021

* [feat]: checkpoint and normalization

- added special handling of BN for track_running_stats and checkpointing
- we test BN/LN and checkpointing
- we test them with mixed precision

5e64d6a7

02 Mar, 2021 1 commit

[feat] Add context manager to FSDP for easier child module wrapping (#446) · f3359550

Sean Naren authored Mar 02, 2021

This adds a context manager that assists in making child modules with similar defaults.
Usage:
```
from fairscale.nn.misc import enable_wrap, wrap

with enable_wrap(**handleful_of_important_params):
    layer_1 = wrap(torch.nn.Linear(5, 5))
    layer_2 = wrap(torch.nn.Linear(5, 5), flatten_parameters=True) # Override parameters if you'd like

# without the context manager, creates Linear layer
layer_1 = wrap(torch.nn.Linear(5, 5))
```
If not within the FSDP context, this would be a no-op. This makes it easier to annotate layers without having to copy any changes in parameters.

f3359550

23 Feb, 2021 1 commit

Add FullyShardedDataParallel (FSDP) (#413) · 15512d9e

Myle Ott authored Feb 22, 2021

Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336

) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper.

Compared to PyTorch DDP:
* FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs
* FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2
* FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3:
    * all-gather parameters at start of forward pass and start of backward pass
    * reduce-scatter grads at end of backward pass
Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

15512d9e

12 Feb, 2021 1 commit

[feature-fix-refactor][ShardedDDP] Make it possible to change trainability graph on the fly (#369) · 13445c55

Benjamin Lefaudeux authored Feb 11, 2021

* Better unit testing
* Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time
* Enabling accumulation tests

13445c55

10 Nov, 2020 1 commit

Single-process control via PipeRPCWrapper (#156) · 5d4f50fb

Tom Birch authored Nov 10, 2020

Adds support for:
* Reused layers (e.g. for weight sharing)
* Lazily-constructed layers
* Single-process control via PipeRPCWrapper
* PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive

Also added examples for multi-process and PipeRPCWrapper

5d4f50fb

17 Sep, 2020 1 commit

Multi-process pipe (#90) · 63f7796a

Tom Birch authored Sep 17, 2020

Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines)
* Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess
* Added support for lazy construction of modules (see lazy_construction for an example)
* Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv
* Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess

63f7796a

31 Jul, 2020 2 commits
- [feat] Model parallel (#3) · 30f5009a
  Tom Birch authored Jul 22, 2020
  
  30f5009a
- [fix] add TransformerEncoderLayer to stubs (#5) · 63b5b166
  Jun Ru Anderson authored Jul 21, 2020
```
Co-authored-by: Jun Ru Anderson <andersonic@fb.com>
```
  63b5b166
08 Jul, 2020 1 commit
- Initial commit · 0cd65242
  Mandeep Singh Baines authored Jul 07, 2020
  
  0cd65242