- 01 Nov, 2021 1 commit
-
-
Min Xu authored
* added a new test, passing without shared weights * tested weight sharing * added the test to test list file * extended to world_size = 2 * fixed test * [feat]: add limited and experimental support for shared parameter * fixed tests * simplify to work with layer with at least 1 non-shared params and add code to pick up linked_param field for sharding the shared param * fixed the case where linked param is not in separate FSDP * changelog and remove old code Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 12 Sep, 2021 1 commit
-
-
Darryl Barnhart authored
* [fix] FSDP intra-backwards gradient accumulation. Ensure gradient reduction accumulates into the unsharded gradient tensor within a backwards pass. This matters when an FSDP module is called multiple times within a forward pass, and reduction is _not_ deferred using activation checkpoint forward counters, bucketing or some other mechanism. Closes #780 * [refactor] Remove forward counters. Comments. Removed forward counters from the activation checkpointing utility, now that FSDP does not require them for correct operation. Add more detailed comment about memory usage behaviour with gradient reduction. * [refactor] Delete deprecated forward counter usage. * [refactor] Add state assertion as end of pre-backward hook.
-
- 31 Jul, 2021 1 commit
-
-
Myle Ott authored
* Add test (broken) for gradient accumulation without no_sync context manager * changelog * no_sync to grad_acc renaming for tests * clean up tmp files * support grad acc without no_sync * minor * update changelog * Update fairscale/nn/data_parallel/fully_sharded_data_parallel.py Better assertion from Sam. Co-authored-by:
Sam Shleifer <sshleifer@gmail.com> * lint Co-authored-by:
Min Xu <min.xu.public@gmail.com> Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 21 Jun, 2021 1 commit
-
-
Min Xu authored
* [feat] FSDP: supporting multiple flatten parameter groups - step 2: extending FPW to support multiple flat params groups - FSDP still only use one group - unit test does this the new code paths - updated the changelog * first cut, mypy passed * test_flatten_params_wrapper.py::TestFlattenParams tests pass * added two more test cases and fixed a case in the code * fixed one bug with param_path_infos * fixed two more tests with hardcoded flat_param names * Update CHANGELOG.md Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 08 Jun, 2021 1 commit
-
-
Min Xu authored
* refactoring FlattenParamWrapper - use a FlatParameter class to encapsulate the logic of flattening and expanding into views. - this will make it easier to have multiple groups of flatten parameters * fixed testing context issues for both temp files and temp dirs * fixing test_fsdp_metadata * fix pickling of FlatParameter * fixed test_fsdp_optimizer_utils.py * minor * fix assert * lint * remove nesting from the test * step 1.5: remove the code related unnecessary nesting support in FPW * Update fairscale/nn/misc/flatten_params_wrapper.py Co-authored-by:
Sam Shleifer <sshleifer@gmail.com> * address comment Co-authored-by:
Min Xu <min.xu.public@gmail.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 17 May, 2021 1 commit
-
-
Min Xu authored
* [fix] auto_wrap: support wrapping based on wrapper_config - user can use this to avoid assert if auto_wrap is used multiple times on a module - user can traverse the modules multiple times and assign a wrapper_config to the module and then use auto_wrap once to wrap them fix #649 fix #585 * added changelog * fix tests * fix a test * added an optional assert for collision based on discussions with Quentin * added config_auto_wrap_policy * lint Co-authored-by:Min Xu <min.xu.public@gmail.com>
-
- 14 May, 2021 1 commit
-
-
Shruti Bhosale authored
* fix saving and loading checkpoints with use_sharded_state=True * mypy fix * better fix of the infinite recursion - we need to specifically call FSDP.state_dict from its local state_dict - added unit test that fails without the fix and works with the fix - fixed mypy for the overloaded functions * make cpu-only fsdp work for state_dict at least Co-authored-by:
Min Xu <min.xu@acm.org> Co-authored-by:
Min Xu <min.xu.public@gmail.com> Co-authored-by:
Min Xu <m1n@fb.com>
-
- 08 May, 2021 1 commit
-
-
msbaines authored
Co-authored-by: @myleott
-
- 26 Apr, 2021 1 commit
-
-
Min Xu authored
* [fix]: let FSDP handle model with multiple forward pass and checkpoint * try CI again * save * save * fixed case with bn * minor * add the new file * minor * added test of a single case, runtime is about 50s * enable all 8 test cases * cleanup * cleanup * skip flatten case with 1.6 and 1.7 * minor Co-authored-by:Min Xu <min.xu@acm.org>
-
- 04 Mar, 2021 1 commit
-
-
Min Xu authored
* [feat]: checkpoint and normalization - added special handling of BN for track_running_stats and checkpointing - we test BN/LN and checkpointing - we test them with mixed precision
-
- 02 Mar, 2021 1 commit
-
-
Sean Naren authored
This adds a context manager that assists in making child modules with similar defaults. Usage: ``` from fairscale.nn.misc import enable_wrap, wrap with enable_wrap(**handleful_of_important_params): layer_1 = wrap(torch.nn.Linear(5, 5)) layer_2 = wrap(torch.nn.Linear(5, 5), flatten_parameters=True) # Override parameters if you'd like # without the context manager, creates Linear layer layer_1 = wrap(torch.nn.Linear(5, 5)) ``` If not within the FSDP context, this would be a no-op. This makes it easier to annotate layers without having to copy any changes in parameters.
-
- 23 Feb, 2021 1 commit
-
-
Myle Ott authored
Recent work by [Microsoft](https://arxiv.org/abs/1910.02054) and [Google](https://arxiv.org/abs/2004.13336 ) has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new **`FullyShardedDataParallel` (FSDP)** wrapper, which is a drop-in replacement for PyTorch's `DistributedDataParallel` (DDP) wrapper. Compared to PyTorch DDP: * FSDP shards parameters (FP16 + FP32) and optimizer state across data parallel GPUs * FSDP with `reshard_after_forward=False` has the same communication cost as PyTorch DDP and is similar to ZeRO-2 * FSDP with `reshard_after_forward=True` increases total communication by 50% and is similar to ZeRO-3: * all-gather parameters at start of forward pass and start of backward pass * reduce-scatter grads at end of backward pass Co-authored-by:
Min Xu <24926999+min-xu-ai@users.noreply.github.com> Co-authored-by:
Sam Shleifer <sshleifer@gmail.com>
-
- 12 Feb, 2021 1 commit
-
-
Benjamin Lefaudeux authored
* Better unit testing * Make it possible to refresh the DDP assumptions when the model has changed. Make it optional so that you save some time * Enabling accumulation tests
-
- 10 Nov, 2020 1 commit
-
-
Tom Birch authored
Adds support for: * Reused layers (e.g. for weight sharing) * Lazily-constructed layers * Single-process control via PipeRPCWrapper * PipelineStyle.AsyncScheudle, which lays the foundation for asynchronous pipeline work by introducing an event loop for each rank/worker to process either activations or gradients as they arrive Also added examples for multi-process and PipeRPCWrapper
-
- 17 Sep, 2020 1 commit
-
-
Tom Birch authored
Adds support for distributing pipeline stages across multiple processes (and therefore multiple machines) * Adds a style argument to the Pipe constructor, defaulting to PipelineStyle.SingleProcess, but also supporting PipelineStyle.MultiProcess * Added support for lazy construction of modules (see lazy_construction for an example) * Added two implementations of inter-process communication: one based on rpc with globally visible queues, one based on send/recv * Copied all the relevant tests from tests/pipe to tests/pipe_process and modified them to exercise PipelineStyle.MultiProcess
-
- 31 Jul, 2020 2 commits
-
-
Tom Birch authored
-
Jun Ru Anderson authored
Co-authored-by:Jun Ru Anderson <andersonic@fb.com>
-
- 08 Jul, 2020 1 commit
-
-
Mandeep Singh Baines authored
-