Unverified Commit 2b15720b authored by Min Xu's avatar Min Xu Committed by GitHub
Browse files

[docs] fsdp changelog and doc (#414)

parent 15512d9e
...@@ -5,7 +5,13 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), ...@@ -5,7 +5,13 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## NEXT - TBD ## NEXT - TBD
### Added
- FullyShardedDataParallel (FSDP) ([#413](https://github.com/facebookresearch/fairscale/issues/413))
### Fixed ### Fixed
- Catch corner case when the model is too small with respect to the world size, and shards are empty ([#406] (https://github.com/facebookresearch/fairscale/pull/406)) - Catch corner case when the model is too small with respect to the world size, and shards are empty ([#406] (https://github.com/facebookresearch/fairscale/pull/406))
## [0.1.7] - 2021-02-19 ## [0.1.7] - 2021-02-19
......
...@@ -9,4 +9,5 @@ API Reference ...@@ -9,4 +9,5 @@ API Reference
optim/grad_scaler optim/grad_scaler
nn/pipe nn/pipe
nn/sharded_ddp nn/sharded_ddp
nn/fsdp
nn/misc/checkpoint_activations nn/misc/checkpoint_activations
FullyShardedDataParallel
========================
.. autoclass:: fairscale.nn.FullyShardedDataParallel
:members:
:undoc-members:
...@@ -23,6 +23,7 @@ Components ...@@ -23,6 +23,7 @@ Components
* `Optimizer state sharding <../../en/latest/api/optim/oss.html>`_ * `Optimizer state sharding <../../en/latest/api/optim/oss.html>`_
* `Sharded grad scaler - automatic mixed precision <../../en/latest/api/optim/grad_scaler.html>`_ * `Sharded grad scaler - automatic mixed precision <../../en/latest/api/optim/grad_scaler.html>`_
* `Sharded distributed data parallel <../../en/latest/api/nn/sharded_ddp.html>`_ * `Sharded distributed data parallel <../../en/latest/api/nn/sharded_ddp.html>`_
* `Fully Sharded Data Parallel FSDP <../../en/latest/api/nn/fsdp.html>`_
* Optimization at scale: * Optimization at scale:
* `AdaScale SGD <../../en/latest/api/optim/adascale.html>`_ * `AdaScale SGD <../../en/latest/api/optim/adascale.html>`_
......
...@@ -3,13 +3,14 @@ ...@@ -3,13 +3,14 @@
# This source code is licensed under the BSD license found in the # This source code is licensed under the BSD license found in the
# LICENSE file in the root directory of this source tree. # LICENSE file in the root directory of this source tree.
from .data_parallel import ShardedDataParallel from .data_parallel import FullyShardedDataParallel, ShardedDataParallel
from .misc import FlattenParamsWrapper from .misc import FlattenParamsWrapper
from .moe import MOELayer, Top2Gate from .moe import MOELayer, Top2Gate
from .pipe import Pipe, PipeRPCWrapper from .pipe import Pipe, PipeRPCWrapper
__all__ = [ __all__ = [
"FlattenParamsWrapper", "FlattenParamsWrapper",
"FullyShardedDataParallel",
"LazyModule", "LazyModule",
"Pipe", "Pipe",
"PipeRPCWrapper", "PipeRPCWrapper",
......
...@@ -85,30 +85,40 @@ class FullyShardedDataParallel(nn.Module): ...@@ -85,30 +85,40 @@ class FullyShardedDataParallel(nn.Module):
) )
Args: Args:
module (nn.Module): module to checkpoint module (nn.Module):
process_group (Optional): process group for sharding module to checkpoint
reshard_after_forward (bool, Optional): if ``True``, reshard parameters process_group (Optional):
process group for sharding
reshard_after_forward (bool, Optional):
if ``True``, reshard parameters
after the forward pass. This saves memory but slows training. This after the forward pass. This saves memory but slows training. This
is only relevant when resharding individual layers. is only relevant when resharding individual layers.
mixed_precision (bool, Optional): if ``True``, inputs, activations and mixed_precision (bool, Optional):
if ``True``, inputs, activations and
gradients will be kept in FP16; computation and communication will gradients will be kept in FP16; computation and communication will
occur in FP16; and a (sharded) master copy of the model weights will occur in FP16; and a (sharded) master copy of the model weights will
be maintained in FP32. be maintained in FP32.
fp32_reduce_scatter (bool, Optional): if ``True``, then reduce-scatter fp32_reduce_scatter (bool, Optional):
if ``True``, then reduce-scatter
gradients in FP32. This is only relevant when *``mixed_precision``* gradients in FP32. This is only relevant when *``mixed_precision``*
is ``True``. is ``True``.
flatten_parameters (bool, Optional): if ``True``, flatten parameters flatten_parameters (bool, Optional):
if ``True``, flatten parameters
into a single contiguous tensor, which improves training speed. into a single contiguous tensor, which improves training speed.
cpu_offload (bool, Optional): if ``True``, offload FP32 params to CPU. cpu_offload (bool, Optional):
if ``True``, offload FP32 params to CPU.
This is only relevant when *``mixed_precision``* is ``True``. This is only relevant when *``mixed_precision``* is ``True``.
compute_dtype (torch.dtype, Optional): dtype for full parameters for compute_dtype (torch.dtype, Optional):
dtype for full parameters for
computation. This defaults to ``torch.float32`` unless computation. This defaults to ``torch.float32`` unless
*``mixed_precision``* is set, in which case it defaults to *``mixed_precision``* is set, in which case it defaults to
``torch.float16``. ``torch.float16``.
move_grads_to_cpu (bool, Optional): move gradient shard to CPU after move_grads_to_cpu (bool, Optional):
move gradient shard to CPU after
reduction. This is useful when combined with CPU-based optimizers. reduction. This is useful when combined with CPU-based optimizers.
It defaults to the value of *``cpu_offload``*. It defaults to the value of *``cpu_offload``*.
bucket_cap_mb (int, Optional): FSDP will bucket parameters so that bucket_cap_mb (int, Optional):
FSDP will bucket parameters so that
gradient reduction can potentially overlap with backward gradient reduction can potentially overlap with backward
computation. bucket_cap_mb controls the bucket size in MegaBytes computation. bucket_cap_mb controls the bucket size in MegaBytes
(MB). Buckets are sub-divided based on world_size, so the max shard (MB). Buckets are sub-divided based on world_size, so the max shard
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment