[docs] fsdp changelog and doc (#414)

2b15720b · Min Xu · GitHub · 15512d9e · 2b15720b · 2b15720b
Unverified Commit 2b15720b authored Feb 22, 2021 by Min Xu Committed by GitHub Feb 22, 2021
6 changed files
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,7 +5,13 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 ## NEXT - TBD
+### Added
+- FullyShardedDataParallel (FSDP) ([#413](https://github.com/facebookresearch/fairscale/issues/413))
 ### Fixed
 - Catch corner case when the model is too small with respect to the world size, and shards are empty ([#406] (https://github.com/facebookresearch/fairscale/pull/406))
 ## [0.1.7] - 2021-02-19

--- a/docs/source/api/index.rst
+++ b/docs/source/api/index.rst
@@ -9,4 +9,5 @@ API Reference
   optim/grad_scaler
   nn/pipe
   nn/sharded_ddp
+   nn/fsdp
   nn/misc/checkpoint_activations
--- a/docs/source/api/nn/fsdp.rst
+++ b/docs/source/api/nn/fsdp.rst
+FullyShardedDataParallel
+========================
+.. autoclass:: fairscale.nn.FullyShardedDataParallel
+    :members:
+    :undoc-members:
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -23,6 +23,7 @@ Components
    * `Optimizer state sharding <../../en/latest/api/optim/oss.html>`_
    * `Sharded grad scaler - automatic mixed precision <../../en/latest/api/optim/grad_scaler.html>`_
    * `Sharded distributed data parallel <../../en/latest/api/nn/sharded_ddp.html>`_
+    * `Fully Sharded Data Parallel FSDP <../../en/latest/api/nn/fsdp.html>`_
 * Optimization at scale:
   * `AdaScale SGD <../../en/latest/api/optim/adascale.html>`_

--- a/fairscale/nn/__init__.py
+++ b/fairscale/nn/__init__.py
@@ -3,13 +3,14 @@
 # This source code is licensed under the BSD license found in the
 # LICENSE file in the root directory of this source tree.
-from .data_parallel import ShardedDataParallel
+from .data_parallel import FullyShardedDataParallel, ShardedDataParallel
 from .misc import FlattenParamsWrapper
 from .moe import MOELayer, Top2Gate
 from .pipe import Pipe, PipeRPCWrapper
 __all__ = [
    "FlattenParamsWrapper",
+    "FullyShardedDataParallel",
    "LazyModule",
    "Pipe",
    "PipeRPCWrapper",

--- a/fairscale/nn/data_parallel/fully_sharded_data_parallel.py
+++ b/fairscale/nn/data_parallel/fully_sharded_data_parallel.py
@@ -85,30 +85,40 @@ class FullyShardedDataParallel(nn.Module):
        )
    Args:
-        module (nn.Module): module to checkpoint
+        module (nn.Module):
-        process_group (Optional): process group for sharding
+            module to checkpoint
-        reshard_after_forward (bool, Optional): if ``True``, reshard parameters
+        process_group (Optional):
+            process group for sharding
+        reshard_after_forward (bool, Optional):
+            if ``True``, reshard parameters
            after the forward pass. This saves memory but slows training. This
            is only relevant when resharding individual layers.
-        mixed_precision (bool, Optional): if ``True``, inputs, activations and
+        mixed_precision (bool, Optional):
+            if ``True``, inputs, activations and
            gradients will be kept in FP16; computation and communication will
            occur in FP16; and a (sharded) master copy of the model weights will
            be maintained in FP32.
-        fp32_reduce_scatter (bool, Optional): if ``True``, then reduce-scatter
+        fp32_reduce_scatter (bool, Optional):
+            if ``True``, then reduce-scatter
            gradients in FP32. This is only relevant when *``mixed_precision``*
            is ``True``.
-        flatten_parameters (bool, Optional): if ``True``, flatten parameters
+        flatten_parameters (bool, Optional):
+            if ``True``, flatten parameters
            into a single contiguous tensor, which improves training speed.
-        cpu_offload (bool, Optional): if ``True``, offload FP32 params to CPU.
+        cpu_offload (bool, Optional):
+            if ``True``, offload FP32 params to CPU.
            This is only relevant when *``mixed_precision``* is ``True``.
-        compute_dtype (torch.dtype, Optional): dtype for full parameters for
+        compute_dtype (torch.dtype, Optional):
+            dtype for full parameters for
            computation. This defaults to ``torch.float32`` unless
            *``mixed_precision``* is set, in which case it defaults to
            ``torch.float16``.
-        move_grads_to_cpu (bool, Optional): move gradient shard to CPU after
+        move_grads_to_cpu (bool, Optional):
+            move gradient shard to CPU after
            reduction. This is useful when combined with CPU-based optimizers.
            It defaults to the value of *``cpu_offload``*.
-        bucket_cap_mb (int, Optional): FSDP will bucket parameters so that
+        bucket_cap_mb (int, Optional):
+            FSDP will bucket parameters so that
            gradient reduction can potentially overlap with backward
            computation. bucket_cap_mb controls the bucket size in MegaBytes
            (MB). Buckets are sub-divided based on world_size, so the max shard