Unverified Commit 2b15720b authored by Min Xu's avatar Min Xu Committed by GitHub
Browse files

[docs] fsdp changelog and doc (#414)

parent 15512d9e
......@@ -5,7 +5,13 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## NEXT - TBD
### Added
- FullyShardedDataParallel (FSDP) ([#413](https://github.com/facebookresearch/fairscale/issues/413))
### Fixed
- Catch corner case when the model is too small with respect to the world size, and shards are empty ([#406] (https://github.com/facebookresearch/fairscale/pull/406))
## [0.1.7] - 2021-02-19
......
......@@ -9,4 +9,5 @@ API Reference
optim/grad_scaler
nn/pipe
nn/sharded_ddp
nn/fsdp
nn/misc/checkpoint_activations
FullyShardedDataParallel
========================
.. autoclass:: fairscale.nn.FullyShardedDataParallel
:members:
:undoc-members:
......@@ -23,6 +23,7 @@ Components
* `Optimizer state sharding <../../en/latest/api/optim/oss.html>`_
* `Sharded grad scaler - automatic mixed precision <../../en/latest/api/optim/grad_scaler.html>`_
* `Sharded distributed data parallel <../../en/latest/api/nn/sharded_ddp.html>`_
* `Fully Sharded Data Parallel FSDP <../../en/latest/api/nn/fsdp.html>`_
* Optimization at scale:
* `AdaScale SGD <../../en/latest/api/optim/adascale.html>`_
......
......@@ -3,13 +3,14 @@
# This source code is licensed under the BSD license found in the
# LICENSE file in the root directory of this source tree.
from .data_parallel import ShardedDataParallel
from .data_parallel import FullyShardedDataParallel, ShardedDataParallel
from .misc import FlattenParamsWrapper
from .moe import MOELayer, Top2Gate
from .pipe import Pipe, PipeRPCWrapper
__all__ = [
"FlattenParamsWrapper",
"FullyShardedDataParallel",
"LazyModule",
"Pipe",
"PipeRPCWrapper",
......
......@@ -85,30 +85,40 @@ class FullyShardedDataParallel(nn.Module):
)
Args:
module (nn.Module): module to checkpoint
process_group (Optional): process group for sharding
reshard_after_forward (bool, Optional): if ``True``, reshard parameters
module (nn.Module):
module to checkpoint
process_group (Optional):
process group for sharding
reshard_after_forward (bool, Optional):
if ``True``, reshard parameters
after the forward pass. This saves memory but slows training. This
is only relevant when resharding individual layers.
mixed_precision (bool, Optional): if ``True``, inputs, activations and
mixed_precision (bool, Optional):
if ``True``, inputs, activations and
gradients will be kept in FP16; computation and communication will
occur in FP16; and a (sharded) master copy of the model weights will
be maintained in FP32.
fp32_reduce_scatter (bool, Optional): if ``True``, then reduce-scatter
fp32_reduce_scatter (bool, Optional):
if ``True``, then reduce-scatter
gradients in FP32. This is only relevant when *``mixed_precision``*
is ``True``.
flatten_parameters (bool, Optional): if ``True``, flatten parameters
flatten_parameters (bool, Optional):
if ``True``, flatten parameters
into a single contiguous tensor, which improves training speed.
cpu_offload (bool, Optional): if ``True``, offload FP32 params to CPU.
cpu_offload (bool, Optional):
if ``True``, offload FP32 params to CPU.
This is only relevant when *``mixed_precision``* is ``True``.
compute_dtype (torch.dtype, Optional): dtype for full parameters for
compute_dtype (torch.dtype, Optional):
dtype for full parameters for
computation. This defaults to ``torch.float32`` unless
*``mixed_precision``* is set, in which case it defaults to
``torch.float16``.
move_grads_to_cpu (bool, Optional): move gradient shard to CPU after
move_grads_to_cpu (bool, Optional):
move gradient shard to CPU after
reduction. This is useful when combined with CPU-based optimizers.
It defaults to the value of *``cpu_offload``*.
bucket_cap_mb (int, Optional): FSDP will bucket parameters so that
bucket_cap_mb (int, Optional):
FSDP will bucket parameters so that
gradient reduction can potentially overlap with backward
computation. bucket_cap_mb controls the bucket size in MegaBytes
(MB). Buckets are sub-divided based on world_size, so the max shard
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment