Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
fairscale
Commits
2b15720b
Unverified
Commit
2b15720b
authored
Feb 22, 2021
by
Min Xu
Committed by
GitHub
Feb 22, 2021
Browse files
[docs] fsdp changelog and doc (#414)
parent
15512d9e
Changes
6
Hide whitespace changes
Inline
Side-by-side
Showing
6 changed files
with
36 additions
and
11 deletions
+36
-11
CHANGELOG.md
CHANGELOG.md
+6
-0
docs/source/api/index.rst
docs/source/api/index.rst
+1
-0
docs/source/api/nn/fsdp.rst
docs/source/api/nn/fsdp.rst
+6
-0
docs/source/index.rst
docs/source/index.rst
+1
-0
fairscale/nn/__init__.py
fairscale/nn/__init__.py
+2
-1
fairscale/nn/data_parallel/fully_sharded_data_parallel.py
fairscale/nn/data_parallel/fully_sharded_data_parallel.py
+20
-10
No files found.
CHANGELOG.md
View file @
2b15720b
...
...
@@ -5,7 +5,13 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to
[
Semantic Versioning
](
https://semver.org/spec/v2.0.0.html
)
.
## NEXT - TBD
### Added
-
FullyShardedDataParallel (FSDP) (
[
#413
](
https://github.com/facebookresearch/fairscale/issues/413
)
)
### Fixed
-
Catch corner case when the model is too small with respect to the world size, and shards are empty ([#406] (https://github.com/facebookresearch/fairscale/pull/406))
## [0.1.7] - 2021-02-19
...
...
docs/source/api/index.rst
View file @
2b15720b
...
...
@@ -9,4 +9,5 @@ API Reference
optim/grad_scaler
nn/pipe
nn/sharded_ddp
nn/fsdp
nn/misc/checkpoint_activations
docs/source/api/nn/fsdp.rst
0 → 100644
View file @
2b15720b
FullyShardedDataParallel
========================
.. autoclass:: fairscale.nn.FullyShardedDataParallel
:members:
:undoc-members:
docs/source/index.rst
View file @
2b15720b
...
...
@@ -23,6 +23,7 @@ Components
* `Optimizer state sharding <../../en/latest/api/optim/oss.html>`_
* `Sharded grad scaler - automatic mixed precision <../../en/latest/api/optim/grad_scaler.html>`_
* `Sharded distributed data parallel <../../en/latest/api/nn/sharded_ddp.html>`_
* `Fully Sharded Data Parallel FSDP <../../en/latest/api/nn/fsdp.html>`_
* Optimization at scale:
* `AdaScale SGD <../../en/latest/api/optim/adascale.html>`_
...
...
fairscale/nn/__init__.py
View file @
2b15720b
...
...
@@ -3,13 +3,14 @@
# This source code is licensed under the BSD license found in the
# LICENSE file in the root directory of this source tree.
from
.data_parallel
import
ShardedDataParallel
from
.data_parallel
import
FullyShardedDataParallel
,
ShardedDataParallel
from
.misc
import
FlattenParamsWrapper
from
.moe
import
MOELayer
,
Top2Gate
from
.pipe
import
Pipe
,
PipeRPCWrapper
__all__
=
[
"FlattenParamsWrapper"
,
"FullyShardedDataParallel"
,
"LazyModule"
,
"Pipe"
,
"PipeRPCWrapper"
,
...
...
fairscale/nn/data_parallel/fully_sharded_data_parallel.py
View file @
2b15720b
...
...
@@ -85,30 +85,40 @@ class FullyShardedDataParallel(nn.Module):
)
Args:
module (nn.Module): module to checkpoint
process_group (Optional): process group for sharding
reshard_after_forward (bool, Optional): if ``True``, reshard parameters
module (nn.Module):
module to checkpoint
process_group (Optional):
process group for sharding
reshard_after_forward (bool, Optional):
if ``True``, reshard parameters
after the forward pass. This saves memory but slows training. This
is only relevant when resharding individual layers.
mixed_precision (bool, Optional): if ``True``, inputs, activations and
mixed_precision (bool, Optional):
if ``True``, inputs, activations and
gradients will be kept in FP16; computation and communication will
occur in FP16; and a (sharded) master copy of the model weights will
be maintained in FP32.
fp32_reduce_scatter (bool, Optional): if ``True``, then reduce-scatter
fp32_reduce_scatter (bool, Optional):
if ``True``, then reduce-scatter
gradients in FP32. This is only relevant when *``mixed_precision``*
is ``True``.
flatten_parameters (bool, Optional): if ``True``, flatten parameters
flatten_parameters (bool, Optional):
if ``True``, flatten parameters
into a single contiguous tensor, which improves training speed.
cpu_offload (bool, Optional): if ``True``, offload FP32 params to CPU.
cpu_offload (bool, Optional):
if ``True``, offload FP32 params to CPU.
This is only relevant when *``mixed_precision``* is ``True``.
compute_dtype (torch.dtype, Optional): dtype for full parameters for
compute_dtype (torch.dtype, Optional):
dtype for full parameters for
computation. This defaults to ``torch.float32`` unless
*``mixed_precision``* is set, in which case it defaults to
``torch.float16``.
move_grads_to_cpu (bool, Optional): move gradient shard to CPU after
move_grads_to_cpu (bool, Optional):
move gradient shard to CPU after
reduction. This is useful when combined with CPU-based optimizers.
It defaults to the value of *``cpu_offload``*.
bucket_cap_mb (int, Optional): FSDP will bucket parameters so that
bucket_cap_mb (int, Optional):
FSDP will bucket parameters so that
gradient reduction can potentially overlap with backward
computation. bucket_cap_mb controls the bucket size in MegaBytes
(MB). Buckets are sub-divided based on world_size, so the max shard
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment