README.md 8.34 KB
Newer Older
Vittorio Caggiano's avatar
Vittorio Caggiano committed
1
2
![FairScale Logo](./docs/source/_static/img/fairscale-logo.png)

Vittorio Caggiano's avatar
Vittorio Caggiano committed
3
4
5
![PyPI](https://img.shields.io/pypi/v/fairscale)
[![Documentation Status](https://readthedocs.org/projects/fairscale/badge/?version=latest)](https://fairscale.readthedocs.io/en/latest/?badge=latest)
[![CircleCI](https://circleci.com/gh/facebookresearch/fairscale.svg?style=shield)](https://app.circleci.com/pipelines/github/facebookresearch/fairscale/) ![PyPI - License](https://img.shields.io/pypi/l/fairscale) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/facebookresearch/fairscale/blob/master/CONTRIBUTING.md)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
6
--------------------------------------------------------------------------------
Vittorio Caggiano's avatar
Vittorio Caggiano committed
7
8

## Description
VitaliyLi's avatar
VitaliyLi committed
9
FairScale is a PyTorch extension library for high performance and large scale training on one or multiple machines/nodes. This library extends basic PyTorch capabilities while adding new experimental ones.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
10

VitaliyLi's avatar
VitaliyLi committed
11
FairScale supports:
Vittorio Caggiano's avatar
Vittorio Caggiano committed
12
* Parallelism:
Min Xu's avatar
Min Xu committed
13
14
15
16
   * Pipeline parallelism (`fairscale.nn.pipe`)
   * Asynchronous Pipeline parallelism (`fairscale.nn.async_pipe`)
   * Model Parallelism (`fairscale.nn.model_parallel.layers`)
   * _experimental_ AmpNet (`fairscale.experimental.nn.ampnet_pipe`)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
17
* Sharded training:
Min Xu's avatar
Min Xu committed
18
19
   * Optimizer state sharding (`fairscale.optim.OSS`)
   * Sharded Data Parallel (SDP) (`fairscale.nn.ShardedDataParallel`)
Min Xu's avatar
Min Xu committed
20
   * Fully Sharded Data Parallel (FSDP) (`fairscale.nn.FullyShardedDataParallel`) (PyTorch >= 1.6)
21
   * OffloadModel (`fairscale.experimental.nn.OffloadModel`)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
22
* Optimization at scale:
Min Xu's avatar
Min Xu committed
23
   * AdaScale SGD (`fairscale.optim.AdaScale`)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
24
* GPU memory optimization:
Min Xu's avatar
Min Xu committed
25
26
27
   * Activation checkpointing wrapper (`fairscale.nn.misc.checkpoint_wrapper`)
* GPU speed optimization:
   * Sharded grad scaler - automatic mixed precision (`fairscale.optim.grad_scaler`)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
28
29
30

## Requirements

31
* PyTorch >= 1.5.1
Vittorio Caggiano's avatar
Vittorio Caggiano committed
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

## Installation

Normal installation:
```bash
pip install fairscale
```

Development mode:
```bash
cd fairscale
pip install -r requirements.txt
pip install -e .
```

47
48
49
If either of the above fails, add `--no-build-isolation` to the `pip install` command (this could be a problem with recent versions of pip).


Vittorio Caggiano's avatar
Vittorio Caggiano committed
50
51
## Getting Started
The full documentation (https://fairscale.readthedocs.io/) contains instructions for getting started and extending fairscale.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
52
53

## Examples
54
### Pipe
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
55
56
57

Run a 4-layer model on 2 GPUs. The first two layers run on cuda:0 and the next two layers run on cuda:1.

58
```python
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
59
60
61
62
63
64
65
66
import torch

import fairscale

model = torch.nn.Sequential(a, b, c, d)
model = fairscale.nn.Pipe(model, balance=[2, 2], devices=[0, 1], chunks=8)
```

67
### Optimizer state sharding (ZeRO)
68
See a more complete example [here](https://github.com/facebookresearch/fairscale/blob/master/benchmarks/oss.py), but a minimal example could look like the following :
69

70
```python
71
import torch
72
import torch.distributed as dist
73
import torch.multiprocessing as mp
74
from fairscale.optim.oss import OSS
75
from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP
76
77
78
79
80
81

def train(
    rank: int,
    world_size: int,
    epochs: int):

82
83
    # DDP init example
    dist.init_process_group(backend='nccl', init_method="tcp://localhost:29501", rank=rank, world_size=world_size)
84
85

    # Problem statement
86
    model = myAwesomeModel().to(rank)
87
    dataloader = mySuperFastDataloader()
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
88
    loss_fn = myVeryRelevantLoss()
89
90
91
    base_optimizer = torch.optim.SGD # pick any pytorch compliant optimizer here
    base_optimizer_arguments = {} # pass any optimizer specific arguments here, or directly below when instantiating OSS

92
    # Wrap the optimizer in its state sharding brethren
93
94
    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)

95
96
97
    # Wrap the model into ShardedDDP, which will reduce gradients to the proper ranks
    model = ShardedDDP(model, optimizer)

98
99
100
101
102
103
104
105
106
107
108
    # Any relevant training loop, nothing specific to OSS. For example:
    model.train()
    for e in range(epochs):
        for batch in dataloader:
            # Train
            model.zero_grad()
            outputs = model(batch["inputs"])
            loss = loss_fn(outputs, batch["label"])
            loss.backward()
            optimizer.step()

109
110
    dist.destroy_process_group()

111
if __name__ == "__main__":
112
    # Supposing that WORLD_SIZE and EPOCHS are somehow defined somewhere
113
114
115
116
117
118
119
120
121
122
123
    mp.spawn(
        train,
        args=(
            WORLD_SIZE,
            EPOCHS,
        ),
        nprocs=WORLD_SIZE,
        join=True,
    )
```

124
### AdaScale SGD
125

126
127
128
129
AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed Data Parallel)
training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
schedule from a baseline batch size when effective batch size is bigger.

130
131
Note that AdaScale does _not_ help increase per-GPU batch size.

132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
```python
from torch.optim import SGD
from torch.optim.lr_scheduler import LambdaLR  # or your scheduler
from fairscale.optim import AdaScale

...
optim = AdaScale(SGD(model.parameters(), lr=0.1))
scheduler = LambdaLR(optim, ...)
...
# Note: the train loop should be with DDP or with gradient accumulation.
last_epoch = 0
step = 0
done = False
while not done:
    for sample in dataset:
        ...
        step += optim.gain()
        optim.step()
        epoch = step // len(dataset)
        if last_epoch != epoch:
            scheduler.step()
            last_epoch = epoch
        if epoch > max_epoch:
            done = True
```

158
Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.
159
(However, training time might be longer comparing to without AdaScale.)
160
161

At a high level, we want ML researchers to:
162
  * go parallel more easily (i.e. no need to find new learning rate schedules)
163
  * not worrying about losing accuracy
164
  * potentially higher GPU efficiency (fewer steps, less networking overhead, etc.)
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
165

166
167
# Testing

168
We use circleci to test on PyTorch versions 1.6.0, 1.7.1, and 1.8.1. Please create an [issue](https://github.com/facebookresearch/fairscale/issues) if you are having trouble with installation.
169

Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
170
171
172
173
174
175
176
## Contributors

See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.

## License

fairscale is licensed under the [BSD-3-Clause License](LICENSE).
177
178
179
180
181

fairscale.nn.pipe is forked from [torchgpipe](https://github.com/kakaobrain/torchgpipe), Copyright 2019, Kakao Brain, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

fairscale.nn.model_parallel is forked from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), Copyright 2020, NVIDIA CORPORATION, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

182
183
fairscale.optim.adascale is forked from [AdaptDL](https://github.com/petuum/adaptdl), Copyright 2020, Petuum, Inc., licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

Myle Ott's avatar
Myle Ott committed
184
185
fairscale.nn.misc.flatten_params_wrapper is forked from [PyTorch-Reparam-Module](https://github.com/SsnL/PyTorch-Reparam-Module), Copyright 2018, Tongzhou Wang, licensed under [MIT License](https://github.com/SsnL/PyTorch-Reparam-Module/blob/master/LICENSE).

186
187
188
189
190
191
192
## References

Here is a list of all authors on relevant research papers this work is based on:

* torchgpipe: Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, Sungwoong Kim. [[Paper](https://arxiv.org/pdf/2004.09910.pdf)] [[Code](https://github.com/kakaobrain/torchgpipe)]
* ZeRO: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. [[Paper](https://arxiv.org/pdf/1910.02054.pdf)] [[Code](https://github.com/microsoft/DeepSpeed)]
* Megatron-LM: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro. [[Paper](https://arxiv.org/pdf/1909.08053.pdf)][[Code](https://github.com/NVIDIA/Megatron-LM)]
193
* AdaScale SGD: Tyler B. Johnson, Pulkit Agrawal, Haijie Gu, Carlos Guestrin. [[Paper](https://proceedings.icml.cc/static/paper_files/icml/2020/4682-Paper.pdf)]
Vittorio Caggiano's avatar
Vittorio Caggiano committed
194
195
* GShard: Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen [[Paper]](https://arxiv.org/abs/2006.16668)
* AMPNet:Alexander L. Gaunt, Matthew A. Johnson, Maik Riechert, Daniel Tarlow, Ryota Tomioka, Dimitrios Vytiniotis, Sam Webster [[Paper]](https://arxiv.org/abs/1705.09786)
196
197
* L2L: Training large Neural networks with constant Memory using a new execution Algorithm, 2020, [[Paper](https://arxiv.org/abs/2002.05645)]
* ZeRO-Offload: Democratizing Billion-Scale Model Training. 2021, [[Paper](https://arxiv.org/abs/2101.06840)]