README.md 9.88 KB
Newer Older
Vittorio Caggiano's avatar
Vittorio Caggiano committed
1
2
![FairScale Logo](./docs/source/_static/img/fairscale-logo.png)

3
[![Support Ukraine](https://img.shields.io/badge/Support-Ukraine-FFD500?style=flat&labelColor=005BBB)](https://opensource.facebook.com/support-ukraine)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
4
5
![PyPI](https://img.shields.io/pypi/v/fairscale)
[![Documentation Status](https://readthedocs.org/projects/fairscale/badge/?version=latest)](https://fairscale.readthedocs.io/en/latest/?badge=latest)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
6
[![CircleCI](https://circleci.com/gh/facebookresearch/fairscale.svg?style=shield)](https://app.circleci.com/pipelines/github/facebookresearch/fairscale/) ![PyPI - License](https://img.shields.io/pypi/l/fairscale) [![Downloads](https://pepy.tech/badge/fairscale)](https://pepy.tech/project/fairscale) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/facebookresearch/fairscale/blob/main/CONTRIBUTING.md)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
7
--------------------------------------------------------------------------------
Vittorio Caggiano's avatar
Vittorio Caggiano committed
8
9

## Description
10
11
12
13
FairScale is a PyTorch extension library for high performance and large scale training.
This library extends basic PyTorch capabilities while adding new SOTA scaling techniques.
FairScale makes available the latest distributed training techniques in the form of composable
modules and easy to use APIs. These APIs are a fundamental part of a researcher's toolbox as
anj-s's avatar
anj-s committed
14
they attempt to scale models with limited resources.
Vittorio Caggiano's avatar
Vittorio Caggiano committed
15

anj-s's avatar
anj-s committed
16
FairScale was designed with the following values in mind:
Vittorio Caggiano's avatar
Vittorio Caggiano committed
17

anj-s's avatar
anj-s committed
18
19
* **Usability** -  Users should be able to understand and use FairScale APIs with minimum cognitive overload.

20
* **Modularity** - Users should be able to combine multiple FairScale APIs as part of their training loop seamlessly.
anj-s's avatar
anj-s committed
21
22

* **Performance** - FairScale APIs provide the best performance in terms of scaling and efficiency.
Vittorio Caggiano's avatar
Vittorio Caggiano committed
23

24
25
26
27
## Watch Introductory Video

[![Explain Like I’m 5: FairScale](https://img.youtube.com/vi/oDt7ebOwWIc/0.jpg)](https://www.youtube.com/watch?v=oDt7ebOwWIc)

28
29
## What's New:

tmarkstrum's avatar
tmarkstrum committed
30
31
* March 2022 [fairscale 0.4.6 was released](https://github.com/facebookresearch/fairscale/releases/tag/v0.4.6).
* We have support for CosFace's LMCL in MEVO. This is a loss function that is suitable for large number of prediction target classes.
tmarkstrum's avatar
tmarkstrum committed
32
33
34
* January 2022 [fairscale 0.4.5 was released](https://github.com/facebookresearch/fairscale/releases/tag/v0.4.5).
* We have experimental support for layer wise gradient scaling.
* We enabled reduce_scatter operation overlapping in FSDP backward propagation.
Anupam Bhatnagar's avatar
Anupam Bhatnagar committed
35
* December 2021 [fairscale 0.4.4 was released](https://github.com/facebookresearch/fairscale/releases/tag/v0.4.4).
36
* FairScale is tested with the following PyTorch versions (with CUDA 11.2): 1.8.1, 1.10.0 and 1.11.0.dev20211101+cu111.
Min Xu's avatar
Min Xu committed
37
* November 2021 [fairscale 0.4.3 was released](https://github.com/facebookresearch/fairscale/releases/tag/v0.4.3).
anj-s's avatar
anj-s committed
38
* We have experimental support for offloading params to disk when using the FSDP API for evaluation workloads.
Min Xu's avatar
Min Xu committed
39
* We have an experimental layer that fuses multiple layers together to support large vocab size trainings.
40
41
42
* November 2021 [fairscale 0.4.2 was released](https://github.com/facebookresearch/fairscale/releases/tag/v0.4.2).
* We have a new experimental API called the LayerwiseMemoryTracker to help track, visualize and suggest fixes for memory issues occurring during the forward/backward pass of your models.
* Introducing SlowMoDistributedDataParallel API, a distributed training wrapper that is useful on clusters with slow network interconnects (e.g. Ethernet).
43
* September 2021 [`master` branch renamed to `main`](https://github.com/github/renaming).
Vittorio Caggiano's avatar
Vittorio Caggiano committed
44

anj-s's avatar
anj-s committed
45
## Installation
46

47
48
To install FairScale, please see the following [instructions](https://github.com/facebookresearch/fairscale/blob/main/docs/source/installation_instructions.rst).
You should be able to install a package with pip or conda, or build directly from source.
49

Vittorio Caggiano's avatar
Vittorio Caggiano committed
50
## Getting Started
anj-s's avatar
anj-s committed
51
The full [documentation](https://fairscale.readthedocs.io/) contains instructions for getting started, deep dives and tutorials about the various FairScale APIs.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
52
53

## Examples
anj-s's avatar
anj-s committed
54
55
56

Here are a few sample snippets from a subset of FairScale offerings:

57
### Pipe
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
58
59
60

Run a 4-layer model on 2 GPUs. The first two layers run on cuda:0 and the next two layers run on cuda:1.

61
```python
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
62
63
64
65
66
67
68
69
import torch

import fairscale

model = torch.nn.Sequential(a, b, c, d)
model = fairscale.nn.Pipe(model, balance=[2, 2], devices=[0, 1], chunks=8)
```

70
### Optimizer state sharding (ZeRO)
71
See a more complete example [here](https://github.com/facebookresearch/fairscale/blob/main/benchmarks/oss.py), but a minimal example could look like the following :
72

73
```python
74
import torch
75
import torch.distributed as dist
76
import torch.multiprocessing as mp
77
from fairscale.optim.oss import OSS
78
from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP
79
80
81
82
83
84

def train(
    rank: int,
    world_size: int,
    epochs: int):

85
86
    # DDP init example
    dist.init_process_group(backend='nccl', init_method="tcp://localhost:29501", rank=rank, world_size=world_size)
87
88

    # Problem statement
89
    model = myAwesomeModel().to(rank)
90
    dataloader = mySuperFastDataloader()
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
91
    loss_fn = myVeryRelevantLoss()
92
93
94
    base_optimizer = torch.optim.SGD # pick any pytorch compliant optimizer here
    base_optimizer_arguments = {} # pass any optimizer specific arguments here, or directly below when instantiating OSS

95
    # Wrap the optimizer in its state sharding brethren
96
97
    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)

98
99
100
    # Wrap the model into ShardedDDP, which will reduce gradients to the proper ranks
    model = ShardedDDP(model, optimizer)

101
102
103
104
105
106
107
108
109
110
111
    # Any relevant training loop, nothing specific to OSS. For example:
    model.train()
    for e in range(epochs):
        for batch in dataloader:
            # Train
            model.zero_grad()
            outputs = model(batch["inputs"])
            loss = loss_fn(outputs, batch["label"])
            loss.backward()
            optimizer.step()

112
113
    dist.destroy_process_group()

114
if __name__ == "__main__":
115
    # Supposing that WORLD_SIZE and EPOCHS are somehow defined somewhere
116
117
118
119
120
121
122
123
124
125
126
    mp.spawn(
        train,
        args=(
            WORLD_SIZE,
            EPOCHS,
        ),
        nprocs=WORLD_SIZE,
        join=True,
    )
```

127
### AdaScale SGD
128

129
130
131
132
AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed Data Parallel)
training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
schedule from a baseline batch size when effective batch size is bigger.

133
134
Note that AdaScale does _not_ help increase per-GPU batch size.

135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
```python
from torch.optim import SGD
from torch.optim.lr_scheduler import LambdaLR  # or your scheduler
from fairscale.optim import AdaScale

...
optim = AdaScale(SGD(model.parameters(), lr=0.1))
scheduler = LambdaLR(optim, ...)
...
# Note: the train loop should be with DDP or with gradient accumulation.
last_epoch = 0
step = 0
done = False
while not done:
    for sample in dataset:
        ...
        step += optim.gain()
        optim.step()
        epoch = step // len(dataset)
        if last_epoch != epoch:
            scheduler.step()
            last_epoch = epoch
        if epoch > max_epoch:
            done = True
```

161
Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.
162
(However, training time might be longer comparing to without AdaScale.)
163
164

At a high level, we want ML researchers to:
165
  * go parallel more easily (i.e. no need to find new learning rate schedules)
166
  * not worrying about losing accuracy
167
  * potentially higher GPU efficiency (fewer steps, less networking overhead, etc.)
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
168

anj-s's avatar
anj-s committed
169
## Testing
170

Anupam Bhatnagar's avatar
Anupam Bhatnagar committed
171
We use circleci to test FairScale with the following PyTorch versions (with CUDA 11.2):
172
173
* the latest stable release (1.10.0)
* the latest LTS release (1.8.1)
174
* a recent nightly release (1.11.0.dev20211101+cu111)
Anupam Bhatnagar's avatar
Anupam Bhatnagar committed
175
176

Please create an [issue](https://github.com/facebookresearch/fairscale/issues) if you are having trouble with installation.
177

Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
178
179
## Contributors

anj-s's avatar
anj-s committed
180
We welcome outside contributions! Please see the [CONTRIBUTING](CONTRIBUTING.md) instructions for how you can contribute to FairScale.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
181
182
183

## License

anj-s's avatar
anj-s committed
184
FairScale is licensed under the [BSD-3-Clause License](LICENSE).
185
186
187
188
189

fairscale.nn.pipe is forked from [torchgpipe](https://github.com/kakaobrain/torchgpipe), Copyright 2019, Kakao Brain, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

fairscale.nn.model_parallel is forked from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), Copyright 2020, NVIDIA CORPORATION, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

190
191
fairscale.optim.adascale is forked from [AdaptDL](https://github.com/petuum/adaptdl), Copyright 2020, Petuum, Inc., licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

Myle Ott's avatar
Myle Ott committed
192
193
fairscale.nn.misc.flatten_params_wrapper is forked from [PyTorch-Reparam-Module](https://github.com/SsnL/PyTorch-Reparam-Module), Copyright 2018, Tongzhou Wang, licensed under [MIT License](https://github.com/SsnL/PyTorch-Reparam-Module/blob/master/LICENSE).

194

anj-s's avatar
anj-s committed
195
## Citing FairScale
196

anj-s's avatar
anj-s committed
197
198
199
200
If you use FairScale in your publication, please cite it by using the following BibTeX entry.

```BibTeX
@Misc{FairScale2021,
Vittorio Caggiano's avatar
Vittorio Caggiano committed
201
  author =       {Mandeep Baines and Shruti Bhosale and Vittorio Caggiano and Naman Goyal and Siddharth Goyal and Myle Ott and Benjamin Lefaudeux and Vitaliy Liptchinsky and Mike Rabbat and Sam Sheiffer and Anjali Sridhar and Min Xu},
anj-s's avatar
anj-s committed
202
203
204
205
206
  title =        {FairScale:  A general purpose modular PyTorch library for high performance and large scale training},
  howpublished = {\url{https://github.com/facebookresearch/fairscale}},
  year =         {2021}
}
```
207
208
209
210
211
212
213
214
215
216
217
218
219

## FAQ
1. If you experience an error indicating a default branch does not exist, it probably due to the latest update, switching the default branch from "master" to "main"
```
error: pathspec 'non-existing-branch' did not match any file(s) known to git
```
Please run the following commands to update to the main branch.
```
git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a
```