README.md 7.32 KB
Newer Older
Vittorio Caggiano's avatar
Vittorio Caggiano committed
1
2
![FairScale Logo](./docs/source/_static/img/fairscale-logo.png)

Vittorio Caggiano's avatar
Vittorio Caggiano committed
3
4
5
![PyPI](https://img.shields.io/pypi/v/fairscale)
[![Documentation Status](https://readthedocs.org/projects/fairscale/badge/?version=latest)](https://fairscale.readthedocs.io/en/latest/?badge=latest)
[![CircleCI](https://circleci.com/gh/facebookresearch/fairscale.svg?style=shield)](https://app.circleci.com/pipelines/github/facebookresearch/fairscale/) ![PyPI - License](https://img.shields.io/pypi/l/fairscale) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/facebookresearch/fairscale/blob/master/CONTRIBUTING.md)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
6
--------------------------------------------------------------------------------
Vittorio Caggiano's avatar
Vittorio Caggiano committed
7
8

## Description
9
10
11
12
FairScale is a PyTorch extension library for high performance and large scale training.
This library extends basic PyTorch capabilities while adding new SOTA scaling techniques.
FairScale makes available the latest distributed training techniques in the form of composable
modules and easy to use APIs. These APIs are a fundamental part of a researcher's toolbox as
anj-s's avatar
anj-s committed
13
they attempt to scale models with limited resources.
Vittorio Caggiano's avatar
Vittorio Caggiano committed
14

anj-s's avatar
anj-s committed
15
FairScale was designed with the following values in mind:
Vittorio Caggiano's avatar
Vittorio Caggiano committed
16

anj-s's avatar
anj-s committed
17
18
* **Usability** -  Users should be able to understand and use FairScale APIs with minimum cognitive overload.

19
* **Modularity** - Users should be able to combine multiple FairScale APIs as part of their training loop seamlessly.
anj-s's avatar
anj-s committed
20
21

* **Performance** - FairScale APIs provide the best performance in terms of scaling and efficiency.
Vittorio Caggiano's avatar
Vittorio Caggiano committed
22
23


anj-s's avatar
anj-s committed
24
## Installation
25

26
To install FairScale, please see the following [instructions](https://github.com/facebookresearch/fairscale/blob/master/docs/source/installation_instructions.rst). You should be able to install a pip package or
anj-s's avatar
anj-s committed
27
build directly from source.
28

Vittorio Caggiano's avatar
Vittorio Caggiano committed
29
## Getting Started
anj-s's avatar
anj-s committed
30
The full [documentation](https://fairscale.readthedocs.io/) contains instructions for getting started, deep dives and tutorials about the various FairScale APIs.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
31
32

## Examples
anj-s's avatar
anj-s committed
33
34
35

Here are a few sample snippets from a subset of FairScale offerings:

36
### Pipe
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
37
38
39

Run a 4-layer model on 2 GPUs. The first two layers run on cuda:0 and the next two layers run on cuda:1.

40
```python
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
41
42
43
44
45
46
47
48
import torch

import fairscale

model = torch.nn.Sequential(a, b, c, d)
model = fairscale.nn.Pipe(model, balance=[2, 2], devices=[0, 1], chunks=8)
```

49
### Optimizer state sharding (ZeRO)
50
See a more complete example [here](https://github.com/facebookresearch/fairscale/blob/master/benchmarks/oss.py), but a minimal example could look like the following :
51

52
```python
53
import torch
54
import torch.distributed as dist
55
import torch.multiprocessing as mp
56
from fairscale.optim.oss import OSS
57
from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP
58
59
60
61
62
63

def train(
    rank: int,
    world_size: int,
    epochs: int):

64
65
    # DDP init example
    dist.init_process_group(backend='nccl', init_method="tcp://localhost:29501", rank=rank, world_size=world_size)
66
67

    # Problem statement
68
    model = myAwesomeModel().to(rank)
69
    dataloader = mySuperFastDataloader()
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
70
    loss_fn = myVeryRelevantLoss()
71
72
73
    base_optimizer = torch.optim.SGD # pick any pytorch compliant optimizer here
    base_optimizer_arguments = {} # pass any optimizer specific arguments here, or directly below when instantiating OSS

74
    # Wrap the optimizer in its state sharding brethren
75
76
    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)

77
78
79
    # Wrap the model into ShardedDDP, which will reduce gradients to the proper ranks
    model = ShardedDDP(model, optimizer)

80
81
82
83
84
85
86
87
88
89
90
    # Any relevant training loop, nothing specific to OSS. For example:
    model.train()
    for e in range(epochs):
        for batch in dataloader:
            # Train
            model.zero_grad()
            outputs = model(batch["inputs"])
            loss = loss_fn(outputs, batch["label"])
            loss.backward()
            optimizer.step()

91
92
    dist.destroy_process_group()

93
if __name__ == "__main__":
94
    # Supposing that WORLD_SIZE and EPOCHS are somehow defined somewhere
95
96
97
98
99
100
101
102
103
104
105
    mp.spawn(
        train,
        args=(
            WORLD_SIZE,
            EPOCHS,
        ),
        nprocs=WORLD_SIZE,
        join=True,
    )
```

106
### AdaScale SGD
107

108
109
110
111
AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed Data Parallel)
training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
schedule from a baseline batch size when effective batch size is bigger.

112
113
Note that AdaScale does _not_ help increase per-GPU batch size.

114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
```python
from torch.optim import SGD
from torch.optim.lr_scheduler import LambdaLR  # or your scheduler
from fairscale.optim import AdaScale

...
optim = AdaScale(SGD(model.parameters(), lr=0.1))
scheduler = LambdaLR(optim, ...)
...
# Note: the train loop should be with DDP or with gradient accumulation.
last_epoch = 0
step = 0
done = False
while not done:
    for sample in dataset:
        ...
        step += optim.gain()
        optim.step()
        epoch = step // len(dataset)
        if last_epoch != epoch:
            scheduler.step()
            last_epoch = epoch
        if epoch > max_epoch:
            done = True
```

140
Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.
141
(However, training time might be longer comparing to without AdaScale.)
142
143

At a high level, we want ML researchers to:
144
  * go parallel more easily (i.e. no need to find new learning rate schedules)
145
  * not worrying about losing accuracy
146
  * potentially higher GPU efficiency (fewer steps, less networking overhead, etc.)
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
147

anj-s's avatar
anj-s committed
148
## Testing
149

150
We use circleci to test on PyTorch versions 1.6.0, 1.7.1, and 1.8.1. Please create an [issue](https://github.com/facebookresearch/fairscale/issues) if you are having trouble with installation.
151

Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
152
153
## Contributors

anj-s's avatar
anj-s committed
154
We welcome outside contributions! Please see the [CONTRIBUTING](CONTRIBUTING.md) instructions for how you can contribute to FairScale.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
155
156
157

## License

anj-s's avatar
anj-s committed
158
FairScale is licensed under the [BSD-3-Clause License](LICENSE).
159
160
161
162
163

fairscale.nn.pipe is forked from [torchgpipe](https://github.com/kakaobrain/torchgpipe), Copyright 2019, Kakao Brain, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

fairscale.nn.model_parallel is forked from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), Copyright 2020, NVIDIA CORPORATION, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

164
165
fairscale.optim.adascale is forked from [AdaptDL](https://github.com/petuum/adaptdl), Copyright 2020, Petuum, Inc., licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

Myle Ott's avatar
Myle Ott committed
166
167
fairscale.nn.misc.flatten_params_wrapper is forked from [PyTorch-Reparam-Module](https://github.com/SsnL/PyTorch-Reparam-Module), Copyright 2018, Tongzhou Wang, licensed under [MIT License](https://github.com/SsnL/PyTorch-Reparam-Module/blob/master/LICENSE).

168

anj-s's avatar
anj-s committed
169
## Citing FairScale
170

anj-s's avatar
anj-s committed
171
172
173
174
If you use FairScale in your publication, please cite it by using the following BibTeX entry.

```BibTeX
@Misc{FairScale2021,
Vittorio Caggiano's avatar
Vittorio Caggiano committed
175
  author =       {Mandeep Baines and Shruti Bhosale and Vittorio Caggiano and Naman Goyal and Siddharth Goyal and Myle Ott and Benjamin Lefaudeux and Vitaliy Liptchinsky and Mike Rabbat and Sam Sheiffer and Anjali Sridhar and Min Xu},
anj-s's avatar
anj-s committed
176
177
178
179
180
  title =        {FairScale:  A general purpose modular PyTorch library for high performance and large scale training},
  howpublished = {\url{https://github.com/facebookresearch/fairscale}},
  year =         {2021}
}
```