README.md 6.85 KB
Newer Older
Vittorio Caggiano's avatar
Vittorio Caggiano committed
1
2
![FairScale Logo](./docs/source/_static/img/fairscale-logo.png)

Vittorio Caggiano's avatar
Vittorio Caggiano committed
3
4
5
![PyPI](https://img.shields.io/pypi/v/fairscale)
[![Documentation Status](https://readthedocs.org/projects/fairscale/badge/?version=latest)](https://fairscale.readthedocs.io/en/latest/?badge=latest)
[![CircleCI](https://circleci.com/gh/facebookresearch/fairscale.svg?style=shield)](https://app.circleci.com/pipelines/github/facebookresearch/fairscale/) ![PyPI - License](https://img.shields.io/pypi/l/fairscale) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/facebookresearch/fairscale/blob/master/CONTRIBUTING.md)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
6
--------------------------------------------------------------------------------
Vittorio Caggiano's avatar
Vittorio Caggiano committed
7
8
9

## Description
fairscale is a PyTorch extension library for high performance and large scale training for optimizing training on one or across multiple machines/nodes. This library extend basic pytorch capabilities while adding new experimental ones.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
10
11

fairscale supports:
Vittorio Caggiano's avatar
Vittorio Caggiano committed
12
13
* Parallelism:
   * pipeline parallelism (fairscale.nn.Pipe)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
14
15
16
17
18
* Sharded training:
   * Optimizer state sharding (fairscale.optim.oss)
   * Sharded grad scaler - automatic mixed precision
   * Sharded distributed data parallel
* Optimization at scale:
19
   * AdaScale SGD (from fairscale.optim import AdaScale)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
20
21
22
23


## Requirements

24
* PyTorch >= 1.5.1
Vittorio Caggiano's avatar
Vittorio Caggiano committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

## Installation

Normal installation:
```bash
pip install fairscale
```

Development mode:
```bash
cd fairscale
pip install -r requirements.txt
pip install -e .
```

40
41
42
If either of the above fails, add `--no-build-isolation` to the `pip install` command (this could be a problem with recent versions of pip).


Vittorio Caggiano's avatar
Vittorio Caggiano committed
43
44
## Getting Started
The full documentation (https://fairscale.readthedocs.io/) contains instructions for getting started and extending fairscale.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
45
46

## Examples
47
### Pipe
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
48
49
50

Run a 4-layer model on 2 GPUs. The first two layers run on cuda:0 and the next two layers run on cuda:1.

51
```python
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
52
53
54
55
56
57
58
59
import torch

import fairscale

model = torch.nn.Sequential(a, b, c, d)
model = fairscale.nn.Pipe(model, balance=[2, 2], devices=[0, 1], chunks=8)
```

60
### Optimizer state sharding (ZeRO)
61
See a more complete example [here](https://github.com/facebookresearch/fairscale/blob/master/benchmarks/oss.py), but a minimal example could look like the following :
62

63
```python
64
import torch
65
import torch.distributed as dist
66
import torch.multiprocessing as mp
67
from fairscale.optim.oss import OSS
68
from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP
69
70
71
72
73
74

def train(
    rank: int,
    world_size: int,
    epochs: int):

75
76
    # DDP init example
    dist.init_process_group(backend='nccl', init_method="tcp://localhost:29501", rank=rank, world_size=world_size)
77
78

    # Problem statement
79
    model = myAwesomeModel().to(rank)
80
    dataloader = mySuperFastDataloader()
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
81
    loss_fn = myVeryRelevantLoss()
82
83
84
    base_optimizer = torch.optim.SGD # pick any pytorch compliant optimizer here
    base_optimizer_arguments = {} # pass any optimizer specific arguments here, or directly below when instantiating OSS

85
    # Wrap the optimizer in its state sharding brethren
86
87
    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)

88
89
90
    # Wrap the model into ShardedDDP, which will reduce gradients to the proper ranks
    model = ShardedDDP(model, optimizer)

91
92
93
94
95
96
97
98
99
100
101
    # Any relevant training loop, nothing specific to OSS. For example:
    model.train()
    for e in range(epochs):
        for batch in dataloader:
            # Train
            model.zero_grad()
            outputs = model(batch["inputs"])
            loss = loss_fn(outputs, batch["label"])
            loss.backward()
            optimizer.step()

102
103
    dist.destroy_process_group()

104
if __name__ == "__main__":
105
    # Supposing that WORLD_SIZE and EPOCHS are somehow defined somewhere
106
107
108
109
110
111
112
113
114
115
116
    mp.spawn(
        train,
        args=(
            WORLD_SIZE,
            EPOCHS,
        ),
        nprocs=WORLD_SIZE,
        join=True,
    )
```

117
### AdaScale SGD
118

119
120
121
122
AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed Data Parallel)
training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
schedule from a baseline batch size when effective batch size is bigger.

123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
```python
from torch.optim import SGD
from torch.optim.lr_scheduler import LambdaLR  # or your scheduler
from fairscale.optim import AdaScale

...
optim = AdaScale(SGD(model.parameters(), lr=0.1))
scheduler = LambdaLR(optim, ...)
...
# Note: the train loop should be with DDP or with gradient accumulation.
last_epoch = 0
step = 0
done = False
while not done:
    for sample in dataset:
        ...
        step += optim.gain()
        optim.step()
        epoch = step // len(dataset)
        if last_epoch != epoch:
            scheduler.step()
            last_epoch = epoch
        if epoch > max_epoch:
            done = True
```

149
150
151
152
153
154
Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.

At a high level, we want ML researchers to:
  * go parallel more easily (i.e. reuse the same LR schedule)
  * not worrying about lossing accuracy
  * get same (or higher) GPU efficiency (fewer steps, less networking, etc.)
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
155

156
157
# Testing

158
We use circleci to test on PyTorch versions 1.5.1, 1.6.0 and 1.7.0 and CUDA version 10.1. Please create an [issue](https://github.com/facebookresearch/fairscale/issues) if you are having trouble with installation.
159

Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
160
161
162
163
164
165
166
## Contributors

See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.

## License

fairscale is licensed under the [BSD-3-Clause License](LICENSE).
167
168
169
170
171

fairscale.nn.pipe is forked from [torchgpipe](https://github.com/kakaobrain/torchgpipe), Copyright 2019, Kakao Brain, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

fairscale.nn.model_parallel is forked from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), Copyright 2020, NVIDIA CORPORATION, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

172
173
fairscale.optim.adascale is forked from [AdaptDL](https://github.com/petuum/adaptdl), Copyright 2020, Petuum, Inc., licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

174
175
176
177
178
179
180
## References

Here is a list of all authors on relevant research papers this work is based on:

* torchgpipe: Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, Sungwoong Kim. [[Paper](https://arxiv.org/pdf/2004.09910.pdf)] [[Code](https://github.com/kakaobrain/torchgpipe)]
* ZeRO: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. [[Paper](https://arxiv.org/pdf/1910.02054.pdf)] [[Code](https://github.com/microsoft/DeepSpeed)]
* Megatron-LM: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro. [[Paper](https://arxiv.org/pdf/1909.08053.pdf)][[Code](https://github.com/NVIDIA/Megatron-LM)]
181
* AdaScale SGD: Tyler B. Johnson, Pulkit Agrawal, Haijie Gu, Carlos Guestrin. [[Paper](https://proceedings.icml.cc/static/paper_files/icml/2020/4682-Paper.pdf)]