README.md 6.71 KB
Newer Older
Vittorio Caggiano's avatar
Vittorio Caggiano committed
1
2
![FairScale Logo](./docs/source/_static/img/fairscale-logo.png)

Vittorio Caggiano's avatar
Vittorio Caggiano committed
3
4
5
![PyPI](https://img.shields.io/pypi/v/fairscale)
[![Documentation Status](https://readthedocs.org/projects/fairscale/badge/?version=latest)](https://fairscale.readthedocs.io/en/latest/?badge=latest)
[![CircleCI](https://circleci.com/gh/facebookresearch/fairscale.svg?style=shield)](https://app.circleci.com/pipelines/github/facebookresearch/fairscale/) ![PyPI - License](https://img.shields.io/pypi/l/fairscale) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/facebookresearch/fairscale/blob/master/CONTRIBUTING.md)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
6
--------------------------------------------------------------------------------
Vittorio Caggiano's avatar
Vittorio Caggiano committed
7
8
9

## Description
fairscale is a PyTorch extension library for high performance and large scale training for optimizing training on one or across multiple machines/nodes. This library extend basic pytorch capabilities while adding new experimental ones.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
10
11

fairscale supports:
Vittorio Caggiano's avatar
Vittorio Caggiano committed
12
13
* Parallelism:
   * pipeline parallelism (fairscale.nn.Pipe)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
14
15
16
17
18
* Sharded training:
   * Optimizer state sharding (fairscale.optim.oss)
   * Sharded grad scaler - automatic mixed precision
   * Sharded distributed data parallel
* Optimization at scale:
19
   * AdaScale SGD (from fairscale.optim import AdaScale)
Vittorio Caggiano's avatar
Vittorio Caggiano committed
20
21
22
23


## Requirements

24
* PyTorch >= 1.5.1
Vittorio Caggiano's avatar
Vittorio Caggiano committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

## Installation

Normal installation:
```bash
pip install fairscale
```

Development mode:
```bash
cd fairscale
pip install -r requirements.txt
pip install -e .
```

## Getting Started
The full documentation (https://fairscale.readthedocs.io/) contains instructions for getting started and extending fairscale.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
42
43

## Examples
44
### Pipe
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
45
46
47

Run a 4-layer model on 2 GPUs. The first two layers run on cuda:0 and the next two layers run on cuda:1.

48
```python
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
49
50
51
52
53
54
55
56
import torch

import fairscale

model = torch.nn.Sequential(a, b, c, d)
model = fairscale.nn.Pipe(model, balance=[2, 2], devices=[0, 1], chunks=8)
```

57
### Optimizer state sharding (ZeRO)
58
See a more complete example [here](https://github.com/facebookresearch/fairscale/blob/master/benchmarks/oss.py), but a minimal example could look like the following :
59

60
```python
61
import torch
62
import torch.distributed as dist
63
import torch.multiprocessing as mp
64
from fairscale.optim.oss import OSS
65
from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP
66
67
68
69
70
71

def train(
    rank: int,
    world_size: int,
    epochs: int):

72
73
    # DDP init example
    dist.init_process_group(backend='nccl', init_method="tcp://localhost:29501", rank=rank, world_size=world_size)
74
75

    # Problem statement
76
    model = myAwesomeModel().to(rank)
77
    dataloader = mySuperFastDataloader()
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
78
    loss_fn = myVeryRelevantLoss()
79
80
81
    base_optimizer = torch.optim.SGD # pick any pytorch compliant optimizer here
    base_optimizer_arguments = {} # pass any optimizer specific arguments here, or directly below when instantiating OSS

82
    # Wrap the optimizer in its state sharding brethren
83
84
    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)

85
86
87
    # Wrap the model into ShardedDDP, which will reduce gradients to the proper ranks
    model = ShardedDDP(model, optimizer)

88
89
90
91
92
93
94
95
96
97
98
    # Any relevant training loop, nothing specific to OSS. For example:
    model.train()
    for e in range(epochs):
        for batch in dataloader:
            # Train
            model.zero_grad()
            outputs = model(batch["inputs"])
            loss = loss_fn(outputs, batch["label"])
            loss.backward()
            optimizer.step()

99
100
    dist.destroy_process_group()

101
if __name__ == "__main__":
102
    # Supposing that WORLD_SIZE and EPOCHS are somehow defined somewhere
103
104
105
106
107
108
109
110
111
112
113
    mp.spawn(
        train,
        args=(
            WORLD_SIZE,
            EPOCHS,
        ),
        nprocs=WORLD_SIZE,
        join=True,
    )
```

114
### AdaScale SGD
115

116
117
118
119
AdaScale can be used to wrap a SGD optimizer and to be used in DDP (Distributed Data Parallel)
training or non-DDP with gradient accumulation. The benefit is to re-use the same LR
schedule from a baseline batch size when effective batch size is bigger.

120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
```python
from torch.optim import SGD
from torch.optim.lr_scheduler import LambdaLR  # or your scheduler
from fairscale.optim import AdaScale

...
optim = AdaScale(SGD(model.parameters(), lr=0.1))
scheduler = LambdaLR(optim, ...)
...
# Note: the train loop should be with DDP or with gradient accumulation.
last_epoch = 0
step = 0
done = False
while not done:
    for sample in dataset:
        ...
        step += optim.gain()
        optim.step()
        epoch = step // len(dataset)
        if last_epoch != epoch:
            scheduler.step()
            last_epoch = epoch
        if epoch > max_epoch:
            done = True
```

146
147
148
149
150
151
Primary goal is to allow scaling to bigger batch sizes without losing model accuracy.

At a high level, we want ML researchers to:
  * go parallel more easily (i.e. reuse the same LR schedule)
  * not worrying about lossing accuracy
  * get same (or higher) GPU efficiency (fewer steps, less networking, etc.)
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
152

153
154
# Testing

155
We use circleci to test on PyTorch versions 1.5.1, 1.6.0 and 1.7.0 and CUDA version 10.1. Please create an [issue](https://github.com/facebookresearch/fairscale/issues) if you are having trouble with installation.
156

Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
157
158
159
160
161
162
163
## Contributors

See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.

## License

fairscale is licensed under the [BSD-3-Clause License](LICENSE).
164
165
166
167
168

fairscale.nn.pipe is forked from [torchgpipe](https://github.com/kakaobrain/torchgpipe), Copyright 2019, Kakao Brain, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

fairscale.nn.model_parallel is forked from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), Copyright 2020, NVIDIA CORPORATION, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

169
170
fairscale.optim.adascale is forked from [AdaptDL](https://github.com/petuum/adaptdl), Copyright 2020, Petuum, Inc., licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

171
172
173
174
175
176
177
## References

Here is a list of all authors on relevant research papers this work is based on:

* torchgpipe: Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, Sungwoong Kim. [[Paper](https://arxiv.org/pdf/2004.09910.pdf)] [[Code](https://github.com/kakaobrain/torchgpipe)]
* ZeRO: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. [[Paper](https://arxiv.org/pdf/1910.02054.pdf)] [[Code](https://github.com/microsoft/DeepSpeed)]
* Megatron-LM: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro. [[Paper](https://arxiv.org/pdf/1909.08053.pdf)][[Code](https://github.com/NVIDIA/Megatron-LM)]
178
* AdaScale SGD: Tyler B. Johnson, Pulkit Agrawal, Haijie Gu, Carlos Guestrin. [[Paper](https://proceedings.icml.cc/static/paper_files/icml/2020/4682-Paper.pdf)]