README.md 5.19 KB
Newer Older
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
1
# fairscale
Vittorio Caggiano's avatar
Vittorio Caggiano committed
2
3
4
5
6
7
![PyPI](https://img.shields.io/pypi/v/fairscale)
[![Documentation Status](https://readthedocs.org/projects/fairscale/badge/?version=latest)](https://fairscale.readthedocs.io/en/latest/?badge=latest)
[![CircleCI](https://circleci.com/gh/facebookresearch/fairscale.svg?style=shield)](https://app.circleci.com/pipelines/github/facebookresearch/fairscale/) ![PyPI - License](https://img.shields.io/pypi/l/fairscale) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/facebookresearch/fairscale/blob/master/CONTRIBUTING.md)

## Description
fairscale is a PyTorch extension library for high performance and large scale training for optimizing training on one or across multiple machines/nodes. This library extend basic pytorch capabilities while adding new experimental ones.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
8
9

fairscale supports:
Vittorio Caggiano's avatar
Vittorio Caggiano committed
10
11
12
13
14
15
16
17
18
* Parallelism:
   * pipeline parallelism (fairscale.nn.Pipe)
   * tensor parallelism (fairscale.nn.model_parallel)
* Optimization:
   * optimizer state sharding (fairscale.optim.oss)


## Requirements

19
* PyTorch >= 1.5.1
Vittorio Caggiano's avatar
Vittorio Caggiano committed
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

## Installation

Normal installation:
```bash
pip install fairscale
```

Development mode:
```bash
cd fairscale
pip install -r requirements.txt
pip install -e .
```

## Getting Started
The full documentation (https://fairscale.readthedocs.io/) contains instructions for getting started and extending fairscale.
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
37
38

## Examples
39
### Pipe
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
40
41
42

Run a 4-layer model on 2 GPUs. The first two layers run on cuda:0 and the next two layers run on cuda:1.

43
```python
Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
44
45
46
47
48
49
50
51
import torch

import fairscale

model = torch.nn.Sequential(a, b, c, d)
model = fairscale.nn.Pipe(model, balance=[2, 2], devices=[0, 1], chunks=8)
```

52
### Optimizer state sharding (ZeRO)
53
See a more complete example [here](https://github.com/facebookresearch/fairscale/blob/master/benchmarks/oss.py), but a minimal example could look like the following :
54

55
```python
56
import torch
57
import torch.distributed as dist
58
import torch.multiprocessing as mp
59
from fairscale.optim.oss import OSS
60
from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP
61
62
63
64
65
66

def train(
    rank: int,
    world_size: int,
    epochs: int):

67
68
    # DDP init example
    dist.init_process_group(backend='nccl', init_method="tcp://localhost:29501", rank=rank, world_size=world_size)
69
70

    # Problem statement
71
    model = myAwesomeModel().to(rank)
72
    model = ShardedDDP(model, device_ids=[rank])  # this will handle the gradient reduce automatically
73
    dataloader = mySuperFastDataloader()
Benjamin Lefaudeux's avatar
Benjamin Lefaudeux committed
74
    loss_fn = myVeryRelevantLoss()
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
    base_optimizer = torch.optim.SGD # pick any pytorch compliant optimizer here
    base_optimizer_arguments = {} # pass any optimizer specific arguments here, or directly below when instantiating OSS

    optimizer = OSS(params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments)

    # Any relevant training loop, nothing specific to OSS. For example:
    model.train()
    for e in range(epochs):
        for batch in dataloader:
            # Train
            model.zero_grad()
            outputs = model(batch["inputs"])
            loss = loss_fn(outputs, batch["label"])
            loss.backward()
            optimizer.step()

91
92
    dist.destroy_process_group()

93
if __name__ == "__main__":
94
    # Supposing that WORLD_SIZE and EPOCHS are somehow defined somewhere
95
96
97
98
99
100
101
102
103
104
105
106
    mp.spawn(
        train,
        args=(
            WORLD_SIZE,
            EPOCHS,
        ),
        nprocs=WORLD_SIZE,
        join=True,
    )
```


Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
107

108
109
# Testing

110
We use circleci to test on PyTorch versions 1.5.1, 1.6.0 and 1.7.0 and CUDA version 10.1. Please create an [issue](https://github.com/facebookresearch/fairscale/issues) if you are having trouble with installation.
111

Mandeep Singh Baines's avatar
Mandeep Singh Baines committed
112
113
114
115
116
117
118
## Contributors

See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.

## License

fairscale is licensed under the [BSD-3-Clause License](LICENSE).
119
120
121
122
123

fairscale.nn.pipe is forked from [torchgpipe](https://github.com/kakaobrain/torchgpipe), Copyright 2019, Kakao Brain, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

fairscale.nn.model_parallel is forked from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), Copyright 2020, NVIDIA CORPORATION, licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

124
125
fairscale.optim.adascale is forked from [AdaptDL](https://github.com/petuum/adaptdl), Copyright 2020, Petuum, Inc., licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0).

126
127
128
129
130
131
132
## References

Here is a list of all authors on relevant research papers this work is based on:

* torchgpipe: Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, Sungwoong Kim. [[Paper](https://arxiv.org/pdf/2004.09910.pdf)] [[Code](https://github.com/kakaobrain/torchgpipe)]
* ZeRO: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. [[Paper](https://arxiv.org/pdf/1910.02054.pdf)] [[Code](https://github.com/microsoft/DeepSpeed)]
* Megatron-LM: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro. [[Paper](https://arxiv.org/pdf/1909.08053.pdf)][[Code](https://github.com/NVIDIA/Megatron-LM)]
133
* AdaScale SGD: Tyler B. Johnson, Pulkit Agrawal, Haijie Gu, Carlos Guestrin. [[Paper](https://proceedings.icml.cc/static/paper_files/icml/2020/4682-Paper.pdf)]