run_demo.md 3.69 KB
Newer Older
zbian's avatar
zbian committed
1
2
# Quick demo

3
Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system
zbian's avatar
zbian committed
4
can accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The
5
system can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
zbian's avatar
zbian committed
6
7
8

## Single GPU

9
Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline
zbian's avatar
zbian committed
10
performances. [Here](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)
11
is an example showing how to train a LeNet model on the CIFAR10 dataset using Colossal-AI.
zbian's avatar
zbian committed
12
13
14

## Multiple GPUs

15
Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the
zbian's avatar
zbian committed
16
17
18
19
20
21
training process drastically by applying efficient parallelization techiniques, which will be elaborated in
the [Parallelization](parallelization.md) section below. Run the code below on your distributed system with 4 GPUs,
where `HOST` is the IP address of your system. Note that we use
the [Slurm](https://slurm.schedmd.com/documentation.html) job scheduling system here.

```bash
22
HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
zbian's avatar
zbian committed
23
24
25
```

`./configs/vit/vit_2d.py` is a config file, which is introduced in the [Config file](config.md) section below. These
26
config files are used by Colossal-AI to define all kinds of training arguments, such as the model, dataset and training
zbian's avatar
zbian committed
27
28
method (optimizer, lr_scheduler, epoch, etc.). Config files are highly customizable and can be modified so as to train
different models.
29
`./examples/run_trainer.py` contains a standard training script and is presented below, it reads the config file and
zbian's avatar
zbian committed
30
31
32
33
realizes the training process.

```python
import colossalai
34
from colossalai.core import global_context as gpc
zbian's avatar
zbian committed
35
from colossalai.engine import Engine
36
from colossalai.logging import get_global_dist_logger
zbian's avatar
zbian committed
37
38
from colossalai.trainer import Trainer

39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def run_trainer():
    model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
    logger = get_global_dist_logger()
    schedule.data_sync = False
    engine = Engine(
        model=model,
        criterion=criterion,
        optimizer=optimizer,
        lr_scheduler=lr_scheduler,
        schedule=schedule
    )
    logger.info("engine is built", ranks=[0])

    trainer = Trainer(engine=engine,
                      hooks_cfg=gpc.config.hooks,
                      verbose=True)
    logger.info("trainer is built", ranks=[0])

    logger.info("start training", ranks=[0])
    trainer.fit(
        train_dataloader=train_dataloader,
        test_dataloader=test_dataloader,
        max_epochs=gpc.config.num_epochs,
        display_progress=True,
        test_interval=2
    )

if __name__ == '__main__':
    run_trainer()
zbian's avatar
zbian committed
68
69
70
71
72
73
74
```

Alternatively, the `model` variable can be substituted with a self-defined model or a pre-defined model in our Model
Zoo. The detailed substitution process is elaborated [here](model.md).

## Features

75
Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development of
76
distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools to
zbian's avatar
zbian committed
77
78
79
80
81
82
83
84
85
kickstart distributed training in a few lines.

- [Data Parallelism](parallelization.md)
- [Pipeline Parallelism](parallelization.md)
- [1D, 2D, 2.5D, 3D and sequence parallelism](parallelization.md)
- [Friendly trainer and engine](trainer_engine.md)
- [Extensible for new parallelism](add_your_parallel.md)
- [Mixed Precision Training](amp.md)
- [Zero Redundancy Optimizer (ZeRO)](zero.md)