README.md 2.23 KB
Newer Older
1
# Auto-Parallelism with ResNet
2
3
4

## Prepare Dataset

5
6
We use CIFAR10 dataset in this example. You should invoke the `donwload_cifar10.py` in the tutorial root directory or directly run the `auto_parallel_with_resnet.py`.
The dataset will be downloaded to `colossalai/examples/tutorials/data` by default.
7
8
9
10
11
12
If you wish to use customized directory for the dataset. You can set the environment variable `DATA` via the following command.

```bash
export DATA=/path/to/data
```

13
14
15
16
17
18
## extra requirements to use autoparallel

```bash
pip install pulp
conda install coin-or-cbc
```
19
20
21
22

## Run on 2*2 device mesh

```bash
23
colossalai run --nproc_per_node 4 auto_parallel_with_resnet.py
24
```
25
26
27

## Auto Checkpoint Benchmarking

28
29
30
31
32
We prepare two bechmarks for you to test the performance of auto checkpoint

The first test `auto_ckpt_solver_test.py` will show you the ability of solver to search checkpoint strategy that could fit in the given budget (test on GPT2 Medium and ResNet 50). It will output the benchmark summary and data visualization of peak memory vs. budget memory and relative step time vs. peak memory.

The second test `auto_ckpt_batchsize_test.py` will show you the advantage of fitting larger batchsize training into limited GPU memory with the help of our activation checkpoint solver (test on ResNet152). It will output the benchmark summary.
33
34
35

The usage of the above two test
```bash
36
37
38
39
40
41
42
43
# run auto_ckpt_solver_test.py on gpt2 medium
python auto_ckpt_solver_test.py --model gpt2

# run auto_ckpt_solver_test.py on resnet50
python auto_ckpt_solver_test.py --model resnet50

# tun auto_ckpt_batchsize_test.py
python auto_ckpt_batchsize_test.py
44
45
46
47
```

There are some results for your reference

48
49
## Auto Checkpoint Solver Test

50
### ResNet 50
51
![](https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/tutorial/resnet50_benchmark.png)
52
53

### GPT2 Medium
54
![](https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/tutorial/gpt2_benchmark.png)
55

56
## Auto Checkpoint Batch Size Test
57
58
59
60
61
62
```bash
===============test summary================
batch_size: 512, peak memory: 73314.392 MB, through put: 254.286 images/s
batch_size: 1024, peak memory: 73316.216 MB, through put: 397.608 images/s
batch_size: 2048, peak memory: 72927.837 MB, through put: 277.429 images/s
```