"examples/images/vscode:/vscode.git/clone" did not exist on "641b1ee71a19e2337f3363620b228dd355835b04"
README.md 1.72 KB
Newer Older
mandoxzhang's avatar
mandoxzhang committed
1
# Introduction
2
This example introduce how to pretrain roberta from scratch, including preprocessing, pretraining, finetune. The example can help you quickly train a high-quality roberta.
mandoxzhang's avatar
mandoxzhang committed
3
4
5

## 0. Prerequisite
- Install Colossal-AI
6
- Editing the port from `/etc/ssh/sshd_config` and `/etc/ssh/ssh_config`, every host expose the same ssh port of server and client. If you are a root user, you also set the **PermitRootLogin** from `/etc/ssh/sshd_config` to "yes"
mandoxzhang's avatar
mandoxzhang committed
7
8
9
10
11
12
13
- Ensure that each host can log in to each other without password. If you have n hosts, need to execute n<sup>2</sup> times

```
ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub ip_destination
```

14
- In all hosts, edit /etc/hosts to record all hosts' name and ip.The example is shown below.
mandoxzhang's avatar
mandoxzhang committed
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

```bash
192.168.2.1   GPU001
192.168.2.2   GPU002
192.168.2.3   GPU003
192.168.2.4   GPU004
192.168.2.5   GPU005
192.168.2.6   GPU006
192.168.2.7   GPU007
...
```

- restart ssh
```
service ssh restart
```

32
## 1. Corpus Preprocessing
mandoxzhang's avatar
mandoxzhang committed
33
34
35
```bash
cd preprocessing
```
36
following the `README.md`, preprocess original corpus to h5py plus numpy
mandoxzhang's avatar
mandoxzhang committed
37
38
39
40
41
42
43
44
45
46

## 2. Pretrain

```bash
cd pretraining
```
following the `README.md`, load the h5py generated by preprocess of step 1 to pretrain the model

## 3. Finetune

47
The checkpoint produced by this repo can replace `pytorch_model.bin` from  [hfl/chinese-roberta-wwm-ext-large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/tree/main) directly. Then use transfomers from Hugging Face to finetune downstream application.
mandoxzhang's avatar
mandoxzhang committed
48
49

## Contributors
50
The example is contributed by AI team from [Moore Threads](https://www.mthreads.com/). If you find any problems for pretraining, please file an issue or send an email to yehua.zhang@mthreads.com. At last, welcome any form of contribution!