This repo introduce how to pretrain a chinese roberta-large from scratch, including preprocessing, pretraining, finetune. The repo can help you quickly train a high-quality bert.
## 0. Prerequisite
- Install Colossal-AI
- Editing the port from /etc/ssh/sshd_config and /etc/ssh/ssh_config, every host expose the same ssh port of server and client. If you are a root user, you also set the **PermitRootLogin** from /etc/ssh/sshd_config to "yes"
- Ensure that each host can log in to each other without password. If you have n hosts, need to execute n<sup>2</sup> times
```
ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub ip_destination
```
- In all hosts, edit /etc/hosts to record all hosts' name and ip.The example is shown below.
```bash
192.168.2.1 GPU001
192.168.2.2 GPU002
192.168.2.3 GPU003
192.168.2.4 GPU004
192.168.2.5 GPU005
192.168.2.6 GPU006
192.168.2.7 GPU007
...
```
- restart ssh
```
service ssh restart
```
## 1. Corpus Preprocessing
```bash
cd preprocessing
```
following the `README.md`, preprocess original corpus to h5py+numpy
## 2. Pretrain
```bash
cd pretraining
```
following the `README.md`, load the h5py generated by preprocess of step 1 to pretrain the model
## 3. Finetune
The checkpoint produced by this repo can replace `pytorch_model.bin` from [hfl/chinese-roberta-wwm-ext-large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/tree/main) directly. Then use transfomers from Hugging Face to finetune downstream application.
## Contributors
The repo is contributed by AI team from [Moore Threads](https://www.mthreads.com/). If you find any problems for pretraining, please file an issue or send an email to yehua.zhang@mthreads.com. At last, welcome any form of contribution!
```
@misc{
title={A simple Chinese RoBERTa Example for Whole Word Masked},
*<ahref='#Tokenizer & Whole Word Masked'>2.2.Tokenizer & Whole Word Masked</a>
<spanid='introduction'/>
## 1. Introduction: <a href='#all_catelogue'>[Back to Top]</a>
This folder is used to preprocess chinese corpus with Whole Word Masked. You can obtain corpus from [WuDao](https://resource.wudaoai.cn/home?ind&name=WuDaoCorpora%202.0&id=1394901288847716352). Moreover, data preprocessing is flexible, and you can modify the code based on your needs, hardware or parallel framework(Open MPI, Spark, Dask).
<spanid='Quick Start Guide'/>
## 2. Quick Start Guide: <a href='#all_catelogue'>[Back to Top]</a>
<spanid='Split Sentence'/>
### 2.1. Split Sentence & Split data into multiple shard:
Firstly, each file has multiple documents, and each document contains multiple sentences. Split sentence through punctuation, such as `。!`. **Secondly, split data into multiple shard based on server hardware (cpu, cpu memory, hard disk) and corpus size.** Each shard contains a part of corpus, and the model needs to train all the shards as one epoch.
In this example, split 200G Corpus into 100 shard, and each shard is about 2G. The size of the shard is memory-dependent, taking into account the number of servers, the memory used by the tokenizer, and the memory used by the multi-process training to read the shard (n data parallel requires n\*shard_size memory). **To sum up, data preprocessing and model pretraining requires fighting with hardware, not just GPU.**
parser.add_argument('--max_predictions_per_seq',type=int,default=80,help='number of shards, e.g., 10, 50, or 100')
parser.add_argument('--input_path',type=str,required=True,help='input path of shard which has split sentence')
parser.add_argument('--output_path',type=str,required=True,help='output path of h5 contains token id')
parser.add_argument('--backend',type=str,default='python',help='backend of mask token, python, c++, numpy respectively')
parser.add_argument('--dupe_factor',type=int,default=1,help='specifies how many times the preprocessor repeats to create the input from the same article/document')
parser.add_argument('--worker',type=int,default=32,help='number of process')
parser.add_argument('--server_num',type=int,default=10,help='number of servers')
1. Pretraining roberta through running the script below. Detailed parameter descriptions can be found in the arguments.py. `data_path_prefix` is absolute path specifies output of preprocessing. **You have to modify the *hostfile* according to your cluster.**
```bash
bash run_pretrain.sh
```
*`--hostfile`: servers' host name from /etc/hosts
*`--include`: servers which will be used
*`--nproc_per_node`: number of process(GPU) from each server
*`--data_path_prefix`: absolute location of train data, e.g., /h5/0.h5
*`--eval_data_path_prefix`: absolute location of eval data