README.md 1.65 KB
Newer Older
yuguo960516's avatar
bloom  
yuguo960516 committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Text classification projects
**English** | [简体中文](/projects/text_classification/README_zh-CN.md)
## GLUE tasks

Fine-tuning the pretrained language models for sequence classification on the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/). This is a common benchmark for *English* pretrained language models.

GLUE is made up of a total of 9 different tasks. The tasks contains cola, sst2, mrpc, stsb, qqp, mnli, qnli, rte, wnli. You can perform the following command to download and extract the dataset:
```bash
cd projects/text_classification
python3 dataset/download_glue_data.py
```

Here is how to finetune the task on one of them:
```bash
bash tools/train.sh tools/train_net.py projects/text_classification/configs/config.py $num_gpus train.train_iter=10
```
where `$num_gpus` indicates the number of GPUs. If you want to run the distributed program, you can change it, for example:
```bash
bash tools/train.sh tools/train_net.py projects/text_classification/configs/config.py 2 train.train_iter=10
```

Before running the program, you should modify the `config.py` file. Modification fields include but are not limited to task_name, data_dir, vocab_file, model hyperparameter, learning rate, and batch size, and so on.

## CLUE tasks

For *Chinese* pretrained language model, it can be evaluated by [CLUE benchmark](https://github.com/CLUEbenchmark/CLUE). This is a sequence classification task just like GLUE, so you can finetune your model with the same command. Don't forget to modify `config.py` file.

You can download and extract the datasets by:
```bash
cd projects/text_classification
python3 dataset/download_clue_data.py
```