README.custom_classification.md 3.99 KB
Newer Older
Myle Ott's avatar
Myle Ott committed
1
# Finetuning RoBERTa on a custom classification task
2

Myle Ott's avatar
Myle Ott committed
3
4
5
6
This example shows how to finetune RoBERTa on the IMDB dataset, but should illustrate the process for most classification tasks.

### 1) Get the data
```bash
7
8
9
10
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar zxvf aclImdb_v1.tar.gz
```

Myle Ott's avatar
Myle Ott committed
11
### 2) Format data
12
`IMDB` data has one data-sample in each file, below python code-snippet converts it one file for train and valid each for ease of processing.  
Myle Ott's avatar
Myle Ott committed
13
```python
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import argparse
import os
import random
from glob import glob

random.seed(0)

def main(args):
    for split in ['train', 'test']:
        samples = []
        for class_label in ['pos', 'neg']:
            fnames = glob(os.path.join(args.datadir, split, class_label) + '/*.txt')
            for fname in fnames:
                with open(fname) as fin:
                    line = fin.readline()
                    samples.append((line, 1 if class_label == 'pos' else 0))
        random.shuffle(samples)
        out_fname = 'train' if split == 'train' else 'dev'
        f1 = open(os.path.join(args.datadir, out_fname + '.input0'), 'w')
        f2 = open(os.path.join(args.datadir, out_fname + '.label'), 'w')
        for sample in samples:
            f1.write(sample[0] + '\n')
            f2.write(str(sample[1]) + '\n')
        f1.close()
        f2.close()

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--datadir', default='aclImdb')
    args = parser.parse_args()
    main(args)
```

Myle Ott's avatar
Myle Ott committed
47
### 3) BPE Encode
48
Run `multiprocessing_bpe_encoder`, you can also do this in previous step for each sample but that might be slower.
Myle Ott's avatar
Myle Ott committed
49
```bash
50
51
52
53
# Download encoder.json and vocab.bpe
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'

Myle Ott's avatar
Myle Ott committed
54
55
56
57
58
59
60
61
for SPLIT in train dev; do
    python -m examples.roberta.multiprocessing_bpe_encoder \
        --encoder-json encoder.json \
        --vocab-bpe vocab.bpe \
        --inputs "aclImdb/$SPLIT.input0" \
        --outputs "aclImdb/$SPLIT.input0.bpe" \
        --workers 60 \
        --keep-empty
62
63
64
done
```

Myle Ott's avatar
Myle Ott committed
65
### 4) Preprocess data
66

Myle Ott's avatar
Myle Ott committed
67
```bash
68
69
70
71
# Download fairseq dictionary.
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'  

fairseq-preprocess \
Myle Ott's avatar
Myle Ott committed
72
73
74
75
76
77
    --only-source \
    --trainpref "aclImdb/train.input0.bpe" \
    --validpref "aclImdb/dev.input0.bpe" \
    --destdir "IMDB-bin/input0" \
    --workers 60 \
    --srcdict dict.txt
78
79

fairseq-preprocess \
Myle Ott's avatar
Myle Ott committed
80
81
82
83
84
    --only-source \
    --trainpref "aclImdb/train.label" \
    --validpref "aclImdb/dev.label" \
    --destdir "IMDB-bin/label" \
    --workers 60
85
86
87

```

Myle Ott's avatar
Myle Ott committed
88
### 5) Run Training
89

Myle Ott's avatar
Myle Ott committed
90
```bash
91
92
93
94
95
TOTAL_NUM_UPDATES=7812  # 10 epochs through IMDB for bsz 32
WARMUP_UPDATES=469      # 6 percent of the number of updates
LR=1e-05                # Peak LR for polynomial LR scheduler.
NUM_CLASSES=2
MAX_SENTENCES=8        # Batch size.
Myle Ott's avatar
Myle Ott committed
96
ROBERTA_PATH=/path/to/roberta/model.pt
97
98

CUDA_VISIBLE_DEVICES=0 python train.py IMDB-bin/ \
Myle Ott's avatar
Myle Ott committed
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
    --restore-file $ROBERTA_PATH \
    --max-positions 512 \
    --max-sentences $MAX_SENTENCES \
    --max-tokens 4400 \
    --task sentence_prediction \
    --reset-optimizer --reset-dataloader --reset-meters \
    --required-batch-size-multiple 1 \
    --init-token 0 --separator-token 2 \
    --arch roberta_large \
    --criterion sentence_prediction \
    --num-classes $NUM_CLASSES \
    --dropout 0.1 --attention-dropout 0.1 \
    --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
    --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
    --max-epoch 10 \
    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
    --truncate-sequence \
118
    --find-unused-parameters \
Myle Ott's avatar
Myle Ott committed
119
    --update-freq 4
120
121
122
```
Above will train with effective batch-size of `32`, tested on one `Nvidia V100 32gb`.
Expected `best-validation-accuracy` after `10` epochs is `~96.5%`.