README.md 7.44 KB
Newer Older
1
2
## Language model training

Sylvain Gugger's avatar
Sylvain Gugger committed
3
4
5
6
7
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2,
ALBERT, BERT, DistilBERT, RoBERTa, XLNet... GPT and GPT-2 are trained or fine-tuned using a causal language modeling
(CLM) loss while ALBERT, BERT, DistilBERT and RoBERTa are trained or fine-tuned using a masked language modeling (MLM)
loss. XLNet uses permutation language modeling (PLM), you can find more information about the differences between those
objectives in our [model summary](https://huggingface.co/transformers/model_summary.html).
8

Sylvain Gugger's avatar
Sylvain Gugger committed
9
10
These scripts leverage the 🤗 Datasets library and the Trainer API. You can easily customize them to your needs if you
need extra processing on your datasets.
11

Sylvain Gugger's avatar
Sylvain Gugger committed
12
**Note:** The old script `run_language_modeling.py` is still available
13
[here](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/run_language_modeling.py).
14

Sylvain Gugger's avatar
Sylvain Gugger committed
15
16
The following examples, will run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own
text files for training and validation. We give examples of both below.
17
18
19
20
21
22
23

### GPT-2/GPT and causal language modeling

The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
the tokenization). The loss here is that of causal language modeling.

```bash
Sylvain Gugger's avatar
Sylvain Gugger committed
24
25
26
27
python run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
28
29
    --do_train \
    --do_eval \
Sylvain Gugger's avatar
Sylvain Gugger committed
30
    --output_dir /tmp/test-clm
31
32
33
34
35
```

This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
a score of ~20 perplexity once fine-tuned on the dataset.

Sylvain Gugger's avatar
Sylvain Gugger committed
36
37
38
39
40
41
42
43
44
45
46
47
48
To run on your own training and validation files, use the following command:

```bash
python run_clm.py \
    --model_name_or_path gpt2 \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm
```


49
### RoBERTa/BERT/DistilBERT and masked language modeling
50
51
52
53
54

The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
pre-training: masked language modeling.

Sylvain Gugger's avatar
Sylvain Gugger committed
55
56
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore,
converge slightly slower (over-fitting takes more epochs).
57

Sylvain Gugger's avatar
Sylvain Gugger committed
58
59
60
61
62
63
64
65
66
```bash
python run_mlm.py \
    --model_name_or_path roberta-base \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm
```
67

Sylvain Gugger's avatar
Sylvain Gugger committed
68
To run on your own training and validation files, use the following command:
69
70

```bash
71
python run_mlm.py \
Sylvain Gugger's avatar
Sylvain Gugger committed
72
73
74
75
76
    --model_name_or_path roberta-base \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --do_train \
    --do_eval \
77
    --output_dir /tmp/test-mlm
Sylvain Gugger's avatar
Sylvain Gugger committed
78
79
```

80
81
82
83
84
85
If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
concatenates all texts and then splits them in blocks of the same length).

**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
sure all your batches have the same length.

Sylvain Gugger's avatar
Sylvain Gugger committed
86
87
88
### Whole word masking

The BERT authors released a new version of BERT using Whole Word Masking in May 2019. Instead of masking randomly
89
selected tokens (which may be part of words), they mask randomly selected words (masking all the tokens corresponding
Sylvain Gugger's avatar
Sylvain Gugger committed
90
to that word). This technique has been refined for Chinese in [this paper](https://arxiv.org/abs/1906.08101).
91

Sylvain Gugger's avatar
Sylvain Gugger committed
92
To fine-tune a model using whole word masking, use the following script:
Tim Isbister's avatar
Tim Isbister committed
93
```bash
Sylvain Gugger's avatar
Sylvain Gugger committed
94
95
96
97
python run_mlm_wwm.py \
    --model_name_or_path roberta-base \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
98
99
    --do_train \
    --do_eval \
Sylvain Gugger's avatar
Sylvain Gugger committed
100
    --output_dir /tmp/test-mlm-wwm
101
102
```

Sylvain Gugger's avatar
Sylvain Gugger committed
103
104
For Chinese models, we need to generate a reference files (which requires the ltp library), because it's tokenized at
the character level.
105

Sylvain Gugger's avatar
Sylvain Gugger committed
106
**Q :** Why a reference file?
107

Sylvain Gugger's avatar
Sylvain Gugger committed
108
109
110
111
**A :** Suppose we have a Chinese sentence like: `我喜欢你` The original Chinese-BERT will tokenize it as
`['我','喜','欢','你']` (character level). But `喜欢` is a whole word. For whole word masking proxy, we need a result
like `['我','喜','##欢','你']`, so we need a reference file to tell the model which position of the BERT original token
should be added `##`.
112
113
114

**Q :** Why LTP ?

Sylvain Gugger's avatar
Sylvain Gugger committed
115
116
117
**A :** Cause the best known Chinese WWM BERT is [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) by HIT.
It works well on so many Chines Task like CLUE (Chinese GLUE). They use LTP, so if we want to fine-tune their model,
we need LTP.
118

119
Now LTP only only works well on `transformers==3.2.0`. So we don't add it to requirements.txt.
120
121
You need to create a separate environment with this version of Transformers to run the `run_chinese_ref.py` script that
will create the reference files. The script is in `examples/contrib`. Once in the proper environment, run the
Sylvain Gugger's avatar
Sylvain Gugger committed
122
following:
123
124


125
126
127
128
129
130
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export LTP_RESOURCE=/path/to/ltp/tokenizer
export BERT_RESOURCE=/path/to/bert/tokenizer
export SAVE_PATH=/path/to/data/ref.txt

131
python examples/contrib/run_chinese_ref.py \
Sylvain Gugger's avatar
Sylvain Gugger committed
132
133
134
135
    --file_name=path_to_train_or_eval_file \
    --ltp=path_to_ltp_tokenizer \
    --bert=path_to_bert_tokenizer \
    --save_path=path_to_reference_file
136
137
```

Sylvain Gugger's avatar
Sylvain Gugger committed
138
Then you can run the script like this: 
139

140

Sylvain Gugger's avatar
Sylvain Gugger committed
141
142
143
144
145
146
147
```bash
python run_mlm_wwm.py \
    --model_name_or_path roberta-base \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --train_ref_file path_to_train_chinese_ref_file \
    --validation_ref_file path_to_validation_chinese_ref_file \
148
149
    --do_train \
    --do_eval \
Sylvain Gugger's avatar
Sylvain Gugger committed
150
    --output_dir /tmp/test-mlm-wwm
151
152
```

153
154
**Note:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length.

Lysandre Debut's avatar
Lysandre Debut committed
155
156
157
158
159
160
161
162
163
164
165
166
### XLNet and permutation language modeling

XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method 
to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input 
sequence factorization order.

We use the `--plm_probability` flag to define the ratio of length of a span of masked tokens to surrounding 
context length for permutation language modeling.

The `--max_span_length` flag may also be used to limit the length of a span of masked tokens used 
for permutation language modeling.

Matthias's avatar
Matthias committed
167
Here is how to fine-tune XLNet on wikitext-2:
Sylvain Gugger's avatar
Sylvain Gugger committed
168

Lysandre Debut's avatar
Lysandre Debut committed
169
```bash
Sylvain Gugger's avatar
Sylvain Gugger committed
170
171
172
173
174
175
176
177
python run_plm.py \
    --model_name_or_path=xlnet-base-cased \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-plm
```
Lysandre Debut's avatar
Lysandre Debut committed
178

Sylvain Gugger's avatar
Sylvain Gugger committed
179
180
181
182
To fine-tune it on your own training and validation file, run:

```bash
python run_plm.py \
Lysandre Debut's avatar
Lysandre Debut committed
183
    --model_name_or_path=xlnet-base-cased \
Sylvain Gugger's avatar
Sylvain Gugger committed
184
185
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
Lysandre Debut's avatar
Lysandre Debut committed
186
187
    --do_train \
    --do_eval \
Sylvain Gugger's avatar
Sylvain Gugger committed
188
    --output_dir /tmp/test-plm
Lysandre Debut's avatar
Lysandre Debut committed
189
```
190
191
192
193
194
195

If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
concatenates all texts and then splits them in blocks of the same length).

**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
sure all your batches have the same length.