README.md 10.2 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

17
18
## Language model training

Sylvain Gugger's avatar
Sylvain Gugger committed
19
20
21
22
23
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2,
ALBERT, BERT, DistilBERT, RoBERTa, XLNet... GPT and GPT-2 are trained or fine-tuned using a causal language modeling
(CLM) loss while ALBERT, BERT, DistilBERT and RoBERTa are trained or fine-tuned using a masked language modeling (MLM)
loss. XLNet uses permutation language modeling (PLM), you can find more information about the differences between those
objectives in our [model summary](https://huggingface.co/transformers/model_summary.html).
24

25
There are two sets of scripts provided. The first set leverages the Trainer API. The second set with `no_trainer` in the suffix uses a custom training loop and leverages the 🤗 Accelerate library . Both sets use the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
26

27
**Note:** The old script `run_language_modeling.py` is still available [here](https://github.com/huggingface/transformers/blob/main/examples/legacy/run_language_modeling.py).
28

29
The following examples, will run on datasets hosted on our [hub](https://huggingface.co/datasets) or with your own
Sylvain Gugger's avatar
Sylvain Gugger committed
30
text files for training and validation. We give examples of both below.
31
32
33
34
35
36
37

### GPT-2/GPT and causal language modeling

The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
the tokenization). The loss here is that of causal language modeling.

```bash
Sylvain Gugger's avatar
Sylvain Gugger committed
38
python run_clm.py \
39
    --model_name_or_path openai-community/gpt2 \
Sylvain Gugger's avatar
Sylvain Gugger committed
40
41
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
42
43
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
44
45
    --do_train \
    --do_eval \
Sylvain Gugger's avatar
Sylvain Gugger committed
46
    --output_dir /tmp/test-clm
47
48
49
50
51
```

This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
a score of ~20 perplexity once fine-tuned on the dataset.

Sylvain Gugger's avatar
Sylvain Gugger committed
52
53
54
55
To run on your own training and validation files, use the following command:

```bash
python run_clm.py \
56
    --model_name_or_path openai-community/gpt2 \
Sylvain Gugger's avatar
Sylvain Gugger committed
57
58
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
59
60
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
Sylvain Gugger's avatar
Sylvain Gugger committed
61
62
63
64
65
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm
```

66
67
68
69
70
71
This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_clm_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below:

```bash
python run_clm_no_trainer.py \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
72
    --model_name_or_path openai-community/gpt2 \
73
74
    --output_dir /tmp/test-clm
```
Sylvain Gugger's avatar
Sylvain Gugger committed
75

76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
### GPT-2/GPT and causal language modeling with fill-in-the middle objective

The following example fine-tunes GPT-2 on WikiText-2 but using the Fill-in-middle training objective. FIM objective was proposed in [Efficient Training of Language Models to Fill in the Middle](https://arxiv.org/abs/2207.14255). They showed that autoregressive language models can learn to infill text after applying a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end.

We're using the raw WikiText-2 (no tokens were replaced before the tokenization). The loss here is that of causal language modeling.

```bash
python run_fim.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --fim_rate 0.5 \
    --fim_spm_rate 0.2 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm
```

To run on your own training and validation files, use the following command:

```bash
python run_fim.py \
    --model_name_or_path gpt2 \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --fim_rate 0.5 \
    --fim_spm_rate 0.2 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm
```

This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_fim_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below:

```bash
python run_fim_no_trainer.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --model_name_or_path gpt2 \
    --fim_rate 0.5 \
    --fim_spm_rate 0.2 \
    --output_dir /tmp/test-clm
```

**Note**: Passing in FIM rate as `0.5` means that FIM transformations will be applied to the dataset with a probability of 50%. Whereas passing in FIM SPM rate as `0.2` means that 20% of FIM transformations will use SPM (or Suffix-Prefix-Middle) and the remaining 80% will use PSM (or Prefix-Suffix-Middle) mode of transformation.

127
### RoBERTa/BERT/DistilBERT and masked language modeling
128
129
130
131
132

The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
pre-training: masked language modeling.

Sylvain Gugger's avatar
Sylvain Gugger committed
133
134
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore,
converge slightly slower (over-fitting takes more epochs).
135

Sylvain Gugger's avatar
Sylvain Gugger committed
136
137
```bash
python run_mlm.py \
138
    --model_name_or_path FacebookAI/roberta-base \
Sylvain Gugger's avatar
Sylvain Gugger committed
139
140
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
141
142
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
Sylvain Gugger's avatar
Sylvain Gugger committed
143
144
145
146
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm
```
147

Sylvain Gugger's avatar
Sylvain Gugger committed
148
To run on your own training and validation files, use the following command:
149
150

```bash
151
python run_mlm.py \
152
    --model_name_or_path FacebookAI/roberta-base \
Sylvain Gugger's avatar
Sylvain Gugger committed
153
154
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
155
156
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
Sylvain Gugger's avatar
Sylvain Gugger committed
157
158
    --do_train \
    --do_eval \
159
    --output_dir /tmp/test-mlm
Sylvain Gugger's avatar
Sylvain Gugger committed
160
161
```

162
163
164
If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
concatenates all texts and then splits them in blocks of the same length).

165
166
167
168
169
170
This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_mlm_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below:

```bash
python run_mlm_no_trainer.py \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
171
    --model_name_or_path FacebookAI/roberta-base \
172
173
174
    --output_dir /tmp/test-mlm
```

175
176
177
**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
sure all your batches have the same length.

Sylvain Gugger's avatar
Sylvain Gugger committed
178
179
### Whole word masking

180
This part was moved to `examples/research_projects/mlm_wwm`.
181

Lysandre Debut's avatar
Lysandre Debut committed
182
183
### XLNet and permutation language modeling

184
185
XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method
to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input
Lysandre Debut's avatar
Lysandre Debut committed
186
187
sequence factorization order.

188
We use the `--plm_probability` flag to define the ratio of length of a span of masked tokens to surrounding
Lysandre Debut's avatar
Lysandre Debut committed
189
190
context length for permutation language modeling.

191
The `--max_span_length` flag may also be used to limit the length of a span of masked tokens used
Lysandre Debut's avatar
Lysandre Debut committed
192
193
for permutation language modeling.

Matthias's avatar
Matthias committed
194
Here is how to fine-tune XLNet on wikitext-2:
Sylvain Gugger's avatar
Sylvain Gugger committed
195

Lysandre Debut's avatar
Lysandre Debut committed
196
```bash
Sylvain Gugger's avatar
Sylvain Gugger committed
197
python run_plm.py \
198
    --model_name_or_path=xlnet/xlnet-base-cased \
Sylvain Gugger's avatar
Sylvain Gugger committed
199
200
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
201
202
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
Sylvain Gugger's avatar
Sylvain Gugger committed
203
204
205
206
    --do_train \
    --do_eval \
    --output_dir /tmp/test-plm
```
Lysandre Debut's avatar
Lysandre Debut committed
207

Sylvain Gugger's avatar
Sylvain Gugger committed
208
209
210
211
To fine-tune it on your own training and validation file, run:

```bash
python run_plm.py \
212
    --model_name_or_path=xlnet/xlnet-base-cased \
Sylvain Gugger's avatar
Sylvain Gugger committed
213
214
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
215
216
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
Lysandre Debut's avatar
Lysandre Debut committed
217
218
    --do_train \
    --do_eval \
Sylvain Gugger's avatar
Sylvain Gugger committed
219
    --output_dir /tmp/test-plm
Lysandre Debut's avatar
Lysandre Debut committed
220
```
221
222
223
224
225
226

If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
concatenates all texts and then splits them in blocks of the same length).

**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
sure all your batches have the same length.
227

228
229
## Streaming

230
To use the streaming dataset mode which can be very useful for large datasets, add `--streaming` to the command line. This is supported by `run_mlm.py`, `run_clm.py` and `run_fim.py`. Make sure to adapt the other scripts to your use case by taking inspiration from them.
231

232
233
## Low Cpu Memory Usage

234
To use low cpu memory mode which can be very useful for LLM, add `--low_cpu_mem_usage` to the command line. This is currently supported by `run_clm.py`,`run_mlm.py`, `run_plm.py`, `run_fim.py`, `run_mlm_no_trainer.py`, `run_clm_no_trainer.py` and `run_fim_no_trainer.py`.
235

236
237
238
239
240
241
## Creating a model on the fly

When training a model from scratch, configuration values may be overridden with the help of `--config_overrides`:


```bash
242
python run_clm.py --model_type openai-community/gpt2 --tokenizer_name openai-community/gpt2 \ --config_overrides="n_embd=1024,n_head=16,n_layer=48,n_positions=102" \
243
244
245
[...]
```

246
This feature is only available in `run_clm.py`, `run_plm.py`, `run_mlm.py` and `run_fim.py`.