README.md 5.45 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

17
18
## Language model training

Sylvain Gugger's avatar
Sylvain Gugger committed
19
20
21
22
23
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2,
ALBERT, BERT, DistilBERT, RoBERTa, XLNet... GPT and GPT-2 are trained or fine-tuned using a causal language modeling
(CLM) loss while ALBERT, BERT, DistilBERT and RoBERTa are trained or fine-tuned using a masked language modeling (MLM)
loss. XLNet uses permutation language modeling (PLM), you can find more information about the differences between those
objectives in our [model summary](https://huggingface.co/transformers/model_summary.html).
24

Sylvain Gugger's avatar
Sylvain Gugger committed
25
26
These scripts leverage the 馃 Datasets library and the Trainer API. You can easily customize them to your needs if you
need extra processing on your datasets.
27

28
**Note:** The old script `run_language_modeling.py` is still available [here](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py).
29

Sylvain Gugger's avatar
Sylvain Gugger committed
30
31
The following examples, will run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own
text files for training and validation. We give examples of both below.
32
33
34
35
36
37
38

### GPT-2/GPT and causal language modeling

The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
the tokenization). The loss here is that of causal language modeling.

```bash
Sylvain Gugger's avatar
Sylvain Gugger committed
39
40
41
42
python run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
43
44
    --do_train \
    --do_eval \
Sylvain Gugger's avatar
Sylvain Gugger committed
45
    --output_dir /tmp/test-clm
46
47
48
49
50
```

This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
a score of ~20 perplexity once fine-tuned on the dataset.

Sylvain Gugger's avatar
Sylvain Gugger committed
51
52
53
54
55
56
57
58
59
60
61
62
63
To run on your own training and validation files, use the following command:

```bash
python run_clm.py \
    --model_name_or_path gpt2 \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm
```


64
### RoBERTa/BERT/DistilBERT and masked language modeling
65
66
67
68
69

The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
pre-training: masked language modeling.

Sylvain Gugger's avatar
Sylvain Gugger committed
70
71
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore,
converge slightly slower (over-fitting takes more epochs).
72

Sylvain Gugger's avatar
Sylvain Gugger committed
73
74
75
76
77
78
79
80
81
```bash
python run_mlm.py \
    --model_name_or_path roberta-base \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-mlm
```
82

Sylvain Gugger's avatar
Sylvain Gugger committed
83
To run on your own training and validation files, use the following command:
84
85

```bash
86
python run_mlm.py \
Sylvain Gugger's avatar
Sylvain Gugger committed
87
88
89
90
91
    --model_name_or_path roberta-base \
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
    --do_train \
    --do_eval \
92
    --output_dir /tmp/test-mlm
Sylvain Gugger's avatar
Sylvain Gugger committed
93
94
```

95
96
97
98
99
100
If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
concatenates all texts and then splits them in blocks of the same length).

**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
sure all your batches have the same length.

Sylvain Gugger's avatar
Sylvain Gugger committed
101
102
### Whole word masking

103
This part was moved to `examples/research_projects/mlm_wwm`. 
104

Lysandre Debut's avatar
Lysandre Debut committed
105
106
107
108
109
110
111
112
113
114
115
116
### XLNet and permutation language modeling

XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method 
to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input 
sequence factorization order.

We use the `--plm_probability` flag to define the ratio of length of a span of masked tokens to surrounding 
context length for permutation language modeling.

The `--max_span_length` flag may also be used to limit the length of a span of masked tokens used 
for permutation language modeling.

Matthias's avatar
Matthias committed
117
Here is how to fine-tune XLNet on wikitext-2:
Sylvain Gugger's avatar
Sylvain Gugger committed
118

Lysandre Debut's avatar
Lysandre Debut committed
119
```bash
Sylvain Gugger's avatar
Sylvain Gugger committed
120
121
122
123
124
125
126
127
python run_plm.py \
    --model_name_or_path=xlnet-base-cased \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-plm
```
Lysandre Debut's avatar
Lysandre Debut committed
128

Sylvain Gugger's avatar
Sylvain Gugger committed
129
130
131
132
To fine-tune it on your own training and validation file, run:

```bash
python run_plm.py \
Lysandre Debut's avatar
Lysandre Debut committed
133
    --model_name_or_path=xlnet-base-cased \
Sylvain Gugger's avatar
Sylvain Gugger committed
134
135
    --train_file path_to_train_file \
    --validation_file path_to_validation_file \
Lysandre Debut's avatar
Lysandre Debut committed
136
137
    --do_train \
    --do_eval \
Sylvain Gugger's avatar
Sylvain Gugger committed
138
    --output_dir /tmp/test-plm
Lysandre Debut's avatar
Lysandre Debut committed
139
```
140
141
142
143
144
145

If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
concatenates all texts and then splits them in blocks of the same length).

**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
sure all your batches have the same length.