language_modeling.mdx 16.2 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Language modeling

Language modeling predicts words in a sentence. There are two forms of language modeling.

<Youtube id="Vpjb1lu0MDk"/>

Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left.

<Youtube id="mqElG5QJWUg"/>

Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally.

This guide will show you how to fine-tune [DistilGPT2](https://huggingface.co/distilgpt2) for causal language modeling and [DistilRoBERTa](https://huggingface.co/distilroberta-base) for masked language modeling on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/eli5) dataset.

<Tip>

You can fine-tune other architectures for language modeling such as [GPT-Neo](https://huggingface.co/EleutherAI/gpt-neo-125M), [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B), and [BERT](https://huggingface.co/bert-base-uncased), following the same steps presented in this guide!

See the text generation [task page](https://huggingface.co/tasks/text-generation) and fill mask [task page](https://huggingface.co/tasks/fill-mask) for more information about their associated models, datasets, and metrics.

</Tip>

## Load ELI5 dataset

Load only the first 5000 rows of the ELI5 dataset from the 馃 Datasets library since it is pretty large:

```py
>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5", split="train_asks[:5000]")
```

Split this dataset into a train and test set:

```py
eli5 = eli5.train_test_split(test_size=0.2)
```

Then take a look at an example:

```py
>>> eli5["train"][0]
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}
```

Notice `text` is a subfield nested inside the `answers` dictionary. When you preprocess the dataset, you will need to extract the `text` subfield into a separate column.

## Preprocess

<Youtube id="ma1TrR7gE7I"/>

For causal language modeling, load the DistilGPT2 tokenizer to process the `text` subfield:

```py
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
```

<Youtube id="8PmhEIXhBvI"/>

For masked language modeling, load the DistilRoBERTa tokenizer instead:

```py
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
```

Extract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process.html#flatten) method:

```py
>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}
```

Each subfield is now a separate column as indicated by the `answers` prefix. Notice that `answers.text` is a list. Instead of tokenizing each sentence separately, convert the list to a string to jointly tokenize them.

Here is how you can create a preprocessing function to convert the list to a string and truncate sequences to be no longer than DistilGPT2's maximum input length:

```py
>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)
```

Use 馃 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once and increasing the number of processes with `num_proc`. Remove the columns you don't need:

```py
>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )
```

Now you need a second preprocessing function to capture text truncated from any lengthy examples to prevent loss of information. This preprocessing function should:

- Concatenate all the text.
- Split the concatenated text into smaller chunks defined by `block_size`.

```py
>>> block_size = 128


>>> def group_texts(examples):
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     result["labels"] = result["input_ids"].copy()
...     return result
```

Apply the `group_texts` function over the entire dataset:

```py
>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)
```

For causal language modeling, use [`DataCollatorForLanguageModeling`] to create a batch of examples. It will also *dynamically pad* your text to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient. 

Sylvain Gugger's avatar
Sylvain Gugger committed
160
161
<frameworkcontent>
<pt>
Steven Liu's avatar
Steven Liu committed
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
You can use the end of sequence token as the padding token, and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:

```py
>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
```

For masked language modeling, use the same [`DataCollatorForLanguageModeling`] except you should specify `mlm_probability` to randomly mask tokens each time you iterate over the data.

```py
>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
Sylvain Gugger's avatar
Sylvain Gugger committed
178
179
180
181
182
183
184
185
186
187
188
189
190
191
```
</pt>
<tf>
You can use the end of sequence token as the padding token, and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:

```py
>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
```

For masked language modeling, use the same [`DataCollatorForLanguageModeling`] except you should specify `mlm_probability` to randomly mask tokens each time you iterate over the data.

```py
Steven Liu's avatar
Steven Liu committed
192
193
194
195
>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
```
Sylvain Gugger's avatar
Sylvain Gugger committed
196
197
</tf>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
198
199
200
201
202

## Causal language modeling

Causal language modeling is frequently used for text generation. This section shows you how to fine-tune [DistilGPT2](https://huggingface.co/distilgpt2) to generate new text.

203
### Train
Steven Liu's avatar
Steven Liu committed
204

205
206
<frameworkcontent>
<pt>
Steven Liu's avatar
Steven Liu committed
207
208
209
210
211
212
213
214
215
216
Load DistilGPT2 with [`AutoModelForCausalLM`]:

```py
>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
```

<Tip>

Steven Liu's avatar
Steven Liu committed
217
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
Steven Liu's avatar
Steven Liu committed
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244

</Tip>

At this point, only three steps remain:

1. Define your training hyperparameters in [`TrainingArguments`].
2. Pass the training arguments to [`Trainer`] along with the model, datasets, and data collator.
3. Call [`~Trainer.train`] to fine-tune your model.

```py
>>> training_args = TrainingArguments(
...     output_dir="./results",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     weight_decay=0.01,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
... )

>>> trainer.train()
```
245
246
247
</pt>
<tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
Steven Liu's avatar
Steven Liu committed
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266

```py
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
...     columns=["attention_mask", "input_ids", "labels"],
...     dummy_labels=True,
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = lm_dataset["test"].to_tf_dataset(
...     columns=["attention_mask", "input_ids", "labels"],
...     dummy_labels=True,
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )
```

267
268
269
270
271
272
<Tip>

If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!

</Tip>

Steven Liu's avatar
Steven Liu committed
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
Set up an optimizer function, learning rate, and some training hyperparameters:

```py
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
```

Load DistilGPT2 with [`TFAutoModelForCausalLM`]:

```py
>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")
```

Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):

```py
>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)
```

Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:

```py
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
```
302
303
</tf>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
304
305
306
307
308

## Masked language modeling

Masked language modeling is also known as a fill-mask task because it predicts a masked token in a sequence. Models for masked language modeling require a good contextual understanding of an entire sequence instead of only the left context. This section shows you how to fine-tune [DistilRoBERTa](https://huggingface.co/distilroberta-base) to predict a masked word.

309
### Train
Steven Liu's avatar
Steven Liu committed
310

311
312
<frameworkcontent>
<pt>
Steven Liu's avatar
Steven Liu committed
313
314
315
316
317
318
319
320
321
322
Load DistilRoBERTa with [`AutoModelForMaskedlM`]:

```py
>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")
```

<Tip>

Steven Liu's avatar
Steven Liu committed
323
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
Steven Liu's avatar
Steven Liu committed
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351

</Tip>

At this point, only three steps remain:

1. Define your training hyperparameters in [`TrainingArguments`].
2. Pass the training arguments to [`Trainer`] along with the model, datasets, and data collator.
3. Call [`~Trainer.train`] to fine-tune your model.

```py
>>> training_args = TrainingArguments(
...     output_dir="./results",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     num_train_epochs=3,
...     weight_decay=0.01,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
... )

>>> trainer.train()
```
352
353
354
</pt>
<tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
Steven Liu's avatar
Steven Liu committed
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373

```py
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
...     columns=["attention_mask", "input_ids", "labels"],
...     dummy_labels=True,
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = lm_dataset["test"].to_tf_dataset(
...     columns=["attention_mask", "input_ids", "labels"],
...     dummy_labels=True,
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )
```

374
375
376
377
378
379
<Tip>

If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!

</Tip>

Steven Liu's avatar
Steven Liu committed
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
Set up an optimizer function, learning rate, and some training hyperparameters:

```py
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
```

Load DistilRoBERTa with [`TFAutoModelForMaskedLM`]:

```py
>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForCausalLM.from_pretrained("distilroberta-base")
```

Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):

```py
>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)
```

Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:

```py
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
```
409
410
</tf>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
411
412
413
414
415
416
417
418

<Tip>

For a more in-depth example of how to fine-tune a model for causal language modeling, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling-tf.ipynb).

</Tip>