custom_datasets.mdx 22.4 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# How to fine-tune a model for common downstream tasks

15
16
[[open-in-colab]]

Sylvain Gugger's avatar
Sylvain Gugger committed
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
This guide will show you how to fine-tune 馃 Transformers models for common downstream tasks. You will use the 馃
Datasets library to quickly load and preprocess the datasets, getting them ready for training with PyTorch and
TensorFlow.

Before you begin, make sure you have the 馃 Datasets library installed. For more detailed installation instructions,
refer to the 馃 Datasets [installation page](https://huggingface.co/docs/datasets/installation.html). All of the
examples in this guide will use 馃 Datasets to load and preprocess a dataset.

```bash
pip install datasets
```

Learn how to fine-tune a model for:

- [seq_imdb](#seq_imdb)
- [tok_ner](#tok_ner)
- [qa_squad](#qa_squad)

<a id='seq_imdb'></a>

## Sequence classification with IMDb reviews

Sequence classification refers to the task of classifying sequences of text according to a given number of classes. In
this example, learn how to fine-tune a model on the [IMDb dataset](https://huggingface.co/datasets/imdb) to determine
whether a review is positive or negative.

<Tip>

For a more in-depth example of how to fine-tune a model for text classification, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification-tf.ipynb).

</Tip>

### Load IMDb dataset

The 馃 Datasets library makes it simple to load a dataset:

```python
from datasets import load_dataset
Sylvain Gugger's avatar
Sylvain Gugger committed
57

Sylvain Gugger's avatar
Sylvain Gugger committed
58
59
60
61
62
63
64
imdb = load_dataset("imdb")
```

This loads a `DatasetDict` object which you can index into to view an example:

```python
imdb["train"][0]
Sylvain Gugger's avatar
Sylvain Gugger committed
65
66
67
{
    "label": 1,
    "text": "Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as \"Teachers\". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is \"Teachers\". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!",
Sylvain Gugger's avatar
Sylvain Gugger committed
68
69
70
71
72
73
74
75
76
77
78
}
```

### Preprocess

The next step is to tokenize the text into a readable format by the model. It is important to load the same tokenizer a
model was trained with to ensure appropriately tokenized words. Load the DistilBERT tokenizer with the
[`AutoTokenizer`] because we will eventually train a classifier using a pretrained [DistilBERT](https://huggingface.co/distilbert-base-uncased) model:

```python
from transformers import AutoTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
79

Sylvain Gugger's avatar
Sylvain Gugger committed
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```

Now that you have instantiated a tokenizer, create a function that will tokenize the text. You should also truncate
longer sequences in the text to be no longer than the model's maximum input length:

```python
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)
```

Use 馃 Datasets `map` function to apply the preprocessing function to the entire dataset. You can also set
`batched=True` to apply the preprocessing function to multiple elements of the dataset at once for faster
preprocessing:

```python
tokenized_imdb = imdb.map(preprocess_function, batched=True)
```

Lastly, pad your text so they are a uniform length. While it is possible to pad your text in the `tokenizer` function
by setting `padding=True`, it is more efficient to only pad the text to the length of the longest element in its
batch. This is known as **dynamic padding**. You can do this with the `DataCollatorWithPadding` function:

```python
from transformers import DataCollatorWithPadding
Sylvain Gugger's avatar
Sylvain Gugger committed
105

Sylvain Gugger's avatar
Sylvain Gugger committed
106
107
108
109
110
111
112
113
114
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
```

### Fine-tune with the Trainer API

Now load your model with the [`AutoModelForSequenceClassification`] class along with the number of expected labels:

```python
from transformers import AutoModelForSequenceClassification
Sylvain Gugger's avatar
Sylvain Gugger committed
115

Sylvain Gugger's avatar
Sylvain Gugger committed
116
117
118
119
120
121
122
123
124
125
126
127
128
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
```

At this point, only three steps remain:

1. Define your training hyperparameters in [`TrainingArguments`].
2. Pass the training arguments to a [`Trainer`] along with the model, dataset, tokenizer, and data collator.
3. Call [`Trainer.train()`] to fine-tune your model.

```python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
Sylvain Gugger's avatar
Sylvain Gugger committed
129
    output_dir="./results",
Sylvain Gugger's avatar
Sylvain Gugger committed
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
```

### Fine-tune with TensorFlow

Fine-tuning with TensorFlow is just as easy, with only a few differences.

Start by batching the processed examples together with dynamic padding using the [`DataCollatorWithPadding`] function.
Make sure you set `return_tensors="tf"` to return `tf.Tensor` outputs instead of PyTorch tensors!

```python
from transformers import DataCollatorWithPadding
Sylvain Gugger's avatar
Sylvain Gugger committed
158

Sylvain Gugger's avatar
Sylvain Gugger committed
159
160
161
162
163
164
165
data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")
```

Next, convert your datasets to the `tf.data.Dataset` format with `to_tf_dataset`. Specify inputs and labels in the
`columns` argument:

```python
166
tf_train_set = tokenized_imdb["train"].to_tf_dataset(
Sylvain Gugger's avatar
Sylvain Gugger committed
167
    columns=["attention_mask", "input_ids", "label"],
Sylvain Gugger's avatar
Sylvain Gugger committed
168
169
170
171
172
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

173
tf_validation_set = tokenized_imdb["test"].to_tf_dataset(
Sylvain Gugger's avatar
Sylvain Gugger committed
174
    columns=["attention_mask", "input_ids", "label"],
Sylvain Gugger's avatar
Sylvain Gugger committed
175
176
177
178
179
180
181
182
183
184
185
186
187
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)
```

Set up an optimizer function, learning rate schedule, and some training hyperparameters:

```python
from transformers import create_optimizer
import tensorflow as tf

batch_size = 16
188
num_train_epochs = 5
Sylvain Gugger's avatar
Sylvain Gugger committed
189
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
190
total_train_steps = int(batches_per_epoch * num_train_epochs)
Sylvain Gugger's avatar
Sylvain Gugger committed
191
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
Sylvain Gugger's avatar
Sylvain Gugger committed
192
193
194
195
196
197
```

Load your model with the [`TFAutoModelForSequenceClassification`] class along with the number of expected labels:

```python
from transformers import TFAutoModelForSequenceClassification
Sylvain Gugger's avatar
Sylvain Gugger committed
198

Sylvain Gugger's avatar
Sylvain Gugger committed
199
200
201
202
203
204
205
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
```

Compile the model:

```python
import tensorflow as tf
Sylvain Gugger's avatar
Sylvain Gugger committed
206

Sylvain Gugger's avatar
Sylvain Gugger committed
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
model.compile(optimizer=optimizer)
```

Finally, fine-tune the model by calling `model.fit`:

```python
model.fit(
    tf_train_set,
    validation_data=tf_validation_set,
    epochs=num_train_epochs,
)
```

<a id='tok_ner'></a>

## Token classification with WNUT emerging entities

Token classification refers to the task of classifying individual tokens in a sentence. One of the most common token
classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence,
such as a person, location, or organization. In this example, learn how to fine-tune a model on the [WNUT 17](https://huggingface.co/datasets/wnut_17) dataset to detect new entities.

<Tip>

For a more in-depth example of how to fine-tune a model for token classification, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification-tf.ipynb).

</Tip>

### Load WNUT 17 dataset

Load the WNUT 17 dataset from the 馃 Datasets library:

```python
Sylvain Gugger's avatar
Sylvain Gugger committed
241
242
243
>>> from datasets import load_dataset

>>> wnut = load_dataset("wnut_17")
Sylvain Gugger's avatar
Sylvain Gugger committed
244
245
246
247
248
```

A quick look at the dataset shows the labels associated with each word in the sentence:

```python
Sylvain Gugger's avatar
Sylvain Gugger committed
249
>>> wnut["train"][0]
Sylvain Gugger's avatar
Sylvain Gugger committed
250
251
252
253
254
255
256
257
258
{'id': '0',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}
```

View the specific NER tags by:

```python
Sylvain Gugger's avatar
Sylvain Gugger committed
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
>>> label_list = wnut["train"].features[f"ner_tags"].feature.names
>>> label_list
[
    "O",
    "B-corporation",
    "I-corporation",
    "B-creative-work",
    "I-creative-work",
    "B-group",
    "I-group",
    "B-location",
    "I-location",
    "B-person",
    "I-person",
    "B-product",
    "I-product",
Sylvain Gugger's avatar
Sylvain Gugger committed
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
]
```

A letter prefixes each NER tag which can mean:

- `B-` indicates the beginning of an entity.
- `I-` indicates a token is contained inside the same entity (e.g., the `State` token is a part of an entity like
  `Empire State Building`).
- `0` indicates the token doesn't correspond to any entity.

### Preprocess

Now you need to tokenize the text. Load the DistilBERT tokenizer with an [`AutoTokenizer`]:

```python
from transformers import AutoTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
291

Sylvain Gugger's avatar
Sylvain Gugger committed
292
293
294
295
296
297
298
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```

Since the input has already been split into words, set `is_split_into_words=True` to tokenize the words into
subwords:

```python
Sylvain Gugger's avatar
Sylvain Gugger committed
299
300
301
>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
>>> tokens
Sylvain Gugger's avatar
Sylvain Gugger committed
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
```

The addition of the special tokens `[CLS]` and `[SEP]` and subword tokenization creates a mismatch between the
input and labels. Realign the labels and tokens by:

1. Mapping all tokens to their corresponding word with the `word_ids` method.
2. Assigning the label `-100` to the special tokens `[CLS]` and ``[SEP]``` so the PyTorch loss function ignores
   them.
3. Only labeling the first token of a given word. Assign `-100` to the other subtokens from the same word.

Here is how you can create a function that will realign the labels and tokens:

```python
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
Sylvain Gugger's avatar
Sylvain Gugger committed
324
        for word_idx in word_ids:  # Set the special tokens to -100.
Sylvain Gugger's avatar
Sylvain Gugger committed
325
326
            if word_idx is None:
                label_ids.append(-100)
Sylvain Gugger's avatar
Sylvain Gugger committed
327
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
Sylvain Gugger's avatar
Sylvain Gugger committed
328
                label_ids.append(label[word_idx])
329
330
331
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
Sylvain Gugger's avatar
Sylvain Gugger committed
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs
```

Now tokenize and align the labels over the entire dataset with 馃 Datasets `map` function:

```python
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
```

Finally, pad your text and labels, so they are a uniform length:

```python
from transformers import DataCollatorForTokenClassification
Sylvain Gugger's avatar
Sylvain Gugger committed
348

Sylvain Gugger's avatar
Sylvain Gugger committed
349
350
351
352
353
354
355
356
357
data_collator = DataCollatorForTokenClassification(tokenizer)
```

### Fine-tune with the Trainer API

Load your model with the [`AutoModelForTokenClassification`] class along with the number of expected labels:

```python
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
Sylvain Gugger's avatar
Sylvain Gugger committed
358

Sylvain Gugger's avatar
Sylvain Gugger committed
359
360
361
362
363
364
365
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
```

Gather your training arguments in [`TrainingArguments`]:

```python
training_args = TrainingArguments(
Sylvain Gugger's avatar
Sylvain Gugger committed
366
    output_dir="./results",
Sylvain Gugger's avatar
Sylvain Gugger committed
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)
```

Collect your model, training arguments, dataset, data collator, and tokenizer in [`Trainer`]:

```python
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
```

Fine-tune your model:

```python
trainer.train()
```

### Fine-tune with TensorFlow

Batch your examples together and pad your text and labels, so they are a uniform length:

```python
from transformers import DataCollatorForTokenClassification
Sylvain Gugger's avatar
Sylvain Gugger committed
401

Sylvain Gugger's avatar
Sylvain Gugger committed
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="tf")
```

Convert your datasets to the `tf.data.Dataset` format with `to_tf_dataset`:

```python
tf_train_set = tokenized_wnut["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = tokenized_wnut["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)
```

Load the model with the [`TFAutoModelForTokenClassification`] class along with the number of expected labels:

```python
from transformers import TFAutoModelForTokenClassification
Sylvain Gugger's avatar
Sylvain Gugger committed
427

Sylvain Gugger's avatar
Sylvain Gugger committed
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
```

Set up an optimizer function, learning rate schedule, and some training hyperparameters:

```python
from transformers import create_optimizer

batch_size = 16
num_train_epochs = 3
num_train_steps = (len(tokenized_datasets["train"]) // batch_size) * num_train_epochs
optimizer, lr_schedule = create_optimizer(
    init_lr=2e-5,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
    num_warmup_steps=0,
)
```

Compile the model:

```python
import tensorflow as tf
Sylvain Gugger's avatar
Sylvain Gugger committed
451

Sylvain Gugger's avatar
Sylvain Gugger committed
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
model.compile(optimizer=optimizer)
```

Call `model.fit` to fine-tune your model:

```python
model.fit(
    tf_train_set,
    validation_data=tf_validation_set,
    epochs=num_train_epochs,
)
```

<a id='qa_squad'></a>

## Question Answering with SQuAD

There are many types of question answering (QA) tasks. Extractive QA focuses on identifying the answer from the text
given a question. In this example, learn how to fine-tune a model on the [SQuAD](https://huggingface.co/datasets/squad) dataset.

<Tip>

For a more in-depth example of how to fine-tune a model for question answering, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering-tf.ipynb).

</Tip>

### Load SQuAD dataset

Load the SQuAD dataset from the 馃 Datasets library:

```python
from datasets import load_dataset
Sylvain Gugger's avatar
Sylvain Gugger committed
486

Sylvain Gugger's avatar
Sylvain Gugger committed
487
488
489
490
491
492
squad = load_dataset("squad")
```

Take a look at an example from the dataset:

```python
Sylvain Gugger's avatar
Sylvain Gugger committed
493
>>> squad["train"][0]
Sylvain Gugger's avatar
Sylvain Gugger committed
494
495
496
497
498
499
500
501
502
503
504
505
506
507
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'
}
```

### Preprocess

Load the DistilBERT tokenizer with an [`AutoTokenizer`]:

```python
from transformers import AutoTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
508

Sylvain Gugger's avatar
Sylvain Gugger committed
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```

There are a few things to be aware of when preprocessing text for question answering:

1. Some examples in a dataset may have a very long `context` that exceeds the maximum input length of the model. You
   can deal with this by truncating the `context` and set `truncation="only_second"`.
2. Next, you need to map the start and end positions of the answer to the original context. Set
   `return_offset_mapping=True` to handle this.
3. With the mapping in hand, you can find the start and end tokens of the answer. Use the `sequence_ids` method to
   find which part of the offset corresponds to the question, and which part of the offset corresponds to the context.

Assemble everything in a preprocessing function as shown below:

```python
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs
```

Apply the preprocessing function over the entire dataset with 馃 Datasets `map` function:

```python
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
```

Batch the processed examples together:

```python
from transformers import default_data_collator
Sylvain Gugger's avatar
Sylvain Gugger committed
586

Sylvain Gugger's avatar
Sylvain Gugger committed
587
588
589
590
591
592
593
594
595
data_collator = default_data_collator
```

### Fine-tune with the Trainer API

Load your model with the [`AutoModelForQuestionAnswering`] class:

```python
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
Sylvain Gugger's avatar
Sylvain Gugger committed
596

Sylvain Gugger's avatar
Sylvain Gugger committed
597
598
599
600
601
602
603
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
```

Gather your training arguments in [`TrainingArguments`]:

```python
training_args = TrainingArguments(
Sylvain Gugger's avatar
Sylvain Gugger committed
604
    output_dir="./results",
Sylvain Gugger's avatar
Sylvain Gugger committed
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)
```

Collect your model, training arguments, dataset, data collator, and tokenizer in [`Trainer`]:

```python
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
```

Fine-tune your model:

```python
trainer.train()
```

### Fine-tune with TensorFlow

Batch the processed examples together with a TensorFlow default data collator:

```python
from transformers.data.data_collator import tf_default_collator
Sylvain Gugger's avatar
Sylvain Gugger committed
639

Sylvain Gugger's avatar
Sylvain Gugger committed
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
data_collator = tf_default_collator
```

Convert your datasets to the `tf.data.Dataset` format with the `to_tf_dataset` function:

```python
tf_train_set = tokenized_squad["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
    dummy_labels=True,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
    dummy_labels=True,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)
```

Set up an optimizer function, learning rate schedule, and some training hyperparameters:

```python
from transformers import create_optimizer

batch_size = 16
num_epochs = 2
total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
optimizer, schedule = create_optimizer(
Sylvain Gugger's avatar
Sylvain Gugger committed
672
673
    init_lr=2e-5,
    num_warmup_steps=0,
Sylvain Gugger's avatar
Sylvain Gugger committed
674
675
676
677
678
679
680
681
    num_train_steps=total_train_steps,
)
```

Load your model with the [`TFAutoModelForQuestionAnswering`] class:

```python
from transformers import TFAutoModelForQuestionAnswering
Sylvain Gugger's avatar
Sylvain Gugger committed
682

Sylvain Gugger's avatar
Sylvain Gugger committed
683
684
685
686
687
688
689
model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
```

Compile the model:

```python
import tensorflow as tf
Sylvain Gugger's avatar
Sylvain Gugger committed
690

Sylvain Gugger's avatar
Sylvain Gugger committed
691
692
693
694
695
696
697
698
699
700
701
702
model.compile(optimizer=optimizer)
```

Call `model.fit` to fine-tune the model:

```python
model.fit(
    tf_train_set,
    validation_data=tf_validation_set,
    epochs=num_train_epochs,
)
```