custom_datasets.rst 25.9 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
.. 
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

13
How to fine-tune a model for common downstream tasks
Sylvain Gugger's avatar
Sylvain Gugger committed
14
=======================================================================================================================
15

16
17
18
19
20
21
22
This guide will show you how to fine-tune 馃 Transformers models for common downstream tasks. You will use the 馃
Datasets library to quickly load and preprocess the datasets, getting them ready for training with PyTorch and
TensorFlow.

Before you begin, make sure you have the 馃 Datasets library installed. For more detailed installation instructions,
refer to the 馃 Datasets `installation page <https://huggingface.co/docs/datasets/installation.html>`_. All of the
examples in this guide will use 馃 Datasets to load and preprocess a dataset.
23

24
.. code-block:: bash
Sylvain Gugger's avatar
Sylvain Gugger committed
25

26
    pip install datasets
27

28
Learn how to fine-tune a model for:
29

30
31
32
- :ref:`seq_imdb`
- :ref:`tok_ner`
- :ref:`qa_squad`
33
34
35

.. _seq_imdb:

36
Sequence classification with IMDb reviews
Sylvain Gugger's avatar
Sylvain Gugger committed
37
-----------------------------------------------------------------------------------------------------------------------
38

39
40
41
Sequence classification refers to the task of classifying sequences of text according to a given number of classes. In
this example, learn how to fine-tune a model on the `IMDb dataset <https://huggingface.co/datasets/imdb>`_ to determine
whether a review is positive or negative.
42

43
.. note::
44

45
46
47
48
49
    For a more in-depth example of how to fine-tune a model for text classification, take a look at the corresponding
    `PyTorch notebook
    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb>`__
    or `TensorFlow notebook
    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification-tf.ipynb>`__.
50

51
52
Load IMDb dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
53

54
The 馃 Datasets library makes it simple to load a dataset:
55
56
57

.. code-block:: python

58
59
    from datasets import load_dataset
    imdb = load_dataset("imdb")
60

61
62
63
This loads a ``DatasetDict`` object which you can index into to view an example:

.. code-block:: python
64

65
66
67
68
    imdb["train"][0]
    {'label': 1,
     'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
    }
69

70
71
Preprocess
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
72

73
74
75
76
The next step is to tokenize the text into a readable format by the model. It is important to load the same tokenizer a
model was trained with to ensure appropriately tokenized words. Load the DistilBERT tokenizer with the
:class:`~transformers.AutoTokenizer` because we will eventually train a classifier using a pretrained `DistilBERT
<https://huggingface.co/distilbert-base-uncased>`_ model:
77
78
79

.. code-block:: python

80
81
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
82

83
84
Now that you have instantiated a tokenizer, create a function that will tokenize the text. You should also truncate
longer sequences in the text to be no longer than the model's maximum input length:
85
86
87

.. code-block:: python

88
89
    def preprocess_function(examples):
        return tokenizer(examples["text"], truncation=True)
90

91
92
93
Use 馃 Datasets ``map`` function to apply the preprocessing function to the entire dataset. You can also set
``batched=True`` to apply the preprocessing function to multiple elements of the dataset at once for faster
preprocessing:
94
95
96

.. code-block:: python

97
    tokenized_imdb = imdb.map(preprocess_function, batched=True)
98

99
100
101
Lastly, pad your text so they are a uniform length. While it is possible to pad your text in the ``tokenizer`` function
by setting ``padding=True``, it is more efficient to only pad the text to the length of the longest element in its
batch. This is known as **dynamic padding**. You can do this with the ``DataCollatorWithPadding`` function:
102
103
104

.. code-block:: python

105
106
    from transformers import DataCollatorWithPadding
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
107

108
109
Fine-tune with the Trainer API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
110

111
Now load your model with the :class:`~transformers.AutoModel` class along with the number of expected labels:
112

113
.. code-block:: python
114

115
116
    from transformers import AutoModelForSequenceClassification
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
117

118
At this point, only three steps remain:
119

120
121
122
123
1. Define your training hyperparameters in :class:`~transformers.TrainingArguments`.
2. Pass the training arguments to a :class:`~transformers.Trainer` along with the model, dataset, tokenizer, and data
   collator.
3. Call ``trainer.train`` to fine-tune your model.
124
125
126

.. code-block:: python

127
    from transformers import TrainingArguments, Trainer
128
129

    training_args = TrainingArguments(
130
131
132
133
134
135
        output_dir='./results',
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=5,
        weight_decay=0.01,
136
137
138
    )

    trainer = Trainer(
139
140
141
142
143
144
        model=model,
        args=training_args,
        train_dataset=tokenized_imdb["train"],
        eval_dataset=tokenized_imdb["test"],
        tokenizer=tokenizer,
        data_collator=data_collator,
145
146
147
148
    )

    trainer.train()

149
150
Fine-tune with TensorFlow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
151

152
Fine-tuning with TensorFlow is just as easy, with only a few differences.
153

154
155
Start by batching the processed examples together with dynamic padding using the ``DataCollatorWithPadding`` function.
Make sure you set ``return_tensors="tf"`` to return ``tf.Tensor`` outputs instead of PyTorch tensors!
156

157
.. code-block:: python
158

159
160
    from transformers import DataCollatorWithPadding
    data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")
161

162
163
Next, convert your datasets to the ``tf.data.Dataset`` format with ``to_tf_dataset``. Specify inputs and labels in the
``columns`` argument:
164
165
166

.. code-block:: python

167
168
169
170
171
172
    tf_train_dataset = tokenized_imdb["train"].to_tf_dataset(
        columns=['attention_mask', 'input_ids', 'label'],
        shuffle=True,
        batch_size=16,
        collate_fn=data_collator,
    )
173

174
175
176
177
178
179
    tf_validation_dataset = tokenized_imdb["train"].to_tf_dataset(
        columns=['attention_mask', 'input_ids', 'label'],
        shuffle=False,
        batch_size=16,
        collate_fn=data_collator,
    )
180

181
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
182

183
.. code-block:: python
184

185
186
    from transformers import create_optimizer
    import tensorflow as tf
187

188
189
190
191
192
193
194
195
196
    batch_size = 16
    num_epochs = 5
    batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
    total_train_steps = int(batches_per_epoch * num_epochs)
    optimizer, schedule = create_optimizer(
        init_lr=2e-5, 
        num_warmup_steps=0, 
        num_train_steps=total_train_steps
    )
197

198
Load your model with the :class:`~transformers.TFAutoModel` class along with the number of expected labels:
199

200
.. code-block:: python
201

202
203
    from transformers import TFAutoModelForSequenceClassification
    model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
204

205
Compile the model:
206

207
.. code-block:: python
208

209
210
    import tensorflow as tf
    model.compile(optimizer=optimizer)
211

212
Finally, fine-tune the model by calling ``model.fit``:
213

214
.. code-block:: python
215

216
217
218
219
220
    model.fit(
        tf_train_set,
        validation_data=tf_validation_set,
        epochs=num_train_epochs,
    )
221

222
.. _tok_ner:
223

224
225
Token classification with WNUT emerging entities
-----------------------------------------------------------------------------------------------------------------------
226

227
228
229
230
Token classification refers to the task of classifying individual tokens in a sentence. One of the most common token
classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence,
such as a person, location, or organization. In this example, learn how to fine-tune a model on the `WNUT 17
<https://huggingface.co/datasets/wnut_17>`_ dataset to detect new entities.
231

232
.. note::
233

234
235
236
237
238
    For a more in-depth example of how to fine-tune a model for token classification, take a look at the corresponding
    `PyTorch notebook
    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb>`__
    or `TensorFlow notebook
    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification-tf.ipynb>`__.
239

240
241
Load WNUT 17 dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
242

243
Load the WNUT 17 dataset from the 馃 Datasets library:
Sylvain Gugger's avatar
Sylvain Gugger committed
244

245
.. code-block:: python
Sylvain Gugger's avatar
Sylvain Gugger committed
246

247
248
    from datasets import load_dataset
    wnut = load_dataset("wnut_17")
Sylvain Gugger's avatar
Sylvain Gugger committed
249

250
A quick look at the dataset shows the labels associated with each word in the sentence:
251
252
253

.. code-block:: python

254
255
256
257
258
    wnut["train"][0]
    {'id': '0',
     'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
     'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
    }
259

260
View the specific NER tags by:
261
262
263

.. code-block:: python

264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
    label_list = wnut["train"].features[f"ner_tags"].feature.names
    label_list
    ['O',
     'B-corporation',
     'I-corporation',
     'B-creative-work',
     'I-creative-work',
     'B-group',
     'I-group',
     'B-location',
     'I-location',
     'B-person',
     'I-person',
     'B-product',
     'I-product'
    ]

A letter prefixes each NER tag which can mean:

* ``B-`` indicates the beginning of an entity.
* ``I-`` indicates a token is contained inside the same entity (e.g., the ``State`` token is a part of an entity like
  ``Empire State Building``).
* ``0`` indicates the token doesn't correspond to any entity.

Preprocess
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
290

291
Now you need to tokenize the text. Load the DistilBERT tokenizer with an :class:`~transformers.AutoTokenizer`:
292
293
294

.. code-block:: python

295
296
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
297

298
299
Since the input has already been split into words, set ``is_split_into_words=True`` to tokenize the words into
subwords:
300
301
302

.. code-block:: python

303
304
305
306
307
308
309
    tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
    tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
    tokens
    ['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']

The addition of the special tokens ``[CLS]`` and ``[SEP]`` and subword tokenization creates a mismatch between the
input and labels. Realign the labels and tokens by:
310

311
312
313
314
1. Mapping all tokens to their corresponding word with the ``word_ids`` method.
2. Assigning the label ``-100`` to the special tokens ``[CLS]`` and ``[SEP]``` so the PyTorch loss function ignores
   them.
3. Only labeling the first token of a given word. Assign ``-100`` to the other subtokens from the same word.
315

316
Here is how you can create a function that will realign the labels and tokens:
317

318
.. code-block:: python
319

320
321
    def tokenize_and_align_labels(examples):
        tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
322

323
324
325
326
327
328
329
330
331
332
333
334
        labels = []
        for i, label in enumerate(examples[f"ner_tags"]):
            word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
            previous_word_idx = None
            label_ids = []
            for word_idx in word_ids:                            # Set the special tokens to -100.
                if word_idx is None:
                    label_ids.append(-100)
                elif word_idx != previous_word_idx:              # Only label the first token of a given word.
                    label_ids.append(label[word_idx])

            labels.append(label_ids)
335

336
337
338
339
        tokenized_inputs["labels"] = labels
        return tokenized_inputs

Now tokenize and align the labels over the entire dataset with 馃 Datasets ``map`` function:
340
341
342

.. code-block:: python

343
    tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
344

345
Finally, pad your text and labels, so they are a uniform length:
346

347
.. code-block:: python
348

349
350
    from transformers import DataCollatorForTokenClassification
    data_collator = DataCollatorForTokenClassification(tokenizer)
Sylvain Gugger's avatar
Sylvain Gugger committed
351

352
353
Fine-tune with the Trainer API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
354

355
Load your model with the :class:`~transformers.AutoModel` class along with the number of expected labels:
356
357
358

.. code-block:: python

359
360
    from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
    model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
361

362
Gather your training arguments in :class:`~transformers.TrainingArguments`:
363

364
.. code-block:: python
365

366
367
368
369
370
371
372
373
374
    training_args = TrainingArguments(
        output_dir='./results',
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
    )
375

376
Collect your model, training arguments, dataset, data collator, and tokenizer in :class:`~transformers.Trainer`:
377

378
.. code-block:: python
379

380
381
382
383
384
385
386
387
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_wnut["train"],
        eval_dataset=tokenized_wnut["test"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )
388

389
Fine-tune your model:
390
391
392

.. code-block:: python

393
    trainer.train()
394

395
396
Fine-tune with TensorFlow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
397

398
Batch your examples together and pad your text and labels, so they are a uniform length:
399

400
.. code-block:: python
401

402
403
    from transformers import DataCollatorForTokenClassification
    data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="tf")
404

405
406
407
Convert your datasets to the ``tf.data.Dataset`` format with ``to_tf_dataset``:

.. code-block:: python
408

409
410
411
412
413
414
    tf_train_set = tokenized_wnut["train"].to_tf_dataset(
        columns=["attention_mask", "input_ids", "labels"],
        shuffle=True,
        batch_size=16,
        collate_fn=data_collator,
    )
415

416
417
418
419
420
421
    tf_validation_set = tokenized_wnut["validation"].to_tf_dataset(
        columns=["attention_mask", "input_ids", "labels"],
        shuffle=False,
        batch_size=16,
        collate_fn=data_collator,
    )
422

423
Load the model with the :class:`~transformers.TFAutoModel` class along with the number of expected labels:
424

425
.. code-block:: python
426

427
428
    from transformers import TFAutoModelForTokenClassification
    model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
429

430
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
431
432
433

.. code-block:: python

434
    from transformers import create_optimizer
435

436
437
438
439
440
441
442
443
444
    batch_size = 16
    num_train_epochs = 3
    num_train_steps = (len(tokenized_datasets["train"]) // batch_size) * num_train_epochs
    optimizer, lr_schedule = create_optimizer(
        init_lr=2e-5,
        num_train_steps=num_train_steps,
        weight_decay_rate=0.01,
        num_warmup_steps=0,
    )
445

446
Compile the model:
Sylvain Gugger's avatar
Sylvain Gugger committed
447

448
.. code-block:: python
449

450
451
    import tensorflow as tf
    model.compile(optimizer=optimizer)
452

453
Call ``model.fit`` to fine-tune your model:
454
455
456

.. code-block:: python

457
458
459
460
461
    model.fit(
        tf_train_set,
        validation_data=tf_validation_set,
        epochs=num_train_epochs,
    )
Sylvain Gugger's avatar
Sylvain Gugger committed
462

463
.. _qa_squad:
464

465
466
Question Answering with SQuAD
-----------------------------------------------------------------------------------------------------------------------
467

468
469
470
There are many types of question answering (QA) tasks. Extractive QA focuses on identifying the answer from the text
given a question. In this example, learn how to fine-tune a model on the `SQuAD
<https://huggingface.co/datasets/squad>`_ dataset.
471

472
.. note::
473

474
475
476
477
478
    For a more in-depth example of how to fine-tune a model for question answering, take a look at the corresponding
    `PyTorch notebook
    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb>`__
    or `TensorFlow notebook
    <https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering-tf.ipynb>`__.
479

480
481
Load SQuAD dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
482

483
Load the SQuAD dataset from the 馃 Datasets library:
484
485
486

.. code-block:: python

487
488
    from datasets import load_dataset
    squad = load_dataset("squad")
Patrick von Platen's avatar
Patrick von Platen committed
489

490
Take a look at an example from the dataset:
Patrick von Platen's avatar
Patrick von Platen committed
491

492
.. code-block:: python
493

494
495
496
497
498
499
500
    squad["train"][0]
    {'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
     'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
     'id': '5733be284776f41900661182',
     'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
     'title': 'University_of_Notre_Dame'
    }
501

502
503
504
505
Preprocess
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Load the DistilBERT tokenizer with an :class:`~transformers.AutoTokenizer`:
506
507
508

.. code-block:: python

509
510
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
511

512
There are a few things to be aware of when preprocessing text for question answering:
513

514
515
516
517
518
519
1. Some examples in a dataset may have a very long ``context`` that exceeds the maximum input length of the model. You
   can deal with this by truncating the ``context`` and set ``truncation="only_second"``.
2. Next, you need to map the start and end positions of the answer to the original context. Set
   ``return_offset_mapping=True`` to handle this.
3. With the mapping in hand, you can find the start and end tokens of the answer. Use the ``sequence_ids`` method to
   find which part of the offset corresponds to the question, and which part of the offset corresponds to the context.
520

521
Assemble everything in a preprocessing function as shown below:
Sylvain Gugger's avatar
Sylvain Gugger committed
522

523
.. code-block:: python
524

525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
    def preprocess_function(examples):
        questions = [q.strip() for q in examples["question"]]
        inputs = tokenizer(
            questions,
            examples["context"],
            max_length=384,
            truncation="only_second",
            return_offsets_mapping=True,
            padding="max_length",
        )

        offset_mapping = inputs.pop("offset_mapping")
        answers = examples["answers"]
        start_positions = []
        end_positions = []
540

541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
        for i, offset in enumerate(offset_mapping):
            answer = answers[i]
            start_char = answer["answer_start"][0]
            end_char = answer["answer_start"][0] + len(answer["text"][0])
            sequence_ids = inputs.sequence_ids(i)

            # Find the start and end of the context
            idx = 0
            while sequence_ids[idx] != 1:
                idx += 1
            context_start = idx
            while sequence_ids[idx] == 1:
                idx += 1
            context_end = idx - 1

            # If the answer is not fully inside the context, label it (0, 0)
            if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
                start_positions.append(0)
                end_positions.append(0)
            else:
                # Otherwise it's the start and end token positions
                idx = context_start
                while idx <= context_end and offset[idx][0] <= start_char:
                    idx += 1
                start_positions.append(idx - 1)

                idx = context_end
                while idx >= context_start and offset[idx][1] >= end_char:
                    idx -= 1
                end_positions.append(idx + 1)

        inputs["start_positions"] = start_positions
        inputs["end_positions"] = end_positions
        return inputs

Apply the preprocessing function over the entire dataset with 馃 Datasets ``map`` function:
577
578
579

.. code-block:: python

580
    tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
581

582
Batch the processed examples together:
583
584
585

.. code-block:: python

586
587
    from transformers import default_data_collator
    data_collator = default_data_collator
588

589
590
Fine-tune with the Trainer API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
591

592
Load your model with the :class:`~transformers.AutoModel` class:
593

594
.. code-block:: python
595

596
597
    from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
    model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
598

599
Gather your training arguments in :class:`~transformers.TrainingArguments`:
600

601
.. code-block:: python
602

603
604
605
606
607
608
609
610
611
    training_args = TrainingArguments(
        output_dir='./results',
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
    )
612

613
Collect your model, training arguments, dataset, data collator, and tokenizer in :class:`~transformers.Trainer`:
614

615
.. code-block:: python
616

617
618
619
620
621
622
623
624
625
626
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_squad["train"],
        eval_dataset=tokenized_squad["validation"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )

Fine-tune your model:
627

628
.. code-block:: python
629

630
    trainer.train()
631

632
Fine-tune with TensorFlow
Sylvain Gugger's avatar
Sylvain Gugger committed
633
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
634

635
636
637
638
639
640
Batch the processed examples together with a TensorFlow default data collator:

.. code-block:: python

    from transformers.data.data_collator import tf_default_collator
    data_collator = tf_default_collator
641

642
Convert your datasets to the ``tf.data.Dataset`` format with the ``to_tf_dataset`` function:
643
644
645

.. code-block:: python

646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
    tf_train_set = tokenized_squad["train"].to_tf_dataset(
        columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
        dummy_labels=True,
        shuffle=True,
        batch_size=16,
        collate_fn=data_collator,
    )

    tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
        columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
        dummy_labels=True,
        shuffle=False,
        batch_size=16,
        collate_fn=data_collator,
    )
661

662
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
663
664
665

.. code-block:: python

666
    from transformers import create_optimizer
667

668
669
670
671
672
673
674
675
676
677
    batch_size = 16
    num_epochs = 2
    total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
    optimizer, schedule = create_optimizer(
        init_lr=2e-5, 
        num_warmup_steps=0, 
        num_train_steps=total_train_steps,
    )

Load your model with the :class:`~transformers.TFAutoModel` class:
678
679
680

.. code-block:: python

681
682
    from transformers import TFAutoModelForQuestionAnswering
    model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
683

684
Compile the model:
685
686
687

.. code-block:: python

688
689
690
691
    import tensorflow as tf
    model.compile(optimizer=optimizer)

Call ``model.fit`` to fine-tune the model:
692

693
694
695
696
697
698
699
.. code-block:: python

    model.fit(
        tf_train_set,
        validation_data=tf_validation_set,
        epochs=num_train_epochs,
    )