custom_datasets.rst 31.1 KB
Newer Older
1
Fine-tuning with custom datasets
Sylvain Gugger's avatar
Sylvain Gugger committed
2
=======================================================================================================================
3

4
5
.. note::

Sylvain Gugger's avatar
Sylvain Gugger committed
6
7
8
9
10
11
12
13
14
    The datasets used in this tutorial are available and can be more easily accessed using the `🤗 NLP library
    <https://github.com/huggingface/nlp>`_. We do not use this library to access the datasets here since this tutorial
    meant to illustrate how to work with your own data. A brief of introduction can be found at the end of the tutorial
    in the section ":ref:`nlplib`".

This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The guide
shows one of many valid workflows for using these models and is meant to be illustrative rather than definitive. We
show examples of reading in several data formats, preprocessing the data for several types of tasks, and then preparing
the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
15
16
17
18
19
20
21
22
23
24
25
26
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow.

We include several examples, each of which demonstrates a different type of common downstream task:

  - :ref:`seq_imdb`
  - :ref:`tok_ner`
  - :ref:`qa_squad`
  - :ref:`resources`

.. _seq_imdb:

Sequence Classification with IMDb Reviews
Sylvain Gugger's avatar
Sylvain Gugger committed
27
-----------------------------------------------------------------------------------------------------------------------
28

29
30
.. note::

Sylvain Gugger's avatar
Sylvain Gugger committed
31
32
    This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and
    can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("imdb")``.
33

Sylvain Gugger's avatar
Sylvain Gugger committed
34
35
36
37
In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task takes
the text of a review and requires the model to predict whether the sentiment of the review is positive or negative.
Let's start by downloading the dataset from the `Large Movie Review Dataset
<http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

.. code-block:: bash

    wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    tar -xf aclImdb_v1.tar.gz

This data is organized into ``pos`` and ``neg`` folders with one text file per example. Let's write a function that can
read this in.

.. code-block:: python

    from pathlib import Path

    def read_imdb_split(split_dir):
        split_dir = Path(split_dir)
        texts = []
        labels = []
        for label_dir in ["pos", "neg"]:
            for text_file in (split_dir/label_dir).iterdir():
                texts.append(text_file.read_text())
                labels.append(0 if label_dir is "neg" else 1)

        return texts, labels

    train_texts, train_labels = read_imdb_split('aclImdb/train')
    test_texts, test_labels = read_imdb_split('aclImdb/test')

Sylvain Gugger's avatar
Sylvain Gugger committed
65
66
We now have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

.. code-block:: python

    from sklearn.model_selection import train_test_split
    train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

Alright, we've read in our dataset. Now let's tackle tokenization. We'll eventually train a classifier using
pre-trained DistilBert, so let's use the DistilBert tokenizer.

.. code-block:: python

    from transformers import DistilBertTokenizerFast
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Now we can simply pass our texts to the tokenizer. We'll pass ``truncation=True`` and ``padding=True``, which will
Sylvain Gugger's avatar
Sylvain Gugger committed
82
83
ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum input
length. This will allow us to feed batches of sequences into the model at the same time.
84
85
86
87
88
89
90
91

.. code-block:: python

    train_encodings = tokenizer(train_texts, truncation=True, padding=True)
    val_encodings = tokenizer(val_texts, truncation=True, padding=True)
    test_encodings = tokenizer(test_texts, truncation=True, padding=True)

Now, let's turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a
Sylvain Gugger's avatar
Sylvain Gugger committed
92
93
94
``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input
encodings and labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data
can be easily batched such that each key in the batch encoding corresponds to a named parameter of the
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
:meth:`~transformers.DistilBertForSequenceClassification.forward` method of the model we will train.

.. code-block:: python

    ## PYTORCH CODE
    import torch

    class IMDbDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels

        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx])
            return item

        def __len__(self):
            return len(self.labels)

    train_dataset = IMDbDataset(train_encodings, train_labels)
    val_dataset = IMDbDataset(val_encodings, val_labels)
    test_dataset = IMDbDataset(test_encodings, test_labels)
    ## TENSORFLOW CODE
    import tensorflow as tf

    train_dataset = tf.data.Dataset.from_tensor_slices((
        dict(train_encodings),
        train_labels
    ))
    val_dataset = tf.data.Dataset.from_tensor_slices((
        dict(val_encodings),
        val_labels
    ))
    test_dataset = tf.data.Dataset.from_tensor_slices((
        dict(test_encodings),
        test_labels
    ))

Now that our datasets our ready, we can fine-tune a model either with the 🤗
Sylvain Gugger's avatar
Sylvain Gugger committed
135
136
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See :doc:`training
<training>`.
137
138
139
140

.. _ft_trainer:

Fine-tuning with Trainer
Sylvain Gugger's avatar
Sylvain Gugger committed
141
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
142

Sylvain Gugger's avatar
Sylvain Gugger committed
143
144
145
The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` and
instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201

.. code-block:: python

    ## PYTORCH CODE
    from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

    training_args = TrainingArguments(
        output_dir='./results',          # output directory
        num_train_epochs=3,              # total number of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=64,   # batch size for evaluation
        warmup_steps=500,                # number of warmup steps for learning rate scheduler
        weight_decay=0.01,               # strength of weight decay
        logging_dir='./logs',            # directory for storing logs
        logging_steps=10,
    )

    model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

    trainer = Trainer(
        model=model,                         # the instantiated 🤗 Transformers model to be trained
        args=training_args,                  # training arguments, defined above
        train_dataset=train_dataset,         # training dataset
        eval_dataset=val_dataset             # evaluation dataset
    )

    trainer.train()
    ## TENSORFLOW CODE
    from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

    training_args = TFTrainingArguments(
        output_dir='./results',          # output directory
        num_train_epochs=3,              # total number of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=64,   # batch size for evaluation
        warmup_steps=500,                # number of warmup steps for learning rate scheduler
        weight_decay=0.01,               # strength of weight decay
        logging_dir='./logs',            # directory for storing logs
        logging_steps=10,
    )

    with training_args.strategy.scope():
        model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

    trainer = TFTrainer(
        model=model,                         # the instantiated 🤗 Transformers model to be trained
        args=training_args,                  # training arguments, defined above
        train_dataset=train_dataset,         # training dataset
        eval_dataset=val_dataset             # evaluation dataset
    )

    trainer.train()

.. _ft_native:

Fine-tuning with native PyTorch/TensorFlow
Sylvain Gugger's avatar
Sylvain Gugger committed
202
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245

We can also train use native PyTorch or TensorFlow:

.. code-block:: python

    ## PYTORCH CODE
    from torch.utils.data import DataLoader
    from transformers import DistilBertForSequenceClassification, AdamW

    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

    model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
    model.to(device)
    model.train()

    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

    optim = AdamW(model.parameters(), lr=5e-5)

    for epoch in range(3):
        for batch in train_loader:
            optim.zero_grad()
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs[0]
            loss.backward()
            optim.step()

    model.eval()
    ## TENSORFLOW CODE
    from transformers import TFDistilBertForSequenceClassification

    model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

    optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
    model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
    model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

.. _tok_ner:

Token Classification with W-NUT Emerging Entities
Sylvain Gugger's avatar
Sylvain Gugger committed
246
-----------------------------------------------------------------------------------------------------------------------
247

248
249
.. note::

Sylvain Gugger's avatar
Sylvain Gugger committed
250
251
    This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_),
    and can be alternatively downloaded with the 🤗 NLP library with ``load_dataset("wnut_17")``.
252

253
Next we will look at token classification. Rather than classifying an entire sequence, this task classifies token by
Sylvain Gugger's avatar
Sylvain Gugger committed
254
255
256
257
258
token. We'll demonstrate how to do this with `Named Entity Recognition
<http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves identifying tokens which correspond to
a predefined set of "entities". Specifically, we'll use the `W-NUT Emerging and Rare entities
<http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data is given as a collection of
pre-tokenized documents where each token is assigned a tag.
259
260
261
262
263
264
265

Let's start by downloading the data.

.. code-block:: bash

    wget http://noisy-text.github.io/2017/files/wnut17train.conll

Sylvain Gugger's avatar
Sylvain Gugger committed
266
267
268
269
In this case, we'll just download the train set, which is a single text file. Each line of the file contains either (1)
a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a function to read
this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token strings, and
``token_tags`` which is a list of lists of tag strings.
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291

.. code-block:: python

    from pathlib import Path
    import re

    def read_wnut(file_path):
        file_path = Path(file_path)

        raw_text = file_path.read_text().strip()
        raw_docs = re.split(r'\n\t?\n', raw_text)
        token_docs = []
        tag_docs = []
        for doc in raw_docs:
            tokens = []
            tags = []
            for line in doc.split('\n'):
                token, tag = line.split('\t')
                tokens.append(token)
                tags.append(tag)
            token_docs.append(tokens)
            tag_docs.append(tags)
Sylvain Gugger's avatar
Sylvain Gugger committed
292

293
        return token_docs, tag_docs
Sylvain Gugger's avatar
Sylvain Gugger committed
294

295
    texts, tags = read_wnut('wnut17train.conll')
Sylvain Gugger's avatar
Sylvain Gugger committed
296

297
298
299
300
301
302
303
304
Just to see what this data looks like, let's take a look at a segment of the first document.

.. code-block:: python

    >>> print(texts[0][10:17], tags[0][10:17], sep='\n')
    ['for', 'two', 'weeks', '.', 'Empire', 'State', 'Building']
    ['O', 'O', 'O', 'O', 'B-location', 'I-location', 'I-location']

Sylvain Gugger's avatar
Sylvain Gugger committed
305
306
``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions
of the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
307
308
309
310
311
312
313
314
315
any entity.

Now that we've read the data in, let's create a train/validation split:

.. code-block:: python

    from sklearn.model_selection import train_test_split
    train_texts, val_texts, train_tags, val_tags = train_test_split(texts, tags, test_size=.2)

Sylvain Gugger's avatar
Sylvain Gugger committed
316
317
Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping which
we'll use in a moment:
318
319
320
321
322
323
324

.. code-block:: python

    unique_tags = set(tag for doc in tags for tag in doc)
    tag2id = {tag: id for id, tag in enumerate(unique_tags)}
    id2tag = {id: tag for tag, id in tag2id.items()}

Sylvain Gugger's avatar
Sylvain Gugger committed
325
326
327
328
329
To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing with
ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model to
return information about the tokens which are split by the wordpiece tokenization process, which we will need in a
moment.
330
331
332
333
334

.. code-block:: python

    from transformers import DistilBertTokenizerFast
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')
335
336
    train_encodings = tokenizer(train_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
    val_encodings = tokenizer(val_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
337
338
339
340

Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert
model below.

Sylvain Gugger's avatar
Sylvain Gugger committed
341
342
343
344
345
346
Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in
the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the
vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens ``['@',
'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a
token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
347

Sylvain Gugger's avatar
Sylvain Gugger committed
348
349
One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in 🤗
Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
350
351
352
353
354
``@HuggingFace`` is ``3`` (indexing ``B-corporation``), we would set the labels of ``['@', 'hugging', '##face']`` to
``[3, -100, -100]``.

Let's write a function to do this. This is where we will use the ``offset_mapping`` from the tokenizer as mentioned
above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's
Sylvain Gugger's avatar
Sylvain Gugger committed
355
356
357
358
start position and end position relative to the original token it was split from. That means that if the first position
in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at it, we can
also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must be a
special token like ``[PAD]`` or ``[CLS]``.
359

Sylvain Gugger's avatar
Sylvain Gugger committed
360
.. note::
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380

    Due to a recently fixed bug, -1 must be used instead of -100 when using TensorFlow in 🤗 Transformers <= 3.02.

.. code-block:: python

    import numpy as np

    def encode_tags(tags, encodings):
        labels = [[tag2id[tag] for tag in doc] for doc in tags]
        encoded_labels = []
        for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
            # create an empty array of -100
            doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
            arr_offset = np.array(doc_offset)

            # set labels whose first offset position is 0 and the second is not 0
            doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
            encoded_labels.append(doc_enc_labels.tolist())

        return encoded_labels
Sylvain Gugger's avatar
Sylvain Gugger committed
381

382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
    train_labels = encode_tags(train_tags, train_encodings)
    val_labels = encode_tags(val_tags, val_encodings)

The hard part is now done. Just as in the sequence classification example above, we can create a dataset object:

.. code-block:: python

    ## PYTORCH CODE
    import torch

    class WNUTDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels

        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx])
            return item

        def __len__(self):
            return len(self.labels)

    train_encodings.pop("offset_mapping") # we don't want to pass this to the model
    val_encodings.pop("offset_mapping")
    train_dataset = WNUTDataset(train_encodings, train_labels)
    val_dataset = WNUTDataset(val_encodings, val_labels)
    ## TENSORFLOW CODE
    import tensorflow as tf

    train_encodings.pop("offset_mapping") # we don't want to pass this to the model
    val_encodings.pop("offset_mapping")

    train_dataset = tf.data.Dataset.from_tensor_slices((
        dict(train_encodings),
        train_labels
    ))
    val_dataset = tf.data.Dataset.from_tensor_slices((
        dict(val_encodings),
        val_labels
    ))

Now load in a token classification model and specify the number of labels:

.. code-block:: python

    ## PYTORCH CODE
    from transformers import DistilBertForTokenClassification
    model = DistilBertForTokenClassification.from_pretrained('distilbert-base-cased', num_labels=len(unique_tags))
    ## TENSORFLOW CODE
    from transformers import TFDistilBertForTokenClassification
    model = TFDistilBertForTokenClassification.from_pretrained('distilbert-base-cased', num_labels=len(unique_tags))

The data and model are both ready to go. You can train the model either with
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow, exactly as in the
sequence classification example above.

  - :ref:`ft_trainer`
  - :ref:`ft_native`

.. _qa_squad:

Question Answering with SQuAD 2.0
Sylvain Gugger's avatar
Sylvain Gugger committed
445
-----------------------------------------------------------------------------------------------------------------------
446

447
448
.. note::

Sylvain Gugger's avatar
Sylvain Gugger committed
449
450
451
    This dataset can be explored in the Hugging Face model hub (`SQuAD V2
    <https://huggingface.co/datasets/squad_v2>`_), and can be alternatively downloaded with the 🤗 NLP library with
    ``load_dataset("squad_v2")``.
452

453
454
455
456
457
458
459
460
461
462
463
464
465
466
Question answering comes in many forms. In this example, we'll look at the particular type of extractive QA that
involves answering a question about a passage by highlighting the segment of the passage that answers the question.
This involves fine-tuning a model which predicts a start position and an end position in the passage. We will use the
`Stanford Question Answering Dataset (SQuAD) 2.0 <https://rajpurkar.github.io/SQuAD-explorer/>`_.

We will start by downloading the data:

.. code-block:: bash

    mkdir squad
    wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O squad/train-v2.0.json
    wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json

Each split is in a structured json file with a number of questions and answers for each passage (or context). We'll
Sylvain Gugger's avatar
Sylvain Gugger committed
467
468
take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since
there are multiple questions per context):
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493

.. code-block:: python

    import json
    from pathlib import Path

    def read_squad(path):
        path = Path(path)
        with open(path, 'rb') as f:
            squad_dict = json.load(f)

        contexts = []
        questions = []
        answers = []
        for group in squad_dict['data']:
            for passage in group['paragraphs']:
                context = passage['context']
                for qa in passage['qas']:
                    question = qa['question']
                    for answer in qa['answers']:
                        contexts.append(context)
                        questions.append(question)
                        answers.append(answer)

        return contexts, questions, answers
Sylvain Gugger's avatar
Sylvain Gugger committed
494

495
496
497
    train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
    val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')

Sylvain Gugger's avatar
Sylvain Gugger committed
498
499
500
501
The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the
correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on
this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token* positions the
answer begins and ends.
502

Sylvain Gugger's avatar
Sylvain Gugger committed
503
504
First, let's get the *character* position at which the answer ends in the passage (we are given the starting position).
Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
505
506
507
508
509
510
511
512

.. code-block:: python

    def add_end_idx(answers, contexts):
        for answer, context in zip(answers, contexts):
            gold_text = answer['text']
            start_idx = answer['answer_start']
            end_idx = start_idx + len(gold_text)
Sylvain Gugger's avatar
Sylvain Gugger committed
513

514
515
516
517
518
519
520
521
522
523
524
525
526
            # sometimes squad answers are off by a character or two – fix this
            if context[start_idx:end_idx] == gold_text:
                answer['answer_end'] = end_idx
            elif context[start_idx-1:end_idx-1] == gold_text:
                answer['answer_start'] = start_idx - 1
                answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
            elif context[start_idx-2:end_idx-2] == gold_text:
                answer['answer_start'] = start_idx - 2
                answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters

    add_end_idx(train_answers, train_contexts)
    add_end_idx(val_answers, val_contexts)

Sylvain Gugger's avatar
Sylvain Gugger committed
527
528
529
Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions. Next,
let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode them together
as sequence pairs.
530
531
532
533
534
535
536
537
538

.. code-block:: python

    from transformers import DistilBertTokenizerFast
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

    train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
    val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

Sylvain Gugger's avatar
Sylvain Gugger committed
539
540
Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers,
we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559

.. code-block:: python

    def add_token_positions(encodings, answers):
        start_positions = []
        end_positions = []
        for i in range(len(answers)):
            start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
            end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
            # if None, the answer passage has been truncated
            if start_positions[-1] is None:
                start_positions[-1] = tokenizer.model_max_length
            if end_positions[-1] is None:
                end_positions[-1] = tokenizer.model_max_length
        encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

    add_token_positions(train_encodings, train_answers)
    add_token_positions(val_encodings, val_answers)

Sylvain Gugger's avatar
Sylvain Gugger committed
560
561
562
Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In
PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of ``(inputs_dict, labels_dict)`` to the
``from_tensor_slices`` method.
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577

.. code-block:: python

    ## PYTORCH CODE
    import torch

    class SquadDataset(torch.utils.data.Dataset):
        def __init__(self, encodings):
            self.encodings = encodings

        def __getitem__(self, idx):
            return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

        def __len__(self):
            return len(self.encodings.input_ids)
Sylvain Gugger's avatar
Sylvain Gugger committed
578

579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
    train_dataset = SquadDataset(train_encodings)
    val_dataset = SquadDataset(val_encodings)
    ## TENSORFLOW CODE
    import tensorflow as tf

    train_dataset = tf.data.Dataset.from_tensor_slices((
        {key: train_encodings[key] for key in ['input_ids', 'attention_mask']},
        {key: train_encodings[key] for key in ['start_positions', 'end_positions']}
    ))
    val_dataset = tf.data.Dataset.from_tensor_slices((
        {key: val_encodings[key] for key in ['input_ids', 'attention_mask']},
        {key: val_encodings[key] for key in ['start_positions', 'end_positions']}
    ))

Now we can use a DistilBert model with a QA head for training:

.. code-block:: python

    ## PYTORCH CODE
    from transformers import DistilBertForQuestionAnswering
    model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
    ## TENSORFLOW CODE
    from transformers import TFDistilBertForQuestionAnswering
    model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")


The data and model are both ready to go. You can train the model with
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` exactly as in the sequence classification example
above. If using native PyTorch, replace ``labels`` with ``start_positions`` and ``end_positions`` in the training
example. If using Keras's ``fit``, we need to make a minor modification to handle this example since it involves
multiple model outputs.

  - :ref:`ft_trainer`

.. code-block:: python

    ## PYTORCH CODE
    from torch.utils.data import DataLoader
    from transformers import AdamW

    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

    model.to(device)
    model.train()

    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

    optim = AdamW(model.parameters(), lr=5e-5)

    for epoch in range(3):
        for batch in train_loader:
            optim.zero_grad()
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            start_positions = batch['start_positions'].to(device)
            end_positions = batch['end_positions'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
            loss = outputs[0]
            loss.backward()
            optim.step()

    model.eval()
    ## TENSORFLOW CODE
    # Keras will expect a tuple when dealing with labels
    train_dataset = train_dataset.map(lambda x, y: (x, (y['start_positions'], y['end_positions'])))

    # Keras will assign a separate loss for each output and add them together. So we'll just use the standard CE loss
    # instead of using the built-in model.compute_loss, which expects a dict of outputs and averages the two terms.
    # Note that this means the loss will be 2x of when using TFTrainer since we're adding instead of averaging them.
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.distilbert.return_dict = False # if using 🤗 Transformers >3.02, make sure outputs are tuples

    optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
    model.compile(optimizer=optimizer, loss=loss) # can also use any keras loss fn
    model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

.. _resources:

Additional Resources
Sylvain Gugger's avatar
Sylvain Gugger committed
658
-----------------------------------------------------------------------------------------------------------------------
659
660
661
662
663
664

  - `How to train a new language model from scratch using Transformers and Tokenizers
    <https://huggingface.co/blog/how-to-train>`_. Blog post showing the steps to load in Esperanto data and train a
    masked language model from scratch.
  - :doc:`Preprocessing <preprocessing>`. Docs page on data preprocessing.
  - :doc:`Training <training>`. Docs page on training and fine-tuning.
665
666
667
668

.. _nlplib:

Using the 🤗 NLP Datasets & Metrics library
Sylvain Gugger's avatar
Sylvain Gugger committed
669
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
670

Sylvain Gugger's avatar
Sylvain Gugger committed
671
672
673
674
675
This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with 🤗
Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the `🤗
NLP library <https://github.com/huggingface/nlp>`_ for working with the 150+ datasets included in the `hub
<https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview, we
will show how to use the NLP library to download and prepare the IMDb dataset from the first example, :ref:`seq_imdb`.
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690

Start by downloading the dataset:

.. code-block:: python

    from nlp import load_dataset
    train = load_dataset("imdb", split="train")

Each dataset has multiple columns corresponding to different features. Let's see what our columns are.

.. code-block:: python

    >>> print(train.column_names)
    ['label', 'text']

Sylvain Gugger's avatar
Sylvain Gugger committed
691
692
Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column to
``labels`` to match the model's input arguments.
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712

.. code-block:: python

    train = train.map(lambda batch: tokenizer(batch["text"], truncation=True, padding=True), batched=True)
    train.rename_column_("label", "labels")

Lastly, we can use the ``set_format`` method to determine which columns and in what data format we want to access
dataset elements.

.. code-block:: python

    ## PYTORCH CODE
    >>> train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
    >>> {key: val.shape for key, val in train[0].items()})
    {'labels': torch.Size([]), 'input_ids': torch.Size([512]), 'attention_mask': torch.Size([512])}
    ## TENSORFLOW CODE
    >>> train.set_format("tensorflow", columns=["input_ids", "attention_mask", "labels"])
    >>> {key: val.shape for key, val in train[0].items()})
    {'labels': TensorShape([]), 'input_ids': TensorShape([512]), 'attention_mask': TensorShape([512])}

Sylvain Gugger's avatar
Sylvain Gugger committed
713
714
We now have a fully-prepared dataset. Check out `the 🤗 NLP docs <https://huggingface.co/nlp/processing.html>`_ for a
more thorough introduction.