training.rst 14.1 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
.. 
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

13
Training and fine-tuning
Sylvain Gugger's avatar
Sylvain Gugger committed
14
=======================================================================================================================
15

Sylvain Gugger's avatar
Sylvain Gugger committed
16
17
18
19
20
21
22
23
Model classes in 馃 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used
seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the
standard training tools available in either framework. We will also show how to use our included
:func:`~transformers.Trainer` class which handles much of the complexity of training for you.

This guide assume that you are already familiar with loading and use our models for inference; otherwise, see the
:doc:`task summary <task_summary>`. We also assume that you are familiar with training deep neural networks in either
PyTorch or TF2, and focus specifically on the nuances and tools for training models in 馃 Transformers.
24
25
26

Sections:

Sylvain Gugger's avatar
Sylvain Gugger committed
27
28
29
30
  - :ref:`pytorch`
  - :ref:`tensorflow`
  - :ref:`trainer`
  - :ref:`additional-resources`
31
32
33
34

.. _pytorch:

Fine-tuning in native PyTorch
Sylvain Gugger's avatar
Sylvain Gugger committed
35
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
36

Sylvain Gugger's avatar
Sylvain Gugger committed
37
38
39
40
41
42
43
44
Model classes in 馃 Transformers that don't begin with ``TF`` are `PyTorch Modules
<https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_, meaning that you can use them just as you would any
model in PyTorch for both inference and optimization.

Let's consider the common task of fine-tuning a masked language model like BERT on a sequence classification dataset.
When we instantiate a model with :func:`~transformers.PreTrainedModel.from_pretrained`, the model configuration and
pre-trained weights of the specified model are used to initialize the model. The library also includes a number of
task-specific final layers or 'heads' whose weights are instantiated randomly when not present in the specified
45
pre-trained model. For example, instantiating a model with
Sylvain Gugger's avatar
Sylvain Gugger committed
46
47
48
49
``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)`` will create a BERT model instance
with encoder weights copied from the ``bert-base-uncased`` model and a randomly initialized sequence classification
head on top of the encoder with an output size of 2. Models are initialized in ``eval`` mode by default. We can call
``model.train()`` to put it in train mode.
50
51
52
53

.. code-block:: python

    from transformers import BertForSequenceClassification
54
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
55
56
    model.train()

Sylvain Gugger's avatar
Sylvain Gugger committed
57
58
59
This is useful because it allows us to make use of the pre-trained BERT encoder and easily train it on whatever
sequence classification dataset we choose. We can use any PyTorch optimizer, but our library also provides the
:func:`~transformers.AdamW` optimizer which implements gradient bias correction as well as weight decay.
60
61
62
63
64
65

.. code-block:: python

    from transformers import AdamW
    optimizer = AdamW(model.parameters(), lr=1e-5)

Sylvain Gugger's avatar
Sylvain Gugger committed
66
67
The optimizer allows us to apply different hyperpameters for specific parameter groups. For example, we can apply
weight decay to all parameters other than bias and layer normalization terms:
68
69
70
71
72
73
74
75
76

.. code-block:: python

    no_decay = ['bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
Sylvain Gugger's avatar
Sylvain Gugger committed
77
78
79

Now we can set up a simple dummy training batch using :func:`~transformers.PreTrainedTokenizer.__call__`. This returns
a :func:`~transformers.BatchEncoding` instance which prepares everything we might need to pass to the model.
80
81
82
83
84
85
86
87
88
89

.. code-block:: python

    from transformers import BertTokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    text_batch = ["I love Pixar.", "I don't care for Pixar."]
    encoding = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True)
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

Sylvain Gugger's avatar
Sylvain Gugger committed
90
91
92
When we call a classification model with the ``labels`` argument, the first returned element is the Cross Entropy loss
between the predictions and the passed labels. Having already set up our optimizer, we can then do a backwards pass and
update the weights:
93
94
95
96
97

.. code-block:: python

    labels = torch.tensor([1,0]).unsqueeze(0)
    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
98
    loss = outputs.loss
99
100
101
    loss.backward()
    optimizer.step()

Sylvain Gugger's avatar
Sylvain Gugger committed
102
103
Alternatively, you can just get the logits and calculate the loss yourself. The following is equivalent to the previous
example:
104
105
106
107

.. code-block:: python

    from torch.nn import functional as F
108
    labels = torch.tensor([1,0])
109
    outputs = model(input_ids, attention_mask=attention_mask)
110
    loss = F.cross_entropy(outputs.logits, labels)
111
112
113
    loss.backward()
    optimizer.step()

Sylvain Gugger's avatar
Sylvain Gugger committed
114
Of course, you can train on GPU by calling ``to('cuda')`` on the model and inputs as usual.
115

Sylvain Gugger's avatar
Sylvain Gugger committed
116
117
We also provide a few learning rate scheduling tools. With the following, we can set up a scheduler which warms up for
``num_warmup_steps`` and then linearly decays to 0 by the end of training.
118
119
120
121
122
123
124
125
126
127
128
129
130
131

.. code-block:: python

    from transformers import get_linear_schedule_with_warmup
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_train_steps)

Then all we have to do is call ``scheduler.step()`` after ``optimizer.step()``.

.. code-block:: python

    loss.backward()
    optimizer.step()
    scheduler.step()

Sylvain Gugger's avatar
Sylvain Gugger committed
132
133
We highly recommend using :func:`~transformers.Trainer`, discussed below, which conveniently handles the moving parts
of training 馃 Transformers models with features like mixed precision and easy tensorboard logging.
134
135
136


Freezing the encoder
Sylvain Gugger's avatar
Sylvain Gugger committed
137
-----------------------------------------------------------------------------------------------------------------------
138

Sylvain Gugger's avatar
Sylvain Gugger committed
139
140
141
In some cases, you might be interested in keeping the weights of the pre-trained encoder frozen and optimizing only the
weights of the head layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on the encoder
parameters, which can be accessed with the ``base_model`` submodule on any task-specific model in the library:
142
143

.. code-block:: python
Sylvain Gugger's avatar
Sylvain Gugger committed
144

145
146
147
148
149
150
151
    for param in model.base_model.parameters():
        param.requires_grad = False


.. _tensorflow:

Fine-tuning in native TensorFlow 2
Sylvain Gugger's avatar
Sylvain Gugger committed
152
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
153

Sylvain Gugger's avatar
Sylvain Gugger committed
154
155
Models can also be trained natively in TensorFlow 2. Just as with PyTorch, TensorFlow models can be instantiated with
:func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of the encoder from a pretrained model.
156
157
158
159
160
161
162

.. code-block:: python

    from transformers import TFBertForSequenceClassification
    model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

Let's use ``tensorflow_datasets`` to load in the `MRPC dataset
Sylvain Gugger's avatar
Sylvain Gugger committed
163
164
165
<https://www.tensorflow.org/datasets/catalog/glue#gluemrpc>`_ from GLUE. We can then use our built-in
:func:`~transformers.data.processors.glue.glue_convert_examples_to_features` to tokenize MRPC and convert it to a
TensorFlow ``Dataset`` object. Note that tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to
166
167
168
169
170
the pretrained tokenizer name.

.. code-block:: python

    from transformers import BertTokenizer, glue_convert_examples_to_features
Sylvain Gugger's avatar
Sylvain Gugger committed
171
    import tensorflow as tf
172
173
174
175
176
177
178
179
180
    import tensorflow_datasets as tfds
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    data = tfds.load('glue/mrpc')
    train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
    train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)

The model can then be compiled and trained as any Keras model:

.. code-block:: python
Sylvain Gugger's avatar
Sylvain Gugger committed
181

182
183
184
185
186
    optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.compile(optimizer=optimizer, loss=loss)
    model.fit(train_dataset, epochs=2, steps_per_epoch=115)

Sylvain Gugger's avatar
Sylvain Gugger committed
187
188
With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it
as a PyTorch model (or vice-versa):
189
190
191
192
193
194
195
196
197
198
199

.. code-block:: python

    from transformers import BertForSequenceClassification
    model.save_pretrained('./my_mrpc_model/')
    pytorch_model = BertForSequenceClassification.from_pretrained('./my_mrpc_model/', from_tf=True)


.. _trainer:

Trainer
Sylvain Gugger's avatar
Sylvain Gugger committed
200
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
201

Sylvain Gugger's avatar
Sylvain Gugger committed
202
203
204
We also provide a simple but feature-complete training and evaluation interface through :func:`~transformers.Trainer`
and :func:`~transformers.TFTrainer`. You can train, fine-tune, and evaluate any 馃 Transformers model with a wide range
of training options and with built-in features like logging, gradient accumulation, and mixed precision.
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250

.. code-block:: python

    ## PYTORCH CODE
    from transformers import BertForSequenceClassification, Trainer, TrainingArguments

    model = BertForSequenceClassification.from_pretrained("bert-large-uncased")

    training_args = TrainingArguments(
        output_dir='./results',          # output directory
        num_train_epochs=3,              # total # of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=64,   # batch size for evaluation
        warmup_steps=500,                # number of warmup steps for learning rate scheduler
        weight_decay=0.01,               # strength of weight decay
        logging_dir='./logs',            # directory for storing logs
    )

    trainer = Trainer(
        model=model,                         # the instantiated 馃 Transformers model to be trained
        args=training_args,                  # training arguments, defined above
        train_dataset=train_dataset,         # training dataset
        eval_dataset=test_dataset            # evaluation dataset
    )
    ## TENSORFLOW CODE
    from transformers import TFBertForSequenceClassification, TFTrainer, TFTrainingArguments

    model = TFBertForSequenceClassification.from_pretrained("bert-large-uncased")

    training_args = TFTrainingArguments(
        output_dir='./results',          # output directory
        num_train_epochs=3,              # total # of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=64,   # batch size for evaluation
        warmup_steps=500,                # number of warmup steps for learning rate scheduler
        weight_decay=0.01,               # strength of weight decay
        logging_dir='./logs',            # directory for storing logs
    )

    trainer = TFTrainer(
        model=model,                         # the instantiated 馃 Transformers model to be trained
        args=training_args,                  # training arguments, defined above
        train_dataset=tfds_train_dataset,    # tensorflow_datasets training dataset
        eval_dataset=tfds_test_dataset       # tensorflow_datasets evaluation dataset
    )

Sylvain Gugger's avatar
Sylvain Gugger committed
251
252
Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to evaluate. You can use your own module as
well, but the first argument returned from ``forward`` must be the loss which you wish to optimize.
253

Sylvain Gugger's avatar
Sylvain Gugger committed
254
255
256
257
:func:`~transformers.Trainer` uses a built-in default function to collate batches and prepare them to be fed into the
model. If needed, you can also use the ``data_collator`` argument to pass your own collator function which takes in the
data in the format provided by your dataset and returns a batch ready to be fed into the model. Note that
:func:`~transformers.TFTrainer` expects the passed datasets to be dataset objects from ``tensorflow_datasets``.
258

Sylvain Gugger's avatar
Sylvain Gugger committed
259
260
To calculate additional metrics in addition to the loss, you can also define your own ``compute_metrics`` function and
pass it to the trainer.
261
262
263

.. code-block:: python

gijswijnholds's avatar
gijswijnholds committed
264
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support
265
266
267
268
269
270
271
272
273
274
275
276
277

    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
        acc = accuracy_score(labels, preds)
        return {
            'accuracy': acc,
            'f1': f1,
            'precision': precision,
            'recall': recall
        }

Sylvain Gugger's avatar
Sylvain Gugger committed
278
279
Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified
``logging_dir`` directory.
280
281
282
283
284


.. _additional-resources:

Additional resources
Sylvain Gugger's avatar
Sylvain Gugger committed
285
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
286

Sylvain Gugger's avatar
Sylvain Gugger committed
287
288
- `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
  which uses ``Trainer`` for IMDb sentiment classification.
289

Sylvain Gugger's avatar
Sylvain Gugger committed
290
291
- `馃 Transformers Examples <https://github.com/huggingface/transformers/tree/master/examples>`_ including scripts for
  training and fine-tuning on GLUE, SQuAD, and several other tasks.
292

Sylvain Gugger's avatar
Sylvain Gugger committed
293
294
295
- `How to train a language model
  <https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb>`_, a detailed
  colab notebook which uses ``Trainer`` to train a masked language model from scratch on Esperanto.
296

Sylvain Gugger's avatar
Sylvain Gugger committed
297
298
- `馃 Transformers Notebooks <notebooks.html>`_ which contain dozens of example notebooks from the community for
  training and using 馃 Transformers on a variety of tasks.