Unverified Commit e4d8f517 authored by Steven Liu's avatar Steven Liu Committed by GitHub
Browse files

Rewrite guides for fine-tuning with Datasets (#13923)

* rewrite guides for fine-tuning with datasets

* simple qa code example

* use anonymous rST links

* style
parent 85a4bda4
......@@ -10,720 +10,690 @@
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Fine-tuning with custom datasets
How to fine-tune a model for common downstream tasks
=======================================================================================================================
.. note::
This guide will show you how to fine-tune 🤗 Transformers models for common downstream tasks. You will use the 🤗
Datasets library to quickly load and preprocess the datasets, getting them ready for training with PyTorch and
TensorFlow.
Before you begin, make sure you have the 🤗 Datasets library installed. For more detailed installation instructions,
refer to the 🤗 Datasets `installation page <https://huggingface.co/docs/datasets/installation.html>`_. All of the
examples in this guide will use 🤗 Datasets to load and preprocess a dataset.
The datasets used in this tutorial are available and can be more easily accessed using the `🤗 Datasets library
<https://github.com/huggingface/datasets>`_. We do not use this library to access the datasets here since this
tutorial meant to illustrate how to work with your own data. A brief introduction can be found at the end of the
tutorial in the section ":ref:`datasetslib`".
.. code-block:: bash
This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets. The guide
shows one of many valid workflows for using these models and is meant to be illustrative rather than definitive. We
show examples of reading in several data formats, preprocessing the data for several types of tasks, and then preparing
the data into PyTorch/TensorFlow ``Dataset`` objects which can easily be used either with
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow.
pip install datasets
We include several examples, each of which demonstrates a different type of common downstream task:
Learn how to fine-tune a model for:
- :ref:`seq_imdb`
- :ref:`tok_ner`
- :ref:`qa_squad`
- :ref:`resources`
- :ref:`seq_imdb`
- :ref:`tok_ner`
- :ref:`qa_squad`
.. _seq_imdb:
Sequence Classification with IMDb Reviews
Sequence classification with IMDb reviews
-----------------------------------------------------------------------------------------------------------------------
.. note::
This dataset can be explored in the Hugging Face model hub (`IMDb <https://huggingface.co/datasets/imdb>`_), and
can be alternatively downloaded with the 🤗 Datasets library with ``load_dataset("imdb")``.
Sequence classification refers to the task of classifying sequences of text according to a given number of classes. In
this example, learn how to fine-tune a model on the `IMDb dataset <https://huggingface.co/datasets/imdb>`_ to determine
whether a review is positive or negative.
In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task takes
the text of a review and requires the model to predict whether the sentiment of the review is positive or negative.
Let's start by downloading the dataset from the `Large Movie Review Dataset
<http://ai.stanford.edu/~amaas/data/sentiment/>`_ webpage.
.. note::
.. code-block:: bash
For a more in-depth example of how to fine-tune a model for text classification, take a look at the corresponding
`PyTorch notebook
<https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb>`__
or `TensorFlow notebook
<https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification-tf.ipynb>`__.
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz
Load IMDb dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This data is organized into ``pos`` and ``neg`` folders with one text file per example. Let's write a function that can
read this in.
The 🤗 Datasets library makes it simple to load a dataset:
.. code-block:: python
from pathlib import Path
from datasets import load_dataset
imdb = load_dataset("imdb")
def read_imdb_split(split_dir):
split_dir = Path(split_dir)
texts = []
labels = []
for label_dir in ["pos", "neg"]:
for text_file in (split_dir/label_dir).iterdir():
texts.append(text_file.read_text())
labels.append(0 if label_dir is "neg" else 1)
This loads a ``DatasetDict`` object which you can index into to view an example:
.. code-block:: python
return texts, labels
imdb["train"][0]
{'label': 1,
'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
}
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
Preprocess
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We now have a train and test dataset, but let's also create a validation set which we can use for for evaluation and
tuning without tainting our test set results. Sklearn has a convenient utility for creating such splits:
The next step is to tokenize the text into a readable format by the model. It is important to load the same tokenizer a
model was trained with to ensure appropriately tokenized words. Load the DistilBERT tokenizer with the
:class:`~transformers.AutoTokenizer` because we will eventually train a classifier using a pretrained `DistilBERT
<https://huggingface.co/distilbert-base-uncased>`_ model:
.. code-block:: python
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
Alright, we've read in our dataset. Now let's tackle tokenization. We'll eventually train a classifier using
pre-trained DistilBert, so let's use the DistilBert tokenizer.
Now that you have instantiated a tokenizer, create a function that will tokenize the text. You should also truncate
longer sequences in the text to be no longer than the model's maximum input length:
.. code-block:: python
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
Now we can simply pass our texts to the tokenizer. We'll pass ``truncation=True`` and ``padding=True``, which will
ensure that all of our sequences are padded to the same length and are truncated to be no longer than model's maximum
input length. This will allow us to feed batches of sequences into the model at the same time.
Use 🤗 Datasets ``map`` function to apply the preprocessing function to the entire dataset. You can also set
``batched=True`` to apply the preprocessing function to multiple elements of the dataset at once for faster
preprocessing:
.. code-block:: python
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
tokenized_imdb = imdb.map(preprocess_function, batched=True)
Now, let's turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a
``torch.utils.data.Dataset`` object and implementing ``__len__`` and ``__getitem__``. In TensorFlow, we pass our input
encodings and labels to the ``from_tensor_slices`` constructor method. We put the data in this format so that the data
can be easily batched such that each key in the batch encoding corresponds to a named parameter of the
:meth:`~transformers.DistilBertForSequenceClassification.forward` method of the model we will train.
Lastly, pad your text so they are a uniform length. While it is possible to pad your text in the ``tokenizer`` function
by setting ``padding=True``, it is more efficient to only pad the text to the length of the longest element in its
batch. This is known as **dynamic padding**. You can do this with the ``DataCollatorWithPadding`` function:
.. code-block:: python
## PYTORCH CODE
import torch
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
class IMDbDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
Fine-tune with the Trainer API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
Now load your model with the :class:`~transformers.AutoModel` class along with the number of expected labels:
def __len__(self):
return len(self.labels)
.. code-block:: python
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)
## TENSORFLOW CODE
import tensorflow as tf
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
dict(test_encodings),
test_labels
))
Now that our datasets are ready, we can fine-tune a model either with the 🤗
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow. See :doc:`training
<training>`.
.. _ft_trainer:
Fine-tuning with Trainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
At this point, only three steps remain:
The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` and
instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`.
1. Define your training hyperparameters in :class:`~transformers.TrainingArguments`.
2. Pass the training arguments to a :class:`~transformers.Trainer` along with the model, dataset, tokenizer, and data
collator.
3. Call ``trainer.train`` to fine-tune your model.
.. code-block:: python
## PYTORCH CODE
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
output_dir='./results',
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=5,
weight_decay=0.01,
)
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
model=model,
args=training_args,
train_dataset=tokenized_imdb["train"],
eval_dataset=tokenized_imdb["test"],
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
## TENSORFLOW CODE
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments
training_args = TFTrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
with training_args.strategy.scope():
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
Fine-tune with TensorFlow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
trainer = TFTrainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
Fine-tuning with TensorFlow is just as easy, with only a few differences.
trainer.train()
Start by batching the processed examples together with dynamic padding using the ``DataCollatorWithPadding`` function.
Make sure you set ``return_tensors="tf"`` to return ``tf.Tensor`` outputs instead of PyTorch tensors!
.. _ft_native:
.. code-block:: python
Fine-tuning with native PyTorch/TensorFlow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")
We can also train using native PyTorch or TensorFlow:
Next, convert your datasets to the ``tf.data.Dataset`` format with ``to_tf_dataset``. Specify inputs and labels in the
``columns`` argument:
.. code-block:: python
## PYTORCH CODE
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW
tf_train_dataset = tokenized_imdb["train"].to_tf_dataset(
columns=['attention_mask', 'input_ids', 'label'],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
tf_validation_dataset = tokenized_imdb["train"].to_tf_dataset(
columns=['attention_mask', 'input_ids', 'label'],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
.. code-block:: python
optim = AdamW(model.parameters(), lr=5e-5)
from transformers import create_optimizer
import tensorflow as tf
for epoch in range(3):
for batch in train_loader:
optim.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs[0]
loss.backward()
optim.step()
batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(
init_lr=2e-5,
num_warmup_steps=0,
num_train_steps=total_train_steps
)
model.eval()
## TENSORFLOW CODE
from transformers import TFDistilBertForSequenceClassification
Load your model with the :class:`~transformers.TFAutoModel` class along with the number of expected labels:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
.. code-block:: python
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
.. _tok_ner:
Compile the model:
Token Classification with W-NUT Emerging Entities
-----------------------------------------------------------------------------------------------------------------------
.. code-block:: python
.. note::
import tensorflow as tf
model.compile(optimizer=optimizer)
This dataset can be explored in the Hugging Face model hub (`WNUT-17 <https://huggingface.co/datasets/wnut_17>`_),
and can be alternatively downloaded with the 🤗 Datasets library with ``load_dataset("wnut_17")``.
Finally, fine-tune the model by calling ``model.fit``:
Next we will look at token classification. Rather than classifying an entire sequence, this task classifies token by
token. We'll demonstrate how to do this with `Named Entity Recognition
<http://nlpprogress.com/english/named_entity_recognition.html>`_, which involves identifying tokens which correspond to
a predefined set of "entities". Specifically, we'll use the `W-NUT Emerging and Rare entities
<http://noisy-text.github.io/2017/emerging-rare-entities.html>`_ corpus. The data is given as a collection of
pre-tokenized documents where each token is assigned a tag.
.. code-block:: python
Let's start by downloading the data.
model.fit(
tf_train_set,
validation_data=tf_validation_set,
epochs=num_train_epochs,
)
.. code-block:: bash
.. _tok_ner:
wget http://noisy-text.github.io/2017/files/wnut17train.conll
Token classification with WNUT emerging entities
-----------------------------------------------------------------------------------------------------------------------
In this case, we'll just download the train set, which is a single text file. Each line of the file contains either (1)
a word and tag separated by a tab, or (2) a blank line indicating the end of a document. Let's write a function to read
this in. We'll take in the file path and return ``token_docs`` which is a list of lists of token strings, and
``token_tags`` which is a list of lists of tag strings.
Token classification refers to the task of classifying individual tokens in a sentence. One of the most common token
classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence,
such as a person, location, or organization. In this example, learn how to fine-tune a model on the `WNUT 17
<https://huggingface.co/datasets/wnut_17>`_ dataset to detect new entities.
.. code-block:: python
.. note::
from pathlib import Path
import re
For a more in-depth example of how to fine-tune a model for token classification, take a look at the corresponding
`PyTorch notebook
<https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb>`__
or `TensorFlow notebook
<https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification-tf.ipynb>`__.
def read_wnut(file_path):
file_path = Path(file_path)
Load WNUT 17 dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
raw_text = file_path.read_text().strip()
raw_docs = re.split(r'\n\t?\n', raw_text)
token_docs = []
tag_docs = []
for doc in raw_docs:
tokens = []
tags = []
for line in doc.split('\n'):
token, tag = line.split('\t')
tokens.append(token)
tags.append(tag)
token_docs.append(tokens)
tag_docs.append(tags)
Load the WNUT 17 dataset from the 🤗 Datasets library:
return token_docs, tag_docs
.. code-block:: python
texts, tags = read_wnut('wnut17train.conll')
from datasets import load_dataset
wnut = load_dataset("wnut_17")
Just to see what this data looks like, let's take a look at a segment of the first document.
A quick look at the dataset shows the labels associated with each word in the sentence:
.. code-block:: python
>>> print(texts[0][10:17], tags[0][10:17], sep='\n')
['for', 'two', 'weeks', '.', 'Empire', 'State', 'Building']
['O', 'O', 'O', 'O', 'B-location', 'I-location', 'I-location']
wnut["train"][0]
{'id': '0',
'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}
``location`` is an entity type, ``B-`` indicates the beginning of an entity, and ``I-`` indicates consecutive positions
of the same entity ("Empire State Building" is considered one entity). ``O`` indicates the token does not correspond to
any entity.
Now that we've read the data in, let's create a train/validation split:
View the specific NER tags by:
.. code-block:: python
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_tags, val_tags = train_test_split(texts, tags, test_size=.2)
label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list
['O',
'B-corporation',
'I-corporation',
'B-creative-work',
'I-creative-work',
'B-group',
'I-group',
'B-location',
'I-location',
'B-person',
'I-person',
'B-product',
'I-product'
]
A letter prefixes each NER tag which can mean:
* ``B-`` indicates the beginning of an entity.
* ``I-`` indicates a token is contained inside the same entity (e.g., the ``State`` token is a part of an entity like
``Empire State Building``).
* ``0`` indicates the token doesn't correspond to any entity.
Preprocess
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Next, let's create encodings for our tokens and tags. For the tags, we can start by just create a simple mapping which
we'll use in a moment:
Now you need to tokenize the text. Load the DistilBERT tokenizer with an :class:`~transformers.AutoTokenizer`:
.. code-block:: python
unique_tags = set(tag for doc in tags for tag in doc)
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
id2tag = {id: tag for tag, id in tag2id.items()}
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
To encode the tokens, we'll use a pre-trained DistilBert tokenizer. We can tell the tokenizer that we're dealing with
ready-split tokens rather than full sentence strings by passing ``is_split_into_words=True``. We'll also pass
``padding=True`` and ``truncation=True`` to pad the sequences to be the same length. Lastly, we can tell the model to
return information about the tokens which are split by the wordpiece tokenization process, which we will need in a
moment.
Since the input has already been split into words, set ``is_split_into_words=True`` to tokenize the words into
subwords:
.. code-block:: python
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')
train_encodings = tokenizer(train_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, is_split_into_words=True, return_offsets_mapping=True, padding=True, truncation=True)
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens
['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
Great, so now our tokens are nicely encoded in the format that they need to be in to feed them into our DistilBert
model below.
The addition of the special tokens ``[CLS]`` and ``[SEP]`` and subword tokenization creates a mismatch between the
input and labels. Realign the labels and tokens by:
Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in
the W-NUT corpus are not in DistilBert's vocabulary. Bert and many models like it use a method called WordPiece
Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the
vocabulary. For example, DistilBert's tokenizer would split the Twitter handle ``@huggingface`` into the tokens ``['@',
'hugging', '##face']``. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a
token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.
1. Mapping all tokens to their corresponding word with the ``word_ids`` method.
2. Assigning the label ``-100`` to the special tokens ``[CLS]`` and ``[SEP]``` so the PyTorch loss function ignores
them.
3. Only labeling the first token of a given word. Assign ``-100`` to the other subtokens from the same word.
One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in 🤗
Transformers by setting the labels we wish to ignore to ``-100``. In the example above, if the label for
``@HuggingFace`` is ``3`` (indexing ``B-corporation``), we would set the labels of ``['@', 'hugging', '##face']`` to
``[3, -100, -100]``.
Here is how you can create a function that will realign the labels and tokens:
Let's write a function to do this. This is where we will use the ``offset_mapping`` from the tokenizer as mentioned
above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token's
start position and end position relative to the original token it was split from. That means that if the first position
in the tuple is anything other than ``0``, we will set its corresponding label to ``-100``. While we're at it, we can
also set labels to ``-100`` if the second position of the offset mapping is ``0``, since this means it must be a
special token like ``[PAD]`` or ``[CLS]``.
.. code-block:: python
.. note::
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples[f"ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word.
previous_word_idx = None
label_ids = []
for word_idx in word_ids: # Set the special tokens to -100.
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx: # Only label the first token of a given word.
label_ids.append(label[word_idx])
Due to a recently fixed bug, -1 must be used instead of -100 when using TensorFlow in 🤗 Transformers <= 3.02.
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
Now tokenize and align the labels over the entire dataset with 🤗 Datasets ``map`` function:
.. code-block:: python
import numpy as np
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
def encode_tags(tags, encodings):
labels = [[tag2id[tag] for tag in doc] for doc in tags]
encoded_labels = []
for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
# create an empty array of -100
doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
arr_offset = np.array(doc_offset)
Finally, pad your text and labels, so they are a uniform length:
# set labels whose first offset position is 0 and the second is not 0
doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
encoded_labels.append(doc_enc_labels.tolist())
.. code-block:: python
return encoded_labels
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer)
train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)
Fine-tune with the Trainer API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The hard part is now done. Just as in the sequence classification example above, we can create a dataset object:
Load your model with the :class:`~transformers.AutoModel` class along with the number of expected labels:
.. code-block:: python
## PYTORCH CODE
import torch
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
class WNUTDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
Gather your training arguments in :class:`~transformers.TrainingArguments`:
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
.. code-block:: python
def __len__(self):
return len(self.labels)
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)
train_encodings.pop("offset_mapping") # we don't want to pass this to the model
val_encodings.pop("offset_mapping")
train_dataset = WNUTDataset(train_encodings, train_labels)
val_dataset = WNUTDataset(val_encodings, val_labels)
## TENSORFLOW CODE
import tensorflow as tf
Collect your model, training arguments, dataset, data collator, and tokenizer in :class:`~transformers.Trainer`:
train_encodings.pop("offset_mapping") # we don't want to pass this to the model
val_encodings.pop("offset_mapping")
.. code-block:: python
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
val_labels
))
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_wnut["train"],
eval_dataset=tokenized_wnut["test"],
data_collator=data_collator,
tokenizer=tokenizer,
)
Now load in a token classification model and specify the number of labels:
Fine-tune your model:
.. code-block:: python
## PYTORCH CODE
from transformers import DistilBertForTokenClassification
model = DistilBertForTokenClassification.from_pretrained('distilbert-base-cased', num_labels=len(unique_tags))
## TENSORFLOW CODE
from transformers import TFDistilBertForTokenClassification
model = TFDistilBertForTokenClassification.from_pretrained('distilbert-base-cased', num_labels=len(unique_tags))
trainer.train()
The data and model are both ready to go. You can train the model either with
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` or with native PyTorch/TensorFlow, exactly as in the
sequence classification example above.
Fine-tune with TensorFlow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- :ref:`ft_trainer`
- :ref:`ft_native`
Batch your examples together and pad your text and labels, so they are a uniform length:
.. _qa_squad:
.. code-block:: python
Question Answering with SQuAD 2.0
-----------------------------------------------------------------------------------------------------------------------
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="tf")
.. note::
Convert your datasets to the ``tf.data.Dataset`` format with ``to_tf_dataset``:
.. code-block:: python
This dataset can be explored in the Hugging Face model hub (`SQuAD V2
<https://huggingface.co/datasets/squad_v2>`_), and can be alternatively downloaded with the 🤗 Datasets library with
``load_dataset("squad_v2")``.
tf_train_set = tokenized_wnut["train"].to_tf_dataset(
columns=["attention_mask", "input_ids", "labels"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
Question answering comes in many forms. In this example, we'll look at the particular type of extractive QA that
involves answering a question about a passage by highlighting the segment of the passage that answers the question.
This involves fine-tuning a model which predicts a start position and an end position in the passage. We will use the
`Stanford Question Answering Dataset (SQuAD) 2.0 <https://rajpurkar.github.io/SQuAD-explorer/>`_.
tf_validation_set = tokenized_wnut["validation"].to_tf_dataset(
columns=["attention_mask", "input_ids", "labels"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
We will start by downloading the data:
Load the model with the :class:`~transformers.TFAutoModel` class along with the number of expected labels:
.. code-block:: bash
.. code-block:: python
mkdir squad
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O squad/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json
from transformers import TFAutoModelForTokenClassification
model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
Each split is in a structured json file with a number of questions and answers for each passage (or context). We'll
take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since
there are multiple questions per context):
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
.. code-block:: python
import json
from pathlib import Path
from transformers import create_optimizer
def read_squad(path):
path = Path(path)
with open(path, 'rb') as f:
squad_dict = json.load(f)
contexts = []
questions = []
answers = []
for group in squad_dict['data']:
for passage in group['paragraphs']:
context = passage['context']
for qa in passage['qas']:
question = qa['question']
for answer in qa['answers']:
contexts.append(context)
questions.append(question)
answers.append(answer)
batch_size = 16
num_train_epochs = 3
num_train_steps = (len(tokenized_datasets["train"]) // batch_size) * num_train_epochs
optimizer, lr_schedule = create_optimizer(
init_lr=2e-5,
num_train_steps=num_train_steps,
weight_decay_rate=0.01,
num_warmup_steps=0,
)
return contexts, questions, answers
Compile the model:
train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')
.. code-block:: python
The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the
correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on
this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which *token* positions the
answer begins and ends.
import tensorflow as tf
model.compile(optimizer=optimizer)
First, let's get the *character* position at which the answer ends in the passage (we are given the starting position).
Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.
Call ``model.fit`` to fine-tune your model:
.. code-block:: python
def add_end_idx(answers, contexts):
for answer, context in zip(answers, contexts):
gold_text = answer['text']
start_idx = answer['answer_start']
end_idx = start_idx + len(gold_text)
model.fit(
tf_train_set,
validation_data=tf_validation_set,
epochs=num_train_epochs,
)
# sometimes squad answers are off by a character or two – fix this
if context[start_idx:end_idx] == gold_text:
answer['answer_end'] = end_idx
elif context[start_idx-1:end_idx-1] == gold_text:
answer['answer_start'] = start_idx - 1
answer['answer_end'] = end_idx - 1 # When the gold label is off by one character
elif context[start_idx-2:end_idx-2] == gold_text:
answer['answer_start'] = start_idx - 2
answer['answer_end'] = end_idx - 2 # When the gold label is off by two characters
.. _qa_squad:
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)
Question Answering with SQuAD
-----------------------------------------------------------------------------------------------------------------------
Now ``train_answers`` and ``val_answers`` include the character end positions and the corrected start positions. Next,
let's tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode them together
as sequence pairs.
There are many types of question answering (QA) tasks. Extractive QA focuses on identifying the answer from the text
given a question. In this example, learn how to fine-tune a model on the `SQuAD
<https://huggingface.co/datasets/squad>`_ dataset.
.. code-block:: python
.. note::
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
For a more in-depth example of how to fine-tune a model for question answering, take a look at the corresponding
`PyTorch notebook
<https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb>`__
or `TensorFlow notebook
<https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering-tf.ipynb>`__.
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)
Load SQuAD dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers,
we can use the built in :func:`~transformers.BatchEncoding.char_to_token` method.
Load the SQuAD dataset from the 🤗 Datasets library:
.. code-block:: python
def add_token_positions(encodings, answers):
start_positions = []
end_positions = []
for i in range(len(answers)):
start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
from datasets import load_dataset
squad = load_dataset("squad")
# if start position is None, the answer passage has been truncated
if start_positions[-1] is None:
start_positions[-1] = tokenizer.model_max_length
if end_positions[-1] is None:
end_positions[-1] = tokenizer.model_max_length
Take a look at an example from the dataset:
encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
.. code-block:: python
add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)
squad["train"][0]
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
'id': '5733be284776f41900661182',
'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
'title': 'University_of_Notre_Dame'
}
Our data is ready. Let's just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In
PyTorch, we define a custom ``Dataset`` class. In TensorFlow, we pass a tuple of ``(inputs_dict, labels_dict)`` to the
``from_tensor_slices`` method.
Preprocess
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
Load the DistilBERT tokenizer with an :class:`~transformers.AutoTokenizer`:
## PYTORCH CODE
import torch
.. code-block:: python
class SquadDataset(torch.utils.data.Dataset):
def __init__(self, encodings):
self.encodings = encodings
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
There are a few things to be aware of when preprocessing text for question answering:
def __len__(self):
return len(self.encodings.input_ids)
1. Some examples in a dataset may have a very long ``context`` that exceeds the maximum input length of the model. You
can deal with this by truncating the ``context`` and set ``truncation="only_second"``.
2. Next, you need to map the start and end positions of the answer to the original context. Set
``return_offset_mapping=True`` to handle this.
3. With the mapping in hand, you can find the start and end tokens of the answer. Use the ``sequence_ids`` method to
find which part of the offset corresponds to the question, and which part of the offset corresponds to the context.
train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)
## TENSORFLOW CODE
import tensorflow as tf
Assemble everything in a preprocessing function as shown below:
train_dataset = tf.data.Dataset.from_tensor_slices((
{key: train_encodings[key] for key in ['input_ids', 'attention_mask']},
{key: train_encodings[key] for key in ['start_positions', 'end_positions']}
))
val_dataset = tf.data.Dataset.from_tensor_slices((
{key: val_encodings[key] for key in ['input_ids', 'attention_mask']},
{key: val_encodings[key] for key in ['start_positions', 'end_positions']}
))
.. code-block:: python
Now we can use a DistilBert model with a QA head for training:
def preprocess_function(examples):
questions = [q.strip() for q in examples["question"]]
inputs = tokenizer(
questions,
examples["context"],
max_length=384,
truncation="only_second",
return_offsets_mapping=True,
padding="max_length",
)
.. code-block:: python
offset_mapping = inputs.pop("offset_mapping")
answers = examples["answers"]
start_positions = []
end_positions = []
## PYTORCH CODE
from transformers import DistilBertForQuestionAnswering
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
## TENSORFLOW CODE
from transformers import TFDistilBertForQuestionAnswering
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
for i, offset in enumerate(offset_mapping):
answer = answers[i]
start_char = answer["answer_start"][0]
end_char = answer["answer_start"][0] + len(answer["text"][0])
sequence_ids = inputs.sequence_ids(i)
# Find the start and end of the context
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
# If the answer is not fully inside the context, label it (0, 0)
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
else:
# Otherwise it's the start and end token positions
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
Apply the preprocessing function over the entire dataset with 🤗 Datasets ``map`` function:
.. code-block:: python
The data and model are both ready to go. You can train the model with
:class:`~transformers.Trainer`/:class:`~transformers.TFTrainer` exactly as in the sequence classification example
above. If using native PyTorch, replace ``labels`` with ``start_positions`` and ``end_positions`` in the training
example. If using Keras's ``fit``, we need to make a minor modification to handle this example since it involves
multiple model outputs.
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
- :ref:`ft_trainer`
Batch the processed examples together:
.. code-block:: python
## PYTORCH CODE
from torch.utils.data import DataLoader
from transformers import AdamW
from transformers import default_data_collator
data_collator = default_data_collator
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
Fine-tune with the Trainer API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
model.to(device)
model.train()
Load your model with the :class:`~transformers.AutoModel` class:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
.. code-block:: python
optim = AdamW(model.parameters(), lr=5e-5)
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
for epoch in range(3):
for batch in train_loader:
optim.zero_grad()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
start_positions = batch['start_positions'].to(device)
end_positions = batch['end_positions'].to(device)
outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
loss = outputs[0]
loss.backward()
optim.step()
Gather your training arguments in :class:`~transformers.TrainingArguments`:
model.eval()
## TENSORFLOW CODE
# Keras will expect a tuple when dealing with labels
train_dataset = train_dataset.map(lambda x, y: (x, (y['start_positions'], y['end_positions'])))
.. code-block:: python
# Keras will assign a separate loss for each output and add them together. So we'll just use the standard CE loss
# instead of using the built-in model.compute_loss, which expects a dict of outputs and averages the two terms.
# Note that this means the loss will be 2x of when using TFTrainer since we're adding instead of averaging them.
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.distilbert.return_dict = False # if using 🤗 Transformers >3.02, make sure outputs are tuples
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
Collect your model, training arguments, dataset, data collator, and tokenizer in :class:`~transformers.Trainer`:
.. _resources:
.. code-block:: python
Additional Resources
-----------------------------------------------------------------------------------------------------------------------
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_squad["train"],
eval_dataset=tokenized_squad["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
- `How to train a new language model from scratch using Transformers and Tokenizers
<https://huggingface.co/blog/how-to-train>`_. Blog post showing the steps to load in Esperanto data and train a
masked language model from scratch.
- :doc:`Preprocessing <preprocessing>`. Docs page on data preprocessing.
- :doc:`Training <training>`. Docs page on training and fine-tuning.
Fine-tune your model:
.. _datasetslib:
.. code-block:: python
Using the 🤗 Datasets & Metrics library
trainer.train()
Fine-tune with TensorFlow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with 🤗
Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the `🤗
Datasets library <https://github.com/huggingface/datasets>`_ for working with the 150+ datasets included in the `hub
<https://huggingface.co/datasets>`_, including the three datasets used in this tutorial. As a very brief overview, we
will show how to use the Datasets library to download and prepare the IMDb dataset from the first example,
:ref:`seq_imdb`.
Batch the processed examples together with a TensorFlow default data collator:
.. code-block:: python
from transformers.data.data_collator import tf_default_collator
data_collator = tf_default_collator
Start by downloading the dataset:
Convert your datasets to the ``tf.data.Dataset`` format with the ``to_tf_dataset`` function:
.. code-block:: python
from datasets import load_dataset
train = load_dataset("imdb", split="train")
tf_train_set = tokenized_squad["train"].to_tf_dataset(
columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
dummy_labels=True,
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
dummy_labels=True,
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
Each dataset has multiple columns corresponding to different features. Let's see what our columns are.
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
.. code-block:: python
>>> print(train.column_names)
['label', 'text']
from transformers import create_optimizer
Great. Now let's tokenize the text. We can do this using the ``map`` method. We'll also rename the ``label`` column to
``labels`` to match the model's input arguments.
batch_size = 16
num_epochs = 2
total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
optimizer, schedule = create_optimizer(
init_lr=2e-5,
num_warmup_steps=0,
num_train_steps=total_train_steps,
)
Load your model with the :class:`~transformers.TFAutoModel` class:
.. code-block:: python
train = train.map(lambda batch: tokenizer(batch["text"], truncation=True, padding=True), batched=True)
train.rename_column_("label", "labels")
from transformers import TFAutoModelForQuestionAnswering
model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
Lastly, we can use the ``set_format`` method to determine which columns and in what data format we want to access
dataset elements.
Compile the model:
.. code-block:: python
## PYTORCH CODE
>>> train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
>>> {key: val.shape for key, val in train[0].items()})
{'labels': torch.Size([]), 'input_ids': torch.Size([512]), 'attention_mask': torch.Size([512])}
## TENSORFLOW CODE
>>> train.set_format("tensorflow", columns=["input_ids", "attention_mask", "labels"])
>>> {key: val.shape for key, val in train[0].items()})
{'labels': TensorShape([]), 'input_ids': TensorShape([512]), 'attention_mask': TensorShape([512])}
import tensorflow as tf
model.compile(optimizer=optimizer)
Call ``model.fit`` to fine-tune the model:
We now have a fully-prepared dataset. Check out `the 🤗 Datasets docs
<https://huggingface.co/docs/datasets/processing.html>`_ for a more thorough introduction.
.. code-block:: python
model.fit(
tf_train_set,
validation_data=tf_validation_set,
epochs=num_train_epochs,
)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment