translation.mdx 7.37 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Translation

<Youtube id="1JvfrvZgi6c"/>

Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework that extends to vision and audio tasks. 

This guide will show you how to fine-tune [T5](https://huggingface.co/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French.

<Tip>

See the translation [task page](https://huggingface.co/tasks/translation) for more information about its associated models, datasets, and metrics.

</Tip>

## Load OPUS Books dataset

Load the OPUS Books dataset from the 馃 Datasets library:

```py
>>> from datasets import load_dataset

>>> books = load_dataset("opus_books", "en-fr")
```

Split this dataset into a train and test set:

```py
books = books["train"].train_test_split(test_size=0.2)
```

Then take a look at an example:

```py
>>> books["train"][0]
{'id': '90560',
 'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
  'fr': 'Mais ce plateau 茅lev茅 ne mesurait que quelques toises, et bient么t nous f没mes rentr茅s dans notre 茅l茅ment.'}}
```

The `translation` field is a dictionary containing the English and French translations of the text.

## Preprocess

<Youtube id="XAR8jnZZuUs"/>

Load the T5 tokenizer to process the language pairs:

```py
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
```

The preprocessing function needs to:

1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Tokenize the input (English) and target (French) separately. You can't tokenize French text with a tokenizer pretrained on an English vocabulary. A context manager will help set the tokenizer to French first before tokenizing it.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

```py
>>> source_lang = "en"
>>> target_lang = "fr"
>>> prefix = "translate English to French: "


>>> def preprocess_function(examples):
...     inputs = [prefix + example[source_lang] for example in examples["translation"]]
...     targets = [example[target_lang] for example in examples["translation"]]
...     model_inputs = tokenizer(inputs, max_length=128, truncation=True)

...     with tokenizer.as_target_tokenizer():
...         labels = tokenizer(targets, max_length=128, truncation=True)

...     model_inputs["labels"] = labels["input_ids"]
...     return model_inputs
```

Sylvain Gugger's avatar
Sylvain Gugger committed
90
Use 馃 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
Steven Liu's avatar
Steven Liu committed
91
92
93
94
95
96
97

```py
>>> tokenized_books = books.map(preprocess_function, batched=True)
```

Use [`DataCollatorForSeq2Seq`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.

Sylvain Gugger's avatar
Sylvain Gugger committed
98
99
<frameworkcontent>
<pt>
Steven Liu's avatar
Steven Liu committed
100
101
102
103
```py
>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
Sylvain Gugger's avatar
Sylvain Gugger committed
104
105
106
107
```
</pt>
<tf>
```py
Steven Liu's avatar
Steven Liu committed
108
109
110
111
>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
```
Sylvain Gugger's avatar
Sylvain Gugger committed
112
113
</tf>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
114

115
## Train
Steven Liu's avatar
Steven Liu committed
116

117
118
<frameworkcontent>
<pt>
Steven Liu's avatar
Steven Liu committed
119
120
121
122
123
124
125
126
127
128
Load T5 with [`AutoModelForSeq2SeqLM`]:

```py
>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
```

<Tip>

Steven Liu's avatar
Steven Liu committed
129
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
Steven Liu's avatar
Steven Liu committed
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162

</Tip>

At this point, only three steps remain:

1. Define your training hyperparameters in [`Seq2SeqTrainingArguments`].
2. Pass the training arguments to [`Seq2SeqTrainer`] along with the model, dataset, tokenizer, and data collator.
3. Call [`~Trainer.train`] to fine-tune your model.

```py
>>> training_args = Seq2SeqTrainingArguments(
...     output_dir="./results",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     weight_decay=0.01,
...     save_total_limit=3,
...     num_train_epochs=1,
...     fp16=True,
... )

>>> trainer = Seq2SeqTrainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_books["train"],
...     eval_dataset=tokenized_books["test"],
...     tokenizer=tokenizer,
...     data_collator=data_collator,
... )

>>> trainer.train()
```
163
164
</pt>
<tf>
Sylvain Gugger's avatar
Sylvain Gugger committed
165
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
Steven Liu's avatar
Steven Liu committed
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182

```py
>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
...     columns=["attention_mask", "input_ids", "labels"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = tokenized_books["test"].to_tf_dataset(
...     columns=["attention_mask", "input_ids", "labels"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )
```

183
184
185
186
187
188
<Tip>

If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!

</Tip>

Steven Liu's avatar
Steven Liu committed
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
Set up an optimizer function, learning rate schedule, and some training hyperparameters:

```py
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
```

Load T5 with [`TFAutoModelForSeq2SeqLM`]:

```py
>>> from transformers import TFAutoModelForSeq2SeqLM

>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")
```

Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):

```py
>>> model.compile(optimizer=optimizer)
```

Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:

```py
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
```
216
217
</tf>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
218
219
220
221

<Tip>

For a more in-depth example of how to fine-tune a model for translation, take a look at the corresponding
222
223
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb).
Steven Liu's avatar
Steven Liu committed
224
225

</Tip>