translation.mdx 7.31 KB
Newer Older
Steven Liu's avatar
Steven Liu committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Translation

<Youtube id="1JvfrvZgi6c"/>

Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework that extends to vision and audio tasks. 

This guide will show you how to fine-tune [T5](https://huggingface.co/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French.

<Tip>

See the translation [task page](https://huggingface.co/tasks/translation) for more information about its associated models, datasets, and metrics.

</Tip>

## Load OPUS Books dataset

Load the OPUS Books dataset from the 馃 Datasets library:

```py
>>> from datasets import load_dataset

>>> books = load_dataset("opus_books", "en-fr")
```

Split this dataset into a train and test set:

```py
books = books["train"].train_test_split(test_size=0.2)
```

Then take a look at an example:

```py
>>> books["train"][0]
{'id': '90560',
 'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
  'fr': 'Mais ce plateau 茅lev茅 ne mesurait que quelques toises, et bient么t nous f没mes rentr茅s dans notre 茅l茅ment.'}}
```

The `translation` field is a dictionary containing the English and French translations of the text.

## Preprocess

<Youtube id="XAR8jnZZuUs"/>

Load the T5 tokenizer to process the language pairs:

```py
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
```

The preprocessing function needs to:

1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Tokenize the input (English) and target (French) separately. You can't tokenize French text with a tokenizer pretrained on an English vocabulary. A context manager will help set the tokenizer to French first before tokenizing it.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

```py
>>> source_lang = "en"
>>> target_lang = "fr"
>>> prefix = "translate English to French: "


>>> def preprocess_function(examples):
...     inputs = [prefix + example[source_lang] for example in examples["translation"]]
...     targets = [example[target_lang] for example in examples["translation"]]
81
...     model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
Steven Liu's avatar
Steven Liu committed
82
83
84
...     return model_inputs
```

Sylvain Gugger's avatar
Sylvain Gugger committed
85
Use 馃 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
Steven Liu's avatar
Steven Liu committed
86
87
88
89
90

```py
>>> tokenized_books = books.map(preprocess_function, batched=True)
```

91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
<frameworkcontent>
<pt>
Load T5 with [`AutoModelForSeq2SeqLM`]:

```py
>>> from transformers import AutoModelForSeq2SeqLM

>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
```
</pt>
<tf>
Load T5 with [`TFAutoModelForSeq2SeqLM`]:

```py
>>> from transformers import TFAutoModelForSeq2SeqLM

>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")
```
</tf>
</frameworkcontent>

Steven Liu's avatar
Steven Liu committed
112
113
Use [`DataCollatorForSeq2Seq`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.

Sylvain Gugger's avatar
Sylvain Gugger committed
114
115
<frameworkcontent>
<pt>
116

Steven Liu's avatar
Steven Liu committed
117
118
119
120
```py
>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
Sylvain Gugger's avatar
Sylvain Gugger committed
121
122
123
```
</pt>
<tf>
124

Sylvain Gugger's avatar
Sylvain Gugger committed
125
```py
Steven Liu's avatar
Steven Liu committed
126
127
128
129
>>> from transformers import DataCollatorForSeq2Seq

>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
```
Sylvain Gugger's avatar
Sylvain Gugger committed
130
131
</tf>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
132

133
## Train
Steven Liu's avatar
Steven Liu committed
134

135
136
<frameworkcontent>
<pt>
Steven Liu's avatar
Steven Liu committed
137
138
139

<Tip>

Steven Liu's avatar
Steven Liu committed
140
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
Steven Liu's avatar
Steven Liu committed
141
142
143
144
145
146
147
148
149
150

</Tip>

At this point, only three steps remain:

1. Define your training hyperparameters in [`Seq2SeqTrainingArguments`].
2. Pass the training arguments to [`Seq2SeqTrainer`] along with the model, dataset, tokenizer, and data collator.
3. Call [`~Trainer.train`] to fine-tune your model.

```py
151
152
>>> from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

Steven Liu's avatar
Steven Liu committed
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
>>> training_args = Seq2SeqTrainingArguments(
...     output_dir="./results",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     per_device_train_batch_size=16,
...     per_device_eval_batch_size=16,
...     weight_decay=0.01,
...     save_total_limit=3,
...     num_train_epochs=1,
...     fp16=True,
... )

>>> trainer = Seq2SeqTrainer(
...     model=model,
...     args=training_args,
...     train_dataset=tokenized_books["train"],
...     eval_dataset=tokenized_books["test"],
...     tokenizer=tokenizer,
...     data_collator=data_collator,
... )

>>> trainer.train()
```
176
177
</pt>
<tf>
Sylvain Gugger's avatar
Sylvain Gugger committed
178
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
Steven Liu's avatar
Steven Liu committed
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195

```py
>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
...     columns=["attention_mask", "input_ids", "labels"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = tokenized_books["test"].to_tf_dataset(
...     columns=["attention_mask", "input_ids", "labels"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )
```

196
197
198
199
200
201
<Tip>

If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!

</Tip>

Steven Liu's avatar
Steven Liu committed
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
Set up an optimizer function, learning rate schedule, and some training hyperparameters:

```py
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
```

Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):

```py
>>> model.compile(optimizer=optimizer)
```

Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:

```py
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
```
221
222
</tf>
</frameworkcontent>
Steven Liu's avatar
Steven Liu committed
223
224
225
226

<Tip>

For a more in-depth example of how to fine-tune a model for translation, take a look at the corresponding
227
228
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb).
Steven Liu's avatar
Steven Liu committed
229

230
</Tip>