t5v1.1.md 2.94 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
<!--Copyright 2021 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
11
12
13
14

鈿狅笍 Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

Sylvain Gugger's avatar
Sylvain Gugger committed
15
16
17
18
19
20
21
22
23
24
25
26
-->

# T5v1.1

## Overview

T5v1.1 was released in the [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511)
repository by Colin Raffel et al. It's an improved version of the original T5 model.

One can directly plug in the weights of T5v1.1 into a T5 model, like so:

```python
27
>>> from transformers import T5ForConditionalGeneration
Sylvain Gugger's avatar
Sylvain Gugger committed
28

29
>>> model = T5ForConditionalGeneration.from_pretrained("google/t5-v1_1-base")
Sylvain Gugger's avatar
Sylvain Gugger committed
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
```

T5 Version 1.1 includes the following improvements compared to the original T5 model:

- GEGLU activation in the feed-forward hidden layer, rather than ReLU. See [this paper](https://arxiv.org/abs/2002.05202).

- Dropout was turned off in pre-training (quality win). Dropout should be re-enabled during fine-tuning.

- Pre-trained on C4 only without mixing in the downstream tasks.

- No parameter sharing between the embedding and classifier layer.

- "xl" and "xxl" replace "3B" and "11B". The model shapes are a bit different - larger `d_model` and smaller
  `num_heads` and `d_ff`.

Note: T5 Version 1.1 was only pre-trained on [C4](https://huggingface.co/datasets/c4) excluding any supervised
46
training. Therefore, this model has to be fine-tuned before it is usable on a downstream task, unlike the original T5
Sylvain Gugger's avatar
Sylvain Gugger committed
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
model. Since t5v1.1 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.

Google has released the following variants:

- [google/t5-v1_1-small](https://huggingface.co/google/t5-v1_1-small)

- [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base)

- [google/t5-v1_1-large](https://huggingface.co/google/t5-v1_1-large)

- [google/t5-v1_1-xl](https://huggingface.co/google/t5-v1_1-xl)

- [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl).

One can refer to [T5's documentation page](t5) for all tips, code examples and notebooks.

This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
found [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511).