"examples/vscode:/vscode.git/clone" did not exist on "3e09d813aaba0bcda4cd9e46422591ac9b13b3ac"
Unverified Commit 47a98fc4 authored by Patrick von Platen's avatar Patrick von Platen Committed by GitHub
Browse files

ByT5 model (#11971)



* allow tf to use uneven num of layers

* add tokenizer

* finish docs

* finish docs

* Apply suggestions from code review

* include in index

* finish

* Update docs/source/model_doc/byt5.rst
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>

* apply sylvais suggestions

* make style
Co-authored-by: default avatarNielsRogge <48327001+NielsRogge@users.noreply.github.com>
parent 1eb58b45
...@@ -206,6 +206,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h ...@@ -206,6 +206,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Blenderbot](https://huggingface.co/transformers/model_doc/blenderbot.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. 1. **[Blenderbot](https://huggingface.co/transformers/model_doc/blenderbot.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1. **[BlenderbotSmall](https://huggingface.co/transformers/model_doc/blenderbot_small.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. 1. **[BlenderbotSmall](https://huggingface.co/transformers/model_doc/blenderbot_small.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
1. **[BORT](https://huggingface.co/transformers/model_doc/bort.html)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry. 1. **[BORT](https://huggingface.co/transformers/model_doc/bort.html)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry.
1. **[ByT5](https://huggingface.co/transformers/model_doc/byt5.html)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
1. **[CamemBERT](https://huggingface.co/transformers/model_doc/camembert.html)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. 1. **[CamemBERT](https://huggingface.co/transformers/model_doc/camembert.html)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1. **[CLIP](https://huggingface.co/transformers/model_doc/clip.html)** from (OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[CLIP](https://huggingface.co/transformers/model_doc/clip.html)** from (OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1. **[ConvBERT](https://huggingface.co/transformers/model_doc/convbert.html)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[ConvBERT](https://huggingface.co/transformers/model_doc/convbert.html)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
......
This diff is collapsed.
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
ByT5
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ByT5 model was presented in `ByT5: Towards a token-free future with pre-trained byte-to-byte models
<https://arxiv.org/abs/2105.13626>`_ by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
Kale, Adam Roberts, Colin Raffel.
The abstract from the paper is the following:
*Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from
the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they
can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by
removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token
sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of
operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with
minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count,
training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level
counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on
tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of
pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our
experiments.*
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be
found `here <https://github.com/google-research/byt5>`__.
ByT5's architecture is based on the T5 model, so one can refer to :doc:`T5's documentation page <t5>`.
Example
_______________________________________________________________________________________________________________________
ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
.. code-block::
from transformers import T5ForConditionalGeneration
import torch
model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add 3 for special tokens
labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3 # add 3 for special tokens
loss = model(input_ids, labels=labels).loss # forward pass
For batched inference and training it is however recommended to make use of the tokenizer:
.. code-block::
from transformers import T5ForConditionalGeneration, AutoTokenizer
model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids
loss = model(**model_inputs, labels=labels).loss # forward pass
ByT5Tokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.ByT5Tokenizer
See :class:`~transformers.ByT5Tokenizer` for all details.
...@@ -165,6 +165,7 @@ _import_structure = { ...@@ -165,6 +165,7 @@ _import_structure = {
"BlenderbotSmallConfig", "BlenderbotSmallConfig",
"BlenderbotSmallTokenizer", "BlenderbotSmallTokenizer",
], ],
"models.byt5": ["ByT5Tokenizer"],
"models.camembert": ["CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "CamembertConfig"], "models.camembert": ["CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "CamembertConfig"],
"models.clip": [ "models.clip": [
"CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
...@@ -1636,6 +1637,7 @@ if TYPE_CHECKING: ...@@ -1636,6 +1637,7 @@ if TYPE_CHECKING:
BlenderbotSmallConfig, BlenderbotSmallConfig,
BlenderbotSmallTokenizer, BlenderbotSmallTokenizer,
) )
from .models.byt5 import ByT5Tokenizer
from .models.camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig from .models.camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from .models.clip import ( from .models.clip import (
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
......
...@@ -27,6 +27,7 @@ from ..bert_japanese.tokenization_bert_japanese import BertJapaneseTokenizer ...@@ -27,6 +27,7 @@ from ..bert_japanese.tokenization_bert_japanese import BertJapaneseTokenizer
from ..bertweet.tokenization_bertweet import BertweetTokenizer from ..bertweet.tokenization_bertweet import BertweetTokenizer
from ..blenderbot.tokenization_blenderbot import BlenderbotTokenizer from ..blenderbot.tokenization_blenderbot import BlenderbotTokenizer
from ..blenderbot_small.tokenization_blenderbot_small import BlenderbotSmallTokenizer from ..blenderbot_small.tokenization_blenderbot_small import BlenderbotSmallTokenizer
from ..byt5.tokenization_byt5 import ByT5Tokenizer
from ..convbert.tokenization_convbert import ConvBertTokenizer from ..convbert.tokenization_convbert import ConvBertTokenizer
from ..ctrl.tokenization_ctrl import CTRLTokenizer from ..ctrl.tokenization_ctrl import CTRLTokenizer
from ..deberta.tokenization_deberta import DebertaTokenizer from ..deberta.tokenization_deberta import DebertaTokenizer
...@@ -287,6 +288,7 @@ TOKENIZER_MAPPING = OrderedDict( ...@@ -287,6 +288,7 @@ TOKENIZER_MAPPING = OrderedDict(
NO_CONFIG_TOKENIZER = [ NO_CONFIG_TOKENIZER = [
BertJapaneseTokenizer, BertJapaneseTokenizer,
BertweetTokenizer, BertweetTokenizer,
ByT5Tokenizer,
CpmTokenizer, CpmTokenizer,
HerbertTokenizer, HerbertTokenizer,
HerbertTokenizerFast, HerbertTokenizerFast,
......
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...file_utils import _BaseLazyModule
_import_structure = {
"tokenization_byt5": ["ByT5Tokenizer"],
}
if TYPE_CHECKING:
from .tokenization_byt5 import ByT5Tokenizer
else:
import importlib
import os
import sys
class _LazyModule(_BaseLazyModule):
"""
Module class that surfaces all objects but only performs associated imports when the objects are requested.
"""
__file__ = globals()["__file__"]
__path__ = [os.path.dirname(__file__)]
def _get_module(self, module_name: str):
return importlib.import_module("." + module_name, self.__name__)
sys.modules[__name__] = _LazyModule(__name__, _import_structure)
# coding=utf-8
# Copyright 2018 The T5 authors and HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert T5 checkpoint."""
import argparse
from transformers import T5Config, T5ForConditionalGeneration, load_tf_weights_in_t5
from transformers.utils import logging
logging.set_verbosity_info()
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path):
# Initialise PyTorch model
config = T5Config.from_json_file(config_file)
print(f"Building PyTorch model from configuration: {config}")
model = T5ForConditionalGeneration(config)
# Load weights from tf checkpoint
load_tf_weights_in_t5(model, config, tf_checkpoint_path)
# Save pytorch-model
print(f"Save PyTorch model to {pytorch_dump_path}")
model.save_pretrained(pytorch_dump_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--tf_checkpoint_path", default=None, type=str, required=True, help="Path to the TensorFlow checkpoint path."
)
parser.add_argument(
"--config_file",
default=None,
type=str,
required=True,
help="The config json file corresponding to the pre-trained T5 model. \n"
"This specifies the model architecture.",
)
parser.add_argument(
"--pytorch_dump_path", default=None, type=str, required=True, help="Path to the output PyTorch model."
)
args = parser.parse_args()
convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path)
# coding=utf-8
# Copyright 2021 T5 Authors and HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Tokenization class for model ByT5."""
import re
import warnings
from typing import Dict, List, Optional, Tuple
from ...tokenization_utils import AddedToken, PreTrainedTokenizer
from ...utils import logging
logger = logging.get_logger(__name__)
class ByT5Tokenizer(PreTrainedTokenizer):
"""
Construct a ByT5 tokenizer. ByT5 simply uses raw bytes utf-8 encoding.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
Args:
eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
The end of sequence token.
.. note::
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the :obj:`sep_token`.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
extra_ids (:obj:`int`, `optional`, defaults to 100):
Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are
accessible as "<extra_id_{%d}>" where "{%d}" is a number between 0 and extra_ids-1. Extra tokens are
indexed from the end of the vocabulary up to beginning ("<extra_id_0>" is the last token in the vocabulary
like in ByT5 preprocessing see `here
<https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117>`__).
additional_special_tokens (:obj:`List[str]`, `optional`):
Additional special tokens used by the tokenizer.
"""
model_input_names = ["input_ids", "attention_mask"]
def __init__(
self,
eos_token="</s>",
unk_token="<unk>",
pad_token="<pad>",
extra_ids=125,
additional_special_tokens=None,
**kwargs
) -> None:
# Add extra_ids to the special token list
if extra_ids > 0 and additional_special_tokens is None:
additional_special_tokens = [f"<extra_id_{i}>" for i in range(extra_ids)]
elif extra_ids > 0 and additional_special_tokens is not None:
# Check that we have the right number of extra_id special tokens
extra_tokens = len(set(filter(lambda x: bool("extra_id" in str(x)), additional_special_tokens)))
if extra_tokens != extra_ids:
raise ValueError(
f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to ByT5Tokenizer. "
"In this case the additional_special_tokens must include the extra_ids tokens"
)
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
super().__init__(
eos_token=eos_token,
unk_token=unk_token,
pad_token=pad_token,
extra_ids=extra_ids,
additional_special_tokens=additional_special_tokens,
**kwargs,
)
# define special tokens dict
self.special_tokens_encoder: Dict[int, str] = {
self.pad_token: 0,
self.eos_token: 1,
self.unk_token: 2,
}
self.special_tokens_decoder: Dict[str, int] = {v: k for k, v in self.special_tokens_encoder.items()}
self._num_special_tokens = len(self.special_tokens_encoder)
self._utf_vocab_size = 2 ** 8 # utf is 8 bits
self._extra_ids = extra_ids
@property
def vocab_size(self):
return self._utf_vocab_size + self._num_special_tokens + self._extra_ids
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
return super().get_special_tokens_mask(
token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
)
# normal case: some special tokens
if token_ids_1 is None:
return ([0] * len(token_ids_0)) + [1]
return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
def _add_eos_if_not_present(self, token_ids: List[int]) -> List[int]:
"""Do not add eos again if user already added it."""
if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
warnings.warn(
f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."
)
return token_ids
else:
return token_ids + [self.eos_token_id]
def create_token_type_ids_from_sequences(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. ByT5 does not
make use of token type ids, therefore a list of zeros is returned.
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of zeros.
"""
eos = [self.eos_token_id]
if token_ids_1 is None:
return len(token_ids_0 + eos) * [0]
return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. A sequence has the following format:
- single sequence: ``X </s>``
- pair of sequences: ``A </s> B </s>``
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
"""
token_ids_0 = self._add_eos_if_not_present(token_ids_0)
if token_ids_1 is None:
return token_ids_0
else:
token_ids_1 = self._add_eos_if_not_present(token_ids_1)
return token_ids_0 + token_ids_1
def _tokenize(self, text: str) -> List[str]:
"""Take as input a string and return a list of strings (tokens) for words/sub-words"""
def _sub_tokenize(sub_text):
character_list = list(sub_text)
utf_tokens_lists = [list(char.encode("utf-8")) for char in character_list]
sub_tokens = [chr(utf_token) for utf_tokens in utf_tokens_lists for utf_token in utf_tokens]
return sub_tokens
# split on special characters
pattern = f"({'|'.join(self.special_tokens_encoder.keys())})"
sub_texts = list(filter(None, re.split(pattern, text)))
tokens = []
for sub_text in sub_texts:
if sub_text in self.special_tokens_encoder.keys():
tokens += [sub_text]
else:
tokens += _sub_tokenize(sub_text)
return tokens
def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
if token.startswith("<extra_id_"):
match = re.match(r"<extra_id_(\d+)>", token)
num = int(match.group(1))
token_id = self.vocab_size - num - 1
elif token in self.special_tokens_encoder:
token_id = self.special_tokens_encoder[token]
elif len(token) > 1:
# token of length > 1 must be newly added tokens => set them to unk token
token_id = self.unk_token_id
else:
token_id = ord(token) + self._num_special_tokens
return token_id
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
if index < self._num_special_tokens:
token = self.special_tokens_decoder[index]
elif index < self._utf_vocab_size + self._num_special_tokens:
token = chr(index - self._num_special_tokens)
else:
token = f"<extra_id_{self.vocab_size - 1 - index}>"
return token
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string."""
def _convert_sub_string(sub_chars):
byte_string = bytes([ord(char) for char in sub_chars])
return byte_string.decode("utf-8", errors="ignore")
string = ""
sub_chars = []
for token in tokens:
# if is special token
if len(token) > 1:
string += _convert_sub_string(sub_chars)
string += token
sub_chars = []
else:
sub_chars.append(token)
# add remaining chars
string += _convert_sub_string(sub_chars)
return string
# ByT5Tokenizer has no vocab file
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
return ()
...@@ -1092,6 +1092,7 @@ class TFT5Model(TFT5PreTrainedModel): ...@@ -1092,6 +1092,7 @@ class TFT5Model(TFT5PreTrainedModel):
decoder_config = copy.deepcopy(config) decoder_config = copy.deepcopy(config)
decoder_config.is_decoder = True decoder_config.is_decoder = True
decoder_config.num_layers = config.num_decoder_layers
self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name="decoder") self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name="decoder")
def get_encoder(self): def get_encoder(self):
...@@ -1255,6 +1256,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel, TFCausalLanguageModeling ...@@ -1255,6 +1256,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel, TFCausalLanguageModeling
decoder_config = copy.deepcopy(config) decoder_config = copy.deepcopy(config)
decoder_config.is_decoder = True decoder_config.is_decoder = True
decoder_config.num_layers = config.num_decoder_layers
self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name="decoder") self.decoder = TFT5MainLayer(decoder_config, embed_tokens, name="decoder")
if not config.tie_word_embeddings: if not config.tie_word_embeddings:
......
...@@ -30,7 +30,7 @@ from .test_modeling_common import ModelTesterMixin, ids_tensor ...@@ -30,7 +30,7 @@ from .test_modeling_common import ModelTesterMixin, ids_tensor
if is_torch_available(): if is_torch_available():
import torch import torch
from transformers import T5Config, T5EncoderModel, T5ForConditionalGeneration, T5Model, T5Tokenizer from transformers import ByT5Tokenizer, T5Config, T5EncoderModel, T5ForConditionalGeneration, T5Model, T5Tokenizer
from transformers.models.t5.modeling_t5 import T5_PRETRAINED_MODEL_ARCHIVE_LIST from transformers.models.t5.modeling_t5 import T5_PRETRAINED_MODEL_ARCHIVE_LIST
...@@ -846,6 +846,30 @@ class T5ModelIntegrationTests(unittest.TestCase): ...@@ -846,6 +846,30 @@ class T5ModelIntegrationTests(unittest.TestCase):
EXPECTED_SCORE = -59.0293 EXPECTED_SCORE = -59.0293
self.assertTrue(abs(mtf_score - EXPECTED_SCORE) < 1e-4) self.assertTrue(abs(mtf_score - EXPECTED_SCORE) < 1e-4)
@slow
def test_small_byt5_integration_test(self):
"""
For comparision run:
>>> import t5 # pip install t5==0.9.1
>>> path_to_byt5_small_checkpoint = '<fill_in>'
>>> t5_model = t5.models.MtfModel(model_dir=path_to_tf_checkpoint, batch_size=1, tpu=None)
>>> vocab = t5.data.ByteVocabulary()
>>> score = t5_model.score(inputs=["Hello there"], targets=["Hi I am"], vocabulary=vocab)
"""
model = T5ForConditionalGeneration.from_pretrained("google/byt5-small").to(torch_device)
tokenizer = ByT5Tokenizer.from_pretrained("google/byt5-small")
input_ids = tokenizer("Hello there", return_tensors="pt").input_ids
labels = tokenizer("Hi I am", return_tensors="pt").input_ids
loss = model(input_ids.to(torch_device), labels=labels.to(torch_device)).loss
mtf_score = -(labels.shape[-1] * loss.item())
EXPECTED_SCORE = -60.7397
self.assertTrue(abs(mtf_score - EXPECTED_SCORE) < 1e-4)
@slow @slow
def test_summarization(self): def test_summarization(self):
model = self.model model = self.model
......
...@@ -26,7 +26,7 @@ from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor ...@@ -26,7 +26,7 @@ from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
if is_tf_available(): if is_tf_available():
import tensorflow as tf import tensorflow as tf
from transformers import T5Tokenizer, TFT5EncoderModel, TFT5ForConditionalGeneration, TFT5Model from transformers import ByT5Tokenizer, T5Tokenizer, TFT5EncoderModel, TFT5ForConditionalGeneration, TFT5Model
class TFT5ModelTester: class TFT5ModelTester:
...@@ -499,6 +499,30 @@ class TFT5ModelIntegrationTests(unittest.TestCase): ...@@ -499,6 +499,30 @@ class TFT5ModelIntegrationTests(unittest.TestCase):
EXPECTED_SCORE = -59.0293 EXPECTED_SCORE = -59.0293
self.assertTrue(abs(mtf_score - EXPECTED_SCORE) < 1e-4) self.assertTrue(abs(mtf_score - EXPECTED_SCORE) < 1e-4)
@slow
def test_small_byt5_integration_test(self):
"""
For comparision run:
>>> import t5 # pip install t5==0.9.1
>>> path_to_byt5_small_checkpoint = '<fill_in>'
>>> t5_model = t5.models.MtfModel(model_dir=path_to_tf_checkpoint, batch_size=1, tpu=None)
>>> vocab = t5.data.ByteVocabulary()
>>> score = t5_model.score(inputs=["Hello there"], targets=["Hi I am"], vocabulary=vocab)
"""
model = TFT5ForConditionalGeneration.from_pretrained("google/byt5-small")
tokenizer = ByT5Tokenizer.from_pretrained("google/byt5-small")
input_ids = tokenizer("Hello there", return_tensors="tf").input_ids
labels = tokenizer("Hi I am", return_tensors="tf").input_ids
loss = model(input_ids, labels=labels).loss
mtf_score = -tf.math.reduce_sum(loss).numpy()
EXPECTED_SCORE = -60.7397
self.assertTrue(abs(mtf_score - EXPECTED_SCORE) < 1e-4)
@slow @slow
def test_summarization(self): def test_summarization(self):
model = self.model model = self.model
......
# coding=utf-8
# Copyright 2020 Google T5 Authors and HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import shutil
import tempfile
import unittest
from transformers import BatchEncoding, ByT5Tokenizer
from transformers.file_utils import cached_property, is_tf_available, is_torch_available
from .test_tokenization_common import TokenizerTesterMixin
if is_torch_available():
FRAMEWORK = "pt"
elif is_tf_available():
FRAMEWORK = "tf"
else:
FRAMEWORK = "jax"
class ByT5TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = ByT5Tokenizer
test_rust_tokenizer = False
def setUp(self):
super().setUp()
tokenizer = ByT5Tokenizer()
tokenizer.save_pretrained(self.tmpdirname)
@cached_property
def t5_base_tokenizer(self):
return ByT5Tokenizer.from_pretrained("google/byt5-small")
def get_tokenizer(self, **kwargs) -> ByT5Tokenizer:
return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
def test_eos_treatment(self):
tokenizer = self.t5_base_tokenizer
batch_with_eos_added = tokenizer(["hi</s>", "I went to the gym</s>", "</s>"])
batch_without_eos_added = tokenizer(["hi", "I went to the gym", ""])
self.assertListEqual(batch_with_eos_added["input_ids"], batch_without_eos_added["input_ids"])
def test_prepare_batch_integration(self):
tokenizer = self.t5_base_tokenizer
src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
# fmt: off
expected_src_tokens = [68, 35, 111, 114, 113, 106, 35, 115, 100, 117, 100, 106, 117, 100, 115, 107, 35, 105, 114, 117, 35, 118, 120, 112, 112, 100, 117, 108, 125, 100, 119, 108, 114, 113, 49, 1, 0]
# fmt: on
batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
self.assertIsInstance(batch, BatchEncoding)
if FRAMEWORK != "jax":
result = list(batch.input_ids.numpy()[0])
else:
result = list(batch.input_ids.tolist()[0])
self.assertListEqual(expected_src_tokens, result)
self.assertEqual((2, 37), batch.input_ids.shape)
self.assertEqual((2, 37), batch.attention_mask.shape)
def test_empty_target_text(self):
tokenizer = self.t5_base_tokenizer
src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
# check if input_ids are returned and no decoder_input_ids
self.assertIn("input_ids", batch)
self.assertIn("attention_mask", batch)
self.assertNotIn("decoder_input_ids", batch)
self.assertNotIn("decoder_attention_mask", batch)
def test_max_length_integration(self):
tokenizer = self.t5_base_tokenizer
tgt_text = [
"Summary of the text.",
"Another summary.",
]
with tokenizer.as_target_tokenizer():
targets = tokenizer(
tgt_text, max_length=32, padding="max_length", truncation=True, return_tensors=FRAMEWORK
)
self.assertEqual(32, targets["input_ids"].shape[1])
def test_eos_in_input(self):
tokenizer = self.t5_base_tokenizer
src_text = ["A long paragraph for summarization. </s>"]
tgt_text = ["Summary of the text. </s>"]
# fmt: off
expected_src_tokens = [68, 35, 111, 114, 113, 106, 35, 115, 100, 117, 100, 106, 117, 100, 115, 107, 35, 105, 114, 117, 35, 118, 120, 112, 112, 100, 117, 108, 125, 100, 119, 108, 114, 113, 49, 35, 1]
expected_tgt_tokens = [86, 120, 112, 112, 100, 117, 124, 35, 114, 105, 35, 119, 107, 104, 35, 119, 104, 123, 119, 49, 35, 1]
# fmt: on
batch = tokenizer(src_text)
with tokenizer.as_target_tokenizer():
targets = tokenizer(tgt_text)
self.assertEqual(expected_src_tokens, batch["input_ids"][0])
self.assertEqual(expected_tgt_tokens, targets["input_ids"][0])
# cannot use default save_and_load_tokenzier test method because tokenzier has no vocab
def test_save_and_load_tokenizer(self):
# safety check on max_len default value so we are sure the test works
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
self.assertNotEqual(tokenizer.model_max_length, 42)
# Now let's start the test
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
# Isolate this from the other tests because we save additional tokens/etc
tmpdirname = tempfile.mkdtemp()
sample_text = " He is very happy, UNwant\u00E9d,running"
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname)
after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
self.assertListEqual(before_tokens, after_tokens)
shutil.rmtree(tmpdirname)
tokenizers = self.get_tokenizers(model_max_length=42)
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
# Isolate this from the other tests because we save additional tokens/etc
tmpdirname = tempfile.mkdtemp()
sample_text = " He is very happy, UNwant\u00E9d,running"
tokenizer.add_tokens(["bim", "bambam"])
additional_special_tokens = tokenizer.additional_special_tokens
additional_special_tokens.append("new_additional_special_token")
tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname)
after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
self.assertListEqual(before_tokens, after_tokens)
self.assertIn("new_additional_special_token", after_tokenizer.additional_special_tokens)
self.assertEqual(after_tokenizer.model_max_length, 42)
tokenizer = tokenizer.__class__.from_pretrained(tmpdirname, model_max_length=43)
self.assertEqual(tokenizer.model_max_length, 43)
shutil.rmtree(tmpdirname)
# tokenizer can be instantiated without any pretrained files, so no need for pretrained tokenizer list
def test_pretrained_model_lists(self):
pass
# tokenizer does not have vocabulary
def test_get_vocab(self):
pass
# inputs cannot be pretokenized since ids depend on whole input string and not just on single characters
def test_pretokenized_inputs(self):
pass
# tests all ids in vocab => vocab doesn't exist so unnecessary to test
def test_conversion_reversible(self):
pass
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment