Unverified Commit 6e685978 authored by NielsRogge's avatar NielsRogge Committed by GitHub
Browse files

Add CANINE (#12024)



* First pass

* More progress

* Add support for local attention

* More improvements

* More improvements

* Conversion script working

* Add CanineTokenizer

* Make style & quality

* First draft of integration test

* Remove decoder test

* Improve tests

* Add documentation

* Mostly docs improvements

* Add CanineTokenizer tests

* Fix most tests on GPU, improve upsampling projection

* Address most comments by @dhgarrette

* Remove decoder logic

* Improve Canine tests, improve docs of CanineConfig

* All tokenizer tests passing

* Make fix-copies and fix tokenizer tests

* Fix test_model_outputs_equivalence test

* Apply suggestions from @sgugger's review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Address some more comments

* Add support for hidden_states and attentions of shallow encoders

* Define custom CanineModelOutputWithPooling, tests pass

* First pass

* More progress

* Add support for local attention

* More improvements

* More improvements

* Conversion script working

* Add CanineTokenizer

* Make style & quality

* First draft of integration test

* Remove decoder test

* Improve tests

* Add documentation

* Mostly docs improvements

* Add CanineTokenizer tests

* Fix most tests on GPU, improve upsampling projection

* Address most comments by @dhgarrette

* Remove decoder logic

* Improve Canine tests, improve docs of CanineConfig

* All tokenizer tests passing

* Make fix-copies and fix tokenizer tests

* Fix test_model_outputs_equivalence test

* Apply suggestions from @sgugger's review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Address some more comments

* Make conversion script work for Canine-c too

* Fix tokenizer tests

* Remove file
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 69f57015
......@@ -212,7 +212,8 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[BORT](https://huggingface.co/transformers/model_doc/bort.html)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry.
1. **[ByT5](https://huggingface.co/transformers/model_doc/byt5.html)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
1. **[CamemBERT](https://huggingface.co/transformers/model_doc/camembert.html)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1. **[CLIP](https://huggingface.co/transformers/model_doc/clip.html)** from (OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1. **[CANINE](https://huggingface.co/transformers/model_doc/canine.html)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
1. **[CLIP](https://huggingface.co/transformers/model_doc/clip.html)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1. **[ConvBERT](https://huggingface.co/transformers/model_doc/convbert.html)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
1. **[CPM](https://huggingface.co/transformers/model_doc/cpm.html)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
1. **[CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
......
This diff is collapsed.
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
CANINE
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The CANINE model was proposed in `CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation <https://arxiv.org/abs/2103.06874>`__ by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It's
among the first papers that trains a Transformer without using an explicit tokenization step (such as Byte Pair
Encoding (BPE), WordPiece or SentencePiece). Instead, the model is trained directly at a Unicode character-level.
Training at a character-level inevitably comes with a longer sequence length, which CANINE solves with an efficient
downsampling strategy, before applying a deep Transformer encoder.
The abstract from the paper is the following:
*Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models
still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword
lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all
languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE,
a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a
pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input
sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by
2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.*
Tips:
- CANINE uses no less than 3 Transformer encoders internally: 2 "shallow" encoders (which only consist of a single
layer) and 1 "deep" encoder (which is a regular BERT encoder). First, a "shallow" encoder is used to contextualize
the character embeddings, using local attention. Next, after downsampling, a "deep" encoder is applied. Finally,
after upsampling, a "shallow" encoder is used to create the final character embeddings. Details regarding up- and
downsampling can be found in the paper.
- CANINE uses a max sequence length of 2048 characters by default. One can use :class:`~transformers.CanineTokenizer`
to prepare text for the model.
- Classification can be done by placing a linear layer on top of the final hidden state of the special [CLS] token
(which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The
details for this can be found in the paper.
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
<https://github.com/google-research/language/tree/master/language/canine>`__.
Example
_______________________________________________________________________________________________________________________
CANINE works on raw characters, so it can be used without a tokenizer:
.. code-block::
from transformers import CanineModel
import torch
model = CanineModel.from_pretrained('google/canine-s') # model pre-trained with autoregressive character loss
text = "hello world"
# use Python's built-in ord() function to turn each character into its unicode code point id
input_ids = torch.tensor([[ord(char) for char in text]])
outputs = model(input_ids) # forward pass
pooled_output = outputs.pooler_output
sequence_output = outputs.last_hidden_state
For batched inference and training, it is however recommended to make use of the tokenizer (to pad/truncate all
sequences to the same length):
.. code-block::
from transformers import CanineTokenizer, CanineModel
model = CanineModel.from_pretrained('google/canine-s')
tokenizer = CanineTokenizer.from_pretrained('google/canine-s')
inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
outputs = model(**encoding) # forward pass
pooled_output = outputs.pooler_output
sequence_output = outputs.last_hidden_state
CANINE specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.canine.modeling_canine.CanineModelOutputWithPooling
:members:
CanineConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CanineConfig
:members:
CanineTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CanineTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences
CanineModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CanineModel
:members: forward
CanineForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CanineForSequenceClassification
:members: forward
CanineForMultipleChoice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CanineForMultipleChoice
:members: forward
CanineForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CanineForTokenClassification
:members: forward
CanineForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.CanineForQuestionAnswering
:members: forward
......@@ -170,6 +170,7 @@ _import_structure = {
],
"models.byt5": ["ByT5Tokenizer"],
"models.camembert": ["CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "CamembertConfig"],
"models.canine": ["CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP", "CanineConfig", "CanineTokenizer"],
"models.clip": [
"CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"CLIPConfig",
......@@ -505,7 +506,6 @@ if is_torch_available():
"load_tf_weights_in_albert",
]
)
_import_structure["models.auto"].extend(
[
"MODEL_FOR_CAUSAL_LM_MAPPING",
......@@ -632,6 +632,19 @@ if is_torch_available():
"CamembertModel",
]
)
_import_structure["models.canine"].extend(
[
"CANINE_PRETRAINED_MODEL_ARCHIVE_LIST",
"CanineForMultipleChoice",
"CanineForQuestionAnswering",
"CanineForSequenceClassification",
"CanineForTokenClassification",
"CanineLayer",
"CanineModel",
"CaninePreTrainedModel",
"load_tf_weights_in_canine",
]
)
_import_structure["models.clip"].extend(
[
"CLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
......@@ -1756,6 +1769,7 @@ if TYPE_CHECKING:
)
from .models.byt5 import ByT5Tokenizer
from .models.camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from .models.canine import CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP, CanineConfig, CanineTokenizer
from .models.clip import (
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
CLIPConfig,
......@@ -2156,6 +2170,17 @@ if TYPE_CHECKING:
CamembertForTokenClassification,
CamembertModel,
)
from .models.canine import (
CANINE_PRETRAINED_MODEL_ARCHIVE_LIST,
CanineForMultipleChoice,
CanineForQuestionAnswering,
CanineForSequenceClassification,
CanineForTokenClassification,
CanineLayer,
CanineModel,
CaninePreTrainedModel,
load_tf_weights_in_canine,
)
from .models.clip import (
CLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
CLIPModel,
......
......@@ -30,6 +30,7 @@ from . import (
blenderbot,
blenderbot_small,
camembert,
canine,
clip,
convbert,
cpm,
......
......@@ -33,6 +33,7 @@ from ..blenderbot_small.configuration_blenderbot_small import (
BlenderbotSmallConfig,
)
from ..camembert.configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from ..canine.configuration_canine import CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP, CanineConfig
from ..clip.configuration_clip import CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, CLIPConfig
from ..convbert.configuration_convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig
from ..ctrl.configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
......@@ -96,6 +97,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
for pretrained_map in [
# Add archive maps here
VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP,
ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP,
......@@ -155,6 +157,7 @@ CONFIG_MAPPING = OrderedDict(
[
# Add configs here
("visual_bert", VisualBertConfig),
("canine", CanineConfig),
("roformer", RoFormerConfig),
("clip", CLIPConfig),
("bigbird_pegasus", BigBirdPegasusConfig),
......@@ -220,6 +223,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
[
# Add full (and cased) model names here
("visual_bert", "VisualBert"),
("canine", "Canine"),
("roformer", "RoFormer"),
("clip", "CLIP"),
("bigbird_pegasus", "BigBirdPegasus"),
......
......@@ -81,6 +81,13 @@ from ..camembert.modeling_camembert import (
CamembertForTokenClassification,
CamembertModel,
)
from ..canine.modeling_canine import (
CanineForMultipleChoice,
CanineForQuestionAnswering,
CanineForSequenceClassification,
CanineForTokenClassification,
CanineModel,
)
from ..clip.modeling_clip import CLIPModel
from ..convbert.modeling_convbert import (
ConvBertForMaskedLM,
......@@ -312,6 +319,7 @@ from .configuration_auto import (
BlenderbotConfig,
BlenderbotSmallConfig,
CamembertConfig,
CanineConfig,
CLIPConfig,
ConvBertConfig,
CTRLConfig,
......@@ -371,6 +379,7 @@ MODEL_MAPPING = OrderedDict(
[
# Base model mapping
(VisualBertConfig, VisualBertModel),
(CanineConfig, CanineModel),
(RoFormerConfig, RoFormerModel),
(CLIPConfig, CLIPModel),
(BigBirdPegasusConfig, BigBirdPegasusModel),
......@@ -624,6 +633,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING = OrderedDict(
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
[
# Model for Sequence Classification mapping
(CanineConfig, CanineForSequenceClassification),
(RoFormerConfig, RoFormerForSequenceClassification),
(BigBirdPegasusConfig, BigBirdPegasusForSequenceClassification),
(BigBirdConfig, BigBirdForSequenceClassification),
......@@ -664,6 +674,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
[
# Model for Question Answering mapping
(CanineConfig, CanineForQuestionAnswering),
(RoFormerConfig, RoFormerForQuestionAnswering),
(BigBirdPegasusConfig, BigBirdPegasusForQuestionAnswering),
(BigBirdConfig, BigBirdForQuestionAnswering),
......@@ -705,6 +716,7 @@ MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING = OrderedDict(
MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
[
# Model for Token Classification mapping
(CanineConfig, CanineForTokenClassification),
(RoFormerConfig, RoFormerForTokenClassification),
(BigBirdConfig, BigBirdForTokenClassification),
(ConvBertConfig, ConvBertForTokenClassification),
......@@ -735,6 +747,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
[
# Model for Multiple Choice mapping
(CanineConfig, CanineForMultipleChoice),
(RoFormerConfig, RoFormerForMultipleChoice),
(BigBirdConfig, BigBirdForMultipleChoice),
(ConvBertConfig, ConvBertForMultipleChoice),
......
......@@ -37,6 +37,7 @@ from ..bertweet.tokenization_bertweet import BertweetTokenizer
from ..blenderbot.tokenization_blenderbot import BlenderbotTokenizer
from ..blenderbot_small.tokenization_blenderbot_small import BlenderbotSmallTokenizer
from ..byt5.tokenization_byt5 import ByT5Tokenizer
from ..canine.tokenization_canine import CanineTokenizer
from ..convbert.tokenization_convbert import ConvBertTokenizer
from ..ctrl.tokenization_ctrl import CTRLTokenizer
from ..deberta.tokenization_deberta import DebertaTokenizer
......@@ -78,6 +79,7 @@ from .configuration_auto import (
BlenderbotConfig,
BlenderbotSmallConfig,
CamembertConfig,
CanineConfig,
ConvBertConfig,
CTRLConfig,
DebertaConfig,
......@@ -294,6 +296,7 @@ TOKENIZER_MAPPING = OrderedDict(
(GPTNeoConfig, (GPT2Tokenizer, GPT2TokenizerFast)),
(LukeConfig, (LukeTokenizer, None)),
(BigBirdPegasusConfig, (PegasusTokenizer, PegasusTokenizerFast)),
(CanineConfig, (CanineTokenizer, None)),
]
)
......
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...file_utils import _BaseLazyModule, is_tokenizers_available, is_torch_available
_import_structure = {
"configuration_canine": ["CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP", "CanineConfig"],
"tokenization_canine": ["CanineTokenizer"],
}
if is_torch_available():
_import_structure["modeling_canine"] = [
"CANINE_PRETRAINED_MODEL_ARCHIVE_LIST",
"CanineForMultipleChoice",
"CanineForQuestionAnswering",
"CanineForSequenceClassification",
"CanineForTokenClassification",
"CanineLayer",
"CanineModel",
"CaninePreTrainedModel",
"load_tf_weights_in_canine",
]
if TYPE_CHECKING:
from .configuration_canine import CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP, CanineConfig
from .tokenization_canine import CanineTokenizer
if is_torch_available():
from .modeling_canine import (
CANINE_PRETRAINED_MODEL_ARCHIVE_LIST,
CanineForMultipleChoice,
CanineForQuestionAnswering,
CanineForSequenceClassification,
CanineForTokenClassification,
CanineLayer,
CanineModel,
CaninePreTrainedModel,
load_tf_weights_in_canine,
)
else:
import importlib
import os
import sys
class _LazyModule(_BaseLazyModule):
"""
Module class that surfaces all objects but only performs associated imports when the objects are requested.
"""
__file__ = globals()["__file__"]
__path__ = [os.path.dirname(__file__)]
def _get_module(self, module_name: str):
return importlib.import_module("." + module_name, self.__name__)
sys.modules[__name__] = _LazyModule(__name__, _import_structure)
# coding=utf-8
# Copyright Google AI and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" CANINE model configuration """
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"google/canine-s": "https://huggingface.co/google/canine-s/resolve/main/config.json",
# See all CANINE models at https://huggingface.co/models?filter=canine
}
class CanineConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.CanineModel`. It is used to
instantiate an CANINE model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the CANINE `google/canine-s
<https://huggingface.co/google/canine-s>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Args:
hidden_size (:obj:`int`, `optional`, defaults to 768):
Dimension of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
Number of hidden layers in the deep Transformer encoder.
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoders.
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoders.
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoders, and pooler.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, `optional`, defaults to 16384):
The maximum sequence length that this model might ever be used with.
type_vocab_size (:obj:`int`, `optional`, defaults to 16):
The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.CanineModel`.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
If :obj:`True`, use gradient checkpointing to save memory at the expense of slower backward pass.
downsampling_rate (:obj:`int`, `optional`, defaults to 4):
The rate at which to downsample the original character sequence length before applying the deep Transformer
encoder.
upsampling_kernel_size (:obj:`int`, `optional`, defaults to 4):
The kernel size (i.e. the number of characters in each window) of the convolutional projection layer when
projecting back from :obj:`hidden_size`*2 to :obj:`hidden_size`.
num_hash_functions (:obj:`int`, `optional`, defaults to 8):
The number of hash functions to use. Each hash function has its own embedding matrix.
num_hash_buckets (:obj:`int`, `optional`, defaults to 16384):
The number of hash buckets to use.
local_transformer_stride (:obj:`int`, `optional`, defaults to 128):
The stride of the local attention of the first shallow Transformer encoder. Defaults to 128 for good
TPU/XLA memory alignment.
Example::
>>> from transformers import CanineModel, CanineConfig
>>> # Initializing a CANINE google/canine-s style configuration
>>> configuration = CanineConfig()
>>> # Initializing a model from the google/canine-s style configuration
>>> model = CanineModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "canine"
def __init__(
self,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=16384,
type_vocab_size=16,
initializer_range=0.02,
layer_norm_eps=1e-12,
use_cache=True,
is_encoder_decoder=False,
pad_token_id=0,
bos_token_id=0xE000,
eos_token_id=0xE001,
downsampling_rate=4,
upsampling_kernel_size=4,
num_hash_functions=8,
num_hash_buckets=16384,
local_transformer_stride=128, # Good TPU/XLA memory alignment.
**kwargs
):
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.initializer_range = initializer_range
self.type_vocab_size = type_vocab_size
self.layer_norm_eps = layer_norm_eps
self.use_cache = use_cache
# Character config:
self.downsampling_rate = downsampling_rate
self.upsampling_kernel_size = upsampling_kernel_size
self.num_hash_functions = num_hash_functions
self.num_hash_buckets = num_hash_buckets
self.local_transformer_stride = local_transformer_stride
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert CANINE checkpoint."""
import argparse
from transformers import CanineConfig, CanineModel, CanineTokenizer, load_tf_weights_in_canine
from transformers.utils import logging
logging.set_verbosity_info()
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, pytorch_dump_path):
# Initialize PyTorch model
config = CanineConfig()
model = CanineModel(config)
model.eval()
print(f"Building PyTorch model from configuration: {config}")
# Load weights from tf checkpoint
load_tf_weights_in_canine(model, config, tf_checkpoint_path)
# Save pytorch-model (weights and configuration)
print(f"Save PyTorch model to {pytorch_dump_path}")
model.save_pretrained(pytorch_dump_path)
# Save tokenizer files
tokenizer = CanineTokenizer()
print(f"Save tokenizer files to {pytorch_dump_path}")
tokenizer.save_pretrained(pytorch_dump_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--tf_checkpoint_path",
default=None,
type=str,
required=True,
help="Path to the TensorFlow checkpoint. Should end with model.ckpt",
)
parser.add_argument(
"--pytorch_dump_path",
default=None,
type=str,
required=True,
help="Path to a folder where the PyTorch model will be placed.",
)
args = parser.parse_args()
convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.pytorch_dump_path)
This diff is collapsed.
# coding=utf-8
# Copyright Google AI and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes for CANINE."""
from typing import Dict, List, Optional
from ...tokenization_utils import AddedToken, PreTrainedTokenizer
from ...utils import logging
logger = logging.get_logger(__name__)
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
"nielsr/canine-s": 2048,
}
# Unicode defines 1,114,112 total “codepoints”
UNICODE_VOCAB_SIZE = 1114112
# Below: Constants defining canonical codepoints for special, pseudo-characters.
# Copied from https://github.com/google-research/language/blob/master/language/canine/special_codepoints.py
PAD = 0
CLS = 0xE000
SEP = 0xE001
BOS = 0xE002
MASK = 0xE003
RESERVED = 0xE004
# Maps special codepoints to human-readable names.
SPECIAL_CODEPOINTS: Dict[int, str] = {
# Special symbols are represented using codepoints values that are valid,
# but designated as "Private Use", meaning that they will never be assigned
# characters by the Unicode Consortium, and are thus safe for use here.
#
# NOTE: Do *NOT* add any sort of [UNK_CHAR] here. They are explicitly
# excluded and should fail with a hard error.
CLS: "[CLS]",
SEP: "[SEP]",
BOS: "[BOS]",
MASK: "[MASK]",
PAD: "[PAD]",
RESERVED: "[RESERVED]",
}
# Maps special codepoint human-readable names to their codepoint values.
SPECIAL_CODEPOINTS_BY_NAME: Dict[str, int] = {name: codepoint for codepoint, name in SPECIAL_CODEPOINTS.items()}
class CanineTokenizer(PreTrainedTokenizer):
r"""
Construct a CANINE tokenizer (i.e. a character splitter). It turns text into a sequence of characters, and then
converts each character into its Unicode code point.
:class:`~transformers.CanineTokenizer` inherits from :class:`~transformers.PreTrainedTokenizer`.
Refer to superclass :class:`~transformers.PreTrainedTokenizer` for usage examples and documentation concerning
parameters.
Args:
model_max_length (:obj:`int`, `optional`, defaults to 2048):
The maximum sentence length the model accepts.
"""
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
def __init__(
self,
bos_token=chr(CLS),
eos_token=chr(SEP),
sep_token=chr(SEP),
cls_token=chr(CLS),
pad_token=chr(PAD),
mask_token=chr(MASK),
add_prefix_space=False,
model_max_length=2048,
**kwargs
):
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
add_prefix_space=add_prefix_space,
model_max_length=model_max_length,
**kwargs,
)
# Creates a mapping for looking up the IDs of special symbols.
self._special_codepoints: Dict[str, int] = {}
for codepoint, name in SPECIAL_CODEPOINTS.items():
self._special_codepoints[name] = codepoint
# Creates a mapping for looking up the string forms of special symbol IDs.
self._special_codepoint_strings: Dict[int, str] = {
codepoint: name for name, codepoint in self._special_codepoints.items()
}
self._unicode_vocab_size = UNICODE_VOCAB_SIZE
self._num_special_tokens = len(self._special_codepoints)
@property
def vocab_size(self) -> int:
return self._unicode_vocab_size
def _tokenize(self, text: str) -> List[str]:
"""Tokenize a string (i.e. perform character splitting)."""
return list(text)
def _convert_token_to_id(self, token: str) -> int:
"""Converts a token (i.e. a Unicode character) in an id (i.e. its integer Unicode code point value)."""
try:
return ord(token)
except TypeError:
raise ValueError(f"invalid token: '{token}'")
def _convert_id_to_token(self, index: int) -> str:
"""
Converts a Unicode code point (integer) in a token (str). In case it's a special code point, convert to
human-readable format.
"""
try:
if index in SPECIAL_CODEPOINTS:
return SPECIAL_CODEPOINTS[index]
return chr(index)
except TypeError:
raise ValueError(f"invalid id: {index}")
def convert_tokens_to_string(self, tokens):
return "".join(tokens)
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. A CANINE sequence has the following format:
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
result = cls + token_ids_0 + sep
if token_ids_1 is not None:
result += token_ids_1 + sep
return result
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
return super().get_special_tokens_mask(
token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
)
result = [1] + ([0] * len(token_ids_0)) + [1]
if token_ids_1 is not None:
result += ([0] * len(token_ids_1)) + [1]
return result
def create_token_type_ids_from_sequences(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A CANINE
sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
sequence(s).
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
result = len(cls + token_ids_0 + sep) * [0]
if token_ids_1 is not None:
result += len(token_ids_1 + sep) * [1]
return result
# CanineTokenizer has no vocab file
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None):
return ()
......@@ -995,6 +995,72 @@ class CamembertModel:
requires_backends(cls, ["torch"])
CANINE_PRETRAINED_MODEL_ARCHIVE_LIST = None
class CanineForMultipleChoice:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
class CanineForQuestionAnswering:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
class CanineForSequenceClassification:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
class CanineForTokenClassification:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
class CanineLayer:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class CanineModel:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
class CaninePreTrainedModel:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
def load_tf_weights_in_canine(*args, **kwargs):
requires_backends(load_tf_weights_in_canine, ["torch"])
CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = None
......
......@@ -6,6 +6,7 @@ from collections import OrderedDict
MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict(
[
("CanineConfig", "CanineForQuestionAnswering"),
("RoFormerConfig", "RoFormerForQuestionAnswering"),
("BigBirdPegasusConfig", "BigBirdPegasusForQuestionAnswering"),
("BigBirdConfig", "BigBirdForQuestionAnswering"),
......@@ -112,6 +113,7 @@ MODEL_FOR_MASKED_LM_MAPPING_NAMES = OrderedDict(
MODEL_FOR_MULTIPLE_CHOICE_MAPPING_NAMES = OrderedDict(
[
("CanineConfig", "CanineForMultipleChoice"),
("RoFormerConfig", "RoFormerForMultipleChoice"),
("BigBirdConfig", "BigBirdForMultipleChoice"),
("ConvBertConfig", "ConvBertForMultipleChoice"),
......@@ -175,6 +177,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
[
("CanineConfig", "CanineForSequenceClassification"),
("RoFormerConfig", "RoFormerForSequenceClassification"),
("BigBirdPegasusConfig", "BigBirdPegasusForSequenceClassification"),
("BigBirdConfig", "BigBirdForSequenceClassification"),
......@@ -222,6 +225,7 @@ MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict(
MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
[
("CanineConfig", "CanineForTokenClassification"),
("RoFormerConfig", "RoFormerForTokenClassification"),
("BigBirdConfig", "BigBirdForTokenClassification"),
("ConvBertConfig", "ConvBertForTokenClassification"),
......@@ -252,6 +256,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
MODEL_MAPPING_NAMES = OrderedDict(
[
("VisualBertConfig", "VisualBertModel"),
("CanineConfig", "CanineModel"),
("RoFormerConfig", "RoFormerModel"),
("CLIPConfig", "CLIPModel"),
("BigBirdPegasusConfig", "BigBirdPegasusModel"),
......
This diff is collapsed.
# coding=utf-8
# Copyright 2021 Google AI and HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import shutil
import tempfile
import unittest
from transformers import BatchEncoding, CanineTokenizer
from transformers.file_utils import cached_property
from transformers.testing_utils import require_tokenizers, require_torch
from transformers.tokenization_utils import AddedToken
from .test_tokenization_common import TokenizerTesterMixin
class CanineTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = CanineTokenizer
test_rust_tokenizer = False
def setUp(self):
super().setUp()
tokenizer = CanineTokenizer()
tokenizer.save_pretrained(self.tmpdirname)
@cached_property
def canine_tokenizer(self):
# TODO replace nielsr by google
return CanineTokenizer.from_pretrained("nielsr/canine-s")
def get_tokenizer(self, **kwargs) -> CanineTokenizer:
return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
@require_torch
def test_prepare_batch_integration(self):
tokenizer = self.canine_tokenizer
src_text = ["Life is like a box of chocolates.", "You never know what you're gonna get."]
# fmt: off
expected_src_tokens = [57344, 76, 105, 102, 101, 32, 105, 115, 32, 108, 105, 107, 101, 32, 97, 32, 98, 111, 120, 32, 111, 102, 32, 99, 104, 111, 99, 111, 108, 97, 116, 101, 115, 46, 57345, 0, 0, 0, 0]
# fmt: on
batch = tokenizer(src_text, padding=True, return_tensors="pt")
self.assertIsInstance(batch, BatchEncoding)
result = list(batch.input_ids.numpy()[0])
self.assertListEqual(expected_src_tokens, result)
self.assertEqual((2, 39), batch.input_ids.shape)
self.assertEqual((2, 39), batch.attention_mask.shape)
@require_torch
def test_encoding_keys(self):
tokenizer = self.canine_tokenizer
src_text = ["Once there was a man.", "He wrote a test in HuggingFace Tranformers."]
batch = tokenizer(src_text, padding=True, return_tensors="pt")
# check if input_ids, attention_mask and token_type_ids are returned
self.assertIn("input_ids", batch)
self.assertIn("attention_mask", batch)
self.assertIn("token_type_ids", batch)
@require_torch
def test_max_length_integration(self):
tokenizer = self.canine_tokenizer
tgt_text = [
"What's the weater?",
"It's about 25 degrees.",
]
with tokenizer.as_target_tokenizer():
targets = tokenizer(tgt_text, max_length=32, padding="max_length", truncation=True, return_tensors="pt")
self.assertEqual(32, targets["input_ids"].shape[1])
# cannot use default save_and_load_tokenzier test method because tokenzier has no vocab
def test_save_and_load_tokenizer(self):
# safety check on max_len default value so we are sure the test works
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
self.assertNotEqual(tokenizer.model_max_length, 42)
# Now let's start the test
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
# Isolate this from the other tests because we save additional tokens/etc
tmpdirname = tempfile.mkdtemp()
sample_text = " He is very happy, UNwant\u00E9d,running"
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname)
after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
self.assertListEqual(before_tokens, after_tokens)
shutil.rmtree(tmpdirname)
tokenizers = self.get_tokenizers(model_max_length=42)
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
# Isolate this from the other tests because we save additional tokens/etc
tmpdirname = tempfile.mkdtemp()
sample_text = " He is very happy, UNwant\u00E9d,running"
additional_special_tokens = tokenizer.additional_special_tokens
# We can add a new special token for Canine as follows:
new_additional_special_token = chr(0xE007)
additional_special_tokens.append(new_additional_special_token)
tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname)
after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
self.assertListEqual(before_tokens, after_tokens)
self.assertIn(new_additional_special_token, after_tokenizer.additional_special_tokens)
self.assertEqual(after_tokenizer.model_max_length, 42)
tokenizer = tokenizer.__class__.from_pretrained(tmpdirname, model_max_length=43)
self.assertEqual(tokenizer.model_max_length, 43)
shutil.rmtree(tmpdirname)
def test_add_special_tokens(self):
tokenizers = self.get_tokenizers(do_lower_case=False)
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
input_text, ids = self.get_clean_sequence(tokenizer)
# a special token for Canine can be defined as follows:
SPECIAL_TOKEN = 0xE005
special_token = chr(SPECIAL_TOKEN)
tokenizer.add_special_tokens({"cls_token": special_token})
encoded_special_token = tokenizer.encode(special_token, add_special_tokens=False)
self.assertEqual(len(encoded_special_token), 1)
text = tokenizer.decode(ids + encoded_special_token, clean_up_tokenization_spaces=False)
encoded = tokenizer.encode(text, add_special_tokens=False)
input_encoded = tokenizer.encode(input_text, add_special_tokens=False)
special_token_id = tokenizer.encode(special_token, add_special_tokens=False)
self.assertEqual(encoded, input_encoded + special_token_id)
decoded = tokenizer.decode(encoded, skip_special_tokens=True)
self.assertTrue(special_token not in decoded)
@require_tokenizers
def test_added_token_serializable(self):
tokenizers = self.get_tokenizers(do_lower_case=False)
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
# a special token for Canine can be defined as follows:
NEW_TOKEN = 0xE006
new_token = chr(NEW_TOKEN)
new_token = AddedToken(new_token, lstrip=True)
tokenizer.add_special_tokens({"additional_special_tokens": [new_token]})
with tempfile.TemporaryDirectory() as tmp_dir_name:
tokenizer.save_pretrained(tmp_dir_name)
tokenizer.from_pretrained(tmp_dir_name)
@require_tokenizers
def test_encode_decode_with_spaces(self):
tokenizers = self.get_tokenizers(do_lower_case=False)
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
input = "hello world"
if self.space_between_special_tokens:
output = "[CLS] hello world [SEP]"
else:
output = input
encoded = tokenizer.encode(input, add_special_tokens=False)
decoded = tokenizer.decode(encoded, spaces_between_special_tokens=self.space_between_special_tokens)
self.assertIn(decoded, [output, output.lower()])
# tokenizer has a fixed vocab_size (namely all possible unicode code points)
def test_add_tokens_tokenizer(self):
pass
# CanineTokenizer does not support do_lower_case = True, as each character has its own Unicode code point
# ("b" and "B" for example have different Unicode code points)
def test_added_tokens_do_lower_case(self):
pass
# CanineModel does not support the get_input_embeddings nor the get_vocab method
def test_np_encode_plus_sent_to_model(self):
pass
# CanineModel does not support the get_input_embeddings nor the get_vocab method
def test_torch_encode_plus_sent_to_model(self):
pass
# tokenizer can be instantiated without any pretrained files, so no need for pretrained tokenizer list
def test_pretrained_model_lists(self):
pass
# tokenizer does not have vocabulary
def test_get_vocab(self):
pass
# inputs cannot be pretokenized since ids depend on whole input string and not just on single characters
def test_pretokenized_inputs(self):
pass
# tests all ids in vocab => vocab doesn't exist so unnecessary to test
def test_conversion_reversible(self):
pass
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment