Unverified Commit a884b7fa authored by Sylvain Gugger's avatar Sylvain Gugger Committed by GitHub
Browse files

Update the new model template (#6019)

parent 295466aa
# How to add a new model in 🤗Transformers # How to add a new model in 🤗 Transformers
This folder describes the process to add a new model in 🤗Transformers and provide templates for the required files. This folder describes the process to add a new model in 🤗 Transformers and provide templates for the required files.
The library is designed to incorporate a variety of models and code bases. As such the process for adding a new model usually mostly consists in copy-pasting to relevant original code in the various sections of the templates included in the present repository. The library is designed to incorporate a variety of models and code bases. As such the process for adding a new model
usually mostly consists in copy-pasting to relevant original code in the various sections of the templates included in
the present repository.
One important point though is that the library has the following goals impacting the way models are incorporated: One important point though is that the library has the following goals impacting the way models are incorporated:
- one specific feature of the API is the capability to run the model and tokenizer inline. The tokenization code thus often have to be slightly adapted to allow for running in the python interpreter. - One specific feature of the API is the capability to run the model and tokenizer inline. The tokenization code thus
- the package is also designed to be as self-consistent and with a small and reliable set of packages dependencies. In consequence, additional dependencies are usually not allowed when adding a model but can be allowed for the inclusion of a new tokenizer (recent examples of dependencies added for tokenizer specificities include `sentencepiece` and `sacremoses`). Please make sure to check the existing dependencies when possible before adding a new one. often have to be slightly adapted to allow for running in the python interpreter.
- the package is also designed to be as self-consistent and with a small and reliable set of packages dependencies. In
consequence, additional dependencies are usually not allowed when adding a model but can be allowed for the
inclusion of a new tokenizer (recent examples of dependencies added for tokenizer specificities include
`sentencepiece` and `sacremoses`). Please make sure to check the existing dependencies when possible before adding a
new one.
For a quick overview of the library organization, please check the [QuickStart section of the documentation](https://huggingface.co/transformers/quickstart.html). For a quick overview of the general philosphy of the library and its organization, please check the
[QuickStart section of the documentation](https://huggingface.co/transformers/philosophy.html).
# Typical workflow for including a model # Typical workflow for including a model
Here an overview of the general workflow: Here an overview of the general workflow:
- [ ] add model/configuration/tokenization classes - [ ] Add model/configuration/tokenization classes.
- [ ] add conversion scripts - [ ] Add conversion scripts.
- [ ] add tests - [ ] Add tests and a @slow integration test.
- [ ] add @slow integration test - [ ] Document your model.
- [ ] finalize - [ ] Finalize.
Let's detail what should be done at each step Let's detail what should be done at each step.
## Adding model/configuration/tokenization classes ## Adding model/configuration/tokenization classes
Here is the workflow for adding model/configuration/tokenization classes: Here is the workflow for adding model/configuration/tokenization classes:
- [ ] copy the python files from the present folder to the main folder and rename them, replacing `xxx` with your model name, - [ ] Copy the python files from the present folder to the main folder and rename them, replacing `xxx` with your model
- [ ] edit the files to replace `XXX` (with various casing) with your model name name.
- [ ] copy-paste or create a simple configuration class for your model in the `configuration_...` file - [ ] Edit the files to replace `XXX` (with various casing) with your model name.
- [ ] copy-paste or create the code for your model in the `modeling_...` files (PyTorch and TF 2.0) - [ ] Copy-paste or create a simple configuration class for your model in the `configuration_...` file.
- [ ] copy-paste or create a tokenizer class for your model in the `tokenization_...` file - [ ] Copy-paste or create the code for your model in the `modeling_...` files (PyTorch and TF 2.0).
- [ ] Copy-paste or create a tokenizer class for your model in the `tokenization_...` file.
# Adding conversion scripts ## Adding conversion scripts
Here is the workflow for the conversion scripts: Here is the workflow for the conversion scripts:
- [ ] copy the conversion script (`convert_...`) from the present folder to the main folder. - [ ] Copy the conversion script (`convert_...`) from the present folder to the main folder.
- [ ] edit this script to convert your original checkpoint weights to the current pytorch ones. - [ ] Edit this script to convert your original checkpoint weights to the current pytorch ones.
# Adding tests: ## Adding tests:
Here is the workflow for the adding tests: Here is the workflow for the adding tests:
- [ ] copy the python files from the `tests` sub-folder of the present folder to the `tests` subfolder of the main folder and rename them, replacing `xxx` with your model name, - [ ] Copy the python files from the `tests` sub-folder of the present folder to the `tests` subfolder of the main
- [ ] edit the tests files to replace `XXX` (with various casing) with your model name folder and rename them, replacing `xxx` with your model name.
- [ ] edit the tests code as needed - [ ] Edit the tests files to replace `XXX` (with various casing) with your model name.
- [ ] Edit the tests code as needed.
# Final steps ## Documenting your model:
Here is the workflow for documentation:
- [ ] Make sure all your arguments are properly documened in your configuration and tokenizer.
- [ ] Most of the documentation of the models is automatically generated, you just ahve to male sure that
`XXX_START_DOCSTRING` contains an introduction to the model you're adding and a link to the original
article and that `XXX_INPUTS_DOCSTRING` contains all the inputs of your model.
- [ ] Create a new page `xxx.rst` in the folder `docs/source/model_doc` and add this file in `docs/source/index.rst`.
Make sure to check you have no sphinx warnings when building the documentation locally and follow our
[documentaiton guide](https://github.com/huggingface/transformers/tree/master/docs#writing-documentation---specification).
## Final steps
You can then finish the addition step by adding imports for your classes in the common files: You can then finish the addition step by adding imports for your classes in the common files:
- [ ] add import for all the relevant classes in `__init__.py` - [ ] Add import for all the relevant classes in `__init__.py`.
- [ ] add your configuration in `configuration_auto.py` - [ ] Add your configuration in `configuration_auto.py`.
- [ ] add your PyTorch and TF 2.0 model respectively in `modeling_auto.py` and `modeling_tf_auto.py` - [ ] Add your PyTorch and TF 2.0 model respectively in `modeling_auto.py` and `modeling_tf_auto.py`.
- [ ] add your tokenizer in `tokenization_auto.py` - [ ] Add your tokenizer in `tokenization_auto.py`.
- [ ] add your models and tokenizer to `pipeline.py` - [ ] Add your models and tokenizer to `pipeline.py`.
- [ ] add a link to your conversion script in the main conversion utility (in `commands/convert.py`) - [ ] Add a link to your conversion script in the main conversion utility (in `commands/convert.py`)
- [ ] edit the PyTorch to TF 2.0 conversion script to add your model in the `convert_pytorch_checkpoint_to_tf2.py` file - [ ] Edit the PyTorch to TF 2.0 conversion script to add your model in the `convert_pytorch_checkpoint_to_tf2.py`
- [ ] add a mention of your model in the doc: `README.md` and the documentation itself at `docs/source/pretrained_models.rst`. file.
- [ ] upload the pretrained weights, configurations and vocabulary files. - [ ] Add a mention of your model in the doc: `README.md` and the documentation itself
- [ ] create model card(s) for your models on huggingface.co. For those last two steps, check the [model sharing documentation](https://github.com/huggingface/transformers#quick-tour-of-model-sharing). in `docs/source/index.rst` and `docs/source/pretrained_models.rst`.
- [ ] Upload the pretrained weights, configurations and vocabulary files.
- [ ] Create model card(s) for your models on huggingface.co. For those last two steps, check the
[model sharing documentation](https://huggingface.co/transformers/model_sharing.html).
...@@ -16,6 +16,7 @@ ...@@ -16,6 +16,7 @@
import logging import logging
from typing import Callable, Union
from .configuration_utils import PretrainedConfig from .configuration_utils import PretrainedConfig
...@@ -30,85 +31,76 @@ XXX_PRETRAINED_CONFIG_ARCHIVE_MAP = { ...@@ -30,85 +31,76 @@ XXX_PRETRAINED_CONFIG_ARCHIVE_MAP = {
class XxxConfig(PretrainedConfig): class XxxConfig(PretrainedConfig):
r""" r"""
:class:`~transformers.XxxConfig` is the configuration class to store the configuration of a This is the configuration class to store the configuration of a :class:`~transformers.XXXModel`.
`XxxModel`. It is used to instantiate a XXX model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
the XXX `xxx-base-uncased <https://huggingface.co/xxx/xxx-base-uncased>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
for more information.
Arguments:
vocab_size: Vocabulary size of `inputs_ids` in `XxxModel`. Args:
hidden_size: Size of the encoder layers and the pooler layer. vocab_size (:obj:`int`, optional, defaults to 30522):
num_hidden_layers: Number of hidden layers in the Transformer encoder. Vocabulary size of the XXX model. Defines the different tokens that
num_attention_heads: Number of attention heads for each attention layer in can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.XXXModel`.
the Transformer encoder. hidden_size (:obj:`int`, optional, defaults to 768):
intermediate_size: The size of the "intermediate" (i.e., feed-forward) Dimensionality of the encoder layers and the pooler layer.
layer in the Transformer encoder. num_hidden_layers (:obj:`int`, optional, defaults to 12):
hidden_act: The non-linear activation function (function or string) in the Number of hidden layers in the Transformer encoder.
encoder and pooler. If string, "gelu", "relu", "swish" and "gelu_new" are supported. num_attention_heads (:obj:`int`, optional, defaults to 12):
hidden_dropout_prob: The dropout probabilitiy for all fully connected Number of attention heads for each attention layer in the Transformer encoder.
layers in the embeddings, encoder, and pooler. hidden_act (:obj:`str` or :obj:`function`, optional, defaults to :obj:`"gelu"`):
attention_probs_dropout_prob: The dropout ratio for the attention The non-linear activation function (function or string) in the encoder and pooler.
probabilities.
max_position_embeddings: The maximum sequence length that this model might If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
ever be used with. Typically set this to something large just in case hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
(e.g., 512 or 1024 or 2048). The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
type_vocab_size: The vocabulary size of the `token_type_ids` passed into attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
`XxxModel`. The dropout ratio for the attention probabilities.
initializer_range: The sttdev of the truncated_normal_initializer for max_position_embeddings (:obj:`int`, optional, defaults to 512):
initializing all weight matrices. The maximum sequence length that this model might ever be used with.
layer_norm_eps: The epsilon used by LayerNorm. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02):
The standard deviation of the :obj:`truncated_normal_initializer` for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-5):
The epsilon used by the layer normalization layers.
gradient_checkpointing (:obj:`bool`, optional, defaults to :obj:`False`):
If :obj:`True`, use gradient checkpointing to save memory at the expense of slower backward pass.
kwargs:
Additional arguments for common configurations, passed to :class:`~transformers.PretrainedConfig`.
""" """
model_type = "xxx" model_type = "xxx"
def __init__( def __init__(
self, self,
vocab_size=50257, vocab_size: int = 50257,
n_positions=1024, hidden_size: int = 1024,
n_ctx=1024, num_hidden_layers: int = 12,
n_embd=768, num_attention_heads: int = 12,
n_layer=12, hidden_act: Union[str, Callable] = "gelu",
n_head=12, hidden_dropout_prob: float = 0.1,
resid_pdrop=0.1, attention_probs_dropout_prob: float = 0.1,
embd_pdrop=0.1, max_position_embeddings: int = 512,
attn_pdrop=0.1, type_vocab_size: int = 2,
layer_norm_epsilon=1e-5, initializer_range: float = 0.02,
initializer_range=0.02, layer_norm_epsilon: float = 1e-5,
summary_type="cls_index", gradient_checkpointing: bool = False,
summary_use_proj=True,
summary_activation=None,
summary_proj_to_labels=True,
summary_first_dropout=0.1,
**kwargs **kwargs
): ):
super().__init__(**kwargs) super().__init__(**kwargs)
self.vocab_size = vocab_size self.vocab_size = vocab_size
self.n_ctx = n_ctx self.hidden_size = hidden_size
self.n_positions = n_positions self.num_hidden_layers = num_hidden_layers
self.n_embd = n_embd self.num_attention_heads = num_attention_heads
self.n_layer = n_layer self.hidden_act = hidden_act
self.n_head = n_head self.hidden_dropout_prob = hidden_dropout_prob
self.resid_pdrop = resid_pdrop self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.embd_pdrop = embd_pdrop self.max_position_embeddings = max_position_embeddings
self.attn_pdrop = attn_pdrop self.type_vocab_size = type_vocab_size
self.layer_norm_epsilon = layer_norm_epsilon
self.initializer_range = initializer_range self.initializer_range = initializer_range
self.summary_type = summary_type self.layer_norm_epsilon = layer_norm_epsilon
self.summary_use_proj = summary_use_proj self.gradient_checkpointing = gradient_checkpointing
self.summary_activation = summary_activation
self.summary_first_dropout = summary_first_dropout
self.summary_proj_to_labels = summary_proj_to_labels
@property
def max_position_embeddings(self):
return self.n_positions
@property
def hidden_size(self):
return self.n_embd
@property
def num_attention_heads(self):
return self.n_head
@property
def num_hidden_layers(self):
return self.n_layer
...@@ -18,6 +18,7 @@ ...@@ -18,6 +18,7 @@
import collections import collections
import logging import logging
import os import os
from typing import List, Optional
from .tokenization_utils import PreTrainedTokenizer from .tokenization_utils import PreTrainedTokenizer
...@@ -77,12 +78,37 @@ def load_vocab(vocab_file): ...@@ -77,12 +78,37 @@ def load_vocab(vocab_file):
class XxxTokenizer(PreTrainedTokenizer): class XxxTokenizer(PreTrainedTokenizer):
r""" r"""
Constructs a XxxTokenizer. Constructs a XXX tokenizer. Based on XXX.
:class:`~transformers.XxxTokenizer` runs end-to-end tokenization: punctuation splitting + wordpiece
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
should refer to the superclass for more information regarding methods.
Args: Args:
vocab_file: Path to a one-wordpiece-per-line vocabulary file vocab_file (:obj:`str`):
do_lower_case: Whether to lower case the input. Only has an effect when do_basic_tokenize=True File containing the vocabulary.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to lowercase the input when tokenizing.
do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to do basic tokenization before WordPiece.
never_split (:obj:`Iterable`, `optional`, defaults to :obj:`None`):
Collection of tokens which will never be split during tokenization. Only has an effect when
:obj:`do_basic_tokenize=True`
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
for sequence classification or for a text and a question for question answering.
It is also used as the last token of a sequence built with special tokens.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
The token used for padding, for example when batching sequences of different lengths.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The classifier token which is used when doing sequence classification (classification of the whole
sequence instead of per-token classification). It is the first token of the sequence when built with
special tokens.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
""" """
vocab_files_names = VOCAB_FILES_NAMES vocab_files_names = VOCAB_FILES_NAMES
...@@ -94,21 +120,16 @@ class XxxTokenizer(PreTrainedTokenizer): ...@@ -94,21 +120,16 @@ class XxxTokenizer(PreTrainedTokenizer):
self, self,
vocab_file, vocab_file,
do_lower_case=True, do_lower_case=True,
do_basic_tokenize=True,
never_split=None,
unk_token="[UNK]", unk_token="[UNK]",
sep_token="[SEP]", sep_token="[SEP]",
pad_token="[PAD]", pad_token="[PAD]",
cls_token="[CLS]", cls_token="[CLS]",
mask_token="[MASK]", mask_token="[MASK]",
tokenize_chinese_chars=True,
**kwargs **kwargs
): ):
"""Constructs a XxxTokenizer.
Args:
**vocab_file**: Path to a one-wordpiece-per-line vocabulary file
**do_lower_case**: (`optional`) boolean (default True)
Whether to lower case the input
Only has an effect when do_basic_tokenize=True
"""
super().__init__( super().__init__(
unk_token=unk_token, unk_token=unk_token,
sep_token=sep_token, sep_token=sep_token,
...@@ -121,22 +142,35 @@ class XxxTokenizer(PreTrainedTokenizer): ...@@ -121,22 +142,35 @@ class XxxTokenizer(PreTrainedTokenizer):
if not os.path.isfile(vocab_file): if not os.path.isfile(vocab_file):
raise ValueError( raise ValueError(
"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained " "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
"model use `tokenizer = XxxTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file) "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
) )
self.vocab = load_vocab(vocab_file) self.vocab = load_vocab(vocab_file)
self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
self.do_basic_tokenize = do_basic_tokenize
# Replace and adapt
# if do_basic_tokenize:
# self.basic_tokenizer = BasicTokenizer(
# do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=tokenize_chinese_chars
# )
# self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
@property @property
def vocab_size(self): def vocab_size(self):
return len(self.vocab) return len(self.vocab)
def get_vocab(self):
return dict(self.vocab, **self.added_tokens_encoder)
def _tokenize(self, text): def _tokenize(self, text):
""" Take as input a string and return a list of strings (tokens) for words/sub-words
"""
split_tokens = [] split_tokens = []
if self.do_basic_tokenize: if self.do_basic_tokenize:
for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens): for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token) # If the token is part of the never_split set
if token in self.basic_tokenizer.never_split:
split_tokens.append(token)
else:
split_tokens += self.wordpiece_tokenizer.tokenize(token)
else: else:
split_tokens = self.wordpiece_tokenizer.tokenize(text) split_tokens = self.wordpiece_tokenizer.tokenize(text)
return split_tokens return split_tokens
...@@ -154,13 +188,25 @@ class XxxTokenizer(PreTrainedTokenizer): ...@@ -154,13 +188,25 @@ class XxxTokenizer(PreTrainedTokenizer):
out_string = " ".join(tokens).replace(" ##", "").strip() out_string = " ".join(tokens).replace(" ##", "").strip()
return out_string return out_string
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
""" """
Build model inputs from a sequence or a pair of sequence for sequence classification tasks Build model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. by concatenating and adding special tokens.
A BERT sequence has the following format: A BERT sequence has the following format:
single sequence: [CLS] X [SEP]
pair of sequences: [CLS] A [SEP] B [SEP] - single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added
token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
""" """
if token_ids_1 is None: if token_ids_1 is None:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
...@@ -168,20 +214,23 @@ class XxxTokenizer(PreTrainedTokenizer): ...@@ -168,20 +214,23 @@ class XxxTokenizer(PreTrainedTokenizer):
sep = [self.sep_token_id] sep = [self.sep_token_id]
return cls + token_ids_0 + sep + token_ids_1 + sep return cls + token_ids_0 + sep + token_ids_1 + sep
def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False): def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
""" """
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` methods. special tokens using the tokenizer ``prepare_for_model`` method.
Args: Args:
token_ids_0: list of ids (must not contain special tokens) token_ids_0 (:obj:`List[int]`):
token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids List of ids.
for sequence pairs token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
already_has_special_tokens: (default False) Set to True if the token list is already formated with Optional second list of IDs for sequence pairs.
special tokens for the model already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Set to True if the token list is already formatted with special tokens for the model
Returns: Returns:
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
""" """
if already_has_special_tokens: if already_has_special_tokens:
...@@ -196,14 +245,29 @@ class XxxTokenizer(PreTrainedTokenizer): ...@@ -196,14 +245,29 @@ class XxxTokenizer(PreTrainedTokenizer):
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1] return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1] return [1] + ([0] * len(token_ids_0)) + [1]
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None): def create_token_type_ids_from_sequences(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
""" """
Creates a mask from the two sequences passed to be used in a sequence-pair classification task. Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
A BERT sequence pair mask has the following format: A BERT sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
| first sequence | second sequence ::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
if token_ids_1 is None, only returns the first portion of the mask (0's). if token_ids_1 is None, only returns the first portion of the mask (0's).
Args:
token_ids_0 (:obj:`List[int]`):
List of ids.
token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
sequence(s).
""" """
sep = [self.sep_token_id] sep = [self.sep_token_id]
cls = [self.cls_token_id] cls = [self.cls_token_id]
...@@ -212,7 +276,16 @@ class XxxTokenizer(PreTrainedTokenizer): ...@@ -212,7 +276,16 @@ class XxxTokenizer(PreTrainedTokenizer):
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
def save_vocabulary(self, vocab_path): def save_vocabulary(self, vocab_path):
"""Save the tokenizer vocabulary to a directory or file.""" """
Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
Args:
vocab_path (:obj:`str`):
The directory in which to save the vocabulary.
Returns:
:obj:`Tuple(str)`: Paths to the files saved.
"""
index = 0 index = 0
if os.path.isdir(vocab_path): if os.path.isdir(vocab_path):
vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES["vocab_file"]) vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES["vocab_file"])
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment