Update the new model template (#6019)

a884b7fa · Sylvain Gugger · GitHub · 295466aa · a884b7fa · a884b7fa
Unverified Commit a884b7fa authored Jul 24, 2020 by Sylvain Gugger Committed by GitHub Jul 24, 2020
5 changed files
--- a/templates/adding_a_new_model/README.md
+++ b/templates/adding_a_new_model/README.md
-# How to add a new model in 🤗Transformers
+# How to add a new model in 🤗 Transformers
-This folder describes the process to add a new model in 🤗Transformers and provide templates for the required files.
+This folder describes the process to add a new model in 🤗 Transformers and provide templates for the required files.
-The library is designed to incorporate a variety of models and code bases. As such the process for adding a new model usually mostly consists in copy-pasting to relevant original code in the various sections of the templates included in the present repository.
+The library is designed to incorporate a variety of models and code bases. As such the process for adding a new model
+usually mostly consists in copy-pasting to relevant original code in the various sections of the templates included in
+the present repository.
 One important point though is that the library has the following goals impacting the way models are incorporated:
- one specific feature of the API is the capability to run the model and tokenizer inline. The tokenization code thus often have to be slightly adapted to allow for running in the python interpreter.
+- One specific feature of the API is the capability to run the model and tokenizer inline. The tokenization code thus
- the package is also designed to be as self-consistent and with a small and reliable set of packages dependencies. In consequence, additional dependencies are usually not allowed when adding a model but can be allowed for the inclusion of a new tokenizer (recent examples of dependencies added for tokenizer specificities include `sentencepiece` and `sacremoses`). Please make sure to check the existing dependencies when possible before adding a new one.
+  often have to be slightly adapted to allow for running in the python interpreter.
+- the package is also designed to be as self-consistent and with a small and reliable set of packages dependencies. In
+  consequence, additional dependencies are usually not allowed when adding a model but can be allowed for the
+  inclusion of a new tokenizer (recent examples of dependencies added for tokenizer specificities include
+  `sentencepiece` and `sacremoses`). Please make sure to check the existing dependencies when possible before adding a
+  new one.
-For a quick overview of the library organization, please check the [QuickStart section of the documentation](https://huggingface.co/transformers/quickstart.html).
+For a quick overview of the general philosphy of the library and its organization, please check the
+[QuickStart section of the documentation](https://huggingface.co/transformers/philosophy.html).
 # Typical workflow for including a model
 Here an overview of the general workflow: 
- [ ] add model/configuration/tokenization classes
+- [ ] Add model/configuration/tokenization classes.
- [ ] add conversion scripts
+- [ ] Add conversion scripts.
- [ ] add tests
+- [ ] Add tests and a @slow integration test.
- [ ] add @slow integration test
+- [ ] Document your model.
- [ ] finalize
+- [ ] Finalize.
-Let's detail what should be done at each step
+Let's detail what should be done at each step.
 ## Adding model/configuration/tokenization classes
 Here is the workflow for adding model/configuration/tokenization classes:
- [ ] copy the python files from the present folder to the main folder and rename them, replacing `xxx` with your model name,
+- [ ] Copy the python files from the present folder to the main folder and rename them, replacing `xxx` with your model
- [ ] edit the files to replace `XXX` (with various casing) with your model name
+  name.
- [ ] copy-paste or create a simple configuration class for your model in the `configuration_...` file
+- [ ] Edit the files to replace `XXX` (with various casing) with your model name.
- [ ] copy-paste or create the code for your model in the `modeling_...` files (PyTorch and TF 2.0)
+- [ ] Copy-paste or create a simple configuration class for your model in the `configuration_...` file.
- [ ] copy-paste or create a tokenizer class for your model in the `tokenization_...` file
+- [ ] Copy-paste or create the code for your model in the `modeling_...` files (PyTorch and TF 2.0).
+- [ ] Copy-paste or create a tokenizer class for your model in the `tokenization_...` file.
-# Adding conversion scripts
+## Adding conversion scripts
 Here is the workflow for the conversion scripts:
- [ ] copy the conversion script (`convert_...`) from the present folder to the main folder.
+- [ ] Copy the conversion script (`convert_...`) from the present folder to the main folder.
- [ ] edit this script to convert your original checkpoint weights to the current pytorch ones.
+- [ ] Edit this script to convert your original checkpoint weights to the current pytorch ones.
-# Adding tests:
+## Adding tests:
 Here is the workflow for the adding tests:
- [ ] copy the python files from the `tests` sub-folder of the present folder to the `tests` subfolder of the main folder and rename them, replacing `xxx` with your model name,
+- [ ] Copy the python files from the `tests` sub-folder of the present folder to the `tests` subfolder of the main
- [ ] edit the tests files to replace `XXX` (with various casing) with your model name
+  folder and rename them, replacing `xxx` with your model name.
- [ ] edit the tests code as needed
+- [ ] Edit the tests files to replace `XXX` (with various casing) with your model name.
+- [ ] Edit the tests code as needed.
-# Final steps
+## Documenting your model:
+Here is the workflow for documentation:
+- [ ] Make sure all your arguments are properly documened in your configuration and tokenizer.
+- [ ] Most of the documentation of the models is automatically generated, you just ahve to male sure that
+  `XXX_START_DOCSTRING` contains an introduction to the model you're adding and a link to the original
+  article and that `XXX_INPUTS_DOCSTRING` contains all the inputs of your model.
+- [ ] Create a new page `xxx.rst` in the folder `docs/source/model_doc` and add this file in `docs/source/index.rst`.
+Make sure to check you have no sphinx warnings when building the documentation locally and follow our
+[documentaiton guide](https://github.com/huggingface/transformers/tree/master/docs#writing-documentation---specification).
+## Final steps
 You can then finish the addition step by adding imports for your classes in the common files:
- [ ] add import for all the relevant classes in `__init__.py`
+- [ ] Add import for all the relevant classes in `__init__.py`.
- [ ] add your configuration in `configuration_auto.py`
+- [ ] Add your configuration in `configuration_auto.py`.
- [ ] add your PyTorch and TF 2.0 model respectively in `modeling_auto.py` and `modeling_tf_auto.py`
+- [ ] Add your PyTorch and TF 2.0 model respectively in `modeling_auto.py` and `modeling_tf_auto.py`.
- [ ] add your tokenizer in `tokenization_auto.py`
+- [ ] Add your tokenizer in `tokenization_auto.py`.
- [ ] add your models and tokenizer to `pipeline.py`
+- [ ] Add your models and tokenizer to `pipeline.py`.
- [ ] add a link to your conversion script in the main conversion utility (in `commands/convert.py`)
+- [ ] Add a link to your conversion script in the main conversion utility (in `commands/convert.py`)
- [ ] edit the PyTorch to TF 2.0 conversion script to add your model in the `convert_pytorch_checkpoint_to_tf2.py` file
+- [ ] Edit the PyTorch to TF 2.0 conversion script to add your model in the `convert_pytorch_checkpoint_to_tf2.py`
- [ ] add a mention of your model in the doc: `README.md` and the documentation itself at `docs/source/pretrained_models.rst`.
+  file.
- [ ] upload the pretrained weights, configurations and vocabulary files.
+- [ ] Add a mention of your model in the doc: `README.md` and the documentation itself
- [ ] create model card(s) for your models on huggingface.co. For those last two steps, check the [model sharing documentation](https://github.com/huggingface/transformers#quick-tour-of-model-sharing).
+  in `docs/source/index.rst` and `docs/source/pretrained_models.rst`.
+- [ ] Upload the pretrained weights, configurations and vocabulary files.
+- [ ] Create model card(s) for your models on huggingface.co. For those last two steps, check the
+  [model sharing documentation](https://huggingface.co/transformers/model_sharing.html).
--- a/templates/adding_a_new_model/configuration_xxx.py
+++ b/templates/adding_a_new_model/configuration_xxx.py
@@ -16,6 +16,7 @@
 import logging
+from typing import Callable, Union
 from .configuration_utils import PretrainedConfig
@@ -30,85 +31,76 @@ XXX_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class XxxConfig(PretrainedConfig):
    r"""
-        :class:`~transformers.XxxConfig` is the configuration class to store the configuration of a
+        This is the configuration class to store the configuration of a :class:`~transformers.XXXModel`.
-        `XxxModel`.
+        It is used to instantiate a XXX model according to the specified arguments, defining the model
+        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+        the XXX `xxx-base-uncased <https://huggingface.co/xxx/xxx-base-uncased>`__ architecture.
+        Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
+        to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
+        for more information.
-        Arguments:
-            vocab_size: Vocabulary size of `inputs_ids` in `XxxModel`.
+        Args:
-            hidden_size: Size of the encoder layers and the pooler layer.
+            vocab_size (:obj:`int`, optional, defaults to 30522):
-            num_hidden_layers: Number of hidden layers in the Transformer encoder.
+                Vocabulary size of the XXX model. Defines the different tokens that
-            num_attention_heads: Number of attention heads for each attention layer in
+                can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.XXXModel`.
-                the Transformer encoder.
+            hidden_size (:obj:`int`, optional, defaults to 768):
-            intermediate_size: The size of the "intermediate" (i.e., feed-forward)
+                Dimensionality of the encoder layers and the pooler layer.
-                layer in the Transformer encoder.
+            num_hidden_layers (:obj:`int`, optional, defaults to 12):
-            hidden_act: The non-linear activation function (function or string) in the
+                Number of hidden layers in the Transformer encoder.
-                encoder and pooler. If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            num_attention_heads (:obj:`int`, optional, defaults to 12):
-            hidden_dropout_prob: The dropout probabilitiy for all fully connected
+                Number of attention heads for each attention layer in the Transformer encoder.
-                layers in the embeddings, encoder, and pooler.
+            hidden_act (:obj:`str` or :obj:`function`, optional, defaults to :obj:`"gelu"`):
-            attention_probs_dropout_prob: The dropout ratio for the attention
+                The non-linear activation function (function or string) in the encoder and pooler.
-                probabilities.
-            max_position_embeddings: The maximum sequence length that this model might
+                If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
-                ever be used with. Typically set this to something large just in case
+            hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
-                (e.g., 512 or 1024 or 2048).
+                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-            type_vocab_size: The vocabulary size of the `token_type_ids` passed into
+            attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
-                `XxxModel`.
+                The dropout ratio for the attention probabilities.
-            initializer_range: The sttdev of the truncated_normal_initializer for
+            max_position_embeddings (:obj:`int`, optional, defaults to 512):
-                initializing all weight matrices.
+                The maximum sequence length that this model might ever be used with.
-            layer_norm_eps: The epsilon used by LayerNorm.
+                Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+            type_vocab_size (:obj:`int`, optional, defaults to 2):
+                The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
+            initializer_range (:obj:`float`, optional, defaults to 0.02):
+                The standard deviation of the :obj:`truncated_normal_initializer` for initializing all weight matrices.
+            layer_norm_eps (:obj:`float`, optional, defaults to 1e-5):
+                The epsilon used by the layer normalization layers.
+            gradient_checkpointing (:obj:`bool`, optional, defaults to :obj:`False`):
+                If :obj:`True`, use gradient checkpointing to save memory at the expense of slower backward pass.
+            kwargs:
+                Additional arguments for common configurations, passed to :class:`~transformers.PretrainedConfig`.
    """
    model_type = "xxx"
    def __init__(
        self,
-        vocab_size=50257,
+        vocab_size: int = 50257,
-        n_positions=1024,
+        hidden_size: int = 1024,
-        n_ctx=1024,
+        num_hidden_layers: int = 12,
-        n_embd=768,
+        num_attention_heads: int = 12,
-        n_layer=12,
+        hidden_act: Union[str, Callable] = "gelu",
-        n_head=12,
+        hidden_dropout_prob: float = 0.1,
-        resid_pdrop=0.1,
+        attention_probs_dropout_prob: float = 0.1,
-        embd_pdrop=0.1,
+        max_position_embeddings: int = 512,
-        attn_pdrop=0.1,
+        type_vocab_size: int = 2,
-        layer_norm_epsilon=1e-5,
+        initializer_range: float = 0.02,
-        initializer_range=0.02,
+        layer_norm_epsilon: float = 1e-5,
-        summary_type="cls_index",
+        gradient_checkpointing: bool = False,
-        summary_use_proj=True,
-        summary_activation=None,
-        summary_proj_to_labels=True,
-        summary_first_dropout=0.1,
        **kwargs
    ):
        super().__init__(**kwargs)
        self.vocab_size = vocab_size
-        self.n_ctx = n_ctx
+        self.hidden_size = hidden_size
-        self.n_positions = n_positions
+        self.num_hidden_layers = num_hidden_layers
-        self.n_embd = n_embd
+        self.num_attention_heads = num_attention_heads
-        self.n_layer = n_layer
+        self.hidden_act = hidden_act
-        self.n_head = n_head
+        self.hidden_dropout_prob = hidden_dropout_prob
-        self.resid_pdrop = resid_pdrop
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
-        self.embd_pdrop = embd_pdrop
+        self.max_position_embeddings = max_position_embeddings
-        self.attn_pdrop = attn_pdrop
+        self.type_vocab_size = type_vocab_size
-        self.layer_norm_epsilon = layer_norm_epsilon
        self.initializer_range = initializer_range
-        self.summary_type = summary_type
+        self.layer_norm_epsilon = layer_norm_epsilon
-        self.summary_use_proj = summary_use_proj
+        self.gradient_checkpointing = gradient_checkpointing
-        self.summary_activation = summary_activation
-        self.summary_first_dropout = summary_first_dropout
-        self.summary_proj_to_labels = summary_proj_to_labels
-    @property
-    def max_position_embeddings(self):
-        return self.n_positions
-    @property
-    def hidden_size(self):
-        return self.n_embd
-    @property
-    def num_attention_heads(self):
-        return self.n_head
-    @property
-    def num_hidden_layers(self):
-        return self.n_layer
--- a/templates/adding_a_new_model/modeling_tf_xxx.py
+++ b/templates/adding_a_new_model/modeling_tf_xxx.py
--- a/templates/adding_a_new_model/modeling_xxx.py
+++ b/templates/adding_a_new_model/modeling_xxx.py
--- a/templates/adding_a_new_model/tokenization_xxx.py
+++ b/templates/adding_a_new_model/tokenization_xxx.py
@@ -18,6 +18,7 @@
 import collections
 import logging
 import os
+from typing import List, Optional
 from .tokenization_utils import PreTrainedTokenizer
@@ -77,12 +78,37 @@ def load_vocab(vocab_file):
 class XxxTokenizer(PreTrainedTokenizer):
    r"""
-    Constructs a XxxTokenizer.
+    Constructs a XXX tokenizer. Based on XXX.
-    :class:`~transformers.XxxTokenizer` runs end-to-end tokenization: punctuation splitting + wordpiece
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
+    should refer to the superclass for more information regarding methods.
    Args:
-        vocab_file: Path to a one-wordpiece-per-line vocabulary file
+        vocab_file (:obj:`str`):
-        do_lower_case: Whether to lower case the input. Only has an effect when do_basic_tokenize=True
+            File containing the vocabulary.
+        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to lowercase the input when tokenizing.
+        do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to do basic tokenization before WordPiece.
+        never_split (:obj:`Iterable`, `optional`, defaults to :obj:`None`):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            :obj:`do_basic_tokenize=True`
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
+            for sequence classification or for a text and a question for question answering.
+            It is also used as the last token of a sequence built with special tokens.
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
+            The classifier token which is used when doing sequence classification (classification of the whole
+            sequence instead of per-token classification). It is the first token of the sequence when built with
+            special tokens.
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
    """
    vocab_files_names = VOCAB_FILES_NAMES
@@ -94,21 +120,16 @@ class XxxTokenizer(PreTrainedTokenizer):
        self,
        vocab_file,
        do_lower_case=True,
+        do_basic_tokenize=True,
+        never_split=None,
        unk_token="[UNK]",
        sep_token="[SEP]",
        pad_token="[PAD]",
        cls_token="[CLS]",
        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
        **kwargs
    ):
-        """Constructs a XxxTokenizer.
-        Args:
-            **vocab_file**: Path to a one-wordpiece-per-line vocabulary file
-            **do_lower_case**: (`optional`) boolean (default True)
-                Whether to lower case the input
-                Only has an effect when do_basic_tokenize=True
-        """
        super().__init__(
            unk_token=unk_token,
            sep_token=sep_token,
@@ -121,22 +142,35 @@ class XxxTokenizer(PreTrainedTokenizer):
        if not os.path.isfile(vocab_file):
            raise ValueError(
                "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
-                "model use `tokenizer = XxxTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
            )
        self.vocab = load_vocab(vocab_file)
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
+        self.do_basic_tokenize = do_basic_tokenize
+        # Replace and adapt
+        # if do_basic_tokenize:
+        #    self.basic_tokenizer = BasicTokenizer(
+        #        do_lower_case=do_lower_case, never_split=never_split, tokenize_chinese_chars=tokenize_chinese_chars
+        #    )
+        # self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
    @property
    def vocab_size(self):
        return len(self.vocab)
+    def get_vocab(self):
+        return dict(self.vocab, **self.added_tokens_encoder)
    def _tokenize(self, text):
-        """ Take as input a string and return a list of strings (tokens) for words/sub-words
-        """
        split_tokens = []
        if self.do_basic_tokenize:
            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
-                for sub_token in self.wordpiece_tokenizer.tokenize(token):
-                    split_tokens.append(sub_token)
+                # If the token is part of the never_split set
+                if token in self.basic_tokenizer.never_split:
+                    split_tokens.append(token)
+                else:
+                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
        else:
            split_tokens = self.wordpiece_tokenizer.tokenize(text)
        return split_tokens
@@ -154,13 +188,25 @@ class XxxTokenizer(PreTrainedTokenizer):
        out_string = " ".join(tokens).replace(" ##", "").strip()
        return out_string
-    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
        by concatenating and adding special tokens.
        A BERT sequence has the following format:
-            single sequence: [CLS] X [SEP]
-            pair of sequences: [CLS] A [SEP] B [SEP]
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
@@ -168,20 +214,23 @@ class XxxTokenizer(PreTrainedTokenizer):
        sep = [self.sep_token_id]
        return cls + token_ids_0 + sep + token_ids_1 + sep
-    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
        """
        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` methods.
+        special tokens using the tokenizer ``prepare_for_model`` method.
        Args:
-            token_ids_0: list of ids (must not contain special tokens)
+            token_ids_0 (:obj:`List[int]`):
-            token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
+                List of ids.
-                for sequence pairs
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
-            already_has_special_tokens: (default False) Set to True if the token list is already formated with
+                Optional second list of IDs for sequence pairs.
-                special tokens for the model
+            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Set to True if the token list is already formatted with special tokens for the model
        Returns:
-            A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """
        if already_has_special_tokens:
@@ -196,14 +245,29 @@ class XxxTokenizer(PreTrainedTokenizer):
            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
        return [1] + ([0] * len(token_ids_0)) + [1]
-    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
        """
        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
        A BERT sequence pair mask has the following format:
-        0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
-        | first sequence    | second sequence
+        ::
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of ids.
+            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            sequence(s).
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
@@ -212,7 +276,16 @@ class XxxTokenizer(PreTrainedTokenizer):
        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
    def save_vocabulary(self, vocab_path):
-        """Save the tokenizer vocabulary to a directory or file."""
+        """
+        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+        Args:
+            vocab_path (:obj:`str`):
+                The directory in which to save the vocabulary.
+        Returns:
+            :obj:`Tuple(str)`: Paths to the files saved.
+        """
        index = 0
        if os.path.isdir(vocab_path):
            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES["vocab_file"])